Entries tagged as adaptive thresholds
One of my favourite aspects of blogging is that if I were to write a proper article or conference paper, I'd be likely to fix any problems with it as I go (well, the ones I notice!). With blogging, it's more likely that I'll just post something quite quickly without significant editing. Most of the time that's not a good thing, but occasionally a bad decision throws up an interesting point to expand on and the fluidity of the medium means I can. When I chose the two graphs towards the end of the last post, it would probably have been better if both graphs had been for 'Response Time (per transaction)' so that you could compare the same statistic over different time groupings, but I accidentally chose 'Number of Transactions (per second)' (which is first on the list in the Statistics Preview screen). For completeness, here are the two graphs I should have used
... and to save you having to go back to the previous post, here's the Number of Transactions graph I used. (Oh, and for anyone who can't stand looking at graphs, there might be some later posts that are more query based because everything in the graphs is available in basic dictionary views that I've probably played around with as much. Just let me use some pretty pictures for a change, eh?)
So why do I think the misjudgement (I wouldn't go as far as calling it a mistake) is interesting? Because it allows me to move away from the narrow focus of transaction response time and highlight the metric choices, which I think are good, and their different focus. The statistics computation and preview screen doesn't group the metrics as well as the threshold setting screen does.
There are three different groups of metrics as indicators of different, but often related variations in system behaviour.
Performance ... hopefully that one is self-explanatory. But there are many reasons why performance could become worse, two possibilities are :-
Workload Type ... has the behaviour of the application changed or is someone running a workload that they shouldn't be, perhaps the query from hell during your usual OLTP day?
Workload Volume ... or is the system just much busier than normal - more users, more transactions?
That's my informal translation and I think that the groups are a useful aid to help you consider what the metrics mean, but there are no hard and fast rules, here. For example, redo size could change because of an increase in the volume of transactions or because of a change in the type of transactions. (As I'm writing, I remember saying something about redo generation and User Commits in a blog post a while ago about workload metrics and redo size is certainly something I've used in the past to track the 'busy-ness' of systems before I had the options of setting adaptive thresholds. I know from my own experience how useful some of these metrics can be.)
The relationship between the different metrics could be quite complex, though. If you look back at the Response Time and Number of Transactions graphs for the Weekdays and Weekend grouping above, you'll see that there were many fewer transactions at weekends, but their individual response time was higher than on weekdays - not what you'd expect, but adapting to complex system behaviour is precisely what makes this approach useful. (The complexity here is likely to be a direct result of this being an intermittently very busy laptop running a wide variety of transactions.)
Anyway, how do I go about setting thresholds to alert against? There's only one screen to deal with. (I'll use a thumbnail of the tresholds setting screen here because it didn't scale down too well.)
However, that's not to say that I think the defaults are inappropriate or that you shouldn't go with them (because it's better to only receive alerts when there really is a problem), but that you might find that you don't get alerts back in the early stages of testing, so how do you know everything's working ok? One approach I've found useful is reducing the Warning level to the minimum allowed - High (.95) - (as I did in the example screen above) until you're sure you're getting some alerts back. There's no doubt in my mind that, for a feature that's very easy to use and implement, you probably need to experiment with the values on your systems in the early stages. I also think that you should 'test' this on Production systems wherever possible, or at least test systems with a proper workload profile. That testing could be as simple as setting Critical to 'None' so that you don't generate high priority incidents, but setting the Warning threshold to various values.
The screen itself is simple to use and I guess that people will find the Percentage of Maxmimum thresholds simple to understand, but Significance Level thresholds can be a little trickier.
In essence, they're designed to trap unusual values. There's a lot more detail on the mathematical foundations of their design in this paper I posted a link to previously and here's an AskTom thread where Graham Wood and John Bereniewicz offer their thoughts which I liked. Personally, I have a hunch that when people first experiment with Significance Level thresholds they find that they don't generate any alerts because the higher levels of significance will only catch truly extreme events and so, having been unable to get the damn things to throw alerts, they give them up as a lost cause. The default values in the DB/Grid Control screen are a Critical alert for Severe (.999, or one in a thousand occurrences) and a Warning for Very High (.99, or one in a hundred). Add in the fact that two consecutive occurrences are required before an alert is triggered and it could be a long wait for a Critical alert.
What do the values that I set here mean? If you look back at the second graph, you'll see that the 95th percentile is difficult to see, but looks somewhere in the low hundreds for both weekends and weekdays. To a certain extent, the absolute value isn't the key, but we can see it by looking at the underlying statistics for the two time groups. (Warning - query thrown together based on a basic guess at the underlying data model.)
SYS@TEST1020> select n.metric_name, n.metric_unit, p.threshold_method, 2 p.num_occurrences, p.warning_param, p.critical_param, 3 s.compute_date, s.sample_count, 4 s.average, s.minimum, s.maximum, s.sdev, 5 s.pctile_25, s.pctile_50, s.pctile_75, s.pctile_90, s.pctile_95, 6 s.est_pctile_99, s.est_pctile_999, s.est_pctile_9999 7 from dbsnmp.MGMT_BSLN_STATISTICS s, dbsnmp.MGMT_BSLN_THRESHOLD_PARMS p, 8 dbsnmp.MGMT_BSLN_DATASOURCES d, dbsnmp.MGMT_BSLN_BASELINES b, 9 V$METRICNAME n 10 where p.datasource_guid = d.datasource_guid 11 and s.datasource_guid = d.datasource_guid 12 and s.bsln_guid = p.bsln_guid 13 and p.bsln_guid = b.bsln_guid 14 and d.metric_id = n.metric_id 15 and b.name = 'Doug Test'; METRIC_NAME ---------------------------------------------------------------- METRIC_UNIT THRESHOLD_METHOD NUM_OCCURRENCES WARNING_PARAM -------------------- ---------------- --------------- ------------- CRITICAL_PARAM COMPUTE_DATE SAMPLE_COUNT AVERAGE MINIMUM MAXIMUM -------------- ----------------- ------------ ---------- ---------- ---------- SDEV PCTILE_25 PCTILE_50 PCTILE_75 PCTILE_90 PCTILE_95 EST_PCTILE_99 ---------- ---------- ---------- ---------- ---------- ---------- ------------- EST_PCTILE_999 EST_PCTILE_9999 -------------- --------------- Response Time Per Txn CentiSeconds Per Txn SIGLVL 2 .95 .99 10/05/09 18:40:10 1239 71.6869897 .021090732 5146.63499 369.551876 1.88866288 3.97931111 6.94699364 24.1417503 254.782416 1882.14226 4041.34033 6266.98777 Response Time Per Txn CentiSeconds Per Txn SIGLVL 2 .95 .99 10/05/09 18:40:10 269 112.962703 0 7658.297 746.769388 .339025 .4427 1.8753 11.62364 207.996409 3692.6721 5575.31008 8696.60199
The weekday time group statistics are based on 1239 values for this metric, 269 for the weekend group and, although the thresholds are stored as .95 and .99, I can see those thresholds translate into 207 centiseconds for a weekend warning and 3692 centiseconds for a critical alert. Remember that, because this is a Static Baseline, these statistics were computed once and will stay the same, so the absolute values this query returns are used as the thresholds. If I was using a Moving Window baseline, the percentile settings would stay the same, but the statistics would be recalculated and the absolute threshold values will change.
As Moving Window baselines are likely to prove more useful for more people (but, I'll repeat, aren't as easy to experiment with on a home system), I'll move on to those in the next post, as well as switching to 11g.
Oh, but before I do, the proof that these alerts do work. (Because there were times when I wondered )
Some features in this post require a Diagnostics Pack license.
Metric Baselines were designed to be easy to implement. There are only two options :-
1) Pick how much recent activity you want to use and let Oracle recompute the statistics based on the most recent metric values over time. (Moving Window)
2) Pick a fixed period of at least 7 days when a system was performing as expected and compute fixed statistics. (Static)
It really is as simple as that and most of the time, you'd use option 1, but, as the documentation points out :-
"The key in defining a metric snapshot for a target is to select a date during which target performance was acceptable under typical workloads. Given this date, actual values of the performance metrics for the target are retrieved and these represent what is normal or expected performance behavior for the target."
(Although it isn't really talking about Metric Baselines here, but Snapshots, the same principle is particularly relevant to Static Metric Baselines as well. That's one of the reasons why Moving Window baselines will usually be the better option).
I think there are two objectives, though. One is to find a period that you would describe as 'typical' and 'acceptable' and the other is to ensure that there were enough measurements in that period to be a statistically valid basis for setting Significance Level Adaptive Thresholds. The latter can be a little tricky when you're trying this out on the type of developer test system it wasn't designed for, but Oracle does help you out by displaying useful information on the DB/Grid Control screens.
Shortly I'll show you the problems you're likely to encounter in practice but, for that to make sense, I need to talk about Time Groupings first. (However, note that I'm going to mention a few things that contradict the point made in one white paper 'It is not intended that customers be required to understand this level of detail in order to successfully use the feature.' Amen to that, I just think it's nice to be able to know sometimes ....)
A core design feature is that the alert thresholds are adaptive. Consider the "Response Time Per Txn" metric. Is there a single acceptable value for that? It's unlikely because, as I discussed in a previous post, the primary objective for your system during overnight batch processing could be maximum throughput (so you accept slower transaction response times), whereas during the day you might expect lower throughput but aim for much faster transaction response times. That's just one illustration of an essential truth about most business systems - the workload profile changes over time. If we tried to set a fixed target response time per transaction to cover all possibilities, we would either find it impossible, or the threshold would be too high to be useful.
Adaptive Thresholds change over time, so that the monitoring software compares the current metric values against statistics for other time periods when we would expect a similar workload profile. For example, if it's a quiet Sunday afternoon now, it's probably more sensible to compare the current metric values to another Sunday afternoon, right? Likewise, if we run the same batch processes each night, then let's compare tonight's statistics against recent nights. Most DBAs will be familiar with the recognisable workload profiles and periodicity of the various systems they manage.
Time Groupings define the shape and size of the periods that Oracle will use for comparison. There are nine combinations of grouping within Days and Weeks, but I'll focus on three using a Metric Baseline duration of 7 days as the example.
If you don't specify any Time Groups at all, Oracle will calculate the statistics based on all of the values in your Metric Baseline. For example, if you have your AWR retention set to 7 days and are using a 7-day Moving Window Baseline, Oracle will look at all of the Metric Values in DBA_HIST_SYSMETRIC_HISTORY and base it's statistical calculations on those results. Assuming constant activity over those 7 days, there will be 10,080 values for each metric that are used to produce the aggregate statistics (1 minute measurements * 60 * 24 * 7).
By Day and Night, Over Weekdays and Weekends (The Default)
This creates 4 different time groups.
- Weekdays during the day (07:00-19:00)
- Weekdays overnight (19:00-07:00)
- Weekends during the day (07:00-19:00)
- Weekends overnight (19:00-07:00)
So, if it's Thursday morning on the 7th May, Oracle will base it's calculations on aggregated statistics based on all of the values between 7:00 and 19:00 on Wednesday 6th, Tuesday 5th, Monday 4th, Friday 1st and Thursday 30th April. Using the 7 day example, there will be 3600 values for each metric in the Weekday group to be aggregated and 1440 in the Weekends group.
By Hour of Day, By Day of Week (Narrowest Groups)
This creates groups for every single hour over the course of the week - 168 groups in all. That means that at 11:00 on Thursday morning, Oracle will compare the current metric values to the statistics as of 11:00 last Thursday morning, which sounds great because you could model fairly complex workload periodicity. However, you probably don't want to have thresholds based on such specific time periods that they become sensitive to even slight changes in the timing of things - is your system really that variable from hour to hour and reliably so? More important, now that there are so many groups, each group only contains 60 values for each metric, which is a problem.
The Aggregate Cardinality of each Time Group is the number of measurements in the group on which the statistics are calculated (e.g. 60 in the Hour of Day, Day of Week example over a one week period). Good statistics are absolutely critical to the ability to set Significance Level Adaptive Thresholds. If we only have 60 values for a metric for the corresponding hour and day last week, it's impossible to make a sensible statement about the 99.9th percentile! Therefore Oracle won't try to and will just ignore any Significance Level thresholds when there is insufficient data on which to base the calculation.
I think this is the main reason why people struggle to have early success with Significance Level Adaptive Thresholds (and not just because I did! LOL)
In fairness to Oracle, DB/Grid Control does warn you that you have insufficient data, includes a facility to look at the data you do have, and lots of on-screen help which I have a bad habit of glossing over but it won't stop you picking inappropriate periods. Time for an example.
I'll create a Static Baseline based on last week. Remember that this is on a laptop that was intensely busy during some periods (running demos during the class and testing them in advance), but shutdown for the majority of the time. Static Baselines are probably less useful that Moving Window Baselines in 'real world' situations, but the Static Baseline allows me to work on this for as long as I want without needing constant activity on my laptop! By default, the Time Grouping is set to By Hour of Day, By Weekdays and Weekends. In order to see whether the selected date range and time groups would contain sufficient data for the statistical calculations, we can use the 'Statistics Preview' section.
When I click 'Compute Statistics', Oracle will create the Time Groupings and determine whether the Aggregate Cardinality is an adequate basis for setting thresholds.
So it should be clear in this case that there's insufficient data when this period is split up into those time groups. Better still, by clicking on one of the little pairs of spectacles, you can see why ...
Note that the graph values are the same for Saturday/Sunday and for Weekdays - those are the two groups within the week. But within those two groups, that's a pretty sparse graph so there clearly isn't sufficient data. Let's not go into detail on what the graph means yet. Let's try to fix the baseline so there is sufficient data. Here are some factors to consider when configuring your Metric Baselines that might help you determine the most appropriate baseline duration and time grouping to select. All of the following reduce the aggregate cardinality. i.e. They make it less likely that Oracle will accept the data as sufficient.
Periods of inactivity - What you'll run into if you try to play around with this at home on a PC that's rarely active. (That's the case here. I've tried to offset it by using a Static Baseline on my busiest week but it still isn't consistently busy enough.)
Narrow Time Groupings - Using the most narrow groupings available, you'll need a 5 week moving window to set even a 95th percentile threshold. (That's the case here, I'm grouping by the hour)
Less History - For example, one week's worth of data only contains one set of data for 'Tuesday', three weeks has three 'Tuesday's.
So this example is just about the worst case! However, maybe I can work around it. I can't change the periods of inactivity and introducing more history would just make things worse (more periods of inactivity) so maybe I should try to use less narrow time groupings?
I tried to reduce the Day grouping to Day and Night (rather than every hour) but that didn't work. Next I tried setting the Week grouping to None. i.e. Ask Oracle to assume that Days are different to Nights but that every day looks the same - even weekends.
Excellent. Those green ticks mean that Oracle has determined that the data in the baseline is a sufficient basis for statistical analysis. I'll click on the spectacles to see how it looks.
Now that there are only two groups - Day and Night - each group has enough data. I still won't talk about the thresholds, but how would it look if I chose another option for wider groups. This time I'll say that the Days and Nights are the same - i.e. The system runs the same around the clock - but that the Weekends are different from Weekdays. That works, too, and here's how the data looks.
Hopefully the graphics make it easier to visualise the underlying mechanics. But we still haven't set any Thresholds yet. That's for the next post.
Despite the documentation, the online help, the graphics and much playing, the penny didn't really drop until John Beresniewicz sent me an Adaptive Thresholds FAQ at which point it dropped with a vengeance, causing a deep but rapidly healing cut on my head (sorry, I digress). If he lets me host it round these parts (and I suspect he will), I'll link to it here. In the meantime, you'll have to make do with my tortuous explanations Thanks to JB, though, who finally made me see the light.
(You know, despite my best intentions and determination to make this a three post series at most, I have a bad feeling about this already)
Some features in this post require a Diagnostics Pack license.
[I really didn't want to get into another multi-part blog post, but this has grown longer than I hoped, so I'll split it up ...]
One of the more interesting components of the 10gR2 instrumentation improvements is the use of Metric Baselines and Adaptive Thresholds by OEM DB/Grid Control to generate alerts. (There are further improvements in 11g but I’ll address those in a later post.)
Although these features have been designed to be extremely simple to implement without understanding the mechanics, I suspect that most Oracle DBAs still struggle with the concept of implementing something without having some idea how it works (at least I hope so). There are some good technical documents floating around out there (see References below), but I feel they're at one end of the technical spectrum or the other, so this and the next post are intended to bridge the gap as well as showing the features in use. (Harder than you might think, but I'll come back to that.)
How can we use Metric Baselines and Adaptive Thresholds? I’ll try to put it as simply as possible :-
• Generate a baseline of how a system looks during ‘normal’ operation.
• Generate alerts when specific metrics exceed threshold values relative to that baseline.
There’s quite a bit more to it than that, as I hope will become apparent.
But first, a warning. This is not about system optimisation – that happened during the design, development, test and initial implementation phases, right? ;-) – but about monitoring and capturing unexpected variance in performance indicators. Because even with an optimised application and hardware infrastructure that generally meets requirements ...
• … a new Execution Plan could be produced for a SQL statement because of an overnight change in object statistics.
• … a failed cache battery somewhere in the storage infrastructure could cause a sudden increase in I/O times.
• … you could encounter a larger number of concurrent users than the system has handled before or was designed for.
I've seen all of these and more and I’m sure you’ve experienced plenty of your own. So there's no denying that it’s important to implement an optimised system first and foremost but, even when you’ve done so, something will go wrong one day. I’m probably making up my own phrases for things again, but I think these tools cover the areas of Performance Monitoring and Troubleshooting, rather than Performance Tuning (or Optimisation).
For existing Production systems, I’d like a facility that detects performance anomalies in the single database instance that has a problem right now (out of the hundreds I’m managing) and notifies me so that I can focus my attention where it’s required.
You could argue that such a facility already exists - called Users. They will generally detect serious performance anomalies and they will notify us using the tried and trusted telephone. If this happens too frequently, a further indicator would be the endless crisis meetings with white boards and wildly differing theories. However, it would be nice if I knew there was a problem before someone had to tell me. Users also don’t tend to be very good at reporting overnight batch throughput problems until the next day!
I'll dig into Metric Baselines first because
1) Defining a Baseline is the first thing you have to do.
2) Whilst configuring the Baseline might initially seem the simplest step, it's the foundation on which the threshold alerts depend. (That's why Oracle has tried to simplify it for busy DBAs.)
There are really two types of Baselines, which differ in the way the time period is defined; the way statistics are calculated and, consequently, their suitability for different uses.
A Moving Window Baseline uses recent data from the AWR repository over 7, 21, 35 or 91 days. As each day passes, the window on the data progresses forward by one day. The statistics are re-calculated on a regular basis (possibly as frequently as every hour depending on the Time Grouping). The effect of this is that the statistics change to reflect the recent workload and performance characteristics of the system.
A Static Baseline uses AWR data from a user-defined period which must be at least 7 days long. The statistics are calculated once, when you define the baseline, and are used forever until you switch to a new baseline. So Static Baselines don't make any allowance for the change in a system's work profile over time. If you have a steadily increasing number of concurrent users, you will eventually reach a stage where the system is alerting regularly because the work profile is so much greater than the Baseline.
Which is best for you depends on the characteristics of the system you're managing and what you're trying to achieve. I'd suggest that you need a Moving Window Baseline in most cases, unless you have very strict performance requirements and a very stable system that you don't expect to change over time.
One situation that almost demands the use of Static Baselines is playing around with this on your own setup at home. (In fact, this is just the first of a few difficulties I've faced playing around with this stuff because a single-user laptop is not the design target!) Think about it. For a Moving Window Baseline to make any sense, your system has to have been processing a 'normal' workload for at least the past 7 days, which is pretty unlikely on a laptop I switch off each night The design expects systems to be active on a more or less continuous basis, as most business systems are. So, in order to give me Metric Baseline statistics that I could re-use in future without needing ongoing continuous activity, I created a Static Metric Baseline covering last week. Why last week? Well, that comes to the next difficulty I faced. The Baseline period must have included enough activity on which to base the statistical computations used (see next post). Most weeks there probably wouldn't have been sufficient data on which to base the computations, but my laptop was more active during a week when I was teaching the course for two days and preparing in the evenings.
The best way to show you what I mean is to create a new Static Metric Baseline. First click on the Metric Baselines link at the bottom of various pages (e.g. Database Home page, Performance Page). If you don't have Baselines enabled, you'll be prompted to confirm that you want to enable them. When you click yes, DB/Grid Control will set the value of the _awr_flush_threshold_metrics hidden parameter to TRUE. (Unfortunately, this is thrown as a compliance error in DB Control 10.2.0.3 – see this Metalink Forum Thread relating to bug number 4749372)
There are 135 metrics in Oracle 10.2.0.4 (including some old friends like the Buffer Cache Hit Ratio!), which you can see by querying the V$SYSMETRIC_HISTORY view. e.g.
SQL> select group_name, metric_name, metric_unit from dba_hist_metric_name 2* where metric_name like 'B%'
GROUP_NAME ---------------------------------------------------------------- METRIC_NAME ---------------------------------------------------------------- METRIC_UNIT ---------------------------------------------------------------- System Metrics Short Duration Buffer Cache Hit Ratio % (LogRead - PhyRead)/LogRead
Session Metrics Long Duration Blocked User Session Count Sessions
System Metrics Long Duration Branch Node Splits Per Txn Splits Per Txn
System Metrics Long Duration Branch Node Splits Per Sec Splits Per Second
System Metrics Long Duration Background Checkpoints Per Sec Check Points Per Second
System Metrics Long Duration Buffer Cache Hit Ratio % (LogRead - PhyRead)/LogRead
6 rows selected.
but there are only 15 used by Metric Baselines and for which you can set Adaptive Thresholds, which are persisted in the AWR repository when _awr_flush_threshold_metrics=TRUE
SQL> select distinct metric_name from DBA_HIST_SYSMETRIC_HISTORY 2 order by metric_name; METRIC_NAME ---------------------------------------------------------------- Current Logons Count DB Block Changes Per Txn Database Time Per Sec Enqueue Requests Per Txn Executions Per Sec Logical Reads Per Txn Network Traffic Volume Per Sec Physical Reads Per Sec Physical Writes Per Sec Redo Generated Per Sec Response Time Per Txn SQL Service Response Time Total Parse Count Per Txn User Calls Per Sec User Transaction Per Sec
15 rows selected.
Actually, there are a further 7 metrics persisted in that view if you also have _awr_flush_workload_metrics=TRUE, although enabling Baselines doesn't do this by default, you won't be able to set adaptive thresholds for these metrics and you shouldn't be changing the value of hidden parameters anyway ... but worth mentioning should you ever see 22 rows returned by the previous query!METRIC_NAME ---------------------------------------------------------------- DB Block Changes Per User Call DB Block Gets Per User Call Executions Per User Call Logical Reads Per User Call Total Sorts Per User Call Total Table Scans Per User Call User Calls Per Txn
7 rows selected.
When you've enabled Baselines ...
you need to click the 'Manage Static Metric Baselines' link at the bottom of the screen. (No, you won't find me disagreeing that the OEM web interface can be a frustrating search for the right screen/link/button/breadcrumb, but I've sort of got used to it because of some of the cool stuff it can do.) Click the, erm, Create button to create a new baseline and you'll see this screen.
Give it a name, pick a period of at least 7 days and then, if you want, you can just click OK and your Static Metric Baseline is created. The statistics will be computed and you'll be able to see your new baseline in the dictionary.
SQL> select name, type, status from dbsnmp.MGMT_BSLN_BASELINES;
NAME T ---------------------------------------------------------------- - STATUS ---------------- Doug Test S INACTIVE
(Note that it's INACTIVE and the type is 'S' for Static.)
But that simple creation of a Static Metric Baseline ignored the 'Time Grouping' and 'Statistics Preview' sections that I, and I suspect others, have found a little confusing at first. So they deserve a post of their own - the next one ...
Although the volume of information out there isn't great, the quality of some of it is, so here's some further reading if you're interested in the subject.
"Metric Baselines: Detecting Unusual Performance Events Using System-Level Metrics in EM 10gR2". John Beresniewicz's White Paper. Seriously good stuff, but JB will need to post it at Ashmasters.com Updated later - thanks to JB for letting me host the document here.
Arup Nanda article on OTN
Cary Millsap blog post discussing the underlying concept, rather than this particular implementation. In the post, he mentions a couple of other resources ...
CMG Paper– (Note also the discussion about Treemaps - blog post about that coming up later)
Robyn Sands on Variance.