May 7: Adaptive Thresholds in 10g - Part 2 (Time Grouping)
Some features in this post require a Diagnostics Pack license.
Metric Baselines were designed to be easy to implement. There are only two options :-
1) Pick how much recent activity you want to use and let Oracle
recompute the statistics based on the most recent metric values over
time. (Moving Window)
2) Pick a fixed period of at least 7 days when a system was performing as expected and compute fixed statistics. (Static)
It really is as simple as that and most of the time, you'd use option 1, but, as the documentation points out :-
"The key in defining a metric snapshot for a target is to select a date during which target performance was acceptable under typical workloads. Given this date, actual values of the performance metrics for the target are retrieved and these represent what is normal or expected performance behavior for the target."
(Although it isn't really talking about Metric Baselines here, but Snapshots, the same principle is particularly relevant to Static Metric Baselines as well. That's one of the reasons why Moving Window baselines will usually be the better option).
I think there are two objectives, though. One is to find a period that you would describe as 'typical' and 'acceptable' and the other is to ensure that there were enough measurements in that period to be a statistically valid basis for setting Significance Level Adaptive Thresholds. The latter can be a little tricky when you're trying this out on the type of developer test system it wasn't designed for, but Oracle does help you out by displaying useful information on the DB/Grid Control screens.
Shortly I'll show you the problems you're likely to encounter in practice but, for that to make sense, I need to talk about Time Groupings first. (However, note that I'm going to mention a few things that contradict the point made in one white paper 'It is not intended that customers be required to understand this level of detail in order to successfully use the feature.' Amen to that, I just think it's nice to be able to know sometimes ....)
A core design feature is that the alert thresholds are adaptive. Consider the "Response Time Per Txn" metric. Is there a single acceptable value for that? It's unlikely because, as I discussed in a previous post, the primary objective for your system during overnight batch processing could be maximum throughput (so you accept slower transaction response times), whereas during the day you might expect lower throughput but aim for much faster transaction response times. That's just one illustration of an essential truth about most business systems - the workload profile changes over time. If we tried to set a fixed target response time per transaction to cover all possibilities, we would either find it impossible, or the threshold would be too high to be useful.
Adaptive Thresholds change over time, so that the monitoring software compares the current metric values against statistics for other time periods when we would expect a similar workload profile. For example, if it's a quiet Sunday afternoon now, it's probably more sensible to compare the current metric values to another Sunday afternoon, right? Likewise, if we run the same batch processes each night, then let's compare tonight's statistics against recent nights. Most DBAs will be familiar with the recognisable workload profiles and periodicity of the various systems they manage.
Time Groupings define the shape and size of the periods that Oracle will use for comparison. There are nine combinations of grouping within Days and Weeks, but I'll focus on three using a Metric Baseline duration of 7 days as the example.
None
If you don't specify any Time Groups at all, Oracle will calculate the statistics based on all of the values in your Metric Baseline. For example, if you have your AWR retention set to 7 days and are using a 7-day Moving Window Baseline, Oracle will look at all of the Metric Values in DBA_HIST_SYSMETRIC_HISTORY and base it's statistical calculations on those results. Assuming constant activity over those 7 days, there will be 10,080 values for each metric that are used to produce the aggregate statistics (1 minute measurements * 60 * 24 * 7).
By Day and Night, Over Weekdays and Weekends (The Default)
This creates 4 different time groups.
- Weekdays during the day (07:00-19:00)
- Weekdays overnight (19:00-07:00)
- Weekends during the day (07:00-19:00)
- Weekends overnight (19:00-07:00)
So, if it's Thursday morning on the 7th May, Oracle will base it's calculations on aggregated statistics based on all of the values between 7:00 and 19:00 on Wednesday 6th, Tuesday 5th, Monday 4th, Friday 1st and Thursday 30th April. Using the 7 day example, there will be 3600 values for each metric in the Weekday group to be aggregated and 1440 in the Weekends group.
By Hour of Day, By Day of Week (Narrowest Groups)
This creates groups for every single hour over the course of the week - 168 groups in all. That means that at 11:00 on Thursday morning, Oracle will compare the current metric values to the statistics as of 11:00 last Thursday morning, which sounds great because you could model fairly complex workload periodicity. However, you probably don't want to have thresholds based on such specific time periods that they become sensitive to even slight changes in the timing of things - is your system really that variable from hour to hour and reliably so? More important, now that there are so many groups, each group only contains 60 values for each metric, which is a problem.
The Aggregate Cardinality of each Time Group is the number of measurements in the group on which the statistics are calculated (e.g. 60 in the Hour of Day, Day of Week example over a one week period). Good statistics are absolutely critical to the ability to set Significance Level Adaptive Thresholds. If we only have 60 values for a metric for the corresponding hour and day last week, it's impossible to make a sensible statement about the 99.9th percentile! Therefore Oracle won't try to and will just ignore any Significance Level thresholds when there is insufficient data on which to base the calculation.
I think this is the main reason why people struggle to have early success with Significance Level Adaptive Thresholds (and not just because I did! LOL)
In fairness to Oracle, DB/Grid Control does warn you that you have insufficient data, includes a facility to look at the data you do have, and lots of on-screen help which I have a bad habit of glossing over
but it won't stop you picking inappropriate periods. Time for an example.
I'll create a Static Baseline based on last week. Remember that this is on a laptop that was intensely busy during some periods (running demos during the class and testing them in advance), but shutdown for the majority of the time. Static Baselines are probably less useful that Moving Window Baselines in 'real world' situations, but the Static Baseline allows me to work on this for as long as I want without needing constant activity on my laptop! By default, the Time Grouping is set to By Hour of Day, By Weekdays and Weekends. In order to see whether the selected date range and time groups would contain sufficient data for the statistical calculations, we can use the 'Statistics Preview' section.
When I click 'Compute Statistics', Oracle will create the Time Groupings and determine whether the Aggregate Cardinality is an adequate basis for setting thresholds.
So it should be clear in this case that there's insufficient data when this period is split up into those time groups. Better still, by clicking on one of the little pairs of spectacles, you can see why ...
Note that the graph values are the same for Saturday/Sunday and for Weekdays - those are the two groups within the week. But within those two groups, that's a pretty sparse graph so there clearly isn't sufficient data. Let's not go into detail on what the graph means yet. Let's try to fix the baseline so there is sufficient data. Here are some factors to consider when configuring your Metric
Baselines that might help you determine the most appropriate baseline
duration and time grouping to select. All of the following reduce the aggregate cardinality. i.e. They make it less likely that Oracle will accept the data as sufficient.
Periods of inactivity - What you'll run into if you try to play around with this at home on a PC that's rarely active. (That's the case here. I've tried to offset it by using a Static Baseline on my busiest week but it still isn't consistently busy enough.)
Narrow Time Groupings - Using the most narrow groupings available, you'll need a 5 week moving window to set even a 95th percentile threshold. (That's the case here, I'm grouping by the hour)
Less History - For example, one week's worth of data only contains one set of data for 'Tuesday', three weeks has three 'Tuesday's.
So this example is just about the worst case! However, maybe I can work around it. I can't change the periods of inactivity and introducing more history would just make things worse (more periods of inactivity) so maybe I should try to use less narrow time groupings?
I tried to reduce the Day grouping to Day and Night (rather than every hour) but that didn't work. Next I tried setting the Week grouping to None. i.e. Ask Oracle to assume that Days are different to Nights but that every day looks the same - even weekends.
Excellent. Those green ticks mean that Oracle has determined that the data in the baseline is a sufficient basis for statistical analysis. I'll click on the spectacles to see how it looks.
Now that there are only two groups - Day and Night - each group has enough data. I still won't talk about the thresholds, but how would it look if I chose another option for wider groups. This time I'll say that the Days and Nights are the same - i.e. The system runs the same around the clock - but that the Weekends are different from Weekdays. That works, too, and here's how the data looks.
Hopefully the graphics make it easier to visualise the underlying mechanics. But we still haven't set any Thresholds yet. That's for the next post.
References
Despite the documentation, the online help, the graphics and much playing, the penny didn't really drop until John Beresniewicz sent me an Adaptive Thresholds FAQ at which point it dropped with a vengeance, causing a deep but rapidly healing cut on my head (sorry, I digress). If he lets me host it round these parts (and I suspect he will), I'll link to it here. In the meantime, you'll have to make do with my tortuous explanations
Thanks to JB, though, who finally made me see the light.
(You know, despite my best intentions and determination to make this a three post series at most, I have a bad feeling about this already)

