One of my favourite aspects of blogging is that if I were to write a proper article or conference paper, I'd be likely to fix any problems with it as I go (well, the ones I notice!). With blogging, it's more likely that I'll just post something quite quickly without significant editing. Most of the time that's not a good thing, but occasionally a bad decision throws up an interesting point to expand on and the fluidity of the medium means I can. When I chose the two graphs towards the end of the last post, it would probably have been better if both graphs had been for 'Response Time (per transaction)' so that you could compare the same statistic over different time groupings, but I accidentally chose 'Number of Transactions (per second)' (which is first on the list in the Statistics Preview screen). For completeness, here are the two graphs I should have used
... and to save you having to go back to the previous post, here's the Number of Transactions graph I used. (Oh, and for anyone who can't stand looking at graphs, there might be some later posts that are more query based because everything in the graphs is available in basic dictionary views that I've probably played around with as much. Just let me use some pretty pictures for a change, eh?)
So why do I think the misjudgement (I wouldn't go as far as calling it a mistake) is interesting? Because it allows me to move away from the narrow focus of transaction response time and highlight the metric choices, which I think are good, and their different focus. The statistics computation and preview screen doesn't group the metrics as well as the threshold setting screen does.
There are three different groups of metrics as indicators of different, but often related variations in system behaviour.
Performance ... hopefully that one is self-explanatory. But there are many reasons why performance could become worse, two possibilities are :-
Workload Type ... has the behaviour of the application changed or is someone running a workload that they shouldn't be, perhaps the query from hell during your usual OLTP day?
Workload Volume ... or is the system just much busier than normal - more users, more transactions?
That's my informal translation and I think that the groups are a useful aid to help you consider what the metrics mean, but there are no hard and fast rules, here. For example, redo size could change because of an increase in the volume of transactions or because of a change in the type of transactions. (As I'm writing, I remember saying something about redo generation and User Commits in a blog post a while ago about workload metrics and redo size is certainly something I've used in the past to track the 'busy-ness' of systems before I had the options of setting adaptive thresholds. I know from my own experience how useful some of these metrics can be.)
The relationship between the different metrics could be quite complex, though. If you look back at the Response Time and Number of Transactions graphs for the Weekdays and Weekend grouping above, you'll see that there were many fewer transactions at weekends, but their individual response time was higher than on weekdays - not what you'd expect, but adapting to complex system behaviour is precisely what makes this approach useful. (The complexity here is likely to be a direct result of this being an intermittently very busy laptop running a wide variety of transactions.)
Anyway, how do I go about setting thresholds to alert against? There's only one screen to deal with. (I'll use a thumbnail of the tresholds setting screen here because it didn't scale down too well.)
However, that's not to say that I think the defaults are inappropriate or that you shouldn't go with them (because it's better to only receive alerts when there really is a problem), but that you might find that you don't get alerts back in the early stages of testing, so how do you know everything's working ok? One approach I've found useful is reducing the Warning level to the minimum allowed - High (.95) - (as I did in the example screen above) until you're sure you're getting some alerts back. There's no doubt in my mind that, for a feature that's very easy to use and implement, you probably need to experiment with the values on your systems in the early stages. I also think that you should 'test' this on Production systems wherever possible, or at least test systems with a proper workload profile. That testing could be as simple as setting Critical to 'None' so that you don't generate high priority incidents, but setting the Warning threshold to various values.
The screen itself is simple to use and I guess that people will find the Percentage of Maxmimum thresholds simple to understand, but Significance Level thresholds can be a little trickier.
In essence, they're designed to trap unusual values. There's a lot more detail on the mathematical foundations of their design in this paper I posted a link to previously and here's an AskTom thread where Graham Wood and John Bereniewicz offer their thoughts which I liked. Personally, I have a hunch that when people first experiment with Significance Level thresholds they find that they don't generate any alerts because the higher levels of significance will only catch truly extreme events and so, having been unable to get the damn things to throw alerts, they give them up as a lost cause. The default values in the DB/Grid Control screen are a Critical alert for Severe (.999, or one in a thousand occurrences) and a Warning for Very High (.99, or one in a hundred). Add in the fact that two consecutive occurrences are required before an alert is triggered and it could be a long wait for a Critical alert.
What do the values that I set here mean? If you look back at the second graph, you'll see that the 95th percentile is difficult to see, but looks somewhere in the low hundreds for both weekends and weekdays. To a certain extent, the absolute value isn't the key, but we can see it by looking at the underlying statistics for the two time groups. (Warning - query thrown together based on a basic guess at the underlying data model.)
SYS@TEST1020> select n.metric_name, n.metric_unit, p.threshold_method, 2 p.num_occurrences, p.warning_param, p.critical_param, 3 s.compute_date, s.sample_count, 4 s.average, s.minimum, s.maximum, s.sdev, 5 s.pctile_25, s.pctile_50, s.pctile_75, s.pctile_90, s.pctile_95, 6 s.est_pctile_99, s.est_pctile_999, s.est_pctile_9999 7 from dbsnmp.MGMT_BSLN_STATISTICS s, dbsnmp.MGMT_BSLN_THRESHOLD_PARMS p, 8 dbsnmp.MGMT_BSLN_DATASOURCES d, dbsnmp.MGMT_BSLN_BASELINES b, 9 V$METRICNAME n 10 where p.datasource_guid = d.datasource_guid 11 and s.datasource_guid = d.datasource_guid 12 and s.bsln_guid = p.bsln_guid 13 and p.bsln_guid = b.bsln_guid 14 and d.metric_id = n.metric_id 15 and b.name = 'Doug Test'; METRIC_NAME ---------------------------------------------------------------- METRIC_UNIT THRESHOLD_METHOD NUM_OCCURRENCES WARNING_PARAM -------------------- ---------------- --------------- ------------- CRITICAL_PARAM COMPUTE_DATE SAMPLE_COUNT AVERAGE MINIMUM MAXIMUM -------------- ----------------- ------------ ---------- ---------- ---------- SDEV PCTILE_25 PCTILE_50 PCTILE_75 PCTILE_90 PCTILE_95 EST_PCTILE_99 ---------- ---------- ---------- ---------- ---------- ---------- ------------- EST_PCTILE_999 EST_PCTILE_9999 -------------- --------------- Response Time Per Txn CentiSeconds Per Txn SIGLVL 2 .95 .99 10/05/09 18:40:10 1239 71.6869897 .021090732 5146.63499 369.551876 1.88866288 3.97931111 6.94699364 24.1417503 254.782416 1882.14226 4041.34033 6266.98777 Response Time Per Txn CentiSeconds Per Txn SIGLVL 2 .95 .99 10/05/09 18:40:10 269 112.962703 0 7658.297 746.769388 .339025 .4427 1.8753 11.62364 207.996409 3692.6721 5575.31008 8696.60199
The weekday time group statistics are based on 1239 values for this metric, 269 for the weekend group and, although the thresholds are stored as .95 and .99, I can see those thresholds translate into 207 centiseconds for a weekend warning and 3692 centiseconds for a critical alert. Remember that, because this is a Static Baseline, these statistics were computed once and will stay the same, so the absolute values this query returns are used as the thresholds. If I was using a Moving Window baseline, the percentile settings would stay the same, but the statistics would be recalculated and the absolute threshold values will change.
As Moving Window baselines are likely to prove more useful for more people (but, I'll repeat, aren't as easy to experiment with on a home system), I'll move on to those in the next post, as well as switching to 11g.
Oh, but before I do, the proof that these alerts do work. (Because there were times when I wondered )
ThX Doug for expanding my horizons; I really like the serries. Which OEM version are you using btw?
Cheers, Marco. This is all Database Control 10.2.0.4 but Grid Control would be the same.
126.96.36.199 in the next post, I think, because it's changed quite a bit ...
Although I think, DB Console (188.8.131.52.0) is ahead regarding its big brother Database Grid Control 10g Release 5 would be an interesting alternative, but can't compare it for you due to the fact that I haven't implemented it yet. Too busy regarding testing stuff
Grid Control 10.2.0.5 would actually be my preferred option because then I could blog about the Loadmap page, but it might take a bit longer. I already have 184.108.40.206 installed here.
Dang, Doug, I've really got to chew on this series! My quick read was fruitful, but I need to bare the fangs to get into it fully...very good content!
Thanks Kevin. It's funny you should say this ....
I need to bare the fangs to get into it fully...
Because that's exactly how I feel. I think the general idea is that this is designed to make performance monitoring lots of databases an easier process, and so you shouldn't have to get into it too deeply, but I can't help needing to know a little about how it works.
Awesome 3 posts on Adaptive Thresholds - its super cool stuff. You'd think I knew how all this stuff works - but I didn't! I knew how it was suppose to work in theory but I never understood how the guys actually coded the UI - yikes. I was so overwhelmed by the number of links and web pages the 200 EM developers at the time (now over 1000 developers) were churning outthat it took a lot of my attention just to try and reign a lot it in - mainly to no avail.
For fun here is a graphic I did for the baseline deviation page
Not sure if that image will be stripped from the blog comment. You can see it here otherwise