Adaptive Thresholds in 10g - Part 3 (Setting T ...

Doug's Oracle Blog

  • Home
  • Papers
  • Books
  • C.V.
  • Fun
  • Oracle Blog
  • Personal Blog

May 10: Adaptive Thresholds in 10g - Part 3 (Setting Thresholds)

Some features in this post require a Diagnostics Pack license.

One of my favourite aspects of blogging is that if I were to write a proper article or conference paper, I'd be likely to fix any problems with it as I go (well, the ones I notice!). With blogging, it's more likely that I'll just post something quite quickly without significant editing. Most of the time that's not a good thing, but occasionally a bad decision throws up an interesting point to expand on and the fluidity of the medium means I can. When I chose the two graphs towards the end of the last post, it would probably have been better if both graphs had been for 'Response Time (per transaction)' so that you could compare the same statistic over different time groupings, but I accidentally chose 'Number of Transactions (per second)' (which is first on the list in the Statistics Preview screen). For completeness, here are the two graphs I should have used





... and to save you having to go back to the previous post, here's the Number of Transactions graph I used. (Oh, and for anyone who can't stand looking at graphs, there might be some later posts that are more query based because everything in the graphs is available in basic dictionary views that I've probably played around with as much. Just let me use some pretty pictures for a change, eh?)



So why do I think the misjudgement (I wouldn't go as far as calling it a mistake) is interesting? Because it allows me to move away from the narrow focus of transaction response time and highlight the metric choices, which I think are good, and their different focus. The statistics computation and preview screen doesn't group the metrics as well as the threshold setting screen does.



There are three different groups of metrics as indicators of different, but often related variations in system behaviour.

Performance ... hopefully that one is self-explanatory. But there are many reasons why performance could become worse, two possibilities are :-

Workload Type ... has the behaviour of the application changed or is someone running a workload that they shouldn't be, perhaps the query from hell during your usual OLTP day?

Workload Volume ... or is the system just much busier than normal - more users, more transactions?

That's my informal translation and I think that the groups are a useful aid to help you consider what the metrics mean, but there are no hard and fast rules, here. For example, redo size could change because of an increase in the volume of transactions or because of a change in the type of transactions. (As I'm writing, I remember saying something about redo generation and User Commits in a blog post a while ago about workload metrics and redo size is certainly something I've used in the past to track the 'busy-ness' of systems before I had the options of setting adaptive thresholds. I know from my own experience how useful some of these metrics can be.)

The relationship between the different metrics could be quite complex, though. If you look back at the Response Time and Number of Transactions graphs for the Weekdays and Weekend grouping above, you'll see that there were many fewer transactions at weekends, but their individual response time was higher than on weekdays - not what you'd expect, but adapting to complex system behaviour is precisely what makes this approach useful. (The complexity here is likely to be a direct result of this being an intermittently very busy laptop running a wide variety of transactions.)

Anyway, how do I go about setting thresholds to alert against? There's only one screen to deal with. (I'll use a thumbnail of the tresholds setting screen here because it didn't scale down too well.)



The screen itself is simple to use and I guess that people will find the Percentage of Maxmimum thresholds simple to understand, but Significance Level thresholds can be a little trickier.

In essence, they're designed to trap unusual values. There's a lot more detail on the mathematical foundations of their design in this paper I posted a link to previously and here's an AskTom thread where Graham Wood and John Bereniewicz offer their thoughts which I liked. Personally, I have a hunch that when people first experiment with Significance Level thresholds they find that they don't generate any alerts because the higher levels of significance will only catch truly extreme events and so, having been unable to get the damn things to throw alerts, they give them up as a lost cause. The default values in the DB/Grid Control screen are a Critical alert for Severe (.999, or one in a thousand occurrences) and a Warning for Very High (.99, or one in a hundred).   Add in the fact that two consecutive occurrences are required before an alert is triggered and it could be a long wait for a Critical alert.

However, that's not to say that I think the defaults are inappropriate or that you shouldn't go with them (because it's better to only receive alerts when there really is a problem), but that you might find that you don't get alerts back in the early stages of testing, so how do you know everything's working ok? One approach I've found useful is reducing the Warning level to the minimum allowed - High (.95) - (as I did in the example screen above) until you're sure you're getting some alerts back. There's no doubt in my mind that, for a feature that's very easy to use and implement, you probably need to experiment with the values on your systems in the early stages. I also think that you should 'test' this on Production systems wherever possible, or at least test systems with a proper workload profile. That testing could be as simple as setting Critical to 'None' so that you don't generate high priority incidents, but setting the Warning threshold to various values.

What do the values that I set here mean? If you look back at the second graph, you'll see that the 95th percentile is difficult to see, but looks somewhere in the low hundreds for both weekends and weekdays. To a certain extent, the absolute value isn't the key, but we can see it by looking at the underlying statistics for the two time groups. (Warning - query thrown together based on a basic guess at the underlying data model.)

SYS@TEST1020> select n.metric_name, n.metric_unit, p.threshold_method,
  2     p.num_occurrences, p.warning_param, p.critical_param,
  3     s.compute_date, s.sample_count,
  4     s.average, s.minimum, s.maximum, s.sdev,
  5     s.pctile_25, s.pctile_50, s.pctile_75, s.pctile_90, s.pctile_95,
  6     s.est_pctile_99, s.est_pctile_999, s.est_pctile_9999
  7  from dbsnmp.MGMT_BSLN_STATISTICS s, dbsnmp.MGMT_BSLN_THRESHOLD_PARMS p,
  8     dbsnmp.MGMT_BSLN_DATASOURCES d, dbsnmp.MGMT_BSLN_BASELINES b,
  9     V$METRICNAME n
 10  where p.datasource_guid = d.datasource_guid
 11  and s.datasource_guid = d.datasource_guid
 12  and s.bsln_guid = p.bsln_guid
 13  and p.bsln_guid = b.bsln_guid
 14  and d.metric_id = n.metric_id
 15  and b.name = 'Doug Test';

METRIC_NAME
----------------------------------------------------------------
METRIC_UNIT          THRESHOLD_METHOD NUM_OCCURRENCES WARNING_PARAM
-------------------- ---------------- --------------- -------------
CRITICAL_PARAM COMPUTE_DATE      SAMPLE_COUNT    AVERAGE    MINIMUM    MAXIMUM
-------------- ----------------- ------------ ---------- ---------- ----------
      SDEV  PCTILE_25  PCTILE_50  PCTILE_75  PCTILE_90  PCTILE_95 EST_PCTILE_99
---------- ---------- ---------- ---------- ---------- ---------- -------------
EST_PCTILE_999 EST_PCTILE_9999
-------------- ---------------
Response Time Per Txn
CentiSeconds Per Txn SIGLVL                         2           .95
           .99 10/05/09 18:40:10         1239 71.6869897 .021090732 5146.63499
369.551876 1.88866288 3.97931111 6.94699364 24.1417503 254.782416    1882.14226
    4041.34033      6266.98777

Response Time Per Txn
CentiSeconds Per Txn SIGLVL                         2           .95
           .99 10/05/09 18:40:10          269 112.962703          0   7658.297
746.769388    .339025      .4427     1.8753   11.62364 207.996409     3692.6721
    5575.31008      8696.60199

The weekday time group statistics are based on 1239 values for this metric, 269 for the weekend group and, although the thresholds are stored as .95 and .99, I can see those thresholds translate into 207 centiseconds for a weekend warning and 3692 centiseconds for a critical alert. Remember that, because this is a Static Baseline, these statistics were computed once and will stay the same, so the absolute values this query returns are used as the thresholds. If I was using a Moving Window baseline, the percentile settings would stay the same, but the statistics would be recalculated and the absolute threshold values will change.

As Moving Window baselines are likely to prove more useful for more people (but, I'll repeat, aren't as easy to experiment with on a home system), I'll move on to those in the next post, as well as switching to 11g.

Oh, but before I do, the proof that these alerts do work. (Because there were times when I wondered ;-))


Posted by Doug Burns Comments: (7) Trackbacks: (0)
Defined tags for this entry: adaptive thresholds
Related entries by tags:
Adaptive Thresholds in 10g - Part 2 (Time Grouping)
Adaptive Thresholds in 10g - Part 1 (Metric Baselines)

Trackbacks
Trackback specific URI for this entry

No Trackbacks

Comments
Display comments as (Linear | Threaded)

#1 - Marco Gralike said:
2009-05-12 00:57 - (Reply)

ThX Doug for expanding my horizons; I really like the serries. Which OEM version are you using btw?

#2 - Doug Burns said:
2009-05-12 08:35 - (Reply)

Cheers, Marco. This is all Database Control 10.2.0.4 but Grid Control would be the same.

11.1.0.7 in the next post, I think, because it's changed quite a bit ...

#3 - Marco Gralike said:
2009-05-12 20:40 - (Reply)

Although I think, DB Console (11.1.0.7.0) is ahead regarding its big brother Database Grid Control 10g Release 5 would be an interesting alternative, but can't compare it for you due to the fact that I haven't implemented it yet. Too busy regarding testing stuff

;-)

#4 - Doug Burns said:
2009-05-12 20:46 - (Reply)

Grid Control 10.2.0.5 would actually be my preferred option because then I could blog about the Loadmap page, but it might take a bit longer. I already have 11.1.0.7 installed here.

#5 - Kevin CLosson said:
2009-05-16 06:43 - (Reply)

Dang, Doug, I've really got to chew on this series! My quick read was fruitful, but I need to bare the fangs to get into it fully...very good content!

#6 - Doug Burns said:
2009-05-16 10:28 - (Reply)

Thanks Kevin. It's funny you should say this ....

I need to bare the fangs to get into it fully...

Because that's exactly how I feel. I think the general idea is that this is designed to make performance monitoring lots of databases an easier process, and so you shouldn't have to get into it too deeply, but I can't help needing to know a little about how it works.

#7 - Kyle Hailey said:
2010-05-31 20:38 - (Reply)

Awesome 3 posts on Adaptive Thresholds - its super cool stuff. You'd think I knew how all this stuff works - but I didn't! I knew how it was suppose to work in theory but I never understood how the guys actually coded the UI - yikes. I was so overwhelmed by the number of links and web pages the 200 EM developers at the time (now over 1000 developers) were churning outthat it took a lot of my attention just to try and reign a lot it in - mainly to no avail.
For fun here is a graphic I did for the baseline deviation page

Not sure if that image will be stripped from the blog comment. You can see it here otherwise


Add Comment

Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
BBCode format allowed
 
 

Statistics on Partitioned Tables

Contents

Part 1 - Default options - GLOBAL AND PARTITION
Part 2 - Estimated Global Stats
Part 3 - Stats Aggregation Problems I
Part 4 - Stats Aggregation Problems II
Part 5 - Minimal Stats Aggregation
Part 6a - COPY_TABLE_STATS - Intro
Part 6b - COPY_TABLE_STATS - Mistakes
Part 6c - COPY_TABLE_STATS - Bugs and Patches
Part 6d - COPY_TABLE_STATS - A Light-bulb Moment
Part 6e - COPY_TABLE_STATS - Bug 10268597

Comments

Doug Burns about 10053 Trace Files - Different Plan in Different Environments
Tue, 02.04.2013 08:57
You're welcome. Now I just nee d to pull my finger out and ac tually come up [...]
Howard Rogers about 10053 Trace Files - Different Plan in Different Environments
Mon, 01.04.2013 23:08
Makes a big difference, so tha nks for that! With two brow ser windows, o [...]
stelioscharalambides.com about 10053 Trace Files
Sat, 30.03.2013 16:28

Upcoming Presentations

Bookmark

Open All | Close All

Syndicate This Blog

  • XML RSS 2.0 feed
  • ATOM/XML ATOM 1.0 feed
  • XML RSS 2.0 Comments
  • Feedburner Feed

Powered by

Serendipity PHP Weblog

Show tagged entries

xml 11g
xml ACE
xml adaptive thresholds
xml ASH
xml Audit Vault
xml AWR
xml Blogging
xml conferences
xml Cuddly Toys
xml Database Refresh
xml DBMS_STATS
xml Direct Path Reads
xml Fun
xml grid control
xml hotsos 2010
xml listener
xml Locking
xml oow
xml oow2009
xml optimiser
xml OTN
xml Parallel
xml Partitions
xml Patching
xml swingbench
xml The Reality Gap
xml time matters
xml ukoug
xml ukoug2009
xml Unix/Shell
xml Useful Links

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.

Design by Andreas Viklund | Conversion to s9y by Carl