May 7: Adaptive Thresholds in 10g - Part 1 (Metric Baselines)
Some features in this post require a Diagnostics Pack license.
[I really didn't want to get into another multi-part blog post, but this has grown longer than I hoped, so I'll split it up ...]
One of the more interesting components of the 10gR2 instrumentation improvements is the use of Metric Baselines and Adaptive Thresholds by OEM DB/Grid Control to generate alerts. (There are further improvements in 11g but I’ll address those in a later post.)
Although these features have been designed to be extremely simple to implement without understanding the mechanics, I suspect that most Oracle DBAs still struggle with the concept of implementing something without having some idea how it works (at least I hope so). There are some good technical documents floating around out there (see References below), but I feel they're at one end of the technical spectrum or the other, so this and the next post are intended to bridge the gap as well as showing the features in use. (Harder than you might think, but I'll come back to that.)
How can we use Metric Baselines and Adaptive Thresholds? I’ll try to put it as simply as possible :-
• Generate a baseline of how a system looks during ‘normal’ operation.
• Generate alerts when specific metrics exceed threshold values relative to that baseline.
There’s quite a bit more to it than that, as I hope will become apparent.
But first, a warning. This is not about system optimisation – that happened during the design, development, test and initial implementation phases, right? ;-) – but about monitoring and capturing unexpected variance in performance indicators. Because even with an optimised application and hardware infrastructure that generally meets requirements ...
• … a new Execution Plan could be produced for a SQL statement because of an overnight change in object statistics.
• … a failed cache battery somewhere in the storage infrastructure could cause a sudden increase in I/O times.
• … you could encounter a larger number of concurrent users than the system has handled before or was designed for.
I've seen all of these and more and I’m sure you’ve experienced plenty of your own. So there's no denying that it’s important to implement an optimised system first and foremost but, even when you’ve done so, something will go wrong one day. I’m probably making up my own phrases for things again, but I think these tools cover the areas of Performance Monitoring and Troubleshooting, rather than Performance Tuning (or Optimisation).
For existing Production systems, I’d like a facility that detects performance anomalies in the single database instance that has a problem right now (out of the hundreds I’m managing) and notifies me so that I can focus my attention where it’s required.
You could argue that such a facility already exists - called Users. They will generally detect serious performance anomalies and they will notify us using the tried and trusted telephone. If this happens too frequently, a further indicator would be the endless crisis meetings with white boards and wildly differing theories. However, it would be nice if I knew there was a problem before someone had to tell me. Users also don’t tend to be very good at reporting overnight batch throughput problems until the next day!
Metric Baselines
I'll dig into Metric Baselines first because
1) Defining a Baseline is the first thing you have to do.
2) Whilst configuring the Baseline might initially seem the simplest step, it's the foundation on which the threshold alerts depend. (That's why Oracle has tried to simplify it for busy DBAs.)
There are really two types of Baselines, which differ in the way the time period is defined; the way statistics are calculated and, consequently, their suitability for different uses.
A Moving Window Baseline uses recent data from the AWR repository over 7, 21, 35 or 91 days. As each day passes, the window on the data progresses forward by one day. The statistics are re-calculated on a regular basis (possibly as frequently as every hour depending on the Time Grouping). The effect of this is that the statistics change to reflect the recent workload and performance characteristics of the system.
A Static Baseline uses AWR data from a user-defined period which must be at least 7 days long. The statistics are calculated once, when you define the baseline, and are used forever until you switch to a new baseline. So Static Baselines don't make any allowance for the change in a system's work profile over time. If you have a steadily increasing number of concurrent users, you will eventually reach a stage where the system is alerting regularly because the work profile is so much greater than the Baseline.
Which is best for you depends on the characteristics of the system you're managing and what you're trying to achieve. I'd suggest that you need a Moving Window Baseline in most cases, unless you have very strict performance requirements and a very stable system that you don't expect to change over time.
One situation that almost demands the use of Static Baselines is playing around with this on your own setup at home. (In fact, this is just the first of a few difficulties I've faced playing around with this stuff because a single-user laptop is not the design target!) Think about it. For a Moving Window Baseline to make any sense, your system has to have been processing a 'normal' workload for at least the past 7 days, which is pretty unlikely on a laptop I switch off each night
The design expects systems to be active on a more or less continuous basis, as most business systems are. So, in order to give me Metric Baseline statistics that I could re-use in future without needing ongoing continuous activity, I created a Static Metric Baseline covering last week. Why last week? Well, that comes to the next difficulty I faced. The Baseline period must have included enough activity on which to base the statistical computations used (see next post). Most weeks there probably wouldn't have been sufficient data on which to base the computations, but my laptop was more active during a week when I was teaching the course for two days and preparing in the evenings.
The best way to show you what I mean is to create a new Static Metric Baseline. First click on the Metric Baselines link at the bottom of various pages (e.g. Database Home page, Performance Page). If you don't have Baselines enabled, you'll be prompted to confirm that you want to enable them. When you click yes, DB/Grid Control will set the value of the _awr_flush_threshold_metrics hidden parameter to TRUE. (Unfortunately, this is thrown as a compliance error in DB Control 10.2.0.3 – see this Metalink Forum Thread relating to bug number 4749372)
There are 135 metrics in Oracle 10.2.0.4 (including some old friends like the Buffer Cache Hit Ratio!), which you can see by querying the V$SYSMETRIC_HISTORY view. e.g.
SQL> select group_name, metric_name, metric_unit from dba_hist_metric_name 2* where metric_name like 'B%'GROUP_NAME ---------------------------------------------------------------- METRIC_NAME ---------------------------------------------------------------- METRIC_UNIT ---------------------------------------------------------------- System Metrics Short Duration Buffer Cache Hit Ratio % (LogRead - PhyRead)/LogRead
Session Metrics Long Duration Blocked User Session Count Sessions
System Metrics Long Duration Branch Node Splits Per Txn Splits Per Txn
System Metrics Long Duration Branch Node Splits Per Sec Splits Per Second
System Metrics Long Duration Background Checkpoints Per Sec Check Points Per Second
System Metrics Long Duration Buffer Cache Hit Ratio % (LogRead - PhyRead)/LogRead
6 rows selected.
but there are only 15 used by Metric Baselines and for which you can set Adaptive Thresholds, which are persisted in the AWR repository when _awr_flush_threshold_metrics=TRUE
SQL> select distinct metric_name from DBA_HIST_SYSMETRIC_HISTORY 2 order by metric_name; METRIC_NAME ---------------------------------------------------------------- Current Logons Count DB Block Changes Per Txn Database Time Per Sec Enqueue Requests Per Txn Executions Per Sec Logical Reads Per Txn Network Traffic Volume Per Sec Physical Reads Per Sec Physical Writes Per Sec Redo Generated Per Sec Response Time Per Txn SQL Service Response Time Total Parse Count Per Txn User Calls Per Sec User Transaction Per Sec15 rows selected.
Actually, there are a further 7 metrics persisted in that view if you also have _awr_flush_workload_metrics=TRUE, although enabling Baselines doesn't do this by default, you won't be able to set adaptive thresholds for these metrics and you shouldn't be changing the value of hidden parameters anyway ... but worth mentioning should you ever see 22 rows returned by the previous query!
METRIC_NAME ---------------------------------------------------------------- DB Block Changes Per User Call DB Block Gets Per User Call Executions Per User Call Logical Reads Per User Call Total Sorts Per User Call Total Table Scans Per User Call User Calls Per Txn7 rows selected.
When you've enabled Baselines ...
you need to click the 'Manage Static Metric Baselines' link at the bottom of the screen. (No, you won't find me disagreeing that the OEM web interface can be a frustrating search for the right screen/link/button/breadcrumb, but I've sort of got used to it because of some of the cool stuff it can do.) Click the, erm, Create button to create a new baseline and you'll see this screen.
Give it a name, pick a period of at least 7 days and then, if you want, you can just click OK and your Static Metric Baseline is created. The statistics will be computed and you'll be able to see your new baseline in the dictionary.
SQL> select name, type, status from dbsnmp.MGMT_BSLN_BASELINES;NAME T ---------------------------------------------------------------- - STATUS ---------------- Doug Test S INACTIVE
(Note that it's INACTIVE and the type is 'S' for Static.)
But that simple creation of a Static Metric Baseline ignored the 'Time Grouping' and 'Statistics Preview' sections that I, and I suspect others, have found a little confusing at first. So they deserve a post of their own - the next one ...
References
Although the volume of information out there isn't great, the quality of some of it is, so here's some further reading if you're interested in the subject.
The Documentation
"Metric Baselines: Detecting Unusual Performance Events Using System-Level Metrics in EM 10gR2". John Beresniewicz's White Paper. Seriously good stuff, but JB will need to post it at Ashmasters.com
Updated later - thanks to JB for letting me host the document here.
Arup Nanda article on OTN
Cary Millsap blog post discussing the underlying concept, rather than this particular implementation. In the post, he mentions a couple of other resources ...
CMG Paper– (Note also the discussion about Treemaps - blog post about that coming up later)
Robyn Sands on Variance.
#3 - Gary 2012-06-28 11:38 - (Reply)
Quick question - in order to enable the moving window baseline, do you need to use EM, or is there a DBMS_ procedure that can also do this? I ask because we don't have EM on our Failsafe clustered production environment, but I'm keen to use a non-static baseline, and then further investigate administering the thresholds, again through the command line (so assuming the latter can also be done via DBMS_). Cheers much
#4 - Doug Burns said:
2012-06-29 10:54 - (Reply)
Hi Gary,
Hopefully this covers what you're looking for ...
http://www.oracle-base.com/articles/11g/awr-baseline-enhancements-11gr1.php
#5 - Gary 2012-06-29 11:08 - (Reply)
Cheers Doug - unfortunately we're still on 10.2.0.4 here. If we were on 11g I'd be a much happier bunny :-p
#6 - Doug Burns said:
2012-06-29 11:52 - (Reply)
Mmmm, tricky one. It's a long time since I've worked with this stuff much but I suspect not.
The problem is that baselines and thresholds operate in a weird space that's partly OEM and partly the RDBMS and I suspect in 10g you probably can't do quite a few things without using OEM.
#7 - Gary 2012-06-29 12:44 - (Reply)
Aye - I suspected as much - cheers for the reply. Enjoy the sunshine :-p


Tracked: May 15, 18:00
Hello and welcome to the 146th edition of Log Buffer, the weekly review of database blogs. I have to make this a quick one, but I hope (as always) that the links give you the highlights of this week’s blogs. Oracle Let’s start with Jonatha...
Tracked: May 15, 21:59
Tracked: Aug 06, 15:18