Adaptive Thresholds in 10g - Part 1 (Metric Ba ...

Doug's Oracle Blog

  • Home
  • Papers
  • Books
  • C.V.
  • Fun
  • Oracle Blog
  • Personal Blog

May 7: Adaptive Thresholds in 10g - Part 1 (Metric Baselines)

Some features in this post require a Diagnostics Pack license.

[I really didn't want to get into another multi-part blog post, but this has grown longer than I hoped, so I'll split it up ...]

One of the more interesting components of the 10gR2 instrumentation improvements is the use of Metric Baselines and Adaptive Thresholds by OEM DB/Grid Control to generate alerts. (There are further improvements in 11g but I’ll address those in a later post.)

Although these features have been designed to be extremely simple to implement without understanding the mechanics, I suspect that most Oracle DBAs still struggle with the concept of implementing something without having some idea how it works (at least I hope so). There are some good technical documents floating around out there (see References below), but I feel they're at one end of the technical spectrum or the other, so this and the next post are intended to bridge the gap as well as showing the features in use. (Harder than you might think, but I'll come back to that.)

How can we use Metric Baselines and Adaptive Thresholds? I’ll try to put it as simply as possible :-

• Generate a baseline of how a system looks during ‘normal’ operation.
• Generate alerts when specific metrics exceed threshold values relative to that baseline.

There’s quite a bit more to it than that, as I hope will become apparent.

But first, a warning. This is not about system optimisation – that happened during the design, development, test and initial implementation phases, right? ;-) – but about monitoring and capturing unexpected variance in performance indicators. Because even with an optimised application and hardware infrastructure  that generally meets requirements ...

• … a new Execution Plan could be produced for a SQL statement because of an overnight change in object statistics.
• … a failed cache battery somewhere in the storage infrastructure could cause a sudden increase in I/O times.
• … you could encounter a larger number of concurrent users than the system has handled before or was designed for.

I've seen all of these and more and I’m sure you’ve experienced plenty of your own. So there's no denying that it’s important to implement an optimised system first and foremost but, even when you’ve done so, something will go wrong one day. I’m probably making up my own phrases for things again, but I think these tools cover the areas of Performance Monitoring and Troubleshooting, rather than Performance Tuning (or Optimisation).

For existing Production systems, I’d like a facility that detects performance anomalies in the single database instance that has a problem right now (out of the hundreds I’m managing) and notifies me so that I can focus my attention where it’s required.

You could argue that such a facility already exists - called Users. They will generally detect serious performance anomalies and they will notify us using the tried and trusted telephone. If this happens too frequently, a further indicator would be the endless crisis meetings with white boards and wildly differing theories. However, it would be nice if I knew there was a problem before someone had to tell me. Users also don’t tend to be very good at reporting overnight batch throughput problems until the next day!

Metric Baselines
I'll dig into Metric Baselines first because

1) Defining a Baseline is the first thing you have to do.
2) Whilst configuring the Baseline might initially seem the simplest step, it's the foundation on which the threshold alerts depend. (That's why Oracle has tried to simplify it for busy DBAs.)

There are really two types of Baselines, which differ in the way the time period is defined; the way statistics are calculated and, consequently, their suitability for different uses.

A Moving Window Baseline uses recent data from the AWR repository over 7, 21, 35 or 91 days. As each day passes, the window on the data progresses forward by one day. The statistics are re-calculated on a regular basis (possibly as frequently as every hour depending on the Time Grouping). The effect of this is that the statistics change to reflect the recent workload and performance characteristics of the system.

A Static Baseline uses AWR data from a user-defined period which must be at least 7 days long. The statistics are calculated once, when you define the baseline, and are used forever until you switch to a new baseline. So Static Baselines don't make any allowance for the change in a system's work profile over time. If you have a steadily increasing number of concurrent users, you will eventually reach a stage where the system is alerting regularly because the work profile is so much greater than the Baseline.

Which is best for you depends on the characteristics of the system you're managing and what you're trying to achieve. I'd suggest that you need a Moving Window Baseline in most cases, unless you have very strict performance requirements and a very stable system that you don't expect to change over time.

One situation that almost demands the use of Static Baselines is playing around with this on your own setup at home. (In fact, this is just the first of a few difficulties I've faced playing around with this stuff because a single-user laptop is not the design target!) Think about it. For a Moving Window Baseline to make any sense, your system has to have been processing a 'normal' workload for at least the past 7 days, which is pretty unlikely on a laptop I switch off each night ;-) The design expects systems to be active on a more or less continuous basis, as most business systems are. So, in order to give me Metric Baseline statistics that I could re-use in future without needing ongoing continuous activity, I created a Static Metric Baseline covering last week. Why last week? Well, that comes to the next difficulty I faced. The Baseline period must have included enough activity on which to base the statistical computations used (see next post). Most weeks there probably wouldn't have been sufficient data on which to base the computations, but my laptop was more active during a week when I was teaching the course for two days and preparing in the evenings.

The best way to show you what I mean is to create a new Static Metric Baseline. First click on the Metric Baselines link at the bottom of various pages (e.g. Database Home page, Performance Page). If you don't have Baselines enabled, you'll be prompted to confirm that you want to enable them. When you click yes, DB/Grid Control will set the value of the _awr_flush_threshold_metrics hidden parameter to TRUE. (Unfortunately, this is thrown as a compliance error in DB Control 10.2.0.3 – see this Metalink Forum Thread  relating to bug number 4749372)

There are 135 metrics in Oracle 10.2.0.4 (including some old friends like the Buffer Cache Hit Ratio!), which you can see by querying the V$SYSMETRIC_HISTORY view. e.g.

SQL> select group_name, metric_name, metric_unit from dba_hist_metric_name
  2* where metric_name like 'B%'

GROUP_NAME ---------------------------------------------------------------- METRIC_NAME ---------------------------------------------------------------- METRIC_UNIT ---------------------------------------------------------------- System Metrics Short Duration Buffer Cache Hit Ratio % (LogRead - PhyRead)/LogRead

Session Metrics Long Duration Blocked User Session Count Sessions

System Metrics Long Duration Branch Node Splits Per Txn Splits Per Txn

System Metrics Long Duration Branch Node Splits Per Sec Splits Per Second

System Metrics Long Duration Background Checkpoints Per Sec Check Points Per Second

System Metrics Long Duration Buffer Cache Hit Ratio % (LogRead - PhyRead)/LogRead

6 rows selected.

but there are only 15 used by Metric Baselines and for which you can set Adaptive Thresholds, which are persisted in the AWR repository when _awr_flush_threshold_metrics=TRUE

SQL> select distinct metric_name from DBA_HIST_SYSMETRIC_HISTORY
  2  order by metric_name;



METRIC_NAME
----------------------------------------------------------------
Current Logons Count
DB Block Changes Per Txn
Database Time Per Sec
Enqueue Requests Per Txn
Executions Per Sec
Logical Reads Per Txn
Network Traffic Volume Per Sec
Physical Reads Per Sec
Physical Writes Per Sec
Redo Generated Per Sec
Response Time Per Txn
SQL Service Response Time
Total Parse Count Per Txn
User Calls Per Sec
User Transaction Per Sec

15 rows selected.

Actually, there are a further 7 metrics persisted in that view if you also have _awr_flush_workload_metrics=TRUE, although enabling Baselines doesn't do this by default, you won't be able to set adaptive thresholds for these metrics and you shouldn't be changing the value of hidden parameters anyway ... but worth mentioning should you ever see 22 rows returned by the previous query!

METRIC_NAME
----------------------------------------------------------------
DB Block Changes Per User Call
DB Block Gets Per User Call
Executions Per User Call
Logical Reads Per User Call
Total Sorts Per User Call
Total Table Scans Per User Call
User Calls Per Txn

7 rows selected.


When you've enabled Baselines ...



you need to click the 'Manage Static Metric Baselines' link at the bottom of the screen. (No, you won't find me disagreeing that the OEM web interface can be a frustrating search for the right screen/link/button/breadcrumb, but I've sort of got used to it because of some of the cool stuff it can do.) Click the, erm, Create button to create a new baseline and you'll see this screen.


Give it a name, pick a period of at least 7 days and then, if you want, you can just click OK and your Static Metric Baseline is created. The statistics will be computed and you'll be able to see your new baseline in the dictionary.

SQL> select name, type, status from dbsnmp.MGMT_BSLN_BASELINES;

NAME                                                             T ---------------------------------------------------------------- - STATUS ---------------- Doug Test                                                        S INACTIVE

(Note that it's INACTIVE and the type is 'S' for Static.)

But that simple creation of a Static Metric Baseline ignored the 'Time Grouping' and 'Statistics Preview' sections that I, and I suspect others, have found a little confusing at first. So they deserve a post of their own - the next one ...

References

Although the volume of information out there isn't great, the quality of some of it is, so here's some further reading if you're interested in the subject.

The Documentation

"Metric Baselines: Detecting Unusual Performance Events Using System-Level Metrics in EM 10gR2". John Beresniewicz's White Paper. Seriously good stuff, but JB will need to post it at Ashmasters.com ;-) Updated later - thanks to JB for letting me host the document here.

Arup Nanda article on OTN

Cary Millsap blog post discussing the underlying concept, rather than this particular implementation. In the post, he mentions a couple of other resources ...

CMG Paper– (Note also the discussion about Treemaps - blog post about that coming up later)

Robyn Sands on Variance.

Posted by Doug Burns Comments: (7) Trackbacks: (3)
Defined tags for this entry: adaptive thresholds
Related entries by tags:
Adaptive Thresholds in 10g - Part 3 (Setting Thresholds)
Adaptive Thresholds in 10g - Part 2 (Time Grouping)

Trackbacks
Trackback specific URI for this entry

PingBack
Weblog: www.pythian.com
Tracked: May 15, 18:00
Log Buffer #146: a Carnival of the Vanities for DBAs
Hello and welcome to the 146th edition of Log Buffer, the weekly review of database blogs. I have to make this a quick one, but I hope (as always) that the links give you the highlights of this week’s blogs. Oracle Let’s start with Jonatha...
Weblog: Pythian Group Blog
Tracked: May 15, 21:59
PingBack
Weblog: oakesgr.wordpress.com
Tracked: Aug 06, 15:18

Comments
Display comments as (Linear | Threaded)

#1 - JB 2009-05-07 17:36 - (Reply)

Nice post Doug, thanks. Will send you comments separately.

#2 - gazzali 2010-05-26 09:55 - (Reply)

Its really good and simple sufficient to manage metrics

#3 - Gary 2012-06-28 11:38 - (Reply)

Quick question - in order to enable the moving window baseline, do you need to use EM, or is there a DBMS_ procedure that can also do this? I ask because we don't have EM on our Failsafe clustered production environment, but I'm keen to use a non-static baseline, and then further investigate administering the thresholds, again through the command line (so assuming the latter can also be done via DBMS_). Cheers much

#4 - Doug Burns said:
2012-06-29 10:54 - (Reply)

Hi Gary,

Hopefully this covers what you're looking for ...

http://www.oracle-base.com/articles/11g/awr-baseline-enhancements-11gr1.php

#5 - Gary 2012-06-29 11:08 - (Reply)

Cheers Doug - unfortunately we're still on 10.2.0.4 here. If we were on 11g I'd be a much happier bunny :-p

#6 - Doug Burns said:
2012-06-29 11:52 - (Reply)

Mmmm, tricky one. It's a long time since I've worked with this stuff much but I suspect not.

The problem is that baselines and thresholds operate in a weird space that's partly OEM and partly the RDBMS and I suspect in 10g you probably can't do quite a few things without using OEM.

#7 - Gary 2012-06-29 12:44 - (Reply)

Aye - I suspected as much - cheers for the reply. Enjoy the sunshine :-p


Add Comment

Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
BBCode format allowed
 
 

Statistics on Partitioned Tables

Contents

Part 1 - Default options - GLOBAL AND PARTITION
Part 2 - Estimated Global Stats
Part 3 - Stats Aggregation Problems I
Part 4 - Stats Aggregation Problems II
Part 5 - Minimal Stats Aggregation
Part 6a - COPY_TABLE_STATS - Intro
Part 6b - COPY_TABLE_STATS - Mistakes
Part 6c - COPY_TABLE_STATS - Bugs and Patches
Part 6d - COPY_TABLE_STATS - A Light-bulb Moment
Part 6e - COPY_TABLE_STATS - Bug 10268597

Comments

Doug Burns about 10053 Trace Files - Different Plan in Different Environments
Tue, 02.04.2013 08:57
You're welcome. Now I just nee d to pull my finger out and ac tually come up [...]
Howard Rogers about 10053 Trace Files - Different Plan in Different Environments
Mon, 01.04.2013 23:08
Makes a big difference, so tha nks for that! With two brow ser windows, o [...]
stelioscharalambides.com about 10053 Trace Files
Sat, 30.03.2013 16:28

Upcoming Presentations

Bookmark

Open All | Close All

Syndicate This Blog

  • XML RSS 2.0 feed
  • ATOM/XML ATOM 1.0 feed
  • XML RSS 2.0 Comments
  • Feedburner Feed

Powered by

Serendipity PHP Weblog

Show tagged entries

xml 11g
xml ACE
xml adaptive thresholds
xml ASH
xml Audit Vault
xml AWR
xml Blogging
xml conferences
xml Cuddly Toys
xml Database Refresh
xml DBMS_STATS
xml Direct Path Reads
xml Fun
xml grid control
xml hotsos 2010
xml listener
xml Locking
xml oow
xml oow2009
xml optimiser
xml OTN
xml Parallel
xml Partitions
xml Patching
xml swingbench
xml The Reality Gap
xml time matters
xml ukoug
xml ukoug2009
xml Unix/Shell
xml Useful Links

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.

Design by Andreas Viklund | Conversion to s9y by Carl