That subject has been covered well enough, in my opinion. (To pick one example, this post and it's comments are around 5 years old.) Diagnostics Pack customers should almost always increase the default AWR retention period for important systems, even allowing for any additional space required in the SYSAUX tablespace.
However, I've found myself talking about the best default AWR snapshot *interval* several times over recent months and years and realising that I'm slightly out of step with the prevailing wisdom on the subject, so let's talk about intervals.
I'll kick off by saying that I think people should stick to the default 1 hour interval, rather than the 15 or 30 minute intervals that most of my peers seem to want. Let me explain why.
Initially I was influenced by some of the performance guys working in Oracle and I remember being surprised by their insistence that one hour is a good interval, which is why they picked it. Hold on, though - doesn't everyone know that a 1 hour AWR report smoothes out detail too much?
Then I got into some discussions about Adaptive Thresholds and it started to make more sense. If you want to compare performance metrics over time and trigger alerts automatically based on apparently unusual performance events or workload profiles, then comparing specific hours today to specific hours a month ago makes more sense than getting down to 15 minute intervals which would be far too sensitive to subtle changes. Adaptive Thresholds would become barking mad if the interval granularity was too fine. But when nobody used Adaptive Thresholds too much even though they seemed like a good idea (sorry JB ) this argument started to make less sense to me.
However, I still think that there are very solid reasons to stick to 1 hour and they make more sense when you understand all of the metrics and analysis tools at your disposal and treat them as a box of tools appropriate to different problems.
Let's go back to why people think that a 1 hour interval is too long. The problem with AWR, Statspack and bstat/estat is that they are system-wide reporting tools that capture the difference (or deltas) between the values of various metrics over a given interval. There are at least a couple of problems with that that come to mind.
1) Although a bit of a simplification, almost all of the metrics are system-wide which makes them a poor data source for analysing an individual users performance experience or an individual batch job because systems generally have a mixture of different activities running concurrently. (Benchmarks and load tests are notable exceptions.)
2) Problem 1 becomes worse when you are looking at *all* of the activity that occurred over a given period of time (the AWR Interval), condensed into a single data set or report. The longer the AWR period you report on, the more useless the data becomes. What use is an AWR report covering a one week period? So much has happened during that time and we might only be interested in what was happening at 2:13 am this morning.
In other words, AWR reports combine a wide activity scope (everything on the system) with a wide time scope (hours or days if generated without thought). Intelligent performance folks reduce the impact of the latter problem by narrowing the time scope and reducing the snapshot interval so that if a problem has just happened or is happening right now, they can focus on the right 15 minutes of activity1.
Which makes complete sense in the Statspack world they grew up in, but makes a lot less sense since Oracle 10g was released in 2004! These days there are probably better tools for what you're trying to achieve.
But, as this post is already getting pretty long, I'll leave that for Part 2.
1The natural endpoint to this narrowing of time scope is when people use tools like Swingbench for load testing and select the option to generate AWR snapshots immediately before and after the test they're running. Any AWR report of that interval will only contain the relevant information if the test is the only thing running on the system. At last year's Openworld, Graham Wood and I also covered the narrowing of the Activity scope by, for example, running the AWR SQL report (awrrpt.sql) to limit the report to a single SQL statement of interest. It's easy for people to forget - it's a *suite* of tools and worth knowing the full range so that you pick the appropriate one for the problem at hand.
(Reminder, just in case we still need it, that the use of features in this post require Diagnostics Pack license.) Damn me for taking so long to write blog posts these days. By the time I get around to them, certain very knowledgeable people have comment
Tracked: Jul 24, 10:21
Isn't this really about understanding which tools are for which jobs and making the right choices?
AWR (minus ASH) is intended to provide a "big picture" view of a meaningful time interval to the business, where "meaningful time interval" really is a proxy for saying "time of consistent workload". The "big picture" is for determining if there are large or systemic issues that can be identified for improving overall performance, and indeed that is exactly what ADDM does after every AWR snapshot. Many databases support different workloads at different times, e.g. batch vs. online, but typically over multiple hours and not within a single hour (periodically.) So that is the purpose of AWR - big picture - and having 24 snapshots per day provides a very good big picture while also separately capturing the larger workload cycles if they exist.
The default of 8 days retention is a concession to the need to support even small demo databases out of box with minimum risk of space issues. Any real database supporting production workloads should be modified to at least 35 days or more in my opinion, or aggressively move snapshots into an global AWR Warehouse.
You mentioned adaptive thresholds, but that is about monitoring and alerting for potential run-time issues, not about analyzing for systemic performance improvement. Again, purpose-tool mismatch, in spite of the other issues with Adaptive Thresholds (having to do not with that technology itself but rather the very botched integration of DB Server Alerts with EM Alerts)
Diagnosing transient problems whose impact is hidden within AWR by the hour long averaging is difficult, but attempting to mitigate that by reducing the snapshot interval is simply using the wrong tool for the task. This is what ASH is used for and I would recommend the ASH report or ASH Analytics for investigating such transient issues.
Thanks for popping by, JB
LOL. I came to read the post again so I could write the second part, only to find you've more or less written it already!
attempting to mitigate that by reducing the snapshot interval is simply using the wrong tool for the task.
Bingo! That's precisely what these two posts were going to be about, because I think people are still using AWR too much and for the wrong types of issues (I think), when ASH is a better alternative.
Which I'll try to get round to writing up
So if ASH is the right tool for shorter-term performance "incidents" vs AWR then it is a legitimate question whether ASH alone is enough for reasonable incident diagnosis and the answer in general is probably yes. I could see having things like EVENT_HISTOGRAM and WAITCLASSMETRIC capured at higher frequency to fill out the picture, and because of Adaptive Thresholds there are specific SYSMETRICS persisted into AWR at their 1-minute intervals (so these are useful and already there, unfortunately I foolishly overlooked the best of these and it is not included - "DB CPU per Sec")
Anyway, I'm not fundamentally opposed to modifying the snapshot interval, just think it is likely being done for the wrong reasons in many if not most cases.
OK, just to make Doug sorry that his is the only blog that I sometimes visit and then over-comment on, thought I would add a bit more.
If the problem is happening in the moment and needs diagnosis then perhaps the best tool for the job is Tanel Poder's "snapper" in the hands of a competent performance analyst. While I have not used it myself, my understanding is that it offers the ability to create "snapshot deltas" over key V$ data and over relatively small time intervals in "real time"
So there are kind of two orthogonal dimensions to consider:
1/ Systemic or transient performance issue?
2/ Presently occurring or in the past?
The original ADDM working off AWR snapshots was pretty much for systemic issues in the past (with the option to generate snapshot/ADDM report manually at any time for some possible insight into current issue.)
However, now there is a flavor of ADDM (forget what name it finally got) that is supposed to be looking at near-time performance and identifying/diagnosing any issues, and I think that it should cover the present-time transient issues pretty well.
So the two big categories: systemic/chronic issues (visible in some or many AWR snapshots) and transient problems (visible in V$ASH and other V$) should both be pretty well-covered by Diag and Tuning pack features, in particular ADDM, ASH and AWR.
But ADDM is not infallible and nothing beats a good brain, so an expert with snapper still should find herself quite useful!
Systemic/past => AWR+ADDM
Systemic/present => ASH+ASH Analytics, Snapper
Transient/present => ASH+ASH Analytics, Snapper
Transient/past => pray it doesn't happen again
Not sure this was worth the verbiage, but as I said my motive is partly to annoy Doug.
(forget what name it finally got)
Real-time ADDM. I've used it a few times and it's ok but I'm not 100% sure what it's added to the toolbox, from my perspective.
Sadly, for you, I actually managed to stay awake long enough to read your comment and agree with it.
My beef is with people using AWR reports for *everything* because they're attached to them from their Statspack days and for some issues they are not the most appropriate tool! Mucking around with AWR intervals is a warning sign to me that people are coming from that angle.
In my practical experience of solving lots of real world performance problems, or at least identifying the root cause, ASH is nearly always good enough at the 1-second in-memory sample granularity. As it moves to the 10-second granularity for the past then it naturally becomes a bit more difficult but is still usually good enough.
I think my concern about people modifying the standard interval is also why they are doing it, but I'd add another couple of points.
1) I remember us having a conversation once about how it made more sense to humans/businesses to talk in terms of 'Monday night' or last 'Thursday between 2 and 4am'. i.e. They're not usually interested or adjusted to fine granularities.
2) I made the point in part 2 that as the strength of AWR is the history, then I would always take 4 times the number of one hour snapshots over the equivalent 15 minute snapshots *every time*. In the real world of storage chargeback, you would be surprised how often I see people reduce their retention to reduce space usage whilst retaining their 15 minute interval. Now, *that* I definitely have a problem with, in the presence of ASH.