(Reminder, just in case we still need it, that the use of features in this post require Diagnostics Pack license.)
Damn me for taking so long to write blog posts these days. By the time I get around to them, certain very knowledgeable people have commented on part 1 and given the game away!
I finished the last part by suggesting that a narrow AWR interval makes less sense in a post-10g Diagnostics Pack landscape than it used to when we used Statspack.
Why do people argue for a Statspack/AWR interval of 15 or 30 minutes on important systems? Because when they encounter a performance problem that is happening right now or didn’t last for very long in the past, they can drill into a more narrow period of time in an attempt to improve the quality of the data available to them and any analysis based on it. (As an aside, I’m sure most of us have generated additional Statspack/AWR snapshots manually to *really* reduce the time scope to what is happening right now on the system, although this is not very smart if you’re using AWR and Adaptive Thresholds!)
However, there are better tools for the job these days.
If I have a user complaining about system performance then I would ideally want to narrow down the scope of the performance metrics to that user’s activity over the period of time they’re experiencing a slow-down. That can be a little difficult on modern systems that use complex connection pools, though. Which session should I trace? How do I capture what has already happened as well as what’s happening right now? Fortunately, if I’ve already paid for Diagnostics Pack then I have *Active Session History* at my disposal, constantly recording snapshots of information for all active sessions. In which case, why not look at
- The session or sessions of interest (which could also be *all* active sessions if I suspect a system-wide issue)
- For the short period of time I’m interested in
- To see what they’re actually doing
Rather than running a system-wide report for a 15 minute interval that aggregates the data I’m interested in with other irrelevant data? (To say nothing of having to wait for the next AWR snapshot or take a manual one and screwing up the regular AWR intervals ...)
When analysing system performance, it’s important to use the most appropriate tool for the job and, in particular, focus your data collection on what is *relevant to the problem under investigation*. The beauty of ASH is that if I’m not sure what *is* relevant yet, I can start with a wide scope of all sessions to help me find the session or sessions of interest and gradually narrow my focus. It has the history that AWR has, but with finer granularity of scope (whether that be sessions, sql statements, modules, actions or one of the many other ASH dimensions). Better still, if the issue turns out to be one long-running SQL statement, then a SQL Monitoring Active Report probably blows all the other tools out of the water!
With all that capability, why are experienced people still so obsessed with the Top 5 Timed Events section of an AWR report as one of their first points of reference? Is it just because they’ve become attached to it over the years of using Statspack? AWR has it’s uses (see JB’s comments for some thoughts on that and I’ve blogged about it extensively in the past) but analysing specific performance issues on Production databases is not it’s strength. In fact, if we’re going to use AWR, why not just use ADDM and let software perform automatically the same type of analysis most DBAs would do anyway (and in many cases, not as well!)
Remember, there’s a reason behind these Recurring Conversations posts. If I didn’t keep finding myself debating these issues with experienced Oracle techies, I wouldn’t harbour doubts about what seem to be common approaches. In this case, I still think there are far too many people using AWR where ASH or SQL Monitoring are far more appropriate tools. I also think that if we stick with a one hour interval rather than a 15 minute interval, we can retain four times as much *history* in the same space! When it comes to AWR – give me long retention over a shorter interval every time!
P.S. As well as thanking JB for his usual insightful comments, I also want to thank Martin Paul Nash. When I was giving an AWR/ASH presentation at this springs OUGN conference, he noticed the bullet point I had on the slide suggesting that we *shouldn’t* change the AWR interval and asked why. Rather than going into it at the time, I asked him to remind me at the end of the presentation and then because I had no time to answer, I promised I’d be blogging about it that weekend. That was almost 4 months ago! Sigh. But at least I got there in the end!
That subject has been covered well enough, in my opinion. (To pick one example, this post and it's comments are around 5 years old.) Diagnostics Pack customers should almost always increase the default AWR retention period for important systems, even allowing for any additional space required in the SYSAUX tablespace.
However, I've found myself talking about the best default AWR snapshot *interval* several times over recent months and years and realising that I'm slightly out of step with the prevailing wisdom on the subject, so let's talk about intervals.
I'll kick off by saying that I think people should stick to the default 1 hour interval, rather than the 15 or 30 minute intervals that most of my peers seem to want. Let me explain why.
Initially I was influenced by some of the performance guys working in Oracle and I remember being surprised by their insistence that one hour is a good interval, which is why they picked it. Hold on, though - doesn't everyone know that a 1 hour AWR report smoothes out detail too much?
Then I got into some discussions about Adaptive Thresholds and it started to make more sense. If you want to compare performance metrics over time and trigger alerts automatically based on apparently unusual performance events or workload profiles, then comparing specific hours today to specific hours a month ago makes more sense than getting down to 15 minute intervals which would be far too sensitive to subtle changes. Adaptive Thresholds would become barking mad if the interval granularity was too fine. But when nobody used Adaptive Thresholds too much even though they seemed like a good idea (sorry JB ) this argument started to make less sense to me.
However, I still think that there are very solid reasons to stick to 1 hour and they make more sense when you understand all of the metrics and analysis tools at your disposal and treat them as a box of tools appropriate to different problems.
Let's go back to why people think that a 1 hour interval is too long. The problem with AWR, Statspack and bstat/estat is that they are system-wide reporting tools that capture the difference (or deltas) between the values of various metrics over a given interval. There are at least a couple of problems with that that come to mind.
1) Although a bit of a simplification, almost all of the metrics are system-wide which makes them a poor data source for analysing an individual users performance experience or an individual batch job because systems generally have a mixture of different activities running concurrently. (Benchmarks and load tests are notable exceptions.)
2) Problem 1 becomes worse when you are looking at *all* of the activity that occurred over a given period of time (the AWR Interval), condensed into a single data set or report. The longer the AWR period you report on, the more useless the data becomes. What use is an AWR report covering a one week period? So much has happened during that time and we might only be interested in what was happening at 2:13 am this morning.
In other words, AWR reports combine a wide activity scope (everything on the system) with a wide time scope (hours or days if generated without thought). Intelligent performance folks reduce the impact of the latter problem by narrowing the time scope and reducing the snapshot interval so that if a problem has just happened or is happening right now, they can focus on the right 15 minutes of activity1.
Which makes complete sense in the Statspack world they grew up in, but makes a lot less sense since Oracle 10g was released in 2004! These days there are probably better tools for what you're trying to achieve.
But, as this post is already getting pretty long, I'll leave that for Part 2.
1The natural endpoint to this narrowing of time scope is when people use tools like Swingbench for load testing and select the option to generate AWR snapshots immediately before and after the test they're running. Any AWR report of that interval will only contain the relevant information if the test is the only thing running on the system. At last year's Openworld, Graham Wood and I also covered the narrowing of the Activity scope by, for example, running the AWR SQL report (awrrpt.sql) to limit the report to a single SQL statement of interest. It's easy for people to forget - it's a *suite* of tools and worth knowing the full range so that you pick the appropriate one for the problem at hand.
May 28: OUG Scotland
As always, kudos to Thomas Presslie and all the good people in the UKOUG office who work so hard putting this together.
Which is why there's still a place for the detailed technical blog posts, once you've got the basics clear. The problem with not having the basics clear is that I've had people tell me things about Incremental Stats based on the blog posts and, when I read them, they make sense to me but I can also see why they might confuse others. Anyway, for completeness, here are some terrific posts about Incremental Stats.
Randolf Geist always has excellent stuff, as does John Hallas and, of course, when Maria Colgan was the Optimizer PM, she used to write about this stuff all the time.
Read blogs, by all means, but maybe start off with White Papers and the documentation and full presentations?
In the first part, I explained that Incremental Statistics are designed to allow a partitioned tables Global Statistics to be updated based on a combination of
1) Statistics gathered by analysing the contents of one or more partitions that have just been loaded or have been updated (and see this blog post for more depth on what 'updated' means!)
2) The statistics of existing partitions which are represented by synopses that are already stored in the SYSAUX tablespace.
Using this combination, we can avoid scanning every partition in the table every time we want to update the Global Stats which is an expensive operation that is likely to be unfeasible every time we load some data into a large table.
For me, the key word here is Incremental. Global Statistic updates are an incremental process, building on previous statistics (represented by the synopses) and only updating the Global Statistics based on the changes introduced by loading new partitions.
Understanding this might clear up another area of confusion I keep coming across. After upgrading their database to 11g, people often want to try out Incremental Global Stats on one of their existing large tables because they've always struggled to keep their Global Stats consistent and up to date. Maybe it's just the sites I work at but I'd say this is the most popular use case. Incrementals for a planned large partitioned table in your new systems might be a sensible idea, but there are a lot more existing systems out there with Global Stats collection problems that people have struggled with for years.
Most people I've spoken to initially had the impression that they simply flick the INCREMENTAL switch and perhaps modify some of the parameters to their existing DBMS_STATS calls so that GRANULARITY is AUTO and they use AUTO sampling sizes. All of which is discussed in the various white papers and blog posts out there.
Then they get a hell of a surprise when the very first gather runs for ages! How long is ages? I don't know in your particular case but I've seen this running for hours and hour and hours and people are crying in to their keyboards wondering why something that was supposed to make things run more quickly is so much slower than their usual stats calls.
The best way I've found to explain this phenomenon is to concentrate on the synopses that describe the existing partitions. Where do you think they come from? How are they calculated and populated if you don't ask Oracle to look at the existing data in your enormous table? That's what needs to happen. In order to make future updates to your Global Stats much more efficient, we first need to establish the baseline describing your existing data that Oracle will use as the foundation for the later incremental updates.
Generating the synopses as the baseline for future improvements will be a relatively painful for the largest tables (if it wasn't, you probably wouldn't be so interested in Incrementals ), but it does only have to happen once. You just need to understand that it does have to happen and plan for it as part of your migration.
My personal suggestion is usually to just delete all of the existing stats and start from scratch with modern default parameter values and tidy up any stats-related junk that might be lingering around large, critical tables. Painful but probably worth it!