That subject has been covered well enough, in my opinion. (To pick one example, this post and it's comments are around 5 years old.) Diagnostics Pack customers should almost always increase the default AWR retention period for important systems, even allowing for any additional space required in the SYSAUX tablespace.
However, I've found myself talking about the best default AWR snapshot *interval* several times over recent months and years and realising that I'm slightly out of step with the prevailing wisdom on the subject, so let's talk about intervals.
I'll kick off by saying that I think people should stick to the default 1 hour interval, rather than the 15 or 30 minute intervals that most of my peers seem to want. Let me explain why.
Initially I was influenced by some of the performance guys working in Oracle and I remember being surprised by their insistence that one hour is a good interval, which is why they picked it. Hold on, though - doesn't everyone know that a 1 hour AWR report smoothes out detail too much?
Then I got into some discussions about Adaptive Thresholds and it started to make more sense. If you want to compare performance metrics over time and trigger alerts automatically based on apparently unusual performance events or workload profiles, then comparing specific hours today to specific hours a month ago makes more sense than getting down to 15 minute intervals which would be far too sensitive to subtle changes. Adaptive Thresholds would become barking mad if the interval granularity was too fine. But when nobody used Adaptive Thresholds too much even though they seemed like a good idea (sorry JB ) this argument started to make less sense to me.
However, I still think that there are very solid reasons to stick to 1 hour and they make more sense when you understand all of the metrics and analysis tools at your disposal and treat them as a box of tools appropriate to different problems.
Let's go back to why people think that a 1 hour interval is too long. The problem with AWR, Statspack and bstat/estat is that they are system-wide reporting tools that capture the difference (or deltas) between the values of various metrics over a given interval. There are at least a couple of problems with that that come to mind.
1) Although a bit of a simplification, almost all of the metrics are system-wide which makes them a poor data source for analysing an individual users performance experience or an individual batch job because systems generally have a mixture of different activities running concurrently. (Benchmarks and load tests are notable exceptions.)
2) Problem 1 becomes worse when you are looking at *all* of the activity that occurred over a given period of time (the AWR Interval), condensed into a single data set or report. The longer the AWR period you report on, the more useless the data becomes. What use is an AWR report covering a one week period? So much has happened during that time and we might only be interested in what was happening at 2:13 am this morning.
In other words, AWR reports combine a wide activity scope (everything on the system) with a wide time scope (hours or days if generated without thought). Intelligent performance folks reduce the impact of the latter problem by narrowing the time scope and reducing the snapshot interval so that if a problem has just happened or is happening right now, they can focus on the right 15 minutes of activity1.
Which makes complete sense in the Statspack world they grew up in, but makes a lot less sense since Oracle 10g was released in 2004! These days there are probably better tools for what you're trying to achieve.
But, as this post is already getting pretty long, I'll leave that for Part 2.
1The natural endpoint to this narrowing of time scope is when people use tools like Swingbench for load testing and select the option to generate AWR snapshots immediately before and after the test they're running. Any AWR report of that interval will only contain the relevant information if the test is the only thing running on the system. At last year's Openworld, Graham Wood and I also covered the narrowing of the Activity scope by, for example, running the AWR SQL report (awrrpt.sql) to limit the report to a single SQL statement of interest. It's easy for people to forget - it's a *suite* of tools and worth knowing the full range so that you pick the appropriate one for the problem at hand.
May 28: OUG Scotland
As always, kudos to Thomas Presslie and all the good people in the UKOUG office who work so hard putting this together.
Which is why there's still a place for the detailed technical blog posts, once you've got the basics clear. The problem with not having the basics clear is that I've had people tell me things about Incremental Stats based on the blog posts and, when I read them, they make sense to me but I can also see why they might confuse others. Anyway, for completeness, here are some terrific posts about Incremental Stats.
Randolf Geist always has excellent stuff, as does John Hallas and, of course, when Maria Colgan was the Optimizer PM, she used to write about this stuff all the time.
Read blogs, by all means, but maybe start off with White Papers and the documentation and full presentations?
In the first part, I explained that Incremental Statistics are designed to allow a partitioned tables Global Statistics to be updated based on a combination of
1) Statistics gathered by analysing the contents of one or more partitions that have just been loaded or have been updated (and see this blog post for more depth on what 'updated' means!)
2) The statistics of existing partitions which are represented by synopses that are already stored in the SYSAUX tablespace.
Using this combination, we can avoid scanning every partition in the table every time we want to update the Global Stats which is an expensive operation that is likely to be unfeasible every time we load some data into a large table.
For me, the key word here is Incremental. Global Statistic updates are an incremental process, building on previous statistics (represented by the synopses) and only updating the Global Statistics based on the changes introduced by loading new partitions.
Understanding this might clear up another area of confusion I keep coming across. After upgrading their database to 11g, people often want to try out Incremental Global Stats on one of their existing large tables because they've always struggled to keep their Global Stats consistent and up to date. Maybe it's just the sites I work at but I'd say this is the most popular use case. Incrementals for a planned large partitioned table in your new systems might be a sensible idea, but there are a lot more existing systems out there with Global Stats collection problems that people have struggled with for years.
Most people I've spoken to initially had the impression that they simply flick the INCREMENTAL switch and perhaps modify some of the parameters to their existing DBMS_STATS calls so that GRANULARITY is AUTO and they use AUTO sampling sizes. All of which is discussed in the various white papers and blog posts out there.
Then they get a hell of a surprise when the very first gather runs for ages! How long is ages? I don't know in your particular case but I've seen this running for hours and hour and hours and people are crying in to their keyboards wondering why something that was supposed to make things run more quickly is so much slower than their usual stats calls.
The best way I've found to explain this phenomenon is to concentrate on the synopses that describe the existing partitions. Where do you think they come from? How are they calculated and populated if you don't ask Oracle to look at the existing data in your enormous table? That's what needs to happen. In order to make future updates to your Global Stats much more efficient, we first need to establish the baseline describing your existing data that Oracle will use as the foundation for the later incremental updates.
Generating the synopses as the baseline for future improvements will be a relatively painful for the largest tables (if it wasn't, you probably wouldn't be so interested in Incrementals ), but it does only have to happen once. You just need to understand that it does have to happen and plan for it as part of your migration.
My personal suggestion is usually to just delete all of the existing stats and start from scratch with modern default parameter values and tidy up any stats-related junk that might be lingering around large, critical tables. Painful but probably worth it!
Even though my blogging has slowed to a crawl, I still spend a lot of time having similar conversations with multiple people and on work chat channels on the same topics which indicates to me that those topics are not well understood. Because I work with a lot of smart people, it's typically not the details that they struggle with. They've read detailed technical blog posts and have performed multiple web searches so have read the most detailed material available and yet somehow they're missing the *point*. Maybe that's the problem with learning everything via blogs and white papers? That it's no substitute for someone explaining the fundamental concepts and design of features? I could probably rephrase that as, maybe there's no substitute for actually reading a book or attending a course occasionally? I appreciate how out-of-date that view might be though.
I'm sure some will have already realised that Recurring Conversations could probably be called Frequently Asked Questions, so let me begin this first post with
'Why are my Global Statistics taking longer to gather when I use Oracle's snazzy 11g Incremental Global Statistics feature than when I don't?'
This has baffled a lot of people I know because I'm not sure they understand fully why the feature was introduced. They want to convert one of their existing partitioned tables to use Incremental Global Statistics and so they test the performance by doing something like this.
1) Delete all of the stats on a large partitioned table.
2) Set INCREMENTAL to FALSE and then gather table stats using GRANULARITY =>'GLOBAL '
3) Set INCREMENTAL to TRUE and then gather table stats using GRANULARITY =>'GLOBAL '
When they time this they find that 3 takes just as long as 2 and, in fact, it takes a little longer! This is useless? What is the point of this new feature if it doesn't speed up the gathering of Global stats?
First I want to look at what we asked Oracle to do in steps 2) and 3) above.
2) Visited all of the partitions of the table to gather information and then update the Global stats on the table.
3) Visited all of the partitions of the table to gather information, update the Global stats on the table and generate synopses for future use.
On that basis, why *wouldn't* option 3 take longer than option 2? They do more or less the same thing but 3) has to do a little additional work.
So if it isn't quicker to gather Global Stats using Incremental Global Statistics, why would you use it?
The benefits don't come from the initial gathering of Global Stats but when you gather stats on new Partitions and *don't* need to gather Global Stats any more. Instead Oracle uses those handy synopses to update them which is a much quicker operation! The Real World cycle of use then looks like this.
1) Delete all of your existing table stats.
2) Set INCREMENTAL to TRUE.
3) Gather table stats using GRANULARITY =>'GLOBAL AND PARTITION' without supplying a PARTNAME. This will re-gather all of your Global and Partition stats across the table and build the initial synopses. Note that at this stage you have achieved no reductions in stats collection times.
4) As you load partitions with new data or the data changes and you need to update your stats, use one of a number of options but the one I tend to use is to gather the stats on each specific partition using GRANULARITY=>' GLOBAL AND PARTITION' with PARTNAME set to the name of the partition we've just loaded. Oracle will now gather Partition stats on just the one Partition and update the Global Stats and the synopses based on the new data that's been introduced.
Bingo – you've just maintained accurate Global Stats without having to trawl through the entire table again!
That's the point.
It's about *not* gathering Global Stats but also not letting them drift hopelessly out of whack with the contents of the table. Measuring the performance of a full Global Stats gathering operation doesn't illustrate the performance benefits.