Jul 30: AWR Differences Report
This morning I had an opportunity to use one of my favourite AWR tools, the differences report. Our system has a fairly involved overnight batch schedule consisting of multiple concurrent job streams that starts at 02:00 and usually completes at about 5:40. This morning it finished at 6:40 so I needed to work out what had gone wrong. (Better still, because this wasn't a problem with specific SQL statements, I can blog about it without using any system-specific information that needs to be obscured and that doesn't happen too often.)
Because I don't have access to the server to run $ORACLE_HOME/rdbms/admin/awrddrpt.sql, I used a SQL-only approach.
First find out the snapshots I'm interested in.
select snap_id, end_interval_time from dba_hist_snapshot where end_interval_time > trunc(sysdate-1) order by snap_id; 8945 28/07/2009 12:00:18.555 AM 8946 28/07/2009 1:00:30.647 AM 8947 28/07/2009 2:00:42.740 AM 8948 28/07/2009 3:00:55.045 AM 8949 28/07/2009 4:00:07.050 AM 8950 28/07/2009 5:00:19.198 AM 8951 28/07/2009 6:00:31.596 AM 8952 28/07/2009 7:00:43.751 AM 8953 28/07/2009 8:00:55.820 AM 8954 28/07/2009 9:00:07.679 AM 8955 28/07/2009 10:00:19.891 AM 8956 28/07/2009 11:00:31.994 AM 8957 28/07/2009 12:00:44.094 PM 8958 28/07/2009 1:00:56.239 PM 8959 28/07/2009 2:00:08.123 PM 8960 28/07/2009 3:00:20.204 PM 8961 28/07/2009 4:00:32.556 PM 8962 28/07/2009 5:00:44.682 PM 8963 28/07/2009 6:00:56.788 PM 8964 28/07/2009 7:00:08.679 PM 8965 28/07/2009 8:00:20.759 PM 8966 28/07/2009 9:00:32.851 PM 8967 28/07/2009 10:00:45.004 PM 8968 28/07/2009 11:00:57.713 PM 8969 29/07/2009 12:00:09.744 AM 8970 29/07/2009 1:00:21.840 AM 8971 29/07/2009 2:00:33.939 AM 8972 29/07/2009 3:00:46.268 AM 8973 29/07/2009 4:00:58.378 AM 8974 29/07/2009 5:00:10.702 AM 8975 29/07/2009 6:00:23.183 AM 8976 29/07/2009 7:00:35.604 AM 8977 29/07/2009 8:00:48.040 AM
Now that I have those snapshot ids, I can generate the AWR differences report, passing in the DBID, Instance Number, Begin and End Snapshot IDs for the first period and then the second period.
select * from TABLE(DBMS_WORKLOAD_REPOSITORY.awr_diff_report_html(4034329550,1,8947,8951, 4034329550,1,8971,8975));
The report shows information for both periods, layed out for easy comparison. What immediately jumped out at me was the much higher DB Time over the first period than the second, which is probably more easily expressed as Average Active Sessions over the period. (Some people might say AAS is the magic metric)
Yes, this is over a long 4 hour period but I have the benefit of knowing that this is the same batch process that runs every night and it's performance and workload profile is more or less the same night after night. There's still no substitute for knowing your systems, after all. So to have over 4 times the number of sessions active, averaged over a long period, is a sure sign that something weird is going on and I had an immediate suspicion about what it might be.
Let's look at the Top 5 Timed Events section.
Yes, as I suspected the first period with many more active sessions on average is showing a lot of time spent on the parallel execution-related "PX Deq Credit: send blkd' event which doesn't show up at all in the second period. In fact, that's why it has an asterisk next to it in the first period numbers, to show it's an event which doesn't appear in the top 5 over the second period. That's what I love about this report. It takes the particular strength of AWR/Statspack - comparing 'good' and 'bad' periods and lays it out in readable format. No more sitting there at a desk with print-outs of two Statspack reports, poring over the details, or flicking between two vi sessions paging through the results. Those who've done that will know precisely what I mean
We have 4 times as many sessions active in the first period and there's a lot of parallel wait time that doesn't appear in the second period so I suspect that PX wasn't being used properly in the second period. I can check various statistics for that. This is a small section of the Instance Activity section of the report.
Yes, every parallel operation in the second period was downgraded to serial. You can see that the use of parallelism isn't very successful in the first 'good' period either, but that's a whole other story and possibly a blog post or two. That's where the big difference in DB Time has come from though - lots of parallel slave activity that wasn't there last night.
I think this is an interesting example because this particular report is showing a 'good' period with higher DB Time than a 'bad' period, which isn't what you'd normally expect to see - that's a side effect of looking at consolidated DB time - there's lots of additional wait time from the various slaves but they're actually helping push more work through the system, so they're a good thing. To put it another way, the instance was much busier during the good period, that's why it was able to push more work through and complete more quickly. Even though that might make sense, if we look at the graphical representation in OEM, here's the good period
... and here's the bad
I'm not convinced that people wouldn't look at those two pictures and think 'Mmmmm, things look much worse in the first period'. Oh, I should also point out that that bump at 22:00 each day is the auto stats gather job which ran for the entire maintenance window on the first night.
The root cause was a little more difficult to find out, but I got there in the end. TOAD has a nasty habit of keeping hold of PX slaves long after a parallel query has finished. (Well, that's my experience, using version 188.8.131.52, post a comment if you know different.) Unfortunately we have parallel DEGREE settings of 8 across our logging and monitoring tables (we shouldn't) and so if I query those using TOAD and forget to disable parallel query at the session level, Oracle decides to run them with parallel plans. I had one such query which used 16 slaves - our parallel_max_servers setting - which meant that there were no slaves for any of our batch processes! Result - batch was delayed by an hour.
The punchline, though, is that although the ETL process sets DEGREE to 1 on most of the tables on completion, it doesn't set the indexes to 1 (which is an oversight). The end result is that our users use parallel query during the day for their reports and (this is definitely another blog post) it makes them run more slowly. So now that I have all slaves assigned to my query, it's helped all the user queries run more quickly, so I'm sure they're happy. In fact, maybe I should just run one of those queries each morning from TOAD
The AWR differences report is very handy sometimes and I know it can't be that well known because when I had to request one prior to having the privileges myself, the DBA denied it's existence!
Tracked: Aug 01, 02:54
Tracked: May 18, 18:48
Nice blog doug! Truly shows the power of the differences report.
I noticed the strange habit of TOAD to keep the PX slaves atached last week. I used 'select * from dual' to cut them loose. Do you know if sql developer does the same?
Also, this post shows that PX is a feature which only is effective if it's explicitly tailored for the environment, to use available resources.
Do you know if sql developer does the same?
I can't say I do, but I'll make a note to check. To be honest, I was surprised that TOAD does.
PX is a feature which only is effective if it's explicitly tailored for the environment, to use available resources.
Amen to that
I've noticed it very recently, your blog reminded me about the behavior (the big pink cloud on my waits after execution is ready)
The PX statement makes a lot of implications, which is not very often communicated towards PX users:
-If your OLTP is (quite) unpredictable in resource usage, PX will kill performance
-'More is better' is not always true
-If your nightbatches are subject to change due to company requirements or other reasons, PX can become ineffective
At a certain rate, I encounter clients who totally abuse PX, with the main reason they couldn't find documentation about the do nots of PX.
> TOAD has a nasty habit of keeping hold
> of PX slaves long after a parallel
> query has finished.
AFAIK there is no "feature" to keep PX slaves when an execution has finished. The problem of TOAD and similar tools, in my opinion (I'm almost never use those kind of tools... so, I might be wrong...), is that by default they only fetch few dozen or rows. Therefore, the cursor is still open and the database engine is waiting for the subsequent fetches. So, a user that ignore that could casually have a look to few rows in a table, do nothing else, and therefore keep the PX slaves "busy".
Of course, that's what it is. You're right.
Makes complete sense now that you mention it.
I am slightly confused by this part:
"Because I don't have access to the server to run $ORACLE_HOME/rdbms/admin/awrddrpt.sql.."
I believe this is script is part of Oracle client installation as well (together with awrrpt.sql).
BTW - the fact that AWR usage (e.g. awrrpt.sql) is subject to Diagnostic Pack licensing still p* me off..
I actually thought about that line before I typed it because in some cases the script will be on the client. However, it's not on the client installation at the site I'm currently working at (I've just checked) and people often don't have local admin rights to install a new client.
Even if it is part of your client installation they may be slightly different versions and I've run into small problems running the client version in the past, so I'd use that with caution.
Nice post Doug. Although I routinely compare AWR reports, I never use the diff report. I usually just pop the two reports I'm comparing into a couple of windows and scroll them side by side. I had forgotten that the diff report does provide some additional info (the calculations of percentage differences in values for example). Anyway, thanks for reminding me of this report. I've used it already this morning.
Thanks Doug. Diff report really helps as we got a problem after applying 2010 January security patches. We see increase of direct path reads (double) resulting doubling the query times.
Cool Stuff !
Doing the same but here "gc" is the culprit