Aug 14: Time Matters - An Infinite Capacity for Waiting*
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time ----------------------------------------- ------------ ----------- ------ ------ db file sequential read 5,814 461 79 38.0 log file sync 4,246 421 99 34.7 log file parallel write 3,222 109 34 9.0 CPU time 108 8.9 db file parallel write 1,718 60 35 4.9
(Note that this is from a 10.2.0.4 Statspack report, but you would see something similar in all AWR reports and Statspack reports that include CPU time in this section.)
Add up those seconds in the Time column and, even just for the top 5 events, the total is 1159 seconds or 19 minutes and 19 seconds. How can the system have spent 19 minutes and 19 seconds using CPU or waiting on various events when there were only 330 seconds in the reporting period?
The person who asked me to clarify this after the training has seen a few Statspack reports in his time and understood that there is more CPU time available in a given period than the Wall Clock period because servers often have multiple CPUs. But, in this case, I ran the test on my laptop which has a dual-core CPU. Therefore the maximum CPU time available between these snapshots was 660 seconds, or 11 minutes.
No, the real reason is that Statspack is reporting system-wide event timings for more than one session and then aggregating them. In this case, there were four user sessions running the SOE benchmark during the period. Given that each user and background session is instrumented and clocking up both service and wait time, aggregating the data for all of the sessions will record more time than has passed if you look at the clock on the wall.
This is such an intuitive idea to me that I hadn't thought to explain it properly during the course, so I set a mental reminder that I would blog about it soon. Having said that, this has been written about by others on many occasions (*not least by Cary Millsap and Jeff Holt in Optimizing Oracle Performance. In fact, I stole the title of this blog post from their book - see page 215 for more information.) Therefore, if this post makes complete sense to you so far or you've read any of those sources before, continued reading will probably be a little boring. If not, this post is especially for you.
So let's revisit the tests. I ran two 5 minute tests using Swingbench's supplied Sales Order Entry benchmark, but I could have used anything for this. The important thing is that the first test consisted of one session running the SOE application, while four sessions were running during the second test.
Here is the top 5 timed events section of the Statspack report for the single session test.
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time ----------------------------------------- ------------ ----------- ------ ------ db file sequential read 4,818 175 36 46.3 CPU time 57 15.2 db file parallel write 1,085 42 39 11.2 log file sync 821 38 46 10.0 log file parallel write 2,702 28 10 7.5
... and from the Statspack report for the four session test.
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time ----------------------------------------- ------------ ----------- ------ ------ db file sequential read 5,814 461 79 38.0 log file sync 4,246 421 99 34.7 log file parallel write 3,222 109 34 9.0 CPU time 108 8.9 db file parallel write 1,718 60 35 4.9
So the two tests were running the same application, on the same hardware and for the same duration. Yet the values in the Time (s) column are much bigger when more sessions are running. The total for just the top 5 timed events is 340 seconds compared to 1159 seconds. (The average wait times have also increased in some cases, because the system is busier, and different events become more significant as concurrent sessions contend with each other. There are various things we could try to deduce from this section, particularly as this was a controlled test, but we'd need more supporting information. Let's just stick to the apparently increased time.)
The problem here, if you see it as a problem, is that aggregated data is being reported. Statspack is not the only example of this, though. Tracing Parallel Execution tasks and then aggregating the trace files using trcsess into one consolidated trace file will also result in variable timings, depending on the Degree of Parallelism. I talked about this in slides 33 to 37 of an earlier presentation. In fact, I can make the Time (s) values as big as I want to, just by running the SOE benchmark with more and more sessions.
That's why the % Total Call Time is a far more useful column. It tells you the system-wide percentage contribution of each event to any performance problem over the given period so that you can focus on the most significant contributors. It should go without saying that looking at session-level statistics is more specific and therefore more reliable but that's not always possible unless you're using ASH, Kyle Hailey's Simulated ASH, some other session history recorder or can recreate the problem session on demand, but that's another, bigger argument that I won't go into here.
So if wall clock time for those tests was about 300 seconds, then what is that bigger value being recorded in the Time (s) column? What I'm really talking about is DB Time, but I'll leave that for the next post.
#1 - Chen Shapira said:
2008-08-15 02:52 - (Reply)
Doug,
Just to check if I follow you correctly - if I have 4 sessions running on the system, and I'm looking at an aggregated report for 6 minutes, the total times reported cannot exceed 6*60*4 seconds, correct?
#1.1 - Doug Burns said:
2008-08-15 07:41 - (Reply)
Chen,
For the purposes of understanding, yes that's about the size of it.
If you look at the entire Statspack report that section came from, though, the "four sessions" in this test were actually over 20 sessions, including background processes. So the real upper limit on time for that period was much higher.
There are also some anomalies in CPU time recording between versions and always bear in mind we're only looking at the top 5 events.
But, as I said, your understanding of what's going on is correct and should become clearer when I get to DB time in the next post.
Cheers,
Doug
#2 - nm 2008-08-20 11:15 - (Reply)
Hi Doug,
Why do you total the time in Time (s).? The reason I say this, is during the time the db file sequential read event occurs, is not possible that a log file parallel write also take place?
Or does this go back to the first comment made - by Chen. Wherein each session has its own time to account for. The session responsible for db file sequential read is not the same session that performs the log file parallel write.
What confuses me, is what I mentioned earlier - since some of these tasks would be occurring at the same time would not totaling all the values result in , well , some form of double counting - if you get what I mean?
Thanks
#2.1 - Doug Burns said:
2008-08-20 12:29 - (Reply)
What confuses me, is what I mentioned earlier - since some of these tasks would be occurring at the same time would not totaling all the values result in , well , some form of double counting - if you get what I mean?
Yes, you're absolutely right! If all four sessions running the benchmark waited for 1 second on a log file sync event, 4 seconds would be added to the relevant row in that column and that's the very aggregation that I'm discussing here.
I don't know if you have Cary and Jeff's book but another read of page 215 is instructive if you do and I recommend you find a copy if you don't. It's why many people question the value of Statspack and AWR reports when analysing performance.
However, what this blog post is really about is
a) That's the reality of the top 5 timed events section and anyone who does use Statspack or AWR reports for performance analysis should at least be aware of what they're looking at. What prompted the blog post is that I've watched people totalling up the time in that column and wondering how it could be greater than the wall clock duration of the snapshot interval.
b) Using my previous example, do you think that it's useful to know that four sessions were waiting on log file sync for 1 second each, rather than just one? Yes, the aggregated value doesn't tell you anything about any of the individual sessions but it does tell you something about the instance as a whole.
Maybe point b) will become more clear in the next blog but your comment shows that you understand what that column means and also the dangers of aggregating the data!
#3 - Lei Zeng said:
2009-07-17 00:15 - (Reply)
I have one quesiton here. Do you know how the %Total Call time is calculated for each of those events?
Thanks very much,
Lei
#3.1 - Doug Burns said:
2009-07-20 10:31 - (Reply)
Lei,
What version of Oracle and are you talking Statspack or AWR?
Cheers,
Doug


[In retrospect, the title of that first blog post might have suited the subject, but doesn't translate too well for subsequent related blog posts. That was a lack of planning or foresight on my part. These blog posts tumbling out of my head in a fairly
Tracked: Aug 20, 13:13