Time Matters - An Infinite Capacity for Waiting*

Doug's Oracle Blog

  • Home
  • Papers
  • Books
  • C.V.
  • Fun
  • Oracle Blog
  • Personal Blog

Aug 14: Time Matters - An Infinite Capacity for Waiting*

When I was teaching the 10g Performance class at my last customer site, I remarked that the total number of seconds you see in the "Top 5 Timed Events" section of a Statspack or AWR report will often be greater than the total number of seconds between the snapshot intervals. For example, this is the "Top 5 Timed Events" section of a Statspack report covering two snapshots that are 5 minutes 30 seconds apart, either side of a 5 minute Sales Order Entry benchmark that I ran using Dominic Giles' Swingbench utility.

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
db file sequential read                          5,814         461     79   38.0
log file sync                                    4,246         421     99   34.7
log file parallel write                          3,222         109     34    9.0
CPU time                                                       108           8.9
db file parallel write                           1,718          60     35    4.9

(Note that this is from a 10.2.0.4 Statspack report, but you would see something similar in all AWR reports and Statspack reports that include CPU time in this section.)

Add up those seconds in the Time column and, even just for the top 5 events, the total is 1159 seconds or 19 minutes and 19 seconds. How can the system have spent 19 minutes and 19 seconds using CPU or waiting on various events when there were only 330 seconds in the reporting period?

The person who asked me to clarify this after the training has seen a few Statspack reports in his time and understood that there is more CPU time available in a given period than the Wall Clock period because servers often have multiple CPUs. But, in this case, I ran the test on my laptop which has a dual-core CPU. Therefore the maximum CPU time available between these snapshots was 660 seconds, or 11 minutes.

No, the real reason is that Statspack is reporting system-wide event timings for more than one session and then aggregating them. In this case, there were four user sessions running the SOE benchmark during the period. Given that each user and background session is instrumented and clocking up both service and wait time, aggregating the data for all of the sessions will record more time than has passed if you look at the clock on the wall.

This is such an intuitive idea to me that I hadn't thought to explain it properly during the course, so I set a mental reminder that I would blog about it soon. Having said that, this has been written about by others on many occasions (*not least by Cary Millsap and Jeff Holt in Optimizing Oracle Performance. In fact, I stole the title of this blog post from their book - see page 215 for more information.) Therefore, if this post makes complete sense to you so far or you've read any of those sources before, continued reading will probably be a little boring. If not, this post is especially for you.

So let's revisit the tests. I ran two 5 minute tests using Swingbench's supplied Sales Order Entry benchmark, but I could have used anything for this. The important thing is that the first test consisted of one session running the SOE application, while four sessions were running during the second test.

Here is the top 5 timed events section of the Statspack report for the single session test.

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
db file sequential read                          4,818         175     36   46.3
CPU time                                                        57          15.2
db file parallel write                           1,085          42     39   11.2
log file sync                                      821          38     46   10.0
log file parallel write                          2,702          28     10    7.5

... and from the Statspack report for the four session test.

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
db file sequential read                          5,814         461     79   38.0
log file sync                                    4,246         421     99   34.7
log file parallel write                          3,222         109     34    9.0
CPU time                                                       108           8.9
db file parallel write                           1,718          60     35    4.9

So the two tests were running the same application, on the same hardware and for the same duration. Yet the values in the Time (s) column are much bigger when more sessions are running. The total for just the top 5 timed events is 340 seconds compared to 1159 seconds. (The average wait times have also increased in some cases, because the system is busier, and different events become more significant as concurrent sessions contend with each other. There are various things we could try to deduce from this section, particularly as this was a controlled test, but we'd need more supporting information. Let's just stick to the apparently increased time.)

The problem here, if you see it as a problem, is that aggregated data is being reported. Statspack is not the only example of this, though. Tracing Parallel Execution tasks and then aggregating the trace files using trcsess into one consolidated trace file will also result in variable timings, depending on the Degree of Parallelism. I talked about this in slides 33 to 37 of an earlier presentation. In fact, I can make the Time (s) values as big as I want to, just by running the SOE benchmark with more and more sessions.

That's why the % Total Call Time is a far more useful column. It tells you the system-wide percentage contribution of each event to any performance problem over the given period so that you can focus on the most significant contributors. It should go without saying that looking at session-level statistics is more specific and therefore more reliable but that's not always possible unless you're using ASH, Kyle Hailey's Simulated ASH, some other session history recorder or can recreate the problem session on demand, but that's another, bigger argument that I won't go into here.

So if wall clock time for those tests was about 300 seconds, then what is that bigger value being recorded in the Time (s) column? What I'm really talking about is DB Time, but I'll leave that for the next post.
Posted by Doug Burns Comments: (6) Trackback: (1)
Defined tags for this entry: Time Matters
Related entries by tags:
Network Events in ASH
Alternative Pictures Demo
That Pictures demo in full
Time Matters: Throughput vs. Response Time - Part 2
Time Matters: Throughput vs. Response Time
Time Matters - DB CPU
Time Matters - DB Time

Trackbacks
Trackback specific URI for this entry

Time Matters - DB Time
[In retrospect, the title of that first blog post might have suited the subject, but doesn't translate too well for subsequent related blog posts. That was a lack of planning or foresight on my part. These blog posts tumbling out of my head in a fairly
Weblog: Doug's Oracle Blog
Tracked: Aug 20, 13:13

Comments
Display comments as (Linear | Threaded)

#1 - Chen Shapira said:
2008-08-15 02:52 - (Reply)

Doug,

Just to check if I follow you correctly - if I have 4 sessions running on the system, and I'm looking at an aggregated report for 6 minutes, the total times reported cannot exceed 6*60*4 seconds, correct?

#1.1 - Doug Burns said:
2008-08-15 07:41 - (Reply)

Chen,

For the purposes of understanding, yes that's about the size of it.

If you look at the entire Statspack report that section came from, though, the "four sessions" in this test were actually over 20 sessions, including background processes. So the real upper limit on time for that period was much higher.

There are also some anomalies in CPU time recording between versions and always bear in mind we're only looking at the top 5 events.

But, as I said, your understanding of what's going on is correct and should become clearer when I get to DB time in the next post.

Cheers,

Doug

#2 - nm 2008-08-20 11:15 - (Reply)

Hi Doug,
Why do you total the time in Time (s).? The reason I say this, is during the time the db file sequential read event occurs, is not possible that a log file parallel write also take place?
Or does this go back to the first comment made - by Chen. Wherein each session has its own time to account for. The session responsible for db file sequential read is not the same session that performs the log file parallel write.
What confuses me, is what I mentioned earlier - since some of these tasks would be occurring at the same time would not totaling all the values result in , well , some form of double counting - if you get what I mean?
Thanks

#2.1 - Doug Burns said:
2008-08-20 12:29 - (Reply)

What confuses me, is what I mentioned earlier - since some of these tasks would be occurring at the same time would not totaling all the values result in , well , some form of double counting - if you get what I mean?

Yes, you're absolutely right! If all four sessions running the benchmark waited for 1 second on a log file sync event, 4 seconds would be added to the relevant row in that column and that's the very aggregation that I'm discussing here.

I don't know if you have Cary and Jeff's book but another read of page 215 is instructive if you do and I recommend you find a copy if you don't. It's why many people question the value of Statspack and AWR reports when analysing performance.

However, what this blog post is really about is

a) That's the reality of the top 5 timed events section and anyone who does use Statspack or AWR reports for performance analysis should at least be aware of what they're looking at. What prompted the blog post is that I've watched people totalling up the time in that column and wondering how it could be greater than the wall clock duration of the snapshot interval.

b) Using my previous example, do you think that it's useful to know that four sessions were waiting on log file sync for 1 second each, rather than just one? Yes, the aggregated value doesn't tell you anything about any of the individual sessions but it does tell you something about the instance as a whole.

Maybe point b) will become more clear in the next blog but your comment shows that you understand what that column means and also the dangers of aggregating the data!

#3 - Lei Zeng said:
2009-07-17 00:15 - (Reply)

I have one quesiton here. Do you know how the %Total Call time is calculated for each of those events?
Thanks very much,
Lei

#3.1 - Doug Burns said:
2009-07-20 10:31 - (Reply)

Lei,

What version of Oracle and are you talking Statspack or AWR?

Cheers,

Doug


Add Comment

Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
BBCode format allowed
 
 

Statistics on Partitioned Tables

Contents

Part 1 - Default options - GLOBAL AND PARTITION
Part 2 - Estimated Global Stats
Part 3 - Stats Aggregation Problems I
Part 4 - Stats Aggregation Problems II
Part 5 - Minimal Stats Aggregation
Part 6a - COPY_TABLE_STATS - Intro
Part 6b - COPY_TABLE_STATS - Mistakes
Part 6c - COPY_TABLE_STATS - Bugs and Patches
Part 6d - COPY_TABLE_STATS - A Light-bulb Moment
Part 6e - COPY_TABLE_STATS - Bug 10268597

Comments

Doug Burns about 10053 Trace Files - Different Plan in Different Environments
Tue, 02.04.2013 08:57
You're welcome. Now I just nee d to pull my finger out and ac tually come up [...]
Howard Rogers about 10053 Trace Files - Different Plan in Different Environments
Mon, 01.04.2013 23:08
Makes a big difference, so tha nks for that! With two brow ser windows, o [...]
stelioscharalambides.com about 10053 Trace Files
Sat, 30.03.2013 16:28

Upcoming Presentations

Bookmark

Open All | Close All

Syndicate This Blog

  • XML RSS 2.0 feed
  • ATOM/XML ATOM 1.0 feed
  • XML RSS 2.0 Comments
  • Feedburner Feed

Powered by

Serendipity PHP Weblog

Show tagged entries

xml 11g
xml ACE
xml adaptive thresholds
xml ASH
xml Audit Vault
xml AWR
xml Blogging
xml conferences
xml Cuddly Toys
xml Database Refresh
xml DBMS_STATS
xml Direct Path Reads
xml Fun
xml grid control
xml hotsos 2010
xml listener
xml Locking
xml oow
xml oow2009
xml optimiser
xml OTN
xml Parallel
xml Partitions
xml Patching
xml swingbench
xml The Reality Gap
xml time matters
xml ukoug
xml ukoug2009
xml Unix/Shell
xml Useful Links

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.

Design by Andreas Viklund | Conversion to s9y by Carl