Another 10046 Success

Doug's Oracle Blog

  • Home
  • Papers
  • Books
  • C.V.
  • Fun
  • Oracle Blog
  • Personal Blog

Dec 19: Another 10046 Success

We're implementing a new packaged application at work. It includes a history import job that takes data in a flat-file and loads it into database tables. It performs some degree of data transformation but, in essence it inserts about 500,000 rows into one table and thousands in to a few others. During the testing, the job took 5 hours to run - slower than you'd expect for what appears to be a simple job, but acceptable to the users who had just got used to it.


However, when it was executed against the new Production database, it was stopped when it was still running after 28 hours. Actually, it was only really after the job was stopped that I was dragged in to have a look. What the software house wanted to look at was what was 'wrong' with the Production database (which I'd created!). I was asked to send a list of parameters for both databases so that they could be compared, check that the optimiser statistics were appropriate (because we had run into problems with out of date statistics during testing) and generally provide information to establish why it was different. There were no obvious differences.


We ran the job again and took some Statspack snapshots and the over-riding view I was getting was that the database server was doing virtually nothing.


Having helped out with this as far as seemed sensible, I suggested that the only way to get to the bottom of this problem was to trace the job. People always seem to view this as a bit of a 'geeky' step that requires a lot of technical work and analysis, but it's better than just flailing around guessing at what might be happening.


Fortunately we were able to run the job against a subset of the data which resulted in a 20 minute execution time. We couldn't wait 28 hours for each test! Getting the trace file proved to be a lot more difficult than I'd hoped because this application implements it's own connection pooling so not only was it difficult to find the correct session, but to filter out irrelevant information. Eventually I managed to get a trace file and, to keep this post as short as possible, I'll just show the timed summary at the end of the formatted output of tkprof.

********************************************************************************


OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 24544 1.41 1.34 0 0 0 0
Execute 12332 4.63 4.09 182 17480 158580 12213
Fetch 226 0.03 0.04 0 3840 0 1817
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 37102 6.07 5.48 182 21320 158580 14030


Misses in library cache during parse: 0


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 24966 0.00 0.04
SQL*Net message from client 24966 35.26 1065.79
db file sequential read 182 0.00 0.02
log file sync 9 0.03 0.14
SQL*Net more data to client 24 0.00 0.00


There are a few things to note here. One, there are more parse operations than executions, which seems wrong. Why would you parse a statement and then not execute it? The fact is that this application does. See this asktom link for more info on why this might be and for a hint of the CPU nightmare that might be in store for us when we have more users!


Here's some more evidence of this activity from the trace file. (Believe me, this is representative of many other statements.)

call     count       cpu    elapsed       disk      query    current        rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 7 0.00 0.00 0 0 0 0
Execute 0 0.00 0.00 0 0 0 0
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 7 0.00 0.00 0 0 0 0

Second, look at how little logical and physical i/o this session has performed. In 20 minutes?!

Looking at the timed events there's hardly any time to speak of apart from the "SQL*Net message from client" idle event. Then again, when you look at that total - 1065 seconds - that's about 17 minutes and 45 seconds. Mmmm, in a (roughly) 20 minute job? Then it hit me what the difference between test and production was. The test database and application server are at the same site whereas the production database and app server are at different sites, communicating over a wide area network. When this was planned, we raised a concern about it but were told that there 's no need to worry because it's just a pilot for a small number of users. However, this job is making 24,000+ trips over the network in 20 minutes, so maybe the performance of those trips is a factor?

Looking at the SQL in the trace file, there was a lot of this kind of thing.


INSERT INTO TABLE1 (Col1,Col2,Col3,Col4,Col5,
Col6,Col7,Col8,Col9,Col10,Col11,Col12,
Col13)
VALUES
(:1,:2,:3,:4,:5,:6,:7,:8,:9,:10,:11,:12,:13);

It uses bind variables, so that's good. However it's still executing this statement once for every row that's inserted - thousands of times in 20 minutes. There are no bulk operations so the application chatters away to the database. Essentially, the database is spending most of it's time waiting for the next insert to arrive. Not good. Will the application be re-written to do this sensibly, perhaps with a dedicated PL/SQL package that handles the activity in the right place - on the server? Not likely. Certainly not in time to meet our deadlines.

Even having shown this, people remained unconvinced because they aren't necessarily used to looking at trace files. I proposed that we perform the tracing on several different environments to see if we could narrow down the differences. Here are the results from the Test environment

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 20573 0.00 0.02
SQL*Net message from client 20573 20.70 66.12
db file sequential read 189 0.01 0.51
log file sync 6 0.02 0.08

And from Production

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 26533 0.00 0.04
SQL*Net message from client 26532 19.79 2137.80
log file sync 33 0.03 0.61
db file sequential read 79 0.00 0.00
SQL*Net more data to client 24 0.00 0.00
SQL*Net more data from client 1 0.00 0.00

Note that, because of the shared activity on individual sessions, the numbers don't match up exactly, but it's hopefully pretty obvious where the time is being lost. I asked the software house to run the same test in their lab and hoped to see their trace file. Instead they asked me to flush the shared pool, run the job and send them the results of a query against the v$sql. I think they were hoping to see higher logical reads for one or more of the statements to prove that they were doing something different. They weren't and they now accept that the time lost on the network is the problem here. Actually, we don't know it's the network really, it could be a slow app server, but we know that there's a reason the database appears to be dead quiet - it spends most of it's time waiting for something to do!

This has taken a bit of work - maybe about 2 weeks - but a lot of that was debate and perhaps only a day of tracing and proper analysis. It was well worth the small amount of effort but you need to be determined to stop, take stock and approach the problem properly. This can be a little difficult when the emails are flying around, suggesting six different directions to go in. The end results are

  • We hope the performance will be better when the database and app servers are co-located
  • We can pressurise the software house to rewrite this bit of the application to use bulk operations
  • We have a clear idea of where the problem is and can stop wasting time investigating irrelevant areas of the server configuration
As for this 'idle' event, there's a presentation here on the subject. I wish I had attended it now ;-) At least I was able to send the link to the software house to add some weight to my argument (as I did the earlier Tom Kyte link)

To wrap up, the best book on this subject is this

Oh, sorry, I almost forgot. Sorry to disagree with Mogens, but I found Statspack really useful when working on this problem too. Maybe I'll write about that another time.

Posted by Doug Burns Comments: (7) Trackback: (1)

Trackbacks
Trackback specific URI for this entry

Network Events in ASH
Note - using ASH and the Top Activity screen require the use of the Diagnostics Pack License.This post was prompted by yet another performance problem identified using pretty pictures and Active Session History data. Although, as you'll see, some pret
Weblog: Doug's Oracle Blog
Tracked: Oct 03, 15:33

Comments
Display comments as (Linear | Threaded)

#1 - Mr. Ed said:
2005-12-20 08:17 - (Reply)

What were the database platforms of test and production?

#2 - Doug Burns said:
2005-12-20 18:52 - (Reply)

Ed,

Thanks for asking because I forgot to mention. Test was 9.2.0.6, Production 9.2.0.7.

I hope you don't know something I don't!

Cheers,

Doug

#3 - David Aldridge said:
2005-12-20 20:04 - (Reply)

This is practically the stuff of which text books are written, Doug. What a lovely result, and how satisfying to have all the numbers to back you up when you go to a third party with an analysis of a problem.

I expect that the next directive will be that you need a network upgrade.

#4 - Doug Burns said:
2005-12-21 00:00 - (Reply)

David,

"and how satisfying to have all the numbers to back you up when you go to a third party with an analysis of a problem."

Well, it took a bit of convincing about the value of trace files to be honest but I persevered and it was a good feeling to come into work yesterday morning and find an email accepting that I might have a point.

"I expect that the next directive will be that you need a network upgrade."

Almost right! The next email in the chain was along the lines of 'So, how quickly can we move the database down south?' ;-)

#5 - Doug Burns said:
2005-12-21 18:49 - (Reply)

I noticed the following blog as an incoming link on Statcounter

http://oracletoday.blogspot.com/

When I went to take a look, it turns out that SQL*Net message from client had been a blog subject back in September. It's worth taking a look at another example of how this 'idle' event can be a sign of problems in some circumstances.

#6 - Muhammad Sarwar 2011-10-28 18:20 - (Reply)

Hi Doug,

Really a good article, i came here in search of a similar issue with "SQL*Net messages to/from client" with our production server running Oracle EBS 11i, where standard programs are hanging for long, i did the same TKPROF, got results, discussed with Network, but they still want solid proof from Network side.

I have uploaded all these to support, but they are still in process of review.

Can you please suggest on this, what test can i do from OS/Network side to get them realized.

Thanks,
Sarwar

#7 - Doug Burns said:
2011-10-31 02:49 - (Reply)

Hi Sarwar,

This same type of problem happened to me earlier this year and we were able to quickly prove the problem by running a tracert between the server and the client, which is the type of evidence network folks are likely to believe.

Cheers,

Doug


Add Comment

Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
BBCode format allowed
 
 

Statistics on Partitioned Tables

Contents

Part 1 - Default options - GLOBAL AND PARTITION
Part 2 - Estimated Global Stats
Part 3 - Stats Aggregation Problems I
Part 4 - Stats Aggregation Problems II
Part 5 - Minimal Stats Aggregation
Part 6a - COPY_TABLE_STATS - Intro
Part 6b - COPY_TABLE_STATS - Mistakes
Part 6c - COPY_TABLE_STATS - Bugs and Patches
Part 6d - COPY_TABLE_STATS - A Light-bulb Moment
Part 6e - COPY_TABLE_STATS - Bug 10268597

Comments

Doug Burns about 10053 Trace Files - Different Plan in Different Environments
Tue, 02.04.2013 08:57
You're welcome. Now I just nee d to pull my finger out and ac tually come up [...]
Howard Rogers about 10053 Trace Files - Different Plan in Different Environments
Mon, 01.04.2013 23:08
Makes a big difference, so tha nks for that! With two brow ser windows, o [...]
stelioscharalambides.com about 10053 Trace Files
Sat, 30.03.2013 16:28

Upcoming Presentations

Bookmark

Open All | Close All

Syndicate This Blog

  • XML RSS 2.0 feed
  • ATOM/XML ATOM 1.0 feed
  • XML RSS 2.0 Comments
  • Feedburner Feed

Powered by

Serendipity PHP Weblog

Show tagged entries

xml 11g
xml ACE
xml adaptive thresholds
xml ASH
xml Audit Vault
xml AWR
xml Blogging
xml conferences
xml Cuddly Toys
xml Database Refresh
xml DBMS_STATS
xml Direct Path Reads
xml Fun
xml grid control
xml hotsos 2010
xml listener
xml Locking
xml oow
xml oow2009
xml optimiser
xml OTN
xml Parallel
xml Partitions
xml Patching
xml swingbench
xml The Reality Gap
xml time matters
xml ukoug
xml ukoug2009
xml Unix/Shell
xml Useful Links

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.

Design by Andreas Viklund | Conversion to s9y by Carl