Dec 19: Another 10046 Success
However, when it was executed against the new Production database, it was stopped when it was still running after 28 hours. Actually, it was only really after the job was stopped that I was dragged in to have a look. What the software house wanted to look at was what was 'wrong' with the Production database (which I'd created!). I was asked to send a list of parameters for both databases so that they could be compared, check that the optimiser statistics were appropriate (because we had run into problems with out of date statistics during testing) and generally provide information to establish why it was different. There were no obvious differences.
We ran the job again and took some Statspack snapshots and the over-riding view I was getting was that the database server was doing virtually nothing.
Having helped out with this as far as seemed sensible, I suggested that the only way to get to the bottom of this problem was to trace the job. People always seem to view this as a bit of a 'geeky' step that requires a lot of technical work and analysis, but it's better than just flailing around guessing at what might be happening.
Fortunately we were able to run the job against a subset of the data which resulted in a 20 minute execution time. We couldn't wait 28 hours for each test! Getting the trace file proved to be a lot more difficult than I'd hoped because this application implements it's own connection pooling so not only was it difficult to find the correct session, but to filter out irrelevant information. Eventually I managed to get a trace file and, to keep this post as short as possible, I'll just show the timed summary at the end of the formatted output of tkprof.
********************************************************************************
OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 24544 1.41 1.34 0 0 0 0
Execute 12332 4.63 4.09 182 17480 158580 12213
Fetch 226 0.03 0.04 0 3840 0 1817
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 37102 6.07 5.48 182 21320 158580 14030
Misses in library cache during parse: 0
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 24966 0.00 0.04
SQL*Net message from client 24966 35.26 1065.79
db file sequential read 182 0.00 0.02
log file sync 9 0.03 0.14
SQL*Net more data to client 24 0.00 0.00
There are a few things to note here. One, there are more parse operations than executions, which seems wrong. Why would you parse a statement and then not execute it? The fact is that this application does. See this asktom link for more info on why this might be and for a hint of the CPU nightmare that might be in store for us when we have more users!
Here's some more evidence of this activity from the trace file. (Believe me, this is representative of many other statements.)
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 7 0.00 0.00 0 0 0 0
Execute 0 0.00 0.00 0 0 0 0
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 7 0.00 0.00 0 0 0 0
Second, look at how little logical and physical i/o this session has performed. In 20 minutes?!
Looking at the timed events there's hardly any time to speak of apart from the "SQL*Net message from client" idle event. Then again, when you look at that total - 1065 seconds - that's about 17 minutes and 45 seconds. Mmmm, in a (roughly) 20 minute job? Then it hit me what the difference between test and production was. The test database and application server are at the same site whereas the production database and app server are at different sites, communicating over a wide area network. When this was planned, we raised a concern about it but were told that there 's no need to worry because it's just a pilot for a small number of users. However, this job is making 24,000+ trips over the network in 20 minutes, so maybe the performance of those trips is a factor?
Looking at the SQL in the trace file, there was a lot of this kind of thing.
INSERT INTO TABLE1 (Col1,Col2,Col3,Col4,Col5,
Col6,Col7,Col8,Col9,Col10,Col11,Col12,
Col13)
VALUES
(:1,:2,:3,:4,:5,:6,:7,:8,:9,:10,:11,:12,:13);
It uses bind variables, so that's good. However it's still executing this statement once for every row that's inserted - thousands of times in 20 minutes. There are no bulk operations so the application chatters away to the database. Essentially, the database is spending most of it's time waiting for the next insert to arrive. Not good. Will the application be re-written to do this sensibly, perhaps with a dedicated PL/SQL package that handles the activity in the right place - on the server? Not likely. Certainly not in time to meet our deadlines.
Even having shown this, people remained unconvinced because they aren't necessarily used to looking at trace files. I proposed that we perform the tracing on several different environments to see if we could narrow down the differences. Here are the results from the Test environment
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 20573 0.00 0.02
SQL*Net message from client 20573 20.70 66.12
db file sequential read 189 0.01 0.51
log file sync 6 0.02 0.08
And from Production
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 26533 0.00 0.04
SQL*Net message from client 26532 19.79 2137.80
log file sync 33 0.03 0.61
db file sequential read 79 0.00 0.00
SQL*Net more data to client 24 0.00 0.00
SQL*Net more data from client 1 0.00 0.00
Note that, because of the shared activity on individual sessions, the numbers don't match up exactly, but it's hopefully pretty obvious where the time is being lost. I asked the software house to run the same test in their lab and hoped to see their trace file. Instead they asked me to flush the shared pool, run the job and send them the results of a query against the v$sql. I think they were hoping to see higher logical reads for one or more of the statements to prove that they were doing something different. They weren't and they now accept that the time lost on the network is the problem here. Actually, we don't know it's the network really, it could be a slow app server, but we know that there's a reason the database appears to be dead quiet - it spends most of it's time waiting for something to do!
This has taken a bit of work - maybe about 2 weeks - but a lot of that was debate and perhaps only a day of tracing and proper analysis. It was well worth the small amount of effort but you need to be determined to stop, take stock and approach the problem properly. This can be a little difficult when the emails are flying around, suggesting six different directions to go in. The end results are
- We hope the performance will be better when the database and app servers are co-located
- We can pressurise the software house to rewrite this bit of the application to use bulk operations
- We have a clear idea of where the problem is and can stop wasting time investigating irrelevant areas of the server configuration
To wrap up, the best book on this subject is this
Oh, sorry, I almost forgot. Sorry to disagree with Mogens, but I found Statspack really useful when working on this problem too. Maybe I'll write about that another time.
#1 - Mr. Ed said:
2005-12-20 08:17 - (Reply)
What were the database platforms of test and production?
#2 - Doug Burns said:
2005-12-20 18:52 - (Reply)
Ed,
Thanks for asking because I forgot to mention. Test was 9.2.0.6, Production 9.2.0.7.
I hope you don't know something I don't!
Cheers,
Doug
#3 - David Aldridge said:
2005-12-20 20:04 - (Reply)
This is practically the stuff of which text books are written, Doug. What a lovely result, and how satisfying to have all the numbers to back you up when you go to a third party with an analysis of a problem.
I expect that the next directive will be that you need a network upgrade.
#4 - Doug Burns said:
2005-12-21 00:00 - (Reply)
David,
"and how satisfying to have all the numbers to back you up when you go to a third party with an analysis of a problem."
Well, it took a bit of convincing about the value of trace files to be honest but I persevered and it was a good feeling to come into work yesterday morning and find an email accepting that I might have a point.
"I expect that the next directive will be that you need a network upgrade."
Almost right! The next email in the chain was along the lines of 'So, how quickly can we move the database down south?' ![]()
#5 - Doug Burns said:
2005-12-21 18:49 - (Reply)
I noticed the following blog as an incoming link on Statcounter
http://oracletoday.blogspot.com/
When I went to take a look, it turns out that SQL*Net message from client had been a blog subject back in September. It's worth taking a look at another example of how this 'idle' event can be a sign of problems in some circumstances.
#6 - Muhammad Sarwar 2011-10-28 18:20 - (Reply)
Hi Doug,
Really a good article, i came here in search of a similar issue with "SQL*Net messages to/from client" with our production server running Oracle EBS 11i, where standard programs are hanging for long, i did the same TKPROF, got results, discussed with Network, but they still want solid proof from Network side.
I have uploaded all these to support, but they are still in process of review.
Can you please suggest on this, what test can i do from OS/Network side to get them realized.
Thanks,
Sarwar
#6.1 - Doug Burns said:
2011-10-31 02:49 - (Reply)
Hi Sarwar,
This same type of problem happened to me earlier this year and we were able to quickly prove the problem by running a tracert between the server and the client, which is the type of evidence network folks are likely to believe.
Cheers,
Doug


Note - using ASH and the Top Activity screen require the use of the Diagnostics Pack License.This post was prompted by yet another performance problem identified using pretty pictures and Active Session History data. Although, as you'll see, some pret
Tracked: Oct 03, 15:33