Network Events in ASH

Doug's Oracle Blog

  • Home
  • Papers
  • Books
  • C.V.
  • Fun
  • Oracle Blog
  • Personal Blog

Oct 3: Network Events in ASH

Note - using ASH and the Top Activity screen require the use of the Diagnostics Pack License.

This post was prompted by yet another performance problem identified using pretty pictures and Active Session History data. Although, as you'll see, some pretty old-fashioned tools played their part too!

ASH entries only exist for certain SQL*Net events. As usual, I think the design is very successful as long as you have a basic understanding of how ASH works.

The events all fall into one of three wait classes - Application, Network and Idle.

SQL> select wait_class, name from v$event_name where name like 'SQL*Net%' order by 1, 2;

WAIT_CLASS           NAME
-------------------- ----------------------------------------
Application          SQL*Net break/reset to client
Application          SQL*Net break/reset to dblink
Idle                 SQL*Net message from client
Idle                 SQL*Net vector message from client
Idle                 SQL*Net vector message from dblink
Network              SQL*Net message from dblink
Network              SQL*Net message to client
Network              SQL*Net message to dblink
Network              SQL*Net more data from client
Network              SQL*Net more data from dblink
Network              SQL*Net more data to client
Network              SQL*Net more data to dblink
Network              SQL*Net vector data from client
Network              SQL*Net vector data from dblink
Network              SQL*Net vector data to client
Network              SQL*Net vector data to dblink

16 rows selected.

Hopefully that immediately dismisses the suggestion I've heard a few times that ASH/Top Activity ignores SQL*Net events. However, there are some events that are classed as Idle here that will not be captured in ASH data and will therefore not appear in the Top Activity screen. (In fact, even if you were to set the _ash_sample_all parameter you would not see these events in the Top Activity screen, even though ASH would contain entries for them. There is an additional filter applied to the data before it's displayed. In any case, I would *not* recommend setting _ash_sample_all except for fun.)

The most contentious of these is probably SQL*Net message from client. This event indicates that the Oracle server is waiting for the client to  do something. If the client is a user session, Oracle is waiting for the user to do something - perhaps someone has a sql*plus session open that they're doing nothing with or they are filling in an application screen with data? From an Oracle perspective that session is Idle and there isn't much we can do to tune it, including the waits would heavily skew the data we're using for our analysis and, because events in the Idle class are specifically excluded from both ASH data and the OEM performance pages,  such activity won't appear. That seems sensible to me.

However, if the client is actually an application server, those supposedly idle events can tell us something about end-to-end system performance. This blog post describes one real life situation where that event helped me identify the performance problem.

Back to the problem at hand. Sadly, in the midst of a a very chaotic work situation, I didn't manage to grab any of the graphs in question but we had a persistent problem with our critical batch processes spending time waiting on 'SQL*Net more data from client', which showed up very clearly as network class waits.

Now there are a number of reasons why this might be happening. Just a couple off the top of my head

1) There is a performance problem with the application server so Oracle is expecting more data to arrive more quickly than the app server can deliver it.

2) A network problem is affecting the delivery speed.

Although the issue here turned out to be number 2, I suppose the true underlying issue is that this application is insanely 'chatty' and uses very frequent round trips to deliver data. Maybe the application design could be improved first?

Ultimately we enabled Extended Tracing on the relevant sessions and confirmed that, yes, the sessions were spending most of their time waiting for the next batch of data and also how long these waits were. In practise, I'm not sure how much value the trace files added in this particular case. In fact, I think this is an excellent example of the underlying beauty of the sampling approach. As an event became *slower* it appeared *more often* in the samples and we had plenty of examples of how long some of the waits were. (But only some - probably the longest ones.)

How did we solve this mysterious problem? With two very useful tools

- ping/tracert. I simply went on to the app server, pinged the db server and confirmed that the response time (from my memory) was in the range of 70-80 milliseconds - far higher than I would expect. One of the Java developers then ran a trace route to the server which highlighted that network performance was great until we hit the db server. It turned out that one of the network interface cards had auto-negotiated down to 100M/s and this was the root cause - easily fixed. Sometimes old-fashioned and simple tools are all that you need, once you know where you should focus your attention (and ASH is great at focussing attention in the right place).

- The next tool is one of the reasons I wanted to write this post. What made me try ping? Whilst I would have got there eventually, I thought I would try my performance optimisation secret weapon! One just for the database wizards! Something I've been keeping an eye on is Oracle Support's attempt to give people a methodical approach to solving performance issues via the Oracle Performance Diagnostic Guide on My Oracle Support - Note ID 390374.1. There are several guides on that landing page but here is the link to the Slow Performance PDF. (You'll need an account to be able to access this.) This is an example of the kind of advice there. (Note that this is for SQL*Net message from client, but you would tend to get most events together anyway and there's similar advice for more data from client.)

"Cause Identified: Slow network limiting the response time between client and database.
The network is saturated and this is limiting the ability of the client and database to communicate with each other.
Cause Justification
TKProf:
1. SQL*Net message from client waits are a large part of the overall time (see the overall summary
section)
2. Array operations are used. This is seen when there are more than 5 rows per execution on average (divide total rows by total execution calls for both recursive and non-recursive calls)
3. The average time for a ping is about equal to twice the average time for a SQL*Net message from client wait and this time is more than a few milliseconds. This indicates that most of the client time is spent in the network."

I've been loathe to mention this document until now because it looks like a work in progress (albeit not updated since the start of 2009), isn't perfect, has gaps and I can imagine a number of my peers taking issue with some of the content. But for something to aid learning if you're not naturally great at solving performance problems, I can think of much worse starting places.

In the end, what could have been a really tricky performance problem (and might have taken a while to notice) was apparent with a brief glance at OEM and we could then follow up by applying the right analysis tools to the problem.

To finish off, when I searched Google for any examples of 'SQL*Net more data from client', one of the first results that cropped up was a chapter from one of my favourite books - 'Optimising Oracle Performance' by Cary Millsap and Jeff Holt. Well worth a read ....
Posted by Doug Burns Comments: (3) Trackbacks: (2)
Defined tags for this entry: ash, awr, grid control, time matters
Related entries by tags:
Alternative Pictures Demo
That Pictures demo in full
Time Matters: Throughput vs. Response Time - Part 2
Diagnosing Locking Problems using ASH/LogMiner – The End
Diagnosing Locking Problems using ASH/LogMiner – Part 9
Diagnosing Locking Problems using ASH/LogMiner – Part 8
Diagnosing Locking Problems using ASH/LogMiner – Part 7
Diagnosing Locking Problems using ASH – Part 6
Diagnosing Locking Problems using ASH – Part 5
Diagnosing Locking Problems using ASH – Part 4

Trackbacks
Trackback specific URI for this entry

PingBack
Weblog: askdba.org
Tracked: Oct 29, 08:31
PingBack
Weblog: askdba.org
Tracked: Oct 29, 08:31

Comments
Display comments as (Linear | Threaded)

#1 - kyle hailey said:
2010-10-04 05:14 - (Reply)

Nice write-up as always! Thanks for putting it together.

There is also http://www.oracledba.ru/orasrp/ to help to some degree with those SQL*Net message from client waits to figure out whether they are idle or application think time. I think more could be done in this reguard - been meaning to sit down and see if I could work some approaches out but haven't yet.

Here my notes on Net wait events:

https://sites.google.com/site/embtdbo/wait-event-documentation/oracle-network-waits

#1.1 - Doug Burns said:
2010-10-04 05:34 - (Reply)

Good point Kyle. Orasrp is normally the tool I use to separate the between call events from the within call events.

#2 - David Aldridge said:
2010-10-04 13:25 - (Reply)

If there is not a special place in hell for those who configure network interfaces to auto-negotiate then ... well, there ought to be.


Add Comment

Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
BBCode format allowed
 
 

Statistics on Partitioned Tables

Contents

Part 1 - Default options - GLOBAL AND PARTITION
Part 2 - Estimated Global Stats
Part 3 - Stats Aggregation Problems I
Part 4 - Stats Aggregation Problems II
Part 5 - Minimal Stats Aggregation
Part 6a - COPY_TABLE_STATS - Intro
Part 6b - COPY_TABLE_STATS - Mistakes
Part 6c - COPY_TABLE_STATS - Bugs and Patches
Part 6d - COPY_TABLE_STATS - A Light-bulb Moment
Part 6e - COPY_TABLE_STATS - Bug 10268597

Comments

Doug Burns about 10053 Trace Files - Different Plan in Different Environments
Tue, 02.04.2013 08:57
You're welcome. Now I just nee d to pull my finger out and ac tually come up [...]
Howard Rogers about 10053 Trace Files - Different Plan in Different Environments
Mon, 01.04.2013 23:08
Makes a big difference, so tha nks for that! With two brow ser windows, o [...]
stelioscharalambides.com about 10053 Trace Files
Sat, 30.03.2013 16:28

Upcoming Presentations

Bookmark

Open All | Close All

Syndicate This Blog

  • XML RSS 2.0 feed
  • ATOM/XML ATOM 1.0 feed
  • XML RSS 2.0 Comments
  • Feedburner Feed

Powered by

Serendipity PHP Weblog

Show tagged entries

xml 11g
xml ACE
xml adaptive thresholds
xml ASH
xml Audit Vault
xml AWR
xml Blogging
xml conferences
xml Cuddly Toys
xml Database Refresh
xml DBMS_STATS
xml Direct Path Reads
xml Fun
xml grid control
xml hotsos 2010
xml listener
xml Locking
xml oow
xml oow2009
xml optimiser
xml OTN
xml Parallel
xml Partitions
xml Patching
xml swingbench
xml The Reality Gap
xml time matters
xml ukoug
xml ukoug2009
xml Unix/Shell
xml Useful Links

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.

Design by Andreas Viklund | Conversion to s9y by Carl