Diagnosing Locking Problems using ASH - Part 2

Doug's Oracle Blog

  • Home
  • Papers
  • Books
  • C.V.
  • Fun
  • Oracle Blog
  • Personal Blog

Mar 30: Diagnosing Locking Problems using ASH - Part 2

Some features in this post require a Diagnostics Pack license.

I tried the graphical approach to tracking down the root cause of a locking problem in the last part. Now let's look at the underlying ASH data. I ran the same test after bouncing my test instance to start from a clean slate. So, here it is again.

Session 1 - logged in as TESTUSER
TESTUSER@TEST1020> select pk_id from test_tab1 where rownum=1 for update;

     PK_ID
----------
      4051

Session 1 holds a lock on that row and I'll use the PK_ID that was returned from the query to make sure I try to lock the same row in the next session.

Session 2 - logged in as TESTUSER
TESTUSER@TEST1020> select pk_id from test_tab1 where pk_id=4051 for update;

Session 2 hangs, waiting for session 1 to release the lock. So I've just created 'a locking problem'. Let's get some details about the two sessions.

SYS@TEST1020> select s.sid, s.state, s.event, s.sql_id, l.type, l.lmode
  2  from v$session s, v$lock l
  3  where s.sid = l.sid  (+)
  4  and s.username='TESTUSER'
  5  order by s.sid;

       SID STATE       EVENT                            SQL_ID        TY LMODE
---------- ----------- -------------------------------- ------------- -- -----
       150 WAITING     enq: TX - row lock contention    0ut22yp7mh229 TM     3
       150 WAITING     enq: TX - row lock contention    0ut22yp7mh229 TX     0
       157 WAITING     SQL*Net message from client                    TX     6
       157 WAITING     SQL*Net message from client                    TM     3

Session 1 is Session ID 157 and currently has the TX lock against the table (mode 6), whilst Session 2 is Session ID 150 and is waiting for the lock to be released.

Session 1 - commits, to release the lock
TESTUSER@TEST1020> commit;

Commit complete.

Session 2 - Is now able to acquire the lock, so locks the row and returns the result
     PK_ID
----------
      4051

... and we can see that Session 2 (SID 150) now has the TX lock, session 157 is holding no locks and neither session is waiting for the other any more - they're both idle (SQL*Net message from client).

SYS@TEST1020> /

       SID STATE       EVENT                            SQL_ID        TY LMODE
---------- ----------- -------------------------------- ------------- -- -----
       150 WAITING     SQL*Net message from client                    TM     3
       150 WAITING     SQL*Net message from client                    TX     6
       157 WAITING     SQL*Net message from client

How does the ASH data look? I'll query all ASH rows for TESTUSER sessions.

SYS@TEST1020> select a.sample_id,a.sample_time,a.session_id,a.event,
  2             a.session_state,a.event,a.sql_id,
  3     a.blocking_session,a.blocking_session_status
  4  from v$active_session_history a, dba_users u
  5  where u.user_id = a.user_id
  6* and u.username = 'TESTUSER';

 SAMPLE_ID
----------
SAMPLE_TIME
---------------------------------------------------------------------------
SESSION_ID EVENT                            SESSION
---------- -------------------------------- -------
EVENT                            SQL_ID        BLOCKING_SESSION BLOCKING_SE
-------------------------------- ------------- ---------------- -----------
    432483
30-MAR-09 14.59.37.509
       150                                  ON CPU
                                 0ut22yp7mh229                  NOT IN WAIT

    432482
30-MAR-09 14.59.36.509
       150                                  ON CPU
                                 0ut22yp7mh229                  NOT IN WAIT

    432481
30-MAR-09 14.59.35.509
       150 db file scattered read           WAITING
db file scattered read           0ut22yp7mh229                  NO HOLDER

    432480
30-MAR-09 14.59.34.509
       150 db file scattered read           WAITING
db file scattered read           0ut22yp7mh229                  NO HOLDER

    432479
30-MAR-09 14.59.33.499
       150 enq: TX - row lock contention    WAITING
enq: TX - row lock contention    0ut22yp7mh229              157 VALID

    432478
30-MAR-09 14.59.32.499
       150 enq: TX - row lock contention    WAITING
enq: TX - row lock contention    0ut22yp7mh229              157 VALID

    432477
30-MAR-09 14.59.31.499
       150 enq: TX - row lock contention    WAITING
enq: TX - row lock contention    0ut22yp7mh229              157 VALID

    432476
30-MAR-09 14.59.30.489
       150 enq: TX - row lock contention    WAITING
enq: TX - row lock contention    0ut22yp7mh229              157 VALID

    432475
30-MAR-09 14.59.29.489
       150 enq: TX - row lock contention    WAITING
enq: TX - row lock contention    0ut22yp7mh229              157 VALID

    432474
30-MAR-09 14.59.28.489
       150 enq: TX - row lock contention    WAITING
enq: TX - row lock contention    0ut22yp7mh229              157 VALID

    432473
30-MAR-09 14.59.27.489
       150 enq: TX - row lock contention    WAITING
enq: TX - row lock contention    0ut22yp7mh229              157 VALID

    432472
30-MAR-09 14.59.26.479
       150 enq: TX - row lock contention    WAITING
enq: TX - row lock contention    0ut22yp7mh229              157 VALID

    432471
30-MAR-09 14.59.25.479
       150 enq: TX - row lock contention    WAITING
enq: TX - row lock contention    0ut22yp7mh229              157 VALID


13 rows selected.

Working my way from bottom to top (oldest data to most recent), I can see that Session ID 150 (that's Session 2, remember) started off waiting for the row level lock and that lasted for 9 samples and then it started to read data (db file scattered read). I can also see that the session ID of the blocking session was 157. Excellent - just what I need to get to the bottom of this.

But there's a problem, isn't there? Where are the samples for SID 157, which is the root cause of the problem? If we can't see any samples for that session, how can we know what it was doing?

That's exactly what happened when I was demonstrating this last week and, once I'd pondered where the ASH data was for a little while, it hit me. The SQL statement that the blocking session executed was over so quickly, that the ASH sampling process missed it completely. The query started, selected the data and locked the row in less than a second, so it never appeared as an Active session as far as ASH is concerned. In fact, it was connected throughout the demo but was waiting on SQL*Net message from client and, as an Idle session, it wasn't included in the ASH data. Yes, it was Idle, but it was also the root cause of this problem. Of course if I was looking at this problem while it was happening then I could use the BLOCKING_SESSION information from the blocked session's ASH samples to look at V$SESSION, V$LOCK etc and work out what was going on, but that would be useless for diagnosing past locking problems with sessions that are now disconnected, which was the point of the demo.

It was very obvious once it hit me and ASH is behaving exactly as designed and advertised. Just so long as you know that when ASH excludes sessions because they're Idle at sample points, it might exclude the very sessions that are at the root cause of the problem.

During the class, I suggested that this was a slightly artificial test because *all* that the blocking session had done was login, run a very quick SELECT FOR UPDATE and then sit Idle. In practise, most genuine user application sessions are likely to be generating enough activity that they would have appeared somewhere in the ASH samples. In fact, as I've been running tests while writing this, ASH captured the activity of the blocking session three or four times in a row, which was a little frustrating! ;-)

The design of ASH depends on the likelihood that there is enough activity to capture the essence of each session's activity (and at very low cost) so maybe this isn't such a big problem? To put it another way, the blocking session has hardly consumed any Database Time to speak of so it doesn't really need to be optimised, does it? It's the blocked session that needs to be 'fixed' and there are plenty of samples for that. So maybe we just shouldn't use the data this way?

If the blocking session had been active for a little longer - if it was experiencing it's own bottleneck or was a long-running statement or transaction - then there would have been ASH samples for it. The next post will look at an example of that.
Posted by Doug Burns Comments: (3) Trackbacks: (5)
Defined tags for this entry: ash, locking
Related entries by tags:
Network Events in ASH
Alternative Pictures Demo
That Pictures demo in full
Diagnosing Locking Problems using ASH/LogMiner – The End
Diagnosing Locking Problems using ASH/LogMiner – Part 9
Diagnosing Locking Problems using ASH/LogMiner – Part 8
Diagnosing Locking Problems using ASH/LogMiner – Part 7
Diagnosing Locking Problems using ASH – Part 6
Diagnosing Locking Problems using ASH – Part 5
Diagnosing Locking Problems using ASH – Part 4

Trackbacks
Trackback specific URI for this entry

Diagnosing Locking Problems using ASH - Part 3
In the last part I described a locking scenario where the blocking session had only executed one SQL statement that was quick enough to avoid being sampled by ASH and is now inactive. As a result, there is no ASH data for the blocking session, so diagno
Weblog: Doug's Oracle Blog
Tracked: Mar 30, 23:35
Diagnosing Locking Problems using ASH – Part 6
"Toto, I've a feeling we're not in Kansas any more."What started as a simple write-up of a course demo gone wrong (or right, depending the way you look at these things) has grown arms and legs and staggered away from the original intention to ta
Weblog: Doug's Oracle Blog
Tracked: Apr 20, 20:20
PingBack
Weblog: jonathanlewis.wordpress.com
Tracked: May 15, 10:24
PingBack
Weblog: karlarao.wordpress.com
Tracked: Aug 15, 12:59
PingBack
Weblog: www.oraclerealworld.com
Tracked: Jul 25, 16:32

Comments
Display comments as (Linear | Threaded)

#1 - Robert Klemme 2009-04-15 07:27 - (Reply)

QUOTE Doug:
During the class, I suggested that this was a slightly artificial test because *all* that the blocking session had done was login, run a very quick SELECT FOR UPDATE and then sit Idle. In practise, most genuine user application sessions are likely to be generating enough activity that they would have appeared somewhere in the ASH samples.


IMHO there are some issues with that: first, the fact that the session appears at all, does not help in troubleshooting if the offending query is missing. In fact, you might be mislead if the query causing the lock is not included (as in your case). Then, from my experience there is a class of applications (namely those that use distributed systems and hence distributed transactions) which can exhibit exactly the behavior you are quoting: something is updated in the database, then other activity happens and finally - after relatively long time - all resources are committed. In this case it is not too unlikely to have the session sitting there idle for some time.

#1.1 - Doug Burns said:
2009-04-15 07:32 - (Reply)

Robert,

I agree with this ...

In fact, you might be mislead if the query causing the lock is not included (as in your case).

That's what I was trying to show in part 4 (which you might not have read yet).

I think there are certainly problems with diagnosing locking problems after they've happened, using ASH data, but the fact remains that there are several occasions where I've been able to do just that and I wouldn't have been able to do it without ASH.

So I think it's *sometimes* a very useful tool, as long as you understand the limitations, which is why I decided it was worth talking about here.

#1.1.1 - Robert Klemme 2009-04-15 09:22 - (Reply)

Absolutely agree: always know your tools! And you were right: I haven't made my way through to part IV yet. :-)

Greetings


Add Comment

Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
BBCode format allowed
 
 

Statistics on Partitioned Tables

Contents

Part 1 - Default options - GLOBAL AND PARTITION
Part 2 - Estimated Global Stats
Part 3 - Stats Aggregation Problems I
Part 4 - Stats Aggregation Problems II
Part 5 - Minimal Stats Aggregation
Part 6a - COPY_TABLE_STATS - Intro
Part 6b - COPY_TABLE_STATS - Mistakes
Part 6c - COPY_TABLE_STATS - Bugs and Patches
Part 6d - COPY_TABLE_STATS - A Light-bulb Moment
Part 6e - COPY_TABLE_STATS - Bug 10268597

A couple of posts about Incremental Stats confusion

Part 1
Part 2

Comments

personal blog about Moving Sideways
Wed, 01.06.2016 17:34
That is a good tip particularl y to those fresh to the blogos phere. Short [...]
odziezprestige.pl about Moving Sideways
Wed, 01.06.2016 16:07
Please let me know if you're l ooking for a article writer fo r your site. [...]
Doug Burns about Moving Sideways
Tue, 10.05.2016 21:43
Oh, I won't give it that long unless I enjoy it ;-)

Bookmark

Open All | Close All

Syndicate This Blog

  • XML RSS 2.0 feed
  • ATOM/XML ATOM 1.0 feed
  • XML RSS 2.0 Comments
  • Feedburner Feed

Powered by

Serendipity PHP Weblog

Show tagged entries

xml 11g
xml ACE
xml adaptive thresholds
xml ASH
xml Audit Vault
xml AWR
xml Blogging
xml conferences
xml Cuddly Toys
xml Database Refresh
xml dbms_stats
xml Direct Path Reads
xml Fun
xml Grid Control
xml hotsos 2010
xml listener
xml locking
xml oow
xml oow2009
xml optimiser
xml OTN
xml Parallel
xml partitions
xml Patching
xml Swingbench
xml The Reality Gap
xml Time Matters
xml ukoug
xml ukoug2009
xml Unix/Shell
xml Useful Links

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.

Design by Andreas Viklund | Conversion to s9y by Carl