Mar 30: Diagnosing Locking Problems using ASH - Part 2
I tried the graphical approach to tracking down the root cause of a locking problem in the last part. Now let's look at the underlying ASH data. I ran the same test after bouncing my test instance to start from a clean slate. So, here it is again.
Session 1 - logged in as TESTUSER
TESTUSER@TEST1020> select pk_id from test_tab1 where rownum=1 for update; PK_ID ---------- 4051
Session 1 holds a lock on that row and I'll use the PK_ID that was returned from the query to make sure I try to lock the same row in the next session.
Session 2 - logged in as TESTUSER
TESTUSER@TEST1020> select pk_id from test_tab1 where pk_id=4051 for update;
Session 2 hangs, waiting for session 1 to release the lock. So I've just created 'a locking problem'. Let's get some details about the two sessions.
SYS@TEST1020> select s.sid, s.state, s.event, s.sql_id, l.type, l.lmode 2 from v$session s, v$lock l 3 where s.sid = l.sid (+) 4 and s.username='TESTUSER' 5 order by s.sid; SID STATE EVENT SQL_ID TY LMODE ---------- ----------- -------------------------------- ------------- -- ----- 150 WAITING enq: TX - row lock contention 0ut22yp7mh229 TM 3 150 WAITING enq: TX - row lock contention 0ut22yp7mh229 TX 0 157 WAITING SQL*Net message from client TX 6 157 WAITING SQL*Net message from client TM 3
Session 1 is Session ID 157 and currently has the TX lock against the table (mode 6), whilst Session 2 is Session ID 150 and is waiting for the lock to be released.
Session 1 - commits, to release the lock
TESTUSER@TEST1020> commit; Commit complete.
Session 2 - Is now able to acquire the lock, so locks the row and returns the result
PK_ID ---------- 4051
... and we can see that Session 2 (SID 150) now has the TX lock, session 157 is holding no locks and neither session is waiting for the other any more - they're both idle (SQL*Net message from client).
SYS@TEST1020> / SID STATE EVENT SQL_ID TY LMODE ---------- ----------- -------------------------------- ------------- -- ----- 150 WAITING SQL*Net message from client TM 3 150 WAITING SQL*Net message from client TX 6 157 WAITING SQL*Net message from client
How does the ASH data look? I'll query all ASH rows for TESTUSER sessions.
SYS@TEST1020> select a.sample_id,a.sample_time,a.session_id,a.event, 2 a.session_state,a.event,a.sql_id, 3 a.blocking_session,a.blocking_session_status 4 from v$active_session_history a, dba_users u 5 where u.user_id = a.user_id 6* and u.username = 'TESTUSER'; SAMPLE_ID ---------- SAMPLE_TIME --------------------------------------------------------------------------- SESSION_ID EVENT SESSION ---------- -------------------------------- ------- EVENT SQL_ID BLOCKING_SESSION BLOCKING_SE -------------------------------- ------------- ---------------- ----------- 432483 30-MAR-09 14.59.37.509 150 ON CPU 0ut22yp7mh229 NOT IN WAIT 432482 30-MAR-09 14.59.36.509 150 ON CPU 0ut22yp7mh229 NOT IN WAIT 432481 30-MAR-09 14.59.35.509 150 db file scattered read WAITING db file scattered read 0ut22yp7mh229 NO HOLDER 432480 30-MAR-09 14.59.34.509 150 db file scattered read WAITING db file scattered read 0ut22yp7mh229 NO HOLDER 432479 30-MAR-09 14.59.33.499 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 432478 30-MAR-09 14.59.32.499 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 432477 30-MAR-09 14.59.31.499 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 432476 30-MAR-09 14.59.30.489 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 432475 30-MAR-09 14.59.29.489 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 432474 30-MAR-09 14.59.28.489 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 432473 30-MAR-09 14.59.27.489 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 432472 30-MAR-09 14.59.26.479 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 432471 30-MAR-09 14.59.25.479 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 13 rows selected.
Working my way from bottom to top (oldest data to most recent), I can see that Session ID 150 (that's Session 2, remember) started off waiting for the row level lock and that lasted for 9 samples and then it started to read data (db file scattered read). I can also see that the session ID of the blocking session was 157. Excellent - just what I need to get to the bottom of this.
But there's a problem, isn't there? Where are the samples for SID 157, which is the root cause of the problem? If we can't see any samples for that session, how can we know what it was doing?
That's exactly what happened when I was demonstrating this last week and, once I'd pondered where the ASH data was for a little while, it hit me. The SQL statement that the blocking session executed was over so quickly, that the ASH sampling process missed it completely. The query started, selected the data and locked the row in less than a second, so it never appeared as an Active session as far as ASH is concerned. In fact, it was connected throughout the demo but was waiting on SQL*Net message from client and, as an Idle session, it wasn't included in the ASH data. Yes, it was Idle, but it was also the root cause of this problem. Of course if I was looking at this problem while it was happening then I could use the BLOCKING_SESSION information from the blocked session's ASH samples to look at V$SESSION, V$LOCK etc and work out what was going on, but that would be useless for diagnosing past locking problems with sessions that are now disconnected, which was the point of the demo.
It was very obvious once it hit me and ASH is behaving exactly as designed and advertised. Just so long as you know that when ASH excludes sessions because they're Idle at sample points, it might exclude the very sessions that are at the root cause of the problem.
During the class, I suggested that this was a slightly artificial test because *all* that the blocking session had done was login, run a very quick SELECT FOR UPDATE and then sit Idle. In practise, most genuine user application sessions are likely to be generating enough activity that they would have appeared somewhere in the ASH samples. In fact, as I've been running tests while writing this, ASH captured the activity of the blocking session three or four times in a row, which was a little frustrating!

The design of ASH depends on the likelihood that there is enough activity to capture the essence of each session's activity (and at very low cost) so maybe this isn't such a big problem? To put it another way, the blocking session has hardly consumed any Database Time to speak of so it doesn't really need to be optimised, does it? It's the blocked session that needs to be 'fixed' and there are plenty of samples for that. So maybe we just shouldn't use the data this way?
If the blocking session had been active for a little longer - if it was experiencing it's own bottleneck or was a long-running statement or transaction - then there would have been ASH samples for it. The next post will look at an example of that.
Network Events in ASH
Alternative Pictures Demo
That Pictures demo in full
Diagnosing Locking Problems using ASH/LogMiner – The End
Diagnosing Locking Problems using ASH/LogMiner – Part 9
Diagnosing Locking Problems using ASH/LogMiner – Part 8
Diagnosing Locking Problems using ASH/LogMiner – Part 7
Diagnosing Locking Problems using ASH – Part 6
Diagnosing Locking Problems using ASH – Part 5
Diagnosing Locking Problems using ASH – Part 4
#1 - Robert Klemme 2009-04-15 08:27 - (Reply)
IMHO there are some issues with that: first, the fact that the session appears at all, does not help in troubleshooting if the offending query is missing. In fact, you might be mislead if the query causing the lock is not included (as in your case). Then, from my experience there is a class of applications (namely those that use distributed systems and hence distributed transactions) which can exhibit exactly the behavior you are quoting: something is updated in the database, then other activity happens and finally - after relatively long time - all resources are committed. In this case it is not too unlikely to have the session sitting there idle for some time.
#1.1 - Doug Burns said:
2009-04-15 08:32 - (Reply)
Robert,
I agree with this ...
In fact, you might be mislead if the query causing the lock is not included (as in your case).
That's what I was trying to show in part 4 (which you might not have read yet).
I think there are certainly problems with diagnosing locking problems after they've happened, using ASH data, but the fact remains that there are several occasions where I've been able to do just that and I wouldn't have been able to do it without ASH.
So I think it's *sometimes* a very useful tool, as long as you understand the limitations, which is why I decided it was worth talking about here.
#1.1.1 - Robert Klemme 2009-04-15 10:22 - (Reply)
Absolutely agree: always know your tools! And you were right: I haven't made my way through to part IV yet.
Greetings
In the last part I described a locking scenario where the blocking session had only executed one SQL statement that was quick enough to avoid being sampled by ASH and is now inactive. As a result, there is no ASH data for the blocking session, so diagno
Tracked: Mar 31, 00:35
"Toto, I've a feeling we're not in Kansas any more."What started as a simple write-up of a course demo gone wrong (or right, depending the way you look at these things) has grown arms and legs and staggered away from the original intention to ta
Tracked: Apr 20, 21:20
Tracked: May 15, 11:24
Tracked: Aug 15, 13:59
Tracked: Jul 25, 17:32