Entries tagged as locking
Related tags
ashApr 17: Diagnosing Locking Problems using ASH – Part 5
... and so it continues.
Coincidentally, the subject of tracking down locking problems after they occurred cropped up on the Oak Table mailing list just after the last post. Several suggestions were offered but I think the nearest and most detailed suggestion was that proposed by Kyle Hailey, ex-Oracle and now working Embarcadero. There were one or two problems with the original email which I've fixed up here ...
For the wait event class I was teaching I wanted to show an example of taking ASH data and finding the blocking statement in the log miner data, but found this was less than reliable. Sometimes it would work sometimes it didn't. Not sure what the issue was. I think dumping the redo log itself would be more dependable. I think logminer might not report everything - ie I think there is some filtering going on.
Anyway here are some steps that might be of use to people. I get XID but you probably don't need it, but I recall the XID being useful in doing the redo log dumps
Setup
Test Case
Session 1
Session 2
So Session 1 is now blocking session 2 ... have a look in V$TRANSACTION
Commit Session 1
... so that Session 2 is released.
and take a look in V$TRANSACTION at the new transaction details
So the blocking lock problem has now cleared. Let's start gathering information to see if we can identify the problem.
Analysis
Parameters for 'enq: TX - Row Lock Contention'
P2=usn<<16||slot
P3=Sequence
=> XID
Let's find our WAITER and the BLOCKING XID
Now let's use logminer to find out what the BLOCKING SQL was
Oh
Well, how about trying a different approach?
Which isn't too helpful - I can't see any information about Session 1's row update. If anyone has any ideas why, I'm all ears.
In the next post I'll look at a redo log file dump to see if that has any more information.
Coincidentally, the subject of tracking down locking problems after they occurred cropped up on the Oak Table mailing list just after the last post. Several suggestions were offered but I think the nearest and most detailed suggestion was that proposed by Kyle Hailey, ex-Oracle and now working Embarcadero. There were one or two problems with the original email which I've fixed up here ...
For the wait event class I was teaching I wanted to show an example of taking ASH data and finding the blocking statement in the log miner data, but found this was less than reliable. Sometimes it would work sometimes it didn't. Not sure what the issue was. I think dumping the redo log itself would be more dependable. I think logminer might not report everything - ie I think there is some filtering going on.
Anyway here are some steps that might be of use to people. I get XID but you probably don't need it, but I recall the XID being useful in doing the redo log dumps
Setup
SYS@TEST1020> set lines 120
SYS@TEST1020> column member format a60
SYS@TEST1020>
SYS@TEST1020> alter system switch logfile;
System altered.
SYS@TEST1020> select group#, thread#, status from v$log;
GROUP# THREAD# STATUS
---------- ---------- ----------------
1 1 CURRENT
2 1 ACTIVE
3 1 ACTIVE
SYS@TEST1020> select group#, member from v$logfile;
GROUP# MEMBER
---------- ------------------------------------------------------------
1 C:\ORACLE\PRODUCT\10.2.0\ORADATA\TEST1020\REDO01.LOG
2 C:\ORACLE\PRODUCT\10.2.0\ORADATA\TEST1020\REDO02.LOG
3 C:\ORACLE\PRODUCT\10.2.0\ORADATA\TEST1020\REDO03.LOG
SYS@TEST1020> exec dbms_logmnr.add_logfile('&&my_member');
Enter value for my_member: C:\ORACLE\PRODUCT\10.2.0\ORADATA\TEST1020\REDO01.LOG
PL/SQL procedure successfully completed.
SYS@TEST1020> exec dbms_logmnr.start_logmnr(options => dbms_logmnr.dict_from_online_catalog);
PL/SQL procedure successfully completed.Test Case
Session 1
TESTUSER@TEST1020> create table test_tab1
as select object_id pk_id, object_name from all_objects;
Table created.
TESTUSER@TEST1020> select rowid, object_name from test_tab1 where pk_id=313;
ROWID OBJECT_NAME
------------------ ------------------------------
AAAN2NAAKAAAaBLAEX SYSTEM_PRIVILEGE_MAP
TESTUSER@TEST1020> update test_tab1 set object_name='SESSION 1' where pk_id=313;
1 row updated.Session 2
TESTUSER@TEST1020> update test_tab1 set object_name='SESSION 2' where pk_id=313;
So Session 1 is now blocking session 2 ... have a look in V$TRANSACTION
SYS@TEST1020> select start_time, xid, xidusn, xidslot, xidsqn, start_scn,
to_char(start_scn, 'XXXXXXXXXX')
from v$transaction order by start_time;
START_TIME XID XIDUSN XIDSLOT XIDSQN START_SCN TO_CHAR(STA
-------------------- ---------------- ---------- ---------- ---------- ---------- -----------
04/20/09 20:46:39 0A00240064180000 10 36 6244 0 0Commit Session 1
TESTUSER@TEST1020> commit; Commit complete.
... so that Session 2 is released.
1 row updated.
and take a look in V$TRANSACTION at the new transaction details
SYS@TEST1020> select start_time, xid, xidusn, xidslot, xidsqn, start_scn, to_cha r(start_scn, 'XXXXXXXXXX') from v$transaction order by start_time; START_TIME XID XIDUSN XIDSLOT XIDSQN START_SCN TO_CHAR(STA -------------------- ---------------- ---------- ---------- ---------- ---------- ----------- 04/20/09 20:46:54 08002000681F0000 8 32 8040 9061094 8A42E6
So the blocking lock problem has now cleared. Let's start gathering information to see if we can identify the problem.
Analysis
Parameters for 'enq: TX - Row Lock Contention'
P2=usn<<16||slot
P3=Sequence
=> XID
Let's find our WAITER and the BLOCKING XID
SYS@TEST1020> select * from 2 (select lpad(ltrim(to_char(p2,'XXXXXX')),6,'0')|| 3 '00'|| 4 ltrim(to_char(mod(p3,256),'XX'))|| 5 ltrim(to_char(trunc(p3/256),'XX'))|| 6 '0000' block_xid, 7 to_char(p2,'XXXXXXXX') p2hex, 8 to_char(p3,'XXXXXXXX') p3hex, 9 trunc(p2/65536) usn, 10 mod(p2,65536) slot, 11 p3 sqn, xid wait_xid 12 from v$active_session_history 13 where event like 'enq: T%' 14 order by sample_time desc) where rownum < 2; BLOCK_XID P2HEX P3HEX USN SLOT SQN WAIT_XID ------------------ --------- --------- ---------- ---------- ---------- -------- 0A00240064180000 A0024 1864 10 36 6244
Now let's use logminer to find out what the BLOCKING SQL was
SYS@TEST1020> select username, session# sid, serial# , sql_redo from v$logmnr_co ntents where XID = '&&blocking_xid' 2 / Enter value for blocking_xid: 0A00240064180000 old 1: select username, session# sid, serial# , sql_redo from v$logmnr_content s where XID = '&&blocking_xid' new 1: select username, session# sid, serial# , sql_redo from v$logmnr_content s where XID = '0A00240064180000' USERNAME SID SERIAL# ------------------------------ ---------- ---------- SQL_REDO -------------------------------------------------------------------------------- UNKNOWN 0 0 commit;
Oh
Well, how about trying a different approach?
SYS@TEST1020> select distinct xid , xidusn, xidslt, xidsqn, username, session# s
id, serial# , sql_redo
2 from v$logmnr_contents
3 where timestamp > sysdate- &minutes/(60*24)
4 and xidusn=10
5 and xidslt=36;
Enter value for minutes: 10
old 3: where timestamp > sysdate- &minutes/(60*24)
new 3: where timestamp > sysdate- 10/(60*24)
XID XIDUSN XIDSLT XIDSQN USERNAME
SID SERIAL#
---------------- ---------- ---------- ---------- ------------------------------
---------- ----------
SQL_REDO
--------------------------------------------------------------------------------
----------------------------------------
0A00240064180000 10 36 6244 UNKNOWN
0 0
commit;Which isn't too helpful - I can't see any information about Session 1's row update. If anyone has any ideas why, I'm all ears.
In the next post I'll look at a redo log file dump to see if that has any more information.
Apr 9: Diagnosing Locking Problems using ASH – Part 4
Some features in this post require a Diagnostics Pack license.
No sooner had I finished part 3 with some conclusions than I thought of another example I should have included and then someone else made a comment in an email which suggested another. (Thanks, JB!)
I’ll try to deal with that point and some other thoughts I had here in a further example. For this example, I decided to suggest how to create a quick table so that you can test this out for yourself.
As easy as that.
In the last part I showed that the SQL_ID for the last sample of a blocking session immediately before it went idle might not be the SQL statement holding the lock. In that case, it was fairly obvious that it wasn’t because it was a simple query that couldn’t have been responsible. However, it’s likely to be even more confusing if the SQL_ID captured by the ASH sample *looks* like it’s responsible, but isn’t, and the previous example didn't illustrate that clearly.
Session 1 – Connected as TESTUSER
So session 1 has executed three different queries, all of which lock one or more rows including the row with PK_ID=313, has rolled back the first two (releasing the locks) and has just that row locked now.
Session 2 – Connected as TESTUSER
Session 2 hangs, waiting for the lock.
Session 1 – Connected as TESTUSER
So the lock is released and session 2 acquires the lock and then releases it
Session 2 – Connected as Testuser
OK, time to look at the ASH data for the last two minutes for TESTUSER sessions. (Note that I have added each session’s serial# to the query. Graham Wood corrected a statement I made in the last post …
… by pointing out that session_serial# would allow me to differentiate between different sessions where SID=148. I’ve updated the previous post to that effect.)
So what statement is SQL_ID ‘95k0s60jznz76’?
Let’s face it, if you were to look at this data, how likely is it that you would think that that statement was responsible for blocking session 2, when it wasn't?
No sooner had I finished part 3 with some conclusions than I thought of another example I should have included and then someone else made a comment in an email which suggested another. (Thanks, JB!)
"It's probably worth pointing out that NO TOOL can claim to identify the blocking SID->SQL in such a case as there is no relationship kept in Oracle between locks acquired by users and the SQL that acquired the locks"
I’ll try to deal with that point and some other thoughts I had here in a further example. For this example, I decided to suggest how to create a quick table so that you can test this out for yourself.
SQL> create user testuser identified by testuser 2 default tablespace USERS 3 temporary tablespace TEMP; User created. SQL> grant create session, create table to testuser; Grant succeeded. SQL> connect testuser/testuser Connected. SQL> create table test_tab1 as select object_id pk_id, object_name from all_objects; Table created.
As easy as that.
In the last part I showed that the SQL_ID for the last sample of a blocking session immediately before it went idle might not be the SQL statement holding the lock. In that case, it was fairly obvious that it wasn’t because it was a simple query that couldn’t have been responsible. However, it’s likely to be even more confusing if the SQL_ID captured by the ASH sample *looks* like it’s responsible, but isn’t, and the previous example didn't illustrate that clearly.
Session 1 – Connected as TESTUSER
SQL> select pk_id, object_name from test_tab1 order by pk_id desc for update; <Trimmed> 890 DBMS_STANDARD 889 DBMS_STANDARD PK_ID OBJECT_NAME ---------- ------------------------------ 887 STANDARD 886 RE$NV_LIST 605 MAP_OBJECT 319 STMT_AUDIT_OPTION_MAP 317 STMT_AUDIT_OPTION_MAP 316 TABLE_PRIVILEGE_MAP 314 TABLE_PRIVILEGE_MAP 313 SYSTEM_PRIVILEGE_MAP 311 SYSTEM_PRIVILEGE_MAP 259 DUAL 258 DUAL 4477 rows selected. SQL> rollback; Rollback complete. SQL> select pk_id from test_tab1 where object_name='SYSTEM_PRIVILEGE_MAP' for update; PK_ID ---------- 311 313 SQL> rollback; Rollback complete. SQL> select pk_id from test_tab1 where pk_id=313 for update; PK_ID ---------- 313 SQL>
So session 1 has executed three different queries, all of which lock one or more rows including the row with PK_ID=313, has rolled back the first two (releasing the locks) and has just that row locked now.
Session 2 – Connected as TESTUSER
SQL> select pk_id from test_tab1 where pk_id=313 for update;
Session 2 hangs, waiting for the lock.
Session 1 – Connected as TESTUSER
SQL> rollback; Rollback complete.
So the lock is released and session 2 acquires the lock and then releases it
Session 2 – Connected as Testuser
PK_ID ---------- 313 SQL> rollback; Rollback complete.
OK, time to look at the ASH data for the last two minutes for TESTUSER sessions. (Note that I have added each session’s serial# to the query. Graham Wood corrected a statement I made in the last post …
“who is to say that that session 148 is the same session as the blocking session 148 and not an earlier session that has disconnected?”
… by pointing out that session_serial# would allow me to differentiate between different sessions where SID=148. I’ve updated the previous post to that effect.)
break on MY_TIME
column my_sid format 999
column my_ser format 99999
column my_state format a30
column my_blkr format 999
select to_char(a.sample_time, 'HH24:MI:SS') MY_TIME,a.session_id MY_SID,a.session_serial# MY_SER,
DECODE(a.session_state, 'WAITING' ,a.event, a.session_state) MY_STATE,a.xid, a.sql_id,
a.blocking_session MY_BLKR
from v$active_session_history a, dba_users u
where u.user_id = a.user_id
and u.username = 'TESTUSER'
and a.sample_time > SYSTIMESTAMP-(2/1440);
MY_TIME MY_SID MY_SER MY_STATE XID SQL_ID MY_BLKR
-------- ------ ------ ------------------------------ ---------------- ------------- -------
12:53:01 300 2739 enq: TX - row lock contention 2pa7rnqf65gj3 304
12:53:00 300 2739 enq: TX - row lock contention 2pa7rnqf65gj3 304
12:52:59 300 2739 enq: TX - row lock contention 2pa7rnqf65gj3 304
12:52:50 304 1445 ON CPU 0006000E0000884D 95k0s60jznz76So what statement is SQL_ID ‘95k0s60jznz76’?
SQL> select sql_text from v$sql where sql_id='95k0s60jznz76'; SQL_TEXT ----------------------------------------------------------------------------- select pk_id, object_name from test_tab1 order by pk_id desc for update
Let’s face it, if you were to look at this data, how likely is it that you would think that that statement was responsible for blocking session 2, when it wasn't?
Mar 31: Diagnosing Locking Problems using ASH - Part 3
Some features in this post require a Diagnostics Pack license.
In the last part I described a locking scenario where the blocking session had only executed one SQL statement that was quick enough to avoid being sampled by ASH and is now inactive. As a result, there is no ASH data for the blocking session, so diagnosing the root cause of the problem after the event is next to impossible. In fairness, it would be impossible anyway without ASH or similar instrumentation.
However, as I suggested in that post, I would argue that's a fairly unusual situation arising out of the nature of the demo and that the blocking session would normally appear in ASH. So let's try a more realistic situation where the blocking session does more work.
Session 1 - logged in as TESTUSER
So this time, I ran a simple query before the SELECT FOR UPDATE. Session 1 holds a lock on that row and I'll use the PK_ID that was returned from the query to make sure I try to lock the same row in the next session.
Session 2 - logged in as TESTUSER
Session 2 hangs, waiting for session 1 to release the lock
Session 1 - commits, to release the lock
Session 2 - Is now able to acquire the lock, so locks the row and returns the result
So everything about this test is the same as the last post, but Session 1 is doing a little more work than the simple SELECT FOR UPDATE. Let's look at the ASH data for all TESTUSER sessions.
This time I'm going to work my way from the top of the data to the bottom (or from the most recent data to the oldest).
The most recent sample shows session 147 on CPU but immediately prior to that, I can see that it was stuck waiting for a TX row lock since some time between 22:23:54 and 22:23:55 and that the blocking session was 148. Things are a little better this time because there are samples for the blocking session 148 but, if I look at the ASH sample immediately prior to the locking problem, session 148 is executing SQL_ID "cda14zb83bb5u'. If I check which SQL statement that is
Which isn't the statement that was holding the lock. I've hit the same problem as the previous example because the SELECT FOR UPDATE activity wasn't sampled.
The up-side to this example is that I can at least see some information about the blocking session in the lead-up to the locking problem. That could be very useful in identifying who the user was, what application they were running, which module they were executing and so on. (Then again, who is to say that that session 148 is the same session as the blocking session 148 and not an earlier session that has disconnected? Updated later - as Graham Wood pointed out, I could have used SESSION_SERIAL# to ensure that I am looking at the right session 148.)
The down-side is that it's easier to fool yourself into thinking you're looking at the root cause of the problem, when you aren't. In this case, it wouldn't take long to realise that the simple query is not blocking the other session and that there must be something missing, but imagine if the SQL had been a transactional statement that could be the source of the problem, but wasn't?
Conclusion
I started off the first post by saying "There are few better illustrations of the utility of ASH data than the ability to detect and diagnose locking problems, even after they've occurred" and I still think that's true. Diagnosing concurrency problems without having access to information on all sessions is next to impossible and ASH's low-cost sampling approach to recording session activity is the very reason it can be switched on all the time, waiting for problems to occur. Of course, the nature of sampling implies an approximate view of reality (and I've deliberately picked examples here to illustrate the gaps in the approximation).
- If the problem is happening right now, then you will have all the information you need, whether you have ASH data or not
- If the blocking SQL statement runs for more than a second because it's resource intensive or is experiencing it's own bottleneck, then you'll have good samples of the root cause.
- If the blocking SQL statement runs in less than a second then you might have an ASH sample of it or you might not, but then you might have samples covering previous actions that could help diagnose the problem.
- Or you could be unlucky, as I was last week, and have no samples at all
I also said in the first post that ASH data "has proved very useful to me in solving several production performance problems in the past" and that still holds true. It just works better with some problems than others and you need to be aware of it's limitations if you're going to use it.
In the last part I described a locking scenario where the blocking session had only executed one SQL statement that was quick enough to avoid being sampled by ASH and is now inactive. As a result, there is no ASH data for the blocking session, so diagnosing the root cause of the problem after the event is next to impossible. In fairness, it would be impossible anyway without ASH or similar instrumentation.
However, as I suggested in that post, I would argue that's a fairly unusual situation arising out of the nature of the demo and that the blocking session would normally appear in ASH. So let's try a more realistic situation where the blocking session does more work.
Session 1 - logged in as TESTUSER
TESTUSER@TEST1020> select count(*) from test_tab1;
COUNT(*)
----------
1024000
TESTUSER@TEST1020> select pk_id from test_tab1 where rownum=1 for update;
PK_ID
----------
4051So this time, I ran a simple query before the SELECT FOR UPDATE. Session 1 holds a lock on that row and I'll use the PK_ID that was returned from the query to make sure I try to lock the same row in the next session.
Session 2 - logged in as TESTUSER
TESTUSER@TEST1020> select pk_id from test_tab1 where pk_id=4051 for update;
Session 2 hangs, waiting for session 1 to release the lock
Session 1 - commits, to release the lock
TESTUSER@TEST1020> commit; Commit complete.
Session 2 - Is now able to acquire the lock, so locks the row and returns the result
PK_ID
----------
4051So everything about this test is the same as the last post, but Session 1 is doing a little more work than the simple SELECT FOR UPDATE. Let's look at the ASH data for all TESTUSER sessions.
SYS@TEST1020> select a.sample_id,a.sample_time,a.session_id,
2 a.session_state,a.event,a.sql_id,
3 a.blocking_session,a.blocking_session_status
4 from v$active_session_history a, dba_users u
5 where u.user_id = a.user_id
6* and u.username = 'TESTUSER';
SAMPLE_ID
----------
SAMPLE_TIME
---------------------------------------------------------------------------
SESSION_ID SESSION
---------- -------
EVENT SQL_ID BLOCKING_SESSION BLOCKING_SE
-------------------------------- ------------- ---------------- -----------
445637
30-MAR-09 22.23.59.263
147 ON CPU
0ut22yp7mh229 NOT IN WAIT
445636
30-MAR-09 22.23.58.253
147 WAITING
enq: TX - row lock contention 0ut22yp7mh229 148 VALID
445635
30-MAR-09 22.23.57.253
147 WAITING
enq: TX - row lock contention 0ut22yp7mh229 148 VALID
445634
30-MAR-09 22.23.56.253
147 WAITING
enq: TX - row lock contention 0ut22yp7mh229 148 VALID
445633
30-MAR-09 22.23.55.252
147 WAITING
enq: TX - row lock contention 0ut22yp7mh229 148 VALID
445611
30-MAR-09 22.23.33.181
148 WAITING
db file scattered read cda14zb83bb5u NO HOLDER
445610
30-MAR-09 22.23.32.181
148 WAITING
db file scattered read cda14zb83bb5u NO HOLDER
445609
30-MAR-09 22.23.31.181
148 WAITING
db file scattered read cda14zb83bb5u NO HOLDER
445608
30-MAR-09 22.23.30.181
148 WAITING
db file scattered read cda14zb83bb5u NO HOLDER
9 rows selected.This time I'm going to work my way from the top of the data to the bottom (or from the most recent data to the oldest).
The most recent sample shows session 147 on CPU but immediately prior to that, I can see that it was stuck waiting for a TX row lock since some time between 22:23:54 and 22:23:55 and that the blocking session was 148. Things are a little better this time because there are samples for the blocking session 148 but, if I look at the ASH sample immediately prior to the locking problem, session 148 is executing SQL_ID "cda14zb83bb5u'. If I check which SQL statement that is
SYS@TEST1020> select sql_text from v$sql where sql_id='cda14zb83bb5u'; SQL_TEXT -------------------------------------------------------------------------------- select count(*) from test_tab1
Which isn't the statement that was holding the lock. I've hit the same problem as the previous example because the SELECT FOR UPDATE activity wasn't sampled.
The up-side to this example is that I can at least see some information about the blocking session in the lead-up to the locking problem. That could be very useful in identifying who the user was, what application they were running, which module they were executing and so on. (Then again, who is to say that that session 148 is the same session as the blocking session 148 and not an earlier session that has disconnected? Updated later - as Graham Wood pointed out, I could have used SESSION_SERIAL# to ensure that I am looking at the right session 148.)
The down-side is that it's easier to fool yourself into thinking you're looking at the root cause of the problem, when you aren't. In this case, it wouldn't take long to realise that the simple query is not blocking the other session and that there must be something missing, but imagine if the SQL had been a transactional statement that could be the source of the problem, but wasn't?
Conclusion
I started off the first post by saying "There are few better illustrations of the utility of ASH data than the ability to detect and diagnose locking problems, even after they've occurred" and I still think that's true. Diagnosing concurrency problems without having access to information on all sessions is next to impossible and ASH's low-cost sampling approach to recording session activity is the very reason it can be switched on all the time, waiting for problems to occur. Of course, the nature of sampling implies an approximate view of reality (and I've deliberately picked examples here to illustrate the gaps in the approximation).
- If the problem is happening right now, then you will have all the information you need, whether you have ASH data or not
- If the blocking SQL statement runs for more than a second because it's resource intensive or is experiencing it's own bottleneck, then you'll have good samples of the root cause.
- If the blocking SQL statement runs in less than a second then you might have an ASH sample of it or you might not, but then you might have samples covering previous actions that could help diagnose the problem.
- Or you could be unlucky, as I was last week, and have no samples at all
I also said in the first post that ASH data "has proved very useful to me in solving several production performance problems in the past" and that still holds true. It just works better with some problems than others and you need to be aware of it's limitations if you're going to use it.
Mar 30: Diagnosing Locking Problems using ASH - Part 2
Some features in this post require a Diagnostics Pack license.
I tried the graphical approach to tracking down the root cause of a locking problem in the last part. Now let's look at the underlying ASH data. I ran the same test after bouncing my test instance to start from a clean slate. So, here it is again.
Session 1 - logged in as TESTUSER
Session 1 holds a lock on that row and I'll use the PK_ID that was returned from the query to make sure I try to lock the same row in the next session.
Session 2 - logged in as TESTUSER
Session 2 hangs, waiting for session 1 to release the lock. So I've just created 'a locking problem'. Let's get some details about the two sessions.
Session 1 is Session ID 157 and currently has the TX lock against the table (mode 6), whilst Session 2 is Session ID 150 and is waiting for the lock to be released.
Session 1 - commits, to release the lock
Session 2 - Is now able to acquire the lock, so locks the row and returns the result
... and we can see that Session 2 (SID 150) now has the TX lock, session 157 is holding no locks and neither session is waiting for the other any more - they're both idle (SQL*Net message from client).
How does the ASH data look? I'll query all ASH rows for TESTUSER sessions.
Working my way from bottom to top (oldest data to most recent), I can see that Session ID 150 (that's Session 2, remember) started off waiting for the row level lock and that lasted for 9 samples and then it started to read data (db file scattered read). I can also see that the session ID of the blocking session was 157. Excellent - just what I need to get to the bottom of this.
But there's a problem, isn't there? Where are the samples for SID 157, which is the root cause of the problem? If we can't see any samples for that session, how can we know what it was doing?
That's exactly what happened when I was demonstrating this last week and, once I'd pondered where the ASH data was for a little while, it hit me. The SQL statement that the blocking session executed was over so quickly, that the ASH sampling process missed it completely. The query started, selected the data and locked the row in less than a second, so it never appeared as an Active session as far as ASH is concerned. In fact, it was connected throughout the demo but was waiting on SQL*Net message from client and, as an Idle session, it wasn't included in the ASH data. Yes, it was Idle, but it was also the root cause of this problem. Of course if I was looking at this problem while it was happening then I could use the BLOCKING_SESSION information from the blocked session's ASH samples to look at V$SESSION, V$LOCK etc and work out what was going on, but that would be useless for diagnosing past locking problems with sessions that are now disconnected, which was the point of the demo.
It was very obvious once it hit me and ASH is behaving exactly as designed and advertised. Just so long as you know that when ASH excludes sessions because they're Idle at sample points, it might exclude the very sessions that are at the root cause of the problem.
During the class, I suggested that this was a slightly artificial test because *all* that the blocking session had done was login, run a very quick SELECT FOR UPDATE and then sit Idle. In practise, most genuine user application sessions are likely to be generating enough activity that they would have appeared somewhere in the ASH samples. In fact, as I've been running tests while writing this, ASH captured the activity of the blocking session three or four times in a row, which was a little frustrating!
The design of ASH depends on the likelihood that there is enough activity to capture the essence of each session's activity (and at very low cost) so maybe this isn't such a big problem? To put it another way, the blocking session has hardly consumed any Database Time to speak of so it doesn't really need to be optimised, does it? It's the blocked session that needs to be 'fixed' and there are plenty of samples for that. So maybe we just shouldn't use the data this way?
If the blocking session had been active for a little longer - if it was experiencing it's own bottleneck or was a long-running statement or transaction - then there would have been ASH samples for it. The next post will look at an example of that.
I tried the graphical approach to tracking down the root cause of a locking problem in the last part. Now let's look at the underlying ASH data. I ran the same test after bouncing my test instance to start from a clean slate. So, here it is again.
Session 1 - logged in as TESTUSER
TESTUSER@TEST1020> select pk_id from test_tab1 where rownum=1 for update;
PK_ID
----------
4051Session 1 holds a lock on that row and I'll use the PK_ID that was returned from the query to make sure I try to lock the same row in the next session.
Session 2 - logged in as TESTUSER
TESTUSER@TEST1020> select pk_id from test_tab1 where pk_id=4051 for update;
Session 2 hangs, waiting for session 1 to release the lock. So I've just created 'a locking problem'. Let's get some details about the two sessions.
SYS@TEST1020> select s.sid, s.state, s.event, s.sql_id, l.type, l.lmode 2 from v$session s, v$lock l 3 where s.sid = l.sid (+) 4 and s.username='TESTUSER' 5 order by s.sid; SID STATE EVENT SQL_ID TY LMODE ---------- ----------- -------------------------------- ------------- -- ----- 150 WAITING enq: TX - row lock contention 0ut22yp7mh229 TM 3 150 WAITING enq: TX - row lock contention 0ut22yp7mh229 TX 0 157 WAITING SQL*Net message from client TX 6 157 WAITING SQL*Net message from client TM 3
Session 1 is Session ID 157 and currently has the TX lock against the table (mode 6), whilst Session 2 is Session ID 150 and is waiting for the lock to be released.
Session 1 - commits, to release the lock
TESTUSER@TEST1020> commit; Commit complete.
Session 2 - Is now able to acquire the lock, so locks the row and returns the result
PK_ID ---------- 4051
... and we can see that Session 2 (SID 150) now has the TX lock, session 157 is holding no locks and neither session is waiting for the other any more - they're both idle (SQL*Net message from client).
SYS@TEST1020> / SID STATE EVENT SQL_ID TY LMODE ---------- ----------- -------------------------------- ------------- -- ----- 150 WAITING SQL*Net message from client TM 3 150 WAITING SQL*Net message from client TX 6 157 WAITING SQL*Net message from client
How does the ASH data look? I'll query all ASH rows for TESTUSER sessions.
SYS@TEST1020> select a.sample_id,a.sample_time,a.session_id,a.event, 2 a.session_state,a.event,a.sql_id, 3 a.blocking_session,a.blocking_session_status 4 from v$active_session_history a, dba_users u 5 where u.user_id = a.user_id 6* and u.username = 'TESTUSER'; SAMPLE_ID ---------- SAMPLE_TIME --------------------------------------------------------------------------- SESSION_ID EVENT SESSION ---------- -------------------------------- ------- EVENT SQL_ID BLOCKING_SESSION BLOCKING_SE -------------------------------- ------------- ---------------- ----------- 432483 30-MAR-09 14.59.37.509 150 ON CPU 0ut22yp7mh229 NOT IN WAIT 432482 30-MAR-09 14.59.36.509 150 ON CPU 0ut22yp7mh229 NOT IN WAIT 432481 30-MAR-09 14.59.35.509 150 db file scattered read WAITING db file scattered read 0ut22yp7mh229 NO HOLDER 432480 30-MAR-09 14.59.34.509 150 db file scattered read WAITING db file scattered read 0ut22yp7mh229 NO HOLDER 432479 30-MAR-09 14.59.33.499 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 432478 30-MAR-09 14.59.32.499 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 432477 30-MAR-09 14.59.31.499 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 432476 30-MAR-09 14.59.30.489 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 432475 30-MAR-09 14.59.29.489 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 432474 30-MAR-09 14.59.28.489 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 432473 30-MAR-09 14.59.27.489 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 432472 30-MAR-09 14.59.26.479 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 432471 30-MAR-09 14.59.25.479 150 enq: TX - row lock contention WAITING enq: TX - row lock contention 0ut22yp7mh229 157 VALID 13 rows selected.
Working my way from bottom to top (oldest data to most recent), I can see that Session ID 150 (that's Session 2, remember) started off waiting for the row level lock and that lasted for 9 samples and then it started to read data (db file scattered read). I can also see that the session ID of the blocking session was 157. Excellent - just what I need to get to the bottom of this.
But there's a problem, isn't there? Where are the samples for SID 157, which is the root cause of the problem? If we can't see any samples for that session, how can we know what it was doing?
That's exactly what happened when I was demonstrating this last week and, once I'd pondered where the ASH data was for a little while, it hit me. The SQL statement that the blocking session executed was over so quickly, that the ASH sampling process missed it completely. The query started, selected the data and locked the row in less than a second, so it never appeared as an Active session as far as ASH is concerned. In fact, it was connected throughout the demo but was waiting on SQL*Net message from client and, as an Idle session, it wasn't included in the ASH data. Yes, it was Idle, but it was also the root cause of this problem. Of course if I was looking at this problem while it was happening then I could use the BLOCKING_SESSION information from the blocked session's ASH samples to look at V$SESSION, V$LOCK etc and work out what was going on, but that would be useless for diagnosing past locking problems with sessions that are now disconnected, which was the point of the demo.
It was very obvious once it hit me and ASH is behaving exactly as designed and advertised. Just so long as you know that when ASH excludes sessions because they're Idle at sample points, it might exclude the very sessions that are at the root cause of the problem.
During the class, I suggested that this was a slightly artificial test because *all* that the blocking session had done was login, run a very quick SELECT FOR UPDATE and then sit Idle. In practise, most genuine user application sessions are likely to be generating enough activity that they would have appeared somewhere in the ASH samples. In fact, as I've been running tests while writing this, ASH captured the activity of the blocking session three or four times in a row, which was a little frustrating!
The design of ASH depends on the likelihood that there is enough activity to capture the essence of each session's activity (and at very low cost) so maybe this isn't such a big problem? To put it another way, the blocking session has hardly consumed any Database Time to speak of so it doesn't really need to be optimised, does it? It's the blocked session that needs to be 'fixed' and there are plenty of samples for that. So maybe we just shouldn't use the data this way?
If the blocking session had been active for a little longer - if it was experiencing it's own bottleneck or was a long-running statement or transaction - then there would have been ASH samples for it. The next post will look at an example of that.
Mar 30: Diagnosing Locking Problems using ASH - Part 1
Some features in this post require a Diagnostics Pack license.
There are few better illustrations of the utility of ASH data than the ability to detect and diagnose locking problems*, even after they've occurred. Because ASH is sampling all active sessions all of the time, you should have information both for the blocked and blocking sessions, even when problems are first reported after they have disappeared and the sessions have disconnected. This has proved very useful to me in solving several production performance problems in the past, so I usually try to demonstrate it during the course.
Something cropped up when one of the demos went wrong last week and it took me a minute to work out what the problem was. I made a simple mistake in the way I approached the demo, but I decided to blog about it because I think it reinforces the nature and inherent limitations of ASH data.
This is a recreation of the demo, which captures the gist of what went 'wrong'.
Session 1 - logged in as TESTUSER
Session 1 holds a lock on that row and I'll use the PK_ID that was returned from the query to make sure I try to lock the same row in the next session.
Session 2 - logged in as TESTUSER
Session 2 hangs, waiting for session 1 to release the lock
Session 1 - commits, to release the lock
Session 2 - Is now able to acquire the lock, so locks the row and returns the result
So I've just created 'a locking problem' that occurred in the past but has been resolved now. At this point, there are a few different things we could look at. In last week's class I dived straight into the contents of V$ACTIVE_SESSION_HISTORY at the command line but I'll take the GUI approach first here. Here is how the Top Activity screen in DB/Grid Control looks. (Click on the thumbnails to see the full-sized images)

The period of the test is highlighted and I can see the spike for one active session, which is reflected in the Top Sessions panel (Session 137 is responsible for most of the activity) and I can see from the Top SQL panel that a single SQL statement is responsible for most of the activity. The majority of samples are waiting on "Application" events, with some "User I/O" as well. Looking at the chart, I can see that the User I/O activity followed the Application waits.
If I drill down into the Application events they are all "enq: TX - row lock contention", all associated with the same Session and SQL statement. (Note the more descriptive 10g event description, rather than "enqueue" in previous versions.)

Next I'll drill down into the SQL statement.

As well as the lock waits followed by db file scattered reads, I can see the SQL statement is the one I ran in session 2.
Next I'll drill down into the Session details for the blocked session and we'll see the first sign of limitations.

On the Blocking Tree tab, I find 'No sessions found to be currently blocking other sessions". That's true now because Session 1 released the lock and Session 2 (SID 137) was able to acquire it, but for the period of time the ASH data covers, there was a session blocking this one.
Note that if I had run this demo *before* Session 1 commited, this screen would have highlighted the problem, like this.

This was taken from another run of the test, but it should be obvious that the idle Session 1 (SID 150) is blocking Session 2 (SID 147) and that there's a handly "Kill Session" button! (Actually, that "Idle" is a giveaway to what's coming next.)
I can look at the ASH data for the blocked session in more detail by selecting the "Activity" tab and then changing the view to "Show Raw Data". In reverse time order (the way ASH data is returned if you don't order it), it's clear that there were db file scattered reads against TESTUSER.TEST_TAB1, preceded by those lock waits.

But that's only showing the ASH data for one session and doesn't include the useful BLOCKING_SESSION% columns that help diagnose locking problems, so in the next part I'll switch to looking at the raw data in V$ACTIVE_SESSION_HISTORY.
*Why am I still seeing locking problems? I think it's because I often have to support Third-Party "Database-independent" (yuck!) applications, some of which insist on implementing their own locking mechanisms. Sigh.
There are few better illustrations of the utility of ASH data than the ability to detect and diagnose locking problems*, even after they've occurred. Because ASH is sampling all active sessions all of the time, you should have information both for the blocked and blocking sessions, even when problems are first reported after they have disappeared and the sessions have disconnected. This has proved very useful to me in solving several production performance problems in the past, so I usually try to demonstrate it during the course.
Something cropped up when one of the demos went wrong last week and it took me a minute to work out what the problem was. I made a simple mistake in the way I approached the demo, but I decided to blog about it because I think it reinforces the nature and inherent limitations of ASH data.
This is a recreation of the demo, which captures the gist of what went 'wrong'.
Session 1 - logged in as TESTUSER
TESTUSER@TEST1020> select pk_id from test_tab1 where rownum=1 for update; PK_ID ---------- 4051
Session 1 holds a lock on that row and I'll use the PK_ID that was returned from the query to make sure I try to lock the same row in the next session.
Session 2 - logged in as TESTUSER
TESTUSER@TEST1020> select pk_id from test_tab1 where pk_id=4051 for update;
Session 2 hangs, waiting for session 1 to release the lock
Session 1 - commits, to release the lock
TESTUSER@TEST1020> commit; Commit complete.
Session 2 - Is now able to acquire the lock, so locks the row and returns the result
PK_ID ---------- 4051
So I've just created 'a locking problem' that occurred in the past but has been resolved now. At this point, there are a few different things we could look at. In last week's class I dived straight into the contents of V$ACTIVE_SESSION_HISTORY at the command line but I'll take the GUI approach first here. Here is how the Top Activity screen in DB/Grid Control looks. (Click on the thumbnails to see the full-sized images)
The period of the test is highlighted and I can see the spike for one active session, which is reflected in the Top Sessions panel (Session 137 is responsible for most of the activity) and I can see from the Top SQL panel that a single SQL statement is responsible for most of the activity. The majority of samples are waiting on "Application" events, with some "User I/O" as well. Looking at the chart, I can see that the User I/O activity followed the Application waits.
If I drill down into the Application events they are all "enq: TX - row lock contention", all associated with the same Session and SQL statement. (Note the more descriptive 10g event description, rather than "enqueue" in previous versions.)
Next I'll drill down into the SQL statement.
As well as the lock waits followed by db file scattered reads, I can see the SQL statement is the one I ran in session 2.
Next I'll drill down into the Session details for the blocked session and we'll see the first sign of limitations.
On the Blocking Tree tab, I find 'No sessions found to be currently blocking other sessions". That's true now because Session 1 released the lock and Session 2 (SID 137) was able to acquire it, but for the period of time the ASH data covers, there was a session blocking this one.
Note that if I had run this demo *before* Session 1 commited, this screen would have highlighted the problem, like this.
This was taken from another run of the test, but it should be obvious that the idle Session 1 (SID 150) is blocking Session 2 (SID 147) and that there's a handly "Kill Session" button! (Actually, that "Idle" is a giveaway to what's coming next.)
I can look at the ASH data for the blocked session in more detail by selecting the "Activity" tab and then changing the view to "Show Raw Data". In reverse time order (the way ASH data is returned if you don't order it), it's clear that there were db file scattered reads against TESTUSER.TEST_TAB1, preceded by those lock waits.
But that's only showing the ASH data for one session and doesn't include the useful BLOCKING_SESSION% columns that help diagnose locking problems, so in the next part I'll switch to looking at the raw data in V$ACTIVE_SESSION_HISTORY.
*Why am I still seeing locking problems? I think it's because I often have to support Third-Party "Database-independent" (yuck!) applications, some of which insist on implementing their own locking mechanisms. Sigh.
« previous page
(Page 2 of 2, totaling 10 entries)
next page »


Comments