Mar 11: Not all Deadlocks are created the same
I've blogged about deadlocks in Oracle at least once before. I said then that although the following message in deadlock trace files is usually true, it isn't always.
The following deadlock is not an Oracle error. Deadlocks of
this type can be expected if certain SQL statements are
issued. The following information may aid in determining the
cause of the deadlock.
So when I came across another example recently, it seemed worth a quick blog post. Not least for the benefit of other souls who hit the same issue (and probably hit Google moments later).
But while it's easy to say - "Hey! Look! I found an exception! Aren't I clever?" - it occurred to me that actually Oracle's capabilities in this area might be underrated by raising the occasional anomaly. Because the truth is
1) In most cases, deadlock errors are down to the way you've written your application or some documented restriction in Oracle. The type of restrictions that you're more likely to hit if you're handling high degrees of concurrency with lots of DDL, parallel query, partition management and the like. Such activities often have unusually restrictive locking requirements and most locking issues can be turned into deadlock issues quite easily if you have a few sessions running concurrently.
2) It's still the case that Oracle will handle the deadlock situation, at least to the extent of rolling back one of the statements causing the issue. (Although, whilst writing this post, I noticed that Jonathan Lewis raised the question of what exactly people mean when they suggest that Oracle resolves deadlock issues.
3) Deadlock trace files are typically very useful and not the most difficult to read. Yes, they tend to use Oracle kernel terminology (not surprising) but I'd wager that most people could have a rough idea of the root cause with some initial analysis and could have a very detailed idea, given more time. Even if you can't decipher the things yourself, it gives Oracle Support detailed information to help root cause analysis.
So, to the particular issue we hit. Towards the end of a data loading process that loads around a billion rows in a short period of time (30/60 minutes that also includes a bunch of surrounding activities), we would hit the occasional deadlock error. Fortunately, the site I'm working at just now has a very enlightened policy towards developer access to the alert log and trace files, so I can do my own initial investigation. On digging out the relevant deadlock trace file, it looked like this (some details changed)
Trace file /app/ora/local/admin/PRD/diag/rdbms/PRD_prod_server/PRD/trace/PRD_dia0_1627682.trc Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production With the Partitioning, Automatic Storage Management, OLAP, Data Mining and Real Application Testing options ORACLE_HOME = /app/ora/local/product/11.2.0.3/db_1 System name: Linux Node name: prod_server.ldn.orcldoug.com Release: 2.6.32-220.13.1.el6.x86_64 Version: #1 SMP Thu Mar 29 11:46:40 EDT 2012 Machine: x86_64 Instance name: PRD Redo thread mounted by this instance: 1 Oracle process number: 8 Unix process pid: 1627682, image: oracle@prod_server.ldn.orcldoug.com (DIA0) *** 2013-01-16 09:09:10.925 *** SESSION ID:(201.1) 2013-01-16 09:09:10.925 *** CLIENT ID:() 2013-01-16 09:09:10.925 *** SERVICE NAME:(SYS$BACKGROUND) 2013-01-16 09:09:10.925 *** MODULE NAME:() 2013-01-16 09:09:10.925 *** ACTION NAME:() 2013-01-16 09:09:10.925 ------------------------------------------------------------------------------- DEADLOCK DETECTED (id=0xd0102292) Chain Signature: 'library cache lock'<='row cache lock' (cycle) Chain Signature Hash: 0x52a8007d The following deadlock is not an Oracle error. Deadlocks of this type can be expected if certain SQL statements are issued. The following information may aid in determining the cause of the deadlock. Resolving deadlock by signaling ORA-00060 to 'instance: 1, os id: 3443329, session id: 161' dump location: /app/ora/local/admin/PRD/diag/rdbms/PRD_prod_server/PRD/trace/PRD_ora_3443329.trc Performing diagnostic dump on 'instance: 1, os id: 3443222, session id: 779' dump location: /app/ora/local/admin/PRD/diag/rdbms/PRD_prod_server/PRD/trace/PRD_ora_3443222.trc ------------------------------------------------------------------------------- Oracle session identified by: { instance: 1 (PRD_prod_server.PRD) os id: 3443222 process id: 127, oracle@prod_server.ldn.orcldoug.com session id: 779 session serial #: 605 } is waiting for 'row cache lock' with wait info: { p1: 'cache id'=0x8 p2: 'mode'=0x0 p3: 'request'=0x5 time in wait: 1 min 58 sec timeout after: never wait id: 1655 blocking: 1 session current sql: Begin run_manager_pkg.finalize_all_values_prc(:v0); End; wait history: * time between current wait and wait #1: 0.002176 sec 1. event: 'enq: PS - contention' time waited: 0.000082 sec wait id: 1654 p1: 'name|mode'=0x50530006 p2: 'instance'=0x1 p3: 'slave ID'=0x2f * time between wait #1 and #2: 0.000013 sec 2. event: 'PX Deq: Slave Session Stats' time waited: 0.000001 sec wait id: 1653 p1: 'sleeptime/senderid'=0x0 p2: 'passes'=0x0 * time between wait #2 and #3: 0.000001 sec 3. event: 'PX Deq: Slave Session Stats' time waited: 0.000002 sec wait id: 1652 p1: 'sleeptime/senderid'=0x0 p2: 'passes'=0x0 } and is blocked by => Oracle session identified by: { instance: 1 (PRD_prod_server.PRD) os id: 3443329 process id: 134, oracle@prod_server.ldn.orcldoug.com session id: 161 session serial #: 247 } which is waiting for 'library cache lock' with wait info: { p1: 'handle address'=0x101f8eac98 p2: 'lock address'=0xfdef83738 p3: '100*mode+namespace'=0x10f2000010003 time in wait: 1.739719 sec timeout after: never wait id: 508 blocking: 1 session current sql: ALTER INDEX "DOUG"."VALUE_PK" REBUILD PARTITION "SYS_P4089"
wait history: * time between current wait and wait #1: 0.000973 sec 1. event: 'enq: CR - block range reuse ckpt' time waited: 0.003220 sec wait id: 507 p1: 'name|mode'=0x43520006 p2: '2'=0x10086 p3: '0'=0x1 * time between wait #1 and #2: 0.000008 sec 2. event: 'reliable message' time waited: 0.000107 sec wait id: 506 p1: 'channel context'=0x101c5afa98 p2: 'channel handle'=0x101c0f2260 p3: 'broadcast message'=0x101b5cfd58 * time between wait #2 and #3: 0.003791 sec 3. event: 'db file sequential read' time waited: 0.000321 sec wait id: 505 p1: 'file#'=0x5e p2: 'block#'=0x118784 p3: 'blocks'=0x1 } and is blocked by the session at the start of the chain.
I would hope that a few things would be immediately obvious, particularly if it's 'your' application that generated this issue
1) The two sessions are running the following parts of the application.
Session 1: Begin run_manager_pkg.finalize_all_values_prc(:v0); End;
Session 2: ALTER INDEX "DOUG"."VALUE_PK" REBUILD PARTITION "SYS_P4089"
Which happened to fit in with what we were seeing. We have two concurrent runs which perform similar actions using different input files that load into different partitions.
2) The first session is using Parallel Query (note the enq: PS - contention and different PX Deq wait events)
3) The deadlock is a little unusual because it's not between two transactions trying to lock database objects or rows being locked by the other session but between in-memory structures. One session is waiting on 'row cache lock' and the other is waiting on
'library cache lock', as opposed to waiting for specific row or
table-level locks. This is also visible from the chain signature at the start of the trace file.
Chain Signature: 'library cache lock'<='row cache lock' (cycle)
Armed with 2) and 3) in particular, my next step was to go to My Oracle Support, as usual. I find that Google isn't too great with issues like this because some of them are quite specific and might not be affecting too many others. Sure enough, a search turned up :-
Bug 14356507 Deadlock between partition maintenance and parallel query operations
Which is confirmed as affecting versions 11.2.0.2 and 11.2.0.3. The fix is in Bundle Patch 12 for Exadata, in Oracle 12.1 and is also available as one-off patch that we're in the process of applying to different environments.
The issue is that "When a parallel query is hard parsed, first QC hard parses the query and then all the slaves. When a partition maintenance operation (DDL) comes in between the hard parses of QC and Slaves.", then you can hit the deadlock. There's more detail in the bug notes, but it's worth noting this phrase "This is basically a timing issue, in high concurrency environments.", which means it only affects us very intermittently and is a nightmare to prove we've eliminated without a lot of testing.
What I find a little disconcerting is that there seem to be quite a few of these library cache deadlock issues kicking around in recent versions that I haven't been used to seeing in prior versions. Given some of the library cache madness I've seen in my few years with 11g, I do wonder what on earth they've done to it!