Production Call-out

Doug's Oracle Blog

  • Home
  • Papers
  • Books
  • C.V.
  • Fun
  • Oracle Blog
  • Personal Blog

Jun 13: Production Call-out

On Saturday morning, we ran into a problem on one of our 9.2.0.7 Production databases during an Extract, Transform and Load (ETL) batch process.


I was called at 5:15am to look at a job that normally runs in a few minutes but had failed after 30 or 40 minutes when trying to allocate an extent in a TEMP tablespace. Application Support and the Operators were keen for me to add space to the TEMP tablespace so that the job could run through to completion because there were hundreds of dependant jobs still to run. When you're providing on-call Production support there's a balance between getting things moving again and analysing the problem properly. In this case it looked like a simple space problem.


So I added 10Gb to the existing 10Gb TEMP tablespace, we re-ran the job and it fell over again after another 40 minutes (maybe longer). At this point I wanted to analyse the problem properly because it was clear that the job was doing something completely different to previous runs but I was (slightly) over-ruled. We came to an agreement that we would add 20Gb, run it one more time, but that I would check what the job was doing while it was running.


Looking at the execution plan (extracted from V$SQL_PLAN), there was a MERGE JOIN CARTESIAN step in there. Nothing wrong with that per se, but it looked like it was choosing it for a table with an extremely small number of rows - 1. I investigated the stats in DBA_TABLES and it turned out that NUM_ROWS for one of the tables was zero, but a quick count showed that there were over a quarter of a million rows in the table. What had happened was that the ETL process truncates tables before populating them with data. Normally it should run long before the statistics are gathered but it's been running later and later each night and eventually the two jobs were running at the same time. When dbms_stats looked at that table, it happened to be empty. The stats were an accurate reflection of an unusual mid-process state and the optimiser was picking a sensible plan for a tiny table ;-)


Looking back, I think this is the most common major problem I've seen with stats at site after site - the optimiser thinking a table has zero rows because the stats have not been generated since the table was populated. It leads to some bizarre execution plans.


I regenerated the stats on that table and the job ran in 3 or 4 minutes without using any temp space to speak of. Of course, the Operations report still described it as a space problem, but I had that modified later.


The proper solution will include


1) Configuring the correct dependencies between the ETL process and the stats collection.


2) Dropping the tempfiles I added (done)


3) Keeping a history of CBO statistics. This wouldn't have prevented the problem happening but would have made it a little easier to diagnose and much easier to fix, by importing yesterday's statistics. I'm in the process of implementing this at work, so will probably blog about that later.

Posted by Doug Burns Comments: (4) Trackbacks: (0)

Trackbacks
Trackback specific URI for this entry

No Trackbacks

Comments
Display comments as (Linear | Threaded)

#1 - Pete_S said:
2006-06-13 20:20 - (Reply)

I've been caught out by wrong stats in the past, so many times that I have started to look there first when a job starts failing after a history of working :-)

But stat collection and other housekeeping jobs that float with no dependency on other schedule tasks are a menace – I remember spending an age in the middle of the night tracking an error in a exp to find it caused by another process dropping a table in the middle of an export… don’t ask!

#2 - Doug Burns said:
2006-06-14 05:39 - (Reply)

Pete,

"I've been caught out by wrong stats in the past, so many times that I have started to look there first when a job starts failing after a history of working"

It's not unusual, is it ;-) That was one of the reasons I decided to blog about it, in case there was one poor soul out there adding space in large quantities one night who might think back to this. In fairness, I realised after the first space addition that that wasn't the problem, but was under a lot of pressure to add more.

"I remember spending an age in the middle of the night"

And that's the other thing - you're normally not quite awake at first and it's important to stop and be methodical when everyone around you is excited. I actually said that a few times during the incident :- 'Let's stop and *think* for 10 minutes, have a look around and it could save us much more time and pain in the long run.

It worked out well in the end ;-)

#3 - Jonathan Lewis said:
2006-06-14 17:21 - (Reply)

A little known feature of dbms_stats - all versions - is that if you create a stats table and then do a 'gather_xxx_stats' setting the statid, stattab and statown parameters, the old statistics are automatically saved under that statid.

Obviously the 10g feature of saving stats for 31 days goes one step beyond this, but if all you need is the ability to go back one day in an emergency the old 8i/9i feature leaves you prepared without writing any special code.

#4 - Doug Burns said:
2006-06-14 18:52 - (Reply)

Nice one, Jonathan. I confess I didn't know that.

Our plan is to keep at least a week's statistics and we're in the process of implementing that (and no doubt I'll blog about it). I'm sure you'll agree, though, that any method of saving stats is a very useful activity in a complex system because execution plan changes are a nightmare to debug otherwise.


Add Comment

Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
BBCode format allowed
 
 

Statistics on Partitioned Tables

Contents

Part 1 - Default options - GLOBAL AND PARTITION
Part 2 - Estimated Global Stats
Part 3 - Stats Aggregation Problems I
Part 4 - Stats Aggregation Problems II
Part 5 - Minimal Stats Aggregation
Part 6a - COPY_TABLE_STATS - Intro
Part 6b - COPY_TABLE_STATS - Mistakes
Part 6c - COPY_TABLE_STATS - Bugs and Patches
Part 6d - COPY_TABLE_STATS - A Light-bulb Moment
Part 6e - COPY_TABLE_STATS - Bug 10268597

Comments

Doug Burns about 10053 Trace Files - Different Plan in Different Environments
Tue, 02.04.2013 08:57
You're welcome. Now I just nee d to pull my finger out and ac tually come up [...]
Howard Rogers about 10053 Trace Files - Different Plan in Different Environments
Mon, 01.04.2013 23:08
Makes a big difference, so tha nks for that! With two brow ser windows, o [...]
stelioscharalambides.com about 10053 Trace Files
Sat, 30.03.2013 16:28

Upcoming Presentations

Bookmark

Open All | Close All

Syndicate This Blog

  • XML RSS 2.0 feed
  • ATOM/XML ATOM 1.0 feed
  • XML RSS 2.0 Comments
  • Feedburner Feed

Powered by

Serendipity PHP Weblog

Show tagged entries

xml 11g
xml ACE
xml adaptive thresholds
xml ASH
xml Audit Vault
xml AWR
xml Blogging
xml conferences
xml Cuddly Toys
xml Database Refresh
xml DBMS_STATS
xml Direct Path Reads
xml Fun
xml grid control
xml hotsos 2010
xml listener
xml Locking
xml oow
xml oow2009
xml optimiser
xml OTN
xml Parallel
xml Partitions
xml Patching
xml swingbench
xml The Reality Gap
xml time matters
xml ukoug
xml ukoug2009
xml Unix/Shell
xml Useful Links

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.

Design by Andreas Viklund | Conversion to s9y by Carl