Jun 13: Production Call-out
On Saturday morning, we ran into a problem on one of our 220.127.116.11 Production databases during an Extract, Transform and Load (ETL) batch process.
I was called at 5:15am to look at a job that normally runs in a few minutes but had failed after 30 or 40 minutes when trying to allocate an extent in a TEMP tablespace. Application Support and the Operators were keen for me to add space to the TEMP tablespace so that the job could run through to completion because there were hundreds of dependant jobs still to run. When you're providing on-call Production support there's a balance between getting things moving again and analysing the problem properly. In this case it looked like a simple space problem.
So I added 10Gb to the existing 10Gb TEMP tablespace, we re-ran the job and it fell over again after another 40 minutes (maybe longer). At this point I wanted to analyse the problem properly because it was clear that the job was doing something completely different to previous runs but I was (slightly) over-ruled. We came to an agreement that we would add 20Gb, run it one more time, but that I would check what the job was doing while it was running.
Looking at the execution plan (extracted from V$SQL_PLAN), there was a MERGE JOIN CARTESIAN step in there. Nothing wrong with that per se, but it looked like it was choosing it for a table with an extremely small number of rows - 1. I investigated the stats in DBA_TABLES and it turned out that NUM_ROWS for one of the tables was zero, but a quick count showed that there were over a quarter of a million rows in the table. What had happened was that the ETL process truncates tables before populating them with data. Normally it should run long before the statistics are gathered but it's been running later and later each night and eventually the two jobs were running at the same time. When dbms_stats looked at that table, it happened to be empty. The stats were an accurate reflection of an unusual mid-process state and the optimiser was picking a sensible plan for a tiny table
Looking back, I think this is the most common major problem I've seen with stats at site after site - the optimiser thinking a table has zero rows because the stats have not been generated since the table was populated. It leads to some bizarre execution plans.
I regenerated the stats on that table and the job ran in 3 or 4 minutes without using any temp space to speak of. Of course, the Operations report still described it as a space problem, but I had that modified later.
The proper solution will include
1) Configuring the correct dependencies between the ETL process and the stats collection.
2) Dropping the tempfiles I added (done)
3) Keeping a history of CBO statistics. This wouldn't have prevented the problem happening but would have made it a little easier to diagnose and much easier to fix, by importing yesterday's statistics. I'm in the process of implementing this at work, so will probably blog about that later.
I've been caught out by wrong stats in the past, so many times that I have started to look there first when a job starts failing after a history of working
But stat collection and other housekeeping jobs that float with no dependency on other schedule tasks are a menace – I remember spending an age in the middle of the night tracking an error in a exp to find it caused by another process dropping a table in the middle of an export… don’t ask!
"I've been caught out by wrong stats in the past, so many times that I have started to look there first when a job starts failing after a history of working"
It's not unusual, is it That was one of the reasons I decided to blog about it, in case there was one poor soul out there adding space in large quantities one night who might think back to this. In fairness, I realised after the first space addition that that wasn't the problem, but was under a lot of pressure to add more.
"I remember spending an age in the middle of the night"
And that's the other thing - you're normally not quite awake at first and it's important to stop and be methodical when everyone around you is excited. I actually said that a few times during the incident :- 'Let's stop and *think* for 10 minutes, have a look around and it could save us much more time and pain in the long run.
It worked out well in the end
A little known feature of dbms_stats - all versions - is that if you create a stats table and then do a 'gather_xxx_stats' setting the statid, stattab and statown parameters, the old statistics are automatically saved under that statid.
Obviously the 10g feature of saving stats for 31 days goes one step beyond this, but if all you need is the ability to go back one day in an emergency the old 8i/9i feature leaves you prepared without writing any special code.
Nice one, Jonathan. I confess I didn't know that.
Our plan is to keep at least a week's statistics and we're in the process of implementing that (and no doubt I'll blog about it). I'm sure you'll agree, though, that any method of saving stats is a very useful activity in a complex system because execution plan changes are a nightmare to debug otherwise.