Feb 22: Statistics on Partitioned Tables - Part 2
In the last part, I asked you to trust me that true Global Stats are a good thing so in this post I hope to show you why they are, to make sure you don't kid yourself that you can avoid them. (Updated later - this is all on 10.2.0.4)
Why would you even want to avoid them? Global stats can take a lot of work to gather if you're working with very large objects because Oracle has to visit all partitions. As an alternative, Oracle has the capability to aggregate lower level statistics to generate simulated global statistics at higher levels of the same object. In our case, as we INSERT data into new subpartitions or use partition exchange operations, we gather statistics at the SUBPARTITION level and allow the statistics to aggregate up to the PARTITION and TABLE level. Here's how it looks ....
I'll delete the existing table stats and regather at the SUBPARTITION level.
SQL> exec dbms_stats.delete_table_stats('TESTUSER', 'TEST_TAB1')
PL/SQL procedure successfully completed.
SQL> exec dbms_stats.gather_table_stats('TESTUSER', 'TEST_TAB1', GRANULARITY => 'SUBPARTITION')
PL/SQL procedure successfully completed.
Note that, because I haven't specified a subpartition name, all of the subpartitions will need to be visited to gather these statistics and will result in the following Table, Partition and Subpartition stats.
SQL> select table_name, global_stats, last_analyzed, num_rows from dba_tables where table_name='TEST_TAB1' and owner='TESTUSER' order by 1, 3 desc nulls last TABLE_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ --- ---------------------- ---------- TEST_TAB1 NO 16-FEB-2010 16:23:32 11 1 row selected.
I can see that the table statistics are not global stats, but the number of rows looks right. These stats are actually aggregated statistics that Oracle has populated, based on the data found in the subpartitions. Let's look at the partitions.
SQL> select table_name, partition_name, global_stats, last_analyzed, num_rows from dba_tab_partitions where table_name='TEST_TAB1' and table_owner='TESTUSER' order by 1, 2, 4 desc nulls last TABLE_NAME PARTITION_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ ------------------------------ --- ---------------------- -------- TEST_TAB1 P_20100131 NO 16-FEB-2010 16:23:32 0 TEST_TAB1 P_20100201 NO 16-FEB-2010 16:23:32 4 TEST_TAB1 P_20100202 NO 16-FEB-2010 16:23:32 2 TEST_TAB1 P_20100203 NO 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100204 NO 16-FEB-2010 16:23:32 2 TEST_TAB1 P_20100205 NO 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100206 NO 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100207 NO 16-FEB-2010 16:23:32 0 8 rows selected.
Again, the aggregated stats appear to be an accurate reflection of the data. How do the subpartition stats look?
SQL> select table_name, subpartition_name, global_stats, last_analyzed, num_rows from dba_tab_subpartitions where table_name='TEST_TAB1' and table_owner='TESTUSER' order by 1, 2, 4 desc nulls last TABLE_NAME SUBPARTITION_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ ------------------------------ --- ---------------------- -------- TEST_TAB1 P_20100131_GROT YES 16-FEB-2010 16:23:32 0 TEST_TAB1 P_20100131_HALO YES 16-FEB-2010 16:23:32 0 TEST_TAB1 P_20100131_JUNE YES 16-FEB-2010 16:23:32 0 TEST_TAB1 P_20100131_OTHERS YES 16-FEB-2010 16:23:32 0 TEST_TAB1 P_20100201_GROT YES 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100201_HALO YES 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100201_JUNE YES 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100201_OTHERS YES 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100202_GROT YES 16-FEB-2010 16:23:32 1 TEST_TAB1 P_20100202_HALO YES 16-FEB-2010 16:23:32 1 <<output_snipped>> 32 rows selected.
So this looks pretty good, doesn't it? We've gathered 'Global' statistics at the subpartition level and yet the stats at the table and partition level look accurate too. Why would we want to use this approach? Well, to use my current system as an example, it's a near-real-time datawarehouse which creates tens of thousands of subpartitions per day, most of them over a period of a few hours so if we were to re-gather global statistics at the table and partition levels, there would be substantial associated stats-gathering workload and the system is under enough strain as it is. Therefore, if we can just gather stats at the subpartition level for the new subpartitions and have Oracle aggregate them to generate derived Table and Partition stats at the same time, so much the better. To simulate that, I'll insert some more data and see if the stats still look accurate after adding data and regathering at the SUBPARTITION level.
SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 100000, 'P'); 1 row created. SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 3000000, 'P'); 1 row created. SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 200000, 'P'); 1 row created. SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 110000, 'P'); 1 row created. SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 240000, 'U'); 1 row created. SQL> COMMIT; Commit complete.
I'll gather stats at the SUBPARTITION level for the only subpartition that has changed data.
SQL> exec dbms_stats.gather_table_stats('TESTUSER', 'TEST_TAB1', GRANULARITY => 'SUBPARTITION',
PARTNAME => 'P_20100206_GROT');
PL/SQL procedure successfully completed.
Time to look at the stats ...
SQL> select table_name, global_stats, last_analyzed, num_rows from dba_tables where table_name='TEST_TAB1' and owner='TESTUSER' order by 1, 2, 4 desc nulls last; TABLE_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ --- ---------------------- ---------- TEST_TAB1 NO 16-FEB-2010 16:23:34 16 1 row selected. SQL> select table_name, partition_name, global_stats, last_analyzed, num_rows from dba_tab_partitions where table_name='TEST_TAB1' and table_owner='TESTUSER' order by 1, 2, 4 desc nulls last; TABLE_NAME PARTITION_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ ------------------------------ --- ---------------------- -------- TEST_TAB1 P_20100131 NO 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100201 NO 16-FEB-2010 16:23:34 4 TEST_TAB1 P_20100202 NO 16-FEB-2010 16:23:34 2 TEST_TAB1 P_20100203 NO 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100204 NO 16-FEB-2010 16:23:34 2 TEST_TAB1 P_20100205 NO 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100206 NO 16-FEB-2010 16:23:34 6 TEST_TAB1 P_20100207 NO 16-FEB-2010 16:23:34 0 8 rows selected. SQL> select table_name, subpartition_name, global_stats, last_analyzed, num_rows from dba_tab_subpartitions where table_name='TEST_TAB1' and table_owner='TESTUSER' order by 1, 2, 4 desc nulls last; TABLE_NAME SUBPARTITION_NAME GLO LAST_ANALYZED NUM_ROWS ------------------------------ ------------------------------ --- ---------------------- -------- TEST_TAB1 P_20100131_GROT YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100131_HALO YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100131_JUNE YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100131_OTHERS YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100201_GROT YES 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100201_HALO YES 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100201_JUNE YES 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100201_OTHERS YES 16-FEB-2010 16:23:34 1 <<output_snipped>> TEST_TAB1 P_20100206_GROT YES 16-FEB-2010 16:23:34 5 TEST_TAB1 P_20100206_HALO YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100206_JUNE YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100206_OTHERS YES 16-FEB-2010 16:23:34 1 TEST_TAB1 P_20100207_GROT YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100207_HALO YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100207_JUNE YES 16-FEB-2010 16:23:34 0 TEST_TAB1 P_20100207_OTHERS YES 16-FEB-2010 16:23:34 0 32 rows selected.
Everything's still looking very good in this case and so it looks like a great strategy - low collection overhead and accurate statistics. That is, until you start drilling down to the column level statistics, using Greg Rahn's query and identify some horrible problems. These are the statistics at the table level.
SQL> select a.column_name, a.num_distinct, display_raw(a.low_value,b.data_type) as low_val, display_raw(a.high_value,b.data_type) as high_val, b.data_type from dba_tab_col_statistics a, dba_tab_cols b where a.owner='TESTUSER' and a.table_name='TEST_TAB1' and a.table_name=b.table_name and a.column_name=b.column_name order by 1; COLUMN_NAME NUM_DISTINCT LOW_VAL HIGH_VAL DATA_TYPE -------------------------- ------------ -------------------- -------------------- ----------- REPORTING_DATE 6 20100201 20100206 NUMBER SEQ_ID 14 400 3000000 NUMBER SOURCE_SYSTEM 7 GROT WINE VARCHAR2 STATUS 4 P U VARCHAR2 4 rows selected.
At first glance they look pretty good, too. To give you a specific example, the STATUS column does have the correct High Value of 'U', which has just appeared in the last set of rows that were inserted. Based on what I've seen to date, Oracle does accurately update the High/Low column values and row counts when generating aggregated stats, but there's a problem here. According to the column statistics, there are 4 distinct STATUSes in the table, but that's not true, there are only 2
SQL> select distinct STATUS from test_tab1; S - U P SQL>
Based on a problem with such a small number of rows and only two distinct values, the chances of the Number of Distinct Values calculated during stats aggregation being accurate looks pretty slim and, when you consider what a key input to cost-based calculations those values are ....
Why aren't the values accurate? Well let's compare High/Low values to Number of Distinct Values (NDV).
When we gathered statistics on the new subpartition, we had access to the previous High/Low values at the table level. Here are the column statistics before stats were gathered on the new subpartition.
SQL> select 2 a.column_name, 3 a.num_distinct, 4 display_raw(a.low_value,b.data_type) as low_val, 5 display_raw(a.high_value,b.data_type) as high_val, 6 b.data_type 7 from 8 dba_tab_col_statistics a, dba_tab_cols b 9 where 10 a.owner='TESTUSER' and 11 a.table_name='TEST_TAB1' and 12 a.table_name=b.table_name and 13 a.column_name=b.column_name 14 order by 1 15 / COLUMN_NAME NUM_DISTINCT LOW_VAL HIGH_VAL DATA_TYPE -------------------------- ------------ -------------------- -------------------- ----------- REPORTING_DATE 6 20100201 20100206 NUMBER SEQ_ID 9 400 30000 NUMBER SOURCE_SYSTEM 8 GROT WINE VARCHAR2 STATUS 1 P P VARCHAR2
So at this stage, there is one distinct value of status, which is P. When we gathered stats on the new subpartition, Oracle could see all of the STATUS values for the rows in that subpartition and noticed STATUS='U' on one of the rows and could work out very easily that it's higher than 'P', so updated the High Value accordingly as per the example shown earlier.
COLUMN_NAME NUM_DISTINCT LOW_VAL HIGH_VAL DATA_TYPE -------------------------- ------------ -------------------- -------------------- ------------ REPORTING_DATE 6 20100201 20100206 NUMBER SEQ_ID 14 400 3000000 NUMBER SOURCE_SYSTEM 7 GROT WINE VARCHAR2 STATUS 4 P U VARCHAR2
Now, what to do about the NDV? Remember, Oracle can't look at any of the data in Partitions or Subpartitions other than the one we're gathering stats on (that's the point, to reduce overhead). So it has to decide what the new NDV should be based on several inputs
1) The actual values in STATUS for the rows in the subpartition we can look at.
2) The previously-gathered and stored NDV for the other subpartitions.
3) The previously-aggregated NDV stored at the table and partition levels.
The problem is that Oracle knows the number of distinct values in other subpartitions (we looked at the data previously to calculate them) but not the values themselves and, without that information, how can it say whether the 2 distinct values (P and U) in this subpartition are distinct when compared to the values in the other subpartitions? Actually, in this case, we might expect Oracle to do something clever and realise that, as there was only one distinct value of P prior to our new subpartition and the only values in the current subpartition are P and U, that there can only be 2 distinct values. Sadly, it just doesn't work that way!
Although relying on lower-level statistics being aggregated up to higher levels might initially seem like a neat trick, it's going to lead to some pretty strange statistics, at least in 10g. Which is why Oracle recommend you gather Global Stats at the TABLE and PARTITION levels.
I wish I could say that was the only problem with this aggregation process, but there's more to come in the next post ...
#1.1 - Doug Burns said:
2010-02-23 04:57 - (Reply)
Greg,
It's 10.2.0.4. I mentioned that in the first post but should probably have repeated it here. I'll be saying more about that in the next post in an hour or two.
Cheers,
Doug
#2 - Kamus said:
2010-03-31 10:38 - (Reply)
Hello, doug
One question, why the new NDV is 4, not 3? This time NDV is 2 and the old NDV on dba_tab_col_statistics is 1, it should be 2+1=3, isn't it?
#2.1 - Doug Burns said:
2010-04-05 09:55 - (Reply)
Hi Kamus,
That's an excellent questions but one I'm not going to answer at this point for a number of reasons.
1) At the start of the series, I very deliberately said
I'll steer clear of any remotely advanced angle
2) Although I think you're absolutely right to say that the NDV would make more sense as 3 instead of 4, I would argue it makes even more sense for it to be 2 and Oracle does have enough information to derive 2 and yet it still doesn't, at least not in 10.2.0.4
3) I might write a blog post specifically about how it derives the NDVs at a later point, or let someone else leave a comment to answer your question here, but I personally think it's better to know that the NDV aggregation process doesn't work properly and should be avoided.
If Oracle doesn't get it right with such simple examples, imagine the potential horrors when you scale it up to more realistic situations.
But, as I said, excellent question.
Cheers,
Doug
#3 - Kamus said:
2010-04-23 12:06 - (Reply)
Thanks Doug,
Excellent answer. ![]()
I agree with you it's a risk to completely believe in NDV aggregation process.
And I'm sad to say, I've tested this process in 11gR1 and 11gR2, still the same issue.
I noticed in 11gR2 doc, it mentioned new value "APPROX_GLOBAL AND PARTITION" for GRANULARITY, it looks like some improvement for NDV aggregation, but I have no time to test yet.
#3.1 - Doug Burns said:
2010-04-23 12:16 - (Reply)
Yes - Incremental Statistics too. I'll be coming to that ... eventually ![]()


As soon as I'd committed my last post, I knew it wasn't what I'd hoped for and said as much to a couple of people before they'd read it. I knew it would probably just add to any confusion people already had about this subject (something I'm particularly k
Tracked: Feb 23, 06:53
Tracked: Feb 26, 18:04
Hello and welcome to Log Buffer #180. Time’s a-wastin’, so let’s go! Oracle There was so much Oracle stuff this week that I’ve decided to cram a little more of it into Log Buffer by providing a little less context than usual. ...
Tracked: Feb 26, 21:25
Tracked: Mar 17, 07:44
Tracked: Mar 21, 03:29