Statistics on Partitioned Tables - Part 2

Doug's Oracle Blog

  • Home
  • Papers
  • Books
  • C.V.
  • Fun
  • Oracle Blog
  • Personal Blog

Feb 22: Statistics on Partitioned Tables - Part 2

In the last part, I asked you to trust me that true Global Stats are a good thing so in this post I hope to show you why they are, to make sure you don't kid yourself that you can avoid them. (Updated later - this is all on 10.2.0.4)

Why would you even want to avoid them? Global stats can take a lot of work to gather if you're working with very large objects because Oracle has to visit all partitions. As an alternative, Oracle has the capability to aggregate lower level statistics to generate simulated global statistics at higher levels of the same object. In our case, as we INSERT data into new subpartitions or use partition exchange operations, we gather statistics at the SUBPARTITION level and allow the statistics to aggregate up to the PARTITION and TABLE level. Here's how it looks ....

I'll delete the existing table stats and regather at the SUBPARTITION level.

SQL> exec dbms_stats.delete_table_stats('TESTUSER', 'TEST_TAB1') 
 

PL/SQL procedure successfully completed. 

SQL> exec dbms_stats.gather_table_stats('TESTUSER', 'TEST_TAB1', GRANULARITY => 'SUBPARTITION') 

PL/SQL procedure successfully completed. 

Note that, because I haven't specified a subpartition name, all of the subpartitions will need to be visited to gather these statistics and will result in the following Table, Partition and Subpartition stats.

SQL> select  table_name, global_stats, last_analyzed, num_rows 
from dba_tables 
where table_name='TEST_TAB1' 
and owner='TESTUSER' 
order by 1, 3 desc nulls last 

TABLE_NAME                     GLO LAST_ANALYZED            NUM_ROWS 
------------------------------ --- ---------------------- ---------- 
TEST_TAB1                      NO  16-FEB-2010 16:23:32           11 

1 row selected. 

I can see that the table statistics are not global stats, but the number of rows looks right. These stats are actually aggregated statistics that Oracle has populated, based on the data found in the subpartitions. Let's look at the partitions.

SQL> select  table_name, partition_name, global_stats, last_analyzed, num_rows 
from dba_tab_partitions 
where table_name='TEST_TAB1' 
and table_owner='TESTUSER' 
order by 1, 2, 4 desc nulls last 

TABLE_NAME                     PARTITION_NAME                 GLO LAST_ANALYZED          NUM_ROWS 
------------------------------ ------------------------------ --- ---------------------- -------- 
TEST_TAB1                      P_20100131                     NO  16-FEB-2010 16:23:32          0 
TEST_TAB1                      P_20100201                     NO  16-FEB-2010 16:23:32          4 
TEST_TAB1                      P_20100202                     NO  16-FEB-2010 16:23:32          2 
TEST_TAB1                      P_20100203                     NO  16-FEB-2010 16:23:32          1 
TEST_TAB1                      P_20100204                     NO  16-FEB-2010 16:23:32          2 
TEST_TAB1                      P_20100205                     NO  16-FEB-2010 16:23:32          1 
TEST_TAB1                      P_20100206                     NO  16-FEB-2010 16:23:32          1 
TEST_TAB1                      P_20100207                     NO  16-FEB-2010 16:23:32          0 

8 rows selected. 

Again, the aggregated stats appear to be an accurate reflection of the data. How do the subpartition stats look?

SQL> select  table_name, subpartition_name, global_stats, last_analyzed, num_rows 
from dba_tab_subpartitions 
where table_name='TEST_TAB1' 
and table_owner='TESTUSER' 
order by 1, 2, 4 desc nulls last 

TABLE_NAME                     SUBPARTITION_NAME              GLO LAST_ANALYZED          NUM_ROWS 
------------------------------ ------------------------------ --- ---------------------- -------- 
TEST_TAB1                      P_20100131_GROT                YES 16-FEB-2010 16:23:32          0 
TEST_TAB1                      P_20100131_HALO                YES 16-FEB-2010 16:23:32          0 
TEST_TAB1                      P_20100131_JUNE                YES 16-FEB-2010 16:23:32          0 
TEST_TAB1                      P_20100131_OTHERS              YES 16-FEB-2010 16:23:32          0 
TEST_TAB1                      P_20100201_GROT                YES 16-FEB-2010 16:23:32          1 
TEST_TAB1                      P_20100201_HALO                YES 16-FEB-2010 16:23:32          1 
TEST_TAB1                      P_20100201_JUNE                YES 16-FEB-2010 16:23:32          1 
TEST_TAB1                      P_20100201_OTHERS              YES 16-FEB-2010 16:23:32          1 
TEST_TAB1                      P_20100202_GROT                YES 16-FEB-2010 16:23:32          1 
TEST_TAB1                      P_20100202_HALO                YES 16-FEB-2010 16:23:32          1 

<<output_snipped>>

32 rows selected. 

So this looks pretty good, doesn't it? We've gathered 'Global' statistics at the subpartition level and yet the stats at the table and partition level look accurate too. Why would we want to use this approach? Well, to use my current system as an example, it's a near-real-time datawarehouse which creates tens of thousands of subpartitions per day, most of them over a period of a few hours so if we were to re-gather global statistics at the table and partition levels, there would be substantial associated stats-gathering workload and the system is under enough strain as it is. Therefore, if we can just gather stats at the subpartition level for the new subpartitions and have Oracle aggregate them to generate derived Table and Partition stats at the same time, so much the better. To simulate that, I'll insert some more data and see if the stats still look accurate after adding data and regathering at the SUBPARTITION level.

SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 100000, 'P');

1 row created. 

SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 3000000, 'P'); 

1 row created. 

SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 200000, 'P');

1 row created. 

SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 110000, 'P');

1 row created. 

SQL> INSERT INTO TEST_TAB1 VALUES (20100206, 'GROT', 240000, 'U');

1 row created. 

SQL> COMMIT;

Commit complete. 

I'll gather stats at the SUBPARTITION level for the only subpartition that has changed data.

SQL> exec dbms_stats.gather_table_stats('TESTUSER', 'TEST_TAB1', GRANULARITY => 'SUBPARTITION', 
                           PARTNAME => 'P_20100206_GROT');

PL/SQL procedure successfully completed. 

Time to look at the stats ...

SQL> select  table_name, global_stats, last_analyzed, num_rows
from dba_tables 
where table_name='TEST_TAB1' 
and owner='TESTUSER' 
order by 1, 2, 4 desc nulls last;

TABLE_NAME                     GLO LAST_ANALYZED            NUM_ROWS 
------------------------------ --- ---------------------- ---------- 
TEST_TAB1                      NO  16-FEB-2010 16:23:34           16 

1 row selected. 

SQL> select  table_name, partition_name, global_stats, last_analyzed, num_rows 
from dba_tab_partitions 
where table_name='TEST_TAB1' 
and table_owner='TESTUSER' 
order by 1, 2, 4 desc nulls last; 

TABLE_NAME                     PARTITION_NAME                 GLO LAST_ANALYZED          NUM_ROWS 
------------------------------ ------------------------------ --- ---------------------- -------- 
TEST_TAB1                      P_20100131                     NO  16-FEB-2010 16:23:34          0 
TEST_TAB1                      P_20100201                     NO  16-FEB-2010 16:23:34          4 
TEST_TAB1                      P_20100202                     NO  16-FEB-2010 16:23:34          2 
TEST_TAB1                      P_20100203                     NO  16-FEB-2010 16:23:34          1 
TEST_TAB1                      P_20100204                     NO  16-FEB-2010 16:23:34          2 
TEST_TAB1                      P_20100205                     NO  16-FEB-2010 16:23:34          1 
TEST_TAB1                      P_20100206                     NO  16-FEB-2010 16:23:34          6 
TEST_TAB1                      P_20100207                     NO  16-FEB-2010 16:23:34          0 

8 rows selected. 

SQL> select  table_name, subpartition_name, global_stats, last_analyzed, num_rows 
from dba_tab_subpartitions 
where table_name='TEST_TAB1' 
and table_owner='TESTUSER' 
order by 1, 2, 4 desc nulls last;

TABLE_NAME                     SUBPARTITION_NAME              GLO LAST_ANALYZED          NUM_ROWS 
------------------------------ ------------------------------ --- ---------------------- -------- 
TEST_TAB1                      P_20100131_GROT                YES 16-FEB-2010 16:23:34          0 
TEST_TAB1                      P_20100131_HALO                YES 16-FEB-2010 16:23:34          0 
TEST_TAB1                      P_20100131_JUNE                YES 16-FEB-2010 16:23:34          0 
TEST_TAB1                      P_20100131_OTHERS              YES 16-FEB-2010 16:23:34          0 
TEST_TAB1                      P_20100201_GROT                YES 16-FEB-2010 16:23:34          1 
TEST_TAB1                      P_20100201_HALO                YES 16-FEB-2010 16:23:34          1 
TEST_TAB1                      P_20100201_JUNE                YES 16-FEB-2010 16:23:34          1 
TEST_TAB1                      P_20100201_OTHERS              YES 16-FEB-2010 16:23:34          1 

<<output_snipped>>

TEST_TAB1                      P_20100206_GROT                YES 16-FEB-2010 16:23:34          5
TEST_TAB1                      P_20100206_HALO                YES 16-FEB-2010 16:23:34          0
TEST_TAB1                      P_20100206_JUNE                YES 16-FEB-2010 16:23:34          0
TEST_TAB1                      P_20100206_OTHERS              YES 16-FEB-2010 16:23:34          1
TEST_TAB1                      P_20100207_GROT                YES 16-FEB-2010 16:23:34          0
TEST_TAB1                      P_20100207_HALO                YES 16-FEB-2010 16:23:34          0
TEST_TAB1                      P_20100207_JUNE                YES 16-FEB-2010 16:23:34          0
TEST_TAB1                      P_20100207_OTHERS              YES 16-FEB-2010 16:23:34          0

32 rows selected. 

Everything's still looking very good in this case and so it looks like a great strategy - low collection overhead and accurate statistics. That is, until you start drilling down to the column level statistics, using Greg Rahn's query and identify some horrible problems. These are the statistics at the table level.

SQL> select 
   a.column_name, 
   a.num_distinct, 
   display_raw(a.low_value,b.data_type) as low_val, 
   display_raw(a.high_value,b.data_type) as high_val, 
   b.data_type 
from 
   dba_tab_col_statistics a, dba_tab_cols b 
where 
   a.owner='TESTUSER' and 
   a.table_name='TEST_TAB1' and 
   a.table_name=b.table_name and 
   a.column_name=b.column_name 
order by 1;

COLUMN_NAME                NUM_DISTINCT LOW_VAL              HIGH_VAL             DATA_TYPE           
-------------------------- ------------ -------------------- -------------------- ----------- 
REPORTING_DATE                        6 20100201             20100206             NUMBER     
SEQ_ID                               14 400                  3000000              NUMBER     
SOURCE_SYSTEM                         7 GROT                 WINE                 VARCHAR2  
STATUS                                4 P                    U                    VARCHAR2  

4 rows selected. 

At first glance they look pretty good, too. To give you a specific example, the STATUS column does have the correct High Value of 'U', which has just appeared in the last set of rows that were inserted. Based on what I've seen to date, Oracle does accurately update the High/Low column values and row counts when generating aggregated stats, but there's a problem here. According to the column statistics, there are 4 distinct STATUSes in the table, but that's not true, there are only 2

SQL> select distinct STATUS from test_tab1;

S
-
U
P

SQL>  

Based on a problem with such a small number of rows and only two distinct values, the chances of the Number of Distinct Values calculated during stats aggregation being accurate looks pretty slim and, when you consider what a key input to cost-based calculations those values are ....

Why aren't the values accurate? Well let's compare High/Low values to Number of Distinct Values (NDV).

When we gathered statistics on the new subpartition, we had access to the previous High/Low values at the table level. Here are the column statistics before stats were gathered on the new subpartition.

SQL> select
  2      a.column_name,
  3      a.num_distinct,
  4      display_raw(a.low_value,b.data_type) as low_val,
  5      display_raw(a.high_value,b.data_type) as high_val,
  6      b.data_type
  7  from
  8      dba_tab_col_statistics a, dba_tab_cols b
  9  where
 10      a.owner='TESTUSER' and
 11      a.table_name='TEST_TAB1' and
 12      a.table_name=b.table_name and
 13      a.column_name=b.column_name
 14  order by 1
 15  /

COLUMN_NAME                NUM_DISTINCT LOW_VAL              HIGH_VAL             DATA_TYPE 
-------------------------- ------------ -------------------- -------------------- ----------- 
REPORTING_DATE                        6 20100201             20100206             NUMBER  
SEQ_ID                                9 400                  30000                NUMBER   
SOURCE_SYSTEM                         8 GROT                 WINE                 VARCHAR2  
STATUS                                1 P                    P                    VARCHAR2  

So at this stage, there is one distinct value of status, which is P. When we gathered stats on the new subpartition, Oracle could see all of the STATUS values for the rows in that subpartition and noticed STATUS='U' on one of the rows and could work out very easily that it's higher than 'P', so updated the High Value accordingly as per the example shown earlier.

COLUMN_NAME                NUM_DISTINCT LOW_VAL              HIGH_VAL             DATA_TYPE           
-------------------------- ------------ -------------------- -------------------- ------------
REPORTING_DATE                        6 20100201             20100206             NUMBER     
SEQ_ID                               14 400                  3000000              NUMBER       
SOURCE_SYSTEM                         7 GROT                 WINE                 VARCHAR2 
STATUS                                4 P                    U                    VARCHAR2 

Now, what to do about the NDV? Remember, Oracle can't look at any of the data in Partitions or Subpartitions other than the one we're gathering stats on (that's the point, to reduce overhead). So it has to decide what the new NDV should be based on several inputs

1) The actual values in STATUS for the rows in the subpartition we can look at.
2) The previously-gathered and stored NDV for the other subpartitions.
3) The previously-aggregated NDV stored at the table and partition levels.

The problem is that Oracle knows the number of distinct values in other subpartitions (we looked at the data previously to calculate them) but not the values themselves and, without that information, how can it say whether the 2 distinct values (P and U) in this subpartition are distinct when compared to the values in the other subpartitions? Actually, in this case, we might expect Oracle to do something clever and realise that, as there was only one distinct value of P prior to our new subpartition and the only values in the current subpartition are P and U, that there can only be 2 distinct values. Sadly, it just doesn't work that way!

Although relying on lower-level statistics being aggregated up to higher levels might initially seem like a neat trick, it's going to lead to some pretty strange statistics, at least in 10g. Which is why Oracle recommend you gather Global Stats at the TABLE and PARTITION levels.

I wish I could say that was the only problem with this aggregation process, but there's more to come in the next post ...

Posted by Doug Burns Comments: (6) Trackbacks: (5)
Defined tags for this entry: DBMS_STATS, Optimiser, Partitions
Related entries by tags:
Statistics on Partitioned Tables - Part 5
Statistics on Partitioned Tables - Part 4
Statistics on Partitioned Tables - Part 3
Statistics on Partitioned Tables - Part 1

Trackbacks
Trackback specific URI for this entry

Statistics on Partitioned Tables - Part 3
As soon as I'd committed my last post, I knew it wasn't what I'd hoped for and said as much to a couple of people before they'd read it. I knew it would probably just add to any confusion people already had about this subject (something I'm particularly k
Weblog: Doug's Oracle Blog
Tracked: Feb 23, 06:53
PingBack
Weblog: www.pythian.com
Tracked: Feb 26, 18:04
Log Buffer #180: a Carnival of the Vanities for DBAs
Hello and welcome to Log Buffer #180. Time’s a-wastin’, so let’s go! Oracle There was so much Oracle stuff this week that I’ve decided to cram a little more of it into Log Buffer by providing a little less context than usual. ...
Weblog: The Pythian Blog
Tracked: Feb 26, 21:25
PingBack
Weblog: jonathanlewis.wordpress.com
Tracked: Mar 17, 07:44
PingBack
Weblog: coskan.wordpress.com
Tracked: Mar 21, 03:29

Comments
Display comments as (Linear | Threaded)

#1 - Greg Rahn said:
2010-02-23 00:05 - (Reply)

What database version are you using?

#1.1 - Doug Burns said:
2010-02-23 04:57 - (Reply)

Greg,

It's 10.2.0.4. I mentioned that in the first post but should probably have repeated it here. I'll be saying more about that in the next post in an hour or two.

Cheers,

Doug

#2 - Kamus said:
2010-03-31 10:38 - (Reply)

Hello, doug
One question, why the new NDV is 4, not 3? This time NDV is 2 and the old NDV on dba_tab_col_statistics is 1, it should be 2+1=3, isn't it?

#2.1 - Doug Burns said:
2010-04-05 09:55 - (Reply)

Hi Kamus,

That's an excellent questions but one I'm not going to answer at this point for a number of reasons.

1) At the start of the series, I very deliberately said

I'll steer clear of any remotely advanced angle

2) Although I think you're absolutely right to say that the NDV would make more sense as 3 instead of 4, I would argue it makes even more sense for it to be 2 and Oracle does have enough information to derive 2 and yet it still doesn't, at least not in 10.2.0.4

3) I might write a blog post specifically about how it derives the NDVs at a later point, or let someone else leave a comment to answer your question here, but I personally think it's better to know that the NDV aggregation process doesn't work properly and should be avoided.

If Oracle doesn't get it right with such simple examples, imagine the potential horrors when you scale it up to more realistic situations.

But, as I said, excellent question.

Cheers,

Doug

#3 - Kamus said:
2010-04-23 12:06 - (Reply)

Thanks Doug,
Excellent answer. :-)
I agree with you it's a risk to completely believe in NDV aggregation process.
And I'm sad to say, I've tested this process in 11gR1 and 11gR2, still the same issue.
I noticed in 11gR2 doc, it mentioned new value "APPROX_GLOBAL AND PARTITION" for GRANULARITY, it looks like some improvement for NDV aggregation, but I have no time to test yet.

#3.1 - Doug Burns said:
2010-04-23 12:16 - (Reply)

Yes - Incremental Statistics too. I'll be coming to that ... eventually ;-)


Add Comment

Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
BBCode format allowed
 
 

Statistics on Partitioned Tables

Contents

Part 1 - Default options - GLOBAL AND PARTITION
Part 2 - Estimated Global Stats
Part 3 - Stats Aggregation Problems I
Part 4 - Stats Aggregation Problems II
Part 5 - Minimal Stats Aggregation
Part 6a - COPY_TABLE_STATS - Intro
Part 6b - COPY_TABLE_STATS - Mistakes
Part 6c - COPY_TABLE_STATS - Bugs and Patches
Part 6d - COPY_TABLE_STATS - A Light-bulb Moment
Part 6e - COPY_TABLE_STATS - Bug 10268597

Comments

jonathanlewis.wordpress.com about 10053 Trace Files - Different Plan in Different Environments
Sat, 01.06.2013 11:26
Doug Burns about 10053 Trace Files - Different Plan in Different Environments
Tue, 02.04.2013 08:57
You're welcome. Now I just nee d to pull my finger out and ac tually come up [...]
Howard Rogers about 10053 Trace Files - Different Plan in Different Environments
Mon, 01.04.2013 23:08
Makes a big difference, so tha nks for that! With two brow ser windows, o [...]

Upcoming Presentations


Bookmark

Open All | Close All

Syndicate This Blog

  • XML RSS 2.0 feed
  • ATOM/XML ATOM 1.0 feed
  • XML RSS 2.0 Comments
  • Feedburner Feed

Powered by

Serendipity PHP Weblog

Show tagged entries

xml 11g
xml ACE
xml adaptive thresholds
xml ASH
xml Audit Vault
xml AWR
xml Blogging
xml conferences
xml Cuddly Toys
xml Database Refresh
xml DBMS_STATS
xml Direct Path Reads
xml Fun
xml grid control
xml hotsos 2010
xml listener
xml Locking
xml oow
xml oow2009
xml optimiser
xml OTN
xml Parallel
xml Partitions
xml Patching
xml swingbench
xml The Reality Gap
xml time matters
xml ukoug
xml ukoug2009
xml Unix/Shell
xml Useful Links

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.

Design by Andreas Viklund | Conversion to s9y by Carl