Feb 22: "How useful are diagnostic/optimization tools?" - Another View
The subject up for discussion is 'How useful are diagnostic/optimisation tools' (Sorry, Dan, I'll revert to true English spelling for this one
"If you do not have a method you can explain to another person, that you can repeat multiple times and reach the same conclusion with the same data, that accurately identifies the root cause and recommends the correct plan of action ... you really do not have a method ... you have a bunch of guesses."
I'd suggest that you have some assertions and a bunch of theories (or educated guesses) that can then be proved right or wrong, but I know where Dan (and Alex-BAAG-Gorby) are coming from. The fact of the matter is that regardless of the currently available tools that you use, there's still a degree of individual skill required to analyse the results and even if you have those skills, you'll still come up against problems that you've never seen before and your skills will grow.
Dan goes on to discuss software as the implementation of a method or process in his next paragraph and concludes that if a tool can't come up with a correct optimisation method, then it's not a complete optimisation tool. He's right, of course, complete optimisation tools for Oracle systems simply don't exist yet. I'm in the midst of writing a course that discusses ADDM (among other things). Now it's a clever tool and I think it helps reduce time to problem resolution (possibly even more so when you understand a lot about performance already!) but the fact is that it often gives utterly stupid recommendations. It has a desire to have you add memory to the SGA constantly, in the absence of any other solution to a sick application, and (so far) I've rarely seen it suggest that CPU is a bottleneck. That "CPU is not a bottleneck" at the end of every ADDM report I've seen so far is enormously reassuring
So if all of these tools are so flawed, what's the point of them? Well, as Dan says in the last sentence of that paragraph ....
"It certainly provides invaluable data, but that data still needs interpretation"
Both parts of that sentence are true. Without wanting to come across as a pompous, be-suited, middle-aged, "this stuff is really hard, Sonny" consultant, the fact remains that the behaviour of entire systems is inherently complex or maybe a better way to say this is that applications are so varied and the parameters governing their behaviour are so wide and sometimes dynamic as to appear unique. Any automated tool or method for diagnosing performance would have to be extremely smart.
As it happens, I'm a bit of an amateur AI fan. I've always had a lurking interest in it and have read a few academic and more populist texts on the subject. What strikes me is that developing software to automatically solve problems in even very simple domains has proved time consuming and fairly unsuccessful to date. That's not to say we won't get there, but it's a long road ahead.
I'm not suggesting that software doesn't have it's place here. Everyone who reads this blog is using a pretty intelligent automated method every time they use Oracle. The CBO is a piece of software that applies an optimisation method automatically and is largely successful, but the CBO can't and could never tell you that the SQL statement that it's analysing is part of a batch report that no-one ever reads. A tool can't know whether what you're doing is sensible - that question is too high-level and abstract. How can any number of lines of code understand that actually, nobody really gives a monkey's about that stupid report that destroys system performance?
That's the problem with performance tuning - not only do you have to solve the narrow technical problem, you have to take a step back and look at human and business issues too. Humans are still much better at that
But let me take a step away from that ultra-wide (but crucial) view and look again at whether, for something such as a Statspack report or a bunch of trace files, a tool could analyse them for us. Well, Anjo Kolk has already demonstrated some attempts at this at oraperf.com a long time ago and they were pretty impressive. (I'd say "are", but I just tried to post a link and I don't think the site's available any more.) ADDM is another impressive tool, as I've mentioned already and I'm more impressed by several of the recent Oracle-supplied tools every time I use them (which is a lot at the moment).
So I'm not saying we shouldn't discuss the problem and take steps towards a solution, but to expect a complete solution to come along any time soon is like expecting a car-driving robot around the corner. Actually, I think I'd rather not come across a car-driving robot around the corner! (Seriously - just think about it. It seems reasonable and I've seen cars park, accelerate and brake themselves. I've seen an automated BMW(?) tear around a track, but what on earth would it make of bikes, children, ice ....)
People like Dan and Alex are gold-dust for the wider Oracle community, because these subjects need to be discussed so everyone can move forward and having a blog is gold-dust for me, because I can ramble away for ages when the mood takes me
I think the best we can hope for at the moment is :-
1) Gather High Quality Information
2) Produce first-pass automatic recommendations.
3) Analyse those recommendations to see if they're any good.
4) Analyse the available information for other causes and develop possible solutions (partly method-based, partly experience and intuition-based)
5) Test each of the theories developed at steps 3 and 4
But, as I said, I'm probably just rambling ....
Diagnosing Locking Problems using ASH/LogMiner – The End
Diagnosing Locking Problems using ASH/LogMiner – Part 9
Diagnosing Locking Problems using ASH/LogMiner – Part 8
Diagnosing Locking Problems using ASH/LogMiner – Part 7
Diagnosing Locking Problems using ASH – Part 6
Diagnosing Locking Problems using ASH – Part 5
Diagnosing Locking Problems using ASH – Part 4
Diagnosing Locking Problems using ASH - Part 3
Diagnosing Locking Problems using ASH - Part 2
Diagnosing Locking Problems using ASH - Part 1
#1 - Tim Hall said:
2008-02-23 09:07 - (Reply)
Hi.
It's a difficult issue because many people see a tool like ADDM and assume it must be correct because Oracle wrote it. As you've said, these tools provide pointers, not hard and fast recommendations, but they are not presented in that way, so people believe them without question.
The lead must surely come from the producers of these tools (in this case Oracle) to explain that they are not the word of God, just a helping hand.
If used in the correct context these tools are great, but if used in the wrong way they can be a complete pain in the ... ![]()
Cheers
Tim...
#1.1 - Doug Burns said:
2008-02-24 08:48 - (Reply)
The lead must surely come from the producers of these tools (in this case Oracle) to explain that they are not the word of God, just a helping hand.
I agree, Tim, but sometimes I'm not sure where these perceptions come from, because surely people can apply their own common sense and realise the limitations?
I suppose calling anything 'Automatic' or 'Self-Managing' is asking for trouble and marketing people did go a little crazy, but only as much as describing Oracle as Unbreakable. I mean, how ridiculous was that?
But it's not just marketing people. When I look at some of the related technical articles on the net, including some on Oracle websites, they're so fluffy and positive as to render them almost useless.
I think that the licensing fiasco has tainted views, too.
#2 - Noons said:
2008-02-23 13:00 - (Reply)
Note - i posted this on behalf of Noons. Apologies for the spam filter playing up again and, should anyone run into problems, just drop me a mail at my name at yahoo. My email address is all over the net anyway :-(
"but the CBO can't and could never tell you that the SQL statement that it's analysing is part of a batch report that no-one ever reads"
As usual: spot-on in a nutshell, Doug.
Folks who rely blindly on tools to replace their brains deserve all they get in the end.
Not to say that it is impossible, as you point out. But we are miles away from anything really useful at the moment.
And it is erroneous, nay, dangerous even to make "infallible" claims about any such product, regardless of who or what wrote it.
The expectation that slapping ADDM in a system is a cure-all is downright stupid, and yet what does sales and marketing go an do? I know: those are not exponents of lack of stupidity, mais quand-meme, alors!
Similar mindset for the cbo, sometimes. Look at the "special values" issue: folks who use a date far in the future to mean "still open".
Fact 1:
people DO INDEED use this, regardless of it being recommended or not. To call them wrong is akin to excuse bad software behaviour by blaming its users!
Fact 2: the alternative is a NULL date - non-indexed - that causes FTS if anyone queries on it. Table scans in almost all cases require more resources than a single-value index query: that is WHY indexes were invented in the first place!
Fact 3: the CBO in some cases will ASSUME that given a high-value and a low-value of a column, the distribution of data will be flat. Then it starts to do silly things when it finds that predicate dates are too skewed to the "start" of the series of values.
And here lies the rub: if the cbo is smart enough to look for a max and min value, then it should ALSO do a full scan to find them. Rather than ASSUME distribution is flat, why not do ALSO a histogram? After all, it's doing a full scan ANYWAY to find the min/max, isn't it?
Once the CBO starts taking into account min-max values into its plans, it MUST NOT assume ANYTHING about the data: it MUST examine ALL of it, not just bits and pieces.
As is, we have a "smart" optimizer baulk and do silly things at something as basic, common and widely used as for example a "high-date" of 31-12-3000!
Silly, and it should never have happened.
Look: we all know users should not use "high-values", for many other reasons.
But to ignore the fact that they indeed do is akin to insist that every driver of a car should be able to optimally break without blocking wheels.
Personally, I'd rather drive a car that uses ABS: it still lets me hit the breaks like crazy, but it doesn't blow up in my face if I do.
That is the same mindset that should be applied to CBOs and other tools: rather than penalize users for doing something they simply won't stop doing, just work around it.
First we must fix this. Then we can take flights into AI land. Walk, before running.
#2.1 - Doug Burns said:
2008-02-24 09:17 - (Reply)
Noons,
There's a lot I agree with there, so let me pick on a few things we differ on ![]()
we are miles away from anything really useful at the moment.
I'm not so sure. I'm biased, of course, because I'm writing a course that uses these tools and will be teaching it for Oracle who will give me money afterwards. But, as I had complete freedom in deciding the subject and it will be *my* material from *my* perspective, I'm no Oracle mouth-piece.
My experiences to date are that ASH and AWR are just complimentary extensions of existing methods and pretty successful ones.
The SQL Tuning and ADDM parts are more of a departure and, unsurprisingly, their usability and degree of success are more variable. When they work well, though, I've found that they can save me a lot of time and offer some useful suggestions.
Do I think they do a better job than an expert who is interpreting the same ASH/AWR data? No. Do I think they can work quicker than most people I've come across? Yes. Do I think they can do better than the vast majority of DBAs I've come across? Yes.
All up for debate and all open to revision as I come across more successes and failures but my point is I don't think we're miles away from something useful.
What concerns me, and this is aimed at many people I've spoken to and not you specifically is that, in the usual slightly sneering, cynical way of DBAs (and I count myself as a member), the terms 'Self-managing' and 'Automatic' can lead to jokes and rejection rather than a serious evaluation of whether these tools are any good or not.
I remember a time when the common view was that rman was useless and would never take off. I don't hear that view very often these days ![]()
Once the CBO starts taking into account min-max values into its plans, it MUST NOT assume ANYTHING about the data: it MUST examine ALL of it, not just bits and pieces.
Just a small section from the large comment about the CBO. I tend to agree with you. Oracle types tend to enjoy the technical challenge of understanding how it works, but many's the time I've been teaching a class and wondered if it isn't all just a massive turn-off for people to have to deal with these interesting twists and turns when the pressure is on at work.
But your example is also interesting because I see it as something that the likes of SQL Tuning in 10g and up is trying to address. Look at the underlying data more carefully to come up with more considered decisions, based on better information. Now I can imagine a couple of retorts already
Why should we have to pay for it? I've given up trying to answer that one - that's for Oracle to decide and they haven't done themselves any favours.
If it's so automatic, why can't it just be enabled and do the job for me? Well, at the moment it can't be trusted to always come up with a better plan for starters! (Could you *ever* trust it completely?)
Oh, and thanks for reminding me of a classic example to use in the course that had slipped my mind completely! Consider it stolen and maybe you can expect a airline miniature of whisky in the mail, by way of thanks ![]()
#2.2 - Doug Burns said:
2008-02-24 12:50 - (Reply)
... and I should also have said :-
if the cbo is smart enough to look for a max and min value, then it should ALSO do a full scan to find them. Rather than ASSUME distribution is flat, why not do ALSO a histogram? After all, it's doing a full scan ANYWAY to find the min/max, isn't it?
If you leave DBMS_STATS to it's own devices in 10g, it will create the histograms. But, as soon as you start doing such things automatically, you find that
a) It generates histograms you probably *don't* want too and then, once you pile your favourite bind variable peeking ;-)(which I agree with you about) on top, it's a recipe for occasional disaster.
b) There's an associated overhead in the work which you might decide you don't want, so people then switch it off.
As Oracle try to do these things automatically, the implementation can often be flawed, so are we *sure* that we want Oracle to handle things automatically?
P.S. What prompted this additional comment was I started to play around with the fake dummy date example this morning (on 10.2.0.3) and, of course, the default DBMS_STATS METHOD_OPT generated histograms for me!
#2.2.1 - Noons said:
2008-02-24 23:53 - (Reply)
Definitely agree with all you said.
The last question you propose is in fact my problem: do we really want Oracle to do this tuning automatically, at this stage of the proceedings?
A quick example - sorry, I'm a sucker for examples! - to help explain why I think not:
- temporary holding table in a batch process. Gets filled up at start of process with intermediate data. Batch goes through, uses the table for whatever, then if all is well it truncates table in readiness for next batch.
Now, what happens when Oracle's default "tuning" - which we didn't pay for but is still there and active anyway... - kicks in at midnight and finds an empty table that has been changed during the day? It analyzes it and correctly marks it as empty.
Next morning batch kicks-in and loads table, then uses it for normal processing: bang, there goes the cbo heading South because now it's assuming this is an empty table and it isn't!
Yeah, I know: dynamic sampling. Can it be enabled JUST for one table? Nope, bang again...
There are current limits to automation that can only be overcome by either one of these:
1- Application itself hints the cbo on what to do because it knows what it needs.
2- CBO uses complex heuristics and hints from a public "helper" API whereby it can "grok" the pattern of execution of certain types of processes and correct and adjust accordingly. Preferably with ample feedback to the admin, just in case things *still* go South.
3- Oracle allows for bypass or fine tweaking of automatic tuning processes, per object. Where is the automation then?
This is why I think we're still a long way from a good, solid solution. Getting there, but still a bit away.
#2.2.1.1 - Jonathan Lewis 2008-02-25 17:23 - (Reply)
I think there are some things that you can't expect the optimizer to handle automatically - and your example of the table that gets truncated and reloaded is one such. How long should the "optimizer" take observing your pattern of use of that table to notice that you truncate it once per day then reload it and run queries ? Because that's the only way it could decide to do something special. Then, should it "automatically" sample it every time after the load - and what if you do a three stage load - or should it sample once and assume no change in data thereafter - which might be someone else's idea of perfection ?
"Yeah, I know: dynamic sampling. Can it be enabled JUST for one table? Nope, bang again..."
A couple of possible workarounds -
a) delete statistics, lock statistics.
b) create or replace view v1 as select /*+ dynamic_sampling(t1,2) */ * from t1. query the view.
Regards
Jonathan Lewis
#2.2.1.1.1 - Doug Burns said:
2008-02-25 18:49 - (Reply)
I think there are some things that you can't expect the optimizer to handle automatically
... and your example there also illustrates that, even if it *could* (and I agree it's unlikely) would it be a sensible use of development effort and code execution overhead anyway? It's so much easier to apply human knowledge to those situations and implement a manual work-around. Then again, in these days of little or no up-front thought and design, it's frustrating when you hit these problems for the first time on a production system.
Nice ideas, too ![]()
#2.2.1.1.1.1 - Noons said:
2008-02-26 07:08 - (Reply)
Turst me: it doesn't help when Oracle's own marketing "gurus" are the ones suggesting all should be well with batch processes like I described, IF we purchase the gazillion dollar ADDM-whatchamacallit-tuning-pack!
Because "it takes care of all that tuning" for you.
We all know it doesn't. But that will never stop Oracle marketing from trying to oversell their so-called "infallible automatic tuning".
We all know the solution is to use one's brains and start locking in stats and other simple such techniques. Which BTW are rarely taught in Oracle's own tuning courses, nowadays.
I blogged about these same problems and solutions nearly 6 months ago with not one single ounce of feedback. Mostly because few modern dbas know it's possible to do this without breaking ANY agreement with Oracle: they have been so oversold on Oracle's own "tuning" that they think it's the only possible game in town.
It's not the "intelligent optimizer" I'm against: it's the irresponsible oversell of its capabilities. Now or in future.
#2.2.1.1.2 - Noons said:
2008-02-26 07:29 - (Reply)
Thanks for jumping in, Jonathan.
Sure: there are situations the cbo can't easily cover. That is precisely what I intended to show with my example.
Yet, note that what I described is a common occurrance, that will strike just about ANY application with a batch component, in ANY database - not just Oracle.
In other words: as common as common can be.
And yet, it is not covered by any "intelligent" optimizer - IMHO with good reason. I blogged about this back in September 07, in the context of Peoplesoft.
Note that it is one of Oracle's own products, and it also suffers from this.
Now, the proper way to address these problems is to apply option 1 in my earlier reply: let the application itself do the analyze when it *knows* it needs to do so. Much, much easier than expect the cbo to reach the same conclusion at the end of an indeterminate sampling period.
Yet, even though Peoplesoft is one of Oracle's own products, does it do that? No way!
What is Oracle support's and marketing reaction then, when shown this very same problem?
Guess what: "you should be running ADDM and the OEM tuning pack, that way you can let it auto-tune all these SQL statements"! (sic)
It only shows neither of these two groups have the slightest notion or grasp of what the real problems are that production dbas face every day.
But that won't stop them from overselling the so-called "intelligent" products...
#2.2.1.1.2.1 - David Kurtz said:
2009-05-29 23:01 - (Reply)
I know this is an old post, but the reference to PeopleSoft caught my eye. Most of the structure of PeopleSoft was laid down long before Oracle's acquisition, and to this day Oracle stick to the platform agnostic line. There still is very little that is done a specifically Oracle way.
One thing that PeopleSoft's Application Engine batch processes do do is analyse statistics on working storage tables during the batch processes (I am not going to be sidetracked by the problems of using the database in this way, but PeopleSoft does it alot, and there isn't much you can do about it). There is a command in the PeopleSoft language that developers must explicitly code. Generally it works quite well, but there is not much control. Either you estimate stats, fully compute stats, or do nothing.
I have found in many cases that Optimizer Dynamic Sampling is sufficient for Oracle to come up with the right answer, but sometimes you do have to explicity collect stats on working storage tables. In which case you have to code that into the problems. The problem can be that PeopleSoft delivered programs collect stats far more often than is really necessary.
Then factor in that 10g does two additonal things by default:
i) It retains previous sets of stats, by default for 31 days. This is fine when a schema wide stats job only refreshes stale stats. But every a process explicitly collects stats on a table, the previous version is stored. The SYSAUX tablespace fills up, and after the retention period dbms_stats also purges the history. On a system with a lot of concurrent batch processes you get contention between concurrent inserts and deletes of this data.
ii) 10g automatically collects histograms where it thinks it is helpful, usually far to often. You can get contention between the concurrent insertion and deletion of statistics. And the histograms are also retained in the history.
Like everything, it depends on many things, you need to be able to stike a balance. You also need to understand your application, and be willing to take control.
#2.2.2 - Jonathan Lewis 2008-02-25 17:12 - (Reply)
"Generating stats automatically for the dummy value data".
I have a testcase for this which covers about 5 years of data at 100 rows per day, with 'two days worth' of data out at the year 4000 - the "null" marker.
With this data, the default collection in 10.2.0.3 doesn't seem interested in generating a histogram. How uniform is the rest of your data - is the histogram there because of the outlier, or because of the general data pattern ?
Regards
Jonathan Lewis
#2.2.2.1 - Doug Burns said:
2008-02-25 19:11 - (Reply)
How uniform is the rest of your data - is the histogram there because of the outlier, or because of the general data pattern ?
I'm *slightly* confused about which comment you were replying to there, but I'm fairly certain it was mine. I think it's probably the general data pattern because it's unevenly distributed and has a small number of distinct values too. Here's the example ...
Quick test in here
Mmmmmm .... here isn't the example, actually. I started to cut and paste the code I thought I'd carefully preserved yesterday and I'm not seeing the histogram being generated! Back to the drawing board ... ![]()
I'll try to recreate what I (think) I saw and will post the test later, one way or the other.
#2.2.2.2 - Doug Burns said:
2008-02-25 21:09 - (Reply)
OK, after wondering what on earth I'd seen yesterday and being unable to create it, I think I've finally narrowed it down to this sequence of events. Although the steps are selective and carefully ordered, this is the entire script with output and there's absolutely nothing missing apart from some output for one of the queries.
In retrospect, I could have made it more elegant, eliminated redundant information and there are a few weird steps in here to try to eliminate some factors (e.g. the instance restart at he beginning) but it'll do for now. Fingers crossed for the formatting.
Oracle 10.2.0.3 on Vista ...
Forgive the lack of trace files and further evidence, but this has already robbed me of my evening, so that's all there'll be for now.
#2.2.2.2.1 - Doug Burns said:
2008-02-25 21:15 - (Reply)
... and there'll probably be more to follow later.
#2.2.2.2.2 - Jonathan Lewis 2008-02-25 21:48 - (Reply)
You got a histogram because of the inherence data skew, not because of the extreme date. You're trying to see if a single extreme value causes a histogram, so you need to create a data set that doesn't - then put in the pseudo-null and see if that introduces a histogram.
Regards
Jonathan Lewis
#2.2.2.2.2.1 - Doug Burns said:
2008-02-25 22:10 - (Reply)
Thanks. Lesson learnt.
I've just noticed that I didn't post the example properly, though. Why I didn't get the histogram was bothering me earlier today. I cut too much out so let me try again.
I think the bit that surprised me most is why the initial stats collection and the one after the first query don't generate a histogram but the final one does. But then I'm probably just being brain-dead and should sleep on it.
(Oh, and the 11g/Direct Path Reads is going even worse, what little time I've spent on it!)
#2.2.2.2.2.1.1 - Jonathan Lewis 2008-02-25 22:38 - (Reply)
The default method_opt is 'for all columns size auto', so Oracle only checks columns that appear in col_usage$ (i.e. ones that have appeared in a WHERE clause) to see if they could use a histogram.
Your first gather is when you haven't used the table - so you don't get a histogram.
Your second call is after a query has used the column, and you HAVE got a histogram, as indicated by endpoint_number values 5 and 6 - which means the 5% sample indicated that 5/6ths of your data was on or before 17th April 07, and 1/6th from there to y3K. (order by endpoint_number helps when reading a histogram, by the way).
Your last call got a better sample, and shows a proper cumulative frequency histogram for the full set of dates.
Regards
Jonathan Lewis
#2.2.2.2.2.1.1.1 - Doug Burns said:
2008-02-25 22:44 - (Reply)
Terrific answer - thanks. I didn't look at the results of the second collection carefully enough.
It's been an education for me. It might be more of an education if I got back to reading that book of yours, of course.
One day ...
#2.2.2.2.2.1.1.1.1 - Doug Burns said:
2008-02-25 22:52 - (Reply)
Oh, but I should also say (in addressing my original comment to Noons) that in some cases, with the default options, Oracle 10 will generate histograms where it wouldn't have done previously. That might (for example in this case) generate histograms that would help with the dummy date, but because of the overall data skew rather than the dummy date value specifically.
That was what I was trying to get at and the problems in my example were a distraction to that point (albeit an interesting distraction)
I've also seen dbms_stats create histograms that seemed unneccessary when using auto sampling.
Thanks for your help.
#3 - Daniel Fink 2008-02-24 05:00 - (Reply)
The comment about interpreting data that you quote, that really is from Jonathan Lewis. I stopped by his recent half day session at RMOUG for the briefest of moments...just long enough to hear him utter the words that I have written (not an exact quote, but the idea is the same). If not for Jonathan and a very opportune visit by myself, I would have have had that insight. Credit given where credit is due.
#3.1 - Doug Burns said:
2008-02-24 08:41 - (Reply)
Well in that case, Dan, I've probably picked up on something he's said that's similar previously and liked it as much as you did! I was at his Statspack presentation so I suspect it was in there.
We can take this attribution thing too far sometimes
There were lots of different thoughts and ideas in your blog - that was just one that I happened to like and agree with.
Keep those blog postings coming, they exercise minds.
#3.1.1 - Daniel Fink 2008-02-28 13:48 - (Reply)
Thank you, Doug. I'll keep exercising those minds...as long as no one refers to me as the Richard Simmons of Oracle!
As for the attribution, it is more that the statement by Jonathan was so absolutely brilliant (and, upon reflection, amazingly obvious) that it just had a profound effect on my thinking.
Sliante!
#3.1.1.1 - Doug Burns said:
2008-02-28 21:07 - (Reply)
as long as no one refers to me as the Richard Simmons of Oracle!
http://en.wikipedia.org/wiki/Richard_Simmons
You know ... if I screw my eyes up ....
![]()
#4 - Doug Burns said:
2008-02-28 05:52 - (Reply)
There's more on this topic over on Richard Foote's blog here


On the one hand, I regret the fact I seem to be posting a lot of links to other people's work at the moment but I couldn't let this one by, if only for those who don't use OraNA or the like and might miss it.The one consistent message I kept hearing fro
Tracked: Feb 27, 06:25