Mar 11: Hotsos 2010 - Day 4
He showed a video of Boeing stress-testing the wing of the 787 and, as he pointed out, aircraft manufacturers really know how to stress-test! (Of course whether that reassures you as it does me, or makes you wish no-one would talk about wings disintegrating, as it probably would Mads, is personal.) They showed Boeing test equpiment which is complicated, expensive and non-revenue generating. Those tests are expensive but when people's lives are on the line, what choice is there? Boeing knows that it has to test the analytic models used in the design. He spent a lot of time talking about good test design. A few thoughts that stood out to me ...
- Some stress tests are a waste of time. Will the Boeing 787 land on the moon? If this test fails, what has it proven? If it passes, then it's awesome but it would be a very expensive way to prove it can cope with commercial flights in Earth's atmosphere.
- Why test for more than you will see in Production? Because you don't really know for sure what you'll see in Production.
- At some point, but I can't remember the context, he used a Scottish phrase that he'd heard Billy Connolly shout (although the Big Yin was only fully credited later in the day) ...
"There's no such thing as bad weather, just the wrong clothes"
... looked over at me and said - "I'd love to hear you say that, with the proper accent". I declined politely.
- Most people try to prove only that their systems will work.
- Most tests of systems that are destined to fail never proved it in advance.
- Test to destruction
b) Until the system melts
c) Decide whether your real requirements are likely to be lower or higher than melting point.
There was a small amount of time for questions and once it looked like they were done, I granted Cary's wish (never thought I'd say that), stuck my hand up and repeated The Big Yin's words. It was only after the laughter had stopped that I realised I might have ruined his big closing, but I think he was ok about it
Next was Tanel Poder talking about LGWR, log file sync waits and COMMIT performance and shock, horror, I was actually going to say that this was one of the least rewarding sessions of the week for me. What?!? Tanel? But he's, like, an Oracle God! LOL But there were reasons
- I realise that I know a *lot* about how log file sync and log file parallel write work, how they relate to each other and some of the problems they might help you identify. Because it's a subject I'm *so* familiar with, I didn't learn much.
- His main demo didn't quite show what he wanted it to because it didn't run multiple sessions but, frankly, I'm in no position to talk about demos this week!
By the end, the presentation turned out ok, not least because there was another unexpected appearance from Bob Sneed to talk about the I/O components involved in redo log management including a suggestion that LGWR be put into a higher scheduling class (but not Real Time!) Updated later - make sure you read Bob and Kevin's comments below. I'll try to find a link to his slides and let you take a look yourself.
I loved Tanel's Big Log File Sync Tuning Secret, though ...
It was particularly relevant to me because I had a Big Log File Sync Tuning Secret as the closing moment of my own presentation. The problem was I couldn't use it after the demos went wrong!
USE ASYNCHRONOUS COMMITS
But, in my case, that was supposed to be funny, too.
I ran off to try and use the free breakfast voucher that Marco had given me but I was just too late. No food again, then Well, I had a couple of slices of cold meat at lunchtime, but mainly to catch up with Alex G before he had to present and then head back to Ottawa. I managed to skip one session at this stage but, after a quick call home, I decided to go along to Alex's RAC Connection Management presentation after all (a little late). Although I have seen some of this stuff before, I always enjoy watching Alex's demos and was particularly impressed by the fact that he'd managed to write his own RAC connection load balancer! I was waiting for the applause in the room but either people didn't quite get it or there was just a lack of energy post-lunch on the last day. I suspect the latter.
Of course, once I'd said goodbye to Alex properly (don't see him nearly enough), I was a little late for whichever session was going to be my final one of the conference and I was hopelessly torn between Kyle Hailey's modern SQL performance tools presentation (Kyle's done a lot of cool work in the area of Oracle Performance Visualisation) and Chris Antognini's Diagnosing Parallel Executions Performance. In the end I plumped for the latter because I thought it was going to be like something I'd unsuccessfully attempted a couple of years ago and I wanted to see if Chris had a different angle on it and had been more successful. In the end, I probably made the wrong choice because although Chris' presentation was great, it was really all stuff I already knew. Definitely my bad call, though. Hopefully I'll get a chance to catch up with Kyle's presentation at some point in the future too!
After that there was just the usual short farewell and thanks from Gary Goodman of Hotsos. Although the thanks were appreciated, I'm glad they were spread around everybody because the attendees are one of the things that make this conference great and Becky and Rhonda did their usual sterling job of organising everything.
Then it was time for some Fajitas with a few friends (actually, a whopping great number of friends who practically filled the Mexican restaurant!) and a few very sedate beers. (We are old men (and women) now and the night before was a big one!) While we were waiting to go to the Mexican, I had one great surprise left - Alex's flights weren't going to get him home, so he came back from the airport and had to check in overnight! At least I got a chance to talk to him properly when I wasn't hopelessly drunk and didn't try to seduce him this time.
Now I need to stop blogging and get back to listening to Tanel's Training Day (good stuff, too, but more about that later)
Time Matters: Throughput vs. Response Time - Part 2
Hotsos 2010 - Summary
Hotsos 2010 - Day 5 - Training Day with Tanel Poder
Hotsos 2010 - Monique
Hotsos 2010 - What's THAT?
Hotsos 2010 - Day 3 - An excellent one (part 1)
Hotsos 2010 - Congratulations, Marco!
Hotsos 2010 - My Presentation
Hotsos 2010 - Day 2 - The conference begins
Hotsos 2010 - Day 1.79 - Friends show up
Tracked: Mar 17, 02:10
LGWR, LGWR, when will they ever learn. I don't care who's talking about elevating LGWR priority. It makes no difference on a busy system. Here's a little problem for you all to consider. When a foreground process is waiting for LGWR to post it (signaling the session's redo has been written), what state and mode is it in? It's sleeping in kernel mode. When a kernel mode process becomes runable what gets CPU first? A runable user mode process (with elevated priority) or a runable kernel mode process? Yep, you guessed right.
So, if you have, say, 8 cores and LGWR is servicing a group commit for 11 sessions, uh, what do you think happens when he posts all of them and what do you think an elevated user mode priority for LGWR has to do with any of that?
Must be time for me to blog all that again.
I don't care who's talking about elevating LGWR priority.
Neither do I In fact, I spent some time questioning this at my current client a few months ago, to no avail. (Well, it probably does make a bit of a difference who's saying it, if I'm honest)
But, yeah, I think it is time for you to blog again
It was great to see you again, albeit through fairly drunken eyes on that last evening!
Hm, noticed this reply. Not sure whether you were in the room and heard the real story (about just committing less and preventing priority decay as the main things).
So, what do you think about your own blog article then?
"It is a scheduling storm. For this reason you should always set LGWR’s scheduling priority to the highest possible (with renice for instance)."
... and that's the thing about blogging quickly about presentations - highlighting the wrong things.
I thought you might want to know that I used your slides yesterday to explain log file sync and log file parallel write wait times and the diagrams really did the trick, so thanks!
I wasn't in the room so I'm not taking a position against anything you've said. As if I'd argue with you anyway. Good grief. In my comment above I explained the perfect case in point where elevated LGWR priority makes no difference. LGWR scheduling priority makes no difference if there are 11 runable kernel-mode processes and 8 CPUs. Kernel mode trumps user mode. That is the point I'm making. That is one of the main points I was making in my Manly Man LGWR SSD post (the one you quoted).
Yep, you found a blog post where I said to elevate LGWR priority. That's because it can't hurt. Neither can it change the fact that kernel-mode trumps user mode and that's why anyone that reads my blog entry (http://kevinclosson.wordpress.com/2007/07/21/manly-men-only-use-solid-state-disk-for-redo-logging-lgwr-io-is-simple-but-not-lgwr-processing/) would be better served by paying particular attention the main points I made in that blog entry which were:
1. Process mode (kernel/user) is more important and process priority
2. Simple processes (e.g., the "noise" process") can get in the way of LGWR and add to LFS.
3. Log file sync waits are generally not LGWR I/O related.
4. Linux offers no lightweight callable preemption protection APIs as did best of breed Unix back in the olden days.
I suppose I should edit that nearly three year old blog post and change:
"For this reason you should always set LGWR’s scheduling priority to the highest possible (with renice for instance)."
"[...] highest possible (with renice for instance). It can't hurt."
Having revisited this topic I realize now that my Manly Men LGWR SSD post morphed into more of a conversation about priority than it should have (in the comment section). Ugh. All that talk about process mode and we got stuck talking about the less interesting topic of process priority.
I have an upcoming fresh post about the topic. We'll see how the saga progresses.
Who said anything about *elevating* LGWR priority? The point was to keep its priority from being inappropriately reduced and otherwise prevent its operation from being degraded by competition.
I'll have my LGWR slides posted soon; after I add a bit about "exactly what command do I type?" (for implementing the widely-proven FX 60 strategy). I'll also add that to my "CPU QoS" slide set.
The importance of this topic is not conjecture, fellows. It has been dramatically demonstrated in innumerable customer scenarios and consideration of the topic is pretty much routine in benchmarking activities.
Thanks for stopping by. I'm glad you noticed Kevin's comment and I was going to drop you a mail to mention it.
Sincere apologies for completely misrepresenting what you said. My excuses are
a) I was trying to avoid specifics until I could post a link to your slides and people could read your own words. So I over-generalised and then changed what you were saying, too!
b) These posts during conferences are often a little rushed and I need to decide whether it's worth persevering with them or not.
Anyway, I've updated the post to hopefully reflect your actual words more effectively and be sure to post a link to your slides here when they're online.
As I find myself repeating fairly regularly, one of the things I like about blogging is that people can correct me when I screw up.
Actually, re-reading the post and comments properly (sigh, I am *so* jet-lagged and busy on top!) I'm not sure I did misrepresent you that much, so I won't modify the blog, but just draw attention to the comments and hopefully your slides, too.
No worries, mate! Sorry if I came off huffy!
There are certainly circumstances where the benefit of assuring LGWR CPU service may go unnoticed, especially on throughput-oriented benchmarks with ample CPU to spare. However, even in those cases, I'd expect to see reduced average response times and reduced response time variance by taking some measures to assure that LGWR does not experience degraded service.
For production workloads on systems that approach 100% CPU utilization, LGWR priority inversion will probably be calamitous. Other effects such as 'interrupt pinning' of LGWR are also root causes of misery when LGWR lands often-enough on interrupt-hot CPUs. My mission to prevent these things is the result of being called into innumerable escalations at Sun where they proved to be crucial.
Anyways - I'll look forward to discussing my "Brute Force Parallelism" material with you sometime between now and Hotsos 2011! Cheers!
Sorry if I came off huffy!
Oh, not at all! I was already twitchy about trying to blog about presentations and reducing the full story to a couple of bullet points. But (stuck record) at least blogging allows people to expand on and/or correct the comments in the same place they were made. i.e. Tom Kyte's 'yes, but what about ...' point he reiterated in his keynote.