Mar 14: PX and "the Magic of 2"
Now that I've had time to sit back and relax and absorb things a little, I thought I should fix a couple of outright bugs in the recent PX paper (Later update - link to the paper) and revisit the conclusions before moving on to testing different aspects. First, the bugs :-
First, some additional info that was available to the Hotsos Symposium attendees as the very last graph I put up, but isn't in the paper. (I'll tell you - you lot owe me! Not only do you get almost twice-daily updates during the conference - you get presentation updates, too
)
I've talked about this before in this blog but, in essence, the tests that I ran didn't use a reasonable amount of CPU because they were based on unusually empty blocks of data with PCTFREE set to 90. Your application might parallel scan a large table and only be interested in 10% of the columns, so it's not a completely redundant test, but it's probably not representative of standard operations and certainly not very CPU-intensive. When I changed the blocks to use PCTFREE 10, I could squeeze 65 million rows into less space than the original 8 million rows. When I re-ran the tests against the new more densely packed blocks, the benefits of PX were more apparent and at higher Degrees of Parallelism.
A quick interlude on graphs. Putting aside the very funny 'Whoo! Yeah!' responses from Mogens and him taking pictures of my beautiful graphs (I really enjoyed it and it was hard to carry on and not just burst out laughing), I think they can be both useful and dangerous. Let's deal with the danger first. The graphs reduce the detail, particularly because I wanted to leave the noparallel numbers in there so most of the values are compressed into a very small vertical range. They also don't really prove anything because there could be all sorts of waits, execution plan changes, errors (this could be a long list ...) hidden in there. Let's deal with the useful now. I wanted to illustrate that, even though things might get a little quicker at higher DOPs, the big benefits come when going from noparallel to parallel 2 and also that the additional (and expensive) CPUs didn't help much at all.
Here is the original Graph 5 from the paper, which shows run times when running the Hash Join example against sparsely-populated blocks (PCTFREE 90) on the Sun E10K. (I realise this might not display too well, but I have the source data and images if you want a copy)

And now the same graph against densely-populated (PCTFREE 10) blocks. Because I only had a limited amount of server time left, I didn't run it for all 12 CPU counts individually and the legend's in a strange order, but I think there's enough here to see the difference.

Note that these two tests are on different data volumes, number of blocks and the response times are very different because the second test is doing so much work, but on fewer blocks.
The main difference I see is that the additional CPUs offer much bigger response time improvements as they are added but I hope it's also apparent that the fastest response times are at higher DOPs, up to the 6-8 range. The response time 'knee' is more curved. Now then, what does this all mean?
In fact, in the last few slides of the presentation, I talked about a couple of things
You can see again the big difference in response times between 1 CPU and 4 CPU so now there is value for money in those extra CPUs! However, of particular interest here is the 4 CPU series. There is a clear (albeit small) performance improvement at DOPs of 5 and 6. Now, if you double that (for the two slave sets that this example uses) and add the Query Co-ordinator, there are a total of 10 or 13 processes working on this query. Which is more than 2 * CPUs, or 8.
That would fit in with Oracle (and Cary's) advice that around about 2 * CPUs is good, but that more than 2 * CPUs would suit jobs that are particularly I/O-intensive or where the I/O subsystem isn't quite up to the job (which it probably isn't in this case)
Hopefully that clarifies a couple of things. At some point in the future (personal stuff permitting) I plan to blog in more detail about the wait event profile of the tests, because I felt that was lacking in the paper and almost entirely absent from the presentation.
- I've changed processes to 800 and removed the comment about parallel_max_servers never exceeding 385
- I've changed the various references to the EMC stripe width to 960Kb, not 970Kb.
First, some additional info that was available to the Hotsos Symposium attendees as the very last graph I put up, but isn't in the paper. (I'll tell you - you lot owe me! Not only do you get almost twice-daily updates during the conference - you get presentation updates, too
I've talked about this before in this blog but, in essence, the tests that I ran didn't use a reasonable amount of CPU because they were based on unusually empty blocks of data with PCTFREE set to 90. Your application might parallel scan a large table and only be interested in 10% of the columns, so it's not a completely redundant test, but it's probably not representative of standard operations and certainly not very CPU-intensive. When I changed the blocks to use PCTFREE 10, I could squeeze 65 million rows into less space than the original 8 million rows. When I re-ran the tests against the new more densely packed blocks, the benefits of PX were more apparent and at higher Degrees of Parallelism.
A quick interlude on graphs. Putting aside the very funny 'Whoo! Yeah!' responses from Mogens and him taking pictures of my beautiful graphs (I really enjoyed it and it was hard to carry on and not just burst out laughing), I think they can be both useful and dangerous. Let's deal with the danger first. The graphs reduce the detail, particularly because I wanted to leave the noparallel numbers in there so most of the values are compressed into a very small vertical range. They also don't really prove anything because there could be all sorts of waits, execution plan changes, errors (this could be a long list ...) hidden in there. Let's deal with the useful now. I wanted to illustrate that, even though things might get a little quicker at higher DOPs, the big benefits come when going from noparallel to parallel 2 and also that the additional (and expensive) CPUs didn't help much at all.
Here is the original Graph 5 from the paper, which shows run times when running the Hash Join example against sparsely-populated blocks (PCTFREE 90) on the Sun E10K. (I realise this might not display too well, but I have the source data and images if you want a copy)

And now the same graph against densely-populated (PCTFREE 10) blocks. Because I only had a limited amount of server time left, I didn't run it for all 12 CPU counts individually and the legend's in a strange order, but I think there's enough here to see the difference.

Note that these two tests are on different data volumes, number of blocks and the response times are very different because the second test is doing so much work, but on fewer blocks.
The main difference I see is that the additional CPUs offer much bigger response time improvements as they are added but I hope it's also apparent that the fastest response times are at higher DOPs, up to the 6-8 range. The response time 'knee' is more curved. Now then, what does this all mean?
In fact, in the last few slides of the presentation, I talked about a couple of things
- That the biggest benefits are gained when going from noparallel to parallel 2 and that benefits diminish rapidly after that. I think that's been apparent in all the tests that I've run and I still believe that to be the case, although I'm looking forward to being proved wrong. That's what I meant by the deliberately provocative comment about sticking to parallel 2, regardless of your configuration.
- That I'd be very interested in seeing any results that showed even a DOP of CPU * 1 that gave the fastest response time. Looking at these results though, and some other results on some weekend tests on the 4 CPU ISP4400, it's pretty close to that already.
You can see again the big difference in response times between 1 CPU and 4 CPU so now there is value for money in those extra CPUs! However, of particular interest here is the 4 CPU series. There is a clear (albeit small) performance improvement at DOPs of 5 and 6. Now, if you double that (for the two slave sets that this example uses) and add the Query Co-ordinator, there are a total of 10 or 13 processes working on this query. Which is more than 2 * CPUs, or 8.That would fit in with Oracle (and Cary's) advice that around about 2 * CPUs is good, but that more than 2 * CPUs would suit jobs that are particularly I/O-intensive or where the I/O subsystem isn't quite up to the job (which it probably isn't in this case)
Hopefully that clarifies a couple of things. At some point in the future (personal stuff permitting) I plan to blog in more detail about the wait event profile of the tests, because I felt that was lacking in the paper and almost entirely absent from the presentation.
Trackbacks
Trackback specific URI for this entry
No Trackbacks
Comments
Display comments as
(Linear | Threaded)
#1 - Mathew Butler said:
2006-03-14 14:52 - (Reply)
Hi Doug, I couldn't see a link to your paper from your blog. Very interested to read it.
Best Regards,
#2 - Doug Burns said:
2006-03-14 15:40 - (Reply)
Mathew,
Good spot. I've added a link in the main blog to
http://oracledoug.com/px_slaves.pdf
(Same link with a .doc extension if you prefer a Word document)
Cheers,
Doug

