Recovery Design Part 5 - Wrap-up

Doug's Oracle Blog

  • Home
  • Papers
  • Books
  • C.V.
  • Fun
  • Oracle Blog
  • Personal Blog

Nov 7: Recovery Design Part 5 - Wrap-up

It's been a very mini-series but I hope I've highlighted some of the challenges when designing systems that need to recover from failures quickly. Here are a few ideas and I'm sure others could add their own.

  • Careful planning is essential.
  • Ask tough questions and imagine the worst.
  • Always be aware of the requirements. The design is entirely dependant on them.
  • You need to consider every single point of failure and RAC still implies a single shared database that can fail (and probably will some day).
  • Test.
  • Keep things as simple as possible and minimise human intervention. Humans make mistakes.
  • Test.
  • Document. It reduces the number of mistakes humans might make.
  • Test.
  • Employ technically strong, responsible humans, and I swear that isn't an advert, just recognition that the more complex the configuration, the better your people need to be.

I mentioned in the previous part that I would reach some unsatisfactory conclusions. Some of those are above - because I'm not telling you what to do, just mentioning some considerations - and here's another.

There really is no one-size-fits-all solution and (slay me for saying this, purists) the chances are that you will not deliver what the business wants, or what you hoped you would, but a compromise resulting from the balance of risks, costs and expectations. Far better to turn around and admit that if things go very wrong it might mean four hours down-time than come up with a pretty design document with the buzz-words that your managers are looking for and then have to spend four hours panicking when disaster strikes after the system's live and five days explaining to managers why 'it didn't work'.

As for the satisfactory sources of information, here are a few.

I already mentioned Mogen's first class critique of RAC and I only recently came across another excellent RAC paper by James Morle that looks at RAC Connection Management. Take a look at James' paper and ask yourself if you understand what 'Transparent Application Failover' really means. It sounds perfect, but I guess it doesn't do what most of you would expect (unless you've worked with it already).

I've recently started reading the eBook version of Julian Dyke and Steve Shaw's Pro Oracle Database 10g RAC on Linux. Although I haven't got into the guts of it yet, I've read all of the appendices and the first few chapters and the tone is pretty much what I've used in this series, but much more detailed. It's very strong on high availability concepts so if you've never worked in that type of environment before, it's a nice way to get started.

Oh, one more thing. Vidya asked about disk-to-disk synchronisation and I thought I'd include a link for anyone who's interested in that kind of thing, although it's clearly vendor-specific

The series is finished for now, but I think there'll be future blogs about how the implementation of the new system goes, particularly any problems I run into.

Looks like I'll be back on PX for a couple of weeks ;-)

Posted by Doug Burns Comments: (7) Trackbacks: (0)

Trackbacks
Trackback specific URI for this entry

No Trackbacks

Comments
Display comments as (Linear | Threaded)

#1 - Noons said:
2006-11-08 08:19 - (Reply)

Another way of winning the VHA war is to use good old "divide and conquer". This is precisely what was done at my previous job:

we received billions of "clicks" from various search engines. Each of these consisted of a target url, a source url, the search terms used and an optional cookie to track sales, pages travelled, clients, etc. Each of these was charged to us by the engines and each was in turn charged back to the clients owning the target sites.

As you can expect because of the vast volumes, the need to track all clicks was of paramount importance. This is not .4X9s HA, this is 0 loss, 100% availability, all integer numbers, period! No excuses. We could lose the odd "click" here and there, but definitely no total service outages, none whatsoever.

Needless to say, there is not only the issue of making it work. What about new versions? And testing? And maintenance? etcetc.

The solution turned out to be Apache servers, heaps of them, on load balancers. Some nifty add-on code to make Apache log more than it usually does, then offline batch loading of the logs into analytics databases so we could do all sorts of fancy footwork with search terms, effectiveness, global marketing campaigns, etcetc.

If any of the servers failed, we'd lose the current "click" being processed but nothing moe: all load woul be diverted to the next ones in the round robbin and bingo, away it ticks.

In 24 months, we lost 10 minutes of clicks. Event that was due to a misconfiguration of one of the load balancers, not our application code.

That's not bad, but at what cost point?
Well, all blade commodity servers, RHAS3, Apache and some in-house brain sweat.

Sometimes it pays to think outside the square...

#2 - Doug Burns said:
2006-11-08 09:12 - (Reply)

That, ladies and gentlemen, is what I meant by this

"Employ technically strong, responsible humans, and I swear that isn't an advert, just recognition that the more complex the configuration, the better your people need to be."

I've seen brain power beat financial power many times, although it helps if you have a good budget too as it expands your options.

#3 - M-square said:
2006-11-08 10:22 - (Reply)

"...come up with a pretty design document with the buzz-words that your managers are looking for and then have to spend four hours panicking when disaster strikes after the system's live and five days explaining to managers why 'it didn't work'"

Ahh ... yes ... RealLife(tm). A beautifull summary.

It fits nicely with my other favorite/pet-hate observation: The manager that tells you to choose (technically) wrong option for reasons of cost/time, telling you that in the event of a disaster he will take full responsibilty. Which he does, in a fashion, by ordering you to do overtime to fix the problem when the disaster does strike.

Otherwise, back to the main point, your Recovery Design Mini series: Yes, yes, true, done that, ok, yes, good way of clarifying that, yes, ok ... (and so on) I agree with it.

#4 - vidya said:
2006-11-08 16:40 - (Reply)

Doug, great wrap-up and thanks for including the link.

#5 - Mathew Butler 2006-11-09 10:05 - (Reply)

Thanks for the ref to the James Morle paper. I think though that your blog link is wrong.

I managed to find the paper at his web site though ( followed the whitepapers link from http://www.scaleabilities.co.uk )

Regards,

#6 - Doug Burns said:
2006-11-09 10:28 - (Reply)

Good spot, Mathew - fixed.

Note that you need to register to access the document, but it's well worth it.

#7 - Amit Kulkarni said:
2007-02-16 19:58 - (Reply)

Sun has their AVS which supports what Vidya and others probably want, again its Solaris specific. Please look http://blogs.sun.com/AVS/


Add Comment

Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
BBCode format allowed
 
 

Statistics on Partitioned Tables

Contents

Part 1 - Default options - GLOBAL AND PARTITION
Part 2 - Estimated Global Stats
Part 3 - Stats Aggregation Problems I
Part 4 - Stats Aggregation Problems II
Part 5 - Minimal Stats Aggregation
Part 6a - COPY_TABLE_STATS - Intro
Part 6b - COPY_TABLE_STATS - Mistakes
Part 6c - COPY_TABLE_STATS - Bugs and Patches
Part 6d - COPY_TABLE_STATS - A Light-bulb Moment
Part 6e - COPY_TABLE_STATS - Bug 10268597

Comments

jonathanlewis.wordpress.com about 10053 Trace Files - Different Plan in Different Environments
Sat, 01.06.2013 11:26
Doug Burns about 10053 Trace Files - Different Plan in Different Environments
Tue, 02.04.2013 08:57
You're welcome. Now I just nee d to pull my finger out and ac tually come up [...]
Howard Rogers about 10053 Trace Files - Different Plan in Different Environments
Mon, 01.04.2013 23:08
Makes a big difference, so tha nks for that! With two brow ser windows, o [...]

Upcoming Presentations


Bookmark

Open All | Close All

Syndicate This Blog

  • XML RSS 2.0 feed
  • ATOM/XML ATOM 1.0 feed
  • XML RSS 2.0 Comments
  • Feedburner Feed

Powered by

Serendipity PHP Weblog

Show tagged entries

xml 11g
xml ACE
xml adaptive thresholds
xml ASH
xml Audit Vault
xml AWR
xml Blogging
xml conferences
xml Cuddly Toys
xml Database Refresh
xml DBMS_STATS
xml Direct Path Reads
xml Fun
xml grid control
xml hotsos 2010
xml listener
xml Locking
xml oow
xml oow2009
xml optimiser
xml OTN
xml Parallel
xml Partitions
xml Patching
xml swingbench
xml The Reality Gap
xml time matters
xml ukoug
xml ukoug2009
xml Unix/Shell
xml Useful Links

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.

Design by Andreas Viklund | Conversion to s9y by Carl