Nov 4: The Reality Gap (2) - Disaster Recovery
Because it's based around a single shared database, RAC will not (and was never designed to) protect you from
- Database corruption
- Erroneous updates or deletes by a user (or DBA!)
- SRDF (and other storage-level replication tools)
- Data Guard
- Other replication solutions (e.g. Shareplex and the like)
Unless you have a delay built in to the replication that gives you time to notice the screw-up and do something about it.
When protecting yourself against disasters, why do we spend so much time and money on protecting ourselves against major site disasters, but much less time and brain-power on protecting ourselves against what I would suggest are far more common - simple human screw-ups?
Think about it :-
- How many times have you screwed up or been sitting next to someone who did?
- How many times have you lost a machine room because of an earthquake, fire or explosion?
- How many times have you had to restore a database to cure a problem?
- How many times have you had to cut-over to your DR site for real? (Not in a practice run)
I'm not saying people shouldn't have DR plans in place, but I'm tired of hearing SRDF sold to managers as the solution to all problems! Reality Gap!
#1 - Carel-Jan Engel 2007-11-04 11:45 - (Reply)
And then this, from real-life experience: Damagement bought (A)SRDF from this three letter storage vendor, storage vendor causes cache corruptions due to insufficient testing of some microcode upgrade (CT had another base version than the one tested). They decide to perform a DR, but on the eve of the first full move to the DR site someone realizes the move-back has never been tested (nor has the move-to, but window-dressing small scale tests claim to prove the opposite). So the DR is canceled, the multi-multi-million EURO DR-site renders obsolete and the business has to sit through the unplanned outage. Lessons learned: none. The same vendor creates similar problems two years later, and the DR-site is still not trusted enough to prevent a day-and-a-half outage.......
#2 - Noons said:
2007-11-04 11:59 - (Reply)
what Carel just mentioned is so typical, it's almost frightening!
just recently been through one of these episodes:
-meeting to "discuss DR/HA test process".
-silly old me asks the obvious question: when do we test fall-back to normal?
-(sound of spanner hitting the gears...)
amazing: 40 years of IT and we still haven't figured out that life goes on beyond the next event: we need solutions that provide continuity, not exceptions!
#3 - Pete Scott said:
2007-11-04 13:55 - (Reply)
Far too many people think that RAC = redundant array of computers... The number of systems (proposed or live) I see where RAC is used to protect data is truly frightening
Often I am fortunate in designing / migrating DW systems to completely new hardware & storage - before I go live I insist that we DR failover (assuming the customer has bought DR
) then trash the new system to prove that our recovery from backup works (and document how long it takes to bring back. Perhaps my way of doing things is not that common though.
I remember Carel-Jan's talk at the Miracle Scotland DBF meeting on recovering from logical errors with Data Guard and the value of imposing a replication delay to protect from propagating bad data across the sytems designed to protect you
#4 - Gregory 2007-11-04 14:36 - (Reply)
Actually there are differences between SRDF and Data Guard regarding "Human Errors". The "rm *.log on some databases with redo logs named .log" will be worse with SRDF.
I wish somebody could create a post to collect all the "Human Errors" (I'm sure people are very creative when generating "Fun"). May be you as you are well known in the community.
To answer your question and thinking about it, it has been years since I saw something broken on a production system (Should I try to bet on something ?). What I believe in this area is not technology based at all, it's :
- Drastic restricted access to production
- A true and well used QA subsystem
- Don't touch anything except if you cannot really avoid it (BTW Patch Set are really necessary !)
- Test PITR.
#5 - Doug Burns said:
2007-11-04 21:16 - (Reply)
Towards the end of the presentation, and after much moaning about The Reality Gap, I thought it would be useful to offer some positive suggestions to address the gap.
My primary suggestion was to expose the truth, because only by doing so can we start to address the real problems that exist.
Therefore, thank you all for adding to the pool of truth. That's the best way to make things better.
I only hope we're not just preaching to the converted ![]()
#6 - Doug Burns said:
2007-11-04 21:18 - (Reply)
Carel-Jan,
Wise words as usual.
Lessons learned: none.
Mmmm, that sounds familiar. It's a little tiring, isn't it?
#7 - Doug Burns said:
2007-11-04 21:20 - (Reply)
40 years of IT and we still haven't figured out
Thank goodness we don't build major road bridges, eh?
#8 - Doug Burns said:
2007-11-04 21:26 - (Reply)
The "rm *.log on some databases with redo logs named .log" will be worse with SRDF.
Definitely, but the sad truth is that managers love this exact copy of failure in two locations! (Well, I'm sure it wasn't described to them that way, otherwise the love affair might be short)
May be you as you are well known in the community.
Oh, no. I wish you hadn't said that. Does that make me a 'luminary'? (Sorry, private joke for myself and perhaps one other ...)
What I believe in this area is not technology based at all
Hallelujah! You're absolutely correct there. Technology is relatively easy. Applying sensible thought and taking a disciplined approach seems to be much more complicated.
I wish somebody could create a post to collect all the "Human Errors"
Yes, that sounds like a good idea. Let me think about it and post something soon.
#9 - Tim Hall said:
2007-11-05 08:22 - (Reply)
Doug:
No need for DR. Just issue the "FLASHBACK DATACENTER" command, new feature in Oracle 12. ![]()
Cheers
Tim...
#10 - Slater 2007-11-05 08:34 - (Reply)
When discussing DR with management who ask questions about the current environment the scene from A Few good men is always played through my head.
Jessep: You want answers? Kaffee: I want the truth! Jessep: You can't handle the truth! ![]()
#11 - Niall Litchfield said:
2007-11-05 09:27 - (Reply)
very nice post and thread - I'm going to shamelessly steal Pete's Redundant Array of Computers acronym as well, it's just brilliant.
I think however that there is one slant that I might bring to this and that is that, in most cases, DR/BCP plans are not in fact made in order to protect against disaster but in order to satisfy regulatory or audit requirements. If a DBA or other privileged user screws up production then that can be managed appropriately (fire or educate as appropriate). However having some tin somewhere else and a report saying how much has been spent to 'comply' and how it was done protects the job and/or meets the performance target of the manager responsible. In other words so long as we incentivise managers to locate tin in a couple of places and don't incentivise them to test/prove DR we don't get proven DR.
I don't think it any coincidence that banks and the like tend to do DR quite well - the incentives are often quite clear there.
#12 - Doug Burns said:
2007-11-05 18:05 - (Reply)
Nice
I'll remember that one. I can probably even manage a pathetic Jessep impersonation at the same time ...
#13 - Doug Burns said:
2007-11-05 18:13 - (Reply)
Phew, I'll be able to retire then ...
#14 - Doug Burns said:
2007-11-05 18:19 - (Reply)
Niall,
I think your comment is both
a) Spot-on; and
b) Utterly depressing
That's what it's all about, isn't it? As long as there's a bit of paper which confirms compliance, who cares too much if it *means* anything.
Yours,
A grumpy old man.
#15 - Tam Nguyen said:
2010-06-03 12:44 - (Reply)
Hi Doug,
Not sure if you remember me but I met you at HOTSOS in Dallas this year.
I empathise with you on this one.
I have been out at customers site where they have been sold SRDF where their servers out in DR sites sit idle collecting dust whilst they are paying hardware maintenance which one day they hope that DR failover will save them from the inevitable. And even then this does not prevent from replicating block corruption.
Why not use active dataguard when you can just failover within minutes and take full advantage of dr hardware and use for reporting purposes?!
#16 - Doug Burns said:
2010-06-03 14:27 - (Reply)
Normally I would be in 100% agreement but we're having some problems with a Standby at the moment that we probably wouldn't see in an SRDF configuration *but* we have an extremely high volume of redo generated and the DG config probably isn't what it should be ![]()


In the comments to the last post, Gregory suggested a blog thread to collect people's favourite (or not-so-favourite) stories of "Human Errors". As Gregory said ..."I wish somebody could create a post to collect all the "Human Errors&q
Tracked: Nov 05, 18:43