Nov 2: Recovery Design part 4 - You need more than one database
There was a good reaction to part 3 with sensible comments from people who've obviously 'been there, seen it, done it, bought the t-shirt' and handled their share of tricky system failures, so thanks to all. The comments are interesting reading for other perspectives and ideas.
Although the sub-title to this part is 'you need more than one database', I hope that the last part already convinced you that, if you really want to protect your data from the widest possible range of failures with the fastest possible recovery, some form of replication is a 'very good idea' (TM). I'll divide how you decide what approach to take into two groups of factors.
Technology
This is the easy bit. What particular technology you use is a matter for you, but there are options. I've worked at sites which used one or more of the following. (I haven't included the use of Oracle Streams here because I haven't used it personally, but anyone who has or does should feel free to comment.)
- User-configured Standby Databases. I include these for historical reasons, because it was the first approach I used on a 7.0.15 database. I vaguely recall that Oracle offered full instructions on Metalink for maintaining a Standby Database by shipping archived redo log files, but didn't officially support it. Nevertheless, it worked well for us although we only ever used the Standby for disaster recovery and as the source of the backups and didn't use it for user reporting.
- Oracle Dataguard. The modern name for a Standby Database and I tend to forget to use it because it sounds to me like a marketing tag for a basic concept. However, I must admit that the modern implementation is much improved and easier to use and people tend to be more familiar with the name 'Dataguard' these days, so I'm trying to use it

- Quest's Shareplex. I worked at one site that used this for everything - dozens of databases worldwide, some of them extremely active. We did run into one or two problems with it but my personal hunch is that it was no worse or better than Oracle's technology, it was just that this particular site had the hots for standby databases. It was used so heavily that we would have probably run into some problems whatever we used. (Although I know a few ex-colleagues who might want to leave a few curses in the comments!)
- Extra disk mirrors. By this I mean something like synchronising disks on a SAN so that updates to one set of disks are propagated to another set, either syncronously or asynchronously. We use a lot of this at my current site, but there are maximum distances if you want synchronous updates and unless you have the technology and the storage administrators who understand it, this won't be an option for many people.
Speaking of synchronous and asynchronous updates brings me on to the next area to consider.
Update lag (or currency)
There is a tension at the centre of any disaster recovery-related replication design.
On the one hand, we want the copy or standby database to be the most up-to-date copy possible because if we need to use it, we'll have less redo information to apply to bring it up to date, which will improve recovery time and we'd ideally like to implement a no data loss solution. i.e. Unless a transaction commits on the Standby, it won't commit on the Primary so the databases should always be identical. This is what Data Guard's Maximum Protection mode offers and it sounds like another excellent idea.
If you ask the users how much down-time they can stand and how much data they can afford to lose (and you absolutely have to ask), then their first answer will nearly always be - no down-time and no data loss.
On the other hand, that could cause a number of problems.
- The Primary is now dependant on the Standby being operational. If the Standby is down, Oracle has to shutdown the primary or the Standby wouldn't be synchronised any more. That doesn't sound promising to me because it makes everything more fragile and failure-prone.
- In the event of someone screwing up on the Primary, you can guarantee that the same error is going to exist on the Standby - that's the way it works!
The worst news is that I'm not going to give you any answers here - just more questions. I'm like that
In the old days there was a built in time lag of anything that was in the current online redo log file, plus how long it took to ship the archived copy over to the standby server and apply it. These days you have more options and need to think about them. For example, a time lag of several hours would give you time to notice someone screwed something up on the primary before allowing it to be applied to the standby but, then again, maybe flashback features are more useful for that kind of thing?
I think the next blog will probably be the last - wrapping up this mini-series, forming some unsatisfactory conclusions and offering some very satisfactory sources of further information.
#1 - Vidya said:
2006-11-02 20:24 - (Reply)
great write up. How about a combination of standby and shareplex. That's what we used at one Customer site. The only reason I liked shareplex,the Replication lag time was minimal and the times I have hated it was when source objects would change or get invaid causing errors with Shareplex Replication Logs and getting paged at night for the same.
synchronising disks on a SAN ?? are there many sites that have this? most often I have been quite unsuccessful with convincing managment on infrastructe changes.
Thanks again great sereis on Recovery Design!!
#2 - Doug Burns said:
2006-11-02 22:25 - (Reply)
when source objects would change or get invaid causing errors with Shareplex Replication Logs and getting paged at night for the same.
I shouldn't smile, but I did when I read that. That was what I meant by (Although I know a few ex-colleagues who might want to leave a few curses in the comments!)"
synchronising disks on a SAN ?? are there many sites that have this?
I would *guess* that most sites that have expensive SANs and storage arrays (which implies a certain budget!) use this kind of solution. EMC, for example, push that kind of thing quite heavily and we use it on a number of our most critical production databases.
#3 - Carel-Jan 2006-11-19 17:51 - (Reply)
On the one hand, we want the copy or standby database to be the most up-to-date copy possible because if we need to use it, we'll have less redo information to apply to bring it up to date, which will improve recovery time and we'd ideally like to implement a no data loss solution. i.e. Unless a transaction commits on the Standby, it won't commit on the Primary so the databases should always be identical. This is what Data Guard's Maximum Protection mode offers and it sounds like another excellent idea.
If you ask the users how much down-time they can stand and how much data they can afford to lose (and you absolutely have to ask), then their first answer will nearly always be - no down-time and no data loss.
On the other hand, that could cause a number of problems.
1. The Primary is now dependant on the Standby being operational. If the Standby is down, Oracle has to shutdown the primary or the Standby wouldn't be synchronised any more. That doesn't sound promising to me because it makes everything more fragile and failure-prone.
2. In the event of someone screwing up on the Primary, you can guarantee that the same error is going to exist on the Standby - that's the way it works!
I have to disagree on 2): You can synchronize the redo to the standby the way you decribed, but still maintain a DELAY in applying the redo. This will take extra time for recovery, but allows for recovery from human errors. Many sites use 2 standbys these days: One locally, maintained in SYNC mode with no delay or just a small delay, and one remote, maintained in ASYNC mode, and a DELAY of 2-8 hours maintained. This setup allows for fast (local) failover if the server crashes. DR to a remote site often implies DNS changes etc., allowing for a longer recovery time for the database whatsoever.
#4 - Doug Burns said:
2006-11-19 23:25 - (Reply)
Thanks Carel-Jan.
I'm aware of what you're proposing but maybe I wasn't clear enough about what I was talking about which was a single Standby solution in SYNC mode (or maximum protection mode). That's what I meant by this
a no data loss solution. i.e. Unless a transaction commits on the Standby, it won't commit on the Primary so the databases should always be identical. This is what Data Guard's Maximum Protection mode offers and it sounds like another excellent idea.
That's what that section was discussing. Based on that, what I was saying was right, wasn't it?
Really my point was that you probably *don't* want to use that as your only option


It's been a very mini-series but I hope I've highlighted some of the problems with designing systems that need to recover from failures quickly. Here are a few ideas and I'm sure others could add their own.Careful planning is essential.Ask tough questio
Tracked: Nov 08, 00:31
I'm going to keep this one short and sweet because I've talked about it before. Because it's based around a single shared database, RAC will not (and was never designed to) protect you from Database corruptionErroneous updates or deletes by a user (or D
Tracked: Nov 04, 09:45