If you're gonna screw up ...

Doug's Oracle Blog

  • Home
  • Papers
  • Books
  • C.V.
  • Fun
  • Blog

Aug 31: If you're gonna screw up ...

... you should always aim high (or desperately low would be more accurate).

I was sitting at my desk today, deleting some old databases (that phrase in itself should be warning enough) when I typed in the following

dunha1e:oracle#rm -fR *

I looked down at my notes to have a think about what I was doing, then typed this

ls -ltra /ora/data2

What might not be so obvious from the way I've written that out is that I was missing a CR/LF in there, so what I actually typed was

dunha1e:oracle#rm -fR *ls -ltra /ora/data2


[dripping_sarcasm]Fortunately I managed to remember the CR/LF after the 'second command'.[/dripping_sarcasm]

I realised what I'd done immediately. For those who aren't rolling about on the floor laughing or absolutely disgusted already, I deleted the following

  • *ls
  • -ltra
  • /ora/data2 and all of it's subdirectories and all of the contents

First, the good news (there's not much)

  • This was a development server. That's probably why I was more casual than normal but, make no mistake, if you're a developer or a tester - this is Production for you.
  • We use RMAN so recovery should be straightforward.
  • In spite of their baser instincts, everyone rallied round to the cause and my team lead didn't have a fit

Now some bad

  • There were about eight databases using files on /ora/data2!
  • The thing with development servers is that they're more of an untidy mess with less surrounding structure and procedures. Which meant a lot more thought needed to be put into what steps would get us back to where we were. (Entertainingly, I'd had a strong argument with my team lead earlier in the day about imposing more structure on things. Maybe he'll do that by revoking my access privileges ;-))
  • For example, we don't back up all of the databases on that server. For some of them we rely on exports. I happen to think an export should never be considered a backup of a database, as we're about to find out when we try to recover them :-(
  • When I started to recover the first few databases, our standard recovery shell-script was falling over. (No recovery test, you see)
  • I've only been on this new team a few weeks, so I don't know the environment that well. Nothing beats practice and prior knowledge.
  • When I started to use RMAN manually, I ran into the most bizarre bug when I retried a recovery in the same session after one had failed. (You'll need a Metalink account for that link.) We spent a bit of time sorting that one out!
With the help of a couple of colleagues, we got the most important 3 databases recovered with no data loss and some flat-files restored too. However, there are about 5 databases that we're going to need to recreate from scratch and then import export files. Normally I would have stayed, but I have a big meeting tomorrow I need to prepare for, so our evening shift guy is progressing the work and I'll be in first thing in the morning to try to finish the clean up.

In fairness, it's a long time since I've made such a stupid mistake but there was some joking discussion about whether this would make it into my blog. Of course it would. Everyone makes stupid mistakes sometimes. However, a little more care would have avoided this (as would have the -i flag on the rm command!) and it was particularly stupid, so I'm not proud of it. However, what's most important when you do make mistakes is

  • You're honest about them.
  • You fix them.
  • You learn from them.
  • You thank anyone else who helped you fix them.
I'll get my coat :-(

P.S. How's this for a punchline? At the Miracle Database Forum in a few weeks time I'll be giving a presentation in the subject area of DBA Worst Practices. Well, I reckon I probably have some suitable material!
Posted by Doug Burns Comments: (28) Trackback: (1)

Trackbacks
Trackback specific URI for this entry

Another way to recover from a screw up… :)
Doug Burns wrote in his blog how he accidentally messed up his Oracle database development environment. Doug, I remember that a little while back, we had a short discussion over at Christo Kutrovsky’s follow-up blog entry to my ZFS post. If you&...
Weblog: outside the box
Tracked: Sep 01, 03:32

Comments
Display comments as (Linear | Threaded)

#1 - Pete_s said:
2006-08-31 20:54 - (Reply)

One early lession for me was to check the directory I was in before deleting files or in my case find . -exec rm {}
.... not a good idea from root on a production server - and my sever rebuild kit was in another office five miles away.

And all because some buggy software created zero lenghth files for each transaction and we run out of i-nodes - there were too many files for rm to recurse.

But confession is good for the soul

#1.1 - Doug Burns said:
2006-08-31 20:58 - (Reply)

Suffice to say I *do* know better than to go around deleting tons of files without being careful about what I'm doing. (Think how often I've got this right)

But something just *happened* today :-(

I think the next 'Housemate of the Month' may have been a distraction ;-)

#2 - Bill S. 2006-08-31 21:13 - (Reply)

Well, shucks, Doug. My belief has always been that if you haven't screwed up lately you just haven't NOTICED it. ;-D

It is always a kindness when this happens on development. You can find solace in the thought of what your recovery would have been like were this NOT an Oracle database.

Glad to see you're human. Well, glad may not be the right word but you know what I mean.

Cheers!

Bill S.

#2.1 - Doug Burns said:
2006-08-31 21:26 - (Reply)

A constant thought that was running through my mind, and several others mentioned it too (like the DBA bogey-man), was 'imagine if this had happened on our super-important production call centre databases'.

I've thought about that enough, though, and don't want to think about it again!

#3 - grash 2006-08-31 21:36 - (Reply)

It happens to everyone, eventually.

I think we relax more on a dev server. we know what we're doing, we've being doing it for years. No change control it's only dev!

I agree with the exports are not backups but then the cost to back up a database that is only for dev! surely they can restore there code from the database they use for version control, but then no data. I think what we (dbas/management/produciton support) forget is that dev database are developers jobs, we don't let them near production but this is where they work, if these datbases are down we're looking at several teams not being able to work.

Anyway it just proves your one of us, and I did lend a hand when I stopped laughing.

#4 - jamie 2006-08-31 22:30 - (Reply)

maybe it was a Karma thing that made you have a mistake? I prefer strong discussions to arguments :-)

Exports are better than nothing and are a cheap way to cover your ....

#4.1 - Doug Burns said:
2006-08-31 22:38 - (Reply)

I had a feeling you lot might show up ;-)

For the rest of you ... Jamie and Grash are colleagues. Jamie's also the team lead who didn't have a fit. Turns out he can still recover databases too, which is handy.

#5 - SwitchBL8 said:
2006-08-31 22:43 - (Reply)

Rule 1: Never, never, ever use the command:

rm -fR *

The shell command should recognise it, and give you an electrical shock via the keyboard.

Rule 2: Always, yes ALWAYS use relative paths.

From the /ora directory give the command :

rm -rf data2/*

If you're in the wrong directory, a "*" would have succeeded, a "data2/*" not.

Rule 3: when absolutely sure about the removal of files, wait for it to finish. Check that is has finished, or check that the files are actually gone. (this last rule would have saved your behind).

Commenter #1 one has a strong point in using the find command, but most mouse-addicts don't know how, so...obey rules 1, 2 and 3.

And now the confession: I learned the rules the hard way. Duh.

#5.1 - Doug Burns said:
2006-08-31 22:47 - (Reply)

Can you elaborate on this bit?

Rule 3: when absolutely sure about the removal of files, wait for it to finish. Check that is has finished, or check that the files are actually gone. (this last rule would have saved your behind).

The rest I know, but (and I'm smiling to myself as I write this) I didn't need to follow the rules on this occasion because I knew what I was doing and where I was ;-)

Maybe I'm just a bit brain-dead - long night, but I'm not sure what you mean by point 3.

#5.1.1 - SwitchBL8 said:
2006-09-01 00:19 - (Reply)

If you're going to perform a removal of a lot of files, wait until it's done. Check that it's done by looking at your screen and that you returned to your prompt. Then check with "ls" that the files are gone.

I know, they are obvious rules, but as said: they will save your behind.

#5.1.1.1 - Doug Burns said:
2006-09-01 00:21 - (Reply)

Sorry, I'm still utterly confused. How would checking that all of the files had been deleted have prevented me from deleting them in the first place?

Aren't the only rules that are going to stop be screwing up here the actions I should take *before* hitting return?

#5.1.1.1.1 - SwitchBL8 said:
2006-09-01 09:45 - (Reply)

If you would have checked the removal, you would have noticed that you did NOT hit return. Instead you did not and already typed in the next command.

#5.1.1.1.1.1 - Doug Burns said:
2006-09-01 13:41 - (Reply)

Mmm, I kind of know what you mean, but I'm not convinced. If you'd said - check the statement before hitting return, that would have save my *ss, but once I'd hit return, checking to see what I deleted (which I did) isn't much use now.

But I know what you mean at least.

#5.1.1.1.1.1.1 - SwitchBL8 said:
2006-09-01 14:07 - (Reply)

You were in the right directory and entered:

rm -fR *

Checking that it did run would saved your behind. It is basically what you say in your original post: it's difficult to spot, but I forgot to hit the return key.

Anyways: no more replying. Go, and restore...

#5.1.1.1.1.1.1.1 - Doug Burns said:
2006-09-01 14:09 - (Reply)

The good news is - they're all back.

#6 - Dave Whiting 2006-08-31 23:32 - (Reply)

Guess what my first task in the morning is?? I'm a Unix Admin (boo, hiss!) who has had to recover data in exactly these types of situations.

You know what the worse thing is though? People not owning up to *exactly* what they've done. For example, a DBA at work today deleted /bin/* by mistake. I could recover that in 5 minutes from a system backup. However, I've been in situations where people haven't told the whole truth and you find out later that there's other *stuff* missing. Unless I really trust the person is telling me everything, I often recommend we do a full OS restore, if I believe it has been compromised.

Doug, aren't you going to share the story of when you shutdown a system in the City while it was in the middle of processing billions of dollars in trades ;-)

#6.1 - Doug Burns said:
2006-08-31 23:43 - (Reply)

Well, you're all coming out of the woodwork for this one, aren't you?!? ;-)

Doug, aren't you going to share the story of when you shutdown a system in the City while it was in the middle of processing billions of dollars in trades

Ah. but you see, you've inadvertently helped bolster my credentials ;-) Mmmm, when *was* that exactly. Let me see ... 10 or 11 years ago. As you can see folks, I make these mistakes *all* the time.

(But now I'm waiting for the Sun guys to turn up and tell the oh-so-funny story of the time I shut-down the helpdesk system)

Phew ... you try to be honest and what do you get?!?!?

#6.1.1 - Stephen Booth said:
2006-09-01 11:56 - (Reply)

When I was working for a certain international company in a FM team looking after the backend systems for a certain large (formerly public sector) transportation infrastructure company, a contractor accidentally logged into the live core financials system database server at about ten to six of a Friday evening and deleted the database, believing (we surmise) he was logged into the QA system onto which the live data would be loaded from the Friday night backup. About 20 minutes later it was noticed what had happened and a call was raised with us, the contractor in question having by this time cleared his desk and headed for Heathrow to get a plane home. Unfortunately I was the support tech on call that weekend, it was about 2am on Monday before the system was backup and fully functioning. This was the weekend that I discovered both that adrenalin is brown and sticky and that not all of our backup tapes were as reliable as we might hope. On the plus side I basically made a month's salary over that weekend.

I learned a number of lessons that weekend (I try to learn from the mistakes of others as I don't have time to make them all myself):

1) uname -a is your friend.
2) No matter how tempting do not use the same password on production as you do on QA, UAT, VT or Dev.
3) Read before hitting enter.
4) Saving a tenner per DLT isn't worth it.
5) Silver (or higher) support is worth it.

Stephen

#7 - Eric S. Emrick 2006-09-01 03:42 - (Reply)

Well that reminds me of my first DBA assignment back in 1994. I thought I would be a diligent DBA and do some housecleaning. I saw a few files that were around 30MB each; which of course were simply wasting space. Not two minutes went by and my users were calling off the hook wondering why they could not connect to their coveted development database. It was at that moment I realized I must have done something wrong. Indeed, my poor database had its redo logs wiped clean. I was the new kid on the block and fresh out of university. I will always remember that humbling feeling. Today, I alias rm on my machines to prompt me with one more confirmation. I can't say I'll never make a mistake like that again. But, I will try to make it as difficult as possible to reproduce. Wow, that was theraeutic :-)

#8 - Mike 2006-09-01 08:13 - (Reply)

Were the databases still running?

There are other ways of recovering without resorting to backups, that may have got you out of the brown sticky stuff.

Rule#1.. no matter how desparate the situation looks - pick up the phone and discuss with your local SysAdmin..

#8.1 - Doug Burns 2006-09-01 08:30 - (Reply)

Were the databases still running?

>> No

There are other ways of recovering without resorting to backups, that may have got you out of the brown sticky stuff.

>> That Oracle would have been happy with?

Rule#1.. no matter how desparate the situation looks - pick up the phone and discuss with your local SysAdmin..

>> Humph, to add to the crowd of amused onlookers?

;-)

#8.1.1 - Mike 2006-09-01 08:45 - (Reply)

Well, of course, there would be a certain degree of amusement involved, but once we'd stopped laughing, we'd put our thinking caps on.

Have you ever deleted a file from a filesystem which was still open by an application? You may have noticed that the space doesn't return to the filesystem until the application closes it again.

It can be recovered.. as long as the application holds the file open. But unfortunately if your databases were closed, then this particular technique won't work.

Still - same rule applies.. call the sysadmin - we're here to help.

>> That Oracle would have been happy with?
Good question. I've done this many,many times, but I can't explicitly recall if I've done it on Oracle. Let me know if you ever delete any more datafiles and I'll check.. ;-)

Did I mention the time when I unplumbed a network interface on a firewall carrying 30,000 sessions to Halifax's online banking/sharedealing websites?. Fixed it before anyone noticed (but still 'fessed up..), but then forgot to patch up the routing table.. D'oh! - the service was out for about 30 minutes during primetime..

#9 - Doug Burns 2006-09-01 08:41 - (Reply)

Nice Trackback!

Check out the trackback. Whilst there's a lot more to a decision to moving to ZFS than stupid DBAs, it would have been very nice to have that in place yesterday.

I'm still not convinced Oracle would have liked files disappearing and re-appearing from underneath it, though

#9.1 - James Foronda said:
2006-09-01 11:25 - (Reply)

Yes, Oracle won't like that. What I usually do for development environments is something like this:

1. Setup the environment and load the initial data, whatever that may be.
2. Shut down the database.
3. Do the ZFS snapshot.

If something gets messed up:

1. Shutdown database.
2. zfs rollback
3. Restart database.

The data you get at this point will be the same as the one at the time the snapshot was taken (as opposed to the data when the screwup occurred). This is how I usually restore a development environment to a known state.

For finer granularity, you can do hot backups on a daily basis also using ZFS snapshot.

#9.1.1 - Stephen Booth said:
2006-09-01 12:02 - (Reply)

We do something similar using a NetApp filer. Apparently it is a supported configuration by Oracle, we went to one of their sites in Reading and had a demo from them.

I still do an RMAN backup every night, to be on the safe side.

Stephen

#10 - Andy C said:
2006-09-01 11:28 - (Reply)

In the same way Tom Kyte learns something every single day, I probably make a mistake every single day (which probably equates to the same thing).

I like the four closing bullets though. If I had a cube, I would pin them up there.

Funny quote today: 'I have had end users on the phone in tears because they can not use the application.' I was also in tears (of laughter).

#11 - Mark L 2006-09-01 17:07 - (Reply)

Unix is great isn't it..my favourite is Friday night before the weekend batch, checking that the crontab is setup to kick off backups, and typing crontab -r instead of crontab -e, one key on the keyboard away, one slip of the finger and its a late Friday night!!

#12 - joel garry 2006-09-02 01:32 - (Reply)

typing crontab -r instead of crontab -e

I learned the hard way that on some recent versions of hp-ux, if you use sam to schedule a backup, it puts #sambackup on the line it creates in crontab. If you subsequently comment out that line then go into and out of sam, crontab gets blanked. Or something like that. Eventually I figured out this can be prevented by putting any character into the string sambackup. (I usually comment out lines in crontab with a date or other human intelligible marker, as well as back it up before changing it in any manner).

One time I went into the O installer to delete a home and oopsie managed to delete the production home. Oddly enough when you do that Oracle doesn't die and existing production kept on until I could reinstall or restore or whatever I did (that part is a blur! But production only had to be down for a few minutes.). Definitely a vote for the open, honest, own-up, fix and learn technique, though I wish I had pictures of me hopping up and down trying to get the busy IS manager's attention. I rarely have anyone to thank for help, though, it's usually all on me.

I've gotten more than one job where the previous person did the rm /oradata/* thing in the wrong X-window.

I've been saved a lot by being paranoid and able to recover mistakes before anyone even noticed. Thank Codd for rollback; imp and recover! (Haven't used flashback in anger yet)

I screwed up the very first backup I ever did. It was switching these huge 20MB disk stacks and typing in copy commands, and the manager was in a hurry to get to bowling and rushed through verbal instructions.

capcha: 2MR


Add Comment

Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications
BBCode format allowed
 
 

Upcoming Appearances

Hotsos Symposium 2010 - 7th-11th March

Comments

Doug Burns about Advert: Symposium Countdown
Tue, 09.02.2010 09:03
Well I'll be running stuff in VMs, that's for sure, and people have gone on and [...]
Pete Scott about Advert: Symposium Countdown
Tue, 09.02.2010 08:55
It is such a relief to get the paper in (so well done, Doug).... I dispatched [...]
Doug Burns about Parallel Query and 11g
Sun, 07.02.2010 10:09
That could be a long reply, so [...]
Links in Comments

It's a minor source of frustration to me that you can't just paste a Hypertext link into the comments form here but, should you ever want to include a link, all you need to do is use the BBCode format, as mentioned below the comment form.

Here is a link to the relevant part of the document that explains how.

Bookmark

Open All | Close All

Syndicate This Blog

  • XML RSS 2.0 feed
  • ATOM/XML ATOM 1.0 feed
  • XML RSS 2.0 Comments
  • Feedburner Feed

Powered by

Serendipity PHP Weblog

Show tagged entries

xml 11g
xml ACE
xml adaptive thresholds
xml ash
xml Audit Vault
xml AWR
xml Blogging
xml Cuddly Toys
xml Database Refresh
xml Direct Path Reads
xml Fun
xml listener
xml locking
xml oow
xml oow2009
xml OTN
xml Parallel
xml Patching
xml Swingbench
xml The Reality Gap
xml Time Matters
xml Unix/Shell
xml Useful Links

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.

Design by Andreas Viklund | Conversion to s9y by Carl