Entries tagged as The Reality Gap
Dec 27: The Reality Gap (5) - Conclusion
When I delivered the presentation at the Scottish OUG Conference, I concluded with a section explaining why I think The Reality Gap causes genuine problems and isn't just another case of me ranting away to myself. So here are a few negative consequences ...
1) It affects people's perception of themselves and their work-place. For example, going back to part 1, how many Oracle DBAs are living in a perpetual state of self-criticism because they don't apply quarterly patch updates consistently? Perhaps if everyone realised we're facing the same problems and that the vast majority *aren't* applying the patches (at least not consistently), we might be able to assess our work more realistically? Or how about the DBAs who bemoan their lot because they still have to support some (shock! horror!) Oracle 8.1.7 databases when everyone knows that 10g has been around for years and we should all be using that?2) It stops problems being addressed. I've witnessed this phenomenon at a couple of troubled customer sites I've worked at in the past. Everyone is frightened of telling the truth because they’re worried that if the truth comes out there’ll be trouble. But I’ve always found that exposing the truth tends to lead to a result that really shouldn’t be surprising. Once the true nature of your problems has been exposed, maybe something can be *done* about them? Oh, and this point leads into the more specific point 3 ...
3) It breeds risk. The ostrich-like behaviour of thinking that RAC or SRDF are complete disaster recovery solutions because words like 'Unbreakable' are bandied around or simply because you paid extra money to have them isn't just a bit silly and deluded, it means that when someone *does* drop a table one day, you won't have protected the business against that risk. All the fine words in the world won't help you then.
4) (Fill in your own here ....)
If you want to close the gap, there's no better tool than telling the truth. Although life can never be about unfettered honesty (maybe you *should* just say you like that new jumper your Auntie bought you for Christmas), I've always believed that truth is one of the most powerful weapons you can wield in the IT Systems arena.
It's never the SAN.
Here's a true story. Rest assured that I've watched the same story develop several times so there's no possibility you can identify an individual client. In fact, the thing you're most likely to recognise is a similar experience of your own
- The business has problems with the overnight batch schedule on a significant system. The problem is that the execution times from night to night are unacceptably variable, for similar data volumes and identical code. The batch window is already tight, so this leads to the system being unavailable during critical hours. The business is upset, particularly as they are fined if they don't deliver certain transaction messages within set deadlines.
- The first port of call is the DBAs because everyone assumes there's a problem with the database. Despite the usual grumbles at this point, I have some sympathy with that perspective because it's a key component of the overall system that's both critical and quite opaque outside the DBA team. Besides, we have the best instrumentation so are likely to be able to help.
- We look at Statspack or AWR reports (this is a complex batch with hundreds of jobs in multiple job streams - a little difficult to trace) and notice that the average single block read time varies from night to night and there is a close correlation between the nights when i/o performance is poor and the batch jobs take longer to run. The vast majority of wait events are related to i/o. At this stage, we ask the O/S or Storage guys to take a look at our numbers and investigate i/o performance. They do so and come back with "no, everything looks fine to us". Which is a little tricky for me to deal with, given that you've demonstrated single block read times over 30ms! Still, what can you do? You ask for expert help and the experts tell you everything is ok.
- The next step might be to trace a few of the batch jobs to show the individual wait times. These look even worse - some are as high as 90m/s! We start to look at filesystem configuration options, reducing the i/o workload and several other bright ideas as we desperately cast around for a solution.
- This part of the story lasts for as long as the DBAs and Storage guys want to make it last. At some point, however, the business has had enough of this nonsense and calls in the storage vendor.
- This part saddens me the most. I haven't met every employee of every storage vendor and so I'm sure there are bright guys there (maybe I've been unlucky) but, invariably, there is 'no problem' with the storage. Again. There never is. More to the point, they join in the general finger-pointing in the direction of the DBAs, start asking for initialisation parameter values (guaranteed to drive me daft, is that one, because it's just a knee-jerk reaction) and explain to the business that the back-end infrastructure is working well. Of course, the last thing the vendor is going to do is to explain that the system that they helped the business to specify (typically on cost per Megabyte, rather than bandwidth) is under-specified. Oh, and if you've spent three million quid on something nice, big and shiny, the last thing you want to hear about are its limitations.
- Now at this point of the story, you might start postulating theories which 'prove' it might not be the SAN and I've had a few extended conversations on this subject with sys admins (Hi, Mike ), but here's the clincher. A true clincher.
- Coincidentally, new SAN infrastructure is due for deployment. When the database is moved to the new infrastructure, single block read times are reduced to single figures and are consistent. The performance problems are completely solved. "Ah", say the Storage guys, "that's just because not all of the databases havet been moved yet. When they are, you should expect to see similar performance levels as before."
- All of the databases are eventually moved and the new improved performance levels are maintained.
My cuddly toy mate, Flatcat, could analyse this situation and see it for what it was (and he's not exactly a computer expert).
It *was* the SAN and I don't enjoy wasting 9 months of my life debating it with people who I'm looking to for expertise, not hand-waving. If this was a one-off story, I might not mind, but I sometimes feel like I'm going round in circles.
The last part will discuss the problems caused by The Reality Gap.
P.S. Despite the tone of this post (which has been kicking around my brain cells for a while) I'm looking forward to Christmas more than a middle-aged man should. So, I'm not feeling grumpy and you'll forgive me if I log off now and pick up any comments later. Then again, I'm not expecting too many comments for a few days ....
By coincidence, this question and answer appeared on asktom yesterday. At first I thought it was an incredible coincidence because I was planning to write about this last weekend, but in retrospect, this subject is something Tom had been banging on about for some time. As the first follow-up comment highlights, you can find many more examples of Tom discussing this question in the past with a simple search. In fact if you search for the precise string "Single Instance Per Server" you'll get a couple of results!
Tom's recommendation is that you have a single Oracle instance per server. There are several good reasons that spring to my mind, but I recommend you read Tom's own thoughts on the subject.
- You can allocate one large pot of memory to the single instance and allow Oracle to use it more efficiently, depending on which application needs it. If you have multiple, smaller instances, you're limiting what each pot can be used for. As a general systems management approach, I prefer bigger pots as long as they remain efficient.
- Not only is resource usage more efficient, fewer instances reduce system maintenance overhead, which is always a good thing. We have enough instances to look after, thanks!
- If you don't use a single instance per server, then you're not going to get the best out of Resource Manager (although I'm still hearing anecdotal evidence of bugs there).
I've worked at several sites who do this to some degree. With the growing popularity of tools such as APEX, companies are keen to implement small 'tactical apps' (not my phrase) to meet simple business requirements. A General Purpose database is a good home for them and I work on some at the moment.
However, here are some problems with this approach.
- When you want to upgrade to a new version of Oracle, every application support or project team has to agree. The companies I've worked with over the past few years are buying off-the-shelf applications by the bucket-load and developing less of their own. Which means we need multiple companies to agree that their application can run on the new version. I've heard of a vendor recently who refuses to certify their app to 220.127.116.11. They agree to 18.104.22.168 (!) or 10.2.0.3. Brilliant - we can move to 10g! Erm, no actually, because one of the other vendors isn't certified on 10.2 yet. It's incredibly frustrating. Of course, we have this problem anyway and having multiple different instances on different versions will make our lives more difficult, but at least we don't have every application sitting on an old, unsupported version of Oracle because of one piddling little app.
- When you want an outage for anything, you have to speak to all of the app owners to get it. If one of the app owners don't ok it, for whatever reason, we have to wait ... and wait ...
- What happens when one application vendor wants you to set an instance parameter to a specific value and another doesn't? How many vendors do you think state that they are happy to support their application co-existing with another in the same database instance? It really doesn't matter whether I think they're wrong (in fact I've shown more than one vendor that it was achievable with their app), the business will always listen to the vendor. They're paying them money for (cough, splutter ...) support.
This could go on for some time but you get the picture. If this was just a technical decision, then I'd have no argument, but it isn't. It's a business decision and will be dictated by business needs in the end. You might think that it's my job as an IT professional to offer the best solution, explain the reasons and implement it and I attempt that on a day to day basis but I see IT departments playing that role less often these days and just responding to whatever the business tell us to do.As I will keep repeating during this short series, this is not about the technical merits of the accepted wisdom, it's about the complete gap between recommendations like 'a single instance per server' and the systems I see daily. Remember, too, that as a contractor I didn't build all of these systems; these sites are not following my personal mantra and they aren't all idiots. So, should anyone suggest that I should go and work at a good site, please point me in the right direction.
To finish this blog I tried to think of the last time I saw a server with a single Oracle instance on it. The only one I could think of is the ISP4400 downstairs in my house. Seriously, I can't think of a server that has a single instance on it.
(Updated later - While getting ready for work, I remembered one from my time at Pythian - a client with a two-node RAC cluster.)
(Updated yet again - Oh, and Data Warehouses tend to be on their own server)
Horror of horrors, I see servers with over 20 instances on them regularly.
Sorry, Tom. I don't disagree with the gist of your argument, but the reality I witness is miles apart.
* But maybe not? If you talk to DBAs who work with other DBMS, they often combine application databases into server-wide instances. So maybe I am imagining all of this and it's about how I'm used to working with Oracle? I genuinely don't know. Maybe because the business know that they can have a seperate instance and DBAs have encouraged it in some way, then we've created this problem for ourselves?
Updated later - Again, it took me a while to find a related link I remembered vaguely. Over on Howard's site there was a discussion about multiple instances on a server.
Because it's based around a single shared database, RAC will not (and was never designed to) protect you from
- Database corruption
- Erroneous updates or deletes by a user (or DBA!)
- SRDF (and other storage-level replication tools)
- Data Guard
- Other replication solutions (e.g. Shareplex and the like)
Unless you have a delay built in to the replication that gives you time to notice the screw-up and do something about it.
When protecting yourself against disasters, why do we spend so much time and money on protecting ourselves against major site disasters, but much less time and brain-power on protecting ourselves against what I would suggest are far more common - simple human screw-ups?
Think about it :-
- How many times have you screwed up or been sitting next to someone who did?
- How many times have you lost a machine room because of an earthquake, fire or explosion?
- How many times have you had to restore a database to cure a problem?
- How many times have you had to cut-over to your DR site for real? (Not in a practice run)
I'm not saying people shouldn't have DR plans in place, but I'm tired of hearing SRDF sold to managers as the solution to all problems! Reality Gap!
I think my presentation went reasonably well, based on the feedback. For UKOUG members, the presentation slides are here but I have to say I was very reluctant to hand them over because they're useless in isolation - just a few visual gags.
In keeping with the style of a keynote and the chap I was standing in for, I wanted something reasonably controversial but high-level. The original idea for the presentation came from a series of blogs that had been floating around in my head for a while, discussing 'The Reality Gap' or :-
"The difference between what Consulting Firms, Oracle Marketing, Technical Architects, Bloggers and Security Researchers say we should be doing and what most of us really are doing"
One of my favourite stories has always been The Emperor's New Clothes. Nothing drives me nuts like everyone toe-ing an almost religious 'line' which has no basis in reality. If I can play the little boy and cut through the misinformation a little that makes me feel better. In fact, I think it's what every consultant, contractor or whatever we want to call ourselves should be doing - asking tricky questions to get to the truth.
Having stolen blog material for a presentation, it's only fair that it should appear here eventually. For my first example, let's talk about Software Maintenance and, more specifically, upgrading to the latest versions of Oracle and applying patches on a regular basis.
I conducted a show-of-hands survey near the start of the presentation which went something like this (and thanks again for everyone who joined in, I was a little nervous that a Scottish crowd might be a little reticent!)
- Who is running Oracle? (Nearly everyone raised their hand. Good start.)
- Who has all of their databases on 22.214.171.124 or 10.2.0.x? (Less than a third kept their hands raised)
- Who has applied the July 2007 CPU to all of their databases? (Not a single hand remained in the air. In October.)
- To start with, you need a regularly scheduled outage on every database. (Plus all the associated Change Management)
- Next you need enough DBAs to plan, test and implement the change.
- Plus the people to perform regression testing? (Well, maybe not and you're prepared to take the risk that no new problems have been introduced)
You and I probably both know that security patches are essential to ensure the security of the business data, but do you think every business truly understands that? They've got better, thanks to SOX legislation, but what do you think they want their DBAs doing - implementing shiny new applications or applying patches to existing applications when they won't see any functional improvement from their perspective? That's our job, though, isn't it? To educate the business about the importance of patching? Mmmm, but who pays us? Who is always looking to cut costs to the bear minimum (ah, the wonders of Capitalism). Who really controls IT departments these days?
However, none of the pros and cons matter to my specific argument and whether you agree or disagree with me about the importance of security patches isn't the issue. (Oh, but please can we mention company firewalls when we're discussing exposure risks, otherwise people are being disingenuous at least.) In fact, I spend a lot of time working to help companies apply patches more regularly so this isn't an argument about whether we should apply CPUs or not.
The issue is that I predict confidently (is that a guess, Alex?) that the majority* of Oracle customers aren't applying CPUs rigorously, so can we please stop kidding ourselves? Because until the issue of applying the patches is addressed, all the discussion about them might keep a few people busy, but to no material effect!
* Please note I did not say "all"
Updated later - here's an interesting related blog that I wanted to point out but thought I'd lost.