Archive for the ‘Storage’ category

Right Practice

September 16, 2011

Wow, it’s been a while since I wrote a post, sorry about that! I thought that I would take a brief break from the technical postings and espouse some opinion on something that has been bothering me for a while – ‘Best Practices.’

Best Practices have been around a long time, and started with very good intentions. In fact, one could easily claim that they are still produced with good intentions: To communicate methods that the hardware and software vendors recommend. However, the application of Best Practices has become increasingly abused in the field to the point where they have become more like prescriptions of how systems should be built. This has gone too far, and needs to be challenged.

Best Practices are a misnomer. At best, they should be referred to as Default Practices, because they are simply valid starting points for the implementation of a piece of technology. They do not describe the optimal (or even appropriate) recipe for implementation in every given situation. And yet: Best Practices are now becoming The Law.

It is increasingly difficult to deviate from a Best Practice, even when there are clear and logical reasons why this should happen. Best Practices are being used as an alternative to rational thought.

Consider this: Any given Best Practice is only applicable in a certain set of conditions. For example, it is ‘Best Practice’ not to fill a certain storage vendor’s array with its own SSD drives, it should only be partially populated. This has been prescribed by the storage vendor because the array has not been designed to cope with the high IOPs and bandwidth rates of every device in the array working at full, uncontended, tilt. This Best Practice is not saying that the SSDs will not work, it is simply saying that the full aggregate performance of all those SSDs cannot be sustained by the interconnects and processors in the array. Fair enough. But this Best Practice is being applied in the field as ‘SSDs are bad for the array’. What if your database requires consistent I/O response time of 1ms or less, requires masses of storage, but actually isn’t a very high IOPs or bandwidth consumer?  Isn’t this the ideal situation to fully populate the array with SSD, to achieve the consistent response time and deliver the required storage density? The array will easily cope with the throughput and IOPs rate, thus negating the Best Practice.

There are many such examples as this. It’s time to stop using Best Practice as a substitute for good old-fashioned thinking, and start to implement designs based upon Right Practice: The ‘Right’ Practice for your particular situation. By all means, start with Best Practice documentation, they often have good information within them. But read between the lines and understand the reasoning behind those recommendations, and apply these reasons to your particular application. Are they applicable?

We are big on the Right Practice approach at Scale Abilities, it’s one of the reasons our clients get big value from our involvement. I recommend that  you start to question the logic of Best Practices and start to come up with your own Right Practices.

Advertisements

Sane SAN 2010: Fibre Channel – Ready, Aim, Fire

September 30, 2010

In my last blog entry I alluded to perhaps not being all that happy about Fibre Channel. Well, it’s true. I have been having a love/hate relationship with Fibre Channel for the last ten years or so, and we have now decided to get a divorce. I just can’t stand it any more!

I first fell in love with Fibre Channel in the late 90s: How could I resist the prospect of leaving behind multi-initiator SCSI with all it’s deep, deep electrical issues? Fibre Channel let me hook up multiple hosts to lots of drives, via a switch, and it let me dynamically attach and detach devices from multiple clustered nodes without reboots. Or so I thought. The reality of Fibre Channel is that it was indeed a revelation in its day, but some of that promise never really materialised until recently. And now it’s too late.

I have a number of problems with Fibre Channel as it stands today, and I’m not even going to mention the fact that it is falling behind in bandwidth. Whoops, I just did – try to pretend you didn’t just read that. The real problems are:

  1. It is complex
  2. It is expensive
  3. It is unreliable
  4. It is slow

Complexity

Complexity. Complexity, complexity, complexity. I hate complexity. Complexity is the IT equivalent of communist bureaucracy – it isn’t remotely interesting, it wastes colossal amounts of time, and it ultimately causes the system to go down. Don’t confuse complexity with challenge – Challenge is having to solve new and interesting problems, Complexity is having to fix the same old problems time and time again and having to do it standing on one leg. So why do I think Fibre Channel is complex? For these reasons:

  1. The stack
  2. YANT

The Stack

If you have ever tried to manage the dependencies associated with maintaining a fully supported Fibre Channel infrastructure then you can probably already feel a knot in your stomach. For everyone else, let me explain.

Every component in a Fibre Channel stack needs to be certified to work with the other components. Operating System version, multipath I/O (mpio) drivers, HBA device drivers, HBA firmware, switch type, switch firmware and storage array firmware. So what happens when you want to, for example, upgrade your MPIO drivers? It is pretty standard for the following process to occur:

  • I want to upgrade to MPIO v2
  • MPIO v2 requires array firmware v42
  • Array firmware v42 requires HBA device driver v3.45
  • HBA device driver v3.45 requires the next release of the Operating System
  • The next release of the Operating System is not yet supported by the Array firmware
  • etc, etc

I think you get the point. But also remember that this wonderful array is shared across 38 different systems, all with different operating systems and HBAs, so the above process has to be followed for every single one, once you have a target release of array firmware that might work across all the platforms. If you are really really lucky, you might get a combination within those factorial possibilities that is actually certified by the array vendor.

Complex enough? Now add YANT…

YANT

Yet Another Networking Technology. I’m all in favour of having different types of networking technology, but not when the advantage is minuscule. All that training, proprietary hardware, cost, and so on: To justify that, the advantage had better be substantial. But it isn’t. Compare Fibre Channel to 10Gbps Ethernet, which is a universal networking standard, and it just doesn’t justify its own existence. To be fair to Fibre Channel, it was the original version of what we are now calling Converged Networking – it has always supported TCP/IP and SCSI protocols, and used to be way faster than Ethernet, but it just never got the traction it needed in that space.

Expensive

It’s tough to argue against this one, Fibre Channel is expensive. 10Gbps Ethernet is also expensive, but the prices will be driven down by volume and ubiquity. In addition, Ethernet switches and so forth can be shared (if you must, that is: I’m still a fan of dedicated storage networks for reasons of reliability), whereas Fibre Channel must be dedicated. Infiniband is expensive too, and will probably stay that way, but it is providing a much higher performance solution than Fibre Channel.

Unreliable

What? Unreliable?

Yes, it’s true. It’s not an inherent problem with the technology itself; Fibre Channel is actually incredibly robust and I can’t fault that fact. However, the promise of real-life reliability is shattered by:

  • Large Fabrics
  • Complexity

What is the point of large fabrics? I can see the point of wanting to stretch I/O capability over a wide area, such as remote replication and so forth, but that does not imply that the whole storage universe of the enterprise should be constructed as a giant fabric, does it? Networks should be composed of relatively small, interconnected,  failure domains, so that traffic can flow, but the impact of a failure is limited in scope. Building a large fabric is going against that, and I’ve lost count of the number of catastrophic failures I’ve seen as a result of building The Dream Fabric.

Complexity; we’re back there again. Reliability is inversely proportional to complexity: High complexity = Low reliability, and vice versa. This is particularly true while we still entrust humans to administer these networks.

Slow

This is the final nail in the coffin. Times have changed, and Fibre Channel has no space in the new world. The way I see it, there are now just two preferred ways to attach storage to a server:

  • Ethernet-based NFS for general use
  • Infiniband-based for very low latency, high bandwidth use

The former approach is a ‘high enough’ performance solution for most current requirements, with ease of use and well understood protocols and technology. I’m not saying it’s quicker than Fibre Channel (though it certainly can be), just that it is fast enough for most things and is easy to put together and manage. The latter method, Infiniband (or similar), is a step up on both Ethernet and Fibre Channel in both higher bandwidth and lower latency, especially when used with RDMA. Infiniband has been a technology searching for a commercial purpose for some time now, and I believe that time has now come, via the route of semiconductor-based storage devices. Consider the following numbers:

  • Fibre Channel Latency: 10-20us (est)
  • Infiniband/RDMA Latency: 1us (est)

Now let’s see how these latencies compare to the those of a physical disk read,  and a read from a DRAM-based storage device:

  • Disk Read: 8,000 us (ie 8ms)
  • DRAM-based Storage read: 15us (source: TMS Ramsan 440 specification)
  • Ratio of FC latency to Disk Latency: 1:800 (1.25%)
  • Ratio of FC latency to DRAM Latency: 1:1.5  (80%)
  • Ratio of IB latency to Disk Latency: 1:8000 (0.125%)
  • Ratio of IB latency to DRAM latency: 1:15 (6.67%)

When comparing to disk reads, the Fibre Channel latency does not add much to the total I/O time. However, when accessing DRAM-based storage, it becomes a hugely dominant factor in the I/O time, whereas Infiniband is still single-digit percentage points. This is why I suggest that Fibre Channel has no role in the forthcoming high-performance storage systems. Fibre Channel is neither simple enough for simple systems, nor fast enough for high-performance systems.

Sane SAN2010: Storage Arrays – Ready, Aim, Fire

September 6, 2010

OK, this one might be contentious, but what the heck – somebody has to say it. Let’s start with a question:

Raise your hand if you have a feeling, even a slight one, that storage arrays suck?

Most DBAs and sysadmins that I speak to certainly have this feeling. They cannot understand why the performance of this very large and expensive array is nearly always lower than they achieve from the hard drive in their desktop computer. OK, so the array can do more aggregate IOPs, but why is it that 13ms, for example, is considered a reasonable average response time? Or worse, why is that some of my I/Os take several hundred milliseconds? And how is it possible that my database is reporting 500ms I/Os and the array is reporting that they are all less than 10ms? These are the questions that are lodged in the minds of my customers.

Storage Arrays do some things remarkably well. Availability, for example, is something that is pretty much nailed in Storage Array Land, both at the fabric layer and the actual array itself. There are exceptions: I think that large Fibre Channel fabrics are a High Availability disaster, and the cost of entry with director-class switches makes no sense when small fabrics can be built using commodity hardware. I have a more general opinion on Fibre Channel actually – it is an ex-format, it is pushing up the lillies. More on that in another blog post, though, I’m not done with the array yet!

The Storage Array became a real success when it became possible to access storage through Fibre Channel. Until then, the storage array was a niche product, except in the mainframe world where Escon was available. Fibre Channel, and subsequently Ethernet and Infiniband, enabled the array to be a big shared storage resource. For clustered database systems this was fantastic – an end to the pure hell of multi-initiator parallel SCSI. But then things started getting a little strange. EMC, for example, started advertising during Saturday morning children’s television about how their storage arrays allowed the direct sharing of information between applications. Well, even an eight year old knows that you can’t share raw data that way, it has to go via a computer to become anything meaningful. But this became the big selling point: all your data in one place. That also has the implication that all the types of data are the same – database files, VMware machines, backups, file shares. They are not, from an access pattern, criticality or business value standpoint, and so this model does not work. Some back-pedalling has occurred since then, notably in the form of formal tiered storage, but this is  still offered under guise of having all the data in one place – just on cheap or expensive drives.

So now we have this big, all-eggs-in-one-basket, expensive array. What have we achieved by this: everything is as slow as everything else. I visited a customer last week with just such an array, and this was the straw that broke the camel’s back (it’s been a long time coming). They have a heavily optimised database system that ultimately means the following speeds and feeds are demanded from the array:

  • 10 megabytes per second write, split into around 1300 IOPs
  • 300 reads per second

If you don’t have a good feel for I/O rates let me tell you: That is a very moderate amount of I/O. And yet the array is consistently returning I/Os that are in several hundred milliseconds, both reads and writes. Quite rightly, the customer thinks this is not very acceptable. Let’s have a little analysis of those numbers.

First the writes.: Half of those writes are sequential to the Oracle redo log and could easily be serviced by one physical drive (one 15k drive can sustain at least 100MB/s of sequential I/O). The rest of them are largely random (let’s assume 100% random), as they are dirty datafile blocks being written by the database writer. Again, a single drive could support 200 random writes per second, but let’s conservatively go for 100 – that means we need six or seven drives to support the physical write requirements of the database, plus one for the redo log. Then we need to add another three drives for the reads. That makes a very conservative total of eleven drives to keep up with the sustained workload for this customer, going straight to disk without any intermediate magic. This array also has quite a chunk of write-back cache, which means that  writes don’t actually even make it to disk before they can be acknowledged to the host/database. Why then, is this array struggling to delivery low latency I/O when it has sixty four drives inside it?

The answer is that a Storage Array is just a big computer itself. Instead of taking HTTP requests or keyboard input, it takes SCSI commands over Fibre Channel. Instead of returning a web page, it returns blocks of data. And like all complex computer systems, the array is subject to performance problems within itself. And the more complex the system, the more likely it is that performance problems will arise. To make things worse, the storage arrays have increasingly encouraged the admin to turn on more and more magic in the software to the point where it is now frequently an impossibility for the storage admin to determine how well a given storage allocation might perform. Modern storage administration has more to do with accountancy than it does performance and technology. Consider this equation:

(number of features) x (complexity of feature) = (total complexity)

Complexity breeds both performance problems and availability problems. This particular customer asked me if there was a way to guarantee that, when they replace this array, the new one will not have these problems. The answer is simple: ‘no’.

Yes, we can go right through the I/O stack, including all the components and software features of the array and fix them up. We can make sure that Fibre Channel ports are private to the host, remove all other workloads from the array so that there is no scheduling or capacity problems there. We can turn off all dynamic optimisations in the array software and we can layout the storage across known physical drives. Then, and only then, might there be a slim chance of a reduced number of high latency I/Os. I have a name for this way of operating a storage array. It’s call Direct Attached Storage (DAS): Welcome to the 1990s.

Now let me combine this reality with the other important aspect: semiconductor-based  storage. What happens when the pent up frustrations of the thousands of storage array owners meets the burgeoning reality of a new and faster storage that is now governed by some kind of accelerated form of Moore’s Law? As my business partner Jeff describes it: It’s gonna be a bloodbath.

I think that we will now see a sea change in the way we connect and use storage. It’s already started with products such as Oracle’s Exadata. I’m not saying that because I am an Oracle bigot (I won’t deny that), but because it is the right thing to do – it’s focused on doing one thing well and it uses emerging technology properly, rather than pretending nothing has changed. I don’t think it’s plug and play for many transactional customers (because of the RAC implication), but the storage component is on the money. Oh, and it is effectively DAS – a virtual machine of Windows running a virus scan won’t slow down your critical invoicing run.

I think that the way we use the storage will have to change too – storage just took a leap up the memory hierarchy. Low latency connectivity such as Infiniband will become more important, as will low latency request APIs, such as SRP. We simply cannot afford to waste time making the request when the response is no longer the major time component.

With all this change, is it now acceptable to have the vast majority of I/O latency accounted for in the complex software and hardware layers of a storage array? I don’t think so.

SaneSAN2010: Serial to Serial – When One Bottleneck Isn’t Enough

August 23, 2010

I was recently looking into a storage-related performance problem at a customer site. The system was an Oracle 10.2.0.4/SLES 9 Linux system, Fibre Channel attached to an EMC DMX storage array. The DMX was replicated to a DR site using SRDF/S.

The problem was only really visible during the overnight batch runs, so AWR reports were the main source of information in diagnosis. In this case, they were more than sufficient, showing clear wait spikes for ‘free buffer waits’ and ‘log file parallel write’ during the problematic period. They were quite impressive, too – sixteen second latencies for some of the writes.

The customer was not oblivious to this fact, of course – it is difficult not to see a problem of such magnitude. They already had an engineering plan to move from SRDF/S (Synchronous) to SRDF/A (Asynchronous), as it was perceived that SRDF was the guilty party in this situation. I had been asked to validate this assumption and to determine the most appropriate roadmap for fixing these I/O problems on this highly critical system.

Of course, SRDF/S will always get the blame in such situations 🙂 I have been involved with many such configurations, and can indeed attest that SRDF/S can lead to trouble, particularly if the implications are not understood correctly. In this case, a very large financial institution, the storage team did indeed understand the main implication (more on this shortly) but, as is often the case in large organisations, the main source of the problem was actually what I call a boundary issue, one which falls between or across the technology focus of two or more teams. In this case, there were three teams involved, leading to configuration issues across all three areas.

Let’s go back to the SRDF implication, as it is the genesis of the problem. In synchronous mode, SRDF will only allow one outstanding write per hyper-volume. Any additional writes to that hyper will be serialised on the local DMX. The storage admin had understood this limitation, and had therefore combined many hyper volumes into a number of striped metavolumes, thus increasing the number of hypers that a given ‘lump of storage’ would contain. All well and good.

The system admin had created striped Veritas volumes over these metavolumes, thus striping even further. A filesystem was then built on the volumes, and presented to the DBA team. The DBAs then built the database and started it up. All apparently ran well for a few years until performance became intolerable, and that’s where my story begins.

I’m going to cut to the chase here, most of us don’t have time to read blogs all day long. There were three factors carefully conspiring on this system to ensure that the write performance was truly terrible:

  1. SRDF/S can only have one outstanding write per hypervolumethat’s a serialisation point.
  2. The filesystem in use was not deployed in any kind of ODM, quick I/O, or other UNIX file locking bypass technology – that’s a serialisation point.
  3. The database was not using Async I/O – that’s (pretty much) a serialisation point.

There you go – 1,2, 3, serialisation points from each of the three teams, none of which were understood by the other teams. Let’s step through Oracle attempting to write dirty buffers to disk during the batch run (the major wait times were observed on ‘free buffer waits’, so let’s start there):

  • DML creates dirty buffers
  • Oracle needs to create more dirty buffers to continue DML operation, but cannot because existing dirty buffers must be written to disk to create space
  • Oracle posts the DBWR process(es) to write out dirty buffers

(all the above happen on well-tuned, healthy systems also, though these may never struggle to have free buffers available because of well-performing writes)

  • DBWR scans the dirty list and issues writes one at a time to the operating system, waiting for each to complete before issuing the next. This is the lack of Async I/O configuration in the database
  • The operating system takes out a file lock (for write) on the datafile, and issues the write. No other DBWR processes can write to this file at this point. This is the side effect of having the wrong kind of filesystem, and implies that only one write can go to a file at any one time.
  • The operating system issues a write to the relevant meta on the DMX, which resolves to a specific hyper inside the box. That’s the single outstanding write for that hyper now in flight. No other writes can occur to that hyper at this point until this one is complete.

It’s easy to see how, when all the write I/Os are being fed to the DMX one at a time, that the additional latency of having SRDF in the equation makes a really big difference. It’s also easy to see that, by turning off SRDF, the problem will get less severe. I’m not defending EMC here, they deserve everything they get when it’s their fault. It just isn’t primarily an SRDF problem in this case. Yes, turning off SRDF or going SRDF/A will help, but it’s still fixing a downstream bottleneck.

The real culprit here is the file locking in the filesystem. This file locking is disabling the storage admin’s design of presenting many hypers up to the host to mitigate the SRDF overhead. In addition, operating system file locking on database files just just plain stupid, and I was hoping to have seen the last example of this in the early 90s; but this is the second one I’ve seen in 3 years… I’m not saying that the people that implement the systems this way are stupid, but it’s pretty easy to be naive about some critical areas when the complexity is so high and unavoidable boundary issues exist between the teams.

The lack of Async I/O is not good here, either, though the presence of multiple DBWRs is mitigating the impact somewhat, and the filesystem would quickly stomp on any improvements made by turning on Async I/O. I don’t believe that this filesystem would support Async anyway until the file locks were bypassed, so it’s two for the price of one here.

With multiple consecutive points of serialisation, it is not surprising that the system was struggling to achieve good throughput.

What’s the lesson here? There are two, really:

  1. Just knowing ‘your area’ isn’t enough.
  2. If you try to walk with two broken legs, you will fall down and bang your head. The fix, however, is not a painkiller for the headache.

EDIT: I have realised upon reflection that I only implied the reason that the file locking makes SRDF/S worse, rather than spelling it out. The reason it makes it worse is that it (file locking) enforces only a single write to that file at once. This means that this particular write is more than likely going to a single hypervolume, and thus eliminating any parallelism that might be achievable from SRDF. FYI, metavolumes have a stripe width of 960KB, so it’s really likely that any single write will only go to one hyper.

Sane SAN 2010 – Introduction

August 23, 2010

This year at the UKOUG Conference in Birmingham, acceptance permitting, I will present the successor to my original Sane SAN whitepaper first penned in 2000. The initial paper was spectacularly well received, relatively speaking, mostly because disk storage at that time was very much a black box to DBAs and a great deal of mystique surrounded its operation. Well, nothing much has changed on that front, so I figured it was very much time to update/rewrite the paper for modern technology and trends and try to impose my occasionally humble opinion on the reader 🙂

I’ve already an article on this blog that will form part of the paper, and I will write a few more over the next few weeks. Check out this one, and keep an eye on my blog for the next few weeks. The first one out the bag is “Serial to Serial – When One Bottleneck Isn’t Enough.”

If the UKOUG don’t accept it, I’ll post it on the Scale Abilities website anyway, and try to palm it off on some other conferences in 2011 🙂

“Flash” Storage Will Be Cheap – The End of the World is Nigh

May 29, 2010

A couple of weeks ago I tweeted a projection that the $/GB for flash drives will meet the $/GB for hard drives within 3-4 years. It was more of a feeling based upon current pricing with Moore’s Law applied than a well researched statement, but it felt about right. I’ve since been thinking some more about this within the context of current storage industry offerings from the likes of EMC, Netapp and Oracle, wondering what this might mean.

First of all I did a bit of research – if my 3-4 years guess-timate was out by an order of magnitude then there is not much point in writing this article (yet). I wanted to find out what the actual trends in flash memory pricing look like and how these might project over time, and I came across the following article: Enterprise Flash Drive Cost and Technology Projections. Though this article is now over a year old, it shows the following chart which illustrates the effect of the observed 60% year on year decline in flash memory pricing:

Flash Drive Pricing Projections

This 60% annual drop in costs is actually an accelerated version of Moore’s Law, and does not take into account any radical technology advances that may happen within the period.  This drop in costs is probably driven in the most part by the consumer thirst for flash technology in iPods and so forth, but naturally ripples back up into the enterprise space in the same way that Intel and AMD’s processor technologies do.

So let’s just assume that my guess-timate and the above chart are correct (they almost precisely agree) – what does that mean for storage products moving forward?

Looking at recent applications of flash technology, we see that EMC were the first off the blocks by offering the option of relatively large flash drives as drop-in replacements for their hard drives in their storage arrays. Netapp took a different approach of putting the flash memory in front of the drives as another caching layer in the stack. Oracle have various options in their (formerly Sun) product line and a formalised mechanism for using flash technology built into the 11g database software and into the Exadata v2 storage platform. Various vendors offer internal flash drives that look like hard drives to the operating system (whether connected by traditional storage interconnects such as SATA or by PCI Express). If  we assume that the cost of flash technology becomes equivalent to hard drive storage in the next three years, I believe all these technologies will quickly become the wrong way to deploy flash technology, and only one (Oracle) has an architecture which lends itself to the most appropriate future model (IMHO).

Let’s flip back to reality and look at how storage is used and where flash technology disrupts that when it is cheap enough to do so.

First, location of data: local or networked in some way? I don’t believe that flash technology disrupts this decision at all. Data will still need to be local in certain cases and networked via some high-speed technology in others, in much the same way as it is today. I believe that the networking technology will need to change for flash technology, but more on that later.

Next, the memory hierarchy: Where does current storage sit in the memory hierarchy? Well, of course, it is at the bottom of the pile, just above tape and other backup technologies if you include those. This is the crucial area where flash technology disrupts all current thinking – the final resting place for data is now close or equal to DRAM memory speeds. One disruptive implication of this is that storage interconnects (such as Fibre Channel, Ethernet, SAS and SATA) are now a latency and bandwidth bottleneck. The other, potentially huge, disruption is what happens to the software architecture when this bottleneck is removed.

Next, capacity: How does the flash capacity sit with hard drive capacity? Well that’s kind of the point of this posting… it’s currently quite a way behind, but my prediction is that they will be equal by 2013/2014. Importantly though, they will then start to accelerate away from hard drives. Given the exponential growth of data volumes, perhaps only semiconductor based storage can keep up with the demand?

Next, IOPs: This is the hugely disruptive part of flash technology, and is a direct result of a dramatically lowered latency (access time) when compared to hard disk technology. Not only is the latency lowered, but semiconductor-based storage is more or less contention-free given the absence of serialised moving parts such as a disk head. Think about it – the service time for a given hard drive I/O is directly by the preceding I/O and where the head was left on the platter. With solid-state storage this does not occur and service times are more uniform (though writes are consistently slower than reads).

These disruptions mean that the current architectures of storage systems are not making the most of semiconductor-based storage. Hey, why do I keep calling it “semiconductor-based storage” instead of SSD or flash? The reason is that the technologies used in this area are changing frequently, from DRAM-based systems to NOR-based flash to NAND based flash to DRAM-fronted flash; Single-level cells to Multi-level cells; battery-backed to “Super Cap” backed. Flash, as we know it today, could be outdated as a technology in the near future, but “semiconductor-based” storage is the future regardless.

I think that we now need technologies that look more like Oracle Exadata v2, with low-latency RDMA interfaces directly into the Operating System/Database. However, they need to easily and natively support other types of storage (unstructured data such as files, VMware datastores and so forth). The Exadata architecture lends itself well to changes in this area in both hardware trends and access protocols.

Perhaps more importantly, we are also only just beginning to understand the implications in software architecture for the disrupted memory hierarchy. We simply cannot continue to treat semiconductor-based storage as “fast disk” and need to start thinking, literally, outside the box.

Forget I/O Bound, You’re Latency Bound, Bub

September 21, 2009

Since it’s been nearly ten years since I wrote my book, Scaling Oracle8i, I thought it was about time that I started writing again. I thought I would start with the new-fangled blogging thing, and see where it takes me. Here goes.

As some will know, I run a small consulting company called Scale Abilities, based out of the UK. We get involved in all sorts of fun projects and problems (or are they the same thing?), but one area that I seem to find myself focusing on a lot is storage. Specifically, the performance and architecture of storage in Oracle database environments. In fact I’m doing this so much that, whenever I am writing presentations for conferences these days, it always seems to be the dominant subject at the front of my mind.

One particular common thread has been the effect of latency. This isn’t just a storage issue, of course, as I endeavoured to point out in my Hotsos Symposium 2008 presentation “Latency and Skew”. Latency, as the subtitle of that particular talk said, is a silent killer. Silent, in that it often goes undetected, and the effects of it can kill performance (and still remain undetected). I’m not going to go into all the analogies about latency here, but let’s try and put a simple definition out for it:

Latency is the time taken between a request and a response.

If that’s such a simple definition, why is it so difficult to spot? Surely if a log period of time passes between a request and a response, the latency will be simple to nail? No.

The problem is that it is the small latencies that cause the problems. Specifically, it is the “small, but not so small that they are not important” ones that are so difficult to spot and yet cause so many problems. Perhaps an example is now in order:

A couple of years ago, a customer of mine was experiencing a performance problem on their newly virtualised database server (VMware 3.5). The problem statement went a little bit like this: Oracle on VMware is broken – it runs much slower on VMware than on physical servers. The customer was preparing a new physical server in order to remove VMware from the equation. Upon further investigation, I determined the following:

  1. The VMware host (physical server running VMware) was a completely different architecture to the previous dedicated server. The old server was one of the Intel Prescott core type (3.2GHz, Global Warming included at no extra cost), and the new one was one of the Core 2 type with VT instructions.
  2. Most measurable jobs were actually faster on the virtualised platform
  3. Only one job was slower

The single job that was slower was so much slower that it overshadowed all the other timings that had improved. Of the four critical batch jobs, the timings looked like this:

Physical server:

  • Job 1: 26s
  • Job 2: 201s
  • Job 3: 457s
  • Job 4: 934s
  • Total: 1618s

Virtualised Server:

  • Job 1: 15s
  • Job 2: 111s
  • Job 3: 208s
  • Job 4: 2820s
  • Total: 3154s

It can be seen that, if one takes the total as the yardstick, the virtualised server is almost twice as slow: Therein lies the danger of using averages and leaping to conclusions. If Job 4 is excluded, the totals are 684s vs 334s, making the virtualised server more than twice as quick as the physical one.

Upon tracing Job 4 on the VMware platform with Oracle extended SQL tracing (10046 level 8), I discovered that it was making a lot of roundtrips to the database. Hang on, let me give that the right emphasis: A LOT of roundtrips. However, each roundtrip was really  fast – about 0.2ms if memory serves. So where’s the problem with that? It’s not exactly high latency, is it? Well it is if you have to do several million of them.

As it turns out, there was something in the VMware stack (perhaps just additional codepath to get through the vSwitch) that was adding around 0.1ms latency to each roundtrip. When this tenth of a millisecond is multiplied by several million (something like 20 million in this case), it becomes a long time. About 2000s to be precise, which more than made up for the extra time. The real answer – do less roundtrips, the new server will be at least twice as fast as the old server.

So what does this have to do with I/O? Plenty. The simple fact is that roundtrips are a classic source of unwanted latency. Latency is the performance killer.

Let’s look at some I/O examples from a real customer system. Note: this is not a particularly well tuned customer system:

  • Latency of 8KB sequential reads: 0.32ms
  • Latency of 4MB sequential reads: 6ms

Obviously, the 4MB reads are taking a lot longer (18x), but that makes sense, right? Apart from one thing: The 4MB reads are serving 512x more data in each payload. The net result of these numbers is as follows:

  • Time to sequentially read 2.1GB in 8KB pieces: 76s
  • Time to sequentially read 2.1GB in 4MB pieces: 24s

So what happened to the 52s difference in these examples? Was this some kind of tuning problem? Was this a fault on the storage array? No. The 52s of time was lost in latency, nowhere else.

Here’s another definition of latency: Latency is wasted time.

So, look out for that latency. Think about it when selecting nested-loop table joins instead of full-table scans. Think about it when doing single-row processing in your Java app. Think about it before you blame your I/O system!