Archive for the ‘Full Stack’ category

Right Practice

September 16, 2011

Wow, it’s been a while since I wrote a post, sorry about that! I thought that I would take a brief break from the technical postings and espouse some opinion on something that has been bothering me for a while – ‘Best Practices.’

Best Practices have been around a long time, and started with very good intentions. In fact, one could easily claim that they are still produced with good intentions: To communicate methods that the hardware and software vendors recommend. However, the application of Best Practices has become increasingly abused in the field to the point where they have become more like prescriptions of how systems should be built. This has gone too far, and needs to be challenged.

Best Practices are a misnomer. At best, they should be referred to as Default Practices, because they are simply valid starting points for the implementation of a piece of technology. They do not describe the optimal (or even appropriate) recipe for implementation in every given situation. And yet: Best Practices are now becoming The Law.

It is increasingly difficult to deviate from a Best Practice, even when there are clear and logical reasons why this should happen. Best Practices are being used as an alternative to rational thought.

Consider this: Any given Best Practice is only applicable in a certain set of conditions. For example, it is ‘Best Practice’ not to fill a certain storage vendor’s array with its own SSD drives, it should only be partially populated. This has been prescribed by the storage vendor because the array has not been designed to cope with the high IOPs and bandwidth rates of every device in the array working at full, uncontended, tilt. This Best Practice is not saying that the SSDs will not work, it is simply saying that the full aggregate performance of all those SSDs cannot be sustained by the interconnects and processors in the array. Fair enough. But this Best Practice is being applied in the field as ‘SSDs are bad for the array’. What if your database requires consistent I/O response time of 1ms or less, requires masses of storage, but actually isn’t a very high IOPs or bandwidth consumer?  Isn’t this the ideal situation to fully populate the array with SSD, to achieve the consistent response time and deliver the required storage density? The array will easily cope with the throughput and IOPs rate, thus negating the Best Practice.

There are many such examples as this. It’s time to stop using Best Practice as a substitute for good old-fashioned thinking, and start to implement designs based upon Right Practice: The ‘Right’ Practice for your particular situation. By all means, start with Best Practice documentation, they often have good information within them. But read between the lines and understand the reasoning behind those recommendations, and apply these reasons to your particular application. Are they applicable?

We are big on the Right Practice approach at Scale Abilities, it’s one of the reasons our clients get big value from our involvement. I recommend that  you start to question the logic of Best Practices and start to come up with your own Right Practices.


Sane SAN 2010: Fibre Channel – Ready, Aim, Fire

September 30, 2010

In my last blog entry I alluded to perhaps not being all that happy about Fibre Channel. Well, it’s true. I have been having a love/hate relationship with Fibre Channel for the last ten years or so, and we have now decided to get a divorce. I just can’t stand it any more!

I first fell in love with Fibre Channel in the late 90s: How could I resist the prospect of leaving behind multi-initiator SCSI with all it’s deep, deep electrical issues? Fibre Channel let me hook up multiple hosts to lots of drives, via a switch, and it let me dynamically attach and detach devices from multiple clustered nodes without reboots. Or so I thought. The reality of Fibre Channel is that it was indeed a revelation in its day, but some of that promise never really materialised until recently. And now it’s too late.

I have a number of problems with Fibre Channel as it stands today, and I’m not even going to mention the fact that it is falling behind in bandwidth. Whoops, I just did – try to pretend you didn’t just read that. The real problems are:

  1. It is complex
  2. It is expensive
  3. It is unreliable
  4. It is slow


Complexity. Complexity, complexity, complexity. I hate complexity. Complexity is the IT equivalent of communist bureaucracy – it isn’t remotely interesting, it wastes colossal amounts of time, and it ultimately causes the system to go down. Don’t confuse complexity with challenge – Challenge is having to solve new and interesting problems, Complexity is having to fix the same old problems time and time again and having to do it standing on one leg. So why do I think Fibre Channel is complex? For these reasons:

  1. The stack
  2. YANT

The Stack

If you have ever tried to manage the dependencies associated with maintaining a fully supported Fibre Channel infrastructure then you can probably already feel a knot in your stomach. For everyone else, let me explain.

Every component in a Fibre Channel stack needs to be certified to work with the other components. Operating System version, multipath I/O (mpio) drivers, HBA device drivers, HBA firmware, switch type, switch firmware and storage array firmware. So what happens when you want to, for example, upgrade your MPIO drivers? It is pretty standard for the following process to occur:

  • I want to upgrade to MPIO v2
  • MPIO v2 requires array firmware v42
  • Array firmware v42 requires HBA device driver v3.45
  • HBA device driver v3.45 requires the next release of the Operating System
  • The next release of the Operating System is not yet supported by the Array firmware
  • etc, etc

I think you get the point. But also remember that this wonderful array is shared across 38 different systems, all with different operating systems and HBAs, so the above process has to be followed for every single one, once you have a target release of array firmware that might work across all the platforms. If you are really really lucky, you might get a combination within those factorial possibilities that is actually certified by the array vendor.

Complex enough? Now add YANT…


Yet Another Networking Technology. I’m all in favour of having different types of networking technology, but not when the advantage is minuscule. All that training, proprietary hardware, cost, and so on: To justify that, the advantage had better be substantial. But it isn’t. Compare Fibre Channel to 10Gbps Ethernet, which is a universal networking standard, and it just doesn’t justify its own existence. To be fair to Fibre Channel, it was the original version of what we are now calling Converged Networking – it has always supported TCP/IP and SCSI protocols, and used to be way faster than Ethernet, but it just never got the traction it needed in that space.


It’s tough to argue against this one, Fibre Channel is expensive. 10Gbps Ethernet is also expensive, but the prices will be driven down by volume and ubiquity. In addition, Ethernet switches and so forth can be shared (if you must, that is: I’m still a fan of dedicated storage networks for reasons of reliability), whereas Fibre Channel must be dedicated. Infiniband is expensive too, and will probably stay that way, but it is providing a much higher performance solution than Fibre Channel.


What? Unreliable?

Yes, it’s true. It’s not an inherent problem with the technology itself; Fibre Channel is actually incredibly robust and I can’t fault that fact. However, the promise of real-life reliability is shattered by:

  • Large Fabrics
  • Complexity

What is the point of large fabrics? I can see the point of wanting to stretch I/O capability over a wide area, such as remote replication and so forth, but that does not imply that the whole storage universe of the enterprise should be constructed as a giant fabric, does it? Networks should be composed of relatively small, interconnected,  failure domains, so that traffic can flow, but the impact of a failure is limited in scope. Building a large fabric is going against that, and I’ve lost count of the number of catastrophic failures I’ve seen as a result of building The Dream Fabric.

Complexity; we’re back there again. Reliability is inversely proportional to complexity: High complexity = Low reliability, and vice versa. This is particularly true while we still entrust humans to administer these networks.


This is the final nail in the coffin. Times have changed, and Fibre Channel has no space in the new world. The way I see it, there are now just two preferred ways to attach storage to a server:

  • Ethernet-based NFS for general use
  • Infiniband-based for very low latency, high bandwidth use

The former approach is a ‘high enough’ performance solution for most current requirements, with ease of use and well understood protocols and technology. I’m not saying it’s quicker than Fibre Channel (though it certainly can be), just that it is fast enough for most things and is easy to put together and manage. The latter method, Infiniband (or similar), is a step up on both Ethernet and Fibre Channel in both higher bandwidth and lower latency, especially when used with RDMA. Infiniband has been a technology searching for a commercial purpose for some time now, and I believe that time has now come, via the route of semiconductor-based storage devices. Consider the following numbers:

  • Fibre Channel Latency: 10-20us (est)
  • Infiniband/RDMA Latency: 1us (est)

Now let’s see how these latencies compare to the those of a physical disk read,  and a read from a DRAM-based storage device:

  • Disk Read: 8,000 us (ie 8ms)
  • DRAM-based Storage read: 15us (source: TMS Ramsan 440 specification)
  • Ratio of FC latency to Disk Latency: 1:800 (1.25%)
  • Ratio of FC latency to DRAM Latency: 1:1.5  (80%)
  • Ratio of IB latency to Disk Latency: 1:8000 (0.125%)
  • Ratio of IB latency to DRAM latency: 1:15 (6.67%)

When comparing to disk reads, the Fibre Channel latency does not add much to the total I/O time. However, when accessing DRAM-based storage, it becomes a hugely dominant factor in the I/O time, whereas Infiniband is still single-digit percentage points. This is why I suggest that Fibre Channel has no role in the forthcoming high-performance storage systems. Fibre Channel is neither simple enough for simple systems, nor fast enough for high-performance systems.

Sane SAN2010: Storage Arrays – Ready, Aim, Fire

September 6, 2010

OK, this one might be contentious, but what the heck – somebody has to say it. Let’s start with a question:

Raise your hand if you have a feeling, even a slight one, that storage arrays suck?

Most DBAs and sysadmins that I speak to certainly have this feeling. They cannot understand why the performance of this very large and expensive array is nearly always lower than they achieve from the hard drive in their desktop computer. OK, so the array can do more aggregate IOPs, but why is it that 13ms, for example, is considered a reasonable average response time? Or worse, why is that some of my I/Os take several hundred milliseconds? And how is it possible that my database is reporting 500ms I/Os and the array is reporting that they are all less than 10ms? These are the questions that are lodged in the minds of my customers.

Storage Arrays do some things remarkably well. Availability, for example, is something that is pretty much nailed in Storage Array Land, both at the fabric layer and the actual array itself. There are exceptions: I think that large Fibre Channel fabrics are a High Availability disaster, and the cost of entry with director-class switches makes no sense when small fabrics can be built using commodity hardware. I have a more general opinion on Fibre Channel actually – it is an ex-format, it is pushing up the lillies. More on that in another blog post, though, I’m not done with the array yet!

The Storage Array became a real success when it became possible to access storage through Fibre Channel. Until then, the storage array was a niche product, except in the mainframe world where Escon was available. Fibre Channel, and subsequently Ethernet and Infiniband, enabled the array to be a big shared storage resource. For clustered database systems this was fantastic – an end to the pure hell of multi-initiator parallel SCSI. But then things started getting a little strange. EMC, for example, started advertising during Saturday morning children’s television about how their storage arrays allowed the direct sharing of information between applications. Well, even an eight year old knows that you can’t share raw data that way, it has to go via a computer to become anything meaningful. But this became the big selling point: all your data in one place. That also has the implication that all the types of data are the same – database files, VMware machines, backups, file shares. They are not, from an access pattern, criticality or business value standpoint, and so this model does not work. Some back-pedalling has occurred since then, notably in the form of formal tiered storage, but this is  still offered under guise of having all the data in one place – just on cheap or expensive drives.

So now we have this big, all-eggs-in-one-basket, expensive array. What have we achieved by this: everything is as slow as everything else. I visited a customer last week with just such an array, and this was the straw that broke the camel’s back (it’s been a long time coming). They have a heavily optimised database system that ultimately means the following speeds and feeds are demanded from the array:

  • 10 megabytes per second write, split into around 1300 IOPs
  • 300 reads per second

If you don’t have a good feel for I/O rates let me tell you: That is a very moderate amount of I/O. And yet the array is consistently returning I/Os that are in several hundred milliseconds, both reads and writes. Quite rightly, the customer thinks this is not very acceptable. Let’s have a little analysis of those numbers.

First the writes.: Half of those writes are sequential to the Oracle redo log and could easily be serviced by one physical drive (one 15k drive can sustain at least 100MB/s of sequential I/O). The rest of them are largely random (let’s assume 100% random), as they are dirty datafile blocks being written by the database writer. Again, a single drive could support 200 random writes per second, but let’s conservatively go for 100 – that means we need six or seven drives to support the physical write requirements of the database, plus one for the redo log. Then we need to add another three drives for the reads. That makes a very conservative total of eleven drives to keep up with the sustained workload for this customer, going straight to disk without any intermediate magic. This array also has quite a chunk of write-back cache, which means that  writes don’t actually even make it to disk before they can be acknowledged to the host/database. Why then, is this array struggling to delivery low latency I/O when it has sixty four drives inside it?

The answer is that a Storage Array is just a big computer itself. Instead of taking HTTP requests or keyboard input, it takes SCSI commands over Fibre Channel. Instead of returning a web page, it returns blocks of data. And like all complex computer systems, the array is subject to performance problems within itself. And the more complex the system, the more likely it is that performance problems will arise. To make things worse, the storage arrays have increasingly encouraged the admin to turn on more and more magic in the software to the point where it is now frequently an impossibility for the storage admin to determine how well a given storage allocation might perform. Modern storage administration has more to do with accountancy than it does performance and technology. Consider this equation:

(number of features) x (complexity of feature) = (total complexity)

Complexity breeds both performance problems and availability problems. This particular customer asked me if there was a way to guarantee that, when they replace this array, the new one will not have these problems. The answer is simple: ‘no’.

Yes, we can go right through the I/O stack, including all the components and software features of the array and fix them up. We can make sure that Fibre Channel ports are private to the host, remove all other workloads from the array so that there is no scheduling or capacity problems there. We can turn off all dynamic optimisations in the array software and we can layout the storage across known physical drives. Then, and only then, might there be a slim chance of a reduced number of high latency I/Os. I have a name for this way of operating a storage array. It’s call Direct Attached Storage (DAS): Welcome to the 1990s.

Now let me combine this reality with the other important aspect: semiconductor-based  storage. What happens when the pent up frustrations of the thousands of storage array owners meets the burgeoning reality of a new and faster storage that is now governed by some kind of accelerated form of Moore’s Law? As my business partner Jeff describes it: It’s gonna be a bloodbath.

I think that we will now see a sea change in the way we connect and use storage. It’s already started with products such as Oracle’s Exadata. I’m not saying that because I am an Oracle bigot (I won’t deny that), but because it is the right thing to do – it’s focused on doing one thing well and it uses emerging technology properly, rather than pretending nothing has changed. I don’t think it’s plug and play for many transactional customers (because of the RAC implication), but the storage component is on the money. Oh, and it is effectively DAS – a virtual machine of Windows running a virus scan won’t slow down your critical invoicing run.

I think that the way we use the storage will have to change too – storage just took a leap up the memory hierarchy. Low latency connectivity such as Infiniband will become more important, as will low latency request APIs, such as SRP. We simply cannot afford to waste time making the request when the response is no longer the major time component.

With all this change, is it now acceptable to have the vast majority of I/O latency accounted for in the complex software and hardware layers of a storage array? I don’t think so.

SaneSAN2010: Serial to Serial – When One Bottleneck Isn’t Enough

August 23, 2010

I was recently looking into a storage-related performance problem at a customer site. The system was an Oracle 9 Linux system, Fibre Channel attached to an EMC DMX storage array. The DMX was replicated to a DR site using SRDF/S.

The problem was only really visible during the overnight batch runs, so AWR reports were the main source of information in diagnosis. In this case, they were more than sufficient, showing clear wait spikes for ‘free buffer waits’ and ‘log file parallel write’ during the problematic period. They were quite impressive, too – sixteen second latencies for some of the writes.

The customer was not oblivious to this fact, of course – it is difficult not to see a problem of such magnitude. They already had an engineering plan to move from SRDF/S (Synchronous) to SRDF/A (Asynchronous), as it was perceived that SRDF was the guilty party in this situation. I had been asked to validate this assumption and to determine the most appropriate roadmap for fixing these I/O problems on this highly critical system.

Of course, SRDF/S will always get the blame in such situations 🙂 I have been involved with many such configurations, and can indeed attest that SRDF/S can lead to trouble, particularly if the implications are not understood correctly. In this case, a very large financial institution, the storage team did indeed understand the main implication (more on this shortly) but, as is often the case in large organisations, the main source of the problem was actually what I call a boundary issue, one which falls between or across the technology focus of two or more teams. In this case, there were three teams involved, leading to configuration issues across all three areas.

Let’s go back to the SRDF implication, as it is the genesis of the problem. In synchronous mode, SRDF will only allow one outstanding write per hyper-volume. Any additional writes to that hyper will be serialised on the local DMX. The storage admin had understood this limitation, and had therefore combined many hyper volumes into a number of striped metavolumes, thus increasing the number of hypers that a given ‘lump of storage’ would contain. All well and good.

The system admin had created striped Veritas volumes over these metavolumes, thus striping even further. A filesystem was then built on the volumes, and presented to the DBA team. The DBAs then built the database and started it up. All apparently ran well for a few years until performance became intolerable, and that’s where my story begins.

I’m going to cut to the chase here, most of us don’t have time to read blogs all day long. There were three factors carefully conspiring on this system to ensure that the write performance was truly terrible:

  1. SRDF/S can only have one outstanding write per hypervolumethat’s a serialisation point.
  2. The filesystem in use was not deployed in any kind of ODM, quick I/O, or other UNIX file locking bypass technology – that’s a serialisation point.
  3. The database was not using Async I/O – that’s (pretty much) a serialisation point.

There you go – 1,2, 3, serialisation points from each of the three teams, none of which were understood by the other teams. Let’s step through Oracle attempting to write dirty buffers to disk during the batch run (the major wait times were observed on ‘free buffer waits’, so let’s start there):

  • DML creates dirty buffers
  • Oracle needs to create more dirty buffers to continue DML operation, but cannot because existing dirty buffers must be written to disk to create space
  • Oracle posts the DBWR process(es) to write out dirty buffers

(all the above happen on well-tuned, healthy systems also, though these may never struggle to have free buffers available because of well-performing writes)

  • DBWR scans the dirty list and issues writes one at a time to the operating system, waiting for each to complete before issuing the next. This is the lack of Async I/O configuration in the database
  • The operating system takes out a file lock (for write) on the datafile, and issues the write. No other DBWR processes can write to this file at this point. This is the side effect of having the wrong kind of filesystem, and implies that only one write can go to a file at any one time.
  • The operating system issues a write to the relevant meta on the DMX, which resolves to a specific hyper inside the box. That’s the single outstanding write for that hyper now in flight. No other writes can occur to that hyper at this point until this one is complete.

It’s easy to see how, when all the write I/Os are being fed to the DMX one at a time, that the additional latency of having SRDF in the equation makes a really big difference. It’s also easy to see that, by turning off SRDF, the problem will get less severe. I’m not defending EMC here, they deserve everything they get when it’s their fault. It just isn’t primarily an SRDF problem in this case. Yes, turning off SRDF or going SRDF/A will help, but it’s still fixing a downstream bottleneck.

The real culprit here is the file locking in the filesystem. This file locking is disabling the storage admin’s design of presenting many hypers up to the host to mitigate the SRDF overhead. In addition, operating system file locking on database files just just plain stupid, and I was hoping to have seen the last example of this in the early 90s; but this is the second one I’ve seen in 3 years… I’m not saying that the people that implement the systems this way are stupid, but it’s pretty easy to be naive about some critical areas when the complexity is so high and unavoidable boundary issues exist between the teams.

The lack of Async I/O is not good here, either, though the presence of multiple DBWRs is mitigating the impact somewhat, and the filesystem would quickly stomp on any improvements made by turning on Async I/O. I don’t believe that this filesystem would support Async anyway until the file locks were bypassed, so it’s two for the price of one here.

With multiple consecutive points of serialisation, it is not surprising that the system was struggling to achieve good throughput.

What’s the lesson here? There are two, really:

  1. Just knowing ‘your area’ isn’t enough.
  2. If you try to walk with two broken legs, you will fall down and bang your head. The fix, however, is not a painkiller for the headache.

EDIT: I have realised upon reflection that I only implied the reason that the file locking makes SRDF/S worse, rather than spelling it out. The reason it makes it worse is that it (file locking) enforces only a single write to that file at once. This means that this particular write is more than likely going to a single hypervolume, and thus eliminating any parallelism that might be achievable from SRDF. FYI, metavolumes have a stripe width of 960KB, so it’s really likely that any single write will only go to one hyper.