Archive for the ‘Performance’ category

Right Practice

September 16, 2011

Wow, it’s been a while since I wrote a post, sorry about that! I thought that I would take a brief break from the technical postings and espouse some opinion on something that has been bothering me for a while – ‘Best Practices.’

Best Practices have been around a long time, and started with very good intentions. In fact, one could easily claim that they are still produced with good intentions: To communicate methods that the hardware and software vendors recommend. However, the application of Best Practices has become increasingly abused in the field to the point where they have become more like prescriptions of how systems should be built. This has gone too far, and needs to be challenged.

Best Practices are a misnomer. At best, they should be referred to as Default Practices, because they are simply valid starting points for the implementation of a piece of technology. They do not describe the optimal (or even appropriate) recipe for implementation in every given situation. And yet: Best Practices are now becoming The Law.

It is increasingly difficult to deviate from a Best Practice, even when there are clear and logical reasons why this should happen. Best Practices are being used as an alternative to rational thought.

Consider this: Any given Best Practice is only applicable in a certain set of conditions. For example, it is ‘Best Practice’ not to fill a certain storage vendor’s array with its own SSD drives, it should only be partially populated. This has been prescribed by the storage vendor because the array has not been designed to cope with the high IOPs and bandwidth rates of every device in the array working at full, uncontended, tilt. This Best Practice is not saying that the SSDs will not work, it is simply saying that the full aggregate performance of all those SSDs cannot be sustained by the interconnects and processors in the array. Fair enough. But this Best Practice is being applied in the field as ‘SSDs are bad for the array’. What if your database requires consistent I/O response time of 1ms or less, requires masses of storage, but actually isn’t a very high IOPs or bandwidth consumer?  Isn’t this the ideal situation to fully populate the array with SSD, to achieve the consistent response time and deliver the required storage density? The array will easily cope with the throughput and IOPs rate, thus negating the Best Practice.

There are many such examples as this. It’s time to stop using Best Practice as a substitute for good old-fashioned thinking, and start to implement designs based upon Right Practice: The ‘Right’ Practice for your particular situation. By all means, start with Best Practice documentation, they often have good information within them. But read between the lines and understand the reasoning behind those recommendations, and apply these reasons to your particular application. Are they applicable?

We are big on the Right Practice approach at Scale Abilities, it’s one of the reasons our clients get big value from our involvement. I recommend that  you start to question the logic of Best Practices and start to come up with your own Right Practices.


SaneSAN2010: Serial to Serial – When One Bottleneck Isn’t Enough

August 23, 2010

I was recently looking into a storage-related performance problem at a customer site. The system was an Oracle 9 Linux system, Fibre Channel attached to an EMC DMX storage array. The DMX was replicated to a DR site using SRDF/S.

The problem was only really visible during the overnight batch runs, so AWR reports were the main source of information in diagnosis. In this case, they were more than sufficient, showing clear wait spikes for ‘free buffer waits’ and ‘log file parallel write’ during the problematic period. They were quite impressive, too – sixteen second latencies for some of the writes.

The customer was not oblivious to this fact, of course – it is difficult not to see a problem of such magnitude. They already had an engineering plan to move from SRDF/S (Synchronous) to SRDF/A (Asynchronous), as it was perceived that SRDF was the guilty party in this situation. I had been asked to validate this assumption and to determine the most appropriate roadmap for fixing these I/O problems on this highly critical system.

Of course, SRDF/S will always get the blame in such situations 🙂 I have been involved with many such configurations, and can indeed attest that SRDF/S can lead to trouble, particularly if the implications are not understood correctly. In this case, a very large financial institution, the storage team did indeed understand the main implication (more on this shortly) but, as is often the case in large organisations, the main source of the problem was actually what I call a boundary issue, one which falls between or across the technology focus of two or more teams. In this case, there were three teams involved, leading to configuration issues across all three areas.

Let’s go back to the SRDF implication, as it is the genesis of the problem. In synchronous mode, SRDF will only allow one outstanding write per hyper-volume. Any additional writes to that hyper will be serialised on the local DMX. The storage admin had understood this limitation, and had therefore combined many hyper volumes into a number of striped metavolumes, thus increasing the number of hypers that a given ‘lump of storage’ would contain. All well and good.

The system admin had created striped Veritas volumes over these metavolumes, thus striping even further. A filesystem was then built on the volumes, and presented to the DBA team. The DBAs then built the database and started it up. All apparently ran well for a few years until performance became intolerable, and that’s where my story begins.

I’m going to cut to the chase here, most of us don’t have time to read blogs all day long. There were three factors carefully conspiring on this system to ensure that the write performance was truly terrible:

  1. SRDF/S can only have one outstanding write per hypervolumethat’s a serialisation point.
  2. The filesystem in use was not deployed in any kind of ODM, quick I/O, or other UNIX file locking bypass technology – that’s a serialisation point.
  3. The database was not using Async I/O – that’s (pretty much) a serialisation point.

There you go – 1,2, 3, serialisation points from each of the three teams, none of which were understood by the other teams. Let’s step through Oracle attempting to write dirty buffers to disk during the batch run (the major wait times were observed on ‘free buffer waits’, so let’s start there):

  • DML creates dirty buffers
  • Oracle needs to create more dirty buffers to continue DML operation, but cannot because existing dirty buffers must be written to disk to create space
  • Oracle posts the DBWR process(es) to write out dirty buffers

(all the above happen on well-tuned, healthy systems also, though these may never struggle to have free buffers available because of well-performing writes)

  • DBWR scans the dirty list and issues writes one at a time to the operating system, waiting for each to complete before issuing the next. This is the lack of Async I/O configuration in the database
  • The operating system takes out a file lock (for write) on the datafile, and issues the write. No other DBWR processes can write to this file at this point. This is the side effect of having the wrong kind of filesystem, and implies that only one write can go to a file at any one time.
  • The operating system issues a write to the relevant meta on the DMX, which resolves to a specific hyper inside the box. That’s the single outstanding write for that hyper now in flight. No other writes can occur to that hyper at this point until this one is complete.

It’s easy to see how, when all the write I/Os are being fed to the DMX one at a time, that the additional latency of having SRDF in the equation makes a really big difference. It’s also easy to see that, by turning off SRDF, the problem will get less severe. I’m not defending EMC here, they deserve everything they get when it’s their fault. It just isn’t primarily an SRDF problem in this case. Yes, turning off SRDF or going SRDF/A will help, but it’s still fixing a downstream bottleneck.

The real culprit here is the file locking in the filesystem. This file locking is disabling the storage admin’s design of presenting many hypers up to the host to mitigate the SRDF overhead. In addition, operating system file locking on database files just just plain stupid, and I was hoping to have seen the last example of this in the early 90s; but this is the second one I’ve seen in 3 years… I’m not saying that the people that implement the systems this way are stupid, but it’s pretty easy to be naive about some critical areas when the complexity is so high and unavoidable boundary issues exist between the teams.

The lack of Async I/O is not good here, either, though the presence of multiple DBWRs is mitigating the impact somewhat, and the filesystem would quickly stomp on any improvements made by turning on Async I/O. I don’t believe that this filesystem would support Async anyway until the file locks were bypassed, so it’s two for the price of one here.

With multiple consecutive points of serialisation, it is not surprising that the system was struggling to achieve good throughput.

What’s the lesson here? There are two, really:

  1. Just knowing ‘your area’ isn’t enough.
  2. If you try to walk with two broken legs, you will fall down and bang your head. The fix, however, is not a painkiller for the headache.

EDIT: I have realised upon reflection that I only implied the reason that the file locking makes SRDF/S worse, rather than spelling it out. The reason it makes it worse is that it (file locking) enforces only a single write to that file at once. This means that this particular write is more than likely going to a single hypervolume, and thus eliminating any parallelism that might be achievable from SRDF. FYI, metavolumes have a stripe width of 960KB, so it’s really likely that any single write will only go to one hyper.

Sane SAN 2010 – Introduction

August 23, 2010

This year at the UKOUG Conference in Birmingham, acceptance permitting, I will present the successor to my original Sane SAN whitepaper first penned in 2000. The initial paper was spectacularly well received, relatively speaking, mostly because disk storage at that time was very much a black box to DBAs and a great deal of mystique surrounded its operation. Well, nothing much has changed on that front, so I figured it was very much time to update/rewrite the paper for modern technology and trends and try to impose my occasionally humble opinion on the reader 🙂

I’ve already an article on this blog that will form part of the paper, and I will write a few more over the next few weeks. Check out this one, and keep an eye on my blog for the next few weeks. The first one out the bag is “Serial to Serial – When One Bottleneck Isn’t Enough.”

If the UKOUG don’t accept it, I’ll post it on the Scale Abilities website anyway, and try to palm it off on some other conferences in 2011 🙂

“Flash” Storage Will Be Cheap – The End of the World is Nigh

May 29, 2010

A couple of weeks ago I tweeted a projection that the $/GB for flash drives will meet the $/GB for hard drives within 3-4 years. It was more of a feeling based upon current pricing with Moore’s Law applied than a well researched statement, but it felt about right. I’ve since been thinking some more about this within the context of current storage industry offerings from the likes of EMC, Netapp and Oracle, wondering what this might mean.

First of all I did a bit of research – if my 3-4 years guess-timate was out by an order of magnitude then there is not much point in writing this article (yet). I wanted to find out what the actual trends in flash memory pricing look like and how these might project over time, and I came across the following article: Enterprise Flash Drive Cost and Technology Projections. Though this article is now over a year old, it shows the following chart which illustrates the effect of the observed 60% year on year decline in flash memory pricing:

Flash Drive Pricing Projections

This 60% annual drop in costs is actually an accelerated version of Moore’s Law, and does not take into account any radical technology advances that may happen within the period.  This drop in costs is probably driven in the most part by the consumer thirst for flash technology in iPods and so forth, but naturally ripples back up into the enterprise space in the same way that Intel and AMD’s processor technologies do.

So let’s just assume that my guess-timate and the above chart are correct (they almost precisely agree) – what does that mean for storage products moving forward?

Looking at recent applications of flash technology, we see that EMC were the first off the blocks by offering the option of relatively large flash drives as drop-in replacements for their hard drives in their storage arrays. Netapp took a different approach of putting the flash memory in front of the drives as another caching layer in the stack. Oracle have various options in their (formerly Sun) product line and a formalised mechanism for using flash technology built into the 11g database software and into the Exadata v2 storage platform. Various vendors offer internal flash drives that look like hard drives to the operating system (whether connected by traditional storage interconnects such as SATA or by PCI Express). If  we assume that the cost of flash technology becomes equivalent to hard drive storage in the next three years, I believe all these technologies will quickly become the wrong way to deploy flash technology, and only one (Oracle) has an architecture which lends itself to the most appropriate future model (IMHO).

Let’s flip back to reality and look at how storage is used and where flash technology disrupts that when it is cheap enough to do so.

First, location of data: local or networked in some way? I don’t believe that flash technology disrupts this decision at all. Data will still need to be local in certain cases and networked via some high-speed technology in others, in much the same way as it is today. I believe that the networking technology will need to change for flash technology, but more on that later.

Next, the memory hierarchy: Where does current storage sit in the memory hierarchy? Well, of course, it is at the bottom of the pile, just above tape and other backup technologies if you include those. This is the crucial area where flash technology disrupts all current thinking – the final resting place for data is now close or equal to DRAM memory speeds. One disruptive implication of this is that storage interconnects (such as Fibre Channel, Ethernet, SAS and SATA) are now a latency and bandwidth bottleneck. The other, potentially huge, disruption is what happens to the software architecture when this bottleneck is removed.

Next, capacity: How does the flash capacity sit with hard drive capacity? Well that’s kind of the point of this posting… it’s currently quite a way behind, but my prediction is that they will be equal by 2013/2014. Importantly though, they will then start to accelerate away from hard drives. Given the exponential growth of data volumes, perhaps only semiconductor based storage can keep up with the demand?

Next, IOPs: This is the hugely disruptive part of flash technology, and is a direct result of a dramatically lowered latency (access time) when compared to hard disk technology. Not only is the latency lowered, but semiconductor-based storage is more or less contention-free given the absence of serialised moving parts such as a disk head. Think about it – the service time for a given hard drive I/O is directly by the preceding I/O and where the head was left on the platter. With solid-state storage this does not occur and service times are more uniform (though writes are consistently slower than reads).

These disruptions mean that the current architectures of storage systems are not making the most of semiconductor-based storage. Hey, why do I keep calling it “semiconductor-based storage” instead of SSD or flash? The reason is that the technologies used in this area are changing frequently, from DRAM-based systems to NOR-based flash to NAND based flash to DRAM-fronted flash; Single-level cells to Multi-level cells; battery-backed to “Super Cap” backed. Flash, as we know it today, could be outdated as a technology in the near future, but “semiconductor-based” storage is the future regardless.

I think that we now need technologies that look more like Oracle Exadata v2, with low-latency RDMA interfaces directly into the Operating System/Database. However, they need to easily and natively support other types of storage (unstructured data such as files, VMware datastores and so forth). The Exadata architecture lends itself well to changes in this area in both hardware trends and access protocols.

Perhaps more importantly, we are also only just beginning to understand the implications in software architecture for the disrupted memory hierarchy. We simply cannot continue to treat semiconductor-based storage as “fast disk” and need to start thinking, literally, outside the box.