Right Practice

Posted September 16, 2011 by James Morle
Categories: Full Stack, Generic, Oracle, Performance, SaneSAN2010, Storage

Tags: , ,

Wow, it’s been a while since I wrote a post, sorry about that! I thought that I would take a brief break from the technical postings and espouse some opinion on something that has been bothering me for a while – ‘Best Practices.’

Best Practices have been around a long time, and started with very good intentions. In fact, one could easily claim that they are still produced with good intentions: To communicate methods that the hardware and software vendors recommend. However, the application of Best Practices has become increasingly abused in the field to the point where they have become more like prescriptions of how systems should be built. This has gone too far, and needs to be challenged.

Best Practices are a misnomer. At best, they should be referred to as Default Practices, because they are simply valid starting points for the implementation of a piece of technology. They do not describe the optimal (or even appropriate) recipe for implementation in every given situation. And yet: Best Practices are now becoming The Law.

It is increasingly difficult to deviate from a Best Practice, even when there are clear and logical reasons why this should happen. Best Practices are being used as an alternative to rational thought.

Consider this: Any given Best Practice is only applicable in a certain set of conditions. For example, it is ‘Best Practice’ not to fill a certain storage vendor’s array with its own SSD drives, it should only be partially populated. This has been prescribed by the storage vendor because the array has not been designed to cope with the high IOPs and bandwidth rates of every device in the array working at full, uncontended, tilt. This Best Practice is not saying that the SSDs will not work, it is simply saying that the full aggregate performance of all those SSDs cannot be sustained by the interconnects and processors in the array. Fair enough. But this Best Practice is being applied in the field as ‘SSDs are bad for the array’. What if your database requires consistent I/O response time of 1ms or less, requires masses of storage, but actually isn’t a very high IOPs or bandwidth consumer?  Isn’t this the ideal situation to fully populate the array with SSD, to achieve the consistent response time and deliver the required storage density? The array will easily cope with the throughput and IOPs rate, thus negating the Best Practice.

There are many such examples as this. It’s time to stop using Best Practice as a substitute for good old-fashioned thinking, and start to implement designs based upon Right Practice: The ‘Right’ Practice for your particular situation. By all means, start with Best Practice documentation, they often have good information within them. But read between the lines and understand the reasoning behind those recommendations, and apply these reasons to your particular application. Are they applicable?

We are big on the Right Practice approach at Scale Abilities, it’s one of the reasons our clients get big value from our involvement. I recommend that  you start to question the logic of Best Practices and start to come up with your own Right Practices.

Advertisements

Sane SAN 2010: Fibre Channel – Ready, Aim, Fire

Posted September 30, 2010 by James Morle
Categories: Full Stack, SaneSAN2010, Storage

In my last blog entry I alluded to perhaps not being all that happy about Fibre Channel. Well, it’s true. I have been having a love/hate relationship with Fibre Channel for the last ten years or so, and we have now decided to get a divorce. I just can’t stand it any more!

I first fell in love with Fibre Channel in the late 90s: How could I resist the prospect of leaving behind multi-initiator SCSI with all it’s deep, deep electrical issues? Fibre Channel let me hook up multiple hosts to lots of drives, via a switch, and it let me dynamically attach and detach devices from multiple clustered nodes without reboots. Or so I thought. The reality of Fibre Channel is that it was indeed a revelation in its day, but some of that promise never really materialised until recently. And now it’s too late.

I have a number of problems with Fibre Channel as it stands today, and I’m not even going to mention the fact that it is falling behind in bandwidth. Whoops, I just did – try to pretend you didn’t just read that. The real problems are:

  1. It is complex
  2. It is expensive
  3. It is unreliable
  4. It is slow

Complexity

Complexity. Complexity, complexity, complexity. I hate complexity. Complexity is the IT equivalent of communist bureaucracy – it isn’t remotely interesting, it wastes colossal amounts of time, and it ultimately causes the system to go down. Don’t confuse complexity with challenge – Challenge is having to solve new and interesting problems, Complexity is having to fix the same old problems time and time again and having to do it standing on one leg. So why do I think Fibre Channel is complex? For these reasons:

  1. The stack
  2. YANT

The Stack

If you have ever tried to manage the dependencies associated with maintaining a fully supported Fibre Channel infrastructure then you can probably already feel a knot in your stomach. For everyone else, let me explain.

Every component in a Fibre Channel stack needs to be certified to work with the other components. Operating System version, multipath I/O (mpio) drivers, HBA device drivers, HBA firmware, switch type, switch firmware and storage array firmware. So what happens when you want to, for example, upgrade your MPIO drivers? It is pretty standard for the following process to occur:

  • I want to upgrade to MPIO v2
  • MPIO v2 requires array firmware v42
  • Array firmware v42 requires HBA device driver v3.45
  • HBA device driver v3.45 requires the next release of the Operating System
  • The next release of the Operating System is not yet supported by the Array firmware
  • etc, etc

I think you get the point. But also remember that this wonderful array is shared across 38 different systems, all with different operating systems and HBAs, so the above process has to be followed for every single one, once you have a target release of array firmware that might work across all the platforms. If you are really really lucky, you might get a combination within those factorial possibilities that is actually certified by the array vendor.

Complex enough? Now add YANT…

YANT

Yet Another Networking Technology. I’m all in favour of having different types of networking technology, but not when the advantage is minuscule. All that training, proprietary hardware, cost, and so on: To justify that, the advantage had better be substantial. But it isn’t. Compare Fibre Channel to 10Gbps Ethernet, which is a universal networking standard, and it just doesn’t justify its own existence. To be fair to Fibre Channel, it was the original version of what we are now calling Converged Networking – it has always supported TCP/IP and SCSI protocols, and used to be way faster than Ethernet, but it just never got the traction it needed in that space.

Expensive

It’s tough to argue against this one, Fibre Channel is expensive. 10Gbps Ethernet is also expensive, but the prices will be driven down by volume and ubiquity. In addition, Ethernet switches and so forth can be shared (if you must, that is: I’m still a fan of dedicated storage networks for reasons of reliability), whereas Fibre Channel must be dedicated. Infiniband is expensive too, and will probably stay that way, but it is providing a much higher performance solution than Fibre Channel.

Unreliable

What? Unreliable?

Yes, it’s true. It’s not an inherent problem with the technology itself; Fibre Channel is actually incredibly robust and I can’t fault that fact. However, the promise of real-life reliability is shattered by:

  • Large Fabrics
  • Complexity

What is the point of large fabrics? I can see the point of wanting to stretch I/O capability over a wide area, such as remote replication and so forth, but that does not imply that the whole storage universe of the enterprise should be constructed as a giant fabric, does it? Networks should be composed of relatively small, interconnected,  failure domains, so that traffic can flow, but the impact of a failure is limited in scope. Building a large fabric is going against that, and I’ve lost count of the number of catastrophic failures I’ve seen as a result of building The Dream Fabric.

Complexity; we’re back there again. Reliability is inversely proportional to complexity: High complexity = Low reliability, and vice versa. This is particularly true while we still entrust humans to administer these networks.

Slow

This is the final nail in the coffin. Times have changed, and Fibre Channel has no space in the new world. The way I see it, there are now just two preferred ways to attach storage to a server:

  • Ethernet-based NFS for general use
  • Infiniband-based for very low latency, high bandwidth use

The former approach is a ‘high enough’ performance solution for most current requirements, with ease of use and well understood protocols and technology. I’m not saying it’s quicker than Fibre Channel (though it certainly can be), just that it is fast enough for most things and is easy to put together and manage. The latter method, Infiniband (or similar), is a step up on both Ethernet and Fibre Channel in both higher bandwidth and lower latency, especially when used with RDMA. Infiniband has been a technology searching for a commercial purpose for some time now, and I believe that time has now come, via the route of semiconductor-based storage devices. Consider the following numbers:

  • Fibre Channel Latency: 10-20us (est)
  • Infiniband/RDMA Latency: 1us (est)

Now let’s see how these latencies compare to the those of a physical disk read,  and a read from a DRAM-based storage device:

  • Disk Read: 8,000 us (ie 8ms)
  • DRAM-based Storage read: 15us (source: TMS Ramsan 440 specification)
  • Ratio of FC latency to Disk Latency: 1:800 (1.25%)
  • Ratio of FC latency to DRAM Latency: 1:1.5  (80%)
  • Ratio of IB latency to Disk Latency: 1:8000 (0.125%)
  • Ratio of IB latency to DRAM latency: 1:15 (6.67%)

When comparing to disk reads, the Fibre Channel latency does not add much to the total I/O time. However, when accessing DRAM-based storage, it becomes a hugely dominant factor in the I/O time, whereas Infiniband is still single-digit percentage points. This is why I suggest that Fibre Channel has no role in the forthcoming high-performance storage systems. Fibre Channel is neither simple enough for simple systems, nor fast enough for high-performance systems.

Sane SAN2010: Storage Arrays – Ready, Aim, Fire

Posted September 6, 2010 by James Morle
Categories: Full Stack, Oracle, SaneSAN2010, Storage

OK, this one might be contentious, but what the heck – somebody has to say it. Let’s start with a question:

Raise your hand if you have a feeling, even a slight one, that storage arrays suck?

Most DBAs and sysadmins that I speak to certainly have this feeling. They cannot understand why the performance of this very large and expensive array is nearly always lower than they achieve from the hard drive in their desktop computer. OK, so the array can do more aggregate IOPs, but why is it that 13ms, for example, is considered a reasonable average response time? Or worse, why is that some of my I/Os take several hundred milliseconds? And how is it possible that my database is reporting 500ms I/Os and the array is reporting that they are all less than 10ms? These are the questions that are lodged in the minds of my customers.

Storage Arrays do some things remarkably well. Availability, for example, is something that is pretty much nailed in Storage Array Land, both at the fabric layer and the actual array itself. There are exceptions: I think that large Fibre Channel fabrics are a High Availability disaster, and the cost of entry with director-class switches makes no sense when small fabrics can be built using commodity hardware. I have a more general opinion on Fibre Channel actually – it is an ex-format, it is pushing up the lillies. More on that in another blog post, though, I’m not done with the array yet!

The Storage Array became a real success when it became possible to access storage through Fibre Channel. Until then, the storage array was a niche product, except in the mainframe world where Escon was available. Fibre Channel, and subsequently Ethernet and Infiniband, enabled the array to be a big shared storage resource. For clustered database systems this was fantastic – an end to the pure hell of multi-initiator parallel SCSI. But then things started getting a little strange. EMC, for example, started advertising during Saturday morning children’s television about how their storage arrays allowed the direct sharing of information between applications. Well, even an eight year old knows that you can’t share raw data that way, it has to go via a computer to become anything meaningful. But this became the big selling point: all your data in one place. That also has the implication that all the types of data are the same – database files, VMware machines, backups, file shares. They are not, from an access pattern, criticality or business value standpoint, and so this model does not work. Some back-pedalling has occurred since then, notably in the form of formal tiered storage, but this is  still offered under guise of having all the data in one place – just on cheap or expensive drives.

So now we have this big, all-eggs-in-one-basket, expensive array. What have we achieved by this: everything is as slow as everything else. I visited a customer last week with just such an array, and this was the straw that broke the camel’s back (it’s been a long time coming). They have a heavily optimised database system that ultimately means the following speeds and feeds are demanded from the array:

  • 10 megabytes per second write, split into around 1300 IOPs
  • 300 reads per second

If you don’t have a good feel for I/O rates let me tell you: That is a very moderate amount of I/O. And yet the array is consistently returning I/Os that are in several hundred milliseconds, both reads and writes. Quite rightly, the customer thinks this is not very acceptable. Let’s have a little analysis of those numbers.

First the writes.: Half of those writes are sequential to the Oracle redo log and could easily be serviced by one physical drive (one 15k drive can sustain at least 100MB/s of sequential I/O). The rest of them are largely random (let’s assume 100% random), as they are dirty datafile blocks being written by the database writer. Again, a single drive could support 200 random writes per second, but let’s conservatively go for 100 – that means we need six or seven drives to support the physical write requirements of the database, plus one for the redo log. Then we need to add another three drives for the reads. That makes a very conservative total of eleven drives to keep up with the sustained workload for this customer, going straight to disk without any intermediate magic. This array also has quite a chunk of write-back cache, which means that  writes don’t actually even make it to disk before they can be acknowledged to the host/database. Why then, is this array struggling to delivery low latency I/O when it has sixty four drives inside it?

The answer is that a Storage Array is just a big computer itself. Instead of taking HTTP requests or keyboard input, it takes SCSI commands over Fibre Channel. Instead of returning a web page, it returns blocks of data. And like all complex computer systems, the array is subject to performance problems within itself. And the more complex the system, the more likely it is that performance problems will arise. To make things worse, the storage arrays have increasingly encouraged the admin to turn on more and more magic in the software to the point where it is now frequently an impossibility for the storage admin to determine how well a given storage allocation might perform. Modern storage administration has more to do with accountancy than it does performance and technology. Consider this equation:

(number of features) x (complexity of feature) = (total complexity)

Complexity breeds both performance problems and availability problems. This particular customer asked me if there was a way to guarantee that, when they replace this array, the new one will not have these problems. The answer is simple: ‘no’.

Yes, we can go right through the I/O stack, including all the components and software features of the array and fix them up. We can make sure that Fibre Channel ports are private to the host, remove all other workloads from the array so that there is no scheduling or capacity problems there. We can turn off all dynamic optimisations in the array software and we can layout the storage across known physical drives. Then, and only then, might there be a slim chance of a reduced number of high latency I/Os. I have a name for this way of operating a storage array. It’s call Direct Attached Storage (DAS): Welcome to the 1990s.

Now let me combine this reality with the other important aspect: semiconductor-based  storage. What happens when the pent up frustrations of the thousands of storage array owners meets the burgeoning reality of a new and faster storage that is now governed by some kind of accelerated form of Moore’s Law? As my business partner Jeff describes it: It’s gonna be a bloodbath.

I think that we will now see a sea change in the way we connect and use storage. It’s already started with products such as Oracle’s Exadata. I’m not saying that because I am an Oracle bigot (I won’t deny that), but because it is the right thing to do – it’s focused on doing one thing well and it uses emerging technology properly, rather than pretending nothing has changed. I don’t think it’s plug and play for many transactional customers (because of the RAC implication), but the storage component is on the money. Oh, and it is effectively DAS – a virtual machine of Windows running a virus scan won’t slow down your critical invoicing run.

I think that the way we use the storage will have to change too – storage just took a leap up the memory hierarchy. Low latency connectivity such as Infiniband will become more important, as will low latency request APIs, such as SRP. We simply cannot afford to waste time making the request when the response is no longer the major time component.

With all this change, is it now acceptable to have the vast majority of I/O latency accounted for in the complex software and hardware layers of a storage array? I don’t think so.

SaneSAN2010: Serial to Serial – When One Bottleneck Isn’t Enough

Posted August 23, 2010 by James Morle
Categories: Boundary Issues, Full Stack, Oracle, Performance, SaneSAN2010, Storage

I was recently looking into a storage-related performance problem at a customer site. The system was an Oracle 10.2.0.4/SLES 9 Linux system, Fibre Channel attached to an EMC DMX storage array. The DMX was replicated to a DR site using SRDF/S.

The problem was only really visible during the overnight batch runs, so AWR reports were the main source of information in diagnosis. In this case, they were more than sufficient, showing clear wait spikes for ‘free buffer waits’ and ‘log file parallel write’ during the problematic period. They were quite impressive, too – sixteen second latencies for some of the writes.

The customer was not oblivious to this fact, of course – it is difficult not to see a problem of such magnitude. They already had an engineering plan to move from SRDF/S (Synchronous) to SRDF/A (Asynchronous), as it was perceived that SRDF was the guilty party in this situation. I had been asked to validate this assumption and to determine the most appropriate roadmap for fixing these I/O problems on this highly critical system.

Of course, SRDF/S will always get the blame in such situations 🙂 I have been involved with many such configurations, and can indeed attest that SRDF/S can lead to trouble, particularly if the implications are not understood correctly. In this case, a very large financial institution, the storage team did indeed understand the main implication (more on this shortly) but, as is often the case in large organisations, the main source of the problem was actually what I call a boundary issue, one which falls between or across the technology focus of two or more teams. In this case, there were three teams involved, leading to configuration issues across all three areas.

Let’s go back to the SRDF implication, as it is the genesis of the problem. In synchronous mode, SRDF will only allow one outstanding write per hyper-volume. Any additional writes to that hyper will be serialised on the local DMX. The storage admin had understood this limitation, and had therefore combined many hyper volumes into a number of striped metavolumes, thus increasing the number of hypers that a given ‘lump of storage’ would contain. All well and good.

The system admin had created striped Veritas volumes over these metavolumes, thus striping even further. A filesystem was then built on the volumes, and presented to the DBA team. The DBAs then built the database and started it up. All apparently ran well for a few years until performance became intolerable, and that’s where my story begins.

I’m going to cut to the chase here, most of us don’t have time to read blogs all day long. There were three factors carefully conspiring on this system to ensure that the write performance was truly terrible:

  1. SRDF/S can only have one outstanding write per hypervolumethat’s a serialisation point.
  2. The filesystem in use was not deployed in any kind of ODM, quick I/O, or other UNIX file locking bypass technology – that’s a serialisation point.
  3. The database was not using Async I/O – that’s (pretty much) a serialisation point.

There you go – 1,2, 3, serialisation points from each of the three teams, none of which were understood by the other teams. Let’s step through Oracle attempting to write dirty buffers to disk during the batch run (the major wait times were observed on ‘free buffer waits’, so let’s start there):

  • DML creates dirty buffers
  • Oracle needs to create more dirty buffers to continue DML operation, but cannot because existing dirty buffers must be written to disk to create space
  • Oracle posts the DBWR process(es) to write out dirty buffers

(all the above happen on well-tuned, healthy systems also, though these may never struggle to have free buffers available because of well-performing writes)

  • DBWR scans the dirty list and issues writes one at a time to the operating system, waiting for each to complete before issuing the next. This is the lack of Async I/O configuration in the database
  • The operating system takes out a file lock (for write) on the datafile, and issues the write. No other DBWR processes can write to this file at this point. This is the side effect of having the wrong kind of filesystem, and implies that only one write can go to a file at any one time.
  • The operating system issues a write to the relevant meta on the DMX, which resolves to a specific hyper inside the box. That’s the single outstanding write for that hyper now in flight. No other writes can occur to that hyper at this point until this one is complete.

It’s easy to see how, when all the write I/Os are being fed to the DMX one at a time, that the additional latency of having SRDF in the equation makes a really big difference. It’s also easy to see that, by turning off SRDF, the problem will get less severe. I’m not defending EMC here, they deserve everything they get when it’s their fault. It just isn’t primarily an SRDF problem in this case. Yes, turning off SRDF or going SRDF/A will help, but it’s still fixing a downstream bottleneck.

The real culprit here is the file locking in the filesystem. This file locking is disabling the storage admin’s design of presenting many hypers up to the host to mitigate the SRDF overhead. In addition, operating system file locking on database files just just plain stupid, and I was hoping to have seen the last example of this in the early 90s; but this is the second one I’ve seen in 3 years… I’m not saying that the people that implement the systems this way are stupid, but it’s pretty easy to be naive about some critical areas when the complexity is so high and unavoidable boundary issues exist between the teams.

The lack of Async I/O is not good here, either, though the presence of multiple DBWRs is mitigating the impact somewhat, and the filesystem would quickly stomp on any improvements made by turning on Async I/O. I don’t believe that this filesystem would support Async anyway until the file locks were bypassed, so it’s two for the price of one here.

With multiple consecutive points of serialisation, it is not surprising that the system was struggling to achieve good throughput.

What’s the lesson here? There are two, really:

  1. Just knowing ‘your area’ isn’t enough.
  2. If you try to walk with two broken legs, you will fall down and bang your head. The fix, however, is not a painkiller for the headache.

EDIT: I have realised upon reflection that I only implied the reason that the file locking makes SRDF/S worse, rather than spelling it out. The reason it makes it worse is that it (file locking) enforces only a single write to that file at once. This means that this particular write is more than likely going to a single hypervolume, and thus eliminating any parallelism that might be achievable from SRDF. FYI, metavolumes have a stripe width of 960KB, so it’s really likely that any single write will only go to one hyper.

Sane SAN 2010 – Introduction

Posted August 23, 2010 by James Morle
Categories: Oracle, Performance, SaneSAN2010, Storage

This year at the UKOUG Conference in Birmingham, acceptance permitting, I will present the successor to my original Sane SAN whitepaper first penned in 2000. The initial paper was spectacularly well received, relatively speaking, mostly because disk storage at that time was very much a black box to DBAs and a great deal of mystique surrounded its operation. Well, nothing much has changed on that front, so I figured it was very much time to update/rewrite the paper for modern technology and trends and try to impose my occasionally humble opinion on the reader 🙂

I’ve already an article on this blog that will form part of the paper, and I will write a few more over the next few weeks. Check out this one, and keep an eye on my blog for the next few weeks. The first one out the bag is “Serial to Serial – When One Bottleneck Isn’t Enough.”

If the UKOUG don’t accept it, I’ll post it on the Scale Abilities website anyway, and try to palm it off on some other conferences in 2011 🙂

Log File Sync and AWR – Not Good Bedfellows

Posted June 2, 2010 by James Morle
Categories: Oracle

I’m sure many of you are already enlightened enough about the dangers of ratios and averages, the most famous blunder of all being the ‘cache hit ratio’. It’s not that all ratios and averages are useless, but it is surely the case that most are useless, and that all can be misleading. This posting is about an example of a misleading average, one that hides the truth from the reader unless they look elsewhere.

I am actually a big fan of AWR (and Statspack). They are chock full of misleading averages and ratios, but that does not mean they don’t have value. Often it is more a case of reading what is absent from the report, reading between the lines and so forth, that makes them so valuable. I’m also frequently asked to fix problems that are no longer occurring – without the history that AWR provides this would be an impossible task, assuming that Active Session History would therefore also be unavailable.

I was recently asked to look over some AWR reports from a performance testing environment  to determine what was going on. As usual, I started by looking over the top section of the reports to get a feel for the workload and to look at the wait events. Here’s the load profile:

                 Per Second   Per Transaction
                 ----------   ---------------
Redo size:       140,604.47            325.65
Logical reads:    16,712.80             38.71
Block changes:       904.42              2.09
Physical reads:        7.59              0.02
Physical writes:      70.76              0.16
User calls:        2,069.02              4.79
Parses:	             342.31              0.79
Hard parses:           0.07              0
Sorts:                52.54              0.12
Logons:	               0.47              0
Executes:            342.82              0.79
Transactions:        431.77

A few things immediately stand out on this load profile:

  • this is not a really busy system
  • there is virtually nil I/O to/from the datafiles
  • it is using bind variables (hard parses is low)
  • it is still making parse calls for every statement (parses=executes)
  • it commits or rollbacks very aggressively (transactions high compared to executes)

It’s the latter point that is most notable here. Transactions is actually higher than executes, meaning there are more commit/rollback calls than there are calls to execute SQL. So, we know we have an application that is committing pretty much after every statement. Let’s move on to the wait events:

Event                       Waits  Time(s)  AvgWait   % Total Wait Class
-----                       -----  -------  -------   ------- ----------
log file sync             357,552   22,427       63    90.4   Commit
CPU time                             1,636      6.6
log file parallel write   275,265      972        4     3.9   System I/O
read by other session       7,848      533       68     2.1   User I/O
db file parallel write     87,473      325        4     1.3   System I/O

It’s pretty clear from the wait events that this really is a commit-heavy application. I validated this with the developer – each transaction is basically an INSERT or an UPDATE followed by a commit. But here’s the mystery – why are the ‘log file sync’ wait times so much higher than the ‘log fiel parallel write’ times? An average write time of 4ms isn’t fantastic by any stretch of the imagination, but it’s not bad, and where are the other 59ms going to? I had a bit of a think about this and found the answer – the truth is hidden between the lines, and disguised by the averaging inside AWR.

The key to this mystery is what the specific wait events mean, and which sessions wait for them:

  • log file sync – user sessions wait on this event after a commit is issued
  • log file parallel write – LGWR is the only process that waits on this event, and it is the actual I/O (and associated setup) that stamps the redo buffer information onto disk

The process flow goes a bit like this for a session making a commit:

  1. User session issues a commit and goes into a ‘log file sync’ wait
  2. LGWR is posted to flush the redo buffer to disk
  3. LGWR issues the I/O and waits on ‘log file parallel write’
  4. LGWR returns from I/O, clears wait, and posts the user session to signal completion
  5. User session clears ‘log file sync’ wait

There are other operations that I have deliberately omitted from that description for clarity (they are not really relevant for the purposes of this discussion). In the case where there are many sessions wanting to commit, the process is slightly different, because of Oracle’s group commit (or piggyback commit) functionality. If a second session issues a commit while the LGWR is busy writing the redo for the first session, it too will wait on log file sync. If a third, fourth, fifth and sixth session issue commits, they will also wait on log file sync. They will wait for the LGWR to finish writing the redo buffer required for the first session, then immediately get posted to write the redo for the newly waiting sessions. Theses sessions (2-6) will continue to wait until this second write is complete.

What this means is that we have a strange kind of queue. It can only be two entries deep (the session currently being taken care of by LGWR and the session(s) that committed after the first session) and infinitely wide (all the subsequent sessions are handled as a single request). What it means in terms of AWR, however, is that the averages get distorted beyond all recognition.

I’ll tell you now that the underlying problem with this system was a response time problem on redo writes, validated by looking the LGWR trace file which was littered with ‘slow write’ warnings (what a great feature, introduced in 10g!). I believe the write cache on the storage was getting saturated, but have not yet been able to confirm that with the client’s storage team. So why was this not being reflected in the ‘log file parallel write’ wait events? The reason is because the slow writes were very slow (some greater than one second), but the majority of the writes were very fast. Consider the following set of write times:

  • 3000 writes take 0.5ms
  • 20 writes takes 1000ms
  • 2000 writes take 1ms

The total time for all those writes is 23,500ms. Divide that by the number of requests (5020), and the average write time is 4.68ms, close to the number reported in the AWR report. The trouble is that 85% of the time is spent in only 0.4% of the calls, which is dragging the average write time down (even in very rapid AWR sampling this would be the case). That’s the first part of the problem.

The second part of the problem is what happens when one of these outlier 1000ms I/O times takes place. We can see that the average (!) execute rate was 342/s, and let’s assume that all of these executions were DML that had a commit after them. We could use the Transactions rate, but that will also include a bunch of rollbacks, and I happen to know from the developer that each execute is a DML operation. So, we have an arrival rate of 342 commits per second and occasional peaks of I/O time of 1000ms. As soon as one of these spikes occurs, 342 sessions will start to accumulate a portion of that 1000ms (depending upon where in that period they issued the commit). If we assume that they arrive evenly spread out across that 1000ms, we can use the area of a triangle to calculate the total time clocked up to log file sync during that period:

  • Area of right-angled triangle = 1/2 base x height
  • Base=1000ms
  • Height=number of sessions waiting, maximum 342
  • Answer=171,000ms

So, the total wait time attributed to log file parallel write is only 1000ms during this period, but the total wait time attributed to log file sync is 171,000ms. You see what I did there? I concealed the calculation of an average into something called ‘area of a triangle’. Shh, don’t tell anyone.

When this slow write is over, the sessions will once again wait for the fast writes, perhaps one or two sessions at a time waiting for sub-millisecond writes, which means the average time is not dragged down quite so far as the ‘log file parallel write’ wait event. Let’s apply the waits to the same theoretical model as the log file parallel write to see:

  • 3000 single log file sync waits at 0.5ms = 1500ms
  • 20 periods of log file sync madness described above = 3,420,000 ms
  • 2000 single log file sync waits at 1ms = 3000ms

Total number of waits here is 5000+6840 (this is 20* 342)=11,840 total waits, compared to 5020 for the log file parallel write. The total wait time, however, is 3,424,500 ms, making an average wait time of 289ms which would be reported by AWR for ‘log file sync’, compared to a 4.68ms ‘log file parallel write’ time. That’s somewhat more exaggerated than the actual figures in the AWR report, but it does highlight the issue quite well.

AWR is a useful tool, but issues like this are masked. Use ASH or session based tracing for these kind of problems, and the answer will become clear!

“Flash” Storage Will Be Cheap – The End of the World is Nigh

Posted May 29, 2010 by James Morle
Categories: Oracle, Performance, SaneSAN2010, Storage

Tags: , , , ,

A couple of weeks ago I tweeted a projection that the $/GB for flash drives will meet the $/GB for hard drives within 3-4 years. It was more of a feeling based upon current pricing with Moore’s Law applied than a well researched statement, but it felt about right. I’ve since been thinking some more about this within the context of current storage industry offerings from the likes of EMC, Netapp and Oracle, wondering what this might mean.

First of all I did a bit of research – if my 3-4 years guess-timate was out by an order of magnitude then there is not much point in writing this article (yet). I wanted to find out what the actual trends in flash memory pricing look like and how these might project over time, and I came across the following article: Enterprise Flash Drive Cost and Technology Projections. Though this article is now over a year old, it shows the following chart which illustrates the effect of the observed 60% year on year decline in flash memory pricing:

Flash Drive Pricing Projections

This 60% annual drop in costs is actually an accelerated version of Moore’s Law, and does not take into account any radical technology advances that may happen within the period.  This drop in costs is probably driven in the most part by the consumer thirst for flash technology in iPods and so forth, but naturally ripples back up into the enterprise space in the same way that Intel and AMD’s processor technologies do.

So let’s just assume that my guess-timate and the above chart are correct (they almost precisely agree) – what does that mean for storage products moving forward?

Looking at recent applications of flash technology, we see that EMC were the first off the blocks by offering the option of relatively large flash drives as drop-in replacements for their hard drives in their storage arrays. Netapp took a different approach of putting the flash memory in front of the drives as another caching layer in the stack. Oracle have various options in their (formerly Sun) product line and a formalised mechanism for using flash technology built into the 11g database software and into the Exadata v2 storage platform. Various vendors offer internal flash drives that look like hard drives to the operating system (whether connected by traditional storage interconnects such as SATA or by PCI Express). If  we assume that the cost of flash technology becomes equivalent to hard drive storage in the next three years, I believe all these technologies will quickly become the wrong way to deploy flash technology, and only one (Oracle) has an architecture which lends itself to the most appropriate future model (IMHO).

Let’s flip back to reality and look at how storage is used and where flash technology disrupts that when it is cheap enough to do so.

First, location of data: local or networked in some way? I don’t believe that flash technology disrupts this decision at all. Data will still need to be local in certain cases and networked via some high-speed technology in others, in much the same way as it is today. I believe that the networking technology will need to change for flash technology, but more on that later.

Next, the memory hierarchy: Where does current storage sit in the memory hierarchy? Well, of course, it is at the bottom of the pile, just above tape and other backup technologies if you include those. This is the crucial area where flash technology disrupts all current thinking – the final resting place for data is now close or equal to DRAM memory speeds. One disruptive implication of this is that storage interconnects (such as Fibre Channel, Ethernet, SAS and SATA) are now a latency and bandwidth bottleneck. The other, potentially huge, disruption is what happens to the software architecture when this bottleneck is removed.

Next, capacity: How does the flash capacity sit with hard drive capacity? Well that’s kind of the point of this posting… it’s currently quite a way behind, but my prediction is that they will be equal by 2013/2014. Importantly though, they will then start to accelerate away from hard drives. Given the exponential growth of data volumes, perhaps only semiconductor based storage can keep up with the demand?

Next, IOPs: This is the hugely disruptive part of flash technology, and is a direct result of a dramatically lowered latency (access time) when compared to hard disk technology. Not only is the latency lowered, but semiconductor-based storage is more or less contention-free given the absence of serialised moving parts such as a disk head. Think about it – the service time for a given hard drive I/O is directly by the preceding I/O and where the head was left on the platter. With solid-state storage this does not occur and service times are more uniform (though writes are consistently slower than reads).

These disruptions mean that the current architectures of storage systems are not making the most of semiconductor-based storage. Hey, why do I keep calling it “semiconductor-based storage” instead of SSD or flash? The reason is that the technologies used in this area are changing frequently, from DRAM-based systems to NOR-based flash to NAND based flash to DRAM-fronted flash; Single-level cells to Multi-level cells; battery-backed to “Super Cap” backed. Flash, as we know it today, could be outdated as a technology in the near future, but “semiconductor-based” storage is the future regardless.

I think that we now need technologies that look more like Oracle Exadata v2, with low-latency RDMA interfaces directly into the Operating System/Database. However, they need to easily and natively support other types of storage (unstructured data such as files, VMware datastores and so forth). The Exadata architecture lends itself well to changes in this area in both hardware trends and access protocols.

Perhaps more importantly, we are also only just beginning to understand the implications in software architecture for the disrupted memory hierarchy. We simply cannot continue to treat semiconductor-based storage as “fast disk” and need to start thinking, literally, outside the box.