Posted tagged ‘latency’

Latency Hiding For Fun and Profit

September 28, 2009

Yep, another post with the word ‘latency’ written all over it.

I’ve talked a lot about latency, and how it is more often than not completely immutable. So, if the latency cannot be improved upon because of some pesky law of physics, what can be done to reduce that wasted time? Just three things, actually:

  1. Don’t do it.
  2. Do it less often.
  3. Be productive with the otherwise wasted time.

The first option is constantly overlooked – do you really need to be doing this task that makes you wait around? The second option is the classic ‘do things in bigger lumps between the latency’ – making less roundtrips being the classic example. This post is about the third option, which is technically referred to as latency hiding.

Everybody knows what latency hiding is, but most don’t realise it. Here’s a classic example:

I need some salad to go with the chicken I am about to roast. Do I:

(a) go to the supermarket immediately and buy the salad, then worry about cooking the chicken?


(b) get the chicken in the oven right away, then go to the supermarket?

Unless the time required to buy the salad is much longer than the chicken’s cook-time, the answer is always going to be (b), right? That’s latency hiding, also known as Asynchronous Processing. Let’s look at the numbers:

Variable definitions:

Supermarket Trip=1800s

Chicken Cook-Time=4800s


Option (a)=1800s+4800s=6600s (oh man, nearly two hours until dinner!)

Option (b)=4800s (with 1800s supermarket time hidden within it)

Here’s another example: You have a big code compile to do, and an empty stomach to fill. In which order do you execute those tasks? Hit ‘make’, then grab a sandwich, right?

As a side note, this is one of my classic character flaws – I just live for having tasks running in parallel this way. Not a flaw, I hear you say? Anyone that has tried to get software parallelism  (such as Oracle Parallel Execution) knows the problem – some tasks finish quicker than expected, and then there’s a bunch of idle threads.  In the real world, this means that my lunch is often very delayed, much to the chagrin of my lunch buddies.

OK, so how does this kind of thing work with software? Let’s look at a couple of examples:

  1. Read Ahead
  2. Async Writes

Read ahead  from physical disk is the most common example of (1), but it equally applies to result prefetching in, say, AJAX applications. Whatever the specific type, it capitalises on parallel processing from two resources. Let’s look at the disk example for clarification.

Disk read ahead is where additional, unrequested, reads are carried out after an initial batch of real requested reads. So, if a batch job makes a read request for blocks 1,2,3 and 4 of a file, “the disk” returns those blocks back and then immediately goes on to read blocks 5,6,7,8, keeping them in cache. If blocks 5,6,7,8 are then subsequently requested by the batch job after the first blocks are processed, they can immediately be returned from cache, thus hiding the latency from the batch job. This has the impact of hiding the latency from the batch job and increases throughput as a direct result.

Async writes are essentially the exact opposite of read-ahead. Let’s take the well-known Oracle example of async writes, that of the DBWR process flushing out dirty buffers to disk. The synchronous way to do this is to generate a list of dirty buffers and then issue a series of synchronous writes (one after the other) until they are all complete. Then start again by looking for more dirty buffers. The async I/O way to do the same operation is to generate the list, issue an async write request (which returns instantly), and immediately start looking for more dirty buffers. This way, the DBWR process can spend more useful time looking for buffers – the latency is hidden, assuming the storage can keep up.

By the way, the other horror of the synchronous write is that there is no way that the I/O queues can be pushed hard enough for efficient I/O when sending out a single buffer at a time. Async writes remedy that problem.

I’ve left a lot of the technical detail out of that last example, such as the reaping of return results from the async I/O process, but didn’t want to cloud the issue. Oops, I guess I just clouded the issue, just ignore that last sentence…


Forget I/O Bound, You’re Latency Bound, Bub

September 21, 2009

Since it’s been nearly ten years since I wrote my book, Scaling Oracle8i, I thought it was about time that I started writing again. I thought I would start with the new-fangled blogging thing, and see where it takes me. Here goes.

As some will know, I run a small consulting company called Scale Abilities, based out of the UK. We get involved in all sorts of fun projects and problems (or are they the same thing?), but one area that I seem to find myself focusing on a lot is storage. Specifically, the performance and architecture of storage in Oracle database environments. In fact I’m doing this so much that, whenever I am writing presentations for conferences these days, it always seems to be the dominant subject at the front of my mind.

One particular common thread has been the effect of latency. This isn’t just a storage issue, of course, as I endeavoured to point out in my Hotsos Symposium 2008 presentation “Latency and Skew”. Latency, as the subtitle of that particular talk said, is a silent killer. Silent, in that it often goes undetected, and the effects of it can kill performance (and still remain undetected). I’m not going to go into all the analogies about latency here, but let’s try and put a simple definition out for it:

Latency is the time taken between a request and a response.

If that’s such a simple definition, why is it so difficult to spot? Surely if a log period of time passes between a request and a response, the latency will be simple to nail? No.

The problem is that it is the small latencies that cause the problems. Specifically, it is the “small, but not so small that they are not important” ones that are so difficult to spot and yet cause so many problems. Perhaps an example is now in order:

A couple of years ago, a customer of mine was experiencing a performance problem on their newly virtualised database server (VMware 3.5). The problem statement went a little bit like this: Oracle on VMware is broken – it runs much slower on VMware than on physical servers. The customer was preparing a new physical server in order to remove VMware from the equation. Upon further investigation, I determined the following:

  1. The VMware host (physical server running VMware) was a completely different architecture to the previous dedicated server. The old server was one of the Intel Prescott core type (3.2GHz, Global Warming included at no extra cost), and the new one was one of the Core 2 type with VT instructions.
  2. Most measurable jobs were actually faster on the virtualised platform
  3. Only one job was slower

The single job that was slower was so much slower that it overshadowed all the other timings that had improved. Of the four critical batch jobs, the timings looked like this:

Physical server:

  • Job 1: 26s
  • Job 2: 201s
  • Job 3: 457s
  • Job 4: 934s
  • Total: 1618s

Virtualised Server:

  • Job 1: 15s
  • Job 2: 111s
  • Job 3: 208s
  • Job 4: 2820s
  • Total: 3154s

It can be seen that, if one takes the total as the yardstick, the virtualised server is almost twice as slow: Therein lies the danger of using averages and leaping to conclusions. If Job 4 is excluded, the totals are 684s vs 334s, making the virtualised server more than twice as quick as the physical one.

Upon tracing Job 4 on the VMware platform with Oracle extended SQL tracing (10046 level 8), I discovered that it was making a lot of roundtrips to the database. Hang on, let me give that the right emphasis: A LOT of roundtrips. However, each roundtrip was really  fast – about 0.2ms if memory serves. So where’s the problem with that? It’s not exactly high latency, is it? Well it is if you have to do several million of them.

As it turns out, there was something in the VMware stack (perhaps just additional codepath to get through the vSwitch) that was adding around 0.1ms latency to each roundtrip. When this tenth of a millisecond is multiplied by several million (something like 20 million in this case), it becomes a long time. About 2000s to be precise, which more than made up for the extra time. The real answer – do less roundtrips, the new server will be at least twice as fast as the old server.

So what does this have to do with I/O? Plenty. The simple fact is that roundtrips are a classic source of unwanted latency. Latency is the performance killer.

Let’s look at some I/O examples from a real customer system. Note: this is not a particularly well tuned customer system:

  • Latency of 8KB sequential reads: 0.32ms
  • Latency of 4MB sequential reads: 6ms

Obviously, the 4MB reads are taking a lot longer (18x), but that makes sense, right? Apart from one thing: The 4MB reads are serving 512x more data in each payload. The net result of these numbers is as follows:

  • Time to sequentially read 2.1GB in 8KB pieces: 76s
  • Time to sequentially read 2.1GB in 4MB pieces: 24s

So what happened to the 52s difference in these examples? Was this some kind of tuning problem? Was this a fault on the storage array? No. The 52s of time was lost in latency, nowhere else.

Here’s another definition of latency: Latency is wasted time.

So, look out for that latency. Think about it when selecting nested-loop table joins instead of full-table scans. Think about it when doing single-row processing in your Java app. Think about it before you blame your I/O system!