Right Practice

Posted September 16, 2011 by James Morle
Categories: Full Stack, Generic, Oracle, Performance, SaneSAN2010, Storage

Tags: , ,

Wow, it’s been a while since I wrote a post, sorry about that! I thought that I would take a brief break from the technical postings and espouse some opinion on something that has been bothering me for a while – ‘Best Practices.’

Best Practices have been around a long time, and started with very good intentions. In fact, one could easily claim that they are still produced with good intentions: To communicate methods that the hardware and software vendors recommend. However, the application of Best Practices has become increasingly abused in the field to the point where they have become more like prescriptions of how systems should be built. This has gone too far, and needs to be challenged.

Best Practices are a misnomer. At best, they should be referred to as Default Practices, because they are simply valid starting points for the implementation of a piece of technology. They do not describe the optimal (or even appropriate) recipe for implementation in every given situation. And yet: Best Practices are now becoming The Law.

It is increasingly difficult to deviate from a Best Practice, even when there are clear and logical reasons why this should happen. Best Practices are being used as an alternative to rational thought.

Consider this: Any given Best Practice is only applicable in a certain set of conditions. For example, it is ‘Best Practice’ not to fill a certain storage vendor’s array with its own SSD drives, it should only be partially populated. This has been prescribed by the storage vendor because the array has not been designed to cope with the high IOPs and bandwidth rates of every device in the array working at full, uncontended, tilt. This Best Practice is not saying that the SSDs will not work, it is simply saying that the full aggregate performance of all those SSDs cannot be sustained by the interconnects and processors in the array. Fair enough. But this Best Practice is being applied in the field as ‘SSDs are bad for the array’. What if your database requires consistent I/O response time of 1ms or less, requires masses of storage, but actually isn’t a very high IOPs or bandwidth consumer?  Isn’t this the ideal situation to fully populate the array with SSD, to achieve the consistent response time and deliver the required storage density? The array will easily cope with the throughput and IOPs rate, thus negating the Best Practice.

There are many such examples as this. It’s time to stop using Best Practice as a substitute for good old-fashioned thinking, and start to implement designs based upon Right Practice: The ‘Right’ Practice for your particular situation. By all means, start with Best Practice documentation, they often have good information within them. But read between the lines and understand the reasoning behind those recommendations, and apply these reasons to your particular application. Are they applicable?

We are big on the Right Practice approach at Scale Abilities, it’s one of the reasons our clients get big value from our involvement. I recommend that  you start to question the logic of Best Practices and start to come up with your own Right Practices.

Sane SAN 2010: Fibre Channel – Ready, Aim, Fire

Posted September 30, 2010 by James Morle
Categories: Full Stack, SaneSAN2010, Storage

In my last blog entry I alluded to perhaps not being all that happy about Fibre Channel. Well, it’s true. I have been having a love/hate relationship with Fibre Channel for the last ten years or so, and we have now decided to get a divorce. I just can’t stand it any more!

I first fell in love with Fibre Channel in the late 90s: How could I resist the prospect of leaving behind multi-initiator SCSI with all it’s deep, deep electrical issues? Fibre Channel let me hook up multiple hosts to lots of drives, via a switch, and it let me dynamically attach and detach devices from multiple clustered nodes without reboots. Or so I thought. The reality of Fibre Channel is that it was indeed a revelation in its day, but some of that promise never really materialised until recently. And now it’s too late.

I have a number of problems with Fibre Channel as it stands today, and I’m not even going to mention the fact that it is falling behind in bandwidth. Whoops, I just did – try to pretend you didn’t just read that. The real problems are:

  1. It is complex
  2. It is expensive
  3. It is unreliable
  4. It is slow

Complexity

Complexity. Complexity, complexity, complexity. I hate complexity. Complexity is the IT equivalent of communist bureaucracy – it isn’t remotely interesting, it wastes colossal amounts of time, and it ultimately causes the system to go down. Don’t confuse complexity with challenge – Challenge is having to solve new and interesting problems, Complexity is having to fix the same old problems time and time again and having to do it standing on one leg. So why do I think Fibre Channel is complex? For these reasons:

  1. The stack
  2. YANT

The Stack

If you have ever tried to manage the dependencies associated with maintaining a fully supported Fibre Channel infrastructure then you can probably already feel a knot in your stomach. For everyone else, let me explain.

Every component in a Fibre Channel stack needs to be certified to work with the other components. Operating System version, multipath I/O (mpio) drivers, HBA device drivers, HBA firmware, switch type, switch firmware and storage array firmware. So what happens when you want to, for example, upgrade your MPIO drivers? It is pretty standard for the following process to occur:

  • I want to upgrade to MPIO v2
  • MPIO v2 requires array firmware v42
  • Array firmware v42 requires HBA device driver v3.45
  • HBA device driver v3.45 requires the next release of the Operating System
  • The next release of the Operating System is not yet supported by the Array firmware
  • etc, etc

I think you get the point. But also remember that this wonderful array is shared across 38 different systems, all with different operating systems and HBAs, so the above process has to be followed for every single one, once you have a target release of array firmware that might work across all the platforms. If you are really really lucky, you might get a combination within those factorial possibilities that is actually certified by the array vendor.

Complex enough? Now add YANT…

YANT

Yet Another Networking Technology. I’m all in favour of having different types of networking technology, but not when the advantage is minuscule. All that training, proprietary hardware, cost, and so on: To justify that, the advantage had better be substantial. But it isn’t. Compare Fibre Channel to 10Gbps Ethernet, which is a universal networking standard, and it just doesn’t justify its own existence. To be fair to Fibre Channel, it was the original version of what we are now calling Converged Networking – it has always supported TCP/IP and SCSI protocols, and used to be way faster than Ethernet, but it just never got the traction it needed in that space.

Expensive

It’s tough to argue against this one, Fibre Channel is expensive. 10Gbps Ethernet is also expensive, but the prices will be driven down by volume and ubiquity. In addition, Ethernet switches and so forth can be shared (if you must, that is: I’m still a fan of dedicated storage networks for reasons of reliability), whereas Fibre Channel must be dedicated. Infiniband is expensive too, and will probably stay that way, but it is providing a much higher performance solution than Fibre Channel.

Unreliable

What? Unreliable?

Yes, it’s true. It’s not an inherent problem with the technology itself; Fibre Channel is actually incredibly robust and I can’t fault that fact. However, the promise of real-life reliability is shattered by:

  • Large Fabrics
  • Complexity

What is the point of large fabrics? I can see the point of wanting to stretch I/O capability over a wide area, such as remote replication and so forth, but that does not imply that the whole storage universe of the enterprise should be constructed as a giant fabric, does it? Networks should be composed of relatively small, interconnected,  failure domains, so that traffic can flow, but the impact of a failure is limited in scope. Building a large fabric is going against that, and I’ve lost count of the number of catastrophic failures I’ve seen as a result of building The Dream Fabric.

Complexity; we’re back there again. Reliability is inversely proportional to complexity: High complexity = Low reliability, and vice versa. This is particularly true while we still entrust humans to administer these networks.

Slow

This is the final nail in the coffin. Times have changed, and Fibre Channel has no space in the new world. The way I see it, there are now just two preferred ways to attach storage to a server:

  • Ethernet-based NFS for general use
  • Infiniband-based for very low latency, high bandwidth use

The former approach is a ‘high enough’ performance solution for most current requirements, with ease of use and well understood protocols and technology. I’m not saying it’s quicker than Fibre Channel (though it certainly can be), just that it is fast enough for most things and is easy to put together and manage. The latter method, Infiniband (or similar), is a step up on both Ethernet and Fibre Channel in both higher bandwidth and lower latency, especially when used with RDMA. Infiniband has been a technology searching for a commercial purpose for some time now, and I believe that time has now come, via the route of semiconductor-based storage devices. Consider the following numbers:

  • Fibre Channel Latency: 10-20us (est)
  • Infiniband/RDMA Latency: 1us (est)

Now let’s see how these latencies compare to the those of a physical disk read,  and a read from a DRAM-based storage device:

  • Disk Read: 8,000 us (ie 8ms)
  • DRAM-based Storage read: 15us (source: TMS Ramsan 440 specification)
  • Ratio of FC latency to Disk Latency: 1:800 (1.25%)
  • Ratio of FC latency to DRAM Latency: 1:1.5  (80%)
  • Ratio of IB latency to Disk Latency: 1:8000 (0.125%)
  • Ratio of IB latency to DRAM latency: 1:15 (6.67%)

When comparing to disk reads, the Fibre Channel latency does not add much to the total I/O time. However, when accessing DRAM-based storage, it becomes a hugely dominant factor in the I/O time, whereas Infiniband is still single-digit percentage points. This is why I suggest that Fibre Channel has no role in the forthcoming high-performance storage systems. Fibre Channel is neither simple enough for simple systems, nor fast enough for high-performance systems.

Sane SAN2010: Storage Arrays – Ready, Aim, Fire

Posted September 6, 2010 by James Morle
Categories: Full Stack, Oracle, SaneSAN2010, Storage

OK, this one might be contentious, but what the heck – somebody has to say it. Let’s start with a question:

Raise your hand if you have a feeling, even a slight one, that storage arrays suck?

Most DBAs and sysadmins that I speak to certainly have this feeling. They cannot understand why the performance of this very large and expensive array is nearly always lower than they achieve from the hard drive in their desktop computer. OK, so the array can do more aggregate IOPs, but why is it that 13ms, for example, is considered a reasonable average response time? Or worse, why is that some of my I/Os take several hundred milliseconds? And how is it possible that my database is reporting 500ms I/Os and the array is reporting that they are all less than 10ms? These are the questions that are lodged in the minds of my customers.

Storage Arrays do some things remarkably well. Availability, for example, is something that is pretty much nailed in Storage Array Land, both at the fabric layer and the actual array itself. There are exceptions: I think that large Fibre Channel fabrics are a High Availability disaster, and the cost of entry with director-class switches makes no sense when small fabrics can be built using commodity hardware. I have a more general opinion on Fibre Channel actually – it is an ex-format, it is pushing up the lillies. More on that in another blog post, though, I’m not done with the array yet!

The Storage Array became a real success when it became possible to access storage through Fibre Channel. Until then, the storage array was a niche product, except in the mainframe world where Escon was available. Fibre Channel, and subsequently Ethernet and Infiniband, enabled the array to be a big shared storage resource. For clustered database systems this was fantastic – an end to the pure hell of multi-initiator parallel SCSI. But then things started getting a little strange. EMC, for example, started advertising during Saturday morning children’s television about how their storage arrays allowed the direct sharing of information between applications. Well, even an eight year old knows that you can’t share raw data that way, it has to go via a computer to become anything meaningful. But this became the big selling point: all your data in one place. That also has the implication that all the types of data are the same – database files, VMware machines, backups, file shares. They are not, from an access pattern, criticality or business value standpoint, and so this model does not work. Some back-pedalling has occurred since then, notably in the form of formal tiered storage, but this is  still offered under guise of having all the data in one place – just on cheap or expensive drives.

So now we have this big, all-eggs-in-one-basket, expensive array. What have we achieved by this: everything is as slow as everything else. I visited a customer last week with just such an array, and this was the straw that broke the camel’s back (it’s been a long time coming). They have a heavily optimised database system that ultimately means the following speeds and feeds are demanded from the array:

  • 10 megabytes per second write, split into around 1300 IOPs
  • 300 reads per second

If you don’t have a good feel for I/O rates let me tell you: That is a very moderate amount of I/O. And yet the array is consistently returning I/Os that are in several hundred milliseconds, both reads and writes. Quite rightly, the customer thinks this is not very acceptable. Let’s have a little analysis of those numbers.

First the writes.: Half of those writes are sequential to the Oracle redo log and could easily be serviced by one physical drive (one 15k drive can sustain at least 100MB/s of sequential I/O). The rest of them are largely random (let’s assume 100% random), as they are dirty datafile blocks being written by the database writer. Again, a single drive could support 200 random writes per second, but let’s conservatively go for 100 – that means we need six or seven drives to support the physical write requirements of the database, plus one for the redo log. Then we need to add another three drives for the reads. That makes a very conservative total of eleven drives to keep up with the sustained workload for this customer, going straight to disk without any intermediate magic. This array also has quite a chunk of write-back cache, which means that  writes don’t actually even make it to disk before they can be acknowledged to the host/database. Why then, is this array struggling to delivery low latency I/O when it has sixty four drives inside it?

The answer is that a Storage Array is just a big computer itself. Instead of taking HTTP requests or keyboard input, it takes SCSI commands over Fibre Channel. Instead of returning a web page, it returns blocks of data. And like all complex computer systems, the array is subject to performance problems within itself. And the more complex the system, the more likely it is that performance problems will arise. To make things worse, the storage arrays have increasingly encouraged the admin to turn on more and more magic in the software to the point where it is now frequently an impossibility for the storage admin to determine how well a given storage allocation might perform. Modern storage administration has more to do with accountancy than it does performance and technology. Consider this equation:

(number of features) x (complexity of feature) = (total complexity)

Complexity breeds both performance problems and availability problems. This particular customer asked me if there was a way to guarantee that, when they replace this array, the new one will not have these problems. The answer is simple: ‘no’.

Yes, we can go right through the I/O stack, including all the components and software features of the array and fix them up. We can make sure that Fibre Channel ports are private to the host, remove all other workloads from the array so that there is no scheduling or capacity problems there. We can turn off all dynamic optimisations in the array software and we can layout the storage across known physical drives. Then, and only then, might there be a slim chance of a reduced number of high latency I/Os. I have a name for this way of operating a storage array. It’s call Direct Attached Storage (DAS): Welcome to the 1990s.

Now let me combine this reality with the other important aspect: semiconductor-based  storage. What happens when the pent up frustrations of the thousands of storage array owners meets the burgeoning reality of a new and faster storage that is now governed by some kind of accelerated form of Moore’s Law? As my business partner Jeff describes it: It’s gonna be a bloodbath.

I think that we will now see a sea change in the way we connect and use storage. It’s already started with products such as Oracle’s Exadata. I’m not saying that because I am an Oracle bigot (I won’t deny that), but because it is the right thing to do – it’s focused on doing one thing well and it uses emerging technology properly, rather than pretending nothing has changed. I don’t think it’s plug and play for many transactional customers (because of the RAC implication), but the storage component is on the money. Oh, and it is effectively DAS – a virtual machine of Windows running a virus scan won’t slow down your critical invoicing run.

I think that the way we use the storage will have to change too – storage just took a leap up the memory hierarchy. Low latency connectivity such as Infiniband will become more important, as will low latency request APIs, such as SRP. We simply cannot afford to waste time making the request when the response is no longer the major time component.

With all this change, is it now acceptable to have the vast majority of I/O latency accounted for in the complex software and hardware layers of a storage array? I don’t think so.

SaneSAN2010: Serial to Serial – When One Bottleneck Isn’t Enough

Posted August 23, 2010 by James Morle
Categories: Boundary Issues, Full Stack, Oracle, Performance, SaneSAN2010, Storage

I was recently looking into a storage-related performance problem at a customer site. The system was an Oracle 10.2.0.4/SLES 9 Linux system, Fibre Channel attached to an EMC DMX storage array. The DMX was replicated to a DR site using SRDF/S.

The problem was only really visible during the overnight batch runs, so AWR reports were the main source of information in diagnosis. In this case, they were more than sufficient, showing clear wait spikes for ‘free buffer waits’ and ‘log file parallel write’ during the problematic period. They were quite impressive, too – sixteen second latencies for some of the writes.

The customer was not oblivious to this fact, of course – it is difficult not to see a problem of such magnitude. They already had an engineering plan to move from SRDF/S (Synchronous) to SRDF/A (Asynchronous), as it was perceived that SRDF was the guilty party in this situation. I had been asked to validate this assumption and to determine the most appropriate roadmap for fixing these I/O problems on this highly critical system.

Of course, SRDF/S will always get the blame in such situations 🙂 I have been involved with many such configurations, and can indeed attest that SRDF/S can lead to trouble, particularly if the implications are not understood correctly. In this case, a very large financial institution, the storage team did indeed understand the main implication (more on this shortly) but, as is often the case in large organisations, the main source of the problem was actually what I call a boundary issue, one which falls between or across the technology focus of two or more teams. In this case, there were three teams involved, leading to configuration issues across all three areas.

Let’s go back to the SRDF implication, as it is the genesis of the problem. In synchronous mode, SRDF will only allow one outstanding write per hyper-volume. Any additional writes to that hyper will be serialised on the local DMX. The storage admin had understood this limitation, and had therefore combined many hyper volumes into a number of striped metavolumes, thus increasing the number of hypers that a given ‘lump of storage’ would contain. All well and good.

The system admin had created striped Veritas volumes over these metavolumes, thus striping even further. A filesystem was then built on the volumes, and presented to the DBA team. The DBAs then built the database and started it up. All apparently ran well for a few years until performance became intolerable, and that’s where my story begins.

I’m going to cut to the chase here, most of us don’t have time to read blogs all day long. There were three factors carefully conspiring on this system to ensure that the write performance was truly terrible:

  1. SRDF/S can only have one outstanding write per hypervolumethat’s a serialisation point.
  2. The filesystem in use was not deployed in any kind of ODM, quick I/O, or other UNIX file locking bypass technology – that’s a serialisation point.
  3. The database was not using Async I/O – that’s (pretty much) a serialisation point.

There you go – 1,2, 3, serialisation points from each of the three teams, none of which were understood by the other teams. Let’s step through Oracle attempting to write dirty buffers to disk during the batch run (the major wait times were observed on ‘free buffer waits’, so let’s start there):

  • DML creates dirty buffers
  • Oracle needs to create more dirty buffers to continue DML operation, but cannot because existing dirty buffers must be written to disk to create space
  • Oracle posts the DBWR process(es) to write out dirty buffers

(all the above happen on well-tuned, healthy systems also, though these may never struggle to have free buffers available because of well-performing writes)

  • DBWR scans the dirty list and issues writes one at a time to the operating system, waiting for each to complete before issuing the next. This is the lack of Async I/O configuration in the database
  • The operating system takes out a file lock (for write) on the datafile, and issues the write. No other DBWR processes can write to this file at this point. This is the side effect of having the wrong kind of filesystem, and implies that only one write can go to a file at any one time.
  • The operating system issues a write to the relevant meta on the DMX, which resolves to a specific hyper inside the box. That’s the single outstanding write for that hyper now in flight. No other writes can occur to that hyper at this point until this one is complete.

It’s easy to see how, when all the write I/Os are being fed to the DMX one at a time, that the additional latency of having SRDF in the equation makes a really big difference. It’s also easy to see that, by turning off SRDF, the problem will get less severe. I’m not defending EMC here, they deserve everything they get when it’s their fault. It just isn’t primarily an SRDF problem in this case. Yes, turning off SRDF or going SRDF/A will help, but it’s still fixing a downstream bottleneck.

The real culprit here is the file locking in the filesystem. This file locking is disabling the storage admin’s design of presenting many hypers up to the host to mitigate the SRDF overhead. In addition, operating system file locking on database files just just plain stupid, and I was hoping to have seen the last example of this in the early 90s; but this is the second one I’ve seen in 3 years… I’m not saying that the people that implement the systems this way are stupid, but it’s pretty easy to be naive about some critical areas when the complexity is so high and unavoidable boundary issues exist between the teams.

The lack of Async I/O is not good here, either, though the presence of multiple DBWRs is mitigating the impact somewhat, and the filesystem would quickly stomp on any improvements made by turning on Async I/O. I don’t believe that this filesystem would support Async anyway until the file locks were bypassed, so it’s two for the price of one here.

With multiple consecutive points of serialisation, it is not surprising that the system was struggling to achieve good throughput.

What’s the lesson here? There are two, really:

  1. Just knowing ‘your area’ isn’t enough.
  2. If you try to walk with two broken legs, you will fall down and bang your head. The fix, however, is not a painkiller for the headache.

EDIT: I have realised upon reflection that I only implied the reason that the file locking makes SRDF/S worse, rather than spelling it out. The reason it makes it worse is that it (file locking) enforces only a single write to that file at once. This means that this particular write is more than likely going to a single hypervolume, and thus eliminating any parallelism that might be achievable from SRDF. FYI, metavolumes have a stripe width of 960KB, so it’s really likely that any single write will only go to one hyper.

Sane SAN 2010 – Introduction

Posted August 23, 2010 by James Morle
Categories: Oracle, Performance, SaneSAN2010, Storage

This year at the UKOUG Conference in Birmingham, acceptance permitting, I will present the successor to my original Sane SAN whitepaper first penned in 2000. The initial paper was spectacularly well received, relatively speaking, mostly because disk storage at that time was very much a black box to DBAs and a great deal of mystique surrounded its operation. Well, nothing much has changed on that front, so I figured it was very much time to update/rewrite the paper for modern technology and trends and try to impose my occasionally humble opinion on the reader 🙂

I’ve already an article on this blog that will form part of the paper, and I will write a few more over the next few weeks. Check out this one, and keep an eye on my blog for the next few weeks. The first one out the bag is “Serial to Serial – When One Bottleneck Isn’t Enough.”

If the UKOUG don’t accept it, I’ll post it on the Scale Abilities website anyway, and try to palm it off on some other conferences in 2011 🙂

Log File Sync and AWR – Not Good Bedfellows

Posted June 2, 2010 by James Morle
Categories: Oracle

I’m sure many of you are already enlightened enough about the dangers of ratios and averages, the most famous blunder of all being the ‘cache hit ratio’. It’s not that all ratios and averages are useless, but it is surely the case that most are useless, and that all can be misleading. This posting is about an example of a misleading average, one that hides the truth from the reader unless they look elsewhere.

I am actually a big fan of AWR (and Statspack). They are chock full of misleading averages and ratios, but that does not mean they don’t have value. Often it is more a case of reading what is absent from the report, reading between the lines and so forth, that makes them so valuable. I’m also frequently asked to fix problems that are no longer occurring – without the history that AWR provides this would be an impossible task, assuming that Active Session History would therefore also be unavailable.

I was recently asked to look over some AWR reports from a performance testing environment  to determine what was going on. As usual, I started by looking over the top section of the reports to get a feel for the workload and to look at the wait events. Here’s the load profile:

                 Per Second   Per Transaction
                 ----------   ---------------
Redo size:       140,604.47            325.65
Logical reads:    16,712.80             38.71
Block changes:       904.42              2.09
Physical reads:        7.59              0.02
Physical writes:      70.76              0.16
User calls:        2,069.02              4.79
Parses:	             342.31              0.79
Hard parses:           0.07              0
Sorts:                52.54              0.12
Logons:	               0.47              0
Executes:            342.82              0.79
Transactions:        431.77

A few things immediately stand out on this load profile:

  • this is not a really busy system
  • there is virtually nil I/O to/from the datafiles
  • it is using bind variables (hard parses is low)
  • it is still making parse calls for every statement (parses=executes)
  • it commits or rollbacks very aggressively (transactions high compared to executes)

It’s the latter point that is most notable here. Transactions is actually higher than executes, meaning there are more commit/rollback calls than there are calls to execute SQL. So, we know we have an application that is committing pretty much after every statement. Let’s move on to the wait events:

Event                       Waits  Time(s)  AvgWait   % Total Wait Class
-----                       -----  -------  -------   ------- ----------
log file sync             357,552   22,427       63    90.4   Commit
CPU time                             1,636      6.6
log file parallel write   275,265      972        4     3.9   System I/O
read by other session       7,848      533       68     2.1   User I/O
db file parallel write     87,473      325        4     1.3   System I/O

It’s pretty clear from the wait events that this really is a commit-heavy application. I validated this with the developer – each transaction is basically an INSERT or an UPDATE followed by a commit. But here’s the mystery – why are the ‘log file sync’ wait times so much higher than the ‘log fiel parallel write’ times? An average write time of 4ms isn’t fantastic by any stretch of the imagination, but it’s not bad, and where are the other 59ms going to? I had a bit of a think about this and found the answer – the truth is hidden between the lines, and disguised by the averaging inside AWR.

The key to this mystery is what the specific wait events mean, and which sessions wait for them:

  • log file sync – user sessions wait on this event after a commit is issued
  • log file parallel write – LGWR is the only process that waits on this event, and it is the actual I/O (and associated setup) that stamps the redo buffer information onto disk

The process flow goes a bit like this for a session making a commit:

  1. User session issues a commit and goes into a ‘log file sync’ wait
  2. LGWR is posted to flush the redo buffer to disk
  3. LGWR issues the I/O and waits on ‘log file parallel write’
  4. LGWR returns from I/O, clears wait, and posts the user session to signal completion
  5. User session clears ‘log file sync’ wait

There are other operations that I have deliberately omitted from that description for clarity (they are not really relevant for the purposes of this discussion). In the case where there are many sessions wanting to commit, the process is slightly different, because of Oracle’s group commit (or piggyback commit) functionality. If a second session issues a commit while the LGWR is busy writing the redo for the first session, it too will wait on log file sync. If a third, fourth, fifth and sixth session issue commits, they will also wait on log file sync. They will wait for the LGWR to finish writing the redo buffer required for the first session, then immediately get posted to write the redo for the newly waiting sessions. Theses sessions (2-6) will continue to wait until this second write is complete.

What this means is that we have a strange kind of queue. It can only be two entries deep (the session currently being taken care of by LGWR and the session(s) that committed after the first session) and infinitely wide (all the subsequent sessions are handled as a single request). What it means in terms of AWR, however, is that the averages get distorted beyond all recognition.

I’ll tell you now that the underlying problem with this system was a response time problem on redo writes, validated by looking the LGWR trace file which was littered with ‘slow write’ warnings (what a great feature, introduced in 10g!). I believe the write cache on the storage was getting saturated, but have not yet been able to confirm that with the client’s storage team. So why was this not being reflected in the ‘log file parallel write’ wait events? The reason is because the slow writes were very slow (some greater than one second), but the majority of the writes were very fast. Consider the following set of write times:

  • 3000 writes take 0.5ms
  • 20 writes takes 1000ms
  • 2000 writes take 1ms

The total time for all those writes is 23,500ms. Divide that by the number of requests (5020), and the average write time is 4.68ms, close to the number reported in the AWR report. The trouble is that 85% of the time is spent in only 0.4% of the calls, which is dragging the average write time down (even in very rapid AWR sampling this would be the case). That’s the first part of the problem.

The second part of the problem is what happens when one of these outlier 1000ms I/O times takes place. We can see that the average (!) execute rate was 342/s, and let’s assume that all of these executions were DML that had a commit after them. We could use the Transactions rate, but that will also include a bunch of rollbacks, and I happen to know from the developer that each execute is a DML operation. So, we have an arrival rate of 342 commits per second and occasional peaks of I/O time of 1000ms. As soon as one of these spikes occurs, 342 sessions will start to accumulate a portion of that 1000ms (depending upon where in that period they issued the commit). If we assume that they arrive evenly spread out across that 1000ms, we can use the area of a triangle to calculate the total time clocked up to log file sync during that period:

  • Area of right-angled triangle = 1/2 base x height
  • Base=1000ms
  • Height=number of sessions waiting, maximum 342
  • Answer=171,000ms

So, the total wait time attributed to log file parallel write is only 1000ms during this period, but the total wait time attributed to log file sync is 171,000ms. You see what I did there? I concealed the calculation of an average into something called ‘area of a triangle’. Shh, don’t tell anyone.

When this slow write is over, the sessions will once again wait for the fast writes, perhaps one or two sessions at a time waiting for sub-millisecond writes, which means the average time is not dragged down quite so far as the ‘log file parallel write’ wait event. Let’s apply the waits to the same theoretical model as the log file parallel write to see:

  • 3000 single log file sync waits at 0.5ms = 1500ms
  • 20 periods of log file sync madness described above = 3,420,000 ms
  • 2000 single log file sync waits at 1ms = 3000ms

Total number of waits here is 5000+6840 (this is 20* 342)=11,840 total waits, compared to 5020 for the log file parallel write. The total wait time, however, is 3,424,500 ms, making an average wait time of 289ms which would be reported by AWR for ‘log file sync’, compared to a 4.68ms ‘log file parallel write’ time. That’s somewhat more exaggerated than the actual figures in the AWR report, but it does highlight the issue quite well.

AWR is a useful tool, but issues like this are masked. Use ASH or session based tracing for these kind of problems, and the answer will become clear!

“Flash” Storage Will Be Cheap – The End of the World is Nigh

Posted May 29, 2010 by James Morle
Categories: Oracle, Performance, SaneSAN2010, Storage

Tags: , , , ,

A couple of weeks ago I tweeted a projection that the $/GB for flash drives will meet the $/GB for hard drives within 3-4 years. It was more of a feeling based upon current pricing with Moore’s Law applied than a well researched statement, but it felt about right. I’ve since been thinking some more about this within the context of current storage industry offerings from the likes of EMC, Netapp and Oracle, wondering what this might mean.

First of all I did a bit of research – if my 3-4 years guess-timate was out by an order of magnitude then there is not much point in writing this article (yet). I wanted to find out what the actual trends in flash memory pricing look like and how these might project over time, and I came across the following article: Enterprise Flash Drive Cost and Technology Projections. Though this article is now over a year old, it shows the following chart which illustrates the effect of the observed 60% year on year decline in flash memory pricing:

Flash Drive Pricing Projections

This 60% annual drop in costs is actually an accelerated version of Moore’s Law, and does not take into account any radical technology advances that may happen within the period.  This drop in costs is probably driven in the most part by the consumer thirst for flash technology in iPods and so forth, but naturally ripples back up into the enterprise space in the same way that Intel and AMD’s processor technologies do.

So let’s just assume that my guess-timate and the above chart are correct (they almost precisely agree) – what does that mean for storage products moving forward?

Looking at recent applications of flash technology, we see that EMC were the first off the blocks by offering the option of relatively large flash drives as drop-in replacements for their hard drives in their storage arrays. Netapp took a different approach of putting the flash memory in front of the drives as another caching layer in the stack. Oracle have various options in their (formerly Sun) product line and a formalised mechanism for using flash technology built into the 11g database software and into the Exadata v2 storage platform. Various vendors offer internal flash drives that look like hard drives to the operating system (whether connected by traditional storage interconnects such as SATA or by PCI Express). If  we assume that the cost of flash technology becomes equivalent to hard drive storage in the next three years, I believe all these technologies will quickly become the wrong way to deploy flash technology, and only one (Oracle) has an architecture which lends itself to the most appropriate future model (IMHO).

Let’s flip back to reality and look at how storage is used and where flash technology disrupts that when it is cheap enough to do so.

First, location of data: local or networked in some way? I don’t believe that flash technology disrupts this decision at all. Data will still need to be local in certain cases and networked via some high-speed technology in others, in much the same way as it is today. I believe that the networking technology will need to change for flash technology, but more on that later.

Next, the memory hierarchy: Where does current storage sit in the memory hierarchy? Well, of course, it is at the bottom of the pile, just above tape and other backup technologies if you include those. This is the crucial area where flash technology disrupts all current thinking – the final resting place for data is now close or equal to DRAM memory speeds. One disruptive implication of this is that storage interconnects (such as Fibre Channel, Ethernet, SAS and SATA) are now a latency and bandwidth bottleneck. The other, potentially huge, disruption is what happens to the software architecture when this bottleneck is removed.

Next, capacity: How does the flash capacity sit with hard drive capacity? Well that’s kind of the point of this posting… it’s currently quite a way behind, but my prediction is that they will be equal by 2013/2014. Importantly though, they will then start to accelerate away from hard drives. Given the exponential growth of data volumes, perhaps only semiconductor based storage can keep up with the demand?

Next, IOPs: This is the hugely disruptive part of flash technology, and is a direct result of a dramatically lowered latency (access time) when compared to hard disk technology. Not only is the latency lowered, but semiconductor-based storage is more or less contention-free given the absence of serialised moving parts such as a disk head. Think about it – the service time for a given hard drive I/O is directly by the preceding I/O and where the head was left on the platter. With solid-state storage this does not occur and service times are more uniform (though writes are consistently slower than reads).

These disruptions mean that the current architectures of storage systems are not making the most of semiconductor-based storage. Hey, why do I keep calling it “semiconductor-based storage” instead of SSD or flash? The reason is that the technologies used in this area are changing frequently, from DRAM-based systems to NOR-based flash to NAND based flash to DRAM-fronted flash; Single-level cells to Multi-level cells; battery-backed to “Super Cap” backed. Flash, as we know it today, could be outdated as a technology in the near future, but “semiconductor-based” storage is the future regardless.

I think that we now need technologies that look more like Oracle Exadata v2, with low-latency RDMA interfaces directly into the Operating System/Database. However, they need to easily and natively support other types of storage (unstructured data such as files, VMware datastores and so forth). The Exadata architecture lends itself well to changes in this area in both hardware trends and access protocols.

Perhaps more importantly, we are also only just beginning to understand the implications in software architecture for the disrupted memory hierarchy. We simply cannot continue to treat semiconductor-based storage as “fast disk” and need to start thinking, literally, outside the box.

The Oracle Wait Interface Is Useless (sometimes) – part 3b

Posted February 16, 2010 by James Morle
Categories: Generic, Oracle

Welcome back for the concluding part of this series of blogs.
In the last part I went through a brief primer about code execution and stack frames in preparation for this posting. I hope that wasn’t too boring for everyone – I think it’s really important to make sure all the preliminary knowledge is covered and not make too many assumptions about the reader. In this part I will finally get to the point, and talk about some alternative techniques for determining the reasons for poor performance for our example user session.
To recap very quickly, we have a user session taking 100% CPU in top, nothing is emitted in SQL trace files, Oracle’s Time Model shows it is executing SQL, and that’s about all the information we have. As Tanel stated in his posting for Part Two, this is the stage where it makes good sense to dig into session-level statistics (not system-level statistics) to see if any of the counters are incrementing in a conspicuous way. If you don’t already have a copy of Tanel’s Snapper code, I suggest you grab it and get familiar with it – it’s a nice implementation of an essential tool that many DBAs already have some kind of script for.

Tanel found in Part Two that the SQL execution was revisiting the same small number of buffers very frequently, indicating that there might be a problem with one of the Nested Loop join methods used by the optimiser. Perhaps there is something wrong with that part of the plan (or with the whole plan)? Using counters can often reveal a potential problem this way, but they don’t necessarily  put a neon sign reading “Here’s Where All The Time Went” on top of the offending item. This is where we might want to dig further using some of our knowledge of code execution and stack traces, in combination with a dynamic tracing infrastructure such as Dtrace or Systemtap.

The power of Dtrace and friends lies in a few key areas:

  1. It can report time-based information
  2. It can show user and kernel information
  3. It is dynamic – unlike gprof and other code profilers, it is not necessary to run everything with a profiled binary

These are really key features which, when combined, allow for tremendously powerful on-demand tracing.

Consider the following sample output from a very small Dtrace script to look into what happens when a single printf() library call is made from a C program:

CPU FUNCTION
  0  -> my_function                            USER CODE
    (57 lines of dynamic library loading omitted)
  0    -> printf                               USER CODE
  0      -> _setorientation                    USER CODE
  0      <- _setorientation                    USER CODE
  0      -> _ndoprnt                           USER CODE
  0        -> _findbuf                         USER CODE
  0          -> isatty                         USER CODE
  0            -> ___errno                     USER CODE
  0            <- ___errno                     USER CODE
  0            -> ioctl                        USER CODE
  0              -> syscall_mstate             KERN
  0                -> gethrtime_unscaled       KERN
  0                  -> tsc_gethrtimeunscaled_delta  KERN
  0                    -> tsc_gethrtimeunscaled  KERN
  0                    <- tsc_gethrtimeunscaled  KERN
  0                  <- tsc_gethrtimeunscaled_delta  KERN
    (180 more lines of kernel functions for ioctl call omitted)
 0            <= ioctl                         SYSCALL: ioctl 1743296 ns
    (34 lines of kernel function (probably DTrace intrumentation!) omitted)
 0        <- ioctl                             USER CODE
 0  <= fstat64                                 SYSCALL: fstat64 352699 ns
    (~300 lines of kernel  and dynamic library loading omitted)
 0    -> _write                                USER CODE
    (~200 lines of kernel  and dynamic library loading omitted)
 0  <= write                                   SYSCALL: write 2154427 ns
    (123 lines of kernel code omitted)
 0  <- printf                                  USER CODE

In the above output, there are around 1000 functions called in User and Kernel domains. The output gives you some idea of the level of detail possible with Dtrace; we can see all of the following attributes:

  • All User mode function calls are logged
  • All Kernel mode function calls are logged
  • All system calls are timed

Note: The printf() itself is very much a user-mode call,part of the standard C library (libc). It actually makes a couple of system calls under the bonnet: Initially an ioctl() and eventually the write(2) system call.

Let’s look at something else – time spent in the Virtual Memory subsystem of the kernel doing Page Ins (page faults). Again, this is a very simple script, and produces output similar to the following:

WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 90 ms
WAIT: PAGE IN: 836 ms
WAIT: PAGE IN: 1098 ms
WAIT: PAGE IN: 853 ms
WAIT: PAGE IN: 816 ms
FINAL WAIT: PAGE IN: 263 ms

So now we can quantify how much time is spent servicing page faults, in this case showing approximately how much per second. Here we have very specific information about how much time our particular session is spending servicing page faults. Don’t assume that soft page faults are cheap/fast, either – with increasing virtualisation trends, soft page faults as observed by the guest operating system could actually be hard page faults in the underlying hypervisor.

These are just a few of the things that can be monitored in Dtrace. Now let’s combine the Oracle stack tracing that we introduced in part 3a.  In that posting I concluded that we could dump the stack trace once in a while, observe the name of the current function, and determine where we are spending the most time. With Dtrace we can make this much more concrete, and can quantitatively report the time spent in each function, and combine other contributing factors from the Operating System! I have put together a Dtrace script to demonstrate this concept. I won’t post the script here (yet) because it is still very much in development and still has a number of deficiencies. What I’m trying to get to here is a really quick way to see exactly what an Oracle session is doing by only looking at the running process using Dtrace. The script currently has the following features:

  • Shows all OWI wait events and durations
  • Shows all system calls and durations
  • Shows timing stats every second for: OWI wait times, page faults, latch acquisition, and a selection of common rowsource fetch operators

The rowsource operators are taken from tracing the function calls to the various rowsource operators defined in Metalink note 175982.1. Specifically, the script currently traces calls to the following rowsource operators:

  • GS – Group By Sort
  • JO – Join (Nested Loop)
  • UA – Union All
  • VW – View
  • HJ – Hash Join
  • TB – Table
  • JOT – Unknown. Possible relation to Join, observed frequently during stack tracing
  • SO – Sort
  • IXT – Index
  • PF – Prefetch

Running the same misbehaving SQL as Tanel in his last posting, we get the following output (snippet only):

LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 464:JO 0:UA 0:VW 0:HJ 0:TB 21:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=1 rsrc=(0:GS 430:JO 0:UA 0:VW 0:HJ 0:TB 9:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 471:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 467:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 466:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 469:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
SYSCALL: times 7104 ns
SYSCALL: times 1749 ns
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 468:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 466:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 468:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 469:JO 0:UA 0:VW 0:HJ 0:TB 22:JOT 0:SO)

Remember that this SQL example precisely fits our profile where the session is using 100% usermode CPU and emitting no wait information, so we cannot see any OWI waits detailed here – just summaries of where the time was spent every second and a couple of system calls. We also have no page faults taking place on this test, so no time is accounted for there. The big user of CPU (468ms per second) is the nested loop join rowsource operator, which agrees with Tanel’s findings using session-statistics. We are also spending a little time acquiring latches, but it’s not very significant. There are clearly some rowsource functions missing, given that we can only account for half the CPU time, so I did a bit more function tracing to find the names of the missing functions, and added them to the script. The output now looks as follows (scroll right to see the new functions):

LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 461:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 163:IXT 759:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 691:JO 0:UA 0:VW 0:HJ 0:TB 455:JOT 0:SO 304:IXT 843:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 460:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 162:IXT 759:PF)
SYSCALL: times 7810 ns
SYSCALL: times 1750 ns
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 686:JO 0:UA 0:VW 0:HJ 0:TB 444:JOT 0:SO 301:IXT 842:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 436:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 177:IXT 721:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 708:JO 0:UA 0:VW 0:HJ 0:TB 444:JOT 0:SO 286:IXT 880:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 357:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 228:IXT 588:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 788:JO 0:UA 0:VW 0:HJ 0:TB 445:JOT 0:SO 235:IXT 1013:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 277:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 276:IXT 458:PF)

Looks like we’ve found the missing time, and then some (the timers add up to more than 1000ms each second). The reason for this timing exaggeration is that there is a good deal of double accounting going on with the current version of the script – child function times are also accounted for in the parent. This is a known bug, hopefully I’ll get a chance to fix that one soon.

Well that’s about it for now. I hope that this journey has been an interesting one, I know it’s been a lot of reading to take in all the parts. We started with a session that was emitting no wait information, and now we have a simple Dtrace script that is easy to execute with no impact to other sessions, and easy to stop once the underlying cause is determined. The query does not need to be stopped and restarted, and no additional overhead is imposed from enabling additional Oracle diagnostics such as setting statistics_level=all. I intend to expand on this approach over time, with a particular focus on:

  • Multiplatform support – Systemtap version is required
  • Better design to support more function tracing
  • Fix double counting issue
  • Add identifiers for optimiser plan line number rather than rolling up by function name

Upcoming Presentation Slots, Spring 2010

Posted February 1, 2010 by James Morle
Categories: Uncategorized

I will be presenting the Oracle Wait Interface is Useless Dog and Pony show a couple of times this spring. It was first presented at the UKOUG 2009 with Tanel Poder, and will be seeing the light of day at the following events:

I hope to see some of you there, and will get the final part posted before RMOUG – so the picture will be complete. If you’ve got any questions, it’s a good opportunity to come and heckle at one of those events!

The Oracle Wait Interface Is Useless (sometimes) – part 3a

Posted January 18, 2010 by James Morle
Categories: Generic, Oracle

OK, here it is, the ‘first part of the last part’, though the topics discussed in these articles will be discussed more over time in my blog and in Tanel’s. I’ve split it into two subparts, because it was just getting insanely long as single posting.

Before I get going on this one, let’s just clear up a misunderstanding from the first part of this series. The first part uses a specific example which, for clarity reasons, will continue to be used for the remainder of the series. The example shown is some kind of query-related issue in this case, but the approach shown here is a more general one that will locate the root cause in many, many other cases than just the case of a bad query. This SQL example is still a good one, though, because there are no waits, lots of CPU, and no system calls. But the underlying problem might not be caused by poor SQL, and diving straight into SQL tuning could waste you 2 days fixing SQL when that is not the root cause. Just because we happen to be running SQL, it does not mean that it is a SQL problem. What we are trying to achieve here is a more robust approach to root cause diagnosis of Oracle performance problems, one where we don’t guess, we don’t assume, we don’t stick with things we’ve seen before: We quantitatively find the real problem. That might be the smart way, it might be the lazy way – those might be the same thing! But most crucially, it’s the fast way to fixing the real problem.

So, back to the main topic. In the first part of this blog, I walked through the example problem and showed that we have a problem that has the following attributes:

  • The user is reporting a server response time problem
  • Zero Oracle waits
  • 100% CPU
  • Zero system calls
  • DB Time is showing all the CPU and elapsed time is spent executing SQL (in this example…)

Note to Kevin: Though I stated the query ‘never’ returns in part one, I was, of cause, using my usual exaggerative vernacular. I meant, of course, that it was taking a very long time to return, but still returning correct results. Sorry about that – it will ‘never’ happen again.

At this point, we might jump into a prolonged SQL tuning exercise. But let’s not for now, because that is still guesswork in my opinion. Let’s make the following reasonable assumption: The whole system is running slowly, using huge quantities of CPU, and this is a mission-critical system where you cannot simply start killing queries and re-executing them in your SQL*Plus session. That’s a reasonable assumption – I have worked on many, many, production systems where this is the case. So let’s carry on diagnosing without actually interfering with the production system in any way.

This is where we might deviate from time-based diagnostics. Not because we don’t think that time is the single most valuable metric on which to determine what ‘fast’ and ‘slow’ actually mean in quantitative terms, but because there is a bit of a shortcut available to us here that might just nail the cause of the problem. That shortcut is session-based statistics, and they have nothing to do with time, being simple counters.

This is where it makes no sense at all to re-invent the wheel, and was why Tanel is on the hook for this part: Tanel has already done some significant work on session-based statitics analysis and has written a great tool, snapper, for just this purpose. If you have not already read part two of this series, head over to Tanel’s blog now to catch up before proceeding!

OK, so we’ve abandoned the wait interface, and let’s assume that the sessions statistics did not yield anything particularly obvious. Or, maybe you just want to dig a little deeper, or possibly take yet another short cut to the truth? Now it’s time to look beyond Oracle and into the Operating System. The Operating System has a completely different viewpoint of the operation of Oracle, notably through the use of execution stack profiling. Don’t run away just yet, it’s not as difficult as you might imagine.

Various methods exist to probe the execution stack across the various Operating Systems. All of these should be used with caution, as they have varying degrees of intrusiveness on the operation of Oracle. The more intrusive methods, such as gdb, and the (possibly) less tested methods (oradebug short_stack) have very real potential to crash the process that you are attaching to, or worse. Don’t say I didn’t warn you! However, other methods for profiling the stack are less intrusive, and are already very robust: I’m talking here about Dtrace and variants.

I have held off from getting too excited by Dtrace in the past, even though the technology always looked amazing. The reason for holding back was that it was a single-OS tool, and without anything comparable on the Linux platform I didn’t see it being useful to >75% of my customers. That’s changing now, though, with the production release of Systemtap (partial, at least) in RHEL 5.4 and there is even a similar tool for AIX named Probevue. So now I’m excited about this technology!

Before digging into Dtrace and friends, I think it’s important that we take a little diversionary refresh into what a stack actually is, and why it is important. For those with a good coding background, look away now, and rejoin the group for part 3b. Just to keep your interest level, though, would it be interesting to get a second-by-second view of where the time goes for a process, down to the following levels of detail?

  • Which line of the plan did I spend most time in?
  • How much time did I spend waiting for page faults?
  • Which wait events did I spend most time in?

That’s going to be part 3b… so please check back periodically!

Back to the “Stack Primer for DBAs”: Consider the following elementary piece of C code:

#include <stdio.h>

int i;

void my_function();

void
main() {

	scanf("%c",&i);
	my_function();

}

Don’t worry about the details of the language if you are unfamiliar with C. The only things to observe here are that we have one “function”, named ‘main()’, which is making a call to another function ‘my_function()’. The main() function is a special one – all standalone C programs have a main() function, which is the initial point of execution. A C program is just a set of compiled machine code instructions at execute time, just like any other computer program. These instructions occupy a chunk of memory, where each memory location is either a machine code instruction or a piece of data. When the code gets down to the call to my_function() in main(), the execution must jump (or branch) to the address that my_function()’s machine code instructions reside at. Before this can happen, an amount of context must be stored, namely:

  • The ‘return’ address, so that execution can resume in the main() function after my_function() is complete
  • Any ‘data’ passed to my_function() (the function arguments)

In C, this information is pushed onto the stack and then the execution branches to the entry point of my_function(). As a sidenote, the sharing of data and execution context in this way is a common way to exploit security holes by overwriting execution context with oversized arguments from the data portion. We’re not interested in all that stuff for this – we are only interested in observing the stack contents. If we took a stack dump while the program were in the my_function() function, it would look like this:

my_function()
main()

Pretty simple, huh? If there were arguments for main() and my_function(), they would also be shown in some way. Most stack trace facilities can only show primitive datatypes (such as integers) and all other more complex arguments (such as structs) are shown as pointers (addresses) to the actual data without decoding the actual content, as follows:

my_function(42, 0x12345678)
main(0x45678900)

Some tools will print a stack trace (known as a backtrace, because it is unwound backwards from the current position in the stack) with some known pointers decoded into human readable form, such as simple struct datatypes. Anyway, let’s have a look at something a bit more complicated – an Oracle backtrace taken from an errorstack trace file:

----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
skdstdst()+36        call     kgdsdst()            000000000 ? 000000000 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   7FFFB345A778 ? 000000000 ?
ksedst1()+98         call     skdstdst()           000000000 ? 000000000 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
ksedst()+34          call     ksedst1()            000000001 ? 000000001 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
dbkedDefDump()+2736  call     ksedst()             000000001 ? 000000001 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
ksedmp()+36          call     dbkedDefDump()       000000001 ? 000000000 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
ksdxfdmp()+1837      call     ksedmp()             000000001 ? 000000000 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
ksdxcb()+1782        call     ksdxfdmp()           7FFFB345BAE0 ? 000000011 ?
                                                   000000003 ? 7FFFB345BA40 ?
                                                   7FFFB345B9A0 ? 000000000 ?
sspuser()+112        call     ksdxcb()             000000001 ? 000000011 ?
                                                   000000001 ? 000000001 ?
                                                   7FFFB345B9A0 ? 000000000 ?
__restore_rt()       call     sspuser()            000000001 ? 000000011 ?
                                                   000000001 ? 000000001 ?
                                                   7FFFB345B9A0 ? 000000000 ?
semtimedop()+10      signal   __restore_rt()       000018000 ? 7FFFB345C730 ?
                                                   000000001 ?
                                                   FFFFFFFFFFFFFFFF ?
                                                   FFFFFFFFFFD23940 ?
                                                   000000000 ?
sskgpwwait()+259     call     semtimedop()         000018000 ? 7FFFB345C730 ?
                                                   000000001 ? 7FFFB345C6D8 ?
                                                   FFFFFFFFFFD23940 ?
                                                   000000000 ?
skgpwwait()+151      call     sskgpwwait()         7FFFB345CB94 ? 00A9A15C0 ?
                                                   07848C4D8 ? 0002DC6C0 ?
                                                   7FFF00000000 ? 000000000 ?
ksliwat()+1816       call     skgpwwait()          7FFFB345CB94 ? 00A9A15C0 ?
                                                   000000000 ? 0002DC6C0 ?
                                                   000000000 ? 000000000 ?
kslwaitctx()+157     call     ksliwat()            078666348 ? 078666348 ?
                                                   005F5DFA7 ? 000000000 ?
                                                   100000000 ? 000000000 ?
kslwait()+136        call     kslwaitctx()         7FFFB345CE30 ? 000000000 ?
                                                   005F5DFA7 ? 000000000 ?
                                                   100000000 ? 000000000 ?
psdwat()+107         call     kslwait()            005F5DFA7 ? 000000167 ?
                                                   000000000 ? 005F5DFA7 ?
                                                   000000000 ? 000000000 ?
pevm_icd_call_commo  call     psdwat()             005F5DFA7 ? 000000167 ?
n()+421                                            000000000 ? 005F5DFA7 ?
                                                   000000000 ? 000000000 ?
pfrinstr_ICAL()+164  call     pevm_icd_call_commo  7FFFB345E0F0 ? 000000000 ?
                              n()                  000000001 ? 000000004 ?
                                                   7FAF00000001 ? 000000000 ?
pfrrun_no_tool()+63  call     pfrinstr_ICAL()      7FAF7E8A3500 ? 06A03B9AA ?
                                                   7FAF7E8A3570 ? 000000004 ?
                                                   7FAF00000001 ? 000000000 ?
pfrrun()+1025        call     pfrrun_no_tool()     7FAF7E8A3500 ? 06A03B9AA ?
                                                   7FAF7E8A3570 ? 000000004 ?
                                                   7FAF00000001 ? 000000000 ?
plsql_run()+769      call     pfrrun()             7FAF7E8A3500 ? 000000000 ?
                                                   7FAF7E8A3570 ? 7FFFB345E0F0 ?
                                                   7FAF00000001 ? 070C18646 ?
peicnt()+296         call     plsql_run()          7FAF7E8A3500 ? 000000001 ?
                                                   000000000 ? 7FFFB345E0F0 ?
                                                   7FAF00000001 ? 000000000 ?
kkxexe()+520         call     peicnt()             7FFFB345E0F0 ? 7FAF7E8A3500 ?
                                                   7FAF7E8C4028 ? 7FFFB345E0F0 ?
                                                   7FAF7E8C1FD0 ? 000000000 ?
opiexe()+14796       call     kkxexe()             7FAF7E8A5128 ? 7FAF7E8A3500 ?
                                                   000000000 ? 7FFFB345E0F0 ?
                                                   7FAF7E8C1FD0 ? 000000000 ?
kpoal8()+2283        call     opiexe()             000000049 ? 000000003 ?
                                                   7FFFB345F678 ? 7FFFB345E0F0 ?
                                                   7FAF7E8C1FD0 ? 000000000 ?
opiodr()+1149        call     kpoal8()             00000005E ? 00000001C ?
                                                   7FFFB3462750 ? 7FFFB345E0F0 ?
                                                   7FAF7E8C1FD0 ? 5E00000001 ?
ttcpip()+1251        call     opiodr()             00000005E ? 00000001C ?
                                                   7FFFB3462750 ? 000000000 ?
                                                   008C5E7F0 ? 5E00000001 ?
opitsk()+1628        call     ttcpip()             00A9B0890 ? 0086BA768 ?
                                                   7FFFB3462750 ? 000000000 ?
                                                   7FFFB34621B0 ? 7FFFB3462954 ?
opiino()+953         call     opitsk()             00A9B0890 ? 000000000 ?
                                                   7FFFB3462750 ? 000000000 ?
                                                   7FFFB34621B0 ? 7FFFB3462954 ?
opiodr()+1149        call     opiino()             00000003C ? 000000004 ?
                                                   7FFFB3463E48 ? 000000000 ?
                                                   7FFFB34621B0 ? 7FFFB3462954 ?
opidrv()+565         call     opiodr()             00000003C ? 000000004 ?
                                                   7FFFB3463E48 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
sou2o()+98           call     opidrv()             00000003C ? 000000004 ?
                                                   7FFFB3463E48 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
opimai_real()+128    call     sou2o()              7FFFB3463E20 ? 00000003C ?
                                                   000000004 ? 7FFFB3463E48 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
ssthrdmain()+209     call     opimai_real()        000000002 ? 7FFFB3464010 ?
                                                   000000004 ? 7FFFB3463E48 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
main()+196           call     ssthrdmain()         000000002 ? 7FFFB3464010 ?
                                                   000000001 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
__libc_start_main()  call     main()               000000002 ? 7FFFB34641B8 ?
+253                                               000000001 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
_start()+36          call     __libc_start_main()  0009D3D74 ? 000000002 ?
                                                   7FFFB34641A8 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?

I know, it’s a bit much to deal with at first glance. But take it a piece at a time, reading in conjunction with Metalink note 175982.1, and it all begins to make sense. Actually, I’m guessing some of these, because the Metalink note doesn’t give all the answers, but it seems to make sense.

Let’s start at the bottom, which is the beginning of the execution of the C program named ‘oracle’. The first two lines are the GLIBC execution wrappers that eventually call Oracle’s main() statement which starts at line 110 of the listing. The next several lines above are the initialisation of Oracle and the (O)racle (P)rogramatic (I)nterface, which is the internal API used in Oracle. TTCPIP is the Two-Task-Communication interface with my SQL*Plus process, but the first interesting line is at line 74, kkxexe(), which is the (k)ernel (k)ompile e(x)ecute) (exe)cute statement call, which is the start of the actual SQL processing. In this case it is actually a simple PL/SQL call to DBMS_LOCK.SLEEP(100000), and the entry to the Oracle Wait interface can be seen at line 50 with the kslwait() call, eventually ending up with a semtimedop() call, which is the way the sleep call has been implemented. Incidentally, the preceding sskgpwwait() call at line 37 is Oracle’s platform-specific code which, in this case, is Red Hat Linux.  At line 11 we see the call to ksedst(), which probably stands for (k)ernel (s)ervices (e)rrorstack (d)ump (s)tack (t)race, which is the part of code actually producing the stack trace we are reading.

So what does all this actually mean in terms of seeing what our Oracle session is up to? Well, we can see what is called, by whom, and how long each call took. We don’t need to know all the function names and what they are responsible for, but we can come up with a selection of interesting ones that we would like to pay attention to. Thinking in terms of SQL execution we can see, for example, a nested loop rowsource function calling out to an index fetch rowsource function, which in turn calls out to a table fetch function.

Even just a few consecutive stack traces of a running query will make it very plain which rowsource Oracle is spending the most time within, simply because of the laws of probability. If you look once a second for a minute, and every stack you see has the ‘index fetch’ rowsource as the current function, then you probably spent most of the last minute doing that kind of processing. But we can do better than that… we can start to look at quantifying where the time is spent, and also to supplement that information with other interesting aspects of the operating system operation, aspects which are somewhat  concealed from Oracle and yet can detrimentally affect its performance.

Join me back for part 3b for the conclusion of this series of articles!