The Oracle Wait Interface Is Useless (sometimes) – part 3b

Posted February 16, 2010 by James Morle
Categories: Generic, Oracle

Welcome back for the concluding part of this series of blogs.
In the last part I went through a brief primer about code execution and stack frames in preparation for this posting. I hope that wasn’t too boring for everyone – I think it’s really important to make sure all the preliminary knowledge is covered and not make too many assumptions about the reader. In this part I will finally get to the point, and talk about some alternative techniques for determining the reasons for poor performance for our example user session.
To recap very quickly, we have a user session taking 100% CPU in top, nothing is emitted in SQL trace files, Oracle’s Time Model shows it is executing SQL, and that’s about all the information we have. As Tanel stated in his posting for Part Two, this is the stage where it makes good sense to dig into session-level statistics (not system-level statistics) to see if any of the counters are incrementing in a conspicuous way. If you don’t already have a copy of Tanel’s Snapper code, I suggest you grab it and get familiar with it – it’s a nice implementation of an essential tool that many DBAs already have some kind of script for.

Tanel found in Part Two that the SQL execution was revisiting the same small number of buffers very frequently, indicating that there might be a problem with one of the Nested Loop join methods used by the optimiser. Perhaps there is something wrong with that part of the plan (or with the whole plan)? Using counters can often reveal a potential problem this way, but they don’t necessarily  put a neon sign reading “Here’s Where All The Time Went” on top of the offending item. This is where we might want to dig further using some of our knowledge of code execution and stack traces, in combination with a dynamic tracing infrastructure such as Dtrace or Systemtap.

The power of Dtrace and friends lies in a few key areas:

  1. It can report time-based information
  2. It can show user and kernel information
  3. It is dynamic – unlike gprof and other code profilers, it is not necessary to run everything with a profiled binary

These are really key features which, when combined, allow for tremendously powerful on-demand tracing.

Consider the following sample output from a very small Dtrace script to look into what happens when a single printf() library call is made from a C program:

CPU FUNCTION
  0  -> my_function                            USER CODE
    (57 lines of dynamic library loading omitted)
  0    -> printf                               USER CODE
  0      -> _setorientation                    USER CODE
  0      <- _setorientation                    USER CODE
  0      -> _ndoprnt                           USER CODE
  0        -> _findbuf                         USER CODE
  0          -> isatty                         USER CODE
  0            -> ___errno                     USER CODE
  0            <- ___errno                     USER CODE
  0            -> ioctl                        USER CODE
  0              -> syscall_mstate             KERN
  0                -> gethrtime_unscaled       KERN
  0                  -> tsc_gethrtimeunscaled_delta  KERN
  0                    -> tsc_gethrtimeunscaled  KERN
  0                    <- tsc_gethrtimeunscaled  KERN
  0                  <- tsc_gethrtimeunscaled_delta  KERN
    (180 more lines of kernel functions for ioctl call omitted)
 0            <= ioctl                         SYSCALL: ioctl 1743296 ns
    (34 lines of kernel function (probably DTrace intrumentation!) omitted)
 0        <- ioctl                             USER CODE
 0  <= fstat64                                 SYSCALL: fstat64 352699 ns
    (~300 lines of kernel  and dynamic library loading omitted)
 0    -> _write                                USER CODE
    (~200 lines of kernel  and dynamic library loading omitted)
 0  <= write                                   SYSCALL: write 2154427 ns
    (123 lines of kernel code omitted)
 0  <- printf                                  USER CODE

In the above output, there are around 1000 functions called in User and Kernel domains. The output gives you some idea of the level of detail possible with Dtrace; we can see all of the following attributes:

  • All User mode function calls are logged
  • All Kernel mode function calls are logged
  • All system calls are timed

Note: The printf() itself is very much a user-mode call,part of the standard C library (libc). It actually makes a couple of system calls under the bonnet: Initially an ioctl() and eventually the write(2) system call.

Let’s look at something else – time spent in the Virtual Memory subsystem of the kernel doing Page Ins (page faults). Again, this is a very simple script, and produces output similar to the following:

WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 90 ms
WAIT: PAGE IN: 836 ms
WAIT: PAGE IN: 1098 ms
WAIT: PAGE IN: 853 ms
WAIT: PAGE IN: 816 ms
FINAL WAIT: PAGE IN: 263 ms

So now we can quantify how much time is spent servicing page faults, in this case showing approximately how much per second. Here we have very specific information about how much time our particular session is spending servicing page faults. Don’t assume that soft page faults are cheap/fast, either – with increasing virtualisation trends, soft page faults as observed by the guest operating system could actually be hard page faults in the underlying hypervisor.

These are just a few of the things that can be monitored in Dtrace. Now let’s combine the Oracle stack tracing that we introduced in part 3a.  In that posting I concluded that we could dump the stack trace once in a while, observe the name of the current function, and determine where we are spending the most time. With Dtrace we can make this much more concrete, and can quantitatively report the time spent in each function, and combine other contributing factors from the Operating System! I have put together a Dtrace script to demonstrate this concept. I won’t post the script here (yet) because it is still very much in development and still has a number of deficiencies. What I’m trying to get to here is a really quick way to see exactly what an Oracle session is doing by only looking at the running process using Dtrace. The script currently has the following features:

  • Shows all OWI wait events and durations
  • Shows all system calls and durations
  • Shows timing stats every second for: OWI wait times, page faults, latch acquisition, and a selection of common rowsource fetch operators

The rowsource operators are taken from tracing the function calls to the various rowsource operators defined in Metalink note 175982.1. Specifically, the script currently traces calls to the following rowsource operators:

  • GS – Group By Sort
  • JO – Join (Nested Loop)
  • UA – Union All
  • VW – View
  • HJ – Hash Join
  • TB – Table
  • JOT – Unknown. Possible relation to Join, observed frequently during stack tracing
  • SO – Sort
  • IXT – Index
  • PF – Prefetch

Running the same misbehaving SQL as Tanel in his last posting, we get the following output (snippet only):

LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 464:JO 0:UA 0:VW 0:HJ 0:TB 21:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=1 rsrc=(0:GS 430:JO 0:UA 0:VW 0:HJ 0:TB 9:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 471:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 467:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 466:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 469:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
SYSCALL: times 7104 ns
SYSCALL: times 1749 ns
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 468:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 466:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 468:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 469:JO 0:UA 0:VW 0:HJ 0:TB 22:JOT 0:SO)

Remember that this SQL example precisely fits our profile where the session is using 100% usermode CPU and emitting no wait information, so we cannot see any OWI waits detailed here - just summaries of where the time was spent every second and a couple of system calls. We also have no page faults taking place on this test, so no time is accounted for there. The big user of CPU (468ms per second) is the nested loop join rowsource operator, which agrees with Tanel's findings using session-statistics. We are also spending a little time acquiring latches, but it's not very significant. There are clearly some rowsource functions missing, given that we can only account for half the CPU time, so I did a bit more function tracing to find the names of the missing functions, and added them to the script. The output now looks as follows (scroll right to see the new functions):

LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 461:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 163:IXT 759:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 691:JO 0:UA 0:VW 0:HJ 0:TB 455:JOT 0:SO 304:IXT 843:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 460:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 162:IXT 759:PF)
SYSCALL: times 7810 ns
SYSCALL: times 1750 ns
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 686:JO 0:UA 0:VW 0:HJ 0:TB 444:JOT 0:SO 301:IXT 842:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 436:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 177:IXT 721:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 708:JO 0:UA 0:VW 0:HJ 0:TB 444:JOT 0:SO 286:IXT 880:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 357:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 228:IXT 588:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 788:JO 0:UA 0:VW 0:HJ 0:TB 445:JOT 0:SO 235:IXT 1013:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 277:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 276:IXT 458:PF)

Looks like we've found the missing time, and then some (the timers add up to more than 1000ms each second). The reason for this timing exaggeration is that there is a good deal of double accounting going on with the current version of the script - child function times are also accounted for in the parent. This is a known bug, hopefully I'll get a chance to fix that one soon.

Well that's about it for now. I hope that this journey has been an interesting one, I know it's been a lot of reading to take in all the parts. We started with a session that was emitting no wait information, and now we have a simple Dtrace script that is easy to execute with no impact to other sessions, and easy to stop once the underlying cause is determined. The query does not need to be stopped and restarted, and no additional overhead is imposed from enabling additional Oracle diagnostics such as setting statistics_level=all. I intend to expand on this approach over time, with a particular focus on:

  • Multiplatform support - Systemtap version is required
  • Better design to support more function tracing
  • Fix double counting issue
  • Add identifiers for optimiser plan line number rather than rolling up by function name

Upcoming Presentation Slots, Spring 2010

Posted February 1, 2010 by James Morle
Categories: Uncategorized

I will be presenting the Oracle Wait Interface is Useless Dog and Pony show a couple of times this spring. It was first presented at the UKOUG 2009 with Tanel Poder, and will be seeing the light of day at the following events:

I hope to see some of you there, and will get the final part posted before RMOUG – so the picture will be complete. If you’ve got any questions, it’s a good opportunity to come and heckle at one of those events!

The Oracle Wait Interface Is Useless (sometimes) – part 3a

Posted January 18, 2010 by James Morle
Categories: Generic, Oracle

OK, here it is, the ‘first part of the last part’, though the topics discussed in these articles will be discussed more over time in my blog and in Tanel’s. I’ve split it into two subparts, because it was just getting insanely long as single posting.

Before I get going on this one, let’s just clear up a misunderstanding from the first part of this series. The first part uses a specific example which, for clarity reasons, will continue to be used for the remainder of the series. The example shown is some kind of query-related issue in this case, but the approach shown here is a more general one that will locate the root cause in many, many other cases than just the case of a bad query. This SQL example is still a good one, though, because there are no waits, lots of CPU, and no system calls. But the underlying problem might not be caused by poor SQL, and diving straight into SQL tuning could waste you 2 days fixing SQL when that is not the root cause. Just because we happen to be running SQL, it does not mean that it is a SQL problem. What we are trying to achieve here is a more robust approach to root cause diagnosis of Oracle performance problems, one where we don’t guess, we don’t assume, we don’t stick with things we’ve seen before: We quantitatively find the real problem. That might be the smart way, it might be the lazy way – those might be the same thing! But most crucially, it’s the fast way to fixing the real problem.

So, back to the main topic. In the first part of this blog, I walked through the example problem and showed that we have a problem that has the following attributes:

  • The user is reporting a server response time problem
  • Zero Oracle waits
  • 100% CPU
  • Zero system calls
  • DB Time is showing all the CPU and elapsed time is spent executing SQL (in this example…)

Note to Kevin: Though I stated the query ‘never’ returns in part one, I was, of cause, using my usual exaggerative vernacular. I meant, of course, that it was taking a very long time to return, but still returning correct results. Sorry about that – it will ‘never’ happen again.

At this point, we might jump into a prolonged SQL tuning exercise. But let’s not for now, because that is still guesswork in my opinion. Let’s make the following reasonable assumption: The whole system is running slowly, using huge quantities of CPU, and this is a mission-critical system where you cannot simply start killing queries and re-executing them in your SQL*Plus session. That’s a reasonable assumption – I have worked on many, many, production systems where this is the case. So let’s carry on diagnosing without actually interfering with the production system in any way.

This is where we might deviate from time-based diagnostics. Not because we don’t think that time is the single most valuable metric on which to determine what ‘fast’ and ’slow’ actually mean in quantitative terms, but because there is a bit of a shortcut available to us here that might just nail the cause of the problem. That shortcut is session-based statistics, and they have nothing to do with time, being simple counters.

This is where it makes no sense at all to re-invent the wheel, and was why Tanel is on the hook for this part: Tanel has already done some significant work on session-based statitics analysis and has written a great tool, snapper, for just this purpose. If you have not already read part two of this series, head over to Tanel’s blog now to catch up before proceeding!

OK, so we’ve abandoned the wait interface, and let’s assume that the sessions statistics did not yield anything particularly obvious. Or, maybe you just want to dig a little deeper, or possibly take yet another short cut to the truth? Now it’s time to look beyond Oracle and into the Operating System. The Operating System has a completely different viewpoint of the operation of Oracle, notably through the use of execution stack profiling. Don’t run away just yet, it’s not as difficult as you might imagine.

Various methods exist to probe the execution stack across the various Operating Systems. All of these should be used with caution, as they have varying degrees of intrusiveness on the operation of Oracle. The more intrusive methods, such as gdb, and the (possibly) less tested methods (oradebug short_stack) have very real potential to crash the process that you are attaching to, or worse. Don’t say I didn’t warn you! However, other methods for profiling the stack are less intrusive, and are already very robust: I’m talking here about Dtrace and variants.

I have held off from getting too excited by Dtrace in the past, even though the technology always looked amazing. The reason for holding back was that it was a single-OS tool, and without anything comparable on the Linux platform I didn’t see it being useful to >75% of my customers. That’s changing now, though, with the production release of Systemtap (partial, at least) in RHEL 5.4 and there is even a similar tool for AIX named Probevue. So now I’m excited about this technology!

Before digging into Dtrace and friends, I think it’s important that we take a little diversionary refresh into what a stack actually is, and why it is important. For those with a good coding background, look away now, and rejoin the group for part 3b. Just to keep your interest level, though, would it be interesting to get a second-by-second view of where the time goes for a process, down to the following levels of detail?

  • Which line of the plan did I spend most time in?
  • How much time did I spend waiting for page faults?
  • Which wait events did I spend most time in?

That’s going to be part 3b… so please check back periodically!

Back to the “Stack Primer for DBAs”: Consider the following elementary piece of C code:

#include <stdio.h>

int i;

void my_function();

void
main() {

	scanf("%c",&i);
	my_function();

}

Don’t worry about the details of the language if you are unfamiliar with C. The only things to observe here are that we have one “function”, named ‘main()’, which is making a call to another function ‘my_function()’. The main() function is a special one – all standalone C programs have a main() function, which is the initial point of execution. A C program is just a set of compiled machine code instructions at execute time, just like any other computer program. These instructions occupy a chunk of memory, where each memory location is either a machine code instruction or a piece of data. When the code gets down to the call to my_function() in main(), the execution must jump (or branch) to the address that my_function()’s machine code instructions reside at. Before this can happen, an amount of context must be stored, namely:

  • The ‘return’ address, so that execution can resume in the main() function after my_function() is complete
  • Any ‘data’ passed to my_function() (the function arguments)

In C, this information is pushed onto the stack and then the execution branches to the entry point of my_function(). As a sidenote, the sharing of data and execution context in this way is a common way to exploit security holes by overwriting execution context with oversized arguments from the data portion. We’re not interested in all that stuff for this – we are only interested in observing the stack contents. If we took a stack dump while the program were in the my_function() function, it would look like this:

my_function()
main()

Pretty simple, huh? If there were arguments for main() and my_function(), they would also be shown in some way. Most stack trace facilities can only show primitive datatypes (such as integers) and all other more complex arguments (such as structs) are shown as pointers (addresses) to the actual data without decoding the actual content, as follows:

my_function(42, 0x12345678)
main(0x45678900)

Some tools will print a stack trace (known as a backtrace, because it is unwound backwards from the current position in the stack) with some known pointers decoded into human readable form, such as simple struct datatypes. Anyway, let’s have a look at something a bit more complicated – an Oracle backtrace taken from an errorstack trace file:

----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
skdstdst()+36        call     kgdsdst()            000000000 ? 000000000 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   7FFFB345A778 ? 000000000 ?
ksedst1()+98         call     skdstdst()           000000000 ? 000000000 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
ksedst()+34          call     ksedst1()            000000001 ? 000000001 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
dbkedDefDump()+2736  call     ksedst()             000000001 ? 000000001 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
ksedmp()+36          call     dbkedDefDump()       000000001 ? 000000000 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
ksdxfdmp()+1837      call     ksedmp()             000000001 ? 000000000 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
ksdxcb()+1782        call     ksdxfdmp()           7FFFB345BAE0 ? 000000011 ?
                                                   000000003 ? 7FFFB345BA40 ?
                                                   7FFFB345B9A0 ? 000000000 ?
sspuser()+112        call     ksdxcb()             000000001 ? 000000011 ?
                                                   000000001 ? 000000001 ?
                                                   7FFFB345B9A0 ? 000000000 ?
__restore_rt()       call     sspuser()            000000001 ? 000000011 ?
                                                   000000001 ? 000000001 ?
                                                   7FFFB345B9A0 ? 000000000 ?
semtimedop()+10      signal   __restore_rt()       000018000 ? 7FFFB345C730 ?
                                                   000000001 ?
                                                   FFFFFFFFFFFFFFFF ?
                                                   FFFFFFFFFFD23940 ?
                                                   000000000 ?
sskgpwwait()+259     call     semtimedop()         000018000 ? 7FFFB345C730 ?
                                                   000000001 ? 7FFFB345C6D8 ?
                                                   FFFFFFFFFFD23940 ?
                                                   000000000 ?
skgpwwait()+151      call     sskgpwwait()         7FFFB345CB94 ? 00A9A15C0 ?
                                                   07848C4D8 ? 0002DC6C0 ?
                                                   7FFF00000000 ? 000000000 ?
ksliwat()+1816       call     skgpwwait()          7FFFB345CB94 ? 00A9A15C0 ?
                                                   000000000 ? 0002DC6C0 ?
                                                   000000000 ? 000000000 ?
kslwaitctx()+157     call     ksliwat()            078666348 ? 078666348 ?
                                                   005F5DFA7 ? 000000000 ?
                                                   100000000 ? 000000000 ?
kslwait()+136        call     kslwaitctx()         7FFFB345CE30 ? 000000000 ?
                                                   005F5DFA7 ? 000000000 ?
                                                   100000000 ? 000000000 ?
psdwat()+107         call     kslwait()            005F5DFA7 ? 000000167 ?
                                                   000000000 ? 005F5DFA7 ?
                                                   000000000 ? 000000000 ?
pevm_icd_call_commo  call     psdwat()             005F5DFA7 ? 000000167 ?
n()+421                                            000000000 ? 005F5DFA7 ?
                                                   000000000 ? 000000000 ?
pfrinstr_ICAL()+164  call     pevm_icd_call_commo  7FFFB345E0F0 ? 000000000 ?
                              n()                  000000001 ? 000000004 ?
                                                   7FAF00000001 ? 000000000 ?
pfrrun_no_tool()+63  call     pfrinstr_ICAL()      7FAF7E8A3500 ? 06A03B9AA ?
                                                   7FAF7E8A3570 ? 000000004 ?
                                                   7FAF00000001 ? 000000000 ?
pfrrun()+1025        call     pfrrun_no_tool()     7FAF7E8A3500 ? 06A03B9AA ?
                                                   7FAF7E8A3570 ? 000000004 ?
                                                   7FAF00000001 ? 000000000 ?
plsql_run()+769      call     pfrrun()             7FAF7E8A3500 ? 000000000 ?
                                                   7FAF7E8A3570 ? 7FFFB345E0F0 ?
                                                   7FAF00000001 ? 070C18646 ?
peicnt()+296         call     plsql_run()          7FAF7E8A3500 ? 000000001 ?
                                                   000000000 ? 7FFFB345E0F0 ?
                                                   7FAF00000001 ? 000000000 ?
kkxexe()+520         call     peicnt()             7FFFB345E0F0 ? 7FAF7E8A3500 ?
                                                   7FAF7E8C4028 ? 7FFFB345E0F0 ?
                                                   7FAF7E8C1FD0 ? 000000000 ?
opiexe()+14796       call     kkxexe()             7FAF7E8A5128 ? 7FAF7E8A3500 ?
                                                   000000000 ? 7FFFB345E0F0 ?
                                                   7FAF7E8C1FD0 ? 000000000 ?
kpoal8()+2283        call     opiexe()             000000049 ? 000000003 ?
                                                   7FFFB345F678 ? 7FFFB345E0F0 ?
                                                   7FAF7E8C1FD0 ? 000000000 ?
opiodr()+1149        call     kpoal8()             00000005E ? 00000001C ?
                                                   7FFFB3462750 ? 7FFFB345E0F0 ?
                                                   7FAF7E8C1FD0 ? 5E00000001 ?
ttcpip()+1251        call     opiodr()             00000005E ? 00000001C ?
                                                   7FFFB3462750 ? 000000000 ?
                                                   008C5E7F0 ? 5E00000001 ?
opitsk()+1628        call     ttcpip()             00A9B0890 ? 0086BA768 ?
                                                   7FFFB3462750 ? 000000000 ?
                                                   7FFFB34621B0 ? 7FFFB3462954 ?
opiino()+953         call     opitsk()             00A9B0890 ? 000000000 ?
                                                   7FFFB3462750 ? 000000000 ?
                                                   7FFFB34621B0 ? 7FFFB3462954 ?
opiodr()+1149        call     opiino()             00000003C ? 000000004 ?
                                                   7FFFB3463E48 ? 000000000 ?
                                                   7FFFB34621B0 ? 7FFFB3462954 ?
opidrv()+565         call     opiodr()             00000003C ? 000000004 ?
                                                   7FFFB3463E48 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
sou2o()+98           call     opidrv()             00000003C ? 000000004 ?
                                                   7FFFB3463E48 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
opimai_real()+128    call     sou2o()              7FFFB3463E20 ? 00000003C ?
                                                   000000004 ? 7FFFB3463E48 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
ssthrdmain()+209     call     opimai_real()        000000002 ? 7FFFB3464010 ?
                                                   000000004 ? 7FFFB3463E48 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
main()+196           call     ssthrdmain()         000000002 ? 7FFFB3464010 ?
                                                   000000001 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
__libc_start_main()  call     main()               000000002 ? 7FFFB34641B8 ?
+253                                               000000001 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
_start()+36          call     __libc_start_main()  0009D3D74 ? 000000002 ?
                                                   7FFFB34641A8 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?

I know, it’s a bit much to deal with at first glance. But take it a piece at a time, reading in conjunction with Metalink note 175982.1, and it all begins to make sense. Actually, I’m guessing some of these, because the Metalink note doesn’t give all the answers, but it seems to make sense.

Let’s start at the bottom, which is the beginning of the execution of the C program named ‘oracle’. The first two lines are the GLIBC execution wrappers that eventually call Oracle’s main() statement which starts at line 110 of the listing. The next several lines above are the initialisation of Oracle and the (O)racle (P)rogramatic (I)nterface, which is the internal API used in Oracle. TTCPIP is the Two-Task-Communication interface with my SQL*Plus process, but the first interesting line is at line 74, kkxexe(), which is the (k)ernel (k)ompile e(x)ecute) (exe)cute statement call, which is the start of the actual SQL processing. In this case it is actually a simple PL/SQL call to DBMS_LOCK.SLEEP(100000), and the entry to the Oracle Wait interface can be seen at line 50 with the kslwait() call, eventually ending up with a semtimedop() call, which is the way the sleep call has been implemented. Incidentally, the preceding sskgpwwait() call at line 37 is Oracle’s platform-specific code which, in this case, is Red Hat Linux.  At line 11 we see the call to ksedst(), which probably stands for (k)ernel (s)ervices (e)rrorstack (d)ump (s)tack (t)race, which is the part of code actually producing the stack trace we are reading.

So what does all this actually mean in terms of seeing what our Oracle session is up to? Well, we can see what is called, by whom, and how long each call took. We don’t need to know all the function names and what they are responsible for, but we can come up with a selection of interesting ones that we would like to pay attention to. Thinking in terms of SQL execution we can see, for example, a nested loop rowsource function calling out to an index fetch rowsource function, which in turn calls out to a table fetch function.

Even just a few consecutive stack traces of a running query will make it very plain which rowsource Oracle is spending the most time within, simply because of the laws of probability. If you look once a second for a minute, and every stack you see has the ‘index fetch’ rowsource as the current function, then you probably spent most of the last minute doing that kind of processing. But we can do better than that… we can start to look at quantifying where the time is spent, and also to supplement that information with other interesting aspects of the operating system operation, aspects which are somewhat  concealed from Oracle and yet can detrimentally affect its performance.

Join me back for part 3b for the conclusion of this series of articles!

I’m now an Oracle ACE Director

Posted January 15, 2010 by James Morle
Categories: Oracle

Just a quick post in celebration of my recent appointment as an Oracle ACE Director! I’m very happy to be invited to this group, and look forward to getting involved in some of the discussions.

The Oracle Wait Interface Is Useless – News About Part Two

Posted January 14, 2010 by James Morle
Categories: Oracle

As many have probably noticed, the blogging experiment with Tanel hasn’t been quite as productive as it might optimally have been. Tanel doesn’t currently have the time to write part two of this article, so I’ll I’m going to pick up the baton and write the next part. Watch this space!

The Oracle Wait Interface Is Useless (sometimes) – Part One: The Problem Definition

Posted November 9, 2009 by James Morle
Categories: Oracle

So here we go, this is part one of this experiment in blogging and co-writing. Tanel has actually written some good stuff already for this, but I wanted to try and formalise things under a common title and make it easier to follow between our sites.

I thought it would be logical to start this process by producing a more concrete problem definition, so that’s the focus of this part. It’s unlikely that we will come up with a complete method in this initial work, but hopefully the wheels will at least turn a little by the end of it!

So first of all, why would I dare to say that the Oracle Wait Interface is useless? Well, partly because I quite like titles that are a little bit catchy, and partly because it is indeed sometimes useless. The emphasis is on the word sometimes, though, because the Oracle Wait Interface is still the single most useful feature in any database product. Wow – that’s quite a claim, isn’t it? This isn’t the place to fully explain why that is, and many others have written great works on this subject already. Check out Cary Millsap’s works, notably his book, Optimizing Oracle Performance, which focuses in great detail on this subject. For the sake of this article, however, here’s why it is so useful: It tells you where the time goes. Think about it: If something is running too slowly, knowing where the time is used up is the single piece of information required to focus on the right subject for tuning.

So what’s wrong with the Oracle wait interface? Just one thing, actually – it is designed to  provide visibility of relatively slow waits. The reason for this is simply that there is a slight overhead in timing every single wait. If that overhead becomes a noticeable proportion of the actual wait itself then the measurement becomes inaccurate (and makes the problem worse). On UNIX-like platforms (yes, that includes Linux), the actual timing interface is implemented using gettimeofday(2) system calls, one before the event and one after the event. This call gives microsecond granularity of timing, at least in theory (on my Opteron 280 test machine, gettimeofday() calls take 1.5 microseconds). So, using this kind of mechanism for events that take a relatively long time makes perfect sense – disk I/O, for example, that will take at least three orders of magnitude longer to complete than the timing calls themselves. Conversely, they make no sense at all for calls that take even as little as 50 microseconds, as the 3 microsecond penalty for measuring the wait is 6% of the actual event time itself at that point. There you go, that’s the beginning of the justification that the wait interface is useless,  in a nutshell.

But hang on, isn’t 50 microseconds really, really fast? Well no, actually, it isn’t. Taking Intel’s Nehalem processor (with Quickpath) as an example, a memory latency is around the 50 ns range – three orders of magnitude faster than the 50 microsecond cut-off that I just arbitrarily invented. Memory access is also the slowest thing that a CPU can do (without factoring in peripheral cards) – in this case the CPU has to wait for about 150 cycles while that memory access takes place. So it’s very possible to have a function call that does fairly complex work and is still an order of magnitude or two faster than the gettimeofday() system call.

Time for an example. Actually, this is a variation on the example that made me start thinking about this – I had been perfectly happy with the Oracle Wait Interface until this point for 99% of cases!
Read the rest of this post »

And Now For Something Completely Different…

Posted October 12, 2009 by James Morle
Categories: Oracle

So just a short post today, more of an announcement…

At 11.05am on Wednesday 2nd December, I will co-present a talk at the UKOUG conference with Tanel Poder. OK, so nothing particularly earth-shattering there. However, we’ve decided to do something a little bit different with the creative process. The talk is entitled “The Oracle Wait Interface Is Useless (sometimes)”, and is a subject that both Tanel and I have been working on separately for a little while. The premise is this: The wait interface is great for ’slow’ waits, but what about waits that are not instrumented by the wait interface? What about waits that are not waits from Oracle’s perspective, such as reading a page of memory? What about pure inefficiency? There is, of course, the concept of DB Time, but it is not currently granular enough.

So, you probably can see the idea behind the presentation: The goal is to present some alternative diagnostic techniques to determine the cause of poor performance. We happen to both have similar ideas on this, and they don’t just stop at Oracle.

So here’s the new concept, at least for us: We are going to write the content for the presentation as a tag team effort between our two blogs. Airing our dirty laundry in public, so to speak. We think this will give a fairly unique opportunity for public comment before the presentation is actually given!

So, over to Tanel for part one…

Latency Hiding For Fun and Profit

Posted September 28, 2009 by James Morle
Categories: Generic, Oracle

Tags: , , ,

Yep, another post with the word ‘latency’ written all over it.

I’ve talked a lot about latency, and how it is more often than not completely immutable. So, if the latency cannot be improved upon because of some pesky law of physics, what can be done to reduce that wasted time? Just three things, actually:

  1. Don’t do it.
  2. Do it less often.
  3. Be productive with the otherwise wasted time.

The first option is constantly overlooked – do you really need to be doing this task that makes you wait around? The second option is the classic ‘do things in bigger lumps between the latency’ – making less roundtrips being the classic example. This post is about the third option, which is technically referred to as latency hiding.

Everybody knows what latency hiding is, but most don’t realise it. Here’s a classic example:

I need some salad to go with the chicken I am about to roast. Do I:

(a) go to the supermarket immediately and buy the salad, then worry about cooking the chicken?

OR

(b) get the chicken in the oven right away, then go to the supermarket?

Unless the time required to buy the salad is much longer than the chicken’s cook-time, the answer is always going to be (b), right? That’s latency hiding, also known as Asynchronous Processing. Let’s look at the numbers:

Variable definitions:

Supermarket Trip=1800s

Chicken Cook-Time=4800s

Calculations:

Option (a)=1800s+4800s=6600s (oh man, nearly two hours until dinner!)

Option (b)=4800s (with 1800s supermarket time hidden within it)

Here’s another example: You have a big code compile to do, and an empty stomach to fill. In which order do you execute those tasks? Hit ‘make’, then grab a sandwich, right?

As a side note, this is one of my classic character flaws – I just live for having tasks running in parallel this way. Not a flaw, I hear you say? Anyone that has tried to get software parallelism  (such as Oracle Parallel Execution) knows the problem – some tasks finish quicker than expected, and then there’s a bunch of idle threads.  In the real world, this means that my lunch is often very delayed, much to the chagrin of my lunch buddies.

OK, so how does this kind of thing work with software? Let’s look at a couple of examples:

  1. Read Ahead
  2. Async Writes

Read ahead  from physical disk is the most common example of (1), but it equally applies to result prefetching in, say, AJAX applications. Whatever the specific type, it capitalises on parallel processing from two resources. Let’s look at the disk example for clarification.

Disk read ahead is where additional, unrequested, reads are carried out after an initial batch of real requested reads. So, if a batch job makes a read request for blocks 1,2,3 and 4 of a file, “the disk” returns those blocks back and then immediately goes on to read blocks 5,6,7,8, keeping them in cache. If blocks 5,6,7,8 are then subsequently requested by the batch job after the first blocks are processed, they can immediately be returned from cache, thus hiding the latency from the batch job. This has the impact of hiding the latency from the batch job and increases throughput as a direct result.

Async writes are essentially the exact opposite of read-ahead. Let’s take the well-known Oracle example of async writes, that of the DBWR process flushing out dirty buffers to disk. The synchronous way to do this is to generate a list of dirty buffers and then issue a series of synchronous writes (one after the other) until they are all complete. Then start again by looking for more dirty buffers. The async I/O way to do the same operation is to generate the list, issue an async write request (which returns instantly), and immediately start looking for more dirty buffers. This way, the DBWR process can spend more useful time looking for buffers – the latency is hidden, assuming the storage can keep up.

By the way, the other horror of the synchronous write is that there is no way that the I/O queues can be pushed hard enough for efficient I/O when sending out a single buffer at a time. Async writes remedy that problem.

I’ve left a lot of the technical detail out of that last example, such as the reaping of return results from the async I/O process, but didn’t want to cloud the issue. Oops, I guess I just clouded the issue, just ignore that last sentence…

Spotting the Red Flags (Part 1 of n)

Posted September 24, 2009 by James Morle
Categories: Oracle

Tags:

As a consultant I get to see many different systems, managed by different people. A large proportion of these systems are broken in exactly the same ways to others, which makes spotting the problems pretty straightforward! It occurred to me at some point that this is a bit like a crime scene – somebody had murdered the system, and it was my job to find out ‘whodunnit’, or at least ‘whatdunnit’. The ‘what’ is quite frequently one (or few) actions performed by the administrator, normally with good intention, that result in system performance or availability carnage. I call these actions ‘Red Flags’, and spotting them can save a lot of time.

A couple of years ago at the Hotsos conference, I did a small impromptu survey. It’s not all that often that you have 500 Oracle geeks in a room that you can solicit opinion from, so it seemed like a good chance to cast the net for Red Flags. I seeded the process with a few of my personal favourites and then started writing as the suggestions came in. Here are the ’seed’ entries:

  • Any non-default setting for optimizer_index_cost_adj
  • Carefully configured KEEP and RECYCLE pools
  • Multiple block sizes
  • Some kind of spin_count configuration

Remember, these are not necessarily the wrong thing to do to your system, but they probably are. That’s why they attract my attention in the first instance.

I got a whole load of responses. Of those responses, some had missed the point and just identified something that was broken. This is more subtle than that – these are things that are forensic evidence of system homicide. Some of the decent ones I got back follow:

  • /*+ RULE */ hint in abundance (I know, lot’s of apps do this, and it sometimes makes sense)
  • Numeric data in VARCHAR2 columns
  • Indexes and tables on ’separate disks’
  • All columns in a table NULLable

Of course, there were more cynical responses too, such as  (he he): cluster_database=true

I was going to write some kind of presentation about it, but I think this might work nicely as a blog entry, potentially multiple entries! If anyone has some good suggestions, please let me know and I’ll start adding them into part 2, along with any more I think of… :)

The amazing truth about the Red Flags, though, is that they are incredibly hard to dislodge. Once optimizer_index_cost_adj is set to anything but default, it is nigh on impossible to convince anyone that the system is probably now tuned on its head. Same with a multiple buffer pool configuration, and same with multiple block sizes.

Forget I/O Bound, You’re Latency Bound, Bub

Posted September 21, 2009 by James Morle
Categories: Storage

Tags: , , ,

Since it’s been nearly ten years since I wrote my book, Scaling Oracle8i, I thought it was about time that I started writing again. I thought I would start with the new-fangled blogging thing, and see where it takes me. Here goes.

As some will know, I run a small consulting company called Scale Abilities, based out of the UK. We get involved in all sorts of fun projects and problems (or are they the same thing?), but one area that I seem to find myself focusing on a lot is storage. Specifically, the performance and architecture of storage in Oracle database environments. In fact I’m doing this so much that, whenever I am writing presentations for conferences these days, it always seems to be the dominant subject at the front of my mind.

One particular common thread has been the effect of latency. This isn’t just a storage issue, of course, as I endeavoured to point out in my Hotsos Symposium 2008 presentation “Latency and Skew”. Latency, as the subtitle of that particular talk said, is a silent killer. Silent, in that it often goes undetected, and the effects of it can kill performance (and still remain undetected). I’m not going to go into all the analogies about latency here, but let’s try and put a simple definition out for it:

Latency is the time taken between a request and a response.

If that’s such a simple definition, why is it so difficult to spot? Surely if a log period of time passes between a request and a response, the latency will be simple to nail? No.

The problem is that it is the small latencies that cause the problems. Specifically, it is the “small, but not so small that they are not important” ones that are so difficult to spot and yet cause so many problems. Perhaps an example is now in order:

A couple of years ago, a customer of mine was experiencing a performance problem on their newly virtualised database server (VMware 3.5). The problem statement went a little bit like this: Oracle on VMware is broken – it runs much slower on VMware than on physical servers. The customer was preparing a new physical server in order to remove VMware from the equation. Upon further investigation, I determined the following:

  1. The VMware host (physical server running VMware) was a completely different architecture to the previous dedicated server. The old server was one of the Intel Prescott core type (3.2GHz, Global Warming included at no extra cost), and the new one was one of the Core 2 type with VT instructions.
  2. Most measurable jobs were actually faster on the virtualised platform
  3. Only one job was slower

The single job that was slower was so much slower that it overshadowed all the other timings that had improved. Of the four critical batch jobs, the timings looked like this:

Physical server:

  • Job 1: 26s
  • Job 2: 201s
  • Job 3: 457s
  • Job 4: 934s
  • Total: 1618s

Virtualised Server:

  • Job 1: 15s
  • Job 2: 111s
  • Job 3: 208s
  • Job 4: 2820s
  • Total: 3154s

It can be seen that, if one takes the total as the yardstick, the virtualised server is almost twice as slow: Therein lies the danger of using averages and leaping to conclusions. If Job 4 is excluded, the totals are 684s vs 334s, making the virtualised server more than twice as quick as the physical one.

Upon tracing Job 4 on the VMware platform with Oracle extended SQL tracing (10046 level 8), I discovered that it was making a lot of roundtrips to the database. Hang on, let me give that the right emphasis: A LOT of roundtrips. However, each roundtrip was really  fast – about 0.2ms if memory serves. So where’s the problem with that? It’s not exactly high latency, is it? Well it is if you have to do several million of them.

As it turns out, there was something in the VMware stack (perhaps just additional codepath to get through the vSwitch) that was adding around 0.1ms latency to each roundtrip. When this tenth of a millisecond is multiplied by several million (something like 20 million in this case), it becomes a long time. About 2000s to be precise, which more than made up for the extra time. The real answer – do less roundtrips, the new server will be at least twice as fast as the old server.

So what does this have to do with I/O? Plenty. The simple fact is that roundtrips are a classic source of unwanted latency. Latency is the performance killer.

Let’s look at some I/O examples from a real customer system. Note: this is not a particularly well tuned customer system:

  • Latency of 8KB sequential reads: 0.32ms
  • Latency of 4MB sequential reads: 6ms

Obviously, the 4MB reads are taking a lot longer (18x), but that makes sense, right? Apart from one thing: The 4MB reads are serving 512x more data in each payload. The net result of these numbers is as follows:

  • Time to sequentially read 2.1GB in 8KB pieces: 76s
  • Time to sequentially read 2.1GB in 4MB pieces: 24s

So what happened to the 52s difference in these examples? Was this some kind of tuning problem? Was this a fault on the storage array? No. The 52s of time was lost in latency, nowhere else.

Here’s another definition of latency: Latency is wasted time.

So, look out for that latency. Think about it when selecting nested-loop table joins instead of full-table scans. Think about it when doing single-row processing in your Java app. Think about it before you blame your I/O system!