The Oracle Wait Interface Is Useless (sometimes) – part 3b

Posted February 16, 2010 by James Morle
Categories: Generic, Oracle

Welcome back for the concluding part of this series of blogs.
In the last part I went through a brief primer about code execution and stack frames in preparation for this posting. I hope that wasn’t too boring for everyone – I think it’s really important to make sure all the preliminary knowledge is covered and not make too many assumptions about the reader. In this part I will finally get to the point, and talk about some alternative techniques for determining the reasons for poor performance for our example user session.
To recap very quickly, we have a user session taking 100% CPU in top, nothing is emitted in SQL trace files, Oracle’s Time Model shows it is executing SQL, and that’s about all the information we have. As Tanel stated in his posting for Part Two, this is the stage where it makes good sense to dig into session-level statistics (not system-level statistics) to see if any of the counters are incrementing in a conspicuous way. If you don’t already have a copy of Tanel’s Snapper code, I suggest you grab it and get familiar with it – it’s a nice implementation of an essential tool that many DBAs already have some kind of script for.

Tanel found in Part Two that the SQL execution was revisiting the same small number of buffers very frequently, indicating that there might be a problem with one of the Nested Loop join methods used by the optimiser. Perhaps there is something wrong with that part of the plan (or with the whole plan)? Using counters can often reveal a potential problem this way, but they don’t necessarily  put a neon sign reading “Here’s Where All The Time Went” on top of the offending item. This is where we might want to dig further using some of our knowledge of code execution and stack traces, in combination with a dynamic tracing infrastructure such as Dtrace or Systemtap.

The power of Dtrace and friends lies in a few key areas:

  1. It can report time-based information
  2. It can show user and kernel information
  3. It is dynamic – unlike gprof and other code profilers, it is not necessary to run everything with a profiled binary

These are really key features which, when combined, allow for tremendously powerful on-demand tracing.

Consider the following sample output from a very small Dtrace script to look into what happens when a single printf() library call is made from a C program:

CPU FUNCTION
  0  -> my_function                            USER CODE
    (57 lines of dynamic library loading omitted)
  0    -> printf                               USER CODE
  0      -> _setorientation                    USER CODE
  0      <- _setorientation                    USER CODE
  0      -> _ndoprnt                           USER CODE
  0        -> _findbuf                         USER CODE
  0          -> isatty                         USER CODE
  0            -> ___errno                     USER CODE
  0            <- ___errno                     USER CODE
  0            -> ioctl                        USER CODE
  0              -> syscall_mstate             KERN
  0                -> gethrtime_unscaled       KERN
  0                  -> tsc_gethrtimeunscaled_delta  KERN
  0                    -> tsc_gethrtimeunscaled  KERN
  0                    <- tsc_gethrtimeunscaled  KERN
  0                  <- tsc_gethrtimeunscaled_delta  KERN
    (180 more lines of kernel functions for ioctl call omitted)
 0            <= ioctl                         SYSCALL: ioctl 1743296 ns
    (34 lines of kernel function (probably DTrace intrumentation!) omitted)
 0        <- ioctl                             USER CODE
 0  <= fstat64                                 SYSCALL: fstat64 352699 ns
    (~300 lines of kernel  and dynamic library loading omitted)
 0    -> _write                                USER CODE
    (~200 lines of kernel  and dynamic library loading omitted)
 0  <= write                                   SYSCALL: write 2154427 ns
    (123 lines of kernel code omitted)
 0  <- printf                                  USER CODE

In the above output, there are around 1000 functions called in User and Kernel domains. The output gives you some idea of the level of detail possible with Dtrace; we can see all of the following attributes:

  • All User mode function calls are logged
  • All Kernel mode function calls are logged
  • All system calls are timed

Note: The printf() itself is very much a user-mode call,part of the standard C library (libc). It actually makes a couple of system calls under the bonnet: Initially an ioctl() and eventually the write(2) system call.

Let’s look at something else – time spent in the Virtual Memory subsystem of the kernel doing Page Ins (page faults). Again, this is a very simple script, and produces output similar to the following:

WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 90 ms
WAIT: PAGE IN: 836 ms
WAIT: PAGE IN: 1098 ms
WAIT: PAGE IN: 853 ms
WAIT: PAGE IN: 816 ms
FINAL WAIT: PAGE IN: 263 ms

So now we can quantify how much time is spent servicing page faults, in this case showing approximately how much per second. Here we have very specific information about how much time our particular session is spending servicing page faults. Don’t assume that soft page faults are cheap/fast, either – with increasing virtualisation trends, soft page faults as observed by the guest operating system could actually be hard page faults in the underlying hypervisor.

These are just a few of the things that can be monitored in Dtrace. Now let’s combine the Oracle stack tracing that we introduced in part 3a.  In that posting I concluded that we could dump the stack trace once in a while, observe the name of the current function, and determine where we are spending the most time. With Dtrace we can make this much more concrete, and can quantitatively report the time spent in each function, and combine other contributing factors from the Operating System! I have put together a Dtrace script to demonstrate this concept. I won’t post the script here (yet) because it is still very much in development and still has a number of deficiencies. What I’m trying to get to here is a really quick way to see exactly what an Oracle session is doing by only looking at the running process using Dtrace. The script currently has the following features:

  • Shows all OWI wait events and durations
  • Shows all system calls and durations
  • Shows timing stats every second for: OWI wait times, page faults, latch acquisition, and a selection of common rowsource fetch operators

The rowsource operators are taken from tracing the function calls to the various rowsource operators defined in Metalink note 175982.1. Specifically, the script currently traces calls to the following rowsource operators:

  • GS – Group By Sort
  • JO – Join (Nested Loop)
  • UA – Union All
  • VW – View
  • HJ – Hash Join
  • TB – Table
  • JOT – Unknown. Possible relation to Join, observed frequently during stack tracing
  • SO – Sort
  • IXT – Index
  • PF – Prefetch

Running the same misbehaving SQL as Tanel in his last posting, we get the following output (snippet only):

LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 464:JO 0:UA 0:VW 0:HJ 0:TB 21:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=1 rsrc=(0:GS 430:JO 0:UA 0:VW 0:HJ 0:TB 9:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 471:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 467:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 466:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 469:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
SYSCALL: times 7104 ns
SYSCALL: times 1749 ns
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 468:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 466:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 468:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 469:JO 0:UA 0:VW 0:HJ 0:TB 22:JOT 0:SO)

Remember that this SQL example precisely fits our profile where the session is using 100% usermode CPU and emitting no wait information, so we cannot see any OWI waits detailed here – just summaries of where the time was spent every second and a couple of system calls. We also have no page faults taking place on this test, so no time is accounted for there. The big user of CPU (468ms per second) is the nested loop join rowsource operator, which agrees with Tanel’s findings using session-statistics. We are also spending a little time acquiring latches, but it’s not very significant. There are clearly some rowsource functions missing, given that we can only account for half the CPU time, so I did a bit more function tracing to find the names of the missing functions, and added them to the script. The output now looks as follows (scroll right to see the new functions):

LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 461:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 163:IXT 759:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 691:JO 0:UA 0:VW 0:HJ 0:TB 455:JOT 0:SO 304:IXT 843:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 460:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 162:IXT 759:PF)
SYSCALL: times 7810 ns
SYSCALL: times 1750 ns
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 686:JO 0:UA 0:VW 0:HJ 0:TB 444:JOT 0:SO 301:IXT 842:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 436:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 177:IXT 721:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 708:JO 0:UA 0:VW 0:HJ 0:TB 444:JOT 0:SO 286:IXT 880:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 357:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 228:IXT 588:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 788:JO 0:UA 0:VW 0:HJ 0:TB 445:JOT 0:SO 235:IXT 1013:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 277:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 276:IXT 458:PF)

Looks like we’ve found the missing time, and then some (the timers add up to more than 1000ms each second). The reason for this timing exaggeration is that there is a good deal of double accounting going on with the current version of the script – child function times are also accounted for in the parent. This is a known bug, hopefully I’ll get a chance to fix that one soon.

Well that’s about it for now. I hope that this journey has been an interesting one, I know it’s been a lot of reading to take in all the parts. We started with a session that was emitting no wait information, and now we have a simple Dtrace script that is easy to execute with no impact to other sessions, and easy to stop once the underlying cause is determined. The query does not need to be stopped and restarted, and no additional overhead is imposed from enabling additional Oracle diagnostics such as setting statistics_level=all. I intend to expand on this approach over time, with a particular focus on:

  • Multiplatform support – Systemtap version is required
  • Better design to support more function tracing
  • Fix double counting issue
  • Add identifiers for optimiser plan line number rather than rolling up by function name

Advertisements

Upcoming Presentation Slots, Spring 2010

Posted February 1, 2010 by James Morle
Categories: Uncategorized

I will be presenting the Oracle Wait Interface is Useless Dog and Pony show a couple of times this spring. It was first presented at the UKOUG 2009 with Tanel Poder, and will be seeing the light of day at the following events:

I hope to see some of you there, and will get the final part posted before RMOUG – so the picture will be complete. If you’ve got any questions, it’s a good opportunity to come and heckle at one of those events!

The Oracle Wait Interface Is Useless (sometimes) – part 3a

Posted January 18, 2010 by James Morle
Categories: Generic, Oracle

OK, here it is, the ‘first part of the last part’, though the topics discussed in these articles will be discussed more over time in my blog and in Tanel’s. I’ve split it into two subparts, because it was just getting insanely long as single posting.

Before I get going on this one, let’s just clear up a misunderstanding from the first part of this series. The first part uses a specific example which, for clarity reasons, will continue to be used for the remainder of the series. The example shown is some kind of query-related issue in this case, but the approach shown here is a more general one that will locate the root cause in many, many other cases than just the case of a bad query. This SQL example is still a good one, though, because there are no waits, lots of CPU, and no system calls. But the underlying problem might not be caused by poor SQL, and diving straight into SQL tuning could waste you 2 days fixing SQL when that is not the root cause. Just because we happen to be running SQL, it does not mean that it is a SQL problem. What we are trying to achieve here is a more robust approach to root cause diagnosis of Oracle performance problems, one where we don’t guess, we don’t assume, we don’t stick with things we’ve seen before: We quantitatively find the real problem. That might be the smart way, it might be the lazy way – those might be the same thing! But most crucially, it’s the fast way to fixing the real problem.

So, back to the main topic. In the first part of this blog, I walked through the example problem and showed that we have a problem that has the following attributes:

  • The user is reporting a server response time problem
  • Zero Oracle waits
  • 100% CPU
  • Zero system calls
  • DB Time is showing all the CPU and elapsed time is spent executing SQL (in this example…)

Note to Kevin: Though I stated the query ‘never’ returns in part one, I was, of cause, using my usual exaggerative vernacular. I meant, of course, that it was taking a very long time to return, but still returning correct results. Sorry about that – it will ‘never’ happen again.

At this point, we might jump into a prolonged SQL tuning exercise. But let’s not for now, because that is still guesswork in my opinion. Let’s make the following reasonable assumption: The whole system is running slowly, using huge quantities of CPU, and this is a mission-critical system where you cannot simply start killing queries and re-executing them in your SQL*Plus session. That’s a reasonable assumption – I have worked on many, many, production systems where this is the case. So let’s carry on diagnosing without actually interfering with the production system in any way.

This is where we might deviate from time-based diagnostics. Not because we don’t think that time is the single most valuable metric on which to determine what ‘fast’ and ‘slow’ actually mean in quantitative terms, but because there is a bit of a shortcut available to us here that might just nail the cause of the problem. That shortcut is session-based statistics, and they have nothing to do with time, being simple counters.

This is where it makes no sense at all to re-invent the wheel, and was why Tanel is on the hook for this part: Tanel has already done some significant work on session-based statitics analysis and has written a great tool, snapper, for just this purpose. If you have not already read part two of this series, head over to Tanel’s blog now to catch up before proceeding!

OK, so we’ve abandoned the wait interface, and let’s assume that the sessions statistics did not yield anything particularly obvious. Or, maybe you just want to dig a little deeper, or possibly take yet another short cut to the truth? Now it’s time to look beyond Oracle and into the Operating System. The Operating System has a completely different viewpoint of the operation of Oracle, notably through the use of execution stack profiling. Don’t run away just yet, it’s not as difficult as you might imagine.

Various methods exist to probe the execution stack across the various Operating Systems. All of these should be used with caution, as they have varying degrees of intrusiveness on the operation of Oracle. The more intrusive methods, such as gdb, and the (possibly) less tested methods (oradebug short_stack) have very real potential to crash the process that you are attaching to, or worse. Don’t say I didn’t warn you! However, other methods for profiling the stack are less intrusive, and are already very robust: I’m talking here about Dtrace and variants.

I have held off from getting too excited by Dtrace in the past, even though the technology always looked amazing. The reason for holding back was that it was a single-OS tool, and without anything comparable on the Linux platform I didn’t see it being useful to >75% of my customers. That’s changing now, though, with the production release of Systemtap (partial, at least) in RHEL 5.4 and there is even a similar tool for AIX named Probevue. So now I’m excited about this technology!

Before digging into Dtrace and friends, I think it’s important that we take a little diversionary refresh into what a stack actually is, and why it is important. For those with a good coding background, look away now, and rejoin the group for part 3b. Just to keep your interest level, though, would it be interesting to get a second-by-second view of where the time goes for a process, down to the following levels of detail?

  • Which line of the plan did I spend most time in?
  • How much time did I spend waiting for page faults?
  • Which wait events did I spend most time in?

That’s going to be part 3b… so please check back periodically!

Back to the “Stack Primer for DBAs”: Consider the following elementary piece of C code:

#include <stdio.h>

int i;

void my_function();

void
main() {

	scanf("%c",&i);
	my_function();

}

Don’t worry about the details of the language if you are unfamiliar with C. The only things to observe here are that we have one “function”, named ‘main()’, which is making a call to another function ‘my_function()’. The main() function is a special one – all standalone C programs have a main() function, which is the initial point of execution. A C program is just a set of compiled machine code instructions at execute time, just like any other computer program. These instructions occupy a chunk of memory, where each memory location is either a machine code instruction or a piece of data. When the code gets down to the call to my_function() in main(), the execution must jump (or branch) to the address that my_function()’s machine code instructions reside at. Before this can happen, an amount of context must be stored, namely:

  • The ‘return’ address, so that execution can resume in the main() function after my_function() is complete
  • Any ‘data’ passed to my_function() (the function arguments)

In C, this information is pushed onto the stack and then the execution branches to the entry point of my_function(). As a sidenote, the sharing of data and execution context in this way is a common way to exploit security holes by overwriting execution context with oversized arguments from the data portion. We’re not interested in all that stuff for this – we are only interested in observing the stack contents. If we took a stack dump while the program were in the my_function() function, it would look like this:

my_function()
main()

Pretty simple, huh? If there were arguments for main() and my_function(), they would also be shown in some way. Most stack trace facilities can only show primitive datatypes (such as integers) and all other more complex arguments (such as structs) are shown as pointers (addresses) to the actual data without decoding the actual content, as follows:

my_function(42, 0x12345678)
main(0x45678900)

Some tools will print a stack trace (known as a backtrace, because it is unwound backwards from the current position in the stack) with some known pointers decoded into human readable form, such as simple struct datatypes. Anyway, let’s have a look at something a bit more complicated – an Oracle backtrace taken from an errorstack trace file:

----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
skdstdst()+36        call     kgdsdst()            000000000 ? 000000000 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   7FFFB345A778 ? 000000000 ?
ksedst1()+98         call     skdstdst()           000000000 ? 000000000 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
ksedst()+34          call     ksedst1()            000000001 ? 000000001 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
dbkedDefDump()+2736  call     ksedst()             000000001 ? 000000001 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
ksedmp()+36          call     dbkedDefDump()       000000001 ? 000000000 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
ksdxfdmp()+1837      call     ksedmp()             000000001 ? 000000000 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
ksdxcb()+1782        call     ksdxfdmp()           7FFFB345BAE0 ? 000000011 ?
                                                   000000003 ? 7FFFB345BA40 ?
                                                   7FFFB345B9A0 ? 000000000 ?
sspuser()+112        call     ksdxcb()             000000001 ? 000000011 ?
                                                   000000001 ? 000000001 ?
                                                   7FFFB345B9A0 ? 000000000 ?
__restore_rt()       call     sspuser()            000000001 ? 000000011 ?
                                                   000000001 ? 000000001 ?
                                                   7FFFB345B9A0 ? 000000000 ?
semtimedop()+10      signal   __restore_rt()       000018000 ? 7FFFB345C730 ?
                                                   000000001 ?
                                                   FFFFFFFFFFFFFFFF ?
                                                   FFFFFFFFFFD23940 ?
                                                   000000000 ?
sskgpwwait()+259     call     semtimedop()         000018000 ? 7FFFB345C730 ?
                                                   000000001 ? 7FFFB345C6D8 ?
                                                   FFFFFFFFFFD23940 ?
                                                   000000000 ?
skgpwwait()+151      call     sskgpwwait()         7FFFB345CB94 ? 00A9A15C0 ?
                                                   07848C4D8 ? 0002DC6C0 ?
                                                   7FFF00000000 ? 000000000 ?
ksliwat()+1816       call     skgpwwait()          7FFFB345CB94 ? 00A9A15C0 ?
                                                   000000000 ? 0002DC6C0 ?
                                                   000000000 ? 000000000 ?
kslwaitctx()+157     call     ksliwat()            078666348 ? 078666348 ?
                                                   005F5DFA7 ? 000000000 ?
                                                   100000000 ? 000000000 ?
kslwait()+136        call     kslwaitctx()         7FFFB345CE30 ? 000000000 ?
                                                   005F5DFA7 ? 000000000 ?
                                                   100000000 ? 000000000 ?
psdwat()+107         call     kslwait()            005F5DFA7 ? 000000167 ?
                                                   000000000 ? 005F5DFA7 ?
                                                   000000000 ? 000000000 ?
pevm_icd_call_commo  call     psdwat()             005F5DFA7 ? 000000167 ?
n()+421                                            000000000 ? 005F5DFA7 ?
                                                   000000000 ? 000000000 ?
pfrinstr_ICAL()+164  call     pevm_icd_call_commo  7FFFB345E0F0 ? 000000000 ?
                              n()                  000000001 ? 000000004 ?
                                                   7FAF00000001 ? 000000000 ?
pfrrun_no_tool()+63  call     pfrinstr_ICAL()      7FAF7E8A3500 ? 06A03B9AA ?
                                                   7FAF7E8A3570 ? 000000004 ?
                                                   7FAF00000001 ? 000000000 ?
pfrrun()+1025        call     pfrrun_no_tool()     7FAF7E8A3500 ? 06A03B9AA ?
                                                   7FAF7E8A3570 ? 000000004 ?
                                                   7FAF00000001 ? 000000000 ?
plsql_run()+769      call     pfrrun()             7FAF7E8A3500 ? 000000000 ?
                                                   7FAF7E8A3570 ? 7FFFB345E0F0 ?
                                                   7FAF00000001 ? 070C18646 ?
peicnt()+296         call     plsql_run()          7FAF7E8A3500 ? 000000001 ?
                                                   000000000 ? 7FFFB345E0F0 ?
                                                   7FAF00000001 ? 000000000 ?
kkxexe()+520         call     peicnt()             7FFFB345E0F0 ? 7FAF7E8A3500 ?
                                                   7FAF7E8C4028 ? 7FFFB345E0F0 ?
                                                   7FAF7E8C1FD0 ? 000000000 ?
opiexe()+14796       call     kkxexe()             7FAF7E8A5128 ? 7FAF7E8A3500 ?
                                                   000000000 ? 7FFFB345E0F0 ?
                                                   7FAF7E8C1FD0 ? 000000000 ?
kpoal8()+2283        call     opiexe()             000000049 ? 000000003 ?
                                                   7FFFB345F678 ? 7FFFB345E0F0 ?
                                                   7FAF7E8C1FD0 ? 000000000 ?
opiodr()+1149        call     kpoal8()             00000005E ? 00000001C ?
                                                   7FFFB3462750 ? 7FFFB345E0F0 ?
                                                   7FAF7E8C1FD0 ? 5E00000001 ?
ttcpip()+1251        call     opiodr()             00000005E ? 00000001C ?
                                                   7FFFB3462750 ? 000000000 ?
                                                   008C5E7F0 ? 5E00000001 ?
opitsk()+1628        call     ttcpip()             00A9B0890 ? 0086BA768 ?
                                                   7FFFB3462750 ? 000000000 ?
                                                   7FFFB34621B0 ? 7FFFB3462954 ?
opiino()+953         call     opitsk()             00A9B0890 ? 000000000 ?
                                                   7FFFB3462750 ? 000000000 ?
                                                   7FFFB34621B0 ? 7FFFB3462954 ?
opiodr()+1149        call     opiino()             00000003C ? 000000004 ?
                                                   7FFFB3463E48 ? 000000000 ?
                                                   7FFFB34621B0 ? 7FFFB3462954 ?
opidrv()+565         call     opiodr()             00000003C ? 000000004 ?
                                                   7FFFB3463E48 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
sou2o()+98           call     opidrv()             00000003C ? 000000004 ?
                                                   7FFFB3463E48 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
opimai_real()+128    call     sou2o()              7FFFB3463E20 ? 00000003C ?
                                                   000000004 ? 7FFFB3463E48 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
ssthrdmain()+209     call     opimai_real()        000000002 ? 7FFFB3464010 ?
                                                   000000004 ? 7FFFB3463E48 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
main()+196           call     ssthrdmain()         000000002 ? 7FFFB3464010 ?
                                                   000000001 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
__libc_start_main()  call     main()               000000002 ? 7FFFB34641B8 ?
+253                                               000000001 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
_start()+36          call     __libc_start_main()  0009D3D74 ? 000000002 ?
                                                   7FFFB34641A8 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?

I know, it’s a bit much to deal with at first glance. But take it a piece at a time, reading in conjunction with Metalink note 175982.1, and it all begins to make sense. Actually, I’m guessing some of these, because the Metalink note doesn’t give all the answers, but it seems to make sense.

Let’s start at the bottom, which is the beginning of the execution of the C program named ‘oracle’. The first two lines are the GLIBC execution wrappers that eventually call Oracle’s main() statement which starts at line 110 of the listing. The next several lines above are the initialisation of Oracle and the (O)racle (P)rogramatic (I)nterface, which is the internal API used in Oracle. TTCPIP is the Two-Task-Communication interface with my SQL*Plus process, but the first interesting line is at line 74, kkxexe(), which is the (k)ernel (k)ompile e(x)ecute) (exe)cute statement call, which is the start of the actual SQL processing. In this case it is actually a simple PL/SQL call to DBMS_LOCK.SLEEP(100000), and the entry to the Oracle Wait interface can be seen at line 50 with the kslwait() call, eventually ending up with a semtimedop() call, which is the way the sleep call has been implemented. Incidentally, the preceding sskgpwwait() call at line 37 is Oracle’s platform-specific code which, in this case, is Red Hat Linux.  At line 11 we see the call to ksedst(), which probably stands for (k)ernel (s)ervices (e)rrorstack (d)ump (s)tack (t)race, which is the part of code actually producing the stack trace we are reading.

So what does all this actually mean in terms of seeing what our Oracle session is up to? Well, we can see what is called, by whom, and how long each call took. We don’t need to know all the function names and what they are responsible for, but we can come up with a selection of interesting ones that we would like to pay attention to. Thinking in terms of SQL execution we can see, for example, a nested loop rowsource function calling out to an index fetch rowsource function, which in turn calls out to a table fetch function.

Even just a few consecutive stack traces of a running query will make it very plain which rowsource Oracle is spending the most time within, simply because of the laws of probability. If you look once a second for a minute, and every stack you see has the ‘index fetch’ rowsource as the current function, then you probably spent most of the last minute doing that kind of processing. But we can do better than that… we can start to look at quantifying where the time is spent, and also to supplement that information with other interesting aspects of the operating system operation, aspects which are somewhat  concealed from Oracle and yet can detrimentally affect its performance.

Join me back for part 3b for the conclusion of this series of articles!

I’m now an Oracle ACE Director

Posted January 15, 2010 by James Morle
Categories: Oracle

Just a quick post in celebration of my recent appointment as an Oracle ACE Director! I’m very happy to be invited to this group, and look forward to getting involved in some of the discussions.

The Oracle Wait Interface Is Useless – News About Part Two

Posted January 14, 2010 by James Morle
Categories: Oracle

As many have probably noticed, the blogging experiment with Tanel hasn’t been quite as productive as it might optimally have been. Tanel doesn’t currently have the time to write part two of this article, so I’ll I’m going to pick up the baton and write the next part. Watch this space!

The Oracle Wait Interface Is Useless (sometimes) – Part One: The Problem Definition

Posted November 9, 2009 by James Morle
Categories: Oracle

So here we go, this is part one of this experiment in blogging and co-writing. Tanel has actually written some good stuff already for this, but I wanted to try and formalise things under a common title and make it easier to follow between our sites.

I thought it would be logical to start this process by producing a more concrete problem definition, so that’s the focus of this part. It’s unlikely that we will come up with a complete method in this initial work, but hopefully the wheels will at least turn a little by the end of it!

So first of all, why would I dare to say that the Oracle Wait Interface is useless? Well, partly because I quite like titles that are a little bit catchy, and partly because it is indeed sometimes useless. The emphasis is on the word sometimes, though, because the Oracle Wait Interface is still the single most useful feature in any database product. Wow – that’s quite a claim, isn’t it? This isn’t the place to fully explain why that is, and many others have written great works on this subject already. Check out Cary Millsap’s works, notably his book, Optimizing Oracle Performance, which focuses in great detail on this subject. For the sake of this article, however, here’s why it is so useful: It tells you where the time goes. Think about it: If something is running too slowly, knowing where the time is used up is the single piece of information required to focus on the right subject for tuning.

So what’s wrong with the Oracle wait interface? Just one thing, actually – it is designed to  provide visibility of relatively slow waits. The reason for this is simply that there is a slight overhead in timing every single wait. If that overhead becomes a noticeable proportion of the actual wait itself then the measurement becomes inaccurate (and makes the problem worse). On UNIX-like platforms (yes, that includes Linux), the actual timing interface is implemented using gettimeofday(2) system calls, one before the event and one after the event. This call gives microsecond granularity of timing, at least in theory (on my Opteron 280 test machine, gettimeofday() calls take 1.5 microseconds). So, using this kind of mechanism for events that take a relatively long time makes perfect sense – disk I/O, for example, that will take at least three orders of magnitude longer to complete than the timing calls themselves. Conversely, they make no sense at all for calls that take even as little as 50 microseconds, as the 3 microsecond penalty for measuring the wait is 6% of the actual event time itself at that point. There you go, that’s the beginning of the justification that the wait interface is useless,  in a nutshell.

But hang on, isn’t 50 microseconds really, really fast? Well no, actually, it isn’t. Taking Intel’s Nehalem processor (with Quickpath) as an example, a memory latency is around the 50 ns range – three orders of magnitude faster than the 50 microsecond cut-off that I just arbitrarily invented. Memory access is also the slowest thing that a CPU can do (without factoring in peripheral cards) – in this case the CPU has to wait for about 150 cycles while that memory access takes place. So it’s very possible to have a function call that does fairly complex work and is still an order of magnitude or two faster than the gettimeofday() system call.

Time for an example. Actually, this is a variation on the example that made me start thinking about this – I had been perfectly happy with the Oracle Wait Interface until this point for 99% of cases!
Read the rest of this post »

And Now For Something Completely Different…

Posted October 12, 2009 by James Morle
Categories: Oracle

So just a short post today, more of an announcement…

At 11.05am on Wednesday 2nd December, I will co-present a talk at the UKOUG conference with Tanel Poder. OK, so nothing particularly earth-shattering there. However, we’ve decided to do something a little bit different with the creative process. The talk is entitled “The Oracle Wait Interface Is Useless (sometimes)”, and is a subject that both Tanel and I have been working on separately for a little while. The premise is this: The wait interface is great for ‘slow’ waits, but what about waits that are not instrumented by the wait interface? What about waits that are not waits from Oracle’s perspective, such as reading a page of memory? What about pure inefficiency? There is, of course, the concept of DB Time, but it is not currently granular enough.

So, you probably can see the idea behind the presentation: The goal is to present some alternative diagnostic techniques to determine the cause of poor performance. We happen to both have similar ideas on this, and they don’t just stop at Oracle.

So here’s the new concept, at least for us: We are going to write the content for the presentation as a tag team effort between our two blogs. Airing our dirty laundry in public, so to speak. We think this will give a fairly unique opportunity for public comment before the presentation is actually given!

So, over to Tanel for part one…