OK, here it is, the ‘first part of the last part’, though the topics discussed in these articles will be discussed more over time in my blog and in Tanel’s. I’ve split it into two subparts, because it was just getting insanely long as single posting.
Before I get going on this one, let’s just clear up a misunderstanding from the first part of this series. The first part uses a specific example which, for clarity reasons, will continue to be used for the remainder of the series. The example shown is some kind of query-related issue in this case, but the approach shown here is a more general one that will locate the root cause in many, many other cases than just the case of a bad query. This SQL example is still a good one, though, because there are no waits, lots of CPU, and no system calls. But the underlying problem might not be caused by poor SQL, and diving straight into SQL tuning could waste you 2 days fixing SQL when that is not the root cause. Just because we happen to be running SQL, it does not mean that it is a SQL problem. What we are trying to achieve here is a more robust approach to root cause diagnosis of Oracle performance problems, one where we don’t guess, we don’t assume, we don’t stick with things we’ve seen before: We quantitatively find the real problem. That might be the smart way, it might be the lazy way – those might be the same thing! But most crucially, it’s the fast way to fixing the real problem.
So, back to the main topic. In the first part of this blog, I walked through the example problem and showed that we have a problem that has the following attributes:
- The user is reporting a server response time problem
- Zero Oracle waits
- 100% CPU
- Zero system calls
- DB Time is showing all the CPU and elapsed time is spent executing SQL (in this example…)
Note to Kevin: Though I stated the query ‘never’ returns in part one, I was, of cause, using my usual exaggerative vernacular. I meant, of course, that it was taking a very long time to return, but still returning correct results. Sorry about that – it will ‘never’ happen again.
At this point, we might jump into a prolonged SQL tuning exercise. But let’s not for now, because that is still guesswork in my opinion. Let’s make the following reasonable assumption: The whole system is running slowly, using huge quantities of CPU, and this is a mission-critical system where you cannot simply start killing queries and re-executing them in your SQL*Plus session. That’s a reasonable assumption – I have worked on many, many, production systems where this is the case. So let’s carry on diagnosing without actually interfering with the production system in any way.
This is where we might deviate from time-based diagnostics. Not because we don’t think that time is the single most valuable metric on which to determine what ‘fast’ and ‘slow’ actually mean in quantitative terms, but because there is a bit of a shortcut available to us here that might just nail the cause of the problem. That shortcut is session-based statistics, and they have nothing to do with time, being simple counters.
This is where it makes no sense at all to re-invent the wheel, and was why Tanel is on the hook for this part: Tanel has already done some significant work on session-based statitics analysis and has written a great tool, snapper, for just this purpose. If you have not already read part two of this series, head over to Tanel’s blog now to catch up before proceeding!
OK, so we’ve abandoned the wait interface, and let’s assume that the sessions statistics did not yield anything particularly obvious. Or, maybe you just want to dig a little deeper, or possibly take yet another short cut to the truth? Now it’s time to look beyond Oracle and into the Operating System. The Operating System has a completely different viewpoint of the operation of Oracle, notably through the use of execution stack profiling. Don’t run away just yet, it’s not as difficult as you might imagine.
Various methods exist to probe the execution stack across the various Operating Systems. All of these should be used with caution, as they have varying degrees of intrusiveness on the operation of Oracle. The more intrusive methods, such as gdb, and the (possibly) less tested methods (oradebug short_stack) have very real potential to crash the process that you are attaching to, or worse. Don’t say I didn’t warn you! However, other methods for profiling the stack are less intrusive, and are already very robust: I’m talking here about Dtrace and variants.
I have held off from getting too excited by Dtrace in the past, even though the technology always looked amazing. The reason for holding back was that it was a single-OS tool, and without anything comparable on the Linux platform I didn’t see it being useful to >75% of my customers. That’s changing now, though, with the production release of Systemtap (partial, at least) in RHEL 5.4 and there is even a similar tool for AIX named Probevue. So now I’m excited about this technology!
Before digging into Dtrace and friends, I think it’s important that we take a little diversionary refresh into what a stack actually is, and why it is important. For those with a good coding background, look away now, and rejoin the group for part 3b. Just to keep your interest level, though, would it be interesting to get a second-by-second view of where the time goes for a process, down to the following levels of detail?
- Which line of the plan did I spend most time in?
- How much time did I spend waiting for page faults?
- Which wait events did I spend most time in?
That’s going to be part 3b… so please check back periodically!
Back to the “Stack Primer for DBAs”: Consider the following elementary piece of C code:
#include <stdio.h>
int i;
void my_function();
void
main() {
scanf("%c",&i);
my_function();
}
Don’t worry about the details of the language if you are unfamiliar with C. The only things to observe here are that we have one “function”, named ‘main()’, which is making a call to another function ‘my_function()’. The main() function is a special one – all standalone C programs have a main() function, which is the initial point of execution. A C program is just a set of compiled machine code instructions at execute time, just like any other computer program. These instructions occupy a chunk of memory, where each memory location is either a machine code instruction or a piece of data. When the code gets down to the call to my_function() in main(), the execution must jump (or branch) to the address that my_function()’s machine code instructions reside at. Before this can happen, an amount of context must be stored, namely:
- The ‘return’ address, so that execution can resume in the main() function after my_function() is complete
- Any ‘data’ passed to my_function() (the function arguments)
In C, this information is pushed onto the stack and then the execution branches to the entry point of my_function(). As a sidenote, the sharing of data and execution context in this way is a common way to exploit security holes by overwriting execution context with oversized arguments from the data portion. We’re not interested in all that stuff for this – we are only interested in observing the stack contents. If we took a stack dump while the program were in the my_function() function, it would look like this:
my_function()
main()
Pretty simple, huh? If there were arguments for main() and my_function(), they would also be shown in some way. Most stack trace facilities can only show primitive datatypes (such as integers) and all other more complex arguments (such as structs) are shown as pointers (addresses) to the actual data without decoding the actual content, as follows:
my_function(42, 0x12345678)
main(0x45678900)
Some tools will print a stack trace (known as a backtrace, because it is unwound backwards from the current position in the stack) with some known pointers decoded into human readable form, such as simple struct datatypes. Anyway, let’s have a look at something a bit more complicated – an Oracle backtrace taken from an errorstack trace file:
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
skdstdst()+36 call kgdsdst() 000000000 ? 000000000 ?
7FFFB3456278 ? 000000001 ?
7FFFB345A778 ? 000000000 ?
ksedst1()+98 call skdstdst() 000000000 ? 000000000 ?
7FFFB3456278 ? 000000001 ?
000000000 ? 000000000 ?
ksedst()+34 call ksedst1() 000000001 ? 000000001 ?
7FFFB3456278 ? 000000001 ?
000000000 ? 000000000 ?
dbkedDefDump()+2736 call ksedst() 000000001 ? 000000001 ?
7FFFB3456278 ? 000000001 ?
000000000 ? 000000000 ?
ksedmp()+36 call dbkedDefDump() 000000001 ? 000000000 ?
7FFFB3456278 ? 000000001 ?
000000000 ? 000000000 ?
ksdxfdmp()+1837 call ksedmp() 000000001 ? 000000000 ?
7FFFB3456278 ? 000000001 ?
000000000 ? 000000000 ?
ksdxcb()+1782 call ksdxfdmp() 7FFFB345BAE0 ? 000000011 ?
000000003 ? 7FFFB345BA40 ?
7FFFB345B9A0 ? 000000000 ?
sspuser()+112 call ksdxcb() 000000001 ? 000000011 ?
000000001 ? 000000001 ?
7FFFB345B9A0 ? 000000000 ?
__restore_rt() call sspuser() 000000001 ? 000000011 ?
000000001 ? 000000001 ?
7FFFB345B9A0 ? 000000000 ?
semtimedop()+10 signal __restore_rt() 000018000 ? 7FFFB345C730 ?
000000001 ?
FFFFFFFFFFFFFFFF ?
FFFFFFFFFFD23940 ?
000000000 ?
sskgpwwait()+259 call semtimedop() 000018000 ? 7FFFB345C730 ?
000000001 ? 7FFFB345C6D8 ?
FFFFFFFFFFD23940 ?
000000000 ?
skgpwwait()+151 call sskgpwwait() 7FFFB345CB94 ? 00A9A15C0 ?
07848C4D8 ? 0002DC6C0 ?
7FFF00000000 ? 000000000 ?
ksliwat()+1816 call skgpwwait() 7FFFB345CB94 ? 00A9A15C0 ?
000000000 ? 0002DC6C0 ?
000000000 ? 000000000 ?
kslwaitctx()+157 call ksliwat() 078666348 ? 078666348 ?
005F5DFA7 ? 000000000 ?
100000000 ? 000000000 ?
kslwait()+136 call kslwaitctx() 7FFFB345CE30 ? 000000000 ?
005F5DFA7 ? 000000000 ?
100000000 ? 000000000 ?
psdwat()+107 call kslwait() 005F5DFA7 ? 000000167 ?
000000000 ? 005F5DFA7 ?
000000000 ? 000000000 ?
pevm_icd_call_commo call psdwat() 005F5DFA7 ? 000000167 ?
n()+421 000000000 ? 005F5DFA7 ?
000000000 ? 000000000 ?
pfrinstr_ICAL()+164 call pevm_icd_call_commo 7FFFB345E0F0 ? 000000000 ?
n() 000000001 ? 000000004 ?
7FAF00000001 ? 000000000 ?
pfrrun_no_tool()+63 call pfrinstr_ICAL() 7FAF7E8A3500 ? 06A03B9AA ?
7FAF7E8A3570 ? 000000004 ?
7FAF00000001 ? 000000000 ?
pfrrun()+1025 call pfrrun_no_tool() 7FAF7E8A3500 ? 06A03B9AA ?
7FAF7E8A3570 ? 000000004 ?
7FAF00000001 ? 000000000 ?
plsql_run()+769 call pfrrun() 7FAF7E8A3500 ? 000000000 ?
7FAF7E8A3570 ? 7FFFB345E0F0 ?
7FAF00000001 ? 070C18646 ?
peicnt()+296 call plsql_run() 7FAF7E8A3500 ? 000000001 ?
000000000 ? 7FFFB345E0F0 ?
7FAF00000001 ? 000000000 ?
kkxexe()+520 call peicnt() 7FFFB345E0F0 ? 7FAF7E8A3500 ?
7FAF7E8C4028 ? 7FFFB345E0F0 ?
7FAF7E8C1FD0 ? 000000000 ?
opiexe()+14796 call kkxexe() 7FAF7E8A5128 ? 7FAF7E8A3500 ?
000000000 ? 7FFFB345E0F0 ?
7FAF7E8C1FD0 ? 000000000 ?
kpoal8()+2283 call opiexe() 000000049 ? 000000003 ?
7FFFB345F678 ? 7FFFB345E0F0 ?
7FAF7E8C1FD0 ? 000000000 ?
opiodr()+1149 call kpoal8() 00000005E ? 00000001C ?
7FFFB3462750 ? 7FFFB345E0F0 ?
7FAF7E8C1FD0 ? 5E00000001 ?
ttcpip()+1251 call opiodr() 00000005E ? 00000001C ?
7FFFB3462750 ? 000000000 ?
008C5E7F0 ? 5E00000001 ?
opitsk()+1628 call ttcpip() 00A9B0890 ? 0086BA768 ?
7FFFB3462750 ? 000000000 ?
7FFFB34621B0 ? 7FFFB3462954 ?
opiino()+953 call opitsk() 00A9B0890 ? 000000000 ?
7FFFB3462750 ? 000000000 ?
7FFFB34621B0 ? 7FFFB3462954 ?
opiodr()+1149 call opiino() 00000003C ? 000000004 ?
7FFFB3463E48 ? 000000000 ?
7FFFB34621B0 ? 7FFFB3462954 ?
opidrv()+565 call opiodr() 00000003C ? 000000004 ?
7FFFB3463E48 ? 000000000 ?
008C5E2A0 ? 7FFFB3462954 ?
sou2o()+98 call opidrv() 00000003C ? 000000004 ?
7FFFB3463E48 ? 000000000 ?
008C5E2A0 ? 7FFFB3462954 ?
opimai_real()+128 call sou2o() 7FFFB3463E20 ? 00000003C ?
000000004 ? 7FFFB3463E48 ?
008C5E2A0 ? 7FFFB3462954 ?
ssthrdmain()+209 call opimai_real() 000000002 ? 7FFFB3464010 ?
000000004 ? 7FFFB3463E48 ?
008C5E2A0 ? 7FFFB3462954 ?
main()+196 call ssthrdmain() 000000002 ? 7FFFB3464010 ?
000000001 ? 000000000 ?
008C5E2A0 ? 7FFFB3462954 ?
__libc_start_main() call main() 000000002 ? 7FFFB34641B8 ?
+253 000000001 ? 000000000 ?
008C5E2A0 ? 7FFFB3462954 ?
_start()+36 call __libc_start_main() 0009D3D74 ? 000000002 ?
7FFFB34641A8 ? 000000000 ?
008C5E2A0 ? 7FFFB3462954 ?
I know, it’s a bit much to deal with at first glance. But take it a piece at a time, reading in conjunction with Metalink note 175982.1, and it all begins to make sense. Actually, I’m guessing some of these, because the Metalink note doesn’t give all the answers, but it seems to make sense.
Let’s start at the bottom, which is the beginning of the execution of the C program named ‘oracle’. The first two lines are the GLIBC execution wrappers that eventually call Oracle’s main() statement which starts at line 110 of the listing. The next several lines above are the initialisation of Oracle and the (O)racle (P)rogramatic (I)nterface, which is the internal API used in Oracle. TTCPIP is the Two-Task-Communication interface with my SQL*Plus process, but the first interesting line is at line 74, kkxexe(), which is the (k)ernel (k)ompile e(x)ecute) (exe)cute statement call, which is the start of the actual SQL processing. In this case it is actually a simple PL/SQL call to DBMS_LOCK.SLEEP(100000), and the entry to the Oracle Wait interface can be seen at line 50 with the kslwait() call, eventually ending up with a semtimedop() call, which is the way the sleep call has been implemented. Incidentally, the preceding sskgpwwait() call at line 37 is Oracle’s platform-specific code which, in this case, is Red Hat Linux. At line 11 we see the call to ksedst(), which probably stands for (k)ernel (s)ervices (e)rrorstack (d)ump (s)tack (t)race, which is the part of code actually producing the stack trace we are reading.
So what does all this actually mean in terms of seeing what our Oracle session is up to? Well, we can see what is called, by whom, and how long each call took. We don’t need to know all the function names and what they are responsible for, but we can come up with a selection of interesting ones that we would like to pay attention to. Thinking in terms of SQL execution we can see, for example, a nested loop rowsource function calling out to an index fetch rowsource function, which in turn calls out to a table fetch function.
Even just a few consecutive stack traces of a running query will make it very plain which rowsource Oracle is spending the most time within, simply because of the laws of probability. If you look once a second for a minute, and every stack you see has the ‘index fetch’ rowsource as the current function, then you probably spent most of the last minute doing that kind of processing. But we can do better than that… we can start to look at quantifying where the time is spent, and also to supplement that information with other interesting aspects of the operating system operation, aspects which are somewhat concealed from Oracle and yet can detrimentally affect its performance.
Join me back for part 3b for the conclusion of this series of articles!
Recent Comments