Friday, April 20, 2007

Today's dtrace one-liners

Well, I've been spending a bit more time with dtrace lately. I'm now starting to do interesting things with dtrace without looking at the manual or snagging other scripts. Today I found some processes that looked like they weren't doing much of use. In particular, I saw the following in "prstat -mL"

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
 15274 oracle    46  53 0.0 0.0 0.0 0.0 0.0 0.0   0  67 .22   0 racgmain/1
 15295 oracle    47  53 0.0 0.0 0.0 0.0 0.0 0.0   0  74 .23   0 racgmain/1
 16810 oracle    47  53 0.0 0.0 0.0 0.0 0.0 0.0   0  66 .23   0 racgmain/1
 25029 oracle    47  53 0.0 0.0 0.0 0.0 0.0 0.0   0  70 .23   0 racgmain/1

First, it is confusing that it says that these processes are doing less than one system call per second (.22 or .23), but spending more than 50% in system calls. Looks like a bug in prstat. To see what system call was being used...

# dtrace -n 'syscall:::entry/execname == "racgmain"/ { @[probefunc] = count() }
dtrace: description 'syscall:::entry' matched 232 probes
^C

  read                                                         269039
Ahh, so it is only reads that are causing all of this. Let's see which file descriptors are involved.
# dtrace -n 'syscall::read:entry/execname == "racgmain"/{ @[pid, probefunc,arg0] = count() }'
^C
    22469  read      9                7
    22469  read      4                7
    22469  read      3                9
    22469  read      8                9
    22469  read     10               10
    22469  read      7               21
    22469  read      6               24
    15274  read     10          1026462
    15295  read     10          1044932
    25029  read     10          1051450
    16810  read     11          1060720

It seemed rather odd that there would be so much reading on a single file descriptor per file (10 or 11, depending on the process). If this were really the case, I would expect to see some processes showing up in prstat that were high in %sys doing writes. Using pfiles on each of the pids showed that the file descriptor in question was a FIFO, but gave no indication what (if anything) was on the other end of the FIFO.

To help answer the question of what might be on the other end of the FIFO, I tried this (ok, not a one liner...)

#!/usr/sbin/dtrace -qs 

syscall::read:return
/execname == "racgmain"/
{
        @c[pid, arg0] = count()
}

tick-1s
{
        printf("%6s %6s %4s\n", " pid  ", " size ", "count");
        printa("%6d %6d %@4d\n", @c);
        clear(@c);
}

Which gave the following output...

 pid    size  count
 15295      0 117524
 15274      0 121065
 25029      0 121308
 16810      0 125358
 pid    size  count
 15274      0 108177
 25029      0 109639
 15295      0 115735
 16810      0 117175
 pid    size  count
 15295      0 119411
 15274      0 119471
 25029      0 120273
 16810      0 120650

Imagine that... these processes are each doing 100,000+ zero-byte reads per second. Time to tell the DBA's that they have some real problems with one of their daemons.

No comments: