Wednesday, February 15, 2006

hunting bugs in filebench

I've been using filebench a bit at work and decided that I would like to try a few things out at home. My home machine is not quite as beefy as the V40z's that I have been testing on at work. Getting filebench to compile in the first place is a bit of work. Probably works really well on someone else's system, but mine is obviously different. That's another story though. After compiling filebench, I ran it for the first time and saw this:

$ /opt/filebench/bin/filebench
Segmentation fault (core dumped)
Bummer. Well, let's see where that is at:
$ gdb /opt/filebench/bin/filebench core
GNU gdb 6.4-debian

. . .

(gdb) where
#0  0x37dd84aa in memset () from /lib/tls/i686/cmov/libc.so.6
#1  0x0807b01e in ?? ()
#2  0x080522da in ipc_init () at ipc.c:264
#3  0x08058bc1 in main (argc=1, argv=0x3f8fdcf4) at parser_gram.y:1140
OK, so let's go with the assumption that the bug is in the code listed as alpha on the web site, and not libc. So we go up the stack a couple levels.
(gdb) up 2
#2  0x080522da in ipc_init () at ipc.c:264
264             memset(filebench_shm, 0, c2 - c1);
(gdb) print filebench_shm
$1 = (filebench_shm_t *) 0xffffffff
Hmmm... 0x with a bunch of f's looks like -1. Perhaps some system call on Solaris (presumably where filebench started) returns NULL on error and on Linux it returns -1. Let's go looking for that system call.
(gdb) list
259     #endif /* USE_PROCESS_MODEL */
260
261             c1 = (caddr_t)filebench_shm;

262 c2 = (caddr_t)&filebench_shm->marker; 263 264 memset(filebench_shm, 0, c2 - c1); 265 filebench_shm->epoch = gethrtime(); 266 filebench_shm->debug_level = 2; 267 filebench_shm->string_ptr = &filebench_shm->strings[0]; 268 filebench_shm->shm_ptr = (char *)filebench_shm->shm_addr;

Nope, not there. Maybe a bit further up.
(gdb) list 250
245     #endif
246
247             if ((filebench_shm = (filebench_shm_t *)mmap(0, sizeof(filebench_shm_t),
248                     PROT_READ | PROT_WRITE,
249                     MAP_SHARED, shmfd, 0)) == NULL) {
250                     filebench_log(LOG_FATAL, "Cannot mmap shm");
251                     exit(1);
252             }
253
254     #else
It looks like mmap may be the culprit. I first asked man, but this is Linux, not Solaris. No man page for mmap! Next try google. Google comes up with this page that looks a lot like a man page. Why isn't that found on my system? Another thing for another day. Anyway, it says:
RETURN VALUE

On success, mmap returns a pointer to the mapped area. On error, the value MAP_FAILED (that is, (void *) -1) is returned, and errno is set appropriately. On success, munmap returns 0, on failure -1, and errno is set (probably to EINVAL).

Ok, so it is returning -1 because it doesn't like something. Let's see what it is trying to mmap:
(gdb) print sizeof(filebench_shm_t)
$2 = 907368000
(gdb) print sizeof(filebench_shm_t) / 1024 / 1024

$3 = 865 (gdb)

That 'splains it. It looks like it is trying to set up a shared memory segment that is 865 MB. My poor little system only has 512. FWIW, I have created a patch that addresses this one problem but I haven't had a chance to test it on Solaris yet. Unfortunately, with the patch, it just tells me that the mmap failed. It doesn't address the fact that it is trying to allocate a shared memory segment larger than the size of RAM on my system.

Update 1:

I have posted several patches to the bug tracking system at sourceforge.net. This particular one is 1432638. It turns out that mmap on Solaris also returns MAP_FAILED so the patch is simpler than I originally expected.

T:

No comments: