Friday, June 12, 2009

Kernel dives vs. sysctl

A few weeks ago I decided to take a look at porting a multiprocessor xcpustate to FreeBSD 7.2[*]. It compiled but didn't actually run and it failed less than gracefully. It turned out that xcpustate was doing a kmem dive to pull utilization stats out of the kernel but that it wasn't detecting that the variable it was trying to find is no longer there. Let me back up a little.

Back in the day, we used to collect system performance data, in most cases, by grotting around in kernel memory. It worked more-or-less like this:
  1. identify the kernel variable containing the data you want
  2. do an nlist(3) to get the address of the data
  3. open /dev/kmem
  4. lseek(2) to the address returned by nlist
  5. read(2) the data
This assumed that the person writing the utility had some knowledge of the kernel and kernel data structures. To my knowledge the first operating system that supported some sort of direct copyout() of some kernel variables was Unicos (Cray's version of Unix for their supercomputers; I used it on the X/MP version). It eventually became ubiquitous as sysctl(3). To do the same job (grabbing the contents of a kernel variable) became as simple as this:
  1. identify the sysctl-supported variable you want
  2. call sysctlbyname(3)
Here's the big surprise: the conventional wisdom has always been that lseek()ing around kmem was costly and that it would be cheaper to just copy the data out directly (although copyout() isn't free). Out of curiosity, while waiting for a phone call today I found a variable that both was accessible through sysctl and appeared in the kernel symbol table: maxfilesperproc and implemented grabbing it in both styles. What I found surprised me a bit, which was that the nlist/lseek/read version consistently took about half the time of a sysctlbyname. The sysctl man page does warn that converting the string form of the name to the mib form can be costly and I'll assume for now that that's where the time is being lost. I'll experiment with constructing the mib by hand later.

Update: I reinstrumented my code and found a bug (oh no!) in the stuff to format printing microsecond timings, and what I found was: 1) the version that just does a sysctlbyname() runs two orders of magnitude (that's a lot!) faster than the code that does a kernel dive, and 2) the code that uses a preformatted/loaded mib and calls sysctl instead of sysctlbyname runs about 4 times faster than using sysctlbyname, or about 400 times faster than doing a kernel dive. That's a little dramatic.

Also, note that the memory scrape approach can be extended to read the contents of variables in executing programs other than the kernel. Start with reading the proc table from /dev/kmem to get the address space of the process of interest, get the name list out the executable (if you've ever wondered why strip(1) exists ... ) and go forward from there.

[*]All of this, of course, doesn't address the question of whether or not FreeBSD is instrumented to be able to pull per-processor performance data out of the kernel, which it appears not to be.

No comments:

Post a Comment