Friday, June 12, 2009

Kernel dives vs. sysctl

A few weeks ago I decided to take a look at porting a multiprocessor xcpustate to FreeBSD 7.2[*]. It compiled but didn't actually run and it failed less than gracefully. It turned out that xcpustate was doing a kmem dive to pull utilization stats out of the kernel but that it wasn't detecting that the variable it was trying to find is no longer there. Let me back up a little.

Back in the day, we used to collect system performance data, in most cases, by grotting around in kernel memory. It worked more-or-less like this:
  1. identify the kernel variable containing the data you want
  2. do an nlist(3) to get the address of the data
  3. open /dev/kmem
  4. lseek(2) to the address returned by nlist
  5. read(2) the data
This assumed that the person writing the utility had some knowledge of the kernel and kernel data structures. To my knowledge the first operating system that supported some sort of direct copyout() of some kernel variables was Unicos (Cray's version of Unix for their supercomputers; I used it on the X/MP version). It eventually became ubiquitous as sysctl(3). To do the same job (grabbing the contents of a kernel variable) became as simple as this:
  1. identify the sysctl-supported variable you want
  2. call sysctlbyname(3)
Here's the big surprise: the conventional wisdom has always been that lseek()ing around kmem was costly and that it would be cheaper to just copy the data out directly (although copyout() isn't free). Out of curiosity, while waiting for a phone call today I found a variable that both was accessible through sysctl and appeared in the kernel symbol table: maxfilesperproc and implemented grabbing it in both styles. What I found surprised me a bit, which was that the nlist/lseek/read version consistently took about half the time of a sysctlbyname. The sysctl man page does warn that converting the string form of the name to the mib form can be costly and I'll assume for now that that's where the time is being lost. I'll experiment with constructing the mib by hand later.

Update: I reinstrumented my code and found a bug (oh no!) in the stuff to format printing microsecond timings, and what I found was: 1) the version that just does a sysctlbyname() runs two orders of magnitude (that's a lot!) faster than the code that does a kernel dive, and 2) the code that uses a preformatted/loaded mib and calls sysctl instead of sysctlbyname runs about 4 times faster than using sysctlbyname, or about 400 times faster than doing a kernel dive. That's a little dramatic.

Also, note that the memory scrape approach can be extended to read the contents of variables in executing programs other than the kernel. Start with reading the proc table from /dev/kmem to get the address space of the process of interest, get the name list out the executable (if you've ever wondered why strip(1) exists ... ) and go forward from there.

[*]All of this, of course, doesn't address the question of whether or not FreeBSD is instrumented to be able to pull per-processor performance data out of the kernel, which it appears not to be.

Monday, June 1, 2009

Facebook performance

I continue to be surprised by Facebook's performance problems. Not just that they're not improving, but that they're actually getting worse. Maybe that's to be expected when you've got a highly interactive system with millions of active users that's written in a scripting language (PHP). But just saying "It's written in PHP" isn't really enough - it could be that the PHP interpreter has performance problems, or the scripts themselves could be inefficient.

And that doesn't even begin to address what are essentially data center questions. There may be insufficient computing or network resources, what resources they have may be badly allocated or administered, etc. I once worked on a hierarchical storage management system and there were users complaining that it was taking as long as 1/2 hour to open a file. I turned on disk tracing (this was AIX on an RS6k) and found that namei() on files in one particular directory were causing a single disk head to skip all over the place. From there it was easy to determine that the directory in question contained an insane number of files and there were indirect indirect indirect indirect ... blocks to hold the directory. By simply reorganizing the filesystem - striping it across multiple platters and multiple spindles and relieving that one poor disk head from having to do all the work - I was able to speed up lookups by three orders of magnitude. That is to say, I was able to get lookups down to 1/1000 of what they had been taking without changing a single line of code or forcing the user to change their (unfortunate) behavior. I suspect that there are lots of instances of those sorts of problems lurking inside the Facebook data center and I'm not sure that the operating systems they're using are instrumented well enough to be able to isolate them.

But still, I don't think sending over a megabyte of scripts down to the user's browser has nothing to do with it. What a mess.

Thursday, May 21, 2009

An old Unix virus

Earlier this morning someone mentioned in passing that an advantage of Linux is that it can't get viruses. I think we all know that's not really true, but it seems to be a common misconception.

The first virus I ever heard about was actually a Unix virus, presented at a Usenix conference. Andrew Hume stuffed some self-replicating code in the padding of Z-format executables and passed them around Bell Labs during, if I recall correctly, a wc performance war that was taking place inside the Unix Systems Lab. It didn't do much other than attach itself to other Z-format executables, but it provided pretty decent proof-of-concept.

Tuesday, May 19, 2009

"Unguessable" URLs

Yesterday I saw a tweet go by saying that some websites provide security by generating "unguessable" URLs.  It didn't say what they're trying to protect; if it's something low-value that's probably not a problem.  But it does raise the question of what's meant by "unguessable" and whether or not the people writing the code have any kind of understanding of randomness and how to get it.  It is reminiscent, to some extent, of people who post to Usenet that they've got an "unbreakable" crypto algorithm and it turns out to be XOR-based.  So, given the constraints of a URL format (characters, length, etc.) how likely is it that given a few examples of one of these unguessable URLs someone would really be unable to start generating guesses and getting hits?

In 2001 Michael Zalewski published a really fascinating paper in which he provided visualizations of the required-to-be-random TCP initial sequence numbers from a number of different operating systems, here.  Kevin Mitnick was able to exploit the poor ISN randomness to hijack TCP sessions and break into some systems.  The combination of the high visibility of Mitnick's actions and Zalewski's striking visualizations motivated a number of OS vendors to improve their random ISN generation, and a year later Zalewski published a follow-up paper in which he published visualizations of the revisions.  There were still problems.  Randomness is hard.  Relying on the ability to protect content or transactions through the use of "unguessable" URLs seems unduly risky to me.

Saturday, May 16, 2009

Cloud computing did NOT cause Google's outage

This strikes me as just wrong.  The issue here isn't that there's an inherent problem with cloud computing that led to the outage.  Indeed, cloud/distributed computing create a robustness  in the face of a number of different kinds of failures if processes and services can migrate or distribute onto unaffected nodes.  This was a routing failure, one that would have been a problem if the services were running on a single server or many servers or virtualized servers or distributed servers or ... .  

There are certainly reasons to be careful about storing data on servers you don't control (see, for example, this), but I really don't think concerns about robustness should be among them.