Re: mystery NCSA httpd problems on gnn.com

Robert S. Thau (rst@ai.mit.edu)
Mon, 30 Jan 1995 22:22:08 +0100


Date: Mon, 30 Jan 1995 10:03:16 -0800
From: Rob McCool <robm@neon.mcom.com>
Cc: Multiple recipients of list <www-talk@www0.cern.ch>

Yes. We run an SGI machine with Netsite (which is not based off the
NCSA code). I don't know about NCSA httpd, but in the case of Netsite
there are processes ready to accept connections which the kernel is
not giving to them. This led us to suspect kernel-level problems.

--Rob

I believe I've seen the same sort of thing (incoming connections
timing out, no new connections being logged, CPU and disks dead, main
server process shows up as blocked in accept() if I gcore(1) it and do
a backtrace). Note that this doesn't seem to be entirely consistent
with the "accept queue backup" story --- if the accept() queue on the
socket is full to bursting, why doesn't the server accept new
connections?

One other piece of puzzling evidence --- intense bursts of connections
don't always provoke the bug. I try to keep track of peak load here
by logging a histogram of transactions/sec vs. number-of-seconds. We
routinely log bursts of >10 transactions/sec a few times a day even on
weekends, when this sort of "freeze-up" behavior doesn't seem to have
been a problem.

Incidentally, killing off the server process and restarting it always
gets things moving again (at least it does here), so that action seems
to clear whatever inside the kernel is causing the bottleneck.

This led me to the wild-ass guess that there was some kind of race
condition inside the kernel leading to a missed wakeup call. On that
theory I put a select() with a one-second timeout in front of the
accept() in standalone_main --- the idea being that if a wakeup() was
missed, the timeout (and subsequently reentering the select) would
rescue the server.

That hack seems to have helped matters, but I'm not sure that it's
gotten rid of the freeze-ups entirely --- I spotted something which
looked an awful lot like the same old freeze on Friday, although this
time the process was waiting in select(). If the bug keeps on showing
up at an annoying rate, the next thing I'll try is closing and
reopening the socket if no connection requests have come in for ten
seconds or so, but that seems a little drastic.

rst