mystery NCSA httpd problems on gnn.com

John Labovitz (johnl@ora.com)
Sat, 28 Jan 1995 02:01:55 +0100


The server machine that hosts Global Network Navigator
(gnn.com, aka nearnet.gnn.com) has been having some very
mysterious problems in the past couple of weeks, and so
far any solutions have evaded us.

We've got a 2-processor Sun Sparc20, with 128mb of
memory, running Solaris 2.3. The HTTP server is stock
NCSA httpd 1.3. The machine is located at the NEARnet
office in Cambridge, on their Ethernet, with access
to their T3 network connection(s?).

Most of the time, the server works well. But occasionally,
connections made to the HTTP server will hang, with
xmosaic saying 'Making HTTP connection...' and eventually
timing out. If I cancel the connection (click the
spinning earth) and try again, sometimes the connection
will succeed, but more often it will hang again. If I
retry 10-20 seconds later, the connection will usually
succeed. Oddly, either I get a connection quickly, or
it never comes through.

So it seems to be some kind of cycle -- the server hangs,
it eventually times out, connections work, and then
eventually it hangs again. My guess is that the main
httpd server -- the one that forks off for each request --
is hanging or sleeping for some reason, and not waking
up to take incoming requests. It times out, eventually,
and traffic starts flowing again.

It may have something to do with the amount of traffic --
yesterday's HTTP access log contained 279,471 requests
(avg. 3/second). Ironically, it was the most popular
day in GNN ever!

The machine doesn't appear to be heavily loaded; the
load average rarely gets to 1.0. Memory usage seems to
be ok (although getting tight, with only 10-20mb
free RAM).

Since we've been getting complaints from users all
over the net, we feel this is not a problem related
to the network connection between the O'Reilly offices
and NEARnet. Also, pinging the host seems to work
regardless of the state of the HTTP server, which
seems to signal some problem in httpd, not in the
network connection or kernel.

In general, the machine has been running quite well
for the last 6 months, except for recently, when we
lost a disk on our RAID array. The relationship between
the current problems and the bad disk may just be
coincidental, since we *think* we've gotten the disk
system back to normal, and since this seems to only
affect httpd.

Has anyone else noticed problems with NCSA httpd
under these kind of conditions? Any thoughts?
Suggestions? Magic invocations? ;)

--
John Labovitz
Technical Services Manager, Global Network Navigator <http://gnn.com/>
O'Reilly & Associates, Sebastopol, California, USA (+1 707 829 0515)