Re: mystery NCSA httpd problems on gnn.com

Rob McCool (robm@neon.mcom.com)
Sat, 28 Jan 1995 06:06:05 +0100


/*
* "mystery NCSA httpd problems on gnn.com" by John Labovitz <johnl@ora.com>
* written Sat, 28 Jan 1995 02:02:27 +0100
*
* Most of the time, the server works well. But occasionally,
* connections made to the HTTP server will hang, with xmosaic saying
* 'Making HTTP connection...' and eventually timing out. If I cancel
* the connection (click the spinning earth) and try again, sometimes
* the connection will succeed, but more often it will hang again. If
* I retry 10-20 seconds later, the connection will usually succeed.
* Oddly, either I get a connection quickly, or it never comes
* through.
[...]
* So it seems to be some kind of cycle -- the server hangs, it
* eventually times out, connections work, and then eventually it
* hangs again. My guess is that the main httpd server -- the one
* that forks off for each request -- is hanging or sleeping for some
* reason, and not waking up to take incoming requests. It times out,
* eventually, and traffic starts flowing again.
[...]
* Since we've been getting complaints from users all over the net, we
* feel this is not a problem related to the network connection
* between the O'Reilly offices and NEARnet. Also, pinging the host
* seems to work regardless of the state of the HTTP server, which
* seems to signal some problem in httpd, not in the network
* connection or kernel.
*/

These problems are suspiciously similar to problems we've been having
with home.mcom.com in recent weeks. Each time a failure similar to the
one you describe happens, we've been able to isolate a backbone or
trunk line somewhere which has gone down and is causing some amount of
the Internet to get suddenly cut off from our server for indefinite
amounts of time. When these failures happen, we now use traceroute
with a set of common hosts to locate the line which is down.

Our analysis of this problem which is confirmed by gathered evidence
and correspondence with SGI (we run IRIX web servers) is that the
incoming connection queue is being filled. Normally, TCP kernels
maintain a queue of connections which are in the process of being
negotiated. An entry in this queue is used when a browser initiates a
connection to the server machine, and is occupied until the connection
is fully negotiated and has been accepted by the server software.

When a trunk line goes down, many of these queue slots can be occupied
by connections which are in the process of negotiation. Because the
line between the server and that machine has been severed, those queue
slots will be occupied either until the line comes back up, or two
minutes elapses.

If the server is accepting between 20 and 30 new connections per
second as home.mcom.com often does, it does not take long to fill this
queue. We had our queue size set to 128, which was sufficient for most
of December, but recent failures along with our increased traffic have
been enough to exhaust even this size.

To change your queue size, you need to change the kernel's maximum
connection request size. Under BSD, this parameter is called
SOMAXCONN. Since you're using Solaris, you can use the ndd command to
set your maximum queue size higher. The parameter you want to
experiment with is tcp_conn_req_max.

Finally, you have to change your server software to use a larger queue
size. The NCSA httpd, if I remember correctly, uses a queue size of
5. Search for a call to listen() and experiment with the value.

If anyone else has more data on this problem, we'd love to hear about
it. I hope this helps anyone who is having similar problems with their
servers. I heard a rumor that these problems are being caused somehow
by routing, Sprint and MCI. Anybody know more?

--Rob