zero-copy TCP
From: Ingo Molnar (mingo@elte.hu)
Date: Sat Sep 02 2000 - 03:45:41 EST
On Sat, 2 Sep 2000, Dan Maas wrote:
> There are various other tricks that can be done to speed up network
> servers, like passing files directly from the buffer cache to the
> network card. This one is currently frowned upon by the Linux
> community, [...]
FYI, the TUX patch (released yesterday) includes a lightweight zero-copy
TCP implementation for the 2.4 Linux kernel. The interface is not yet
exported to user-space (simply because TUX uses it from kernel-space so
the user-space bits were not needed), but the network driver framework and
TCP-stack bits are there, so the hard part is done. The two most widely
used gigabit drivers are 'converted' to support zero-copy, the SysKonnect
and the Acenic driver (the modifications are well tested). I plan to add
the user-space bits in the near future.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Jes Sorensen (jes@linuxcare.com)
Date: Sat Sep 02 2000 - 16:20:48 EST
>>>>> "Ingo" == Ingo Molnar <mingo@elte.hu> writes:
Ingo> On Sat, 2 Sep 2000, Dan Maas wrote:
>> There are various other tricks that can be done to speed up network
>> servers, like passing files directly from the buffer cache to the
>> network card. This one is currently frowned upon by the Linux
>> community, [...]
Ingo> FYI, the TUX patch (released yesterday) includes a lightweight
Ingo> zero-copy TCP implementation for the 2.4 Linux kernel. The
Ingo> interface is not yet exported to user-space (simply because TUX
Ingo> uses it from kernel-space so the user-space bits were not
Ingo> needed), but the network driver framework and TCP-stack bits are
Ingo> there, so the hard part is done. The two most widely used
Ingo> gigabit drivers are 'converted' to support zero-copy, the
Ingo> SysKonnect and the Acenic driver (the modifications are well
Ingo> tested). I plan to add the user-space bits in the near future.
Could you comment a bit on the design you used or do I have to go read
the code? Some of us had a good chat at OLS about how to do zero copy
TCP xmits by kiobufifying the skb's.
Jes
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Sat Sep 02 2000 - 16:25:48 EST
The entire Linux Network subsystem needs an overhaul. The code copies
data all over the place. I am at present pulling it apart and porting it
to MANOS, and what a mess indeed. In NetWare, the only time data ever
gets copied from incoming packets is:
1. A copy to userspace at a stream head.
2. An incoming write that gets copied into the file cache.
Reads from cache are never copied. In fact, the network server locks a
file cache page and sends it unaltered to the network drivers and DMA's
directly from it. Since NetWare has WTD's these I/O requests get
processed at the highest possible priority. In networking, the enemy is
LATENCY for fast performance. That's why NetWare can handle 5000 users
and Linux barfs on 100 in similiar tests. Copying increases latency,
and the long code paths in the Linux Network layer.
Jeff
Jes Sorensen wrote:
>
> >>>>> "Ingo" == Ingo Molnar <mingo@elte.hu> writes:
>
> Ingo> On Sat, 2 Sep 2000, Dan Maas wrote:
>
> >> There are various other tricks that can be done to speed up network
> >> servers, like passing files directly from the buffer cache to the
> >> network card. This one is currently frowned upon by the Linux
> >> community, [...]
>
> Ingo> FYI, the TUX patch (released yesterday) includes a lightweight
> Ingo> zero-copy TCP implementation for the 2.4 Linux kernel. The
> Ingo> interface is not yet exported to user-space (simply because TUX
> Ingo> uses it from kernel-space so the user-space bits were not
> Ingo> needed), but the network driver framework and TCP-stack bits are
> Ingo> there, so the hard part is done. The two most widely used
> Ingo> gigabit drivers are 'converted' to support zero-copy, the
> Ingo> SysKonnect and the Acenic driver (the modifications are well
> Ingo> tested). I plan to add the user-space bits in the near future.
>
> Could you comment a bit on the design you used or do I have to go read
> the code? Some of us had a good chat at OLS about how to do zero copy
> TCP xmits by kiobufifying the skb's.
>
> Jes
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Alan Cox (alan@lxorguk.ukuu.org.uk)
Date: Sat Sep 02 2000 - 16:35:11 EST
> to MANOS, and what a mess indeed. In NetWare, the only time data ever
> gets copied from incoming packets is:
>
> 1. A copy to userspace at a stream head.
> 2. An incoming write that gets copied into the file cache.
Sounds like Linux - one DMA and one copy to user space.
> Reads from cache are never copied. In fact, the network server locks a
> file cache page and sends it unaltered to the network drivers and DMA's
> directly from it. Since NetWare has WTD's these I/O requests get
Doesn't work with IP - you have to be able to checksum the data. For the
recent cards that can handle this have a look at TUX. The work is there ready
for 2.5
Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Sat Sep 02 2000 - 16:45:42 EST
Alan Cox wrote:
>
> > to MANOS, and what a mess indeed. In NetWare, the only time data ever
> > gets copied from incoming packets is:
> >
> > 1. A copy to userspace at a stream head.
> > 2. An incoming write that gets copied into the file cache.
>
> Sounds like Linux - one DMA and one copy to user space.
Alan, Please. I'm in your code and there are copies all over the
place. I agree you have a "fast path" for most stuff, but there's all
kinds of handles lookups, linear list searching like
while (x)
{
x = x->next
}
all over the place that increases latency. Not to mention the overhead
of the type of interrupt and trap gates that suck up about 50 clocks to
fetch the IDT, PDE, and GDT tables for every interrupt. NetWare copies
nothing in TCPIP except at the stream head. Why do you need to copy
data anyway to checksum an IP packet anyway? I noticed you do the right
thing and keep the headers and data as separate fragments during header
construction, so why do you need to copy data for checksumming?
Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Alan Cox (alan@lxorguk.ukuu.org.uk)
Date: Sat Sep 02 2000 - 17:10:25 EST
> > Sounds like Linux - one DMA and one copy to user space.
>
> Alan, Please. I'm in your code and there are copies all over the
> place. I agree you have a "fast path" for most stuff, but there's all
There arent copies all over the case for the paths that occur. Like 99.999%
of the time. Fragmented packets dont happen except for NFS (which is a rather
broken protocol anyway).
One DMA, one copy to user space
> kinds of handles lookups, linear list searching like
>
> while (x)
> {
> x = x->next
> }
timers are constructed to be close to O(1), the tcp hash isnt a linear lookup,
the socket operations from user space use file-> dereferences not a lookup
> nothing in TCPIP except at the stream head. Why do you need to copy
> data anyway to checksum an IP packet anyway? I noticed you do the right
> thing and keep the headers and data as separate fragments during header
> construction, so why do you need to copy data for checksumming?
We dont copy for checksumming. We fold the single user space copy and the
checksum operation into one path, because on any modern CPU it costs precisely
the same to copy as to copy/checksum.
I don't think you've actually sat and instrumented the TCP code
Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Sat Sep 02 2000 - 17:20:58 EST
Alan Cox wrote:
>
> > > Sounds like Linux - one DMA and one copy to user space.
> >
> > Alan, Please. I'm in your code and there are copies all over the
> > place. I agree you have a "fast path" for most stuff, but there's all
>
> There arent copies all over the case for the paths that occur. Like 99.999%
> of the time. Fragmented packets dont happen except for NFS (which is a rather
> broken protocol anyway).
There are.
>
> One DMA, one copy to user space
>
> > kinds of handles lookups, linear list searching like
> >
> > while (x)
> > {
> > x = x->next
> > }
>
> timers are constructed to be close to O(1), the tcp hash isnt a linear lookup,
> the socket operations from user space use file-> dereferences not a lookup
It is is there's a hash collision.
>
> > nothing in TCPIP except at the stream head. Why do you need to copy
> > data anyway to checksum an IP packet anyway? I noticed you do the right
> > thing and keep the headers and data as separate fragments during header
> > construction, so why do you need to copy data for checksumming?
>
> We dont copy for checksumming. We fold the single user space copy and the
> checksum operation into one path, because on any modern CPU it costs precisely
> the same to copy as to copy/checksum.
>
> I don't think you've actually sat and instrumented the TCP code
In Linux, no, in Netware, yes. I'm in your TCP code now and it's
fairly large.
Jeff
>
> Alan
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Alan Cox (alan@lxorguk.ukuu.org.uk)
Date: Sat Sep 02 2000 - 17:21:13 EST
> > There arent copies all over the case for the paths that occur. Like 99.999%
> > of the time. Fragmented packets dont happen except for NFS (which is a rather
> > broken protocol anyway).
>
> There are.
You forgot to cite them
> > the socket operations from user space use file-> dereferences not a lookup
>
> It is is there's a hash collision.
So you want to compute a perfect hash from unknown data which may also be a
hostile attacking your hash function. If you can do that, stop off and claim
a PhD
Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Sat Sep 02 2000 - 17:28:18 EST
Alan Cox wrote:
>
> We dont copy for checksumming. We fold the single user space copy and the
> checksum operation into one path, because on any modern CPU it costs precisely
> the same to copy as to copy/checksum.
You stated in an earlier message you copied the data when you caclulated
the TCPIP checksum? No you say you don't. Perhaps I misunderstood.
>
> I don't think you've actually sat and instrumented the TCP code
The TCPIP stack in Wolf Mountain has my name as the author, and it was
one of the nastiest projects I've ever done. OSPF routing is bitch
BTW. Try again.
>
> Alan
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Alan Cox (alan@lxorguk.ukuu.org.uk)
Date: Sat Sep 02 2000 - 17:30:19 EST
> You stated in an earlier message you copied the data when you caclulated
> the TCPIP checksum? No you say you don't. Perhaps I misunderstood.
We do a single copy/checksum from user space. You have to do the copy because
the packet may not be DMAable, may not be aligned for most PCI hardware and
numerous other things. Since that copy costs as much as the checksum its
effectively free in the checksum computation. It also avoids considerable
complexity on the TCP paths when you need to retransmit.
> > I don't think you've actually sat and instrumented the TCP code
>
> The TCPIP stack in Wolf Mountain has my name as the author, and it was
The Linux TCP code..
> one of the nastiest projects I've ever done. OSPF routing is bitch
> BTW. Try again.
OSPF is a matter of getting the graph theory right.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Andi Kleen (ak@suse.de)
Date: Sat Sep 02 2000 - 17:39:38 EST
On Sat, Sep 02, 2000 at 04:28:18PM -0600, Jeff V. Merkey wrote:
>
>
> Alan Cox wrote:
> >
> > We dont copy for checksumming. We fold the single user space copy and the
> > checksum operation into one path, because on any modern CPU it costs precisely
> > the same to copy as to copy/checksum.
>
> You stated in an earlier message you copied the data when you caclulated
> the TCPIP checksum? No you say you don't. Perhaps I misunderstood.
Linux always does a single copy for TCP, and the checksum is folded into
that. Doing just the checksum alone wouldn't be much less costly.
[Note this is only true for 2.4 in the fast path, 2.2 RX usually does
checksum and copy-to-user separated, unless you have hardware RX checksumming
For TX we always do a single copy checksum out of user space or out of the
page cache when you use sendfile or mmap]
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Sat Sep 02 2000 - 17:47:33 EST
Andi Kleen wrote:
>
> On Sat, Sep 02, 2000 at 04:28:18PM -0600, Jeff V. Merkey wrote:
> >
> >
> > Alan Cox wrote:
> > >
> > > We dont copy for checksumming. We fold the single user space copy and the
> > > checksum operation into one path, because on any modern CPU it costs precisely
> > > the same to copy as to copy/checksum.
> >
> > You stated in an earlier message you copied the data when you caclulated
> > the TCPIP checksum? No you say you don't. Perhaps I misunderstood.
>
> Linux always does a single copy for TCP, and the checksum is folded into
> that. Doing just the checksum alone wouldn't be much less costly.
>
> [Note this is only true for 2.4 in the fast path, 2.2 RX usually does
> checksum and copy-to-user separated, unless you have hardware RX checksumming
>
> For TX we always do a single copy checksum out of user space or out of the
> page cache when you use sendfile or mmap]
This makes sense.
Jeff
>
> -Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Ingo Molnar (mingo@elte.hu)
Date: Sun Sep 03 2000 - 03:29:50 EST
On Sat, 2 Sep 2000, Jeff V. Merkey wrote:
> while (x)
> {
> x = x->next
> }
>
> all over the place that increases latency. [...]
i challenge you to show one such place in the 2.4.0-test8-pre2 kernel. If
it's all over the place and if it increases latency, you certainly can
show at least one such place.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Tue Sep 05 2000 - 05:14:10 EST
Ingo,
When I have time to do this exercise, I will. I've finished merging
Alan's Code into MANOS (completed last night). Most of the cases I saw
where there were copies were not fast path. It takes some time to go
through all this code you guys have written. It is actually looking
good.
Jeff
Ingo Molnar wrote:
>
> On Sat, 2 Sep 2000, Jeff V. Merkey wrote:
>
> > while (x)
> > {
> > x = x->next
> > }
> >
> > all over the place that increases latency. [...]
>
> i challenge you to show one such place in the 2.4.0-test8-pre2 kernel. If
> it's all over the place and if it increases latency, you certainly can
> show at least one such place.
>
> Ingo
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Ingo Molnar (mingo@elte.hu)
Date: Tue Sep 05 2000 - 05:39:03 EST
On Tue, 5 Sep 2000, Jeff V. Merkey wrote:
> > > while (x)
> > > {
> > > x = x->next
> > > }
> > >
> > > all over the place that increases latency. [...]
> >
> > i challenge you to show one such place in the 2.4.0-test8-pre2 kernel. If
> > it's all over the place and if it increases latency, you certainly can
> > show at least one such place.
>
> When I have time to do this exercise, I will. [...]
well, your original claim (quoted above) shows that you have identified
numerous such places already, so you dont have to do any additional
'exercise'. The "all over the place" code shouldnt be too hard to find
again - please just say filename and line number in any kernel version of
your choice and we'll look into it.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Tue Sep 05 2000 - 05:58:10 EST
Alright Ingo, you asked for it. I am going through it now and going
over ALL my notes. I will catalog ALL of them and post it. Is this
what you really want?
:-)
Jeff
Ingo Molnar wrote:
>
> On Tue, 5 Sep 2000, Jeff V. Merkey wrote:
>
> > > > while (x)
> > > > {
> > > > x = x->next
> > > > }
> > > >
> > > > all over the place that increases latency. [...]
> > >
> > > i challenge you to show one such place in the 2.4.0-test8-pre2 kernel. If
> > > it's all over the place and if it increases latency, you certainly can
> > > show at least one such place.
> >
> > When I have time to do this exercise, I will. [...]
>
> well, your original claim (quoted above) shows that you have identified
> numerous such places already, so you dont have to do any additional
> 'exercise'. The "all over the place" code shouldnt be too hard to find
> again - please just say filename and line number in any kernel version of
> your choice and we'll look into it.
>
> Ingo
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Ingo Molnar (mingo@elte.hu)
Date: Tue Sep 05 2000 - 06:15:25 EST
On Tue, 5 Sep 2000, Jeff V. Merkey wrote:
> Alright Ingo, you asked for it. I am going through it now and going
> over ALL my notes. I will catalog ALL of them and post it. Is this
> what you really want?
yes, this would be the best indeed, to get those places fixed. But if you
dont want to spend your time on that then it's enough to just post a
single incident of such inefficiency and list-walking that impacts latency
like you claim.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Tue Sep 05 2000 - 06:09:10 EST
The origin of this comment was related to a comparison of the
MSM/TSM/CSM layer in NetWare and Linux. I've already said that Alan's
code handles fast paths well and from what I've seen is comparable to
NetWare. The areas I saw where sideband cases and issues of fragment
re-assembly. It's as good as what's in NetWare.
Jeff
Ingo Molnar wrote:
>
> On Tue, 5 Sep 2000, Jeff V. Merkey wrote:
>
> > Alright Ingo, you asked for it. I am going through it now and going
> > over ALL my notes. I will catalog ALL of them and post it. Is this
> > what you really want?
>
> yes, this would be the best indeed, to get those places fixed. But if you
> dont want to spend your time on that then it's enough to just post a
> single incident of such inefficiency and list-walking that impacts latency
> like you claim.
>
> Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Ingo Molnar (mingo@elte.hu)
Date: Tue Sep 05 2000 - 06:41:05 EST
On Tue, 5 Sep 2000, Jeff V. Merkey wrote:
> The origin of this comment was related to a comparison of the
> MSM/TSM/CSM layer in NetWare and Linux. I've already said that Alan's
> code handles fast paths well and from what I've seen is comparable to
> NetWare. [...]
can we thus take this as a retraction of your below quoted three
derogatory comments?
" The entire Linux Network subsystem needs an overhaul. "
" In networking, the enemy is LATENCY for fast performance. That's why
NetWare can handle 5000 users and Linux barfs on 100 in similiar tests.
Copying increases latency, and the long code paths in the Linux Network
layer. "
" Alan, Please. I'm in your code and there are copies all over the
place. I agree you have a "fast path" for most stuff, but there's all
kinds of handles lookups, linear list searching like
while (x)
{
x = x->next
}
all over the place that increases latency. "
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Tue Sep 05 2000 - 06:41:35 EST
Ingo Molnar wrote:
>
> On Tue, 5 Sep 2000, Jeff V. Merkey wrote:
>
> > The origin of this comment was related to a comparison of the
> > MSM/TSM/CSM layer in NetWare and Linux. I've already said that Alan's
> > code handles fast paths well and from what I've seen is comparable to
> > NetWare. [...]
>
> can we thus take this as a retraction of your below quoted three
> derogatory comments?
>
> " The entire Linux Network subsystem needs an overhaul. "
To support the performance metrics of NetWare, there are some changes I
will make that will allow Alan's code to beat Native NetWare. One is
allowing pre-scan protocol stacks to exist. Another is a WTD
optimization to allow Alan's code to tag pages in the page cache and
post them with a preemptive IO WTD. Another is moving ALL of the
routing code into the kernel space. Another is consolidation of bottom
ad top halves to allow a single interrupt thread to run all the way into
the router and out without the need to schedule. Another is moving the
NCP server into the kernel. Another is enabling "gang" tagging and
release of a singe cache page by hundereds or thousands of users at one
tme for incoming reads. The list is very long.
>
> " In networking, the enemy is LATENCY for fast performance. That's why
> NetWare can handle 5000 users and Linux barfs on 100 in similiar tests.
> Copying increases latency, and the long code paths in the Linux Network
> layer. "
>
> " Alan, Please. I'm in your code and there are copies all over the
> place. I agree you have a "fast path" for most stuff, but there's all
> kinds of handles lookups, linear list searching like
>
> while (x)
> {
> x = x->next
> }
>
> all over the place that increases latency. "
>
> Ingo
I already said this code is more than suitable, and better yet, it's
something folks are familiar with in Linux. Alan and I went over some
of this off line. Sorry you missed it.
Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Ingo Molnar (mingo@elte.hu)
Date: Tue Sep 05 2000 - 06:16:19 EST
btw., - the maintainers of the 2.4 networking and TCP/IP code are Alexey
Kuznetsov and David S. Miller - please direct your findings towards them,
not me :-)
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Tue Sep 05 2000 - 06:10:28 EST
You opened your mouth.
:-)
Jeff
Ingo Molnar wrote:
>
> btw., - the maintainers of the 2.4 networking and TCP/IP code are Alexey
> Kuznetsov and David S. Miller - please direct your findings towards them,
> not me :-)
>
> Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Ingo Molnar (mingo@elte.hu)
Date: Sun Sep 03 2000 - 03:28:18 EST
On Sat, 2 Sep 2000, Jeff V. Merkey wrote:
> Alan, Please. I'm in your code and there are copies all over the
> place. I agree you have a "fast path" for most stuff, but there's
> all kinds of handles lookups, linear list searching like
have you ever bothered actually measuring the impact? I have. Is the Linux
kernel perfect? Not at all. I dont understand why you take this as a
personal insult - you are certainly free to add your improvements, no
insults or patronizing is necessery, this is a technical forum.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Andi Kleen (ak@suse.de)
Date: Sat Sep 02 2000 - 17:02:27 EST
On Sat, Sep 02, 2000 at 10:35:11PM +0100, Alan Cox wrote:
> > to MANOS, and what a mess indeed. In NetWare, the only time data ever
> > gets copied from incoming packets is:
> >
> > 1. A copy to userspace at a stream head.
> > 2. An incoming write that gets copied into the file cache.
>
> Sounds like Linux - one DMA and one copy to user space.
Given for NFS over UDP it is usually more, because of the defragmentation
pass. That will be fixed in 2.5 and the code is already writen, just wants
to be ported to kiobufs. 2.4 NFSD at least receives directly into the
page cache unlike 2.2 (so it'll do two copies, three usually on alpha)
Samba probably does more copies though, I don't think it receives directly
into a mmap'ed buffer (so there are at least two copies to write something
to disk).
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Jes Sorensen (jes@linuxcare.com)
Date: Sat Sep 02 2000 - 16:40:18 EST
>>>>> "Jeff" == Jeff V Merkey <jmerkey@timpanogas.com> writes:
Jeff, could you start by learning to quote email and not send a full
copy of the entire email you reply to (read rfc1855).
Jeff> The entire Linux Network subsystem needs an overhaul. The code
Jeff> copies data all over the place. I am at present pulling it apart
Jeff> and porting it to MANOS, and what a mess indeed. In NetWare, the
Jeff> only time data ever gets copied from incoming packets is:
Try and understand the code before you make such bold statements.
Jeff> 1. A copy to userspace at a stream head. 2. An incoming write
Jeff> that gets copied into the file cache.
Jeff> Reads from cache are never copied. In fact, the network server
Jeff> locks a file cache page and sends it unaltered to the network
Jeff> drivers and DMA's directly from it. Since NetWare has WTD's
Jeff> these I/O requests get processed at the highest possible
Jeff> priority. In networking, the enemy is LATENCY for fast
Jeff> performance. That's why NetWare can handle 5000 users and Linux
Jeff> barfs on 100 in similiar tests. Copying increases latency, and
Jeff> the long code paths in the Linux Network layer.
You can't DMA directly from a file cache page unless you have a
network card that does scatter/gather DMA and surprise surprise,
80-90% of the cards on the market don't support this. Besides that you
need to do copy-on-write if you want to be able to do zero copy on
write() from user space, marking data copy on write is *expensive* on
x86 SMP boxes since you have to modify the tlb on all
processors. On top of that you have to look at the packet size, for
small packets a copy is often a lot cheaper than modifying the page
tables, even on UP systems so you need a copy/break scheme here.
As wrt your statement on latency then it's nice to see that you don't
know what you are talking about. Latency is one issue in fast
networking it's far from the only one. Latency is important for
message passing type applications however for bulk data transfers it's
less relevant since you really want deep pipelining here and properly
written applications. If you TCP window is too small even zero latency
will only buy you soo much on a really fast network.
Jes
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Jamie Lokier (lk@tantalophile.demon.co.uk)
Date: Sat Sep 02 2000 - 22:22:44 EST
Jes Sorensen wrote:
> You can't DMA directly from a file cache page unless you have a
> network card that does scatter/gather DMA and surprise surprise,
> 80-90% of the cards on the market don't support this. Besides that you
> need to do copy-on-write if you want to be able to do zero copy on
> write() from user space, marking data copy on write is *expensive* on
> x86 SMP boxes since you have to modify the tlb on all
> processors. On top of that you have to look at the packet size, for
> small packets a copy is often a lot cheaper than modifying the page
> tables, even on UP systems so you need a copy/break scheme here.
I just thought I'd mention that you can do zero copy TCP in and out
*without* any page marking schemes. All you need is a network card with
quite a lot of RAM and some intelligence. An Alteon could do it, with
extra RAM or an impressively underloaded network.
(for example) http://www.digital.com/info/DTJS05/
-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Linus Torvalds (torvalds@transmeta.com)
Date: Sun Sep 03 2000 - 01:33:27 EST
In article <20000903052244.B15788@pcep-jamie.cern.ch>,
Jamie Lokier <lk@tantalophile.demon.co.uk> wrote:
>
>I just thought I'd mention that you can do zero copy TCP in and out
>*without* any page marking schemes. All you need is a network card with
>quite a lot of RAM and some intelligence. An Alteon could do it, with
>extra RAM or an impressively underloaded network.
>
>(for example) http://www.digital.com/info/DTJS05/
The thing is, that at least historically it has always been a bad bet to
bet on special-purpose hardware over general-purpose stuff.
What I'm saying is that basically you should not design your TCP layer
around the 0.1% of cards that have tons of intelligence, when you have a
general-purpose CPU that tends to be faster in the end.
The smart cards can actually have higher latency than just doing it
the "stupid" way with the CPU. Yes, they'll offload some of the
computation, and may make system throughput better, but at what cost?
[ Same old example: just calculate how quickly you can get your packet
on the wire with a smart card that does checksumming in hardware, and
do the same calculations with a CPU that does the checksums. Take into
account that the checksum is at the _head_ of the packet. The CPU will
win.
Proof: the data to be sent out is in RAM. In fact, often it is cached
in the CPU these days. In order to start sending out the packet, the
smart card has to move all of the data from RAM/cache over the bus to
the card. It can only start actually sending after that. Cost: bus
speed to copy it over.
In contrast, if you do it on the CPU, you can basically start feeding
the packet out on the net after doing a CPU checksum that is limited
by RAM/cache speeds. Bus speed isn't the limiting factor any more on
packet latency, as you can send out the start of the packet on the
network before the whole packet has even been copied over the internal
bus! ]
So. Smart cards are not necessarily better for latency. They are
certainly not cheaper. They _are_ better for throughput, no question
about that. But so is adding another CPU. Or beefing up your memory
subsystem. Or any number of other things that are more generic than some
smart network card - and often cheaper because they are "standard
components", useful regardless of _what_ you do.
End result: smart cards only make sense in systems that are really
pushing the performance envelope. Which, after all, is not that common,
as it's usually easier to just beef up the machine in other ways until
the network is not the worst bottle-neck. Very few places outside
benchmark labs have networks _that_ studly.
Right now gigabit is heavy-duty enough that it is worth smart cards.
The same used to be true about the first generation of 100Mbit cards.
The same will be true of 10Gbps cards in another few years. But
basically, they'll probably always end up being the exception rather
than the rule, unless they become so cheap that it doesn't matter. But
"cheap" and "pushing the performance envelope" do not tend to go hand in
hand.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Jamie Lokier (lk@tantalophile.demon.co.uk)
Date: Sun Sep 03 2000 - 15:46:54 EST
Linus Torvalds wrote:
> Proof: the data to be sent out is in RAM. In fact, often it is cached
> in the CPU these days. In order to start sending out the packet, the
> smart card has to move all of the data from RAM/cache over the bus to
> the card. It can only start actually sending after that. Cost: bus
> speed to copy it over.
>
> In contrast, if you do it on the CPU, you can basically start feeding
> the packet out on the net after doing a CPU checksum that is limited
> by RAM/cache speeds. Bus speed isn't the limiting factor any more on
> packet latency, as you can send out the start of the packet on the
> network before the whole packet has even been copied over the internal
> bus!
Nice point! Only valid for TCP & UDP though.
When people want _real_ low latency, they don't use TCP or UDP, and they
certainly don't put data checksums at the start. They still aim for
zero copies. That pass, even over cached data, is still significant.
> Right now gigabit is heavy-duty enough that it is worth smart cards.
> The same used to be true about the first generation of 100Mbit cards.
> The same will be true of 10Gbps cards in another few years. But
> basically, they'll probably always end up being the exception rather
> than the rule, unless they become so cheap that it doesn't matter. But
> "cheap" and "pushing the performance envelope" do not tend to go hand in
> hand.
Fair enough. Please read my description of a zero-copy scheme that
doesn't require much intelligence on the card though. I think it's a
neat kernel trick that might just pay off. Sometimes, maybe.
-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Linus Torvalds (torvalds@transmeta.com)
Date: Sun Sep 03 2000 - 16:03:03 EST
On Sun, 3 Sep 2000, Jamie Lokier wrote:
>
> Nice point! Only valid for TCP & UDP though.
Yeah. But "we need oxygen" is only a valid point for carbon-based
life-forms. You might as well argue that oxygen is not avalid criteria for
being livable, because it's only valid for the particular kind of
creatures we are.
Basically, only TCP and UDP really matter. Decnet, IPX, etc don't really
make a big selling point any more.
> When people want _real_ low latency, they don't use TCP or UDP, and they
> certainly don't put data checksums at the start. They still aim for
> zero copies. That pass, even over cached data, is still significant.
I disagree.
Look at history.
Exercise 1: name a protocol that did something like that
(yes, I know, there are multiple).
Exercise 2: name one of them that is still relevant today.
See? Performance, in the end, is very much secondary. It doesn't matter
one whit if you perform better than everybody else, if you cannot _talk_
to everybody else.
I think the RISC vendors found that out. And I think most network vendors
find that out.
(Yes, I know, you're probably talking about things like the networking
protocols for clusters etc. I'm just saying that historically such
special-purpose stuff always tends to end up being not as good as the
"real thing".)
> Fair enough. Please read my description of a zero-copy scheme that
> doesn't require much intelligence on the card though. I think it's a
> neat kernel trick that might just pay off. Sometimes, maybe.
We could certainly try to do better. But some of the scemes I've seen have
implied a lot of complexity for gains that aren't actually real in the end
(eg playing expensive games with memory mapping in order to avoid a copy
that ends up happening anyway because the particular card you're using
doesn't do scatter-gather: you'd perform a lot better if you just did the
copy outright and forgot about the expensive games - which is what Linux
does).
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Tue Sep 05 2000 - 05:36:05 EST
Linus Torvalds wrote:
>
>
>
> Basically, only TCP and UDP really matter. Decnet, IPX, etc don't really
> make a big selling point any more.
>
>
Linus,
IPX is a really good LAN protocol (but totally sucks for internet). A
full blown NCP server in-kernel that's toughtly coupled to the page
cache running over IPX would make flames shoot out of the back of a
Linux server, and make NT like look an old lady hobbling down the
street. There's no need to configure client addresses with it, and for
file and print, it's the best.
Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Henning P. Schmiedehausen (hps@tanstaafl.de)
Date: Tue Sep 05 2000 - 08:34:02 EST
jmerkey@timpanogas.com (Jeff V. Merkey) writes:
>Linus Torvalds wrote:
>>
>>
>>
>> Basically, only TCP and UDP really matter. Decnet, IPX, etc don't really
>> make a big selling point any more.
>>
>>
>Linus,
>IPX is a really good LAN protocol (but totally sucks for internet). A
>full blown NCP server in-kernel that's toughtly coupled to the page
>cache running over IPX would make flames shoot out of the back of a
>Linux server, and make NT like look an old lady hobbling down the
>street. There's no need to configure client addresses with it, and for
>file and print, it's the best.
And it would be a good bit of necrophilia, too.
Jeff, Netware is dead. Please leave it there. IP won. The number of
new Netware Installations (as compared to existing or just upgrades)
is close (really close) to nil.
Regards
Henning
--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH hps@intermeta.de
Am Schwabachgrund 22 Fon.: 09131 / 50654-0 info@intermeta.de
D-91054 Buckenhof Fax.: 09131 / 50654-20
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Dan Hollis (goemon@anime.net)
Date: Tue Sep 05 2000 - 13:25:12 EST
On 5 Sep 2000, Henning P. Schmiedehausen wrote:
> jmerkey@timpanogas.com (Jeff V. Merkey) writes:
> >IPX is a really good LAN protocol (but totally sucks for internet). A
> Jeff, Netware is dead. Please leave it there. IP won. The number of
> new Netware Installations (as compared to existing or just upgrades)
> is close (really close) to nil.
I think you mean IPX is dead. Netware *could* work over TCP or UDP.
IP is definitely king. Even micro$haft gave up on NetBEUI.
-Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Henning P . Schmiedehausen (hps@tanstaafl.de)
Date: Tue Sep 05 2000 - 14:32:46 EST
On Tue, Sep 05, 2000 at 11:25:12AM -0700, Dan Hollis wrote:
> On 5 Sep 2000, Henning P. Schmiedehausen wrote:
> > jmerkey@timpanogas.com (Jeff V. Merkey) writes:
> > >IPX is a really good LAN protocol (but totally sucks for internet). A
> > Jeff, Netware is dead. Please leave it there. IP won. The number of
> > new Netware Installations (as compared to existing or just upgrades)
> > is close (really close) to nil.
>
> I think you mean IPX is dead. Netware *could* work over TCP or UDP.
> IP is definitely king. Even micro$haft gave up on NetBEUI.
Yep, thats' what I meant. Sorry that I was not clearer. But I think
that there are even with NetWare on IP not many new
installations. There is lots of migration of existing servers and
keeping existing systems alive but new rollouts?
But then again, maybe with MANOS and OpenNetWare, everything will be
different.
Regards
Henning
--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH hps@intermeta.de
Am Schwabachgrund 22 Fon.: 09131 / 50654-0 info@intermeta.de
D-91054 Buckenhof Fax.: 09131 / 50654-20
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
From: Chris Wedgwood (cw@f00f.org)
Date: Tue Sep 05 2000 - 14:20:31 EST
On Tue, Sep 05, 2000 at 03:34:02PM +0200, Henning P. Schmiedehausen wrote:
And it would be a good bit of necrophilia, too.
Jeff, Netware is dead. Please leave it there. IP won. The number of
new Netware Installations (as compared to existing or just upgrades)
is close (really close) to nil.
Sadly neither of these comments are true -- there are still a great
many NetWare installations and many of the existing installations are
far from dead as they move to IP...
--cw
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/