zero-copy TCP
From: Ingo Molnar (mingo@elte.hu)
Date: Sat Sep 02 2000 - 03:45:41 EST 


On Sat, 2 Sep 2000, Dan Maas wrote: 


> There are various other tricks that can be done to speed up network 
> servers, like passing files directly from the buffer cache to the 
> network card. This one is currently frowned upon by the Linux 
> community, [...] 


FYI, the TUX patch (released yesterday) includes a lightweight zero-copy 
TCP implementation for the 2.4 Linux kernel. The interface is not yet 
exported to user-space (simply because TUX uses it from kernel-space so 
the user-space bits were not needed), but the network driver framework and 
TCP-stack bits are there, so the hard part is done. The two most widely 
used gigabit drivers are 'converted' to support zero-copy, the SysKonnect 
and the Acenic driver (the modifications are well tested). I plan to add 
the user-space bits in the near future. 


Ingo 


- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Jes Sorensen (jes@linuxcare.com)
Date: Sat Sep 02 2000 - 16:20:48 EST 


>>>>> "Ingo" == Ingo Molnar <mingo@elte.hu> writes: 


Ingo> On Sat, 2 Sep 2000, Dan Maas wrote: 


>> There are various other tricks that can be done to speed up network 
>> servers, like passing files directly from the buffer cache to the 
>> network card. This one is currently frowned upon by the Linux 
>> community, [...] 


Ingo> FYI, the TUX patch (released yesterday) includes a lightweight 
Ingo> zero-copy TCP implementation for the 2.4 Linux kernel. The 
Ingo> interface is not yet exported to user-space (simply because TUX 
Ingo> uses it from kernel-space so the user-space bits were not 
Ingo> needed), but the network driver framework and TCP-stack bits are 
Ingo> there, so the hard part is done. The two most widely used 
Ingo> gigabit drivers are 'converted' to support zero-copy, the 
Ingo> SysKonnect and the Acenic driver (the modifications are well 
Ingo> tested). I plan to add the user-space bits in the near future. 


Could you comment a bit on the design you used or do I have to go read 
the code? Some of us had a good chat at OLS about how to do zero copy 
TCP xmits by kiobufifying the skb's. 


Jes 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Sat Sep 02 2000 - 16:25:48 EST 


The entire Linux Network subsystem needs an overhaul. The code copies 
data all over the place. I am at present pulling it apart and porting it 
to MANOS, and what a mess indeed. In NetWare, the only time data ever 
gets copied from incoming packets is: 


1. A copy to userspace at a stream head. 
2. An incoming write that gets copied into the file cache. 


Reads from cache are never copied. In fact, the network server locks a 
file cache page and sends it unaltered to the network drivers and DMA's 
directly from it. Since NetWare has WTD's these I/O requests get 
processed at the highest possible priority. In networking, the enemy is 
LATENCY for fast performance. That's why NetWare can handle 5000 users 
and Linux barfs on 100 in similiar tests. Copying increases latency, 
and the long code paths in the Linux Network layer. 


Jeff 


Jes Sorensen wrote: 
> 
> >>>>> "Ingo" == Ingo Molnar <mingo@elte.hu> writes: 
> 
> Ingo> On Sat, 2 Sep 2000, Dan Maas wrote: 
> 
> >> There are various other tricks that can be done to speed up network 
> >> servers, like passing files directly from the buffer cache to the 
> >> network card. This one is currently frowned upon by the Linux 
> >> community, [...] 
> 
> Ingo> FYI, the TUX patch (released yesterday) includes a lightweight 
> Ingo> zero-copy TCP implementation for the 2.4 Linux kernel. The 
> Ingo> interface is not yet exported to user-space (simply because TUX 
> Ingo> uses it from kernel-space so the user-space bits were not 
> Ingo> needed), but the network driver framework and TCP-stack bits are 
> Ingo> there, so the hard part is done. The two most widely used 
> Ingo> gigabit drivers are 'converted' to support zero-copy, the 
> Ingo> SysKonnect and the Acenic driver (the modifications are well 
> Ingo> tested). I plan to add the user-space bits in the near future. 
> 
> Could you comment a bit on the design you used or do I have to go read 
> the code? Some of us had a good chat at OLS about how to do zero copy 
> TCP xmits by kiobufifying the skb's. 
> 
> Jes 
> - 
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
> the body of a message to majordomo@vger.kernel.org 
> Please read the FAQ at http://www.tux.org/lkml/ 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/

Re: zero-copy TCP
From: Alan Cox (alan@lxorguk.ukuu.org.uk)
Date: Sat Sep 02 2000 - 16:35:11 EST 


> to MANOS, and what a mess indeed. In NetWare, the only time data ever 
> gets copied from incoming packets is: 
> 
> 1. A copy to userspace at a stream head. 
> 2. An incoming write that gets copied into the file cache. 


Sounds like Linux - one DMA and one copy to user space. 


> Reads from cache are never copied. In fact, the network server locks a 
> file cache page and sends it unaltered to the network drivers and DMA's 
> directly from it. Since NetWare has WTD's these I/O requests get 


Doesn't work with IP - you have to be able to checksum the data. For the 
recent cards that can handle this have a look at TUX. The work is there ready 
for 2.5 


Alan 


- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Sat Sep 02 2000 - 16:45:42 EST 


Alan Cox wrote: 
> 
> > to MANOS, and what a mess indeed. In NetWare, the only time data ever 
> > gets copied from incoming packets is: 
> > 
> > 1. A copy to userspace at a stream head. 
> > 2. An incoming write that gets copied into the file cache. 
> 
> Sounds like Linux - one DMA and one copy to user space. 


Alan, Please. I'm in your code and there are copies all over the 
place. I agree you have a "fast path" for most stuff, but there's all 
kinds of handles lookups, linear list searching like 


while (x) 
{ 
x = x->next 
} 


all over the place that increases latency. Not to mention the overhead 
of the type of interrupt and trap gates that suck up about 50 clocks to 
fetch the IDT, PDE, and GDT tables for every interrupt. NetWare copies 
nothing in TCPIP except at the stream head. Why do you need to copy 
data anyway to checksum an IP packet anyway? I noticed you do the right 
thing and keep the headers and data as separate fragments during header 
construction, so why do you need to copy data for checksumming? 


Jeff 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Alan Cox (alan@lxorguk.ukuu.org.uk)
Date: Sat Sep 02 2000 - 17:10:25 EST 


> > Sounds like Linux - one DMA and one copy to user space. 
> 
> Alan, Please. I'm in your code and there are copies all over the 
> place. I agree you have a "fast path" for most stuff, but there's all 


There arent copies all over the case for the paths that occur. Like 99.999% 
of the time. Fragmented packets dont happen except for NFS (which is a rather 
broken protocol anyway). 


One DMA, one copy to user space 


> kinds of handles lookups, linear list searching like 
> 
> while (x) 
> { 
> x = x->next 
> } 


timers are constructed to be close to O(1), the tcp hash isnt a linear lookup, 
the socket operations from user space use file-> dereferences not a lookup 


> nothing in TCPIP except at the stream head. Why do you need to copy 
> data anyway to checksum an IP packet anyway? I noticed you do the right 
> thing and keep the headers and data as separate fragments during header 
> construction, so why do you need to copy data for checksumming? 


We dont copy for checksumming. We fold the single user space copy and the 
checksum operation into one path, because on any modern CPU it costs precisely 
the same to copy as to copy/checksum. 


I don't think you've actually sat and instrumented the TCP code 



Alan 


- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Sat Sep 02 2000 - 17:20:58 EST 


Alan Cox wrote: 
> 
> > > Sounds like Linux - one DMA and one copy to user space. 
> > 
> > Alan, Please. I'm in your code and there are copies all over the 
> > place. I agree you have a "fast path" for most stuff, but there's all 
> 
> There arent copies all over the case for the paths that occur. Like 99.999% 
> of the time. Fragmented packets dont happen except for NFS (which is a rather 
> broken protocol anyway). 


There are. 


> 
> One DMA, one copy to user space 
> 
> > kinds of handles lookups, linear list searching like 
> > 
> > while (x) 
> > { 
> > x = x->next 
> > } 
> 
> timers are constructed to be close to O(1), the tcp hash isnt a linear lookup, 
> the socket operations from user space use file-> dereferences not a lookup 


It is is there's a hash collision. 


> 
> > nothing in TCPIP except at the stream head. Why do you need to copy 
> > data anyway to checksum an IP packet anyway? I noticed you do the right 
> > thing and keep the headers and data as separate fragments during header 
> > construction, so why do you need to copy data for checksumming? 
> 
> We dont copy for checksumming. We fold the single user space copy and the 
> checksum operation into one path, because on any modern CPU it costs precisely 
> the same to copy as to copy/checksum. 
> 
> I don't think you've actually sat and instrumented the TCP code 


In Linux, no, in Netware, yes. I'm in your TCP code now and it's 
fairly large. 


Jeff 




> 
> Alan 
> 
> - 
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
> the body of a message to majordomo@vger.kernel.org 
> Please read the FAQ at http://www.tux.org/lkml/ 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Alan Cox (alan@lxorguk.ukuu.org.uk)
Date: Sat Sep 02 2000 - 17:21:13 EST 


> > There arent copies all over the case for the paths that occur. Like 99.999% 
> > of the time. Fragmented packets dont happen except for NFS (which is a rather 
> > broken protocol anyway). 
> 
> There are. 


You forgot to cite them 


> > the socket operations from user space use file-> dereferences not a lookup 
> 
> It is is there's a hash collision. 


So you want to compute a perfect hash from unknown data which may also be a 
hostile attacking your hash function. If you can do that, stop off and claim 
a PhD 


Alan 


- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Sat Sep 02 2000 - 17:28:18 EST 


Alan Cox wrote: 
> 
> We dont copy for checksumming. We fold the single user space copy and the 
> checksum operation into one path, because on any modern CPU it costs precisely 
> the same to copy as to copy/checksum. 


You stated in an earlier message you copied the data when you caclulated 
the TCPIP checksum? No you say you don't. Perhaps I misunderstood. 


> 
> I don't think you've actually sat and instrumented the TCP code 


The TCPIP stack in Wolf Mountain has my name as the author, and it was 
one of the nastiest projects I've ever done. OSPF routing is bitch 
BTW. Try again. 


> 
> Alan 
> 
> - 
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
> the body of a message to majordomo@vger.kernel.org 
> Please read the FAQ at http://www.tux.org/lkml/ 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Alan Cox (alan@lxorguk.ukuu.org.uk)
Date: Sat Sep 02 2000 - 17:30:19 EST 


> You stated in an earlier message you copied the data when you caclulated 
> the TCPIP checksum? No you say you don't. Perhaps I misunderstood. 


We do a single copy/checksum from user space. You have to do the copy because 
the packet may not be DMAable, may not be aligned for most PCI hardware and 
numerous other things. Since that copy costs as much as the checksum its 
effectively free in the checksum computation. It also avoids considerable 
complexity on the TCP paths when you need to retransmit. 


> > I don't think you've actually sat and instrumented the TCP code 
> 
> The TCPIP stack in Wolf Mountain has my name as the author, and it was 


The Linux TCP code.. 


> one of the nastiest projects I've ever done. OSPF routing is bitch 
> BTW. Try again. 


OSPF is a matter of getting the graph theory right. 



- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Andi Kleen (ak@suse.de)
Date: Sat Sep 02 2000 - 17:39:38 EST 


On Sat, Sep 02, 2000 at 04:28:18PM -0600, Jeff V. Merkey wrote: 
> 
> 
> Alan Cox wrote: 
> > 
> > We dont copy for checksumming. We fold the single user space copy and the 
> > checksum operation into one path, because on any modern CPU it costs precisely 
> > the same to copy as to copy/checksum. 
> 
> You stated in an earlier message you copied the data when you caclulated 
> the TCPIP checksum? No you say you don't. Perhaps I misunderstood. 


Linux always does a single copy for TCP, and the checksum is folded into 
that. Doing just the checksum alone wouldn't be much less costly. 


[Note this is only true for 2.4 in the fast path, 2.2 RX usually does 
checksum and copy-to-user separated, unless you have hardware RX checksumming 


For TX we always do a single copy checksum out of user space or out of the 
page cache when you use sendfile or mmap] 



-Andi 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Sat Sep 02 2000 - 17:47:33 EST 


Andi Kleen wrote: 
> 
> On Sat, Sep 02, 2000 at 04:28:18PM -0600, Jeff V. Merkey wrote: 
> > 
> > 
> > Alan Cox wrote: 
> > > 
> > > We dont copy for checksumming. We fold the single user space copy and the 
> > > checksum operation into one path, because on any modern CPU it costs precisely 
> > > the same to copy as to copy/checksum. 
> > 
> > You stated in an earlier message you copied the data when you caclulated 
> > the TCPIP checksum? No you say you don't. Perhaps I misunderstood. 
> 
> Linux always does a single copy for TCP, and the checksum is folded into 
> that. Doing just the checksum alone wouldn't be much less costly. 
> 
> [Note this is only true for 2.4 in the fast path, 2.2 RX usually does 
> checksum and copy-to-user separated, unless you have hardware RX checksumming 
> 
> For TX we always do a single copy checksum out of user space or out of the 
> page cache when you use sendfile or mmap] 


This makes sense. 


Jeff 


> 
> -Andi 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Ingo Molnar (mingo@elte.hu)
Date: Sun Sep 03 2000 - 03:29:50 EST 


On Sat, 2 Sep 2000, Jeff V. Merkey wrote: 


> while (x) 
> { 
> x = x->next 
> } 
> 
> all over the place that increases latency. [...] 


i challenge you to show one such place in the 2.4.0-test8-pre2 kernel. If 
it's all over the place and if it increases latency, you certainly can 
show at least one such place. 


Ingo 


- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Tue Sep 05 2000 - 05:14:10 EST 


Ingo, 


When I have time to do this exercise, I will. I've finished merging 
Alan's Code into MANOS (completed last night). Most of the cases I saw 
where there were copies were not fast path. It takes some time to go 
through all this code you guys have written. It is actually looking 
good. 


Jeff 


Ingo Molnar wrote: 
> 
> On Sat, 2 Sep 2000, Jeff V. Merkey wrote: 
> 
> > while (x) 
> > { 
> > x = x->next 
> > } 
> > 
> > all over the place that increases latency. [...] 
> 
> i challenge you to show one such place in the 2.4.0-test8-pre2 kernel. If 
> it's all over the place and if it increases latency, you certainly can 
> show at least one such place. 
> 
> Ingo 
> 
> - 
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
> the body of a message to majordomo@vger.kernel.org 
> Please read the FAQ at http://www.tux.org/lkml/ 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Ingo Molnar (mingo@elte.hu)
Date: Tue Sep 05 2000 - 05:39:03 EST 


On Tue, 5 Sep 2000, Jeff V. Merkey wrote: 


> > > while (x) 
> > > { 
> > > x = x->next 
> > > } 
> > > 
> > > all over the place that increases latency. [...] 
> > 
> > i challenge you to show one such place in the 2.4.0-test8-pre2 kernel. If 
> > it's all over the place and if it increases latency, you certainly can 
> > show at least one such place. 
> 
> When I have time to do this exercise, I will. [...] 


well, your original claim (quoted above) shows that you have identified 
numerous such places already, so you dont have to do any additional 
'exercise'. The "all over the place" code shouldnt be too hard to find 
again - please just say filename and line number in any kernel version of 
your choice and we'll look into it. 


Ingo 


- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Tue Sep 05 2000 - 05:58:10 EST 


Alright Ingo, you asked for it. I am going through it now and going 
over ALL my notes. I will catalog ALL of them and post it. Is this 
what you really want? 


:-) 


Jeff 



Ingo Molnar wrote: 
> 
> On Tue, 5 Sep 2000, Jeff V. Merkey wrote: 
> 
> > > > while (x) 
> > > > { 
> > > > x = x->next 
> > > > } 
> > > > 
> > > > all over the place that increases latency. [...] 
> > > 
> > > i challenge you to show one such place in the 2.4.0-test8-pre2 kernel. If 
> > > it's all over the place and if it increases latency, you certainly can 
> > > show at least one such place. 
> > 
> > When I have time to do this exercise, I will. [...] 
> 
> well, your original claim (quoted above) shows that you have identified 
> numerous such places already, so you dont have to do any additional 
> 'exercise'. The "all over the place" code shouldnt be too hard to find 
> again - please just say filename and line number in any kernel version of 
> your choice and we'll look into it. 
> 
> Ingo 
> 
> - 
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
> the body of a message to majordomo@vger.kernel.org 
> Please read the FAQ at http://www.tux.org/lkml/ 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Ingo Molnar (mingo@elte.hu)
Date: Tue Sep 05 2000 - 06:15:25 EST 


On Tue, 5 Sep 2000, Jeff V. Merkey wrote: 


> Alright Ingo, you asked for it. I am going through it now and going 
> over ALL my notes. I will catalog ALL of them and post it. Is this 
> what you really want? 


yes, this would be the best indeed, to get those places fixed. But if you 
dont want to spend your time on that then it's enough to just post a 
single incident of such inefficiency and list-walking that impacts latency 
like you claim. 


Ingo 


- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Tue Sep 05 2000 - 06:09:10 EST 


The origin of this comment was related to a comparison of the 
MSM/TSM/CSM layer in NetWare and Linux. I've already said that Alan's 
code handles fast paths well and from what I've seen is comparable to 
NetWare. The areas I saw where sideband cases and issues of fragment 
re-assembly. It's as good as what's in NetWare. 


Jeff 


Ingo Molnar wrote: 
> 
> On Tue, 5 Sep 2000, Jeff V. Merkey wrote: 
> 
> > Alright Ingo, you asked for it. I am going through it now and going 
> > over ALL my notes. I will catalog ALL of them and post it. Is this 
> > what you really want? 
> 
> yes, this would be the best indeed, to get those places fixed. But if you 
> dont want to spend your time on that then it's enough to just post a 
> single incident of such inefficiency and list-walking that impacts latency 
> like you claim. 
> 
> Ingo 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Ingo Molnar (mingo@elte.hu)
Date: Tue Sep 05 2000 - 06:41:05 EST 


On Tue, 5 Sep 2000, Jeff V. Merkey wrote: 


> The origin of this comment was related to a comparison of the 
> MSM/TSM/CSM layer in NetWare and Linux. I've already said that Alan's 
> code handles fast paths well and from what I've seen is comparable to 
> NetWare. [...] 


can we thus take this as a retraction of your below quoted three 
derogatory comments? 


" The entire Linux Network subsystem needs an overhaul. " 


" In networking, the enemy is LATENCY for fast performance. That's why 
NetWare can handle 5000 users and Linux barfs on 100 in similiar tests. 
Copying increases latency, and the long code paths in the Linux Network 
layer. " 



" Alan, Please. I'm in your code and there are copies all over the 
place. I agree you have a "fast path" for most stuff, but there's all 
kinds of handles lookups, linear list searching like 


while (x) 
{ 
x = x->next 
} 

all over the place that increases latency. " 


Ingo 


- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Tue Sep 05 2000 - 06:41:35 EST 


Ingo Molnar wrote: 
> 
> On Tue, 5 Sep 2000, Jeff V. Merkey wrote: 
> 
> > The origin of this comment was related to a comparison of the 
> > MSM/TSM/CSM layer in NetWare and Linux. I've already said that Alan's 
> > code handles fast paths well and from what I've seen is comparable to 
> > NetWare. [...] 
> 
> can we thus take this as a retraction of your below quoted three 
> derogatory comments? 
> 
> " The entire Linux Network subsystem needs an overhaul. " 


To support the performance metrics of NetWare, there are some changes I 
will make that will allow Alan's code to beat Native NetWare. One is 
allowing pre-scan protocol stacks to exist. Another is a WTD 
optimization to allow Alan's code to tag pages in the page cache and 
post them with a preemptive IO WTD. Another is moving ALL of the 
routing code into the kernel space. Another is consolidation of bottom 
ad top halves to allow a single interrupt thread to run all the way into 
the router and out without the need to schedule. Another is moving the 
NCP server into the kernel. Another is enabling "gang" tagging and 
release of a singe cache page by hundereds or thousands of users at one 
tme for incoming reads. The list is very long. 


> 
> " In networking, the enemy is LATENCY for fast performance. That's why 
> NetWare can handle 5000 users and Linux barfs on 100 in similiar tests. 
> Copying increases latency, and the long code paths in the Linux Network 
> layer. " 
> 
> " Alan, Please. I'm in your code and there are copies all over the 
> place. I agree you have a "fast path" for most stuff, but there's all 
> kinds of handles lookups, linear list searching like 
> 
> while (x) 
> { 
> x = x->next 
> } 
> 
> all over the place that increases latency. " 
> 
> Ingo 


I already said this code is more than suitable, and better yet, it's 
something folks are familiar with in Linux. Alan and I went over some 
of this off line. Sorry you missed it. 


Jeff 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Ingo Molnar (mingo@elte.hu)
Date: Tue Sep 05 2000 - 06:16:19 EST 


btw., - the maintainers of the 2.4 networking and TCP/IP code are Alexey 
Kuznetsov and David S. Miller - please direct your findings towards them, 
not me :-) 


Ingo 


- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Tue Sep 05 2000 - 06:10:28 EST 


You opened your mouth. 


:-) 


Jeff 


Ingo Molnar wrote: 
> 
> btw., - the maintainers of the 2.4 networking and TCP/IP code are Alexey 
> Kuznetsov and David S. Miller - please direct your findings towards them, 
> not me :-) 
> 
> Ingo 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Ingo Molnar (mingo@elte.hu)
Date: Sun Sep 03 2000 - 03:28:18 EST 


On Sat, 2 Sep 2000, Jeff V. Merkey wrote: 


> Alan, Please. I'm in your code and there are copies all over the 
> place. I agree you have a "fast path" for most stuff, but there's 
> all kinds of handles lookups, linear list searching like 


have you ever bothered actually measuring the impact? I have. Is the Linux 
kernel perfect? Not at all. I dont understand why you take this as a 
personal insult - you are certainly free to add your improvements, no 
insults or patronizing is necessery, this is a technical forum. 


Ingo 


- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/

Re: zero-copy TCP
From: Andi Kleen (ak@suse.de)
Date: Sat Sep 02 2000 - 17:02:27 EST 


On Sat, Sep 02, 2000 at 10:35:11PM +0100, Alan Cox wrote: 
> > to MANOS, and what a mess indeed. In NetWare, the only time data ever 
> > gets copied from incoming packets is: 
> > 
> > 1. A copy to userspace at a stream head. 
> > 2. An incoming write that gets copied into the file cache. 
> 
> Sounds like Linux - one DMA and one copy to user space. 


Given for NFS over UDP it is usually more, because of the defragmentation 
pass. That will be fixed in 2.5 and the code is already writen, just wants 
to be ported to kiobufs. 2.4 NFSD at least receives directly into the 
page cache unlike 2.2 (so it'll do two copies, three usually on alpha) 


Samba probably does more copies though, I don't think it receives directly 
into a mmap'ed buffer (so there are at least two copies to write something 
to disk). 


-Andi 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Jes Sorensen (jes@linuxcare.com)
Date: Sat Sep 02 2000 - 16:40:18 EST 


>>>>> "Jeff" == Jeff V Merkey <jmerkey@timpanogas.com> writes: 


Jeff, could you start by learning to quote email and not send a full 
copy of the entire email you reply to (read rfc1855). 


Jeff> The entire Linux Network subsystem needs an overhaul. The code 
Jeff> copies data all over the place. I am at present pulling it apart 
Jeff> and porting it to MANOS, and what a mess indeed. In NetWare, the 
Jeff> only time data ever gets copied from incoming packets is: 


Try and understand the code before you make such bold statements. 


Jeff> 1. A copy to userspace at a stream head. 2. An incoming write 
Jeff> that gets copied into the file cache. 


Jeff> Reads from cache are never copied. In fact, the network server 
Jeff> locks a file cache page and sends it unaltered to the network 
Jeff> drivers and DMA's directly from it. Since NetWare has WTD's 
Jeff> these I/O requests get processed at the highest possible 
Jeff> priority. In networking, the enemy is LATENCY for fast 
Jeff> performance. That's why NetWare can handle 5000 users and Linux 
Jeff> barfs on 100 in similiar tests. Copying increases latency, and 
Jeff> the long code paths in the Linux Network layer. 


You can't DMA directly from a file cache page unless you have a 
network card that does scatter/gather DMA and surprise surprise, 
80-90% of the cards on the market don't support this. Besides that you 
need to do copy-on-write if you want to be able to do zero copy on 
write() from user space, marking data copy on write is *expensive* on 
x86 SMP boxes since you have to modify the tlb on all 
processors. On top of that you have to look at the packet size, for 
small packets a copy is often a lot cheaper than modifying the page 
tables, even on UP systems so you need a copy/break scheme here. 


As wrt your statement on latency then it's nice to see that you don't 
know what you are talking about. Latency is one issue in fast 
networking it's far from the only one. Latency is important for 
message passing type applications however for bulk data transfers it's 
less relevant since you really want deep pipelining here and properly 
written applications. If you TCP window is too small even zero latency 
will only buy you soo much on a really fast network. 


Jes 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Jamie Lokier (lk@tantalophile.demon.co.uk)
Date: Sat Sep 02 2000 - 22:22:44 EST 


Jes Sorensen wrote: 
> You can't DMA directly from a file cache page unless you have a 
> network card that does scatter/gather DMA and surprise surprise, 
> 80-90% of the cards on the market don't support this. Besides that you 
> need to do copy-on-write if you want to be able to do zero copy on 
> write() from user space, marking data copy on write is *expensive* on 
> x86 SMP boxes since you have to modify the tlb on all 
> processors. On top of that you have to look at the packet size, for 
> small packets a copy is often a lot cheaper than modifying the page 
> tables, even on UP systems so you need a copy/break scheme here. 


I just thought I'd mention that you can do zero copy TCP in and out 
*without* any page marking schemes. All you need is a network card with 
quite a lot of RAM and some intelligence. An Alteon could do it, with 
extra RAM or an impressively underloaded network. 


(for example) http://www.digital.com/info/DTJS05/ 


-- Jamie 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Linus Torvalds (torvalds@transmeta.com)
Date: Sun Sep 03 2000 - 01:33:27 EST 


In article <20000903052244.B15788@pcep-jamie.cern.ch>, 
Jamie Lokier <lk@tantalophile.demon.co.uk> wrote: 
> 
>I just thought I'd mention that you can do zero copy TCP in and out 
>*without* any page marking schemes. All you need is a network card with 
>quite a lot of RAM and some intelligence. An Alteon could do it, with 
>extra RAM or an impressively underloaded network. 
> 
>(for example) http://www.digital.com/info/DTJS05/ 


The thing is, that at least historically it has always been a bad bet to 
bet on special-purpose hardware over general-purpose stuff. 


What I'm saying is that basically you should not design your TCP layer 
around the 0.1% of cards that have tons of intelligence, when you have a 
general-purpose CPU that tends to be faster in the end. 


The smart cards can actually have higher latency than just doing it 
the "stupid" way with the CPU. Yes, they'll offload some of the 
computation, and may make system throughput better, but at what cost? 


[ Same old example: just calculate how quickly you can get your packet 
on the wire with a smart card that does checksumming in hardware, and 
do the same calculations with a CPU that does the checksums. Take into 
account that the checksum is at the _head_ of the packet. The CPU will 
win. 


Proof: the data to be sent out is in RAM. In fact, often it is cached 
in the CPU these days. In order to start sending out the packet, the 
smart card has to move all of the data from RAM/cache over the bus to 
the card. It can only start actually sending after that. Cost: bus 
speed to copy it over. 


In contrast, if you do it on the CPU, you can basically start feeding 
the packet out on the net after doing a CPU checksum that is limited 
by RAM/cache speeds. Bus speed isn't the limiting factor any more on 
packet latency, as you can send out the start of the packet on the 
network before the whole packet has even been copied over the internal 
bus! ] 


So. Smart cards are not necessarily better for latency. They are 
certainly not cheaper. They _are_ better for throughput, no question 
about that. But so is adding another CPU. Or beefing up your memory 
subsystem. Or any number of other things that are more generic than some 
smart network card - and often cheaper because they are "standard 
components", useful regardless of _what_ you do. 


End result: smart cards only make sense in systems that are really 
pushing the performance envelope. Which, after all, is not that common, 
as it's usually easier to just beef up the machine in other ways until 
the network is not the worst bottle-neck. Very few places outside 
benchmark labs have networks _that_ studly. 


Right now gigabit is heavy-duty enough that it is worth smart cards. 
The same used to be true about the first generation of 100Mbit cards. 
The same will be true of 10Gbps cards in another few years. But 
basically, they'll probably always end up being the exception rather 
than the rule, unless they become so cheap that it doesn't matter. But 
"cheap" and "pushing the performance envelope" do not tend to go hand in 
hand. 


Linus 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Jamie Lokier (lk@tantalophile.demon.co.uk)
Date: Sun Sep 03 2000 - 15:46:54 EST 


Linus Torvalds wrote: 
> Proof: the data to be sent out is in RAM. In fact, often it is cached 
> in the CPU these days. In order to start sending out the packet, the 
> smart card has to move all of the data from RAM/cache over the bus to 
> the card. It can only start actually sending after that. Cost: bus 
> speed to copy it over. 
> 
> In contrast, if you do it on the CPU, you can basically start feeding 
> the packet out on the net after doing a CPU checksum that is limited 
> by RAM/cache speeds. Bus speed isn't the limiting factor any more on 
> packet latency, as you can send out the start of the packet on the 
> network before the whole packet has even been copied over the internal 
> bus! 


Nice point! Only valid for TCP & UDP though. 


When people want _real_ low latency, they don't use TCP or UDP, and they 
certainly don't put data checksums at the start. They still aim for 
zero copies. That pass, even over cached data, is still significant. 


> Right now gigabit is heavy-duty enough that it is worth smart cards. 
> The same used to be true about the first generation of 100Mbit cards. 
> The same will be true of 10Gbps cards in another few years. But 
> basically, they'll probably always end up being the exception rather 
> than the rule, unless they become so cheap that it doesn't matter. But 
> "cheap" and "pushing the performance envelope" do not tend to go hand in 
> hand. 


Fair enough. Please read my description of a zero-copy scheme that 
doesn't require much intelligence on the card though. I think it's a 
neat kernel trick that might just pay off. Sometimes, maybe. 


-- Jamie 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Linus Torvalds (torvalds@transmeta.com)
Date: Sun Sep 03 2000 - 16:03:03 EST 


On Sun, 3 Sep 2000, Jamie Lokier wrote: 
> 
> Nice point! Only valid for TCP & UDP though. 


Yeah. But "we need oxygen" is only a valid point for carbon-based 
life-forms. You might as well argue that oxygen is not avalid criteria for 
being livable, because it's only valid for the particular kind of 
creatures we are. 


Basically, only TCP and UDP really matter. Decnet, IPX, etc don't really 
make a big selling point any more. 


> When people want _real_ low latency, they don't use TCP or UDP, and they 
> certainly don't put data checksums at the start. They still aim for 
> zero copies. That pass, even over cached data, is still significant. 


I disagree. 


Look at history. 


Exercise 1: name a protocol that did something like that 
(yes, I know, there are multiple). 


Exercise 2: name one of them that is still relevant today. 


See? Performance, in the end, is very much secondary. It doesn't matter 
one whit if you perform better than everybody else, if you cannot _talk_ 
to everybody else. 


I think the RISC vendors found that out. And I think most network vendors 
find that out. 


(Yes, I know, you're probably talking about things like the networking 
protocols for clusters etc. I'm just saying that historically such 
special-purpose stuff always tends to end up being not as good as the 
"real thing".) 


> Fair enough. Please read my description of a zero-copy scheme that 
> doesn't require much intelligence on the card though. I think it's a 
> neat kernel trick that might just pay off. Sometimes, maybe. 


We could certainly try to do better. But some of the scemes I've seen have 
implied a lot of complexity for gains that aren't actually real in the end 
(eg playing expensive games with memory mapping in order to avoid a copy 
that ends up happening anyway because the particular card you're using 
doesn't do scatter-gather: you'd perform a lot better if you just did the 
copy outright and forgot about the expensive games - which is what Linux 
does). 


Linus 


- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Tue Sep 05 2000 - 05:36:05 EST 


Linus Torvalds wrote: 
> 
> 
> 
> Basically, only TCP and UDP really matter. Decnet, IPX, etc don't really 
> make a big selling point any more. 
> 
> 


Linus, 


IPX is a really good LAN protocol (but totally sucks for internet). A 
full blown NCP server in-kernel that's toughtly coupled to the page 
cache running over IPX would make flames shoot out of the back of a 
Linux server, and make NT like look an old lady hobbling down the 
street. There's no need to configure client addresses with it, and for 
file and print, it's the best. 


Jeff 
- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/

Re: zero-copy TCP
From: Henning P. Schmiedehausen (hps@tanstaafl.de)
Date: Tue Sep 05 2000 - 08:34:02 EST 


jmerkey@timpanogas.com (Jeff V. Merkey) writes: 




>Linus Torvalds wrote: 
>> 
>> 
>> 
>> Basically, only TCP and UDP really matter. Decnet, IPX, etc don't really 
>> make a big selling point any more. 
>> 
>> 


>Linus, 


>IPX is a really good LAN protocol (but totally sucks for internet). A 
>full blown NCP server in-kernel that's toughtly coupled to the page 
>cache running over IPX would make flames shoot out of the back of a 
>Linux server, and make NT like look an old lady hobbling down the 
>street. There's no need to configure client addresses with it, and for 
>file and print, it's the best. 


And it would be a good bit of necrophilia, too. 


Jeff, Netware is dead. Please leave it there. IP won. The number of 
new Netware Installations (as compared to existing or just upgrades) 
is close (really close) to nil. 


Regards 
Henning 


-- 
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH hps@intermeta.de
Am Schwabachgrund 22 Fon.: 09131 / 50654-0 info@intermeta.de
D-91054 Buckenhof Fax.: 09131 / 50654-20 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

Re: zero-copy TCP
From: Dan Hollis (goemon@anime.net)
Date: Tue Sep 05 2000 - 13:25:12 EST 


On 5 Sep 2000, Henning P. Schmiedehausen wrote: 
> jmerkey@timpanogas.com (Jeff V. Merkey) writes: 
> >IPX is a really good LAN protocol (but totally sucks for internet). A 
> Jeff, Netware is dead. Please leave it there. IP won. The number of 
> new Netware Installations (as compared to existing or just upgrades) 
> is close (really close) to nil. 


I think you mean IPX is dead. Netware *could* work over TCP or UDP. 
IP is definitely king. Even micro$haft gave up on NetBEUI. 


-Dan 


- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/ 

Re: zero-copy TCP
From: Henning P . Schmiedehausen (hps@tanstaafl.de)
Date: Tue Sep 05 2000 - 14:32:46 EST 


On Tue, Sep 05, 2000 at 11:25:12AM -0700, Dan Hollis wrote: 
> On 5 Sep 2000, Henning P. Schmiedehausen wrote: 
> > jmerkey@timpanogas.com (Jeff V. Merkey) writes: 
> > >IPX is a really good LAN protocol (but totally sucks for internet). A 
> > Jeff, Netware is dead. Please leave it there. IP won. The number of 
> > new Netware Installations (as compared to existing or just upgrades) 
> > is close (really close) to nil. 
> 
> I think you mean IPX is dead. Netware *could* work over TCP or UDP. 
> IP is definitely king. Even micro$haft gave up on NetBEUI. 


Yep, thats' what I meant. Sorry that I was not clearer. But I think 
that there are even with NetWare on IP not many new 
installations. There is lots of migration of existing servers and 
keeping existing systems alive but new rollouts? 


But then again, maybe with MANOS and OpenNetWare, everything will be 
different. 


Regards 
Henning 



-- 
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH hps@intermeta.de
Am Schwabachgrund 22 Fon.: 09131 / 50654-0 info@intermeta.de
D-91054 Buckenhof Fax.: 09131 / 50654-20 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

Re: zero-copy TCP
From: Chris Wedgwood (cw@f00f.org)
Date: Tue Sep 05 2000 - 14:20:31 EST 


On Tue, Sep 05, 2000 at 03:34:02PM +0200, Henning P. Schmiedehausen wrote: 


And it would be a good bit of necrophilia, too. 

Jeff, Netware is dead. Please leave it there. IP won. The number of 
new Netware Installations (as compared to existing or just upgrades) 
is close (really close) to nil. 


Sadly neither of these comments are true -- there are still a great 
many NetWare installations and many of the existing installations are 
far from dead as they move to IP... 



--cw 

- 
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
the body of a message to majordomo@vger.kernel.org 
Please read the FAQ at http://www.tux.org/lkml/