Re: [tcpdump-workers] [PATCH] enable memory mapped access to ethernet device for linux

Guy Harris Thu, 06 Dec 2007 16:09:30 -0800


On Dec 6, 2007, at 9:54 AM, Alexander Dupuy wrote:

Paolo Abeni writes:
It does not use environment variables to control the memory mappedringparameters; instead the requested snap len is used: the low orderbytesare used to select the ring frame size and the high order bytes areusedto select the ring frame number. If the high order bytes is 0, likein
every current libpcap usage, a reasonable default is used.
Using the snaplen for the ring frame size certainly makes sense, butI'm uncomfortable with overloading the high order bytes to specifythe number of ring slots. The right way to do this is to provide ageneral interface to set the buffering size (WinPcap has a system-specific extension) but the problem in the past has been that somesystems, like *BSD, require the buffer size to be set when openingthe underlying (in this case, BPF) device. Guy Harris has in thepast talked about an extensible key/value list parameter topcap_open, but as far as I know, nothing has ever come of it. Inour version, we use a pcap_setbufsize function, like WinPcap'spcap_setbuff, but this only works on Linux (packet socket read ormmap), Windows (WinPcap), and SunOS 3.x or SGI Irix (the only othersystems to use sockets for packet capture, and which supportSO_RCVBUF).


How does pcap_setbufsize() differ from pcap_setbuff()?

Rounding the ring size to nearest power of two wastes quite a bit ofmemory for full capture on standard Ethernet (2048/1514 = 26%wasted) and even more for typical jumbo frames (16384/9000 = 45%wasted). How exactly does this simplify ring navigation? I don'trecall seeing this in any other pcap-mmap implementation(admittedly, I never looked too closely at Phil Woods' code). Also,what do you do when the snaplen is zero - implying max packet size?

0 is not a valid snapshot length value in pcap_open_live(); it's validin current versions of tcpdump, but that's because it maps it to65535. 65535 is also the default in Wireshark.

If that wastes wired-down ring-buffer memory, the right thing to do isprobably, as you note, to use the interface MTU (although you have toadd on not only the maximum link-layer header size, but also sizes forthings such as the radio header for 802.11 adapters).

I wonder if your "power-of-two" approach is just covering up somememory overflow problems. I also notice that you are limiting thenumber of ring slots to 128K (MAX_BLOCK_NR). While this is correctfor 32-bit i386 Linux 2.4 (and earlier) kernels, the values aredifferent on other architectures, and the kmalloc limit no longerapplies for 2.6 kernels (there are other limits, though). Myversion uses some binary search approach to find a working buffersize if the requested one fails when allocating the ring buffer -this isn't ideal, but is more practical across different kernels,and simplifies application programming considerably.


...and is similar to what's done for the BPF buffer size.

Using a zero timeout to indicate "wait forever" introduces somecompatibility and consistency problems; the original (and best,probably) use of the timeout is for in-kernel delays - theapplication-level read timeout (or not) is better taken frompcap_setnonblock() call (i.e. wait forever is default, unlessnonblock is set). If I'm not mistaken, this is the current behaviorwith socket read() implementation on Linux. You also have to bemuch more careful about multiple calls to poll() within the loop,due to interrupts, interface down, and handle pcap_breakloop()correctly.

On platforms where the timeout is supported, 0 means "wait forever";to quote the pcap man page's description of pcap_open_live():

to_ms specifies the read timeout in milliseconds. The read timeoutis used to arrange that the read not necessarily return immediatelywhen a packet is seen, but that it wait for some amount of time toallow more packets to arrive and to read multiple packets from the OSkernel in one operation. Not all platforms support a read timeout;on platforms that don't, the read timeout is ignored. A zero valuefor to_ms, on platforms that support a read timeout, will cause a readto wait forever to allow enough packets to arrive, with no timeout.

Linux is one of the platforms that doesn't support a read timeout.Note that it is *NOT* guaranteed that a read will complete within"to_ms" milliseconds; on Solaris, for example, the timer doesn't startuntil at least one packet is seen, so the read could block forever ifno packets arrive. (Applications should *NOT* be using the timeoutto, for example, allow them to do other things if no packets arrive.)

There's also an issue that with the ringbuffer, the initial contentscan be quite substantial in the fraction of a second between thepcap_open and application call to pcap_setfilter; for some reasonthis is not so much an issue for the socket read() interface,although buffering takes place there as well, perhaps the kernel(re-)filters the socket buffer when the filter is changed?

With BPF and Digital UNIX's packetfilter, changing the filter flushesthe buffer. With Linux, changing the filter doesn't flush the buffer- so current versions of libpcap purge the buffer themselves, so that,after you change a filter, you don't get any packets that wouldn'thave passed the filter. (On platforms where filtering is done inuserland, that's not an issue.)

I also wonder whether it might make sense to look at a libpcap-ngdevelopment effort; this would be an upwards-compatible replacementfor libpcap that would also offer new APIs with the key/valueextensions, and support reading/dumping both classic pcap savefileand ntar formats (possibly using the ntar library, or new code).Clearly, this would not be for the libpcap 1.0 branch, but rather anew libpcap 2.0.

...or libpcap 1.1; calling it libpcap 2.0 might imply binaryincompatibility, and if the library is renamed libpcap.2.so on ELF-based systems, would strongly imply binary incompatibility (i.e.,programs linked with libpcap 1.x wouldn't work with 2.x even if 2.x*is* binary-compatible).


(That's also an issue for going to libpcap 1.0 from 0.x.)
-
This is the tcpdump-workers list.
Visit https://cod.sandelman.ca/ to unsubscribe.

Re: [tcpdump-workers] [PATCH] enable memory mapped access to ethernet device for linux

Reply via email to