On Dec 6, 2007, at 9:54 AM, Alexander Dupuy wrote:
Paolo Abeni writes:
It does not use environment variables to control the memory mapped
ring
parameters; instead the requested snap len is used: the low order
bytes
are used to select the ring frame size and the high order bytes are
used
to select the ring frame number. If the high order bytes is 0, like
in
every current libpcap usage, a reasonable default is used.
Using the snaplen for the ring frame size certainly makes sense, but
I'm uncomfortable with overloading the high order bytes to specify
the number of ring slots. The right way to do this is to provide a
general interface to set the buffering size (WinPcap has a system-
specific extension) but the problem in the past has been that some
systems, like *BSD, require the buffer size to be set when opening
the underlying (in this case, BPF) device. Guy Harris has in the
past talked about an extensible key/value list parameter to
pcap_open, but as far as I know, nothing has ever come of it. In
our version, we use a pcap_setbufsize function, like WinPcap's
pcap_setbuff, but this only works on Linux (packet socket read or
mmap), Windows (WinPcap), and SunOS 3.x or SGI Irix (the only other
systems to use sockets for packet capture, and which support
SO_RCVBUF).
How does pcap_setbufsize() differ from pcap_setbuff()?
Rounding the ring size to nearest power of two wastes quite a bit of
memory for full capture on standard Ethernet (2048/1514 = 26%
wasted) and even more for typical jumbo frames (16384/9000 = 45%
wasted). How exactly does this simplify ring navigation? I don't
recall seeing this in any other pcap-mmap implementation
(admittedly, I never looked too closely at Phil Woods' code). Also,
what do you do when the snaplen is zero - implying max packet size?
0 is not a valid snapshot length value in pcap_open_live(); it's valid
in current versions of tcpdump, but that's because it maps it to
65535. 65535 is also the default in Wireshark.
If that wastes wired-down ring-buffer memory, the right thing to do is
probably, as you note, to use the interface MTU (although you have to
add on not only the maximum link-layer header size, but also sizes for
things such as the radio header for 802.11 adapters).
I wonder if your "power-of-two" approach is just covering up some
memory overflow problems. I also notice that you are limiting the
number of ring slots to 128K (MAX_BLOCK_NR). While this is correct
for 32-bit i386 Linux 2.4 (and earlier) kernels, the values are
different on other architectures, and the kmalloc limit no longer
applies for 2.6 kernels (there are other limits, though). My
version uses some binary search approach to find a working buffer
size if the requested one fails when allocating the ring buffer -
this isn't ideal, but is more practical across different kernels,
and simplifies application programming considerably.
...and is similar to what's done for the BPF buffer size.
Using a zero timeout to indicate "wait forever" introduces some
compatibility and consistency problems; the original (and best,
probably) use of the timeout is for in-kernel delays - the
application-level read timeout (or not) is better taken from
pcap_setnonblock() call (i.e. wait forever is default, unless
nonblock is set). If I'm not mistaken, this is the current behavior
with socket read() implementation on Linux. You also have to be
much more careful about multiple calls to poll() within the loop,
due to interrupts, interface down, and handle pcap_breakloop()
correctly.
On platforms where the timeout is supported, 0 means "wait forever";
to quote the pcap man page's description of pcap_open_live():
to_ms specifies the read timeout in milliseconds. The read timeout
is used to arrange that the read not necessarily return immediately
when a packet is seen, but that it wait for some amount of time to
allow more packets to arrive and to read multiple packets from the OS
kernel in one operation. Not all platforms support a read timeout;
on platforms that don't, the read timeout is ignored. A zero value
for to_ms, on platforms that support a read timeout, will cause a read
to wait forever to allow enough packets to arrive, with no timeout.
Linux is one of the platforms that doesn't support a read timeout.
Note that it is *NOT* guaranteed that a read will complete within
"to_ms" milliseconds; on Solaris, for example, the timer doesn't start
until at least one packet is seen, so the read could block forever if
no packets arrive. (Applications should *NOT* be using the timeout
to, for example, allow them to do other things if no packets arrive.)
There's also an issue that with the ringbuffer, the initial contents
can be quite substantial in the fraction of a second between the
pcap_open and application call to pcap_setfilter; for some reason
this is not so much an issue for the socket read() interface,
although buffering takes place there as well, perhaps the kernel
(re-)filters the socket buffer when the filter is changed?
With BPF and Digital UNIX's packetfilter, changing the filter flushes
the buffer. With Linux, changing the filter doesn't flush the buffer
- so current versions of libpcap purge the buffer themselves, so that,
after you change a filter, you don't get any packets that wouldn't
have passed the filter. (On platforms where filtering is done in
userland, that's not an issue.)
I also wonder whether it might make sense to look at a libpcap-ng
development effort; this would be an upwards-compatible replacement
for libpcap that would also offer new APIs with the key/value
extensions, and support reading/dumping both classic pcap savefile
and ntar formats (possibly using the ntar library, or new code).
Clearly, this would not be for the libpcap 1.0 branch, but rather a
new libpcap 2.0.
...or libpcap 1.1; calling it libpcap 2.0 might imply binary
incompatibility, and if the library is renamed libpcap.2.so on ELF-
based systems, would strongly imply binary incompatibility (i.e.,
programs linked with libpcap 1.x wouldn't work with 2.x even if 2.x
*is* binary-compatible).
(That's also an issue for going to libpcap 1.0 from 0.x.)
-
This is the tcpdump-workers list.
Visit https://cod.sandelman.ca/ to unsubscribe.