On Mon, Nov 20, 2000 at 07:31:52PM -0500, Anand Mohanadoss wrote:
> I wanted to verify if the libpcap and tcpdump implementation on
> Solaris is done in kernel or at user level.

Presumably by "the libpcap and tcpdump implementation" you're referring
only the packet-filtering part of the implementation, i.e. the part that
decides which packets are to be viewed or saved to the capture file.

All of tcpdump and libpcap is done in user level, and is supposed to be
done in user level; libpcap will, *if* the raw-packet-capture mechanism
in the OS supports kernel-level packet filtering *and* that packet
filtering uses Berkeley Packet Filter (BPF) programs to specify the test
to be done on the packet to decide whether to copy it from the kernel to
user space or not, use that kernel-level packet filtering, otherwise
it'll cause all packets to be copied to user space and do the filtering
in user mode.

> In case it is currently being done at user level,

On Solaris, it's being done in user mode, because the kernel packet
filtering mechanism in Solaris supports only a CMU/Stanford-style packet
filter, not a BPF-style filter.

> is there some way I can contribute towards moving it to kernel level?

Well, one problem is that the CMU/Stanford-style packet filter (CSPF)
isn't as powerful as BPF; the Winter 1993 USENIX paper on BPF, a link to
which can be found at

        http://www.tcpdump.org/related.html

(it's in PostScript form) says on page 5 (at the bottom of the first
column):

        Another problem with CSPF, recognized by its designers, is its
        inability to parse variable length packet headers, e.g. TCP
        headers encapsulated  in a variable length IP header.  Because
        the CSPF instruction set didn't include an indirection operator,
        only packet data at fixed offsets is accessible.

This means that, unlike BPF, CSPF doesn't allow a program to correctly
check all packets to, for example, see whether they're being set to or
from TCP port 80; the BPF code to do that, on Ethernet, is (comments
added by me):

        /*
         * Check whether the Ethernet type field - the 16-bit field
         * starting at an offset of 12 from the beginning of the
         * Ethernet header, i.e. from the beginning of the packet -
         * has a value of 0x0800, i.e. "IP".
         */
        (000) ldh      [12]
        (001) jeq      #0x800           jt 2    jf 12

        /*
         * That test passed, so we know this is an IP packet.
         *
         * Now check whether the IP protocol field - the 8-bit field at
         * an offset of 9 from the beginning of the IP header, i.e. at
         * an offset of 9+14 from the beginning of the packet - has a
         * value of 6, i.e. "TCP".
         */
        (002) ldb      [23]
        (003) jeq      #0x6             jt 4    jf 12

        /*
         * That test passed, so we know this is a TCP packet.
         *
         * Now check whether this packet is not the first fragment of
         * an IP datagram, by checking whether the fragment offset
         * field - the bottom 13 bits of the 16-bit field at an offset
         * of 6 from the beginning of the IP header, i.e. at an offset
         * of 6+14 from the beginning of the packet - is non-zero.
         */
        (004) ldh      [20]
        (005) jset     #0x1fff          jt 12   jf 6

        /*
         * That test passed - i.e., none of the bottom 13 bits were
         * set, so the fragment offset is 0 - so we know this packet
         * has a TCP header.
         *
         * Now load the IP header length, in bytes, into the X register.
         * The IP header length is the bottom 4 bits of the byte at
         * an offset of 0 from the beginning of the IP header, i.e.
         * at an offset of 0+14 from the beginning of the packet;
         * it is in units of 32-bit words, so it must be multiplied
         * by 4 to conver it to units of 8-bit bytes.
         */
        (006) ldxb     4*([14]&0xf)

        /*
         * Now check whether the TCP source port - the 16-bit field
         * at an offset of 0 from the beginning of the TCP header,
         * i.e. at an offset of {IP header length in bytes} from
         * the beginning of the IP header, or {IP header length in
         * bytes}+14 from the beginning of the packet - has a value
         * of 80 (hex 50).
         *
         * If it does, go to the "ret #68", which means "success".
         */
        (007) ldh      [x + 14]
        (008) jeq      #0x50            jt 11   jf 9

        /*
         * Well, that test failed.  Now try the TCP destination port -
         * the 16-bit field at an offset of 2 from the beginning of
         * the TCP header, i.e. at an offset of 2+{IP header length in
         * bytes}+14, or X+16, from the beginning of the packet.
         */
        (009) ldh      [x + 16]
        (010) jeq      #0x50            jt 11   jf 12
        (011) ret      #68
        (012) ret      #0

That cannot be expressed in the CMU/Stanford packet filter machine code.
Instead, I think the code would, at best, be something like

        /*
         * Push the 16-bit field at an offset of 6 16-bit words, or
         * 12 bytes, onto the stack - i.e., push the Ethernet type
         * field.
         */
        ENF_PUSHWORD+6

        /*
         * Push the constant "0x0800" onto the stack, and compare the two
         * fields at the top of the stack; fail immediately if they're
         * not equal, keep going otherwise.
         */
        ENF_PUSHLIT|ENF_CAND
        htons(0x0800)

        /*
         * OK, they matched, so we know this is an IP packet.
         * Now push the 16-bit field containing the IP protocol
         * field (CSPF can deal only with 16-bit fields) onto the
         * stackj; that's at a byte offset of 8+14, or a 16-bit-word
         * offset of 11.
         */
        ENF_PUSHWORD+11

        /*
         * Push a bitmask onto the stack that, when ANDed with that word,
         * clears the byte that's not part of the IP protocol field, and
         * then AND it.  (XXX - I may get the byte order wrong here, but
         * that's not germane to this discussion.)
         */
        ENF_PUSH00FF
        ENF_AND

        /*
         * The IP protocol field, by itself, is at the top of the stack.
         * Compare it with 6, and fail immediately if they're not equal.
         */
        ENF_PUSHLIT|ENF_CAND
        6

        /*
         * OK, we know it's TCP; check the fragment offset, by pushing
         * the 16-bit quantity containing the fragment offset, ANDing
         * it with 0x1fff, and failing if it's not zero.
         */
        ENF_PUSHWORD+10
        ENF_PUSHLIT
        0x1fff
        ENF_AND
        ENF_PUSHZERO|ENF_CAND

        /*
         * OK, it's TCP, and the first or only fragment; *now* what do
         * we do?
         *
         * We can push the IP header length onto the stack, but we can't
         * use that as an offset into the packet.
         *
         * At best, we can assume the header length is the standard 20
         * bytes - i.e., that there are no IP options - and do that;
         * we could conceivably also compare the header length with
         * all possible values, and do N different checks of the TCP
         * port number, but the code to generate that would be a
         * bit gross.
         */
        left as an exercise to the reader

I don't have Solaris handy right now, but I suspect "snoop" either

        1) punts and assumes a header length of 20

or

        2) punts and does the filtering in userland.

So if you want to do the packet filtering in the kernel of standard
Solaris, you'd first have to figure out how to generate CSPF code for
all the expressions you can specify in libpcap (including, for example,
"tcp port 80", which is what generated the BPF code above), or figure
out how to generate code for the part that *can* be done
straightforwardly in CSPF and do the rest in a userland BPF interpreter,
or completely punt to userland for the hard ones.

Then you'd have to change the way the code generator works so that, on
systems with only CSPF support in the kernel, it can generate either
CSPF code (for those cases where you can do some or all of the filtering
in the kernel) and/or BPF code (for cases where you can't do it all in
the kernel - *including* reading from a capture file rather than doing a
live capture, as you can't do any of it in the kernel there).

This is, at best, a non-trivial task; it may well be impossible if
either

        1) my assumption that you can do tests of all possible values of
           "indirect index" fields such as the IP header length field,
           and use that to select which of N different comparisons with
           fixed offsets to do, is incorrect (I suspect my assumption is
           correct, but I haven't rigorously proved it)

or

        2) kernel limitations on the size of packet filter expressions
           mean that the resulting code will, for common cases, not fit
           in the kernel (although in that case you could punt to
           userland for some or all of the testing).

Filter expressions that just check particular IP addresses are probably
a very common case, and CSPF can handle them, so this effort might pay
off in some cases, I guess.

(IPv6, however, is probably even more painful for CSPF than for BPF, and
BPF doesn't quite handle it as well as we might like; BSDI made some
additions to BPF to handle that, if I correctly remember some mail a
while ago to tcpdump.org, but I don't know whether any OSes other than
BSD/OS support them.)

An alternative would be a STREAMS module that implements BPF, which
you'd push onto the DLPI stream instead of pushing the CSPF module
"pfmod"; I think Rick Jones might have done such a module for HP-UX,
although I don't know how much pain would be involved in making it work
on Solaris.

However, that alternative would work only if you were allowed to add a
new kernel module to your system ("allowed" here means more than just
"have root access" - you typically need root access to do packet
captures at all, so one could argue that requiring root privileges to
add a kernel module isn't a problem, *but* the people running the system
might not be willing to *let* you add that module.

It's also probably not a trivial task, and if the module has a bug that
causes a system crash or hang or other such problem, I suspect the
people running the system won't be very happy with you....

So there are two not-exactly-pleasant alternatives (which is, I suspect,
one reason why nobody's done anything about it yet).
-
This is the TCPDUMP workers list. It is archived at
http://www.tcpdump.org/lists/workers/index.html
To unsubscribe use mailto:[EMAIL PROTECTED]?body=unsubscribe

Reply via email to