http://gicl.cs.drexel.edu/people/sevy/network/Linux_network_stack_walkthrough.html


Linux Network Stack Walkthrough (2.4.20)


The following is a walkthrough of the routines in the Linux 2.4.20 network stack, focusing on IP networking. I've written this up in an attempt to understand in some detail the functioning of the Linux networking code. The walkthrough follows the sequence of function calls that occur for received and locally-generated network traffic. The analysis walks through each function, giving a short description of the functionality provided by each successive clause (related group of lines), and indicating which routines are called from within the function. Corresponding line numbers are listed for each clause; the Linux source code can be conveniently inspected using the Linux Cross Reference website. This document thus provides a sort of annotated reference to the source code. It's still a work-in-progress, so additional explanation will be added; in particular, question-marks are an indication that more description needs to be added for an indicated clause. I hope to eventually add a more discursive discussion, with overviews at different levels of detail.




Received traffic:

Core (layer 2, protocol-independent) receive routines:

  • net/core/dev.c: netif_rx( )
    Receive routine, called by driver when frame received on interface; enqueues frame on current processor's packet receive queue, and kickstarts receive routine
    • updates stats, applies socketbuffer timestamp (1237)
    • check processor's softnet_data input packet queue length: if less than maximum, (1248)
      • if queue length is 0 (before skb added), call netif_rx_schedule(queue->blog_dev) to schedule current processor's softnet_data backlog device to have queued received packets handled (1271)
        • the poll method of the "pseudo-dev" queue->blog_dev, associated with the cpu socket queue, is process_backlog( )
    • call __skb_queue_tail to enqueue socketbuffer on current processor's input packet queue (1255)
    • return the queue's congestion level (queue->cng_level)(?) (1260)

  • include/linux/netdevice.h: netif_rx_schedule( )
    Called to add a processor's backlog_dev to the processor's receive poll list, and to raise the NET_RX_SOFTIRQ to get the network code to handle received frames
    • call netif_rx_schedule_prep( ) to make sure device is running and has not already been added to poll list (746); if true,
      • add device to current processor's softnet_data rx poll list (733)
      • adjust dev->quota (?) (734)
      • raise softirq NET_RX_SOFTIRQ (738)

  • net/core/dev.c: net_rx_action( )
    Receive softirq handler, registered for NET_RX_SOFTIRQ. Runs through the current processor's receive poll list, which is a list of "backlog devices" associated with the processor's packet input queues, and calls each backlog device's poll method. The poll method of each backlog device is process_backlog( ), assigned during system initialization in netdev_init( ); this routine dequeues packets and calls netif_receive_skb( ) for each.
    • set budget for handling received traffic to netdev_max_backlog (1563)
    • loop through current processor's softnet_data poll list, consisting of blog_dev's of processor packet input queues (1568): for each device on the list
      • if budget for handling network receive traffic is exhausted, or 1 jiffie has elapsed, reschedule this softirq and exit (1571)
      • if device quota exhausted, put device back on end of queue, continue (1578)
      • calls dev->poll( ), which equals process_backlog( ) (assigned for each backlog_dev in netdev_init( )) (1578)
        • if dev->poll returns non-zero, indicating packets remaining to be handled, put device back on end of queue, adjust device quota(?), continue (1579 - 1585)
Question??? Why loop through poll list? Seems like only one item could ever be on list, which is processor's softnet_data blog_dev???

  • net/core/dev.c: process_backlog( )
    Assigned as the poll method of each cpu's socket queue's backlog device (blog_dev) in net_dev_init( ); the backlog device is added to the poll list (if not already present) whenever netif_rx( ) is called. This routine is called from within the net_rx_action( ) receive softirq routine, and in turn dequeues packets and passes them for further processing to netif_receive_skb( ).
    • loops, dequeuing packets from processor's input queue until queue empty, quota exceeded, or 1 jiffy passes
      • calls netif_receive_skb( ) for each dequeued packet (1504)
    • adjust value of budget parameter passed in, and device quota, decrementing by number of packets dequeued (1536, 1542 )
    • if queue empty, remove dev from poll list, return 0 (1544)
    • if queue not empty, return -1 (1538)

  • net/core/dev.c: netif_receive_skb( )
    Main device receive routine, called from within NET_RX_SOFTIRQ softirq handler. Checks the payload type, and calls any handler(s) registered for that type (or bridges frame if bridging enabled on incoming interface). For IP traffic, the registered handler is ip_recv( ).
    • updates stats (1426)
    • passes socketbuffer to any handlers registered for packet type ETH_P_ALL (i.e., handlers to be executed for all packets) (1438)
    • calls handle_bridge, if bridging enabled (1457)
      • calls br_handle_frame_hook, returns
    • passes socketbuffer to any handlers registered for specific payload type (e.g., ARP, IP handlers) (1464)
      • Since handlers stored in simple hashtable, check each handler in list to make sure it is in fact for specified payload type (1465)
Note: when searching for packet handler in ptype list, always "one behind" in search loop (pt_prev)


IP (layer 3) receive routines:

  • net/ipv4/ip_input.c: ip_recv( )
    Main IP receive routine, called from netif_receive_skb( ) when an IP packet is received on an interface. Together with ip_recv_finish( ), checks packet, passes to netfilter hook, gets route for packet and assigns it to skb->dst (if netfilter doesn't assign one), and calls skb->dst->input( ) to complete packet routing.
    • drop IP_OTHERHOST packets (386)
    • update stats (389)
    • call skb_share_check( ) to clone socketbuffer if it’s shared (391)
    • make sure socketbuffer is large enough to contain IP header, header size field is reasonable, version correct, header checksum correct (394 - 418)
    • check that total packet length reported in header is OK (bigger than header and less than total socket buffer size) (423)
    • trim socketbuffer to correct size (430)
    • call netfilter NF_IP_PRE_ROUTING hook (437)
      • call ip_rcv_finish (if netfilter hook didn't steal packet)

  • net/ipv4/ip_input.c: ip_recv_finish( )
    • call ip_route_input( ) if skb->dst is NULL - i.e., if netfilter hook didn't assign its own routing-table entry (315)
      • if return is error value, drop packet
    • process IP header options, if any (331 - 365)
      • ???
    • return skb->dst->input( ) (367)

Routing (received packets):

  • net/ipv4/route.c: ip_route_input( )
    Find route for supplied packet (skb), looking first in route cache hashtable and calling ip_route_input_slow( ) if not found, and assign result to skb->dst.
    • get hash from supplied key value (1630)
    • loop through hash entries for one matching supplied src addr, dst addr, tos, iif, oif = = 0 (1632)
      • if rtable found, assign it to skb->dst, update stats, return
    • handle multicast dest addr (1664)
      • if multicast address is one we've registered to receive, or we're a multicast router, call ip_route_input_mc( ), return
    • call ip_route_input_slow( )

  • net/ipv4/route.c: ip_route_input_slow( )
    Main routing routine for incoming packets when entry not in route cache hashtable. Get appropriate dst_entry struct from "slow" routing tables, assign it to skb->dst and add it to the route cache hashtable
    • check that the supplied (incoming) interface has IP enabled, i.e., that there's an in_dev associated with it, otherwise drop
    • create a new key based on the supplied dst arrd, src addr, tos, iif, oif = = 0, and scope RT_SCOPE_UNIVERSE (1332 - 1342)
    • check for bad source, destination address (“martians”), log errors and return
    • call fib_lookup( ) to get route (1366)
      • uses key, assigns value to fib_result
      • exits if error, i.e., no route found or input interface not configured for forwarding
    • handle NAT stuff (1375 - 1398)
      • ???
    • if fib_result type RTN_BROADCAST (1400)
      • discard if packet type not IP (1511)
      • check source address
        • call ip_get_address( ) to get source address if supplied src addr is 0, else call fib_validate_source( ) to make sure source address is as expected
      • create and populate a dst_entry struct (1528 - 1567)
        • call rt_set_nexthop to set mtu information in dst_entry from fib_result (1485)
      • call rt_intern_hash( ) (1501)
        • add dst_entry info to the route cache hashtable
        • call arp_bind_neighbour( ) if unicast or output (oif == 0) route
        • assign skb->dst to hold dst_entry
    • if fib_result type RTN_LOCAL (1403)
      • create and populate a dst_entry struct (1528 - 1567)
        • call rt_set_nexthop to set mtu information in dst_entry from fib_result (1485)
      • call rt_intern_hash( ) (1501)
        • add dst_entry info to the route cache hashtable
        • call arp_bind_neighbour( ) if unicast or output (oif == 0) route
        • assign skb->dst to hold dst_entry
    • Otherwise (packet to be unicast-forwarded)
      • check if input interface not configured for forwarding, or fib_result type not RTN_UNICAST (1416 - 1419)
        • exit if error
      • get output device associated with fib_result (1425)
      • call fib_validate_source( ) to make sure source address is as expected (1433)
        • exit if error
      • set RTCF_DOREDIRECT, RTCF_DIRECTSRC flags if needed (???) (1441)
      • reject packet if socketbuffer protocol not IP (e.g., ARP) and RTCF_DNAT flag not set
      • create and populate a dst_entry struct (1454 - 1487)
        • call rt_set_nexthop to set mtu information in dst_entry from fib_result (1485)
      • call rt_intern_hash( ) (1501)
        • add dst_entry info to the route cache hashtable
        • call arp_bind_neighbour( ) if unicast or output (oif == 0) route
        • assign skb->dst to hold dst_entry

Local delivery:

  • net/ipv4/ip_input.c: ip_local_deliver( )
    Assigned as the dst->input routine for local routes in ip_route_input_slow( ); called from ip_recv( ) after route assigned to skb->dst
    • if needed, call ip_defrag( ) to defragment packet (296)
      • returns reassembled fragment if complete, or null if not
    • call netfilter hook NF_IP_LOCAL_IN (302)
      • call ip_local_deliver_finish( ) if netfilter hook didn't "steal" packet

  • net/ipv4/ip_input.c: ip_local_deliver_finish( )
    Find level-4 protocol handler(s) for packet, call handler's receive function
    • pull IP header from packet (227)
    • call nf_conntrack_put to release packet from netfilter connection-tracking module (232)
    • check raw socket hashtable to see if there are any potential raw sockets registered (250)
      • if so, call raw_v4_input( )
        • call __raw_v4_lookup( ) in loop to get successive sockets matching info (addresses, etc.)
        • for each socket except last, clone the socketbuffer and call raw_rcv( ) (???)
          • call sock_queue_rcv_skb( ) (???)
          • return the last socket (or NULL if none)
    • find protocol handlers for packet payload type (255)
      • if single handler and no raw sockets, call its handler( ) function, return
      • if multiple handlers, or raw sockets, call ip_run_ipprot( )
        • run through handlers, cloning socket buffer for any handlers with copy field set
        • return 1 if at least one handler handles packet, 0 if none handle it
    • call raw_rcv( ) for last raw socket, if any (275)
    • otherwise, if no protocol handler handled packet, and no raw sockets, send ICMP error message (279)



Forwarding:

  • net/ipv4/ip_forward.c: ip_forward ( )
    Assigned as the dst->input routine for non-local routes in ip_route_input_slow( ); called from ip_recv( ) after route assigned to skb->dst
    • check socketbuffer opt.router_alert field, return if true (?) (81)
    • drop packet if type isn't PACKET_HOST (84)
    • discard packet, send ICMP time-exceeded if TTL <= 1 (98)
    • drop packet, send ICMP host unreachable message if strict routing indicated and next-hop inappropriate (101)
    • send route-redirect message if indicated by assigned route (117)
    • copy socket buffer so can mangle entries (decrease TTL, etc.) (121)
    • decrement TTL (126)
    • if packet size greater than MTU and don't-fragment flag is set, send ICMP destination-unreachable/fragmentation-needed message (133)
    • if route uses fast NAT (indicated by rt_flags including RTCF_NAT), call ip_do_nat( ) (137)
    • call netfilter hook NF_IP_FORWARD, with completion routine ip_forward_finish( ) if packet not stolen

  • net/ipv4/ip_forward.c: ip_forward_finish ( )
    • update forwarding statistics (48)
    • if no IP options:
      • handle fast routing if enabled (?) (51 - 65)
      • return ip_send( ) (66)
    • if IP options:
      • call ip_forward_options( ) (69)
      • return ip_send( ) (70)

Output:

IP (layer 3) output routines:

  • include/net/ip.h: ip_send( )
    • if packet length greater than next-hop MTU
      • call ip_fragment( ), with completion function ip_finish_output( ) (165)
    • call ip_finish_output( ) (167)

  • net/ipv4/ip_output.c: ip_finish_output ( )
    • call netfilter hook IP_POST_ROUTING, with completion function ip_finish_output2( ) (191)

  • net/ipv4/ip_output.c: ip_finish_output2 ( )
    • if destination entry has a cached hardware header, return its hh_output( ) method (174)
      • for ARP, hh_output( ) is dev_queue_xmit( )
    • if no hardware header cache, return destination's neighbour->output( ) method (176)
      • for ARP, the output( ) method is neigh_resolve_output( )

Core (layer 2) output routines:

  • net/core/dev.c: dev_queue_xmit( )
    • linearize socket buffer, if needed (996 - 1011)
    • checksum packet, if needed (1017 - 1022)
    • if device has a queue (qdisc):
      • enqueue packet (1029)
      • call qdisc_run( ) for the output device, which calls qdisc_restart( ) if device is not stopped:
        • if netdev_nit, have protocol handler(s) for ETH_PTYPE_ALL; call dev_queue_xmit_nit( ) (???)
        • call dev->hard_start_xmit( )
    • if device has no queue
      • if netdev_nit, have protocol handler(s) for ETH_PTYPE_ALL; call dev_queue_xmit_nit( ) (???) (1057)
      • call dev->hard_start_xmit( ) (1060)




Locally generated traffic:


UDP (layer 4) send routines:

  • net/ipv4/udp.c: udp_sendmsg( )
    • check length, flags
    • set destination address and port from supplied sock or msghdr
    • set source address, port from sock
    • use ipc to send messages to source or dest if errors
    • get TOS bits, and set RTO_LINK if sk->localroute set;
    • find routing information (rtable struct):
      • call sk_dest_check to see if sock already has destination attached (516)
      • if not, call ip_route_output (518)
        • calls ip_route_output_key to get rtable
        • call sk_dst_set( ) to attach rtable to socket for future use if connected
    • call ip_build_xmit( ) to send packet (passing in rtable struct) (547)

IP (layer 3) send routines:

  • net/ipv4/ip_output.c: ip_build_xmit( )
    • call ip_build_xmit_slow( ) if fragmentation neededor packet has IP options (654)
    • call ip_local_error( ) if header already included and packet size too large (657)
    • call sock_alloc_send_skb( ) to allocate a socket buffer (678)
    • call skb_reserve( ) to reserve space at head of socket buffer for hardware header, and skb_put to move tail pointer for space for packet data (682 - 688)
    • assign skb->dst to clone of dest entry in rtable passed in
    • set fields of IP header (if not supplied) (690 - 704)
    • call getfrag, which is either udp_getfrag( ) or udp_getfrag_nosum( ), to copy packet data into socket buffer (705)
    • call netfilter hook NF_IP_LOCAL_OUT (713)
    • call output_maybe_reroute( ) if netfilter hook doesn't steal packet
    • call skb->dst->output( )

Routing (locally-generated packets):

  • net/ipv4/route.c: ip_route_output( )
    • calls ip_route_output_key to get rtable
    • call sk_dst_set( ) to attach rtable to socket for future use if connected

  • net/ipv4/route.c: ip_route_output_key( )
    • get hash from supplied key value
    • loop through hash entries for one matching key fields src addr, dst addr, tos, oif, iif = = 0
    • also need hash entry to have iif = = 0; so only for locally-generated packets???
    • if rtable found, update stats, return rtable in supplied pointer argument
    • if none found, call ip_route_output_slow

  • net/ipv4/route.c: ip_route_output_slow( )
    Main routing routine for outgoing (locally generated?) packets when entry not in fast-route hashtable
    • create a new key based on the supplied key
    • in particular,set key.iif = loopback interface
    • if supplied key has nonzero source address

    • if supplied key has nonzero output interface

    • if no destination address supplied

  • net/ipv4/devinet.c: inet_select_addr
    Given destination address, device, scope, returns one of device’s inet addresses to use as a source address
    • return 0 if device has no inet addresses
    • return interface local address(???) if dest matches one of device's address / mask combinations




Bridging code:

  • net/core/dev.c: handle_bridge( )
    • call any leftover handler for ETH_P_ALL
    • call br_handle_frame_hook( ), return

  • net/bridge/br_input.c: br_handle_frame( )
    Hook installed at boot time as br_handle_frame_hook; handles bridging
    • checks that interface frame was received on was configured for bridging, that interface is up, and that source address not broadcast (123)
    • if bridging state is forwarding or learning, calls br_fdb_insert( ) to add entry to forwarding table if not already present (139)
    • handles "special" frames ??? (143)
    • if bridging state is forwarding, calls netfilter hook NF_BR_PRE_ROUTING
      • if frame not stolen, calls br_handle_frame_finish( )

  • net/bridge/br_input.c: br_handle_frame_finish( )
    • if interface in promiscuous mode, clones socketbuffer, calls br_pass_frame_up( ) (69)
    • if dest address is broadcast, calls br_flood_forward( ), and br_pass_frame_up( ) if not already done (79)
    • calls br_fdb_get( ) to get routing info for frame (86)
    • forwards frame:
      • if destination is local, calls br_pass_frame_up( ) if not already done (87)
      • if destination not local, calls br_forward( ) (96)
      • if destination NULL (unknown), calls br_flood_forward( ) (102)

  • net/bridge/br_input.c: br_pass_frame_up( )
    Called to pass frame for local consumption
    • increment stats (36)
    • set socketbuffer packet type to PACKET_HOST, and incoming interface to bridge interface
    • calls netfilter hook NF_BR_LOCAL_IN
      • if frame not stolen, calls br_pass_frame_up_finish( )
      • br_pass_frame_up_finish( ) calls netif_rx( )

  • net/bridge/br_forward: br_forward( )
    Forwarding routine for packets to be forwarded on single bridge port
    • call should_deliver( ) to test if frame should be forwarded (85)
      • returns false if skb->dev == bridge_port->dev or state not forwarding
    • calls __br_forward if true
      • sets indev = skb->dev, skb->dev = bridge_port->dev
      • calls netfilter hook NF_BR_FORWARD
        • calls __br_forward_finish if packet not stolen
    • __br_forward_finish
      • calls netfilter hook NF_BR_POST_ROUTING
        • calls __dev_queue_push_xmit if packet not stolen
    • __dev_queue_push_xmit
      • calls skb_push( )
      • calls dev_queue_xmit( )

  • net/bridge/br_forward: br_deliver( )
    Delivery routine for locally-originated packets to be forwarded on single bridge port
    • call should_deliver( ) to test if frame should be forwarded (85)
      • returns false if skb->dev == bridge_port->dev or state not forwarding
    • calls __br_deliver( ) if true
      • sets indev = skb->dev, skb->dev = bridge_port->dev
      • calls netfilter hook NF_BR_LOCAL_OUT
        • calls __br_forward_finish if packet not stolen
    • __br_forward_finish( )
      • calls netfilter hook NF_BR_POST_ROUTING
        • calls __dev_queue_push_xmit if packet not stolen
    • __dev_queue_push_xmit( )
      • calls skb_push( )
      • calls dev_queue_xmit( )

  • net/bridge/br_forward: br_flood_forward( )
    Forwarding routine for packets to be flooded on all bridge ports
    • call br_flood( ), with packet_hook = __br_forward( )

  • net/bridge/br_forward: br_flood_deliver( )
    Delivery routine for locally-originated packets to be flooded on all bridge ports
    • call br_flood( ), with packet_hook = __br_deliver( )

  • net/bridge/br_forward: br_flood( )
    Routine for packets to be flooded on all bridge ports
    • clone socketbuffer if this is specified
    • run through port list of bridge
      • clone frame
      • call __packet_hook( ), supplied as argument



To do:

  • fib_lookup( )
  • route cache
  • multicast:
  • net/ipv4/route.c: ip_route_input_mc( ) (1235)
    • validate packet (check that dev not null, source address OK, protocol is IP, call fib_validate_source( )) (1246)




Reply via email to