Adventures with bpftrace

After a move to new hardware and a new kernel, a pair of hosts in two of our production environments started seeing out of order traffic. Had we set something up wrong, or was the hardware/kernel change the cause?

Crossposted from https://technology.lmax.com

You’re out of order!

Editor’s note: There are going to be a lot of order puns here.

We’ve written previously about our software’s reliance on UDP multicast for the majority of its messaging needs.

Now. What does the internet have to say about UDP? Well, mostly it consists of terrible jokes.

I tried to tell a UDP joke, but no-one got it.

These are all variations on the same theme: that the UDP protocol provides no guarantees of delivery. Packets may be lost or turn up out of order.

It may surprise you to learn, therefore, that we treat out of order delivery of UDP multicast traffic in our systems as something to investigate. Why is that?

Order, Order

At an application level we almost always want messages on a given multicast topic to arrive in the order they were sent. Why? Well, there may be dependencies between the messages, e.g [order 1 accepted, order 1 cancelled] only really makes sense in one of the two possible orderings. Typically, in this circumstance we enforce a single publisher per logical context - so while there may be multiple sessions placing orders, only one of them will be placing order 1 for client abc.

For many years, multicast UDP packets stayed in the same order across our network. Our messaging library’s ability to cope with out of order messages lay unused, and the metrics we captured just in case it started happening rested quietly at 0. This is a good thing - any ordering issues are going to incur some processing cost while the messaging layer tries to get things back into the right order for us, and many parts of our production systems are latency sensitive in the microsecond range.

Unfortunately, this happy period recently came to an end, when two of our systems started angrily reporting that messages on our busiest topic were infrequently appearing out of order. Interestingly the problem occurred in two different places, but they shared a feature: both the publishing and subscribing processes were running on the same physical host.

A sneaky peekcap

We wanted to reduce the surface area of places where the reordering could be happening, so we took a packet capture on the machine in question. This turned out to be quite useful, because according to the pcap, everything was in its right place.

That just about made sense, because we have other subscribers attached to the same multicast bus on other physical hosts that aren’t reporting out of order. We try to packet capture the same traffic using tcpdump -Qin ... to see if that’s in a different order, but we don’t record any packets at all that way. Curious.

We return to a previous map of the kernel receive path we’ve done this before , and work out that we’d really like to be able to see what’s going on in udp.c .

byattj: now we just need to be able to drop arbitrary breakpoints/debug logging into the kernel… biguml: should I bring my eBPF book tomorrow? :wink:

Enter bpftrace

bpftrace is a scripting language that can be compiled into a bytecode format that is injectable into kernel code. What does that mean? Well, here’s an example bpftrace program from bpftrace.org that creates a histogram of the time spent in read

 // read.bt file
 tracepoint:syscalls:sys_enter_read
 {
   @start[tid] = nsecs;
 }
 
 tracepoint:syscalls:sys_exit_read / @start[tid] /
 {
   @times = hist(nsecs - @start[tid]);
   delete(@start[tid]);
 }

and here’s the output when you run that bpftrace script:

 # bpftrace read.bt
 Attaching 2 probes...
 ^C
 
 @times:
 [256, 512)           326 |@                                                   |
 [512, 1k)           7715 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
 [1k, 2k)           15306 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
 [2k, 4k)             609 |@@                                                  |
 [4k, 8k)             611 |@@                                                  |
 [8k, 16k)            438 |@                                                   |
 [16k, 32k)            59 |                                                    |
 [32k, 64k)            36 |                                                    |
 [64k, 128k)            5 |                                                    |

We’ve been waiting for years to have a bpftrace shaped problem, and now, finally, once has turned up. Even better, it’s turned up somewhere that we don’t have kernel bypass turned on, so there are (almost) no limits to what we can capture. Let’s start by seeing if we can capture these packets roughly half-way between the sender and receiver.

Getting our first script off the ground

Reading the bpftrace reference guide , we think we want a kprobe of one of the functions in udp.c. We try a few possible candidates without success; a brief google suggests we should be able to probe anything listed in _/boot/System.map-$(KERNEL_VERSION). Our first script ends up targeting _udp4_lib_rcv, and looks like this:

#include <linux/skbuff.h>
#include <linux/ip.h>
#include <linux/udp.h>
#include <uapi/linux/ip.h>
#include <uapi/linux/udp.h>

kprobe:__udp4_lib_rcv
{
    $i = (struct sk_buff *) arg0;
    $dst_addr = ((struct iphdr *) ($i->head + $i->network_header))->daddr;
        
    // if byte at $i->data[0] is 0, then long at $i->data[8] is the lbm sequence number    
    $lbm_hdr = (uint8 *) ($i->data + 8);
    $lbm_seq_1 = *((uint8 *) ($i->data + 16));
    $lbm_seq_2 = *((uint8 *) ($i->data + 17));
    $lbm_seq_3 = *((uint8 *) ($i->data + 18));
    $lbm_seq_4 = *((uint8 *) ($i->data + 19));
    $lbm_seq = (uint32) 0;
    $lbm_seq |= ($lbm_seq_1 << 24);
    $lbm_seq |= ($lbm_seq_2 << 16);
    $lbm_seq |= ($lbm_seq_3 << 8);  
    $lbm_seq |= $lbm_seq_4;

    if ($dst_addr == 311791) // this corresponds to the multicast address of our out of order data
    {
        if (*$lbm_hdr == 0)
        {
            printf("pid/tid/cpu %d,%d,%d recving len %d at daddr %d, seq %u\n", pid, tid, cpu, $i->len, $dst_addr, $lbm_seq);
        }
    }
}

Here we trace every packet being recvd for our out-of-order topics, and emit a line of text with some information about it. We also run a test program that Informatica provide at the same time (lbmrcv) at the same time which makes it very easy to see if a receiver sees out of order packets.

We kick off the trace and the test program at the same time; the test program very rapidly reports out of order packets, and our bpftrace output…does not. So the reordering must be after our tracepoint? Perhaps it happens at the point where we’re fanning out the packets - there are several receivers for this topic on this host.

We try a second probe down in __udp_enqueue_schedule_skb, which sees a packet per receiver, but still doesn’t see the ordering. That’s really curious, because as far as we can see, there’s very little opportunity for reordering after this point.

We’re now very confused. We’ve got two data points that fundamentally disagree, so we’ve got an assumption wrong somewhere. Let’s try something simpler: can we at least isolate the thread reading the udp packets to see the out of order there? This also gives us an opportunity to try our hand at a kretprobe.

#include <linux/skbuff.h>
#include <linux/ip.h>
#include <linux/udp.h>
#include <uapi/linux/ip.h>
#include <uapi/linux/udp.h>

kretprobe:__skb_recv_udp {
    $i = (struct sk_buff *) retval;
    $dst_addr = ((struct iphdr *) ($i->head + $i->network_header))->daddr;  // if byte at $i->data[0] is 0, then long at $i->data[8] is the lbm sequence number    
    $lbm_hdr = (uint8 *) ($i->data );
    // Snipped out the same sequence bit shifts as we've seen before here.
    if ($dst_addr == 311791) {
        if (*$lbm_hdr == 0) {
            printf("pid/tid/cpu %d,%d,%d,%u recving len %d at daddr %d, seq %u\n", pid, tid, cpu, nsecs, $i->len, $dst_addr, $lbm_seq);
        }
    }
}

and now we can see the out of order packets…

pid/tid/cpu/ns 319417,319417,26,768486861 recving len 1092 at daddr 311791, seq 4093587009
pid/tid/cpu/ns 319417,319417,26,768491450 recving len 120  at daddr 311791, seq 4093587010
pid/tid/cpu/ns 319417,319417,26,768495127 recving len 54   at daddr 311791, seq 4093587191 <<<---
pid/tid/cpu/ns 319417,319417,26,769580746 recving len 1092 at daddr 311791, seq 4093587011
pid/tid/cpu/ns 319417,319417,26,769585475 recving len 294  at daddr 311791, seq 4093587012

…and they’re being read from a single thread, so those packets must have been out of order when they got written to whatever that thread is reading from. So we’ve got an assumption wrong in our brains about how bpftrace works. What does the write side look like for these packets? We compare a read-trace with a write-trace and discover that the first out of order packet is delivered by a different thread. In fact, it’s always the first message of a batch sent by a thread that didn’t send the prior message.

Artist’s impression of the write-trace, where sequence 76 appears before sequence 5 on the read-trace.

thread,sequence
98,1
98,2
98,3
98,4
98,5
31,76

This suggests that the reordering is happening at the point of our previous probes (or before!), but somehow, the architecture of how bpftrace ferries messages around is hiding it from us. We add some timestamping to those scripts, and lo and behold, the ordering issue is made visible there too.

Here’s the only line we changed in the first script:

    printf("pid/tid/cpu/ns %d,%d,%d,%u recving seq %u nseq %u qlen %u\n", pid, tid, cpu, nsecs, $lbm_seq, $lbm_seq_n, $q->qlen);

and now a write-trace sorted by timestamp.


Direction   pid      tid      cpu  timestamp   length  sequence
S           3977661  3984187  4    2612886314  1100    1620722478
S           36       36       4    2612961369  302     1620722479
S           36       36       4    2612983582  858     1620722480
S           3977661  3984187  5    2612984594  1100    1620722482
S           36       36       4    2613002528  842     1620722481
S           3977661  3984187  5    2613018098  112     1620722483

…and to go with the reordering, we see we have two different pid/tid/cpu combinations racing with each other to write to the socket buffer; there’s the race we were looking for.

Excellent - we’ve managed to prove to ourselves that for localised multicast sender/receivers, reordering is now a thing that can happen. Now - is this our fault? Has the kernel changed? Or has our move from Intel -> AMD brought us here?

What changed?

We’ve done a sizable kernel jump from X to Y. We’ve moved from an Intel CPU to an AMD CPU. We’ve also inherited a bag of our own changes to cope with those first two.

Our best guess is a kernel change.

That’s a big search area. We enlist help from our friends at Informatica , because it seems likely they’ll have come across this issue before and might be able to signpost us to a smaller search area. They quickly come back to us with an interesting fact; they can trivially reproduce this, and even more interestingly can make the problem disappear…by pinning the sending thread to a single core.

Why is that? We’ll look into that in a future installment…