The Impossible NullPointerException

Our new production exchange recently produced an impossible looking NullPointerException.

At the same time, we saw another application in the same deployment throw an OutOfMemoryError.

Both problems turned out to have the same root cause. This post tells the story of how we found that out.

Crossposted from https://technology.lmax.com

The Problem

Or problems, plural, in this case. We’ve added a new exchange. To bring it into service, it needs to join an upstream ‘global’ service that maintains data that’s shared across multiple exchanges. To do so, that global service needs to send our new exchange quite a lot of data. One of the larger datasets that we need to synchronise is customer.

Our first attempt to do this did not go well. Here’s the first few elements of a stack trace that we thought was, well, exceptional.

Caused by: java.lang.NullPointerException
        at CustomerDao.upsert(CustomerDao.java:345)
        at CustomerManager.upsertCustomer(CustomerManager.java:173)
        at CustomerManager.upsert(CustomerManager.java:150) 

Why is that so impossible? We’ll need a crash course in how our messaging system works to see why. First though - let’s have a look at that line, and work our way up…

Here’s CustomerDao.java:345

customerParams.addValue("principalRealm", writeAlphaCharCode(changedBy.getRealm()));

From which we guess that changedBy is null. That particular object has type Identity and gets passed down from CustomerManager::upsert.

public void upsert(
        final String requestId,
        final Customer customer,
        final Identity changedBy,
        final @Milliseconds long changeTimestamp,
        final long bookingCustomerId)

Now. How did this function call get here? Well, it was proxied over from another (remote) service via multicast. CustomerManager implements a messaging interface that contains that method signature. With a bit of magic, for a topic like that, we generate an implementation that transforms the method call into a multicast packet, and another piece of code that can ‘invoke’ that packet on an instance of that interface in another application entirely.

This is very handy, because we can see who calls that method on that interface via intelliJ’s “find usages”, and quickly find the culprit who has passed in a null Identity. There are six callers, but we know that only one of them is called by our sync process, so we trace where the Identity instance is constructed by that caller, and suddenly, we’re very confused, because in the database reading code, we find this:

return new Identity(
        getByAlphaChar(rs.getString("principal_realm"), RealmId.class), 
        rs.getString("principal_domain"), 
        rs.getString("principal_username"));

We double and triple check our work, but no. The caller’s got no way of passing null in. And yet…NullPointerException. What have we missed? Well, we guess there must be something wrong in the magic that marshals those objects across the network. Unfortunately, that magic, in this case, is a slightly larger area than usual…

The other problem

We haven’t been entirely honest.

The vast majority of the messaging at LMAX happens as described - a method call transported across a multicast bus between two applications. In this particular case though, the story is more complicated, because global and the new exchange do not live in the same multicast network.

How do we get around this? We use a bridge. The bridge application has two parts - a process in the exchange, and one in global. Each process proxies a subset of the multicast traffic from its end down a single, shared TCP link to the other end, which then faithfully copies that traffic back out onto the multicast bus at the other end unchanged. The global end is slightly more special because it maintains a connection per exchange (it sends anything it hears to every exchange), but otherwise the processes are very similar. We call these processes wan-tunnel-local (deployed in the exchange) and wan-tunnel-global (deployed in global).

Our real message path is actually more like:

global-admin-svc -> multicast -> wan-tunnel-global -> tcp link -> wan-tunnel-local -> multicast -> account-svc

With this extra knowledge, we can reveal part two of the issue: at around the same time we see the NullPointerException in account-svc, wan-tunnel-local OutOfMemoryErrors.

java.lang.OutOfMemoryError: Direct buffer memory 
  at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
  at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118)
  at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317)
  at TunnelInbound.readNext(TunnelInbound.java:113)
  at TunnelNetworkLoop.onRead(TunnelNetworkLoop.java:222)
  at TunnelNetworkLoop.doSelect(TunnelNetworkLoop.java:182)
  at TunnelNetworkLoop.doSelect(TunnelNetworkLoop.java:141)
  at TunnelAcceptor.run(TunnelAcceptor.java:139)
  at java.base/java.lang.Thread.run(Thread.java:834)

Even more notably, only one wan-tunnel-local (out of five) OOMed, and only one account-svc NPEd. And yes, they were in the same exchange.

Where should we go next? There are some big clues in there. Not least the presence of our old enemyfriend, ByteBuffer.

Oh, ByteBuffer, we have not missed you

We look very closely at wan-tunnel-global, and work out that it is definitely copying the same multicast packets to each TCP connection handler’s work queue. This means we’ve either got a bug in the code to copy the work queue to the TCP socket in this process, or a bug in the code that reads and translates the other end of the connection in wan-tunnel-local.

Unfortunately, that code is full of ByteBuffers. It also seems unnecessarily convoluted to our eyes. That’s an easy criticism to make when looking at code for the first time when you know a bug lives in it, so we should reserve judgement. Walking through it and running the unit tests doesn’t immediately find a problem at either end. Unsurprising, we suppose - if the tests exercised the bug, they’d be failing, and this bit of software has worked without error for several years. We write more unit tests, and they continue to tell us everything is fine.

We’re at that funny sort of impasse where it’s tempting to start blaming cosmic rays for corrupting a packet. That usually means it’s time to try a reproduction at a different scale; we’ve been going small - can we go large? Handily, yes.

An Example In Staging

We handily have virtualized versions of our exchanges available, using a subset of cleansed production data. In the case of our bug we have the whole (cleansed) data set, so we should be able to invoke absolute the same code path that triggered this in live.

We get staging into an appropriate state to do this, push the sync button, and…it works. The data syncs correctly. Ok, so, it’s intermittent. We work out how to easily repeat the experiment and run it few tens of times across lunch. It works correctly every time. That’s odd - we tried this once in live, and it failed the first time. This is either a spectacularly unlikely coincidence, or our reproduction is insufficiently accurate. We reassure ourselves that computers are deterministic and non-malicious, and wonder if we need to simulate the exact network conditions too; the TCP link between global and the exchange often goes over a VPN, and sometimes over a long enough distance to generate 100ms+ of RTT.

Do we finally have an excuse to get tc/netem out? Yes we do.

We’re in control of the traffic now

We make our way to a wan-tunnel-local host in staging, and crack our fingers in preparation for typing arcane demands into the black screen of tiny letters.

# throttle bandwidth:
sudo tc qdisc add dev eth55 root handle 1: tbf rate 1mbit burst 32kbit limit 3000
# and delay packets (both commands need to be used together with the 'handle' part):
sudo tc qdisc add dev eth55 parent 1:1 handle 10: netem delay 150ms 20ms distribution normal

We could spend time explaining what’s going on there, but having blindly copied those commands from one of these two excellent resources, we suspect a link might be a better idea. We believe we’re adding 150ms of delay (with a 20ms standard deviation) and rate limiting (32kbit/s with bursts up to 1mbit/s) to all traffic on eth55.

Whether our understanding is correct or not swiftly becomes irrelevant - the very first time we attempt this with both the rate limit and the delay, we see our wan-tunnel-local happily throw our OOME with the stack we’re after. We visit the logs for the local account-svc and find a familiar NPE, too. Aha. Not cosmic rays after all. The reproducer seems pretty reliable, too - our first three tries all generate the desired result.

Now we’re back on the science train, we can gather some more data. Attaching a debugger to account-svc gets us a step further - we’re trying to deserialize an Identity (in fact probably most of the objects in the second half of an RPC) out of a giant array of zeroed bytes.

We take some packet captures at the wan-tunnel-local. I forget how to use wireshark filters and instead get it to export the socket’s data as a file, then search for contiguous zeroes in it. Oh. Yes. There they are.

We take a hop upstream - we assume the wan-tunnel-global must have sent a packet with zeroes in, but did it receive any? Our packet captures suggest it did not receive any big arrays of 0, and it definitely sent one.

We already looked at this code once and didn’t find anything, but perhaps, armed with this new knowledge, something else might jump out?

Back in the Wan Tunnel

No, no it doesn’t. But it does change which bits of the code we look at. We’re only hitting a problem when we try to send a reasonably sized chunk of data down a constricted pipe. How does this code cope with that? Does it block, or drop, or what?

Reminder: we’re still in the global end of the wan-tunnel, doing a send to each downstream local tunnel. The channel is a standard java.nio.channels.GatheringByteChannel representing the socket with a local tunnel at the other end.

This method sends some data messages (there are also ‘command’ messages like acknowledgements and heartbeats) to a given channel. It is full of comments. That’s not usually a good sign…

    private final GatheringByteChannel channel; 
    private ByteBuffer dataBackLog = ByteBuffer.allocateDirect(1 << 22);      
    private final ByteBuffer commandBacklog = ByteBuffer.allocateDirect(1 << 16);

    public void writeMessages(ByteBuffer buffer) throws IOException
    {
        int messageCount = validateCompleteMessages(flyweight, buffer);

        drain();

        if (!hasPendingMessages() && !hasPendingCommands())
        {
            // nothing pending anywhere; just try and write it!
            channel.write(buffer);
            // Only if we get a partial write do we need to make sure that
            // the remain message data fits in the buffer.
            // Give this only happens when the backlog is empty
            // the total buffer length should not be more the largest
            // message seen + a little bit to round up to page size.
            resizeDataBacklogToFitPartialWrite(buffer.remaining());
        }

        if (buffer.hasRemaining()) // partial write or no write at all for this buffer
        {
            // this copies this buffer into the backlog...
            addToBacklog(dataBackLog, buffer, messageCount);
            // and then we stick it in the queue.
            appendToDrainQueue(dataBackLog);
        }

        if (hasPendingCommands())
        {
            appendToDrainQueue(commandBacklog);
            drain();
        }

        outboundCounters.increment(TunnelMessageType.DATA);
    }

What is all this? Well. This implementation tries really hard to keep data and command messages apart (separate dataBacklog and commandBacklog fields) - why? It also refuses to store partial messages in its internal buffers (trust us on this one) - this feels sensible. It appears to make a token attempt to resize its internal buffer should it see a large enough message, but…only if there’s no data already pending? We just don’t get that at all.

What we’re looking for in here is buffer fullness, probably in either of those two backlog fields. How does addToBacklog work?

    private static void addToBacklog(final ByteBuffer backlog, final ByteBuffer buffer, final int messageCount) throws DataLossException
    {
        backlog.compact();
        if (backlog.remaining() < buffer.remaining())
        {
            throw new DataLossException(backlog.remaining(), buffer.remaining(), messageCount);
        }
        backlog.put(buffer);
        backlog.flip();
    }

OK - so here’s the answer - if we get a full buffer, we throw a well named DataLossException, and we just drop the data on the floor. That’s probably sensible - when the underlying bytes are retranslated into application level messages, they have sequence numbers on them, and if there’s a gap, the receiver can request a resend of what is missing (this is one of the usages of the commands in the tunnel protocol).

We check how the DataLossException gets propagated and that looks alright.

So this method is fine then. Right? We thought so - there’s even a test that checks for the DataLossException, and it passes, and a quick bit of debugging shows its passing for the right reason.

ByteBuffer veterans may be crying into their coffee at this point, because NO, that method is NOT alright.

The missing flip

Let’s look at that code once more, but slower. We’ll add some comments inline, to help.

    private static void addToBacklog(final ByteBuffer backlog, final ByteBuffer buffer, final int messageCount) throws DataLossException
    {
        // move the content of backlog to the front of the buffer
        // the position is set to the first byte _after_ the content, so
        // after a compact, this buffer is in _write_ mode.
        backlog.compact();
        if (backlog.remaining() < buffer.remaining())
        {             
            throw new DataLossException(backlog.remaining(), buffer.remaining(), messageCount);
        }
        // Given we're in write mode, add the buffer to the backlog
        backlog.put(buffer);
        // Flip the buffer back into read mode so we're ready to be
        // copied into `channel` by the next invocation of `drain`
        backlog.flip();
    }

Can you see it now? It took the CTO sitting down next to me and pointing at it questioningly for my temporary ByteBuffer blindness to wear off, so don’t feel bad if you didn’t.

In the case of data loss, we omit the call to flip and leave the backlog in a state where it’s ready to be written. What happens if you attempt to read from it in that state? Well, what you read is a freshly zeroed out remainder of the buffer - buffer length - content length of 0s.

We add a test that triggers data loss, and then tries to continue sending data afterwards. Straight away we are greeted with giant arrays of 0s in the test output. Quite a simple error in the end; it turns out that this is the first time that we’ve ever suffered data loss in a real environment, and this component fundamentally doesn’t handle that scenario!

What about the OOM, though?

That’s a bit fiddlier to explain. Let’s imagine our buffer was small, say 1024 bytes. The following sequence of events then occurs.

  1. Send message one - it’s 900 bytes long This buffers the message in one of the backlogs
  2. Attempt to drain the backlogs to the underlying channel This succeeds with a partial write of, say, 300 bytes
  3. Send message two - it’s 600 bytes long This triggers data loss, and, critically, leaves the buffer in a state where it will now send 0s in the place of the last 600 bytes of message one
  4. Send more messages - it doesn’t matter what, really.

At the other end, the reader manages to discard most of the 0s by interpreting them as empty packets - an accident, we think, rather than an explicit bit of design. Unfortunately, when the real messages turn up again, those ‘empty’ packets aren’t really a valid packet length, so the first actual packet does not start where the reader thinks it should, and we read a message length from somewhere totally inappropriate. We then try to allocate a buffer of that size. Boom. Or, rather, OOM.

How did it take us so long to work this out?

This bug made it to a blog post. Many of our other bugs do not. Their errors are often just as trivial, but identifying them requires less work.

It would be easy to blame the original author, but in fact, much of the blame lay with us.

  1. We gave too much weight to the fact that the code had worked flawlessly for too long. We look straight past the gaps in the test coverage because of this, I think.

  2. Despite all the evidence pointing at a congestion problem, we still needed a full fat reproducer to force us to look properly at the buffer full case. Even once we were looking at it, my ByteBuffer bug blindness struggled to see the light!

  3. At some point in the past, we noticed that code was complicated, and tried to compensate with comments. We could have worked out why it was complicated, and documented that with tests. Alternatively, the tests could have shown us what we could delete (the true utility of coverage). After writing a round-trip fuzz test of the sender code, we found we could hugely simplify it - down to less than half the size it was originally, and requiring none of the comments.

  4. Some absolute idiot (you’re fired -Ed) at the beginning called it an impossible NullPointerException, and so we went looking for zebras when in fact the usual ByteBuffer::flip was what we needed to find…