💥 Crash Investigations 02: Debugging a Rare Native Plugin Memory Crash
Table of Contents
This time we’re going to look at another crash that we encountered during the development of Deliver At All Costs. The crash in question is particularly interesting, because it was very rare - only happening every few months while playtesting.
Backstory
This crash happened in seemingly random situations, once every few months. The crash log showed one of our native plugins in the stack trace, but I didn’t have enough information to verify that plugin as the root cause. Since the crash was so rare, and the conditions so random, I had no way to reproduce it either, at the time I brushed it off as pure coincidence, hoping that it would go away on its own.
As we were making our way closer to our beta milestone, during heavy playtesting, we hit the crash again. Seeing a crash with the same stack trace again was starting to make me nervous. What if this is not a coincidence, and if we see it every few months, how often are players going to see it once we ship? I still had no useful information on how to reproduce it, and as we were so close to the milestone I again had to leave it be for a while.
A few months later we were showing off the game at a press event. Since events are guaranteed to lure out all the nastiest of bugs, this is where the crash happened the next time. After the event I decided to put all my energy into finding the culprit of this crash. I got in contact with the developer of the plugin asking them if they had seen anything like this before. I also asked if they could provide me with debug symbols so that I could get some more information out of the crash dump, but unfortunately they didn’t have any. While they did not have the resources to assist me in tracking down the core issue at the time, they were very helpful in getting me access to the source code for the plugin, which was written in C. Step one was to get a more concrete lead to where the bug could be, so using the source code I recompiled the plugin with debugging info enabled, and placed the new binary in our project. The idea being that the next time the game crashes, I will be able to get more information and debug the crash properly.
Analyzing the crash
2 months later the next crash occurs and I immediately start analyzing the crash dump with the symbols that I had compiled. The exception record shows that it was triggered by an access violation.
EXCEPTION_RECORD: (.exr -1)
ExceptionAddress: 00007fcc1cf38a6b (plugin+0x0000000000068a6b)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 0000000000000000
Parameter[1]: 0000019cd8a61000
Attempt to read from address 0000019cd8a61000
...
STACK_TEXT:
00000007`7e35ed48 : plugin+0x6ca6b (somefunction2)
00000007`7e35ed50 : plugin+0x2cd16 (somefunction1)
00000007`7e35ed58 : 0x00000198`dec8df50
00000007`7e35ed60 : 0x00000197`d8a60fd0
00000007`7e35ed68 : 0x00000197`2f9c6b80
00000007`7e35ed70 : 0x00000198`dec8df50
Thanks to the debug symbols, the stack trace now included function names from the library, and as I had access to the source code I was able to follow the code to where the crash happened. It now seemed plausible that the crash was indeed related to this library, and that it happened during a specific API call that the game was doing every now and then.
An access violation points to a memory issue, perhaps a use after free, or out of bounds access, or something similar. Knowing where the memory access happened, I went through the relevant code paths many times but couldn’t find anything obvious that looked suspicious. Here is a small high-level overview of how the relevant part of the C-API worked.
// Simple tagged union pattern.
typedef struct {
uint tag;
union {
VariantA a;
VariantB b;
VariantC c;
};
} MyUnion;
// Public interface for the plugin. User passes a pointer to the data structure as an argument.
MY_API void MyPublicApi(MyUnion *userData)
{
// Library makes an internal copy of the data, perhaps to be able to keep it around for longer.
MyUnion *myData = malloc(sizeof(MyUnion));
memcpy(myData, userData, sizeof(MyUnion));
// ...
if (data->tag == MY_TAG_A)
{
// Each time the library reads the data it always checks the tag first. It never touches
// the inactive variants since the data would be undefined.
UseVariantA(&myData->a);
}
// ...
}
As you can see there is nothing too concerning. It just takes in a pointer to a struct, makes a copy for itself, then being careful to only read from the correct member depending on the tag. This was my conclusion after walking through the source code. There is nothing that looks wrong here.
I decided to look at the C# bindings that called the API instead. Here is a stripped down example of how it worked.
[StructLayout(LayoutKind.Sequential)]
struct RawVariantA
{
// The variant struct on this side also contains the tag to match the union.
public uint Tag;
// ...data members for VariantA
}
void CallApiWithVariantA(UserFriendlyDataForVariantA userData)
{
RawVariantA data = new(userData); // re-arrange the data to correctly match the C-API
IntPtr memory = Marshal.AllocHGlobal(Marshal.SizeOf<RawVariantA>());
Marshal.StructureToPtr(data, memory, false);
NativeLibrary.MyPublicApi(memory);
Marshal.FreeHGlobal(memory);
}
Looks good too right? But wait, there is actually a subtle but dangerous bug here! Notice that the function only allocates memory to fit the struct RawVariantA. This is the correct size needed to store all the data that the C-API needs to read it safely. But this is only true if the C-API only looks at the members of VariantA. If for instance, VariantB is larger than VariantA, reading from VariantB could mean reading outside of the allocation! Good thing then that the C-API only ever reads from the correct variant based on the tag. …right?
If you haven’t figured it out already, take a look at the code from C-API again. While the business logic only ever uses the data from the correct variant, the memcpy does not look at the tag before copying the full size of MyUnion. This means that the call to memcpy expects the object pointed to to be sizeof(MyUnion), not just the Tag plus VariantA (which is what the C# bindings allocate). This means that if the RawVariantA struct in C# happens to be smaller than the full union as defined in C, the C-code will access memory that is outside of the allocation.
This just happens to be fine most of the time, since the code doesn’t actually care about the contents of the data that lies after VariantA in memory. But we are still accessing unallocated (or someone else’s) memory. This can definitely lead to unexpected crashes. Let’s see how that happens.
Cause of the crash
Lets look at the memory address from the crash again.
Attempt to read from address 0000019cd8a61000
This address is actually very near the pointer that was passed to the MyPublicApi function. In fact, it is within the range pointer..pointer+sizeof(MyUnion). And this makes a lot of sense, because with the debug symbols I could see that the instruction pointer was in the middle of the (in this case inlined) memcpy code.
Notice that the memory address ends with hex 000. This means that it is 4096 byte aligned, which just so happens to be the size of a typical memory page on x86 Windows. Now, what could happen is that if the C# bindings allocate a buffer that ends just at the end of a memory page boundary, that could lead to the C code attempting to copy data from that second memory page as well, even though the allocation doesn’t span that far. Here is how that could look like in memory:
ACTUAL [TAG VARIANTA]
COPIED [TAG VARIANTA ....... ] <-- notice where the expected buffer size ends
|------------------------|------------------------|
MEMORY PAGE 01 MEMORY PAGE 02
If MEMORY PAGE 02 has not been committed yet (or in other words is unallocated), any memory access to that page would result in an access violation and crash the program. This is exactly what happened in our crash, and the very specific conditions required for this to happen is also the reason why the crash was so rare.
Fixing the bindings
Given what we now know, the fix is actually very simple, we just need to make sure that the buffer is allocated with the correct size of the MyUnion structure - updating the call to Marshal.AllocHGlobal is enough.
An interesting discovery I made was that the designer of the bindings actually had a correctly defined structure matching MyUnion from the C-API. They just didn’t use it in this case and had instead, mistakenly, opted to use a ‘simpler’ struct for each variant to call the API.
[StructLayout(LayoutKind.Explicit)]
struct MyUnion
{
[FieldOffset(0)] public uint Tag;
[FieldOffset(0)] public VariantA a;
[FieldOffset(0)] public VariantB b;
[FieldOffset(0)] public VariantC c;
}
This is a good example as to why, when designing a native API working with pointers, it is a good idea to always take an extra argument specifying the size of the passed structure. This helps protect against this type of mistake, and also helps with API versioning. If you have ever used the Windows API; this is exactly what those extra cbSize parameters are for.