Sunday, February 25, 2018

libc:malloc meets ADIHEAP

Oracle Solaris 11.4 comes with ADIHEAP , a new security extension that acts as a management interface for allocators that implement ADI based defenses. In this blog entry we'll walk through the implementation of ADIHEAP within libc:malloc in Solaris.


So why ADIHEAP at all? Shortly before the advent of libadimalloc, I started thinking about a better way to integrate ADI in the Solaris ecosystem. I love the technology and libadimalloc, while doing its job as a testing library, wasn't cutting it for production environments. LD_PRELOAD is a poor controlling interface and relinking existing applications just wasn't happening. It didn't really help that libadimalloc wasn't seen as a viable non-ADI allocator and hence had no consumer out of the box.

My vision for ADI was a bit different, involving the main Solaris allocators (libc:malloc, libumem and libmtmalloc) supporting ADI-based defenses and the Security Extensions Framework acting as a more advanced and coherent controlling interface. That's when ADIHEAP was born.

ADIHEAP brought all the usual security extensions goodness:
  • progressive introduction to sensitive binaries, through the tagged-files model and binary tagging (ld -z sx=adiheap=enable)
  • ease of switching between enabled/disabled state, especially system wide (capture different production scenarios)
  • simplified/advanced testing through sxadm exec -i and the ability to unleash ADI checks over the entire system with model=all
  • reporting through the Compliance framework 
  • kernel and user process cooperation. The kernel knows whether the extension will be enabled on the target process and can do operations on its behalf. In particular, this vastly simplifies ADI support in brk-based allocators, since the kernel now pre enables ADI over the brk pages.
Of course, ADIHEAP by itself does very little without libraries support. The choice for the first ADIHEAP consumer (never introduce something without a consumer!) ended on libc:malloc, as it was/is small, self contained, (still) vastly used across the system and implemented as a non-slab, brk-based allocator, which provides an interesting alternative to the mmap-and-slab-based libadimalloc example.


The implementation of libc:malloc hasn't changed much over the years and an older implementation can be found from the old open-source days. We'll use this public implementation as a reference, since the main goal of this entry is to show how an allocator can be evolved to incorporate ADI checks. Keep in mind that some of the code here described may not strictly apply to the 11.4 Oracle Solaris codebase.

libc:malloc is composed by two main files, mallint.h (which contains general definitions for the allocator) and malloc.c (which has the bulk of the implementation). It's a best fit allocator based on a self-adjusting tree of free elements grouped by size. Element information is contained inside the chunk itself and described by the TREE structure:
 110 /* structure of a node in the free tree */
 111 typedef struct _t_ {
 112         WORD    t_s;    /* size of this element */
 113         WORD    t_p;    /* parent node */
 114         WORD    t_l;    /* left child */
 115         WORD    t_r;    /* right child */
 116         WORD    t_n;    /* next in link list */
 117         WORD    t_d;    /* dummy to reserve space for self-pointer */
 118 } TREE;
Free objects use all the elements of the TREE structure and also have a pointer to the start of the chunk at the end of the buffer (which basically mimics the t_d member for chunks larger than sizeof(TREE)).

Allocated objects, instead, only use the first element, which contains the size of the chunk. The first element is ensured to be ALIGN bytes in size:
 103 /* the proto-word; size must be ALIGN bytes */
 104 typedef union _w_ {
 105         size_t          w_i;            /* an unsigned int */
 106         struct _t_      *w_p;           /* a pointer */
 107         char            w_a[ALIGN];     /* to force size */
 108 } WORD;
so that the data portion is guaranteed to start at the required alignment boundary, which is 16 bytes on 64-bit systems.

Since every chunk is guaranteed to be aligned to and a multiple of ALIGN, the last bits of the size member are equally guaranteed to be zero. libc:malloc exploits this to store further information in the last two bits, with BIT0 that specifies whether the block is in use (1) or not and BIT1 that specifies, for blocks in use, whether the preceding block is free. This information is used at free time to handle the coalescing of free blocks together (adjacent free blocks are always coalesced, in order to reduce fragmentation).

Allocations smaller than the TREE structure itself would waste a lot of space for the header, so the allocator provides a specialized path: ,smalloc(). Small allocations are based off on an array of linked lists of free chunks. Each entry of the array satisfies a multiple of WORDSIZE requirement, so that List[0] contains 16 bytes chunks, List[1] 32 bytes chunks and so forth up to sizeof(TREE).

libc:malloc grows its memory footprint through sbrk() in the _morecore() function, which extends the Bottom block, the last free block available. As it's common in this kind of allocators, the Bottom block is treated a bit specially.

Introducing ADI defenses to libc:malloc

Porting an allocator to use ADI requires to solve three main problems: buffer alignment/size, what versioning scheme to use and minimize performance degradation. The former two end up directly or indirectly affecting the latter. It's also paramount for the allocator to run with basically the same speed and memory consumption when the ADI protections are disabled, so that applications that do not have ADIHEAP enabled do not suffer of any potential regression.

The first step in opening the allocator to the ADI world is to change the minimum unit of operation to 64 bytes. Looking at the libc:malloc code, this means to move ALIGN to a value of 64. Unfortunately, in libc:malloc, ALIGN meaning is overloaded, since it's both the alignment that malloc() expects for the pointer and the alignment that chunks have in memory within the allocator. While this normally aligns (no pun intended), when it comes to ADI, it's decoupled: malloc() still lives happily with 16-bytes aligned pointers, but we need 64-byte aligned chunks that back them. This can be a common scenario in allocators, but is luckily also a fairly simple one to fix, by capturing the decoupling into different variables.

Actually, the best thing to do here is to rely as little as possible on static defines and have alignment and other requirements extracted at runtime, through an init function. The init function is also the best place to learn on whether we are going to enable ADI protections or not, by consuming the  sx_enabled(3C) interface from the security extensions framework:
    if (sx_enabled(SX_ADIHEAP)) { ADI specific initialization...
         adi_enabled = B_TRUE;
The ADI specific initialization should probably leverage the ADI API to collect information about the versioning space, the default block size, etc (see the examples in the mentioned blog entry).

Done with the alignment, we need to concentrate on the allocator metadata and how it affects chunks size and behavior. In the reference libc:malloc implementation, all the metadata concentrates into the TREE header, which is WORDSIZE * 6 = 16 * 6 = 92 bytes in size. This would dictate a minimum unit of allocation of 128 bytes, which is a bit scary for workloads that allocate lots of small objects. Ideally we'd like for the minimum unit to be 64 bytes and we can actually achieve that here with a more clever use of the data types. In fact only the very first member of the TREE structure needs to be 16 bytes in size, while all the others (with the exception of some implication for the last member as well) don't. This allows to get rid of the WORD union padding and just declare the members as struct pointers, which are 8 bytes in size. By keeping the first and last member as proto WORD and the rest as pointers we get to 16 * 2 + 8 * 4 = 64 bytes, which is exactly our goal.

Now that we have the minimum unit down to the cacheline size, we can start thinking about small allocations in the smalloc() path. There are basically two options: keep smaller buffers together under a common version, hoping to not have small controlled overflows between them or ditch the smalloc() algorithm altogether and declare that any allocation will at least get one 64 bytes chunk. Simplicity trumps memory usage by a fair bit here and we get the added benefit of a more secure implementation, so we say goodbye to smalloc() when ADIHEAP is enabled. With the proper adjustment at runtime for MINSIZE (another good candidate for a variable set up during the init path), smalloc() path is just never entered.

With the size/alignment solved, it's time for the versioning scheme, which again requires to evaluate the metadata behavior to make a decision. Either we isolate the header into its own 64 byte chunk, but de fact we end up extending the minimum allocation unit back to 128 bytes and that's not good, or just accept in the threat model that an underflow can target the size member. It's a tradeoff, but we can make a case for underflows to be significantly less common than overflows and that it has to be a controlled underflow (anything above 16 bytes would trap). It's not uncommon for security defenses to have to take tradeoffs and certainly we could provide a paranoid mode where the metadata is decoupled. I'm usually very complexity and knobs adverse, but that must not come with ineffective implementations. In this case, the major feature of ADI (detecting and invariantly preventing linear overflows) is maintained, so we can go for the simpler/faster implementation.

The versioning scheme is pretty standard for allocators that don't have isolated metadata, as we just need to reserve one version (1 is a popular choice) for free objects. At allocation time, instead, we're free to cycle to the rest of the versioning space. There are usually two different algorithms for versioning objects, which depend on whether the relative position between chunks is known and fixed at allocation time or not. For libc:malloc it clearly isn't, so we go for "randomization" plus verification. The fake randomization is based on RDTICK (the timestamp register) and improved with left and right verification.

Each time a version is selected, it is compared with the antecedent chunk and, if identical, it's incremented by one. The resulting value is then compared with the following chunk and, if identical, it's incremented again, taking care of looping back to 2 when hitting the end of the versioning space. This ensure the most important property for an ADIHEAP capable allocator to act as an invariant against linear overflows: no two adjacent allocated buffers should be found having the same version.

Run, Crash, Fix, Rinse and Repeat

ADIHEAP and the above prototype come along pretty nicely and seem to go through the most obvious testing of doing a bunch of malloc() and checking the returned version. It's with a lot of confidence that I type sxadm exec -s adiheap=enable /bin/ls, already imagining how the mail to showcase ADIHEAP to my colleagues would look like.

Segmentation Fault (core dump). Almost immediately and within libc:malloc.

Turns out that any non trivial sequence of memory operations requires coalescing and tree rebalancing, which do a heavy use of the LAST() and NEXT() macros.
 145 #define LAST(b)         (*((TREE **)(((uintptr_t)(b)) - WORDSIZE)))
 146 #define NEXT(b)         ((TREE *)(((uintptr_t)(b)) + SIZE(b) + WORDSIZE))
LAST() takes as an argument a TREE pointer and subtracts WORDSIZE. This effectively accesses the last 16 bytes of the previous chunk which, when free, contains the self pointer to the chunk itself. In other words, LAST() allows to reach back to the previous free chunk without knowing its size. NEXT() takes as a pointer a TREE structure, adds the size of the chunk and the extra WORDSIZE that corresponds to the size metadata at the start, effectively accessing the next adjacent chunk. This is generally fine and dandy, except that with ADIHEAP, the previous and next chunk most likely have mismatching versions and the segmentation fault is inevitable.

Once again we have two ways to fix this: we can either declare NEXT() and LAST() trusted paths and have them use non faulting loads that ignore ADI tagging, or we can read the previous/next ADI tag and fixup the pointer ourselves before accessing it. The first solution is a bit faster, but feels and looks more hackish in an allocator, so we go for the latter, rewrite NEXT() and LAST() as functions and wait to see how bad the performance numbers look like. Turns they are not that bad, so we go for it.
darkhelmet$ sxadm exec -s adiheap=enable /bin/ls
0_sun_studio_12.4  ssh-XXXX.xbadd     volatile-user
hmptemp            ssh-xauth-xSGDfd
Woot. I'm the coolest boy in town. Let's see what else we can run with ADIHEAP enabled. /bin/w ? Done. /sbin/ping ? Yeah. /bin/bash ?

Segmentation Fault (core dump). Once again within libc:malloc.

A few imprecations later, the fulguration on the road to Damascus. While our brk space is backed by ADI enabled pages, other adjacent memory isn't. Our versioning algorithm, when it allocates a new version, checks the previous and next chunk, but what if there is no previous or next chunk at all, because we are just at the start or at the end of the brk space (something that ASLR had hidden from previous tests)? We core dump attempting to use an ADI API over non ADI enabled pages.

libc:malloc only keeps track of where the brk ends, but we also need to know where brk started. In general, you may have to keep track of the boundaries of your allocated space, especially if you have it fragmented in memory. In this case, it's simple, we just store the starting address in our init function.

The above are just two examples of the many joy, segmentation fault, despair, debug, fix loops that touching an allocator brings. I spare you from many of the others, from off by one in the versioning code (that led to certain chunks getting adjacent ones retagged), to fixing and then fixing again memalign(), from discovering that emacs takes an entire copy of the lisp interpreter at build time, dumps its memory and restores it back when executed on another system and you've got to work around that to make it work (ignore free()s and replay any realloc() as if it was a new malloc() does the trick) and fixing python clever memory management code (see the objmalloc patch for an example of using a non faulting load in a performance sensitive code path).

Python, why you so slow?

The first performance numbers out of libc:malloc are encouraging, but while testing the objmalloc patch on Python things look ridiculously slow. A pkg solving operation that takes 30-40 seconds with the standard libc bumps up to minutes with ADIHEAP enabled, a 5x regression which is just as bad as unexpected.

Oracle Studio Performance Analyzer is a friend here and I start collecting samples to figure out where the problem lies. One outlier shows up: some PyList_Append() calls seem to spend an awful lot of time in realloc() compared to other paths that call into it as well, even other invocations of the function itself. Some DTrace allows to further isolate the realloc() sequences in PyList_Append(), identifying an interesting pattern, whereby PyList_Append() allocates a small amount of memory and reallocates it up multiple times in small chunks.

I take a long stare at the realloc() code, but nothing seems wrong and there aren't really any changes from the ADIHEAP code: adjacent blocks are checked and if one fitting is found, it is merged with the existing one and the rest is freed.

And then it hits me.

Some of this reallocation calls extend into the Bottom chunk, which is a couple of kilobytes in size.
Here is what happens:

  • a small 64 byte chunk is allocated
  • the chunk ends up being right next to the Bottom chunk
  • the chunk is reallocated extending its size to 128/256 bytes
  • the bottom chunk (which is 8/16K bytes) is splitted into two parts, the extra 64/192 bytes needed for the reallocation and the remaining 8/16K - 64/192 bytes
  • the chunk version is extended into the extra 64/192 bytes
  • the free version is stored on the remaining 8/16K bytes

multiply the above for a couple of incremental loops and we are doing kilobytes and kilobytes of unnecessary tagging! No kidding that this is killing performance.

I rewrite the free tagging code with an extra check, which evaluates the first and last chunk of the to-be-freed chunk: if both versions match the free version it means that we have been called on a split free block and so we have nothing to do and can just return. With this simple check alone, the 5x regression goes away and performance is almost on par with the non ADIHEAP case.

Almost, though. There is still a bit too much time spent in _morecore() compared to the vanilla libc case. The impact is significantly smaller, but notwithstanding this it's noticeable in the Studio analyzer output.

The lesson learned with realloc() sort of applies to _morecore() as well: when we extend the brk space, we ask for a couple of pages and mark them as free. Once again, we're tagging kilobytes and kilobytes of memory unnecessarily, because those pages come fresh (and empty) from the kernel. Just as we used as a sentinel in realloc() the presence of the free tag, we use as a sentinel here the presence of the universal tag, and limit ourselves to tag only the first and last 64 byte chunks of the freshly received memory.  Similarly, during a split, we check again for an underlying zero tag and convert the first chunk to the free tag.

This works because ADI allows to access with an arbitrary tag in the pointer memory that is physically tagged as universal match (0 or 0xF), which is what the kernel returns by virtue of zeroing the pages out. With this extra change there are more unexpected slowdowns and the pkg operation runs almost as fast as with the vanilla libc.

Debugging Easter Eggs

The ADIHEAP libc:malloc implementation is gathered towards production scenarios, but since I was there, I left a couple of debugging easter eggs that you might find interesting. You can turn these on through the following environment variables:
  • _LIBC_ADI_PRECISE: when set, the application runs with ADI set to precise mode. This is handy when you have a store version mismatch and you want to pinpoint exactly where it happened. 
  • _LIBC_MALLOC_ZEROFILL: when set, malloc() behaves just like calloc() and zeroes out a buffer before handing it out (useful in certain cases where memory is freed and later reused, but not initialized properly). This is implemented regardless of whether ADIHEAP is enabled, but is kind of cute with ADIHEAP, since we can use adi_memset() rather than adi_set_version() and get it basically for free.
  • SXADM_EXTEND: ADIHEAP and ADISTACK can uncover a large number of latent bugs and hence bring disruption to software that otherwise "just works" (think of an off by one read outside of the allocated chunk, which is generally non fatal for an application, but immediately traps under ADIHEAP). For this reason, model=all is not exposed for ADIHEAP and ADISTACK. When this variable is set, model=all can be used there too, e.g. through SXADM_EXTEND=1 sxadm enable -c model=all adiheap. I'm a huge fan of model=all for testing and that's how I run my T7 environments, in order to catch bugs left and right (and it did catch many).

Final Words

I've been waiting to write this blog post for quite some time now and I'm glad this stuff is finally out. The most important message that should get conveyed is that, while there might be tricky corner cases, any default/system allocator can be converted to use ADI and benefit from the added security checks that come with it.

If you have a T7 and an Oracle Solaris 11.4 copy, you can now go out and play with sxadm and ADIHEAP (and ADISTACK, more about it coming), I look forward to hear from your experience!

No comments:

Post a Comment