Thread-Local Storage Descriptors for IA32 and AMD64/EM64T Version 0.9.6 - 2018-10-11 Alexandre Oliva Copyright 2005, 2008, 2018 Alexandre Oliva. Permission is granted to distribute verbatim copies of this document. Please contact the author at aoliva@redhat.com or oliva@gnu.org to request additional permissions. This is a draft of work in progress. The ABI changes suggested below are not to be taken as final. In particular, the relocation numbers and dynamic table entries have not been approved by the official maintainers of the ABIs, the instruction selection is still subject to change and even the calling conventions may be modified without notice. Updated versions of this document may be located at http://people.redhat.com/aoliva/writeups/TLS/RFC-TLSDESC-x86.txt or http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt. Introduction ============ While porting NPTL to the FR-V architecture, an idea occurred to me that would enable significant improvements to the General Dynamic and Local Dynamic access models to thread-local variables. These methods are known to be extremely inefficient because of the need to call a function to obtain the address of a thread-local variable, a function that can often be quite inefficient. The reason for calling such a function is that, when code is compiled for a dynamic library (the cases in which these access models are used), it is not generally possible to know whether a thread-local variable is going to be allocated in the Static Thread Local Storage Block or not. Dynamic libraries that are loaded at program start up can have their thread local storage sections assigned to this static block, since their TLS requirements are all known before the program is initiated. Libraries loaded at a later time, however, may need dynamically-allocated storage for their TLS blocks. In the former case, the offset from the Thread Pointer, usually held in a register, to the thread-local variable is going to be the same for all threads, whereas in the latter case, such offset may vary, and it may even be necessary to allocate storage for the variable at the time it is accessed. Existing implementations of GD and LD access models did not take advantage of this run-time constant to speed up the case of libraries loaded before program start up, a case that is certainly the most common. Even though libraries can choose more efficient access models, they can only do so by giving up the ability for the modules that define such variables to be loaded after program start up, since different code sequences have to be generated in this case. The method proposed here doesn't make a compile-time trade off; it rather decides, at run time, how each variable can be accessed most efficiently, without any penalty on code or data sizes in case the variable is defined in an initially-loaded module, and with some additional data, allocated dynamically, for the case of late-loaded modules. In both cases, performance is improved over the traditional access models. Another advantage of this novel design for such access models is that it enables relocations to thread-local variables to be resolved lazily. Background ========== Thread-local storage is organized as follows: for every thread, two blocks of memory are allocated: a Static TLS block and a Dynamic Thread Vector. A thread pointer, normally a reserved register, points to some fixed location in the Static TLS block, that contains a pointer to the dynamic thread vector at some fixed location as well. TLS for the main executable, if it has a TLS section, is also at a fixed offset from the thread pointer. Other modules loaded before the program starts will also have their TLS sections assigned to the Static TLS block. The dynamic thread vector starts with a generation count, followed by pointers to TLS blocks holding thread-specific copies of the TLS sections of each module. If modules are loaded at run time, the dynamic thread vector may need to grow, and the corresponding TLS blocks may have to be allocated dynamically. The generation count is used to control whether the DTV is up-to-date with regard to newly-loaded or unloaded modules, enabling the reuse of entries without confusion on whether a TLS block has been created and initialized for a newly-loaded module or whether that block was used by a module already unloaded, that is still waiting for deallocation. Programs can access thread-local variables by using code that follows 4 different models: Local Exec, Initial Exec, General Dynamic and Local Dynamic. Local Exec is only applicable when code in the main executable accesses a thread-local variable also defined in the main executable. In such cases, the offset from the Thread Pointer to the variable can be computed by the linker, so the access model consists in simply adding this offset to the Thread Pointer to compute the address of the variable. Initial Exec is applicable when code in the main executable accesses thread-local variables that are defined in other loadable modules that are dependencies of this executable. Since the dynamic loader will make sure they are loaded before the program can start running, it can lay out the Static TLS Block taking their TLS sections' sizes and alignment requirements into account. Since the Thread Pointer points into the same location within the Static TLS Block of each thread, the offset from the Thread Pointer to the variable one wants to access is constant for all threads. However, unlike the Local Exec case, it is not known at link time, but rather at run time. So, the compiler arranges for the offset to be loaded from a Global Offset Table entry, that the linker creates to hold the TP offset of the variable, and adds that to the TP to access the variable. The linker also emits a dynamic relocation for the dynamic loader to fill in the GOT entry with the correct value. Although the Initial Exec access model is intended for use in code generated for executables, some dynamic libraries may use it as well, when they know the referenced variable is defined in the main executable or in another initially-loaded dynamic library. In fact, if a dynamic library references a thread-local variable defined in itself, using the Initial Exec access model, it may be giving up the ability to be loaded at run time, with mechanisms such as the dlopen() library call. General Dynamic is the most general access model, in that it poses no requirement whatsoever on the variable-accessing or the variable-defining modules. This access model has traditionally involved calling a function, typically called __tls_get_addr(), to obtain the address of a thread-local variable. This function takes two pieces of information to compute the address: a module id, an index into the dynamic thread vector corresponding to the module in which the variable is defined, and an offset of the variable from the beginning of the corresponding TLS block. These two pieces of information are computed by the dynamic loader, in GOT locations determined by the linker. If they're adjacent, it's possible to pass both pieces of information to __tls_get_addr() by means of a single pointer; otherwise, they're usually passed as two separate arguments. In either case, the code sets up arguments for __tls_get_addr() and then calls it explicitly, using the returned value as the address of the variable. The Local Dynamic access model can only be used to access variables that are local to the module in which they're used, and it only makes sense when accessing multiple such variables. The idea is to use a single call to __tls_get_addr() to compute the base address of the TLS block for the module. Then, to access variables, it suffices to add to this base address the offset corresponding to each variable. The offset can be used as a literal in the code, since it's determined by the linker when it lays out the TLS section of the module. Overall Design ============== The optimization proposed herein intends to improve the performance of dynamic libraries that access thread-local variables using the dynamic access models, in cases they're loaded before program start up, without slowing down (or, if possible, speeding up) the performance when the same libraries are loaded during run time. The idea is to use TLS descriptors containing a pointer to a function and an argument. This uses two words in the GOT, just like the arguments traditionally passed to __tls_get_addr(), but uses this space in a very different way. A single relocation is to be used to compute the value of the two words of the TLS descriptor. The dynamic loader, when resolving the relocation, can determine whether the TLS block used for the module that defines the TLS symbol is in the static TLS block or not. If it is in the static TLS block, the offset from the TP to the symbol is constant for all threads, so it sets the argument in the TLS descriptor to this constant offset, and sets the function pointer to a piece of code that simply returns the argument (plus the TP, in an alternate design). If the defining module is in dynamically-allocated TLS, it sets the argument to a dynamically-allocated extended descriptor containing the arguments passed to __tls_get_addr() and the module's generation count, and the function pointer to a piece of code that checks whether the generation count is sufficiently up to date and that the DTV entry is set. If so, this piece of code loads the DTV entry, adds the symbol offset to it, and subtracts the TP value (unless using the alternate design), returning the result. Otherwise, it calls __tls_get_addr(), and subtracts from its return value the TP (unless using the alternate design). The functions defined above use custom calling conventions that require them to preserve any registers they modify. This penalizes the case that requires dynamic TLS, since it must preserve (*) all call-clobbered registers before calling __tls_get_addr(), but it is optimized for the most common case of static TLS, and also for the case in which the code generated by the compiler can be relaxed by the linker to a more efficient access model: being able to assume no registers are clobbered by the call tends to improve register allocation. Also, the function that handles the dynamic TLS case will most often be able to avoid calling __tls_get_addr(), thus potentially avoiding the need for preserving registers. (*) Preserving a register does not necessarily imply saving and restoring it. If the system library implementation does not use or even know about a certain extended register set, it needs not save it, because it will presumably not modify it. This assumes the TLS Descriptor implementation is self-contained within the system library, without no overridable callbacks. A consequence is that, even if other parts of the system library are compiled so as to use an extended register set, those used by the implementation of TLS Descriptors, including lazy relocations, should be limited to using the register set that the interfaces are known to preserve. Lazy relocation is accomplished by setting the argument portion of the TLS descriptor to point to the relocation, and the function pointer to point to a lazy relocation function. Some effort is needed to ensure that the TLS descriptor is modified and accessed atomically, and that the lazy relocation function can quickly identify the module that contains the relocation. This optimized access model thus consists in setting up a pointer to the TLS descriptor (or, in another alternate design, loading its contents atomically), then calling the function at the address given by the pointer to function in the TLS descriptor, passing to it a pointer to the descriptor itself (or, in an alternate design, the argument portion of the TLS descriptor). The value returned by this call can then be used as an offset from the thread pointer to access the variable (or as the address of the variable, in the first alternate design). The actual code sequences and implementation details for IA32 and AMD64/EM64T are depicted in the following subsections. IA32 ---- The general dynamic access model used to be: leal variable@TLSGD(,%ebx,1), %eax call ___tls_get_addr@PLT # use %eax as the address, or (%eax) to access the value After the call instruction, %eax holds the address of variable. variable@TLSGD is resolved by the linker to the GOT offset holding the data structure passed by reference to ___tls_get_addr(), containing the TLS module id and the TLS offset, computed by two separate dynamic relocations. The odd addressing mode is needed to make the instruction longer, such that, in case the linker finds out it can relax the code sequence, the necessary code fits. The optimized method proposed here uses: leal variable@TLSDESC(%ebx), %eax [...] # any other instructions that preserve %eax call *variable@TLSCALL(%eax) # add %gs:0 to %eax to compute the address, # or use %gs:(%eax) to access the value Note that this call instruction is actually emitted without an offset, so it's 4 bytes shorter than it appears to be. It's equivalent to `call *(%eax)', that emits a two-byte instruction, but it's annotated with a relocation that enables the leal and the call to be moved apart from each other, for better scheduling. The following new relocations types are used by the code above: #define R_386_TLS_GOTDESC 39 /* GOT offset for TLS descriptor. */ #define R_386_TLS_DESC_CALL 40 /* Marker of call through TLS descriptor for relaxation. */ #define R_386_TLS_DESC 41 /* TLS descriptor containing pointer to code and to argument, returning the TLS offset for the symbol. */ TLS_DESC is the dynamic relocation that the linker emits in response to the two other relocations. All of the relocations accept addends; being REL, the relocations have in-place addends. In the TLS_DESC case, that applies to a pair of words, the addend is stored in the second word, to simplify some lazy relocation cases. In order to enable lazy relocation, %ebx MUST point to the GOT at the point of the call instruction. Instead of adding more relocations for the Local Dynamic case, we propose the use a symbol _TLS_MODULE_BASE_, that the linker implicitly defines, as a hidden symbol, to the base address of the TLS section of a module. Because of the way @DTPOFF has historically been relaxed in main executables, when using the GNU TLS extensions, _TLS_MODULE_BASE_ is handled differently on final executables than on dynamic libraries. In general, _TLS_MODULE_BASE_ is defined at the lowest address in the TLS segment, i.e., so that adding its NTPOFF (computed in the sequence above) to the linker-computed DTPOFF of a variable yields the TP offset of the variable. On main executables, the DTPOFF is relaxed into NTPOFF, so we must stop _TLS_MODULE_BASE_'s standard NTPOFF from affecting the direct LE addressing. Therefore, for main executables, the call sequence must be relaxed in such a way that the base address computed into %eax is zero. Whether this is accomplished by setting _TLS_MODULE_BASE_ to the end of the TLS section, by handling its relaxation especially, or by any other means, is unspecified. _TLS_MODULE_BASE_ is a special internal symbol, not to be relied upon for any other purposes; it needs not even be defined as a symbol in the linker output. When relaxing the sequence above to the Initial Exec model, we'd get sequences such as: movl variable@GOTNTPOFF(%ebx), %eax [...] nop; nop Double-nop and jmp .+2 were more efficient than any prefix applied to nop, mov %eax,%eax or breaking up the movl into leal and movl (%eax),%eax on Pentium III and Athlon 64, so double-nop it is. In case there's a GOT entry holding the positive TPOFF: movl variable@GOTTPOFF(%ebx), %eax [...] negl %eax I'm not sure which of these would be better to use, but in my current implementation I've preferred the first alternative, unless a GOT entry for the third is needed for other reasons, and one for the first case isn't. The Local Exec model uses code such as: leal variable@NTPOFF, %eax [...] nop; nop Relaxing Local Dynamic model such as: leal _TLS_MODULE_BASE_@TLSDESC(%ebx), %eax [...] # any other instructions that preserve %eax call *_TLS_MODULE_BASE_@TLSCALL(%eax) [...] # any other instructions that preserve %eax leal variable@dtpoff(%eax), %edx to Local Exec is to yield code such as: leal 0, %eax [...] nop; nop [...] leal variable@ntpoff(%eax), %edx In theory, it would be possible to eliminate entirely the zero-initialization of %eax and remove it from the second leal. However, this would prevent a compiler from moving the addition of the thread pointer to the zero-initialized %eax, rather than adding it separately to each local dynamic variable. Since there is no relocation to identify the addition, we mandate the zero-initialization and the use of the base address. The function for the static TLS case can be as simple as: _dl_tlsdesc_return: movl 4(%eax), %eax ret whereas the other takes a bit more effort, requiring the following data structure definitions: struct tlsdesc { ptrdiff_t __attribute__((regparm(1))) (*entry)(struct tlsdesc *); void *arg; }; typedef struct dl_tls_index { unsigned long int ti_module; unsigned long int ti_offset; } tls_index; struct tlsdesc_dynamic_arg { tls_index tlsinfo; size_t gen_count; }; The definition of the resolution function follows the logic depicted below, except for the need to preserve all registers except %eax. ptrdiff_t __attribute__ ((__regparm__ (1))) _dl_tlsdesc_dynamic (struct tlsdesc *tdp) { struct tlsdesc_dynamic_arg *td = tdp->arg; dtv_t *dtv = *(dtv_t **)((char *)__thread_pointer + DTV_OFFSET); if (__builtin_expect (td->gen_count <= dtv[0].counter && (dtv[td->tlsinfo.ti_module].pointer.val != TLS_DTV_UNALLOCATED), 1)) return dtv[td->tlsinfo.ti_module].pointer.val + td->tlsinfo.ti_offset - __thread_pointer; /* Preserve any call-clobbered registers not preserved because of the above across the call below. */ return ___tls_get_addr (&td->tlsinfo) - __thread_pointer; } The tlsdesc_dynamic_arg objects are allocated by the dynamic loader when resolving a relocation, and stored in a hash table created for the module in which the symbol is defined. Note that this dynamic allocation has no implications to prelinking, since prelinking is only applicable to modules loaded before program start up, and thus always uses the static TLS case, that does not need dynamic allocation. An alternate design in which the function called through the TLS descriptor returns not the TP offset, but rather the address of the variable of interest, could refrain from adding %gs:0 to the value returned by the call to compute the address of a symbol, and from using the %gs: prefix when accessing the variable, but it would require the use of a longer call instruction to enable proper relaxation. The call instruction would have to be 7, instead of 2 bytes long, such that the linker could relax it to `addl %gs:0, %eax'. This would make code that accesses the variable 4 bytes longer on average (5 bytes minus one used by the %gs prefix), whereas code that computes its address would be shorter by only two bytes. It's not clear such a change would be profitable. Lazy relocation is not profitable in all cases. Consider that we need the argument in the TLS descriptor to hold both the addend and the address of the relocation, if it's a REL relocation, and it probably wouldn't make sense to depend on dynamic memory allocation for the relocation. So, in case we actually need both pieces of information, we perform the relocation immediately. In other cases, such as that of RELA relocations and REL relocations with zero addends, we use the argument to hold a pointer to the relocation. Another special case we handle is that of REL relocations with addends that reference the *ABS* section, i.e., that reference the local TLS section. All cases handled by lazy relocation start by grabbing a global dynamic loader lock and checking that the pointer in the TLS descriptor hasn't changed. If it has, we return immediately into the new pointer, after releasing the lock. Otherwise, we set the function pointer to a hold function that will get all other threads that attempt to use this variable to wait until relocation is complete. After diverting any other threads to the hold function, we can perform the relocation, determining whether the referenced symbol is in Static TLS or not, and deciding which of the two functions to use from that point on and computing the argument to pass to it. The argument is stored first in the TLS descriptor, and the new entry point is stored last. The relocation functions finally release the lock and return into the newly-computed function. In the current implementation, the hold function attempts to grab the lock, checks that the pointer hasn't changed and releases the lock, such that, if any other such TLS descriptor lazy relocation is in progress, it will wait until the lock is released. When it obtains the lock, it releases it immediately and returns into the function newly-stored in the TLS descriptor. An alternate implementation is envisioned that relies on condition variables that hold functions would wait on, such that the relocation functions wouldn't have to hold the lock throughout their execution, rather waking up all hold functions upon completion of the relocation. AMD64/EM64T ----------- The design is very similar to that of IA32. The main difference stems from the fact that AMD64's IP-relative addressing modes have enabled it to do away with the need for a register holding the GOT pointer, which in turn required additional measures to enable lazy relocation of TLS descriptors. Where the existing ABI uses the following sequence: .byte 0x66 leaq variable@TLSGD(%rip), %rdi .word 0x6666 rex64 call __tls_get_addr@PLT # use %eax as the address, or (%eax) to access the value we propose this instead: leaq variable@TLSDESC(%rip), %rax [...] # any other instructions that preserve %eax call *variable@TLSCALL(%rax) # add %fs:0 to %eax to compute the address, # or use %fs:(%eax) to access the value Note that, as in the IA32 case, the call instruction is a two-byte instruction, the offset is completely discarded by the assembler. The following new relocations types are used by the code above: #define R_X86_64_GOTPC32_TLSDESC 34 /* GOT offset for TLS descriptor. */ #define R_X86_64_TLSDESC_CALL 35 /* Marker for call through TLS descriptor. */ #define R_X86_64_TLSDESC 36 /* TLS descriptor. */ As on IA32, _TLS_MODULE_BASE_ is to be used to obtain the base address for the Local Dynamic access model. The same considerations for relaxation apply. The alternate design proposed for IA32 that gets the TLS descriptor call to compute not the offset, but the actual address of the variable, would require a much longer call instruction to accommodate the 9 bytes needed to add the TP to the address in case of relaxation, so it is even less likely to be profitable. When relaxing the sequence above to the Initial Exec model, we'd get sequences such as: movq variable@GOTTPOFF(%rip), %rax [...] nop; nop double-nop and jmp .+2 were more efficient than any prefix applied to nop an Athlon 64, so go with the former. It's used for Local Exec as well: movq $variable@TPOFF, %rax [...] nop; nop Relaxing Local Dynamic model such as: leaq variable@TLSDESC(%rip), %rax [...] # any other instructions that preserve %eax call *variable@TLSCALL(%rax) [...] # any other instructions that preserve %eax leaq variable@dtpoff(%rax), %rcx to Local Exec is to yield code such as: movq $0, %rax [...] nop; nop [...] leaq variable@tpoff(%rax), %rcx As on IA32, optimizing away the zero initialization would only be possible with further additions to the ABI specification, and it is currently forbidden. The other data structures and TLS descriptor offset computation functions are equivalent to those used on IA32, with one point worth noting that the type of the `entry' member of struct tlsdesc cannot be represented in C, and not even in GNU-extended C, since the function takes its argument in %rax, rather than in %rdi. Since AMD64 uses RELA dynamic relocations, all TLS descriptors are suitable for lazy relocation; there's no need to worry about preserving the addend, since it is held in the relocation table itself. Thus, there's only need for one lazy relocation function. That said, this function needs some means to be told what module the relocation at hand refers to. On IA32, this is done by means of the reserved register %ebx, that points to the Global Offset Table, but no register holds this pointer on AMD64. Although it would be possible to use the relocation pointer to search the relocation ranges for all loaded modules, this would be extremely inefficient, so we've introduced two dynamic table entries to enable the dynamic loader to communicate with modules that use lazy relocation: #define DT_TLSDESC_PLT 0x6ffffef6 /* Location of PLT entry for TLS descriptor resolver calls. */ #define DT_TLSDESC_GOT 0x6ffffef7 /* Location of GOT entry used by TLS descriptor resolver PLT entry. */ A module that uses lazy TLSDESC relocations MUST define these two entries, unless the DF_BIND_NOW flag is set. The former indicates the address of a GOT entry to be filled in by the dynamic loader with the address of the internal function to be used for lazy relocation of TLS descriptors. The latter must hold the address of a PLT entry that pushes onto the stack the module's link map address, located in the GOT portion reserved for the dynamic loader to use, and then jumps to the lazy relocation function, using the address stored in the TLSDESC_GOT entry. The lazy relocation function is responsible for releasing the stack slot taken up by the PLT entry. Future Improvements =================== The use of TLS descriptors to access thread-local variables would enable the compression of the DTV such that it contained only entries for non-static modules. Static ones could be given negative ids, such that legacy relocations and direct calls to __tls_get_addr() could still work correctly, but entries could be omitted from the DTV, and the DTV entries would no longer need the boolean currently used to denote entries that are in Static TLS. The DTV could also be modified so as to contain TP offsets instead of absolute addresses. Some refactoring of _dl_tlsdesc_dynamic() and __tls_get_addr() could avoid the need to subtract TP in the former, using an alternate entry point that refrains from adding the TP to the offsets in the new DTV. It might make sense to combine the current design, in which TLSCALL functions return the TP offsets, with the alternate design in which they return the actual address, introducing different relocations for each case, enabling code to use the former when accessing variables directly using the TP segment register, and using the latter when computing their addresses. It's not obvious that this would have any significant impact on performance, and although code size could certainly be reduced for code that computes the address of thread-local variables, it's not obvious that the need for additional GOT entries and supporting code would not make up for it. Conclusion ========== The design described above is currently functional, using CVS snapshots of GCC, binutils and glibc taken in the past few days, plus local changes about to be contributed. Even though the performance of the standard NPTL benchmark is improved by only a negligible margin, that's understandable: glibc and nptl, aware that the dynamic access models used to be so inefficient, take advantage of the fact that libc is always loaded initially and reference almost all thread-local variables using the more efficient access models. Besides, the benchmark program itself is not affected by the TLS access model glibc uses. Even after disabling the code in glibc that selects the Initial Exec model to access variables defined in libc, the effect on the benchmark's performance was negligible. However, synthetic benchmarks designed to time functions that return the value or the address of thread-local variables, have shown that the performance of the proposed method is significantly better than that of the currently-used method. When the referenced variable is found to be in static TLS at run time, the newly-proposed method makes such functions about twice as fast as when using the method in wide use today, bringing it close to the performance of the Initial Exec model. Even when the variable is in dynamic TLS, the speedup is still over 20%. As for code size, the new method tends to be a win for dynamic libraries that access TLS variables through the segment registers more often than they compute the address of such variables. Libraries in GNU libc such as libpthread, libmemusage, as well as most dynamic libraries used in its testsuite, have experienced small reductions in code size on both architectures, and no libraries whatsoever have experienced code size increases. Disabling the use of Initial Exec in the glibc build, libc.so shrinks by almost 1% on AMD64 and 0.5% on IA32, which more than compensates the almost-1% growth of the 10-times-smaller ld.so on AMD64, without change on IA32. Change Log ========== 0.9.6 - 2018-10-11: Document TLS Descriptor interface rationale under extended register sets. Drop long-dead lsd.ic.unicamp.br address. 0.9.5 - 2008-07-03: Detail specification of LD to LE relaxation, after binutils bug report from Cary Coutant. 0.9.4 - 2005-10-10: Whitespace fixes. More info on performance and code size changes. DT_TLSDESC_* may be omitted if DF_BIND_NOW is set. Use the same variable name in all examples, for uniformity. Select double-nop on relaxations; commit to other instruction selections. 0.9.3 - 2005-09-16: Fix information about code size differences in the latest build. Modify relocation numbers to avoid conflicts, and move dynamic table entries to a better range. Add URL for updated versions. Move copyright notice and work in progress notice to the beginning. 0.9.2 - 2005-09-15: Rename main section to Overall Design. Clarify how the dynamic TLS handler can often avoid the overhead of preserving registers. Add rough figures of performance improvements. 0.9.1 - 2005-09-13: Fixed typos and thinkos. Improved readability. Added section with Future Improvements. 0.9 - 2005-09-09: initial version.