Thread-Local Storage Descriptors for the ARM platform Revision 0.2.2 - 2008-03-15 An updated version is available at http://www.codesourcery.com/publications/RFC-TLSDESC-ARM.txt Glauber de Oliveira Costa Alexandre Oliva With the special help of 3 trained soccer player monkeys Rationale and TLS Descriptors ============================= Accessing TLS variables in dlopened modules usually happens by means of a call to __tls_get_addr(), that is, predictably enough, responsible for getting the real address of the variable, given it's module index and offset. Currently, by choosing one Local and/or General Dynamic Model of TLS access, a call to __tls_get_addr() gets involved in the thread local address resolution. However, the dynamic loader has enough information to detect if the variable being accessed lives in the static TLS block (initially loaded), or may be accessed just by the ways of the Dynamic Thread Vector (DTV) (dlopen) This knowledge would enable it to select specialized functions that may be able to access the variable in a more efficient way, given a suitable mechanism for it to record such selections. Such a mechanism is built by storing the specialized function, together with its argument in two consecutive GOT entries, being the parameter the first one, and the specialized function the second one. This data structure designed in GOT is given the name Thread Local Storage Descriptor, or simply put, TLS Descriptor. The linker may also relax the code sequence to Local or Initial Exec models. Proposed Dynamic Access ======================= Currently, the dynamic access model is as follows: ldr r0, .Lt0 .L1: add r0, pc, r0 bl __tls_get_addr(PLT) .Lt0: .word foo(tlsgd) + (. - .L1 - 8) At the end of the call sequence, the r0 register holds the address of the desired variable, and its contents can be accessed by just issuing a simple load instruction like: ldr r0, [r0] In our proposal, code for dynamic access changes slightly, turning out to be: ldr r0, .Lt0 .L1: bl variable(tlscall) .Lt0 .word variable(tlsdesc) + (. - .L1) One major difference from both models is that the end of the code sequence, r0 register does not hold the address of the variable anymore, but rather its tp-relative address. This way, one must issue an ldr r0, [$tp, r0] in order to get its contents. The tlsdesc relocation in .Lt0 gives the pc-relative address of the TLS descriptor representing the thread-local variable we're interested in. The addend of the relocation is incomplete in order to properly allow linker relaxations (see 'addend adjustments' below for a rationale) and must be adjusted at link time. In case of Local Dynamic access model, to avoid the definition of new relocations, the linker defines for all modules that have a TLS section a hidden per-module symbol called _TLS_MODULE_BASE_ that denotes the beginning of the TLS section for that module. In both Local and General Dynamic cases, in the absence of relaxations, tlscall is resolved to a call to a trampoline provided by the linker. Our proposal for the trampoline is: __tls_trampoline: add r0, lr, r0 ldr r1, [r0, #4] // load the resolver address bx r1 // jumps to it, passing it r0 Also, note that the last branch defines a tail-call. This is expected to make things simpler, as the resolver function can then rely on the link register to hold the return address without needing to know where the branch came from. The resolver function may then return by issuing a: bx lr on architecture versions that support both ARM and Thumb; otherwise: mov pc, lr Relaxations =========== The result of relaxing our proposed dynamic sequence to Initial Exec is: ldr r0, .Lt0 .L1: ldr r0, [pc, r0] .Lt0: .word foo(gottpoff) + (. - .L1 - 8) The branch and link instruction turns into a load. The addend of the relocation must be adjusted by the linker in such a way to provide the correct offset for it to be loaded relative to the instruction at .L1. Currently, our proposal is pretty much the same as the current model states. And by relaxing to Local Exec, we get: ldr r0, .Lt0 nop .Lt0 .word foo(tpoff) Besides the nop instruction, executed instead of the branch and link one, the model is exactly what the current Local Exec model states. When relaxing to this access model, the addend is completely ignored, as the tpoff relocation does not resolve to a pc-relative address. Since our proposed dynamic access models return a TP offset, rather than an absolute address, the relaxation to the above sequences is much simpler, since we do not need to fit an additional instruction that adds TP to the short sequences above. Inlining the Trampoline ======================== To provide the ability of inlining the trampoline, for those who would choose performance despite the increase of code size, the compiler should be able to generate an instruction sequence that does the same job as the trampoline would have otherwise done. Such an instruction sequence may be: ldr rt, .Lt1 .L1: add rx, pc, rt ldr ry, [rx, #4] [mov r0, rx] blx ry .Lt1: .word variable(tlsdesc) + (. - .L1) Note that the addend is also incomplete here. In order to keep the ability to relax the code sequence, the instructions must be annotated, as follows: ldr rt, .Lt1 .tlsdescseq variable .L1: add rx, pc, rt .tlsdescseq variable ldr ry, [rx, #4] [mov r0, rx] .tlsdescseq variable blx ry .Lt1: .word variable(tlsdesc) + (. - .L1) Note that we do not force the use of specific registers other than r0 for the argument to the resolver (see resolver functions below), granting the compiler the ability to choose the best possible register allocation. There is no requirement that the instructions be issued in this particular sequence either, or that no other instructions be interspersed, or even that the values not be reused when it makes sense. It is even permitted for different registers to be used where the specification above implies a single register to be used, if the value is copied from one to the other. Shown below, is the code sequence after the relaxation for the Initial Exec model: ldr rt, .Lt1 .L1: add rx, pc, rt ldr ry, [rx] [mov r0, rx] mov r0, ry .Lt1: .word variable(gottpoff) + (. - .L1 - 8) In the Local Exec model, there is no need to either add back the pc or load the variable value from any place, so the sequence turns into: ldr rt, .Lt1 .L1: mov rx, rt nop [mov r0, rx] nop .Lt1: .word variable(tpoff) Also, the mov instruction can be turned into a nop in case the source and destination registers happen to be the same. In both models, the last branch instruction is replaced by a nop, as there is no need to issue any function call. Addend Adjustments ================== Depending on whether the trampoline is inlined or not, we use different methods to compute the absolute address of a TLS descriptor. The out-of-line trampoline adds lr to the relative address it is passed in r0, formerly loaded from the tlsdesc constant pool entry, where lr contains an address that is one instruction past the bl instruction annotated with the tlscall relocation, whereas the inline trampoline adds pc to the relative address loaded from the tlsdesc constant pool entry, where pc contains an address that is two instructions past the address of the instruction that refers to it. The result of the tlscall relocation must be adjusted to work in both cases, even when they happen to be relaxed, which makes matters more difficult as the offsets that have to be different before relaxation need to become the same after relaxation. The solution that avoids the need for distinct relocation types for inline and out-of-line trampolines is to provide the linker with enough information for it to make the correct decision. We thus emit the relocation with an addend that provides the relative location of the instruction that is going to use the result of the relocation. If it is a call instruction, presumed to be annotated with a tlscall relocation, the linker resolves the relocation such that its result added to the lr value set by the call instruction yields the address of the TLS descriptor, i.e., it subtracts 4 from the addend. Otherwise, it computes the relocation result in such a way that adding its result with the pc value at the referenced instruction yields the address of the TLS descriptor, i.e., it subtracts 8 from the addend. The relocation is designed to work in Thumb mode as well. To signal the linker that the TLS call sequence is in Thumb mode, the compiler must add 1 to the addend of the tlsdesc relocation, such that the linker makes the appropriate adjustments to the offsets it computes Resolver Functions ================== We describe below some of the alternatives that the dynamic loader may use to fill in the resolver function portion of the TLS descriptor, as well as the corresponding argument. Also, all names and data structures are for illustrative purposes. Implementation details may differ, and symbols are designed to be internal to the linker and thus, not part of the ABI. Programs should not, and in fact, through the use of symbol hiding mechanisms, should be unable to, directly refer to them. __tls_get_addr(), however, remains available. Except where otherwise noted, every resolver takes as its only argument the address of the TLS descriptor in r0. Static Specialization --------------------- The resolver for static-TLS cases is quite simple, since the descriptor pointed to by r0 already has the desired value, all we have to do is load it. _dl_tlsdesc_return: ldr r0, [r0] bx lr Here again, processors without interworking support should, instead of the bx lr instruction, use: mov pc, lr as their returning mechanism. Dynamic Specialization ---------------------- For the dynamic case, we suggest some data structures and describe their expected use. For that purpose, we provide an example piece of code that cannot be used directly, because of special calling conventions for resolver functions (see section Resolvers' Calling Convention below), but that details the behavior of the corresponding dynamic TLS resolver function. struct tlsdesc { union { void *pointer; long value; } argument; ptrdiff_t (*resolver)(struct tlsdesc *); }; typedef struct dl_tls_index { unsigned long int ti_module; unsigned long int ti_offset; } tls_index; struct tlsdesc_dynamic_arg { tls_index tlsinfo; size_t gen_count; }; ptrdiff_t _dl_tlsdesc_dynamic(struct tlsdesc *tdp) { struct tlsdesc_dynamic_arg *td = tdp->argument.pointer; dtv_t *dtv = (dtv_t *)THREAD_DTV(); if (__builtin_expect (td->gen_count <= dtv[0].counter && dtv[td->tlsinfo.ti_module].pointer.val != TLS_DTV_UNALLOCATED, 1)) return dtv[td->tlsinfo.ti_module].pointer.val + td->tlsinfo.ti_offset - __builtin_thread_pointer(); return __tls_get_addr (&td->tlsinfo) - __builtin_thread_pointer(); } Weak Undefined Symbols ---------------------- For TLS symbols that turn out to be weak and undefined, the dynamic loader can use yet another resolver function that returns the negated value of $tp. This preserves the expected semantics of undefined weak symbols, at very little overhead even when compared with the incorrect result of the relaxation of such sequences in the older model. We thus recommend that references to TLS weak symbols that are not locally defined to be preserved as dynamic call sequences, since it is possible that the symbol turns out to be defined at the time the program runs, or that it becomes undefined at run time even if another library used at link time provided a definition. The exception to the suggestion above is when linking static executables, where it can be determined for sure whether a symbol is defined or not, since that cannot change at run time. The linker may then relax the code sequence such that it sets up r0 with the negated value of TP, such that when TP is added to compute the actual address, the result is 0. The result is analogous in for both inline and out-of-line trampolines, disregarding only extra nop instruction that the linker may insert in the later case. The effective code sequence after changes is as follows: ldr r0, .Lt0 .L1: sub r0, r0, $tp .Lt0 .word 0x0 Lazy Specialization ------------------- Lazy processing of relocations is possible with the new dynamic access model. A lazy resolver needs to obtain the relocation index and the _GLOBAL_OFFSET_TABLE_ address in order to perform lazy relocations. The relocation index can easily fit in the argument portion of the descriptor, but loading the GOT address in a register prior to calling the TLS resolver was deemed as too much overhead, since it is only necessary for the first time a descriptor is used. We have therefore introduced another per-module trampoline, that TLS descriptors eligible for lazy relocation get as their resolver. The address of this trampoline is communicated to the dynamic loader by means of a dynamic table entry: DT_TLSDESC_PLT. It loads the address of the actual resolver from a GOT entry, whose address is informed to the dynamic loader with another dynamic table entry: DT_TLSDESC_GOT. The dynamic loader is responsible for filling in the named GOT entry with the address of the actual TLS lazy resolver address, whose name, exemplified as _dl_tlsdesc_lazy_resolver, can thus remain internal to the dynamic loader. Such a trampoline can be written as: _dl_tlsdec_lazy_trampoline sdmfd sp!,{r2} ldr r1,.Lt0 ldr r2,[pc, #_GLOBAL_OFFSET_TABLE - . - 8 \ + _dl_tlsdesc_lazy_resolver(GOT)] .L1: add r1, pc, r1 bx r2 .Lt0 .word _GLOBAL_OFFSET_TABLE_ - .L1 - 8 Note the unmatched saving of r2: we need scratch registers, and only r1 is available (see the Calling Conventions section), so we specify that the lazy trampoline saves r2, and the actual resolver is responsible for restoring it. Also note the long address expression used to load r2. It is an alternate way to compute the relocated DT_TLSDESC_GOT-named address. It could be loaded from [r1, #_dl_tlsdesc_lazy_resolver(GOT)] after r1 is set up, but this formulation avoids the dynamic dependency on the GOT register being set up, possibly reducing the latency. If the immediate operand does not fit, an alternate r1-based formulation can be used. The numbers for the dynamic table entries were originally defined at http://www.ic.unicamp.br/~oliva/writeups/TLS/RFC-TLSDESC-x86.txt #define DT_TLSDESC_PLT 0x6ffffef6 /* Location of PLT entry for TLS descriptor resolver calls. */ #define DT_TLSDESC_GOT 0x6ffffef7 /* Location of GOT entry used by TLS descriptor resolver PLT entry. */ Upon call, _dl_tlsdesc_lazy_resolver may then resolve the relocation, write both the selected function to this access model and its argument back to the TLS descriptor, and finally restore r2 and jump to the resolver address in the TLS descriptor. Since there is no way to atomically update the TLS descriptor, the lazy resolver must take special care to access it correctly, to avoid race conditions. Before making any changes to the descriptor, or even accessing its argument, the resolver should acquire a global dynamic loader lock and check that the resolver address in the TLS descriptor has not changed. If it has, it should release the lock and proceed to restoring r2 and jumping to the new resolver address. Otherwise, it should set the resolver address in the TLS descriptor to a hold function that will get any other threads that attempt to use this TLS descriptor to wait until relocation is complete. Storing the final value of the TLS descriptor also needs care: the resolver field must be set to its final value after the argument gets its final value, such that any if thread attempts to use the descriptor before it gets its final value, it still goes to the hold function. Once the hold function is in place, it would be safe to release the lock and use some internal condition-variable signaling mechanism to wake up any threads blocked on it. However, since the dynamic loader most likely requires a lock to be held while accessing its internal data structures to resolve the relocation, a simpler implementation simply holds the lock until the relocation is completed and the TLS descriptor fully updated. The hold function, in this case, simply acquires the lock, releases it and jumps to the resolver address in the TLS descriptor. Resolvers' Calling Convention ============================= In order to reduce run-time penalties for relaxed sequences and for the most common non-relaxed ones the resolver functions need to save all register they modify, including usually call-clobbered ones. It can be accomplished, for example, by ways of a ldm/sdm pair. Three registers are exception to this rule, namely r0, expected to return the value we need; r1, expected (but not required) to hold the throw-away resolver address; and the processor flags, that would be too expensive to save and restore. Relocations =========== We add the following new relocation symbols: #define R_ARM_TLS_GOTDESC 90 #define R_ARM_TLS_CALL 91 #define R_ARM_TLS_DESCSEQ 92 #define R_ARM_TLS_DESC 93 All of them are REL relocations, the first three being static ones. R_ARM_TLS_GOTDESC is emmited by the assembler as (tlsdesc), getting as addend an incomplete offset that allow the linker to find out which instruction is expected to use the result of this relocation. The linker may subtract a constant amount (#-8 for ARM mode) in case of relaxation to Initial Exec or make an adjustment depending of the call path for the dynamic models (#-4 for call to trampoline and #-8 for inline trampoline) For the Local Exec model, the addend is completely ignored. In dynamic cases, this relocation is resolved to the the address of the GOT entry in which the TLS descriptor is stored. For the static linking case of Weak Undefined symbols, the linker simply resolves the relocation to 0x0. R_ARM_TLS_CALL is emitted by the assembler as (tlscall). It gets no addend, and is resolved by the linker to a suitable immediate that may allow the bl instruction to jump to the address of the trampoline. In case of a relaxation, the linker may: * turn the branch instruction into a ldr r0, [pc,r0] (IE relaxation) * turn the branch instruction into a nop (LE relaxation) * turn the branch instruction into a sub r0,r0,$tp (Weak Undefined Symbol static linking) R_ARM_TLS_DESCSEQ is emitted by the assembler as .tlsdescseq. It gets no addend and is used to change instructions into more suitable ones in case of a relaxation for a more efficient access model. When encountering this relocation, the linker may: * turn the `add rA, pc, rB' instruction into a `mov rA, rB', if A and B are different, and a nop otherwise (LE relaxation) * turn the `add rA, pc, rB' instruction into `sub rA, rB, $tp' (Weak undefined Symbol static linking relaxation) * turn the `ldr rA, [rB, #4]' and `blx rC' instructions into nops (LE relaxation and Weak Undefined Symbol relaxation for static linking) * turn the `ldr rA, [rB, #4]' instruction into `ldr rA, [rB]' (IE Relaxation) * turn the `blx rA' instruction into `mov r0, rA' (IE Relaxation) R_ARM_TLS_DESC is emitted by the linker in response to R_ARM_TLS_GOTDESC and/or R_ARM_TLS_CALL and gets no addend. When lazy relocations are not allowed, the linker fills both GOT entries with zero, being the job of the loader to place the specialized function parameter in the first entry, and the function itself in the second. When lazy relocations are allowed, the linker puts the relocation index in the first GOT entry, and fills the second entry with zero. This GOT entry is in fact the parameter of _dl_tlsdesc_lazy_trampoline. However, the local address of this function is already provided by the module by ways of the DT_TLSDESC_PLT entry in the dynamic table, so the linker does not need to set it up. The job of the dynamic loader simply consists in applying a relative relocation to the address of the lazy trampoline and storing it in the second TLSDesc entry. Revision History ================ 0.2.2 2008-03-15: Fix typos. 0.2.1 2006-04-07: Revise relaxation rules for inlined TLS desc call sequences. Explain how far the register allocation freedom goes. 0.2 2006-04-06: Avoid the use of ldrd/strd. Global reformatting. 0.1.1 2006-03-29: First `published' specification.