The FR-V thread-local storage ABI Alexandre Oliva Aldy Hernandez Version 1.0 2004-12-10 Introduction ============ This document describes the extensions to the FRV-FDPIC ABI for supporting thread-local storage (TLS). It is meant to be read as an extension to [2] and it follows the format and terminology used in [1]. Run-Time Handling of TLS ======================== Register GR29 is reserved as the register to be used as the thread context pointer. This requires the kernel to actually preserve the value of this register. GR29 cannot be used for any other purpose throughout the program, even if the program itself does not use thread-local storage. The TLS data structures on FRV are specified according to variant I, with the following exceptions: - The thread context pointer is biased: it points 2048 bytes past the beginning of the TCB, such that the pointer to the DTV may be loaded from the TCB with a single instruction, while (almost) doubling the range directly addressable with the 12-bit signed literal offset available in load and store instructions, that executables can take advantage of to access local variables in TLS; - Pointers to module's TLS areas are all biased by +2032 bytes such that, in the main executable, the thread pointer can be used directly as the biased pointer to its TLS area; and - The contents of DTV (Dynamic Thread Vector) entries are unspecified; they may contain offsets from the thread pointer, instead of absolute addresses. It is also unspecified whether they are biased. Since the DTV is cannot be directly accessed by ABI-specified mechanisms (the location of the pointer to it in the TCB is left unspecified), its internal format, and even its very existence, are to be regarded as an internal implementation details, subject to change without notice. The 16 bytes starting at gr29-2048 are reserved for use by the TLS implementation, so TLS data for the main executable starts at gr29-2032. gr29 must always be at aligned to at least a 16-byte boundary. If the TLS section in the main executable requires additional alignment, it is the gr29-2032 address that will be arranged to satisfy the alignment requirements. gr29-2048 will therefore also be aligned to a 16-byte boundary. The TLS implementation may reserve additional space below gr29-2048, for example, to hold the data structure that represents the thread that uses the thread control block. Should such data structures require alignment stricter than 16 bytes, gr29-2032 may end up getting more strictly aligned than hereby specified, such that the offset between gr29 and this data structure, whose size and alignment are known to the thread library, is constant. calling conventions ------------------------------------ On most ports, the __tls_get_addr() function is defined as part of the ABI, and explicitly called to determine the address of thread-local variables. The function traditionally takes obtains from its arguments the following information: a module ID and an offset. This function typically returns the address computed by adding the offset to the TLS block allocated for the module within the running thread. Specific features of the FR-V architecture, such as register+register addressing modes, and of the FDPIC ABI, such as the use of function descriptors, enable significant optimizations for the most common cases in which this function used to be used, if we introduce alternate entry points and specialized calling conventions. Therefore, __tls_get_addr() is not defined as part of this ABI. Its presence is not required, and programs are not supposed to call it. Different, more efficient mechanisms, are introduced to implement similar functionality. Consider that in the general dynamic TLS access model, a function call is necessary to obtain the address of a thread-local variable. However, in most cases, the variable happens to be in the portion of memory that is allocated at the time of creation of every thread, at a constant offset from the thread context pointer (static TLS). Even in cases that require dynamic allocation of the TLS for a module, if the variable is accessed often enough, we can avoid a significant fraction of the function call overhead, namely the need to spill all registers not preserved across function calls, by adopting custom calling conventions that preserve many more registers. Another advantage of the custom calling conventions is that they almost remove the penalty for generating code for the general dynamic access model, where another access model that doesn't require calls would have been enough, even more so if the linker performs relaxations that actually remove the call. The calling conventions and expected behavior are implemented using an alternate, unnamed pair of entry points, one used for the static TLS case, one used for the dynamic TLS case. The actual entry points are defined internally to the dynamic loader or, in the case of static executables, the first entry point is defined by the linker, and the second is never used. For clarity purposes, we are going to refer to these entry points as , with angle brackets being used to imply that it is not an actual symbol name. The idea is that the two entry points are interchangeable as far as the compiler and the linker are concerned: when the linker can't tell for sure whether the referenced symbol is in static or dynamic TLS, it will be up to the dynamic loader to select the most appropriate entry point. For this reason, it is essential to define a common ABI for implementations. inputs: GR9: The idea is that the GR8,GR9 register pair is loaded from a TLS descriptor from the GOT. GR8 is expected to be the selected entry point for , but the value is not used by itself, so it's legitimate to not set it; GR9, on the other hand, holds additional information needed by the entry point. In the static TLS case, it's the TLS offset of the referenced variable. In the dynamic TLS case, it's a pointer to a data structure allocated by the dynamic loader holding information such as the GOT address to be used by the dynamic loader, the module ID of the TLS variable, its TLS module offset, and any other information that the implementation of the dynamic function might need. GR15: The GOT pointer for the module containing the TLS descriptor from which the address was loaded. It can often be used if the instruction that calls is optimized into a load, but the function itself might rely on its value. GR29: Biased pointer to the TCB for the current thread. outputs: GR9: Offset from GR29 to the address returned by __tls_get_addr(), or that would be returned by it should it have been called. GR8: May be modified freely. All other registers shall be preserved by functions that follow the ABI. behavior: The static just returns, since the correct value was already loaded into GR9 by its caller. The behavior of the dynamic is described with the following pseudo-code: If the requested module has already been resolved for the current thread: Set GR9 to the DTV entry for the module, plus the offset for the variable; Return. Save registers; Load the arguments for __tls_get_addr() and call it; Set GR9 to the difference between the returned value and the thread context pointer; Restore registers (other than GR9 and GR8); Return. Additional implementations of may be introduced, as long as they do not make assumptions on inputs not warranted by the ABI specification, and provide the outputs strictly as specified above. The current specification is intended to make room for lazy resolution of TLS descriptor relocations, but this feature is not defined in the current specification. TLS Access Models ----------------- There are two main ways to access thread-local storage, the dynamic and the static model. All other models described here fall into these two categories. The FRV provides four models, all derived from the aforementioned two models: general dynamic, local dynamic, initial exec, local exec. Different models are used to provide as much performance as possible. General Dynamic TLS Model ------------------------- The general dynamic TLS model can be used everywhere. Compilers will generate code with this model by default and only use a more restrictive model when it is more efficient or when told explicitly to use another model. The generated code for this model does not assume that the module number or variable offset is know at link or compile-time. The values for the module ID and the TLS block offset are determined by the dynamic linker at run-time, and passed to the function through a pointer in GR9. Upon return, the function returns the offset from the thread context pointer to the variable for the current thread in GR9. It is desireable to avoid this model whenever possible, as it is the slowest. In the following code fragments, the code shown is determining the address of a thread-local variable x: extern __thread int x; &x; General form ------------ Here, we call a (pseudo-)function to determine the offset for the variable. Like before every call, the GR15 register must hold the GOT pointer for the current module; unlike other calls, GR15 must be set prior to the call instruction (i.e., not in the same VLIW pack). Unlike other calls, GR15 and most other registers are preserved; only GR8, GR9 can be modified by the callee without being preserved, and LR is modified by the call instruction itself. call #gettlsoff(x) The call instruction may be relaxed into a load instruction, so its VLIW packing must enable such replacement. The call instruction above requests the linker to generates the following sequence of instructions in the PLT, and to adjust the call such that this PLT entry is called: plt(gettlsoff(x)): sethi.p #gottlsdeschi(x), gr8 setlo #gottlsdesclo(x), gr8 ldd @(gr15, gr8), gr8 jmpl @(gr8, gr0) Shorter versions of the PLT entry can be used, if the offsets are known to fit in 16 bits: plt(gettlsoff(x)): setlos #gottlsdesclo(x), gr8 ldd @(gr15, gr8), gr8 jmpl @(gr8, gr0) or even 12 bits: plt(gettlsoff(x)): lddi @(gr15, #gottlsdesc12(x)), gr8 jmpl @(gr8, gr0) Like PLT entries for regular function calls, the PLT entries above can also be inlined into the caller, resulting in code sequences such as: sethi.p #gottlsdeschi(x), gr8 setlo #gottlsdesclo(x), gr8 ldd #tlsdesc(x)@(gr15, gr8), gr8 calll #gettlsoff(x)@(gr8, gr0) The instructions can be freely scheduled, and they can use whatever registers they like, as long as GR9 holds the value loaded from the second word of the TLS descriptor when the first instruction in the callee is executed. The callee of such a sequence must preserve all registers other than GR8 and GR9; LR is modified by the call instruction itself. If the PLT is inlined, another register holding the GOT pointer may be used instead of GR15 in the ldd or lddi instructions, but GR15 must be set to the GOT pointer value before or in parallel with the calll instruction. This is less stringent than the non-inlined case, that requires GR15 to be set before, not in parallel, with the call instruction. #tlsdesc and #gettlsoff are annotations that enable the linker to optimize the code when the dynamic model is not necessary. Upon return from any of the calls above, gr9+gr29 yields the address of variable x or sx. For example, the value of x can be obtained or modified using a load or store with the addressing mode @(gr29,gr9). #gottlsdeschi and #gottlsdesclo denote the high and low 16 bits of the offset from the GOT pointer containing a TLS descriptor; #gottlsdesc12 denotes the low 12 bits. When requested to create a gettlsoff PLT entry, or when resolving the relocations denoted by these expressions, the linker will automatically create such a GOT entry: <_GLOBAL_OFFSET_TABLE_+#gottlsdesc(x)>: .dword #(x) Note that # is not well-formed assembly, it's just an arbitrary notation to convey the fact that we're going to have the corresponding relocations associated to the given 64-bit value. Local Dynamic TLS Model ----------------------- The local dynamic TLS model is an optimization of the general dynamic TLS model. The compiler can generate code following this model if it can recognize that the thread-local variable is defined in the same object it is referenced in. This includes thread-local variables with file scope or variables which are defined to be protected or hidden. Since a thread-local variable is defined by the module ID and the offset in the TLS block of that module, in the case of variables which are known to be referenced and defined in the same object, the offsets are known at link-time, but the module ID isn't. Since the module ID is not known, it is necessary to use general dynamic call sequences to obtain the module biased base pointer, and then add the constant offsets known at link time to such base pointer. The generated code for the general dynamic and the local dynamic sections are so different that the optimizations have to be done by the compiler, not the linker. The following code is used to obtain the biased base pointer for the module: call #gettlsoff(0) This is equivalent in all regards to #gettlsoff(zero_offset), where zero_offset denotes a variable whose biased offset is zero. The PLT entry that would be generated can be inlined: sethi.p #gottlsdeschi(0), gr14 setlo #gottlsdesclo(0), gr14 ldd #tlsdesc(0)@(gr15, gr14), gr8 calll #gettlsoff(0)@(gr8, gr0) The calls above end up calling , so, after they return, gr9+gr29 yields the biased base pointer for the module. In the following code fragments, the code shown determines the address of two thread-local variables: static __thread int x1; static __thread int x2; &x1; &x2; Compute the biased base pointer for the module in gr7: add gr9, gr29, gr7 and then use it as a base pointer to access other variables. sethi.p #tlsmoffhi(x1), gr8 setlo #tlsmofflo(x1), gr8 ;;; @(gr7, gr8) is x1 ... sethi.p #tlsmoffhi(x2), gr10 setlo #tlsmofflo(x2), gr10 ;;; @(gr7, gr10) is x2 #tlsmoff represents the offset of the variable into the module. Instead of computing TLS module offsets using such long instruction sequences, it is possible to perform the access, or compute the address, with a single instruction: ld @(gr7, #tlsmoff12(x1)), gr12 ;;; gr12 now holds the value last stored in x1 addi gr7, #tlsmoff12(x2), gr13 ;;; gr13 now holds the thread-local address of x2 Since thread-local sections tend to be small (seldom more than tens of bytes), it is recommended that code using 12-bit offsets be emitted by the compiler by default, and that a command-line option be introduced to enable the use of larger TLS sections. Initial Exec TLS Model ---------------------- The initial exec TLS model can be applied unconditionally when generating the executable itself (i.e. when compiling code without the options to emit code suitable for shared libraries: -fPIC or -fpic). This optimization is usable if the variables accessed are known to be in one of the modules available at program start, and if the programmer selects to use the static access model. The generated code will not use the function, which means that deferred allocation of memory for the TLS blocks is not possible in this model. It is still possible however, to defer allocation for dynamically loaded modules. With this optimization, for each variable there would be a run-time relocation for a GOT entry which instructs the dynamic linker to compute the offset from the TCB. In the following code fragments, the code shown is determining the address of a thread-local variable x: extern __thread int x; &x; Assuming GR15 holds the PIC pointer, the following instruction computes the offset from the thread context pointer to variable x: ldi @(gr15, #gottlsoff12(x)), gr9 Ideally, the sequence above should never be generated by the compiler, or generated only with -fpic, since it assumes GOT offsets fit into 12 bits. The longer form with sethi/setlo should be emitted for -fPIC. The linker may relax out-of-line `call #gettlsoff' instructions to the above, if it doesn't cause GOT overflows. For -fPIC, the following sequence should be generated instead, once again assuming GR15 is the PIC pointer: sethi.p #gottlsoffhi(x), gr14 setlo #gottlsofflo(x), gr14 ld #tlsoff(x)@(gr15, gr14), gr9 When relaxing an inlined general dynamic call sequence, the ldd instruction to gr8 and gr9 is replaced with the ld instruction above, and the call is replaced with a nop. The GOT entry used by the initial exec model is: _GLOBAL_OFFSET_TABLE_+#gottlsoff(x): .word #(x) When linking an executable, PLT entries generated in response to call #gettlsoff instructions can use static TLS instead, referencing TLS offsets in the GOT, instead of TLS descriptors: plt(gettlsoff(sx)): sethi.p #gottlsoffhi(sx), gr8 setlo #gottlsofflo(sx), gr8 ld @(gr15, gr8), gr9 ret Shorter versions for 16- or 12-bit offsets from the GOT can be used for executables as well. Local Exec TLS Model -------------------- This model is an optimization for the local dynamic model. It can be used only for code in the executable itself, and only when the variables accessed are in the executable itself. These restrictions mean that, in this model, the TLS block can be addressed relative to the thread pointer. It also means that we always use the first TLS block (the one for the executable), and the size of the other TLS blocks is irrelevant for address computations. The following code descriptions implement the following fragment: static __thread int x; &x; After the two instructions below, gr10+gr29 represents the address of variable x: sethi.p #tlsmoffhi(x), gr10 setlo #tlsmofflo(x), gr10 When relaxing from the Local Dynamic model, the code that calls can be simplified to: setlos #0, gr9 When linking an executable, and referencing a local symbol, PLT entries generated in response to call #gettlsoff instructions can use the local exec access model, computing the TLS offsets directly instead of using TLS descriptors: plt(gettlsoff(sx)): sethi.p #tlsmoffhi(sx), gr8 setlo #tlsmofflo(sx), gr8 ret Shorter versions for 16- or 12-bit offsets from the GOT can be used for executables as well. Relocations =========== Some relocations are available in 12, HI and LO variants. The 12 variant, to be used as the immediate offset to load or store instructions, resolves to the least significant 12 bits. The HI and LO variants resolve to the most and least significant 16 bits, respectively, and they're to be used as operands to sethi and setlo. Even though LO could be used as the operand to setlos, this is not recommended, since it silently truncates the value to 16 bits. Some relocations can only be applied to a limited set of instructions, so as to enable certain optimizations. The optimizations are described assuming fields will not overflow. If they do, the linker is advised to not perform the transformation, but, if it does, it must print an error message and offer a command-line option to disable the transformation. Given a set of optimizations that can be applied to relocations referencing a symbol+offset, the linker must perform either all or none of them, i.e., the semantics of the sequences depicted in the Linker Optimizations section below must not be broken by transformations in some instructions but not in others that below to the same logical group of instructions. The following relocations are available in relocatable objects, but never as dynamic relocations: 25 R_FRV_GETTLSOFF This relocation forces a PLT entry to be generated, that references a TLS descriptor for the symbol. The TLS descriptor is implicitly generated in the GOT. The relocation, applied to a call instruction, causes the it to call this PLT entry. If the symbol number is zero, it's a reference to the TLS base address for the module, including the bias. This relocation must be associated with a call instruction. If an executable is being linked, the referenced symbol binds locally, the call instruction may be replaced with: setlos #tlsmoff(symbol+addend), gr9 If an executable is being linked, but the reference symbol does not bind locally (or, optionally, if it does but the substitution above would overflow), the call instruction may be replaced with: ldi @(gr15, #gottlsoff12(symbol+addend)), gr9 A GOT entry with the TLS offset is implicitly generated in this case. 27 R_FRV_GOTTLSDESC12 28 R_FRV_GOTTLSDESCHI 29 R_FRV_GOTTLSDESCLO These relocations resolve to the GOT offset for a TLS descriptor of the symbol. The TLS descriptor is implicitly generated in the GOT. If the symbol number is zero, it's a reference to the TLS descriptor for the module, including the bias. The GOTTLSDESC12 relocation must be associated with an `lddi' instruction. GOTTLSDESCHI must be associated with a `sethi' instruction. GOTTLSDESCLO must be associated with one of `setlo' or `setlos'. When linking an executable, and the referenced symbol binds locally, the `sethi', `setlo' and `setlos' instructions MAY be replaced with `nop's. An `lddi' instruction such as: lddi @(grB, #gottlsdesc12(symbol+offset)), grC MAY be replaced with: setlos #tlsmofflo(symbol+offset), gr When linking an executable, and one of these relocations reference a symbol that does not bind locally, the immediate operand of `sethi', `setlo' and `setlos' instructions must be replaced with the value for a corresponding GOTTLSOFF relocation. An `lddi' instruction may be replaced with `ldi', the immediate offset replaced with the value for the corresponding GOTTLSOFF relocation, and the destination register replaced with the odd-numbered register following the destination register of `lddi'. A GOT entry holding the TLS offset for the symbol is implicitly generated. 30 R_FRV_TLSMOFF12 31 R_FRV_TLSMOFFHI 32 R_FRV_TLSMOFFLO These relocations resolve to the offset from the biased base address for the module to the address of the thread-local symbol. The symbol must bind locally in the module. 33 R_FRV_GOTTLSOFF12 34 R_FRV_GOTTLSOFFHI 35 R_FRV_GOTTLSOFFLO These relocations resolve to the GOT offset for an entry holding the offset from the thread pointer to the thread-local symbol. It causes a TLSOFF entry to be created in the GOT. The GOTTLSOFF12 relocation must be associated with an `ldi' instruction. GOTTLSOFFHI must be associated with a `sethi' instruction. GOTTLSOFFLO must be associated with one of `setlo' or `setlos'. When linking an executable, and the referenced symbol binds locally, `sethi', `setlo' and `setlos' instructions may be replaced with `nop's. An `ldi' instruction such as: ldi @(grB, #gottlsoff12(symbol+offset)), grC MAY be replaced with: setlos #tlsmofflo(symbol+offset), grC The following relocations are do-nothing annotations, used for linker relaxations. 37 R_FRV_TLSDESC_RELAX A relocation that must be attached to ldd instructions that load the function descriptor for from a TLS descriptor, except for those that get a GOTTLSDESC12 relocation. It indicates that, in @(grX, grY), grX holds the GOT address, and grY holds the offset from grX to the TLS descriptor for symbol+addend. When linking an executable, and the referenced symbol binds locally, the `ldd' instruction may be replaced from: ldd #tlsdesc(symbol+addend)@(grX, grY), grC to setlos #tlsmofflo(symbol+addend), gr When linking an executable, and the reference symbol does not bind locally, the `ldd' instruction may be replaced with `ld', with the destination register replaced with the odd-numbered register following the destination register of `ldd'. When the referenced TLS descriptor (or offset, after the relaxation above) is within the 12-bit-addressable range in the GOT, the load instruction may be turned into an immediate load instruction. 38 R_FRV_GETTLSOFF_RELAX A relocation that must be attached to calll instructions that call for symbol+addend. When linking an executable, the `calll' instruction may be replaced with a `nop'. 39 R_FRV_TLSOFF_RELAX A relocation that must be attached to `ld' instructions that load the TLS offset for a symbol+addend from the GOT. It indicates that, in @(grX, grY), grX holds the GOT address, and grY holds the offset from it to the GOT entry containing the TLS offset for symbol+addend. When linking an executable, if the referenced symbol binds locally, the `ld' instruction can be replaced with: setlos #tlsmofflo(symbol+addend), grC When the referenced TLS offset within the 12-bit-addressable range in the GOT, the load instruction may be turned into an immediate load instruction. 40 R_FRV_TLSMOFF These relocations resolve to the offset from the biased base address for the module to the address of the thread-local symbol. The symbol must be defined locally in the module. This relocation is typically only used in debug information. The assembler generates it in response to assembly code such as: .picptr tlsmoff(symbol+addend) The following relocations are only available as dynamic relocations: 26 R_FRV_TLSDESC_VALUE A 64-bit relocation that resolves to a entry point followed by the argument to be passed to it in GR9 such that it computes the TLS offset for the symbol+offset referenced in the relocation. The in-place addend is stored in the second word. The contents of the first word prior to relocation are reserved for future extensions. 36 R_FRV_TLSOFF A 32-bit offset from the thread pointer to a thread-local symbol. Linker Optimizations ==================== We summarize below the relaxations the linker may apply to TLS relocations. The relaxable instructions can be scheduled freely, and they can use registers other than those used in the sample code snippets above. The only requirement is that, on call sites (call #gettlsoff or calll #gettlsoff), the custom calling conventions of are satisfied. We have to be careful in the transformations to make sure that we don't assume too much about what's in each register, since a compiler may have chosen different registers and added copy instructions. In the code snippets below, we use symbolic register names such as grA, grB and grC. If the compiler may have introduced copies between instructions, we add ' to the register name (e.g., grA', grA''), to indicate it's the same register as the one mentioned before, or a copy thereof. The packing bits are only shown below to illustrate the expected common use. The linker should retain them where they appear, and the original code must be emitted in such a way that the substitutions below don't generate illegal packing combinations. In some cases, instructions become nops. With significant additional effort, we could get sufficient information to the linker to enable it to actually eliminate such instructions from the instruction stream in some cases that don't require them due to VLIW packing. However, due to the increased demands from the assembler in retaining symbolic information in object files to enable such code-length changes, and since it is possible to obtain the improved sequences by providing the compiler with additional symbol-locality information, we recommend simple substitution of instructions for nops. General Dynamic to Initial Exec ------------------------------- Instead of having to emit a full TLS descriptor for the variable in the GOT, we only need a TLS offset GOT entry. In the first case, we may be unable to perform the optimization if we #gottlsoff12 overflows. From: call #gettlsoff(x) To: ldi @(gr15, #gottlsoff12(x)), gr9 From: sethi.p #gottlsdeschi(x), grA setlo #gottlsdesclo(x), grA' ldd #tlsdesc(x)@(grB, grA''), grC calll #gettlsoff(x)@(grC', gr0) To: sethi.p #gottlsoffhi(x), grA setlo #gottlsofflo(x), grA' ld #tlsoff(x)@(grB, grA''), gr nop From: setlos #gottlsdesclo(x), grA ldd #tlsdesc(x)@(grB, grA'), grC calll #gettlsoff(x)@(grC', gr0) To: setlos #gottlsofflo(x), grA ld #tlsoff(x)@(grB, grA'), gr nop From: lddi.p @(grB, #gottlsdesc12(x)), grC calll #gettlsoff(x)@(grC', gr0) To: ldi.p @(grB, #gottlsoff12(x)), gr nop One might be tempted to replace the nops with `mov gr, gr9', but at the point of the call, the function descriptor that would have been loaded to grC/gr must have been moved to gr8/gr9, even if gr8 is not used in the address for the call instruction, because gr8 and gr9 are defined as part of the custom calling conventions of . General or Local Dynamic To Local Exec -------------------------------------- Here we manage to get rid of all dynamic relocations in (almost) all cases, by taking advantage of the fact that thread-local symbols of the executable are at fixed offsets from gr29. In the Local Dynamic case, instead of a symbol x, we'll have a constant N (expected to be zero). This constant is the TLS module offset itself, so #tlsmofflo(N) becomes #lo(N) and #tlsmoffhi(N) becomes #hi(N). From: call #gettlsoff(x) To: setlos #tlsmofflo(x), gr9 or, if #tlsmofflo(x) exceeds a 16-bit signed offset, before giving up, we may still try the following substitution, that still requires one dynamic relocation, but saves a TLS PLT entry and most of the TLS descriptor space in the GOT. ldi @(gr15, #gottlsoff12(x)), gr9 Alternatively, we can optimize the generated PLT entry such that it computes gr9 using a longer sequence, or one that loads a TLSOFF from the GOT. From: sethi.p #gottlsdeschi(x), grA setlo #gottlsdesclo(x), grA ldd #tlsdesc(x)@(grB, grA'), grC calll #gettlsoff(x)@(grC', gr0) To: nop.p nop setlos #tlsmofflo(x), gr nop From: setlos #gottlsdesclo(x), grA ldd #tlsdesc(x)@(grB, grA'), grC calll #gettlsoff(x)@(grC', gr0) To: nop setlos #tlsmofflo(x), gr nop From: lddi.p @(grB, #gottlsdesc12(x)), grC calll #gettlsoff(x)@(grC', gr0) To: setlos.p #tlsmofflo(x), gr nop If the tlsmoff(x) exceeds a signed 16-bit value, instead of: setlos #tlsmofflo(x), gr nop ;; that replaced the calll instruction we emit, in all cases but the first: sethi #tlsmoffhi(x), gr setlo #tlsmofflo(x), gr9 Initial Exec To Local Exec -------------------------- Here we attempt to get rid of the TLS offset GOT entry, so as to eliminate the need for a dynamic relocation, but we can't always do it. If the symbol TLS offset fits in 16 bits, we can always do it. Otherwise, we need global information to decide how to handle each individual relaxable relocation, so the recommendation is that such cases not be optimized. TLS offset of symbol x fits in 16 bits (signed): From: sethi.p #gottlsoffhi(x), grA setlo #gottlsofflo(x), grA' ld #tlsoff(x)@(grB, grA''), grC To: nop.p nop setlos #tlsmofflo(x), grC From: setlos #gottlsofflo(x), grA ld #tlsoff(x)@(grB, grA'), grC To: nop setlos #tlsmofflo(x), grC From: ldi @(grB, #gottlsoff12(x)), grC To: setlos #tlsmofflo(x), grC If the TLS Module Offset of X exceeds 16 bits, we can't optimize the last case. As for the other cases, we might apply the following transformations. However, since the cases below require global analysis on the input operands of the ld instruction, we recommend it not to be applied. Since TLS segments are generally very small (tens of bytes, far less than 64 kilobytes), the absence of such a transformation should not be a major problem. The transformations below are ones that can only be performed when disambiguation is possible. One must bear in mind, however, that it is possible for a single ld #tlsoff instruction to take index register inputs from both sethi/setlo and setlos sets, with copying and control flow making it an undecidable problem in the general case. We could have disambiguated in the assembly level, using new annotations such as #gottlsofflos, #tlsofflos and #tlsoffhilo in setlos and ld, generating different relocations that would enable the linker to immediately tell which case is in use, but since this case would only be useful for TLS objects defined in the main executable that didn't fit in the initial 65520 bytes of the TLS segment, and that were not defined in the same translation unit that references them, or that were not declared as binding locally (e.g., of hidden or protected visibility) in other translation units that reference it, we dismissed it as a non-issue, and decided to leave them unoptimized. From: (assuming tlsoffhilo in addition to, or instead of tlsoff) sethi.p #gottlsoffhi(x), grA setlo #gottlsofflo(x), grA' ld #(x)@(grB, grA''), grC To: sethi.p #tlsmofflo(x), grA setlo #tlsmoffhi(x), grA' mov grA'', grC or, if grA'' and grC were required to be the same register, you could use: nop.p sethi #tlsmofflo(x), grA' setlo #tlsmoffhi(x), grA'' ;; grA'' and grC must be the same From: (assuming gottlsofflos and tlsofflos instead of gottlsofflo and tlsoff, and that grA' and grC are required to be the same register) setlos #(x), grA ld #(x)@(grB, grA'), grC To: sethi #tlsmoffhi(x), grA setlo #tlsmofflo(x), grA' ;; grA' and grC must be the same Envisioned Extensions ===================== The design was created to enable lazy resolution of R_FRV_TLSDESC_VALUE relocations, but this optimization is so far unspecified and not implemented. This optimization could significantly speed up the start up of programs that use dynamic libraries that rely on many TLS variables, since the cost of performing their relocations would be avoided for variables that are not actually referenced, and delayed to runtime otherwise. Although this optimization would require an ABI extension to specify the expected behavior for an R_FRV_TLSDESC_VALUE in a lazy relocation section, since this relocation is not specified as a lazy relocation in the current specification, it is expected that such a change could be implemented in a fully backward-compatible way. An optimization to the Dynamic Thread Vector management is also envisioned. Under the current specification, it is possible to arrange for the DTV to not contain entries for static TLS entries, since they are never going to be used. This also enables the dynamic thread vector to be allocated on demand, instead of a thread start-up, which would enable programs that do not use dynamic TLS (typically, programs that do not dlopen() libraries that have TLS) to run without ever allocating dynamic thread vectors. This could significantly speed up thread creation time. Since the internal format of the DTV, and even its very existence, are unspecified in the ABI, such optimizations can be implemented in a fully backward-compatible way. Revision History ================ 2007-02-17 (1.0): Added copyright notice and license. 2004-12-10 (0.22): Added 32-bit R_FRV_TLSMOFF relocation. References ========== [1] "ELF Handling For Thread-Local Storage" version 0.20, Ulrich Drepper, Red Hat, 2003. [2] "The FR-V FDPIC ABI" version 1.0, Red Hat, 2004. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. http://creativecommons.org/licenses/by-sa/3.0/ 2011-05-17 Alexandre Oliva * Relicensed from OPL1.0 with further restrictions to CC BY-SA.