Thread-Local Storage Descriptors for the ARM platform
		     Revision 0.2.2 - 2008-03-15
		  An updated version is available at
     http://www.codesourcery.com/publications/RFC-TLSDESC-ARM.txt

	    Glauber de Oliveira Costa <glommer@gmail.com>
     Alexandre Oliva <oliva@lsd.ic.unicamp.br,aoliva@redhat.com>
				   
       With the special help of 3 trained soccer player monkeys
				   

Rationale and TLS Descriptors
=============================

Accessing TLS variables in dlopened modules usually happens by means
of a call to __tls_get_addr(), that is, predictably enough,
responsible for getting the real address of the variable, given it's
module index and offset.

Currently, by choosing one Local and/or General Dynamic Model of TLS
access, a call to __tls_get_addr() gets involved in the thread local
address resolution.  However, the dynamic loader has enough
information to detect if the variable being accessed lives in the
static TLS block (initially loaded), or may be accessed just by the
ways of the Dynamic Thread Vector (DTV) (dlopen) This knowledge would
enable it to select specialized functions that may be able to access
the variable in a more efficient way, given a suitable mechanism for
it to record such selections.

Such a mechanism is built by storing the specialized function,
together with its argument in two consecutive GOT entries, being the
parameter the first one, and the specialized function the second one.
This data structure designed in GOT is given the name Thread Local
Storage Descriptor, or simply put, TLS Descriptor.

The linker may also relax the code sequence to Local or Initial Exec
models.


Proposed Dynamic Access
=======================

Currently, the dynamic access model is as follows:

	ldr	r0, .Lt0
.L1:	add	r0, pc, r0
	bl	__tls_get_addr(PLT)

.Lt0:
	.word	foo(tlsgd) + (. - .L1 - 8)

At the end of the call sequence, the r0 register holds the address of
the desired variable, and its contents can be accessed by just issuing
a simple load instruction like:

	ldr	r0, [r0]

In our proposal, code for dynamic access changes slightly, turning out
to be:

	ldr	r0, .Lt0
.L1:	bl	variable(tlscall)

.Lt0
	.word	variable(tlsdesc) + (. - .L1)

One major difference from both models is that the end of the code
sequence, r0 register does not hold the address of the variable
anymore, but rather its tp-relative address.  This way, one must issue
an

	ldr	r0, [$tp, r0]

in order to get its contents.

The tlsdesc relocation in .Lt0 gives the pc-relative address of the
TLS descriptor representing the thread-local variable we're interested
in.  The addend of the relocation is incomplete in order to properly
allow linker relaxations (see 'addend adjustments' below for a
rationale) and must be adjusted at link time.

In case of Local Dynamic access model, to avoid the definition of new
relocations, the linker defines for all modules that have a TLS
section a hidden per-module symbol called _TLS_MODULE_BASE_ that
denotes the beginning of the TLS section for that module.

In both Local and General Dynamic cases, in the absence of
relaxations, tlscall is resolved to a call to a trampoline provided by
the linker.

Our proposal for the trampoline is:

__tls_trampoline:
	add	r0, lr, r0
	ldr	r1, [r0, #4]	// load the resolver address
	bx	r1		// jumps to it, passing it r0

Also, note that the last branch defines a tail-call.  This is expected
to make things simpler, as the resolver function can then rely on the
link register to hold the return address without needing to know where
the branch came from.  The resolver function may then return by
issuing a:

	bx	lr

on architecture versions that support both ARM and Thumb; otherwise:

	mov	pc, lr


Relaxations
===========

The result of relaxing our proposed dynamic sequence to Initial Exec is:

	ldr	r0, .Lt0
.L1:	ldr	r0, [pc, r0]

.Lt0:
	.word	foo(gottpoff) + (. - .L1 - 8)

The branch and link instruction turns into a load.  The addend of the
relocation must be adjusted by the linker in such a way to provide the
correct offset for it to be loaded relative to the instruction at .L1.
Currently, our proposal is pretty much the same as the current model
states.

And by relaxing to Local Exec, we get:

	ldr	r0, .Lt0
	nop

.Lt0
	.word	foo(tpoff)

Besides the nop instruction, executed instead of the branch and link
one, the model is exactly what the current Local Exec model states.
When relaxing to this access model, the addend is completely ignored,
as the tpoff relocation does not resolve to a pc-relative address.

Since our proposed dynamic access models return a TP offset, rather
than an absolute address, the relaxation to the above sequences is
much simpler, since we do not need to fit an additional instruction
that adds TP to the short sequences above.


Inlining the Trampoline
========================

To provide the ability of inlining the trampoline, for those who would
choose performance despite the increase of code size, the compiler
should be able to generate an instruction sequence that does the same
job as the trampoline would have otherwise done.  Such an instruction
sequence may be:

	ldr	rt, .Lt1
.L1:	add	rx, pc, rt
	ldr	ry, [rx, #4]
	[mov	r0, rx]
	blx	ry
.Lt1:
	.word	variable(tlsdesc) + (. - .L1)

Note that the addend is also incomplete here.

In order to keep the ability to relax the code sequence, the
instructions must be annotated, as follows:

	ldr	rt, .Lt1
.tlsdescseq variable
.L1:	add	rx, pc, rt
.tlsdescseq variable
	ldr	ry, [rx, #4]
	[mov	r0, rx]
.tlsdescseq variable
	blx	ry
.Lt1:
	.word	variable(tlsdesc) + (. - .L1)

Note that we do not force the use of specific registers other than r0
for the argument to the resolver (see resolver functions below),
granting the compiler the ability to choose the best possible register
allocation.  There is no requirement that the instructions be issued
in this particular sequence either, or that no other instructions be
interspersed, or even that the values not be reused when it makes
sense.  It is even permitted for different registers to be used where
the specification above implies a single register to be used, if the
value is copied from one to the other.

Shown below, is the code sequence after the relaxation for the Initial
Exec model:

	ldr	rt, .Lt1
.L1:	add	rx, pc, rt
	ldr	ry, [rx]
	[mov	r0, rx]
	mov	r0, ry
.Lt1:
	.word	variable(gottpoff) + (. - .L1 - 8)

In the Local Exec model, there is no need to either add back the pc or
load the variable value from any place, so the sequence turns into:

	ldr	rt, .Lt1
.L1:	mov	rx, rt
	nop
	[mov	r0, rx]
	nop

.Lt1:
	.word	variable(tpoff)

Also, the mov instruction can be turned into a nop in case the source
and destination registers happen to be the same.

In both models, the last branch instruction is replaced by a nop, as
there is no need to issue any function call.


Addend Adjustments
==================

Depending on whether the trampoline is inlined or not, we use
different methods to compute the absolute address of a TLS descriptor.

The out-of-line trampoline adds lr to the relative address it is
passed in r0, formerly loaded from the tlsdesc constant pool entry,
where lr contains an address that is one instruction past the bl
instruction annotated with the tlscall relocation, whereas the inline
trampoline adds pc to the relative address loaded from the tlsdesc
constant pool entry, where pc contains an address that is two
instructions past the address of the instruction that refers to it.

The result of the tlscall relocation must be adjusted to work in both
cases, even when they happen to be relaxed, which makes matters more
difficult as the offsets that have to be different before relaxation
need to become the same after relaxation.

The solution that avoids the need for distinct relocation types for
inline and out-of-line trampolines is to provide the linker with
enough information for it to make the correct decision.  We thus emit
the relocation with an addend that provides the relative location of
the instruction that is going to use the result of the relocation.

If it is a call instruction, presumed to be annotated with a tlscall
relocation, the linker resolves the relocation such that its result
added to the lr value set by the call instruction yields the address
of the TLS descriptor, i.e., it subtracts 4 from the addend.

Otherwise, it computes the relocation result in such a way that adding
its result with the pc value at the referenced instruction yields the
address of the TLS descriptor, i.e., it subtracts 8 from the addend.

The relocation is designed to work in Thumb mode as well.  To signal
the linker that the TLS call sequence is in Thumb mode, the compiler
must add 1 to the addend of the tlsdesc relocation, such that the
linker makes the appropriate adjustments to the offsets it computes


Resolver Functions
==================

We describe below some of the alternatives that the dynamic loader may
use to fill in the resolver function portion of the TLS descriptor, as
well as the corresponding argument.

Also, all names and data structures are for illustrative purposes.
Implementation details may differ, and symbols are designed to be
internal to the linker and thus, not part of the ABI.  Programs should
not, and in fact, through the use of symbol hiding mechanisms, should
be unable to, directly refer to them.  __tls_get_addr(), however,
remains available.

Except where otherwise noted, every resolver takes as its only
argument the address of the TLS descriptor in r0.

Static Specialization
---------------------

The resolver for static-TLS cases is quite simple, since the
descriptor pointed to by r0 already has the desired value, all we have
to do is load it.

_dl_tlsdesc_return:
	ldr	r0, [r0]
	bx	lr

Here again, processors without interworking support should, instead of
the bx lr instruction, use:

	mov	pc, lr

as their returning mechanism.

Dynamic Specialization
----------------------

For the dynamic case, we suggest some data structures and describe
their expected use.  For that purpose, we provide an example piece of
code that cannot be used directly, because of special calling
conventions for resolver functions (see section Resolvers' Calling
Convention below), but that details the behavior of the corresponding
dynamic TLS resolver function.

struct tlsdesc
{
  union
  {
    void *pointer;
    long value;
  } argument;
  ptrdiff_t (*resolver)(struct tlsdesc *);
};

typedef struct dl_tls_index
{
  unsigned long int ti_module;
  unsigned long int ti_offset;
} tls_index;

struct tlsdesc_dynamic_arg
{
  tls_index tlsinfo;
  size_t gen_count;
};

ptrdiff_t
_dl_tlsdesc_dynamic(struct tlsdesc *tdp)
{
	struct tlsdesc_dynamic_arg *td = tdp->argument.pointer;
	dtv_t *dtv = (dtv_t *)THREAD_DTV();
	if (__builtin_expect (td->gen_count <= dtv[0].counter
			      && dtv[td->tlsinfo.ti_module].pointer.val
				 != TLS_DTV_UNALLOCATED,
			      1))
		return dtv[td->tlsinfo.ti_module].pointer.val +
			td->tlsinfo.ti_offset - __builtin_thread_pointer();

	return __tls_get_addr (&td->tlsinfo) - __builtin_thread_pointer();
}


Weak Undefined Symbols
----------------------

For TLS symbols that turn out to be weak and undefined, the dynamic
loader can use yet another resolver function that returns the negated
value of $tp.  This preserves the expected semantics of undefined weak
symbols, at very little overhead even when compared with the incorrect
result of the relaxation of such sequences in the older model.

We thus recommend that references to TLS weak symbols that are not
locally defined to be preserved as dynamic call sequences, since it is
possible that the symbol turns out to be defined at the time the
program runs, or that it becomes undefined at run time even if another
library used at link time provided a definition.

The exception to the suggestion above is when linking static
executables, where it can be determined for sure whether a symbol is
defined or not, since that cannot change at run time.  The linker may
then relax the code sequence such that it sets up r0 with the negated
value of TP, such that when TP is added to compute the actual address,
the result is 0.

The result is analogous in for both inline and out-of-line
trampolines, disregarding only extra nop instruction that the linker
may insert in the later case.  The effective code sequence after
changes is as follows:

	ldr	r0, .Lt0
.L1:	sub	r0, r0, $tp

.Lt0
	.word	0x0

Lazy Specialization
-------------------

Lazy processing of relocations is possible with the new dynamic access
model.  A lazy resolver needs to obtain the relocation index and the
_GLOBAL_OFFSET_TABLE_ address in order to perform lazy relocations.
The relocation index can easily fit in the argument portion of the
descriptor, but loading the GOT address in a register prior to calling
the TLS resolver was deemed as too much overhead, since it is only
necessary for the first time a descriptor is used.

We have therefore introduced another per-module trampoline, that TLS
descriptors eligible for lazy relocation get as their resolver.  The
address of this trampoline is communicated to the dynamic loader by
means of a dynamic table entry: DT_TLSDESC_PLT.  It loads the address
of the actual resolver from a GOT entry, whose address is informed to
the dynamic loader with another dynamic table entry: DT_TLSDESC_GOT.
The dynamic loader is responsible for filling in the named GOT entry
with the address of the actual TLS lazy resolver address, whose name,
exemplified as _dl_tlsdesc_lazy_resolver, can thus remain internal to
the dynamic loader.

Such a trampoline can be written as:

_dl_tlsdec_lazy_trampoline

	sdmfd	sp!,{r2}
	ldr	r1,.Lt0
	ldr	r2,[pc, #_GLOBAL_OFFSET_TABLE - . - 8 \
			+ _dl_tlsdesc_lazy_resolver(GOT)]
.L1:	add	r1, pc, r1
	bx	r2

.Lt0	.word	_GLOBAL_OFFSET_TABLE_ - .L1 - 8

Note the unmatched saving of r2: we need scratch registers, and only
r1 is available (see the Calling Conventions section), so we specify
that the lazy trampoline saves r2, and the actual resolver is
responsible for restoring it.

Also note the long address expression used to load r2.  It is an
alternate way to compute the relocated DT_TLSDESC_GOT-named address.
It could be loaded from [r1, #_dl_tlsdesc_lazy_resolver(GOT)] after r1
is set up, but this formulation avoids the dynamic dependency on the
GOT register being set up, possibly reducing the latency.  If the
immediate operand does not fit, an alternate r1-based formulation can
be used.

The numbers for the dynamic table entries were originally defined at
http://www.ic.unicamp.br/~oliva/writeups/TLS/RFC-TLSDESC-x86.txt

#define DT_TLSDESC_PLT  0x6ffffef6      /* Location of PLT entry for
                                           TLS descriptor resolver
                                           calls.  */
#define DT_TLSDESC_GOT  0x6ffffef7      /* Location of GOT entry used
                                           by TLS descriptor resolver
                                           PLT entry.  */

Upon call, _dl_tlsdesc_lazy_resolver may then resolve the relocation,
write both the selected function to this access model and its argument
back to the TLS descriptor, and finally restore r2 and jump to the
resolver address in the TLS descriptor.

Since there is no way to atomically update the TLS descriptor, the
lazy resolver must take special care to access it correctly, to avoid
race conditions.  Before making any changes to the descriptor, or even
accessing its argument, the resolver should acquire a global dynamic
loader lock and check that the resolver address in the TLS descriptor
has not changed.  If it has, it should release the lock and proceed to
restoring r2 and jumping to the new resolver address.  Otherwise, it
should set the resolver address in the TLS descriptor to a hold
function that will get any other threads that attempt to use this TLS
descriptor to wait until relocation is complete.

Storing the final value of the TLS descriptor also needs care: the
resolver field must be set to its final value after the argument gets
its final value, such that any if thread attempts to use the
descriptor before it gets its final value, it still goes to the hold
function.

Once the hold function is in place, it would be safe to release the
lock and use some internal condition-variable signaling mechanism to
wake up any threads blocked on it.  However, since the dynamic loader
most likely requires a lock to be held while accessing its internal
data structures to resolve the relocation, a simpler implementation
simply holds the lock until the relocation is completed and the TLS
descriptor fully updated.  The hold function, in this case, simply
acquires the lock, releases it and jumps to the resolver address in
the TLS descriptor.


Resolvers' Calling Convention
=============================

In order to reduce run-time penalties for relaxed sequences and for
the most common non-relaxed ones the resolver functions need to save
all register they modify, including usually call-clobbered ones.  It
can be accomplished, for example, by ways of a ldm/sdm pair.  Three
registers are exception to this rule, namely r0, expected to return
the value we need; r1, expected (but not required) to hold the
throw-away resolver address; and the processor flags, that would be
too expensive to save and restore.


Relocations
===========

We add the following new relocation symbols:

#define			R_ARM_TLS_GOTDESC 90
#define			R_ARM_TLS_CALL	  91
#define			R_ARM_TLS_DESCSEQ 92
#define			R_ARM_TLS_DESC    93

All of them are REL relocations, the first three being static ones.

R_ARM_TLS_GOTDESC is emmited by the assembler as (tlsdesc), getting as
addend an incomplete offset that allow the linker to find out which
instruction is expected to use the result of this relocation.  The
linker may subtract a constant amount (#-8 for ARM mode) in case of
relaxation to Initial Exec or make an adjustment depending of the call
path for the dynamic models (#-4 for call to trampoline and #-8 for
inline trampoline) For the Local Exec model, the addend is completely
ignored.  In dynamic cases, this relocation is resolved to the the
address of the GOT entry in which the TLS descriptor is stored.  For
the static linking case of Weak Undefined symbols, the linker simply
resolves the relocation to 0x0.

R_ARM_TLS_CALL is emitted by the assembler as (tlscall).  It gets no
addend, and is resolved by the linker to a suitable immediate that may
allow the bl instruction to jump to the address of the trampoline.  In
case of a relaxation, the linker may:

   * turn the branch instruction into a ldr r0, [pc,r0] (IE
     relaxation)

   * turn the branch instruction into a nop (LE relaxation)

   * turn the branch instruction into a sub r0,r0,$tp (Weak Undefined
     Symbol static linking)

R_ARM_TLS_DESCSEQ is emitted by the assembler as .tlsdescseq.  It gets
no addend and is used to change instructions into more suitable ones
in case of a relaxation for a more efficient access model.  When
encountering this relocation, the linker may:

   * turn the `add rA, pc, rB' instruction into a `mov rA, rB', if A
     and B are different, and a nop otherwise (LE relaxation)

   * turn the `add rA, pc, rB' instruction into `sub rA, rB, $tp'
     (Weak undefined Symbol static linking relaxation)

   * turn the `ldr rA, [rB, #4]' and `blx rC' instructions into nops
     (LE relaxation and Weak Undefined Symbol relaxation for static
     linking)

   * turn the `ldr rA, [rB, #4]' instruction into `ldr rA, [rB]' (IE
     Relaxation)

   * turn the `blx rA' instruction into `mov r0, rA' (IE Relaxation)

R_ARM_TLS_DESC is emitted by the linker in response to
R_ARM_TLS_GOTDESC and/or R_ARM_TLS_CALL and gets no addend.  When lazy
relocations are not allowed, the linker fills both GOT entries with
zero, being the job of the loader to place the specialized function
parameter in the first entry, and the function itself in the second.
When lazy relocations are allowed, the linker puts the relocation
index in the first GOT entry, and fills the second entry with zero.
This GOT entry is in fact the parameter of
_dl_tlsdesc_lazy_trampoline.  However, the local address of this
function is already provided by the module by ways of the
DT_TLSDESC_PLT entry in the dynamic table, so the linker does not need
to set it up.  The job of the dynamic loader simply consists in
applying a relative relocation to the address of the lazy trampoline
and storing it in the second TLSDesc entry.


Revision History
================

0.2.2 2008-03-15: Fix typos.

0.2.1 2006-04-07: Revise relaxation rules for inlined TLS desc call
sequences.  Explain how far the register allocation freedom goes.

0.2   2006-04-06: Avoid the use of ldrd/strd.  Global reformatting.

0.1.1 2006-03-29: First `published' specification.