The design and implementation of a better ThreadLocal<T>

time to read 7 min | 1291 words

I talked about finding a major issue with ThreadLocal and the impact that it had on long lived and large scale production environments. I’m not sure why ThreadLocal<T> is implemented the way it does, but it seems to me that it was never meant to be used with tens of thousands of instances and thousands of threads. Even then, it seems like the GC pauses issue is something that you wouldn’t expect to see by just reading the code. So we had to do better, and this gives me a chance to do something relatively rare. To talk about a complete feature implementation in detail. I don’t usually get to do this, features are usually far too big for me to talk about in real detail.

I’m also interested in feedback on this post. I usually break them into multiple posts in a series, but I wanted to try putting it all in one location. The downside is that it may be too long / detailed for someone to read in one seating. Please let me know your thinking in the matter, it would be very helpful.

Before we get started, let’s talk about the operating environment and what we are trying to achieve:

  1. Running on .NET core.
  2. Need to support tens of thousands of instances (I don’t like it, but fixing that issue is going to take a lot longer).
  3. No shared state between instances.
  4. Cost of the ThreadLocal is related to the number of thread values it has, nothing else.
  5. Should automatically clean up after itself when a thread is closed.
  6. Should automatically clean up after itself when a ThreadLocal instance is disposed.
  7. Can access all the values across all threads.
  8. Play nicely with the GC.

That is quite a list, I have to admit. There are a lot of separate concerns that we have to take into account, but the implementation turned out to be relatively small. First, let’s show the code, and then we can discuss how it answer the requirements.

public sealed partial class LightThreadLocal<T> : IDisposable
{
[ThreadStatic] private static CurrentThreadState _state;
private ConcurrentDictionary<CurrentThreadState, T> _values =
new ConcurrentDictionary<CurrentThreadState, T>(ReferenceEqualityComparer<CurrentThreadState>.Default);
private readonly Func<T> _generator;
public LightThreadLocal(Func<T> generator = null)
{
_generator = generator;
}
public ICollection<T> Values => _values.Values;
public bool IsValueCreated => _state != null && _values.ContainsKey(_state);
public T Value
{
get
{
(_state ??= new CurrentThreadState()).Register(this);
if (_values.TryGetValue(_state, out var v) == false &&
_generator != null)
{
v = _generator();
_values[_state] = v;
}
return v;
}
set
{
(_state ??= new CurrentThreadState()).Register(this);
_values[_state] = value;
}
}
public void Dispose()
{
GC.SuppressFinalize(this);
var copy = _values;
_values = null;
while (copy.Count > 0)
{
foreach (var kvp in copy)
{
if (copy.TryRemove(kvp.Key, out var item) &&
item is IDisposable d)
{
d.Dispose();
}
}
}
}
~LightThreadLocal()
{
Dispose();
}
}

This shows the LightThreadLocal<T> class, but it is missing the CurrentThreadState, which we’ll discuss in a bit. In terms of the data model, we have a concurrent dictionary, which is indexed by a CurrentThreadState instance which is held in a thread static variable. The code also allows you to define a generator and will create a default value on first access to the thread.

The first design decision is the key for the dictionary, I thought about using Thread.CurrentThread and the thread id.Using the thread id as the key is dangerous, because thread ids may be reused. And that is a case of a nasty^nasty bug. Yes, that is a nasty bug raised to the power of nasty. I can just imagine trying to debug something like that, it would be a nightmare.  As for using Thread.CurrentThread, we’ll not have reused instances, so that is fine, but we do need to keep track of additional information for our purposes, so we can’t just reuse the thread instance. Therefor, we created our own class to keep track of the state.

All instances of a LightThreadLocal are going to share the same thread static value. However, that value is going to be kept as small as possible, it’s only purpose is to allow us to index into the shared dictionary. This means that except for the shared thread static state, we have no interaction between different instances of the LightThreadLocal. That means that if we have a lot of such instances, we use a lot less space and won’t degrade performance over time.

I also implemented an explicit disposal of the values if needed, as well as a finalizer. There is some song and dance around the disposal to make sure it plays nicely with concurrent disposal from a thread (see later), but that is pretty much it.

There really isn’t much to do here, right? Except that the real magic happens in the CurrentThreadState.

public partial class LightThreadLocal<T> : IDisposable
{
public sealed class CurrentThreadState
{
private readonly HashSet<WeakReferenceToLightThreadLocal> _parents
= new HashSet<WeakReferenceToLightThreadLocal>();
public void Register(LightThreadLocal<T> parent)
{
_parents.Add(new WeakReferenceToLightThreadLocal(parent));
}
~CurrentThreadState()
{
foreach (var parent in _parents)
{
if (parent.TryGetTarget(out var liveParent) == false)
continue;
var copy = liveParent._values;
if (copy == null)
continue;
if (copy.TryRemove(this, out var value)
&& value is IDisposable d)
{
d.Dispose();
}
}
}
}
}

Not that much magic, huh? Smile

We keep a list of the LightThreadLocal instance that has registered a value for this thread. And we have a finalizer that will be called once the thread is killed. That will go to all the LightThreadLocal instances that used this thread and remove the values registered for this thread. Note that this may run concurrently with the LightThreadLocal.Dispose, so we have to be a bit careful (the careful bit happens in the LightThreadLocal.Dispose).

There is one thing here that deserve attention, though. The WeakReferenceToLightThreadLocal class, here it is with all its glory:

public partial class LightThreadLocal<T> : IDisposable
{
private sealed class WeakReferenceToLightThreadLocal : IEquatable<WeakReferenceToLightThreadLocal>
{
private readonly WeakReference<LightThreadLocal<T>> _weak;
private readonly int _hashCode;
public bool TryGetTarget(out LightThreadLocal<T> target)
{
return _weak.TryGetTarget(out target);
}
public WeakReferenceToLightThreadLocal(LightThreadLocal<T> instance)
{
_hashCode = instance.GetHashCode();
_weak = new WeakReference<LightThreadLocal<T>>(instance);
}
public bool Equals(WeakReferenceToLightThreadLocal other)
{
if (ReferenceEquals(null, other)) return false;
if (ReferenceEquals(this, other)) return true;
if (_hashCode != other._hashCode)
return false;
if (_weak.TryGetTarget(out var x) == false ||
other._weak.TryGetTarget(out var y) == false)
return false;
return ReferenceEquals(x, y);
}
public override bool Equals(object obj)
{
if (ReferenceEquals(null, obj)) return false;
if (ReferenceEquals(this, obj)) return true;
if (obj.GetType() != this.GetType()) return false;
return Equals((WeakReferenceToLightThreadLocal)obj);
}
public override int GetHashCode()
{
return _hashCode;
}
}
}

This is basically wrapper to WeakReference that allow us to get a stable hash value even if the reference has been collected. The reason we use that is that we need to reference the LightThreadLocal from the CurrentThreadState. And if we hold a strong reference, that would prevent the LightThreadLocal instance from being collected. It also means that in terms of the complexity of the object graph, we have only forward references with no cycles, cross references, etc. That should be a fairly simple object graph for the GC to walk through, which is the whole point of what I’m trying to do here.

Oh, we also need to support accessing all the values, but that is so trivial I don’t think I need to talk about it. Each LightThreadLocal has its own concurrent dictionary, and we can just access that Values property and we get the right result.

We aren’t done yet, though. There are still certain things that I didn’t do. For example, if we have a lot of LightThreadLocalinstances, they would gather up in the thread static instances, leading to large memory usage. We want to be able to automatically clean these up when the LightThreadLocalinstance goes away. That turn out to be somewhat of a challenge. There are a few issues here:

  • We can’t do that from the LightThreadLocal.Dispose / finalizer. That would mean that we have to guard against concurrent data access, and that would impact the common path.
  • We don’t want to create a reference from the LightThreadLocal to the CurrentThreadState, that would lead to more complex data structure and may lead to slow GC.

Instead of holding references to the real objects, we introduce two new ones. A local state and a global state:

private class GlobalState
{
public int Disposed;
public readonly HashSet<LocalState> UsedThreads
= new HashSet<LocalState>(ReferenceEqualityComparer<LocalState>.Default);
public void Dispose()
{
Interlocked.Exchange(ref Disposed, 1);
foreach (var localState in UsedThreads)
{
Interlocked.Increment(ref localState.ParentsDisposed);
}
}
}
private class LocalState
{
public int ParentsDisposed;
}
view raw state.cs hosted with ❤ by GitHub

The global state exists at the level of the LightThreadLocal instance while the local state exists at the level of each thread. The local state is just a number, indicating whatever there are any disposed parents. The global state holds the local state of all the threads that interacted with the given LightThreadLocal instance. By introducing these classes, we break apart the object references. The LightThreadLocal isn’t holding (directly or indirectly) any reference to the CurrentThreadState and the CurrentThreadState only holds a weak reference for the LightThreadLocal.

Finally, we need to actually make use of this state and we do that by calling GlobalState.Dispose() when the LightThreadLocal is disposed. That would mark all the threads that interacted with it as having a disposed parents. Crucially, we don’t need to do anything else there. All the rest happens in the CurrentThreadState, in its own native thread. Here is what this looks like:

public void Register(LightWeightThreadLocal<T> parent)
{
parent._globalState.UsedThreads.Add(_localState);
_parents.Add(new WeakReferenceToLightWeightThreadLocal(parent));
int parentsDisposed = _localState.ParentsDisposed;
if (parentsDisposed > 0)
{
RemoveDisposedParents(parentsDisposed);
}
}
private void RemoveDisposedParents(int parentsDisposed)
{
var toRemove = new List<WeakReferenceToLightWeightThreadLocal>();
foreach (var local in _parents)
{
if (local.TryGetTarget(out var target) == false || target._globalState.Disposed != 0)
{
toRemove.Add(local);
}
}
foreach (var remove in toRemove)
{
_parents.Remove(remove);
}
Interlocked.Add(ref parentsDisposed, -parentsDisposed);
}
view raw localthread.cs hosted with ❤ by GitHub

Whenever the Register method is called (which happens whenever we use the LightThreadLocal.Value property), we’ll register our own thread with the global state of the LightThreadLocal instance and then check whatever we have been notified of a disposal. If so, we’ll clean our own state in RemoveDisposedParents.

This close down all the loopholes in the usage that I can think about, at least for now.

This is currently going through our testing infrastructure, but I thought it is an interesting feature. Small enough to talk about, but complex enough that there are multiple competing requirements that you have to consider and non trivial aspects to work with.