Enhanced C#
Language of your choice: library documentation
|
A hash-trie data structure for use inside other data structures. More...
A hash-trie data structure for use inside other data structures.
InternalSet{T} is a dual-mode mutable/immutable "hash trie", which is a kind of tree that is built from the hashcode of the items it contains. It supports fast cloning, and is suitable as a persistent data structure.
InternalSet<T> is not designed to be used by itself, but as a building block for other data structures. It has no Count property because it does not know its own size; the outer data structure must track the size if the size is needed. The lack of a Count property allows an empty InternalSet to use a mere single word of memory!
This is my second implementation of InternalSet. The original version used memory very efficiently for reference types, but required boxing for value types; this version needs more memory, but is moderately faster in most cases and supports value types without boxing. I estimate that InternalSet (the second version) uses roughly the same amount of memory as HashSet{T} (actually more or less depending on the number of items in the set, and on the hashcode distribution.)
Collection classes based on InternalSet are most efficient for small sets, but if you always need small sets then a simple wrapper around HashSet would suffice. In fact, despite my best efforts, this data type rarely outperforms HashSet, but this is because HashSet{T} is quite fast, not because InternalSet{T} is slow. Still, there are several reasons to consider using a collection class based on InternalSet instead of HashSet{T}:
InternalSet is not efficient for Ts that are expensive to compare; unlike standard .NET collections, this data structure does not store the hashcode of each item inside the collection. The memory saved by not storing the hashcode compensates for the extra memory that InternalSet
tends to require due to its structure.
As I was saying, this data structure is inspired by Clojure's PersistentHashMap. Whereas PersistentHashMap uses nodes of size 32, I chose to use nodes of size 16 in order to increase space efficiency for small sets; for some reason I tend to design programs that use many small collections and a few big ones, so I tend to prefer designs that stay efficient at small sizes.
So InternalSet is a tree of nodes, with each level of the tree representing 4 bits of the hashcode. Slots in the root node are selected based on bits 0 to 3 of the hashcode, slots in children of the root are selected based on bits 4 to 7 of the hashcode, and so forth. Here's a diagram:
_root* * IsFrozen=true | | +---------+---------+----+----+---------+---------+ | | | | | | 0x2 0x3 0x6 0x7 0x9 0xF | | | +--+--+ | +--+--+ | | | | | 0x13 0x73 0x57 0x09 0x59
Each of the 12 nodes on this diagram has 16 slots for items of type T, and the 4 nodes that have children have 16 additional slots for references to children. The numbers on the nodes represent their role in the tree; for example:
Technically, this data structure has O(log N) time complexity for search, insertion and removal. However, it's a base-16 logarithm and maxes out at 8 levels, so it is faster than typical O(log N) algorithms that are base-2. At smaller sizes, its speed is similar to a conventional hashtable, and some operations are still efficient at large sizes, too.
Unlike InternalList{T}, new InternalSet<T>()
is a valid empty set. Moreover, because the root node is never changed after it is created (unless you modify it while it is frozen), all copies of an InternalSet{T} represent the same set unless the set is frozen with CloneFreeze; see Thaw() for more information.
The neatest feature of this data structure is fast cloning and subtree sharing. You can call CloneFreeze to freeze/clone the trie in O(1) time; this freezes the root node (a transitive property that implicitly affects all children), but still permits the hashtrie to be modified by copying nodes on-demand. Thus the trie is actually frozen, but copy-on-write behavior provides the illusion that it is still editable.
This data structure is designed to support classes that contain mutable data, so that it can be used to construct dictionaries; that is, it allows T values that have an immutable "key" part and a mutable "value" part. Call Find to retrieve the value associated with a key, and call Add with replaceIfPresent=true to change the "value" associated with a key. The Map{K,V} and MMap{K,V} classes rely on this feature to implement a dictionary.
How it works: I call this data structure a "hash-trie" because it blends properties of hashtables and tries. It places items into a tree by taking their hashcode and dividing it into 8 groups of 4 bits, starting at the least significant bits. Each group of 4 bits is used to select a location in the tree/trie, and each node of the tree always has 16 items (and 16 children, if it has any children at all.) For example, consider a tree with 7 items that have the following hash codes:
The top level of the trie represents the lowest 4 bits of the hashcode. Since each node has 16 items, 7 items can usually fit in a single node, but in this case there are too many hashcodes that end with "1", causing a node split:
|0|1|2|3|4|5|6|7|8|9|A|B|C|D|E|F| _root ==> _items | |!|!|M|!| | | | | | | |K| | | | _children | |*| | | | | | | | | | | | | | |
|0|1|2|3|4|5|6|7|8|9|A|B|C|D|E|F| * child node ==> _items |O|P| | | |N| | | |L| |J| | | | | _children (null)
("!" represents the deleted flag, which indicates that an item was once present at this location.)
The second level of the trie represents bits 4-7, which is the second- last hex digit. You can see, for example, that the second-last digit of N is 5, therefore N is stored at index 5 of the child node.
In case hashcodes of different objects collide at a particular digit, adjacent array elements can be used to hold the different objects that share the same 4-bit sub-hashcode; this is a bounded-time variation on the linearly-probed hashtable. In this example, both O and P have zero as their second-last digit. Assuming O is added first, it takes slot [0]; then P takes slot [1]. Up to 3 adjacent entries can be used for a given hashcode; therefore, when searching for an entry it is necessary to search up to 4 locations in each node: the preferred location, plus 3 adjacent locations.
For example, support we search for an item X that is not in the set and has hashcode 0xCCA9A241. In that case, the Find methods starts with the least-significant digit, 1. This points us to the child slot; an invariant of our hashtrie is that if there is a child node, all items with the corresponding sub-hashcode must be placed in the child node. Therefore it is impossible, for example, that X could be located at index 2 of the root node; the existence of the child node guarantees that it is not there. So the Find method looks inside the child node, at index 4 (the second-last digit of X's hashcode) and finds nothing. It also looks at indexes 5, 6, and 7, comparing N to X in the process. Since none of these slots contain X, the Find method returns false.
Something unfortunate happens if five or more objects have the same hashcode: it forces the tree to have maximum depth. Since a particular hashcode can only be repeated four times in a single node, upon adding a fifth item with the same hashcode, child nodes are created for all 8 digits of the hashcode. At the 8th level, a special node type is allocated that contains, in addition to the usual 16 slots, a list of "overflow slots" holds items that cannot fit in the normal slots due to excessive collisions. All of this has a substantial memory penalty; to avoid this problem, use a better hash function that does not create false collisions.
If there are more than 16 items that share the same 28 lower-order bits, the overflow area on the 8th level node will expand to hold all of these items; this is the only way that a node can have more than 16 items.
Fast cloning works by setting the "IsFrozen" flag on the root node. When a node is frozen, all its children are frozen implicitly; since the children are not marked right away, the CloneFreeze method can return immediately. The frozen flag will be propagated from parents to children lazily, when the tree is modified later.
To "thaw" a node, a copy is made of that node and all of its parents. For example, suppose that the following tree is frozen and cloned:
_root* * IsFrozen=true | | +---------+---------+----+----+---------+---------+ | | | | | | 0x2 0x3 0x6 0x7 0x9 0xF | | | +--+--+ | +--+--+ | | | | | 0x13 0x73 0x57 0x09 0x59
Remember, only the root's IsFrozen flag is set at first; all other nodes do not have the frozen flag yet.
Now suppose that an item is added to node 0x9 (e.g. something with hashcode 0x39 could go in this node). Before the new item can be placed in node 0x9, it must be thawed. To thaw it, an unfrozen copy is made, leaving the original untouched. The copy is not frozen, but it does point to the same frozen children (0x09 and 0x59), so a for-loop sets the IsFrozen flag of each child. Then, the new item is added to the copy of node 0x9. Next, the _root is also unfrozen by making a copy of it with IsFrozen=false
. Again, a for-loop sets the IsFrozen flag of each frozen child, and then child slot [9] in the root is replaced with the new copy of 0x9 (which has the new item).
This concludes the thawing process. So at this point, just two nodes are actually unfrozen, and the modified tree looks like this:
! Unfrozen copy _root! * IsFrozen=true | | +---------+---------+----+----+---------+---------+ | | | | | | 0x2* 0x3* 0x6* 0x7* 0x9! 0xF* | | | +--+--+ | +--+--+ | | | | | 0x13 0x73 0x57 0x09* 0x59*
There are 12 nodes here and 2 have been copied. The other 10 nodes are still shared between the modified tree and the clone. Next, if you add an item to node 0x6, only that one node has to be thawed; the root has already been thawed and there is no need to make another copy of it. Due to the random nature of hashcodes, it is probable that as you modify the set after cloning it, it is typical for each modification to require approximately one node to be thawed, until the majority of the nodes have been thawed.
InternalSet does not thaw unnecessarily. If you try to remove an item that is not present, none of the tree will be thawed. If you add an item that is already present in a frozen node (and you do not ask for replacement), that node will not be thawed. Contains and Find never cause thawing.
I am not aware whether a data structure quite like this has been described in the comp-sci literature or not (although it probably has). If you see something like this in a paper, let me know.
When attempting to insert a new item in a node, the first available empty slot will be used; and when searching for an item, the search stops at an empty slot. For example, suppose that the root node contains these items:
|0|1|2|3|4|5|6|7|8|9|A|B|C|D|E|F| _root ==> _items |A| |C| |E|F| | |I| |K|L| |N| | |
Now suppose that you are searching for, or adding, or an item 'D' whose hashcode ends with '3'. Slot 3 is empty, and this data structure works in such a way that the search for 'D' can end immediately with a result of 'false', or it can be added at slot 2 immediately without comparing 'D' with slots 4, 5 and 6 which (if 2 were not empty) might already contain 'D'.
The reasoning behind this rule is that if 'D' already existed in the set, slot 2 should not be empty; since it is empty, 'D' must not be in the set already. However, deletions could violate this logic. For example, imagine that we add two items, first 'd' and then 'D', which both have a hashcode that ends in '3'. Then the node would look like this:
|0|1|2|3|4|5|6|7|8|9|A|B|C|D|E|F| _root ==> _items |A| |C|d|E|F|D| |I| |K|L| |N| | |
Next, you delete 'd'. Imagine that this leaves the node in the following state:
|0|1|2|3|4|5|6|7|8|9|A|B|C|D|E|F| _root ==> _items |A| |C| |E|F|D| |I| |K|L| |N| | |
Now 'D' is left outside its 'home' location of 3. If you then attempt to add 'D' to the set, a duplicate copy would be added at position '3'! Or if you search for 'D' instead, the result would be 'false' even though D is present in the set.
I thought of two solutions to this problem; the first was to 'fix' the node after a deletion so that 'D' would move from slot 6 to 3. But there's a big problem with this solution because InternalSet{T}.Enumerator has a RemoveCurrent()
method which is supposed to delete the current item and move to the next one. If the node had to be rearranged in response to a deletion, it would be very difficult to guarantee that the enumerator still returns each item in the set exactly once.
The second solution, which I actually implemented, puts a special "deleted" marker in slot 3 (denoted ! on the first diagram). This marker forces the search routine to compare the item being added or searched for with other slots beyond the current one, but otherwise it behaves like an empty slot.
There is a third solution–always check all four possible slots. But the comparison is not always cheap, so InternalSet{T} does not use this solution unless you are using null
as the value of the IEqualityComparer{T}.
Since InternalSet{T} can hold any value of type T, the "deleted" and "empty/in use" indicators cannot physically be stored in the slots of type T. Instead, these indicators are stored separately, with 16 bits for "deleted" flags and 16 bits for "used" flags.
During a normal delete operation, if a node has no children and is using only one or two slots after an item is deleted, the parent is checked for empty slots to find out whether the child is really necessary. If there are enough free slot(s) in the parent node, the remaining items in the child are transferred back back to the parent and the child is deleted (the reference to it is cleared to null).
Unfortunately, this behavior is not available when you call Enumerator.RemoveCurrent. In order to maintain the integrity of the enumerator, a child node will not be deleted during a call to RemoveCurrent
unless the node is completely empty after the removal. Consequently, the tree will use extra memory if you remove most, but not all, items from the set using RemoveCurrent
.
By the way, unlike the original implementation, this version of InternalSet allows 'null' to be a member of the set.
Interesting fact: it is possible for two sets to be equal (contain the same items), and yet for those items to be enumerated in different orders in the two sets.
Nested classes | |
struct | Enumerator |
Public static fields | |
static readonly InternalSet< T > | Empty = new InternalSet<T> { _root = FrozenEmptyRoot() } |
An empty set. More... | |
static readonly IEqualityComparer< T > | DefaultComparer = typeof(IReferenceComparable).IsAssignableFrom(typeof(T)) ? null : EqualityComparer<T>.Default |
This is EqualityComparer{T}.Default, or null if T implements IReferenceComparable. More... | |
static readonly OnFoundExisting | AddIfNotPresent = _IgnoreExisting_ |
static readonly OnFoundExisting | AddOrReplace = _ReplaceExisting_ |
static readonly OnFoundExisting | RemoveMode = _DeleteExisting_ |
static Enumerator | _setOperationEnumerator |
Properties | |
bool | IsRootFrozen [get] |
bool | HasRoot [get] |
Public Member Functions | |
InternalSet (IEnumerable< T > list, IEqualityComparer< T > comparer, out int count) | |
InternalSet (IEnumerable< T > list, IEqualityComparer< T > comparer) | |
int | GetSetHashCode (IEqualityComparer< T > comparer) |
InternalSet< T > | CloneFreeze () |
Freezes the hashtrie so that any further changes require paths in the tree to be copied. More... | |
void | Thaw () |
Thaws a frozen root node by duplicating it, or creates the root node if the set doesn't have one. More... | |
bool | Add (ref T item, IEqualityComparer< T > comparer, bool replaceIfPresent) |
Tries to add an item to the set, and retrieves the existing item if present. More... | |
bool | Remove (ref T item, IEqualityComparer< T > comparer) |
Removes an item from the set. More... | |
delegate bool | OnFoundExisting (ref Node slots, int i, T item) |
bool | Find (ref T item, IEqualityComparer< T > comparer) |
Enumerator | GetEnumerator () |
IEnumerator< T > IEnumerable< T >. | GetEnumerator () |
System.Collections.IEnumerator System.Collections.IEnumerable. | GetEnumerator () |
void | CopyTo (T[] array, int arrayIndex) |
void | Clear () |
bool | Contains (T item, IEqualityComparer< T > comparer) |
int | Count () |
int | UnionWith (IEnumerable< T > other, IEqualityComparer< T > thisComparer, bool replaceIfPresent) |
Adds the contents of 'other' to this set. More... | |
int | UnionWith (InternalSet< T > other, IEqualityComparer< T > thisComparer, bool replaceIfPresent) |
int | IntersectWith (InternalSet< T > other, IEqualityComparer< T > otherComparer) |
Removes all items from this set that are not present in 'other'. More... | |
int | IntersectWith (ISet< T > other) |
int | IntersectWith (IEnumerable< T > other, IEqualityComparer< T > comparer) |
Removes all items from this set that are not present in 'other'. More... | |
int | ExceptWith (IEnumerable< T > other, IEqualityComparer< T > thisComparer) |
Removes all items from this set that are present in 'other'. More... | |
int | ExceptWith (InternalSet< T > other, IEqualityComparer< T > thisComparer) |
int | SymmetricExceptWith (InternalSet< T > other, IEqualityComparer< T > thisComparer) |
int | SymmetricExceptWith (IEnumerable< T > other, IEqualityComparer< T > comparer, bool xorDuplicates=true) |
Modifies the current set to contain only elements that were present either in this set or in the other collection, but not both. More... | |
bool | IsSubsetOf (ISet< T > other, int myMinCount) |
Returns true if all items in this set are present in the other set. More... | |
bool | IsSubsetOf (InternalSet< T > other, IEqualityComparer< T > otherComparer) |
bool | IsSubsetOf (IEnumerable< T > other, IEqualityComparer< T > comparer, int myMinCount=0) |
bool | IsSupersetOf (IEnumerable< T > other, IEqualityComparer< T > thisComparer, int myMaxCount=int.MaxValue) |
Returns true if all items in the other set are present in this set. More... | |
bool | IsSupersetOf (InternalSet< T > other, IEqualityComparer< T > thisComparer) |
bool | Overlaps (IEnumerable< T > other, IEqualityComparer< T > thisComparer) |
Returns true if this set contains at least one item from 'other'. More... | |
bool | Overlaps (InternalSet< T > other, IEqualityComparer< T > thisComparer) |
bool | IsProperSubsetOf (ISet< T > other, int myExactCount) |
Returns true if all items in this set are present in the other set, and the other set has at least one item that is not in this set. More... | |
bool | IsProperSubsetOf (IEnumerable< T > other, IEqualityComparer< T > comparer, int myExactCount) |
Returns true if all items in this set are present in the other set, and the other set has at least one item that is not in this set. More... | |
bool | IsProperSupersetOf (ISet< T > other, IEqualityComparer< T > thisComparer, int myExactCount) |
Returns true if all items in the other set are present in this set, and this set has at least one item that is not in the other set. More... | |
bool | IsProperSupersetOf (IEnumerable< T > other, IEqualityComparer< T > comparer, int myExactCount) |
Returns true if all items in the other set are present in this set, and this set has at least one item that is not in the other set. More... | |
bool | SetEquals (ISet< T > other, int myExactCount) |
Returns true if this set and the other set have the same items. More... | |
bool | SetEquals (IEnumerable< T > other, IEqualityComparer< T > comparer, int myExactCount) |
Returns true if this set and the other set have the same items. More... | |
int | CountMemory (int sizeOfT) |
Measures the total size of all objects allocated to this collection, in bytes, including the size of InternalSet{T} itself (which is one word). More... | |
int | CountMemory (int sizeOfT, out InternalSetStats stats) |
Measures the total size of all objects allocated to this collection, in bytes, and counts the number of nodes of different types. More... | |
Static Public Member Functions | |
static Node | FrozenEmptyRoot () |
static int | Adj (int i, int n) |
static bool | Equals (T value, ref T item, IEqualityComparer< T > comparer) |
static uint | GetHashCode (T item, IEqualityComparer< T > comparer) |
static void | PropagateFrozenFlag (Node parent, Node child) |
static void | ReplaceChild (ref Node slots, int iHome, Node newChild) |
static bool | TryRemoveChild (ref Node slots, int iHome, Node child) |
static bool | _IgnoreExisting_ (ref Node slots, int i, T item) |
static bool | _ReplaceExisting_ (ref Node slots, int i, T item) |
static bool | _DeleteExisting_ (ref Node slots, int i, T item) |
static bool | AddOrRemove (ref Node slots, ref T item, uint hc, IEqualityComparer< T > comparer, OnFoundExisting mode) |
static bool | OnFoundInOverflow (ref Node slots, int i, ref T item, OnFoundExisting mode, T existing) |
static int | SelectBucketToSpill (Node slots, int i0, IEqualityComparer< T > comparer) |
static void | Spill (Node parent, int i0, IEqualityComparer< T > comparer) |
static Enumerator | SetOperationEnumerator () |
|
inline |
Tries to add an item to the set, and retrieves the existing item if present.
References Loyc.Collections.AddIfNotPresent, and Loyc.Collections.AddOrReplace.
|
inline |
Freezes the hashtrie so that any further changes require paths in the tree to be copied.
This is an O(1) operation. It causes all existing copies of this InternalSet{T}, as well as any other copies you make in the future, to become independent of one another so that modifications to one copy do not affect any of the others.
To unfreeze the hashtrie, simply modify it as usual with (for example) a call to Add or Remove, or call Thaw. Frozen parts of the trie are copied on-demand.
|
inline |
Measures the total size of all objects allocated to this collection, in bytes, including the size of InternalSet{T} itself (which is one word).
sizeOfT | Size of each T. C# provides no way to get this number so it must be supplied as a parameter. If T is a reference type such as String, IntPtr.Size tells you the size of each reference; please note that this method is does not look "inside" each T, it just measures the "shallow" size of the collection. For instance, if this is a set of strings, then CountMemory(IntPtr.Size) is the size of the set including the references to the strings, but not including the strings themselves. |
|
inline |
Measures the total size of all objects allocated to this collection, in bytes, and counts the number of nodes of different types.
|
inline |
Removes all items from this set that are present in 'other'.
other | The set whose members should be removed from this set. |
thisComparer | The comparer for this set (not for 'other', which is simply enumerated). |
References Loyc.Collections.Remove.
|
inline |
Removes all items from this set that are not present in 'other'.
other | The set whose members should be kept in this set. |
otherComparer | The comparer for 'other' (not for this set, which is simply enumerated). |
|
inline |
Removes all items from this set that are not present in 'other'.
other | The set whose members should be kept in this set. |
This method is costly if 'other' is not a set; a temporary set will be constructed to answer the query. Also, this overload has the same subtle assumption as the other overload.
|
inline |
Returns true if all items in this set are present in the other set, and the other set has at least one item that is not in this set.
This implementation assumes that if the two sets use different definitions of equality (different IEqualityComparer{T}s), that neither set contains duplicates from the point of view of the other set. If this rule is broken–meaning, if either of the sets were constructed with the comparer of the other set, that set would shrink– then the results of this method are unreliable. If both sets use the same comparer, though, you have nothing to worry about.
|
inline |
Returns true if all items in this set are present in the other set, and the other set has at least one item that is not in this set.
This method is costly if 'other' is not a set; a temporary set will be constructed to answer the query. Also, this overload has the same subtle assumption as the other overload.
|
inline |
Returns true if all items in the other set are present in this set, and this set has at least one item that is not in the other set.
This implementation assumes that if the two sets use different definitions of equality (different IEqualityComparer{T}s), that neither set contains duplicates from the point of view of the other set. If this rule is broken–meaning, if either of the sets were constructed with the comparer of the other set, that set would shrink– then the results of this method are unreliable. If both sets use the same comparer, though, you have nothing to worry about.
|
inline |
Returns true if all items in the other set are present in this set, and this set has at least one item that is not in the other set.
This method is costly if 'other' is not a set; a temporary set will be constructed to answer the query. Also, this overload has the same subtle assumption as the other overload.
|
inline |
Returns true if all items in this set are present in the other set.
myMinCount | Specifies the minimum number of items that this set contains (use 0 if unknown) |
|
inline |
Returns true if all items in the other set are present in this set.
|
inline |
Returns true if this set contains at least one item from 'other'.
|
inline |
Removes an item from the set.
|
inline |
Returns true if this set and the other set have the same items.
This implementation assumes that if the two sets use different definitions of equality (different IEqualityComparer{T}s), that neither set contains duplicates from the point of view of the other set. If this rule is broken–meaning, if either of the sets were constructed with the comparer of the other set, that set would shrink– then the results of this method are unreliable. If both sets use the same comparer, though, you have nothing to worry about.
|
inline |
Returns true if this set and the other set have the same items.
This method is costly if 'other' is not a set; a temporary set will be constructed to answer the query. Also, this overload has the same subtle assumption as the other overload.
|
inline |
Modifies the current set to contain only elements that were present either in this set or in the other collection, but not both.
xorDuplicates | Controls this function's behavior in case 'other' contains duplicates. If xorDuplicates is true, an even number of duplicates has no overall effect and an odd number is treated the same as if there were a single instance of the item. Setting xorDuplicates to false is costly, since a temporary set is constructed in order to eliminate any duplicates. The same comparer is used for the temporary set as for this set. |
Returns the change in set size (positive if items were added, negative if items were removed)
References Loyc.Collections.Add, and Loyc.Collections.Remove.
|
inline |
Thaws a frozen root node by duplicating it, or creates the root node if the set doesn't have one.
Since InternalSet{T} is a structure rather than a class, it's not immediately obvious what the happens when you copy it with the '=' operator. The InternalList{T} structure, for example, it is unsafe to copy (in general) because as the list length changes, the two (or more) copies immediately go "out of sync" because each copy has a separate Count property and a separate array pointer–and yet they will share the same array, at least temporarily, which can produce strange results.
It is mostly safe to copy InternalSet instances, however, because they only contain a single piece of data (a reference to the root node), and the root node only changes in two situations:
In the second case, when you have frozen a set with CloneFreeze(), all existing copies are frozen, and further changes affect only the specific copy that you change. You can also call Thaw() if you need to make copies that are kept in sync, without actually modifying the set first.
This method has no effect if the root node is already thawed.
|
inline |
Adds the contents of 'other' to this set.
thisComparer | The comparer for this set (not for 'other', which is simply enumerated). |
replaceIfPresent | If items in 'other' match items in this set, this flag causes those items in 'other' to replace the items in this set. |
References Loyc.Collections.Add.
|
static |
This is EqualityComparer{T}.Default, or null if T implements IReferenceComparable.
|
static |
An empty set.
This property comes with a frozen, empty root node, which Set{T} uses as an "initialized" flag.