Shareability
The idleness of the past holiday weeks brought to me a much needed insight: how important scalable shareability is for Enumerable<T> (if such an interface were to be of any use for high-compositionality scenarios).
Here is a simple example in which an enumerable named e is shared between two independent compositions:
Enumerable<T> e = ... some code ... Enumerable<U> uE = e.map(t2uMapper); Enumerable<V> vE = e.map(t2vMapper);
for(U u: uE) {
System.out.println("My Us: " + u + " ..."); } for(V v: vE) {
System.out.println("My Vs: " + v + " ..."); }
The goal is to ensure that both loops work properly and, more importantly, the whole thing is scalable: if we have a chain of hundreds of thousands of operations peppered with thousands of points where enumerables are being shared between independent pipelines, all the resulting enumerables (which may come and go at runtime and number in the range of millions or more) still work properly, efficiently and do not overflow the stack (assuming we have enough heap).
High composability and (by consequence) high shareability do not exist in .NET LINQ.
Without shareability, if enumerable A is composed with map() to produce enumerable B and then with another map() to produce C, then getting the iterator for C is easy:
Assuming reversed composition is present, then getting the iterator is straightforward:
Implementing reversed composition is not that arcane: since the internal support is based on a pipeline, .reversedOp() pushes the same operator as .op() but at the beginning of the pipeline instead of the end.
Because of of space consumption. If we have N elements from leaf to source, caching requires O(N) locations which is not that bad for a single node. Applying caching on each node raises space consumption to O(N2) which is not acceptable.
Limiting the cache to, say, 10000 elements brings the order back to O(N) but with a high constant: O(10000*N). As N can be in the range of hundreds of thousands, event the order constant counts.
One “solution” would be to keep a map at the choice points that tells:
If Enumerable<T> is a smarter, compose-able kind of Iterable<T>, then it cannot fare worse than that in one of its most powerful capacities: to produce, on demand, independent enumerating sessions that represent nothing else than set abstractions in an almost purely mathematical style.
What is shareability?
It is the ability to be shared as a separate “image”, while each “shared” image retains the properties of the original. In the case of Enumerable<T> this means participating to multiple operational compositions and still doing the expected job efficiently in terms of computation as well as memory and stack consumption.Here is a simple example in which an enumerable named e is shared between two independent compositions:
Enumerable<T> e = ... some code ... Enumerable<U> uE = e.map(t2uMapper); Enumerable<V> vE = e.map(t2vMapper);
for(U u: uE) {
System.out.println("My Us: " + u + " ..."); } for(V v: vE) {
System.out.println("My Vs: " + v + " ..."); }
What is the support for it?
Before discussing how to implement this in the EnumJ library, let us see what support we have out there for such a thing.Java 8 streams
The support for (any) shareability in Java 8 streams is none, since streams are non-shareable by design.Enumerator<T> of EnumJ
By default, EnumJ enumerators are not shareable but the library exposes the ShareableEnumerator<T> class which offers some shareability support, with the following caveats:- the shared access has to be somewhat coordinated otherwise the number of buffered elements can overflow the heap
- the Enumerator<T>.asShareable() operator which creates ShareableEnumerator<T> instances is not highly compose-able. This limits the number of sharing points we can have in a pipeline to a few thousands at best
IEnumerable<T> in .NET
.NET offers powerful LINQ operators that do pretty much everything that Java 8 streams and EnumJ enumerators do and in addition are shareable. Their shareability is marred by the same stack-overflow problem, though. Since they don’t pipe properly, one cannot compose 100000 times the Select() operator (for example) and avoid stack overflow.High composability and (by consequence) high shareability do not exist in .NET LINQ.
The problem
The problem is that, with shareability, the pipeline of operators is not a pipeline any more: it is an acyclic graph (a forest of trees that share branches and leaves).Without shareability, if enumerable A is composed with map() to produce enumerable B and then with another map() to produce C, then getting the iterator for C is easy:
- extract the enumerator of A
- apply the first map() to get enumerator of B
- apply the second map() to enumerator of B to get enumerator of C
The solution
“Thanks” to shareability, any node in the pipeline can have more than one successor which makes forward walking a moot case. Shareability does not change the uniqueness of the predecessor, though. This uniqueness can be exploited to build the iterator efficiently by walking the graph backwards.First solution: double traversal
The first solution is less optimal, but simpler:- traverse the graph backwards and collect the operators efficiently, in forward order (by pushing the operators at the front of a linked list, for example)
- get the enumerator of the source
- traverse the linked list forwards and apply the operators on the enumerator
- return the final enumerator as the iterator of the composed enumerable
Second solution: single traversal with reversed enumerator composition
The second solution is a little bit more involved, but more efficient in terms of computation and memory consumption. It requires to endow the classes that implement Enumerator<T> with reversed composition (explained below).Assuming reversed composition is present, then getting the iterator is straightforward:
- create a late-binding enumerator to return as a result
- traverse the graph backwards and apply the reversed counterparts of the operators encountered along the way
- when reaching the origin or an operator without reversed counterpart:
- extract the enumerator of the source or predecessor
- bind the enumerator to the late-binding enumerator from 1)
- return the late-binding enumerator as the result of the whole Enumerable<T>.enumerator() call
Reversed composition
Reversed composition is a kind of composition that produces the same result when the operators are applied in reverse. That is, .reversedSkip().reversedMap() is the same as .map().skip().Implementing reversed composition is not that arcane: since the internal support is based on a pipeline, .reversedOp() pushes the same operator as .op() but at the beginning of the pipeline instead of the end.
Comparison
Both solutions are O(N) in both computational complexity and memory consumption, but the second solution is better:- double traversal involves two traversals which may be too expensive for many operators (here we are speaking of hundreds of thousands of operators)
- double traversal requires a list to store the operators in the correct order before applying them. This may be too expensive:
- If the operators stay in the list then they occur both in the list and in the pipeline of the enumerator being constructed. This wastes memory
- If the operators get removed as they get applied to the enumerator being constructed, traversal is slower. This saves memory but wastes time
Bad optimisation: caching the list
Caching may seem tempting to optimise the first solution: given that the graph does not change from leaves upwards (although it may extend from leaves downwards) the linked list of operators does not change. So, why not caching it?Because of of space consumption. If we have N elements from leaf to source, caching requires O(N) locations which is not that bad for a single node. Applying caching on each node raises space consumption to O(N2) which is not acceptable.
Limiting the cache to, say, 10000 elements brings the order back to O(N) but with a high constant: O(10000*N). As N can be in the range of hundreds of thousands, event the order constant counts.
Another bad optimisation: resolve confusion with a hashed map
We showed that sharing confuses a forwards approach for Enumerable<T>.enumerator() because the algorithm doesn’t know which path to choose at the sharing point. If we have A –> B –> C –> D and A –> B –> E –> F, which way do we take at B when generating the enumerator for D: B –> C or B –> E?One “solution” would be to keep a map at the choice points that tells:
- if you want to generate for D, then choose B –> C
- if you want to generate for F, then choose B –> E
Why all the fuss
Because, unlike Iterator<T>, Iterable<T> is shareable by design.If Enumerable<T> is a smarter, compose-able kind of Iterable<T>, then it cannot fare worse than that in one of its most powerful capacities: to produce, on demand, independent enumerating sessions that represent nothing else than set abstractions in an almost purely mathematical style.
Comments
Post a Comment