Saturday 7 February 2015

Enumerators in Java 8: when streams aren’t enough

I decided to write a small library for enumerators in Java 8. What I call an enumerator is, in fact, a special kind of iterator endowed with certain properties such as compose-ability - as Java streams and .NET LINQ operators already have.

I think of enumerators as a specialized type of generators which, in turn, are a watered down version of coroutines.

The intent is obvious: with enumerators, generating potentially infinite sequences of potentially infinite data structures of a potentially unlimited set of topologies defined entirely at runtime becomes a reality. This is a luxury that programmers in high-level functional or logic languages have been enjoying for decades.

Aren’t Java 8 streams enough?

No, they aren’t. Streams are a great addition to Java brought over by Java 8 and they blend wonderfully with Java 8 lambda expressions. Moreover, they have strong parallel support which makes them ideal for exploiting multi-core architectures with minimal effort.

However, there are scenarios where they are either too strong or insufficient:

  • They have been designed with parallelism in mind. While this is a good thing in general, parallelism support makes them quite heavy for very simple tasks where parallelism is not needed.
  • Their parallel nature makes them lack certain operations which are quintessentially sequential:
    • Stream<T>.takeWhile(Predicate<T>) which takes elements while a predicate evaluates to true.

      OBS: we may also call this Stream<T>.limitWhile(Predicate<T>) if we are to stay in line with Stream<T>.limit(int)

    • Stream<T>.zip(Stream<T>) which binds together two streams and produces a stream of pairs
    • Stream<T>.map(Function2<Integer,T,R>) which models Select<T>(this IEnumerable<T>, Func<int,T,R>) from .NET LINQ

  • They lack pipelining which makes them unsuitable for compositions in large number.

    In the case of Stream.concat(Stream<T>, Stream<T>) for example, the JDK documentation warns the user against repeated concatenation as it may lead to large stacks or even stack overflows. This is caused by lack of pipelining, this lack being a result of the parallel nature of streams. I’ll detail on pipelining in a section below.

  • They are not lazy in their arguments. For example, we have Stream.concat(Stream<T>, Stream<T>) but there is no Stream.concat(Supplier<Stream<T>>, Supplier<Stream<T>>). As I explained in a previous post, being supper lazy is sometimes necessary. Java 8 streams, as they are implemented today, are lazy in yielded values but not lazy in input parameters.
  • They are rather hard to customize. Because of their parallel nature, they don’t rely on iterators (which are strictly sequential) but on spliterators (the name is a contraction of split iterators). Spliterators offer support for splitting the iterated sequence to allow parallelism, but customizing them is not the easiest thing to do.

What operations to support?

Enumerators need to satisfy some characteristics to be really useful:

  • Compatibility with Java iterators. In fact, Enumerator<T> should be an interface extending Iterator<T>.
  • Ease of creation / conversion. Converting an enumerator from and to iterators, iterables, streams, spliterators and data suppliers should be easy and straightforward, in both eager and lazy form.
  • Fluent compose-ability. Just like streams, it should be easy to compose operations fluently, like in Enumerator.on(1, 2, 3).map(i –> i*3).limit(2) . In other words, programming with enumerators should be as pleasant as in programming with Java 8 streams or .NET LINQ operators.
  • Support for pipelining. This is an important feature that Java 8 streams currently don’t support.

    To exemplify, a code like Enumerator<Integer> en = Enumerator.empty(); for(int i=0; i<1000000; ++i) { en = en.concat(Enumerator.on(i, i+1, i+2)); } will produce in en an enumerator of 3 million elements that will not overflow the stack.

    If you try this with streams, you end up with one million stream objects calling each other resulting in a stack of one million frames when getting the last three elements. A big kaboom (stack overflow) is guaranteed.

  • Stream operations and .NET LINQ operators together. Enumerators should support all the stream and .NET LINQ operations that makes sense to them, in the same fluent syntax. When there is name or convention clash between Java 8 streams and .NET LINQ operators, the Java 8 conventions should take precedence.
  • Support for controlled element extraction. This is another important feature that neither Java 8 or .NET LINQ operators support. Controlled element extraction is the ability to build an enumerator from a set of iterators of same type according to a given choice algorithm. I’ll detail on controlled element extraction in a section below.
  • Efficient resource management. Composed enumerators should dump (dereference, allow for garbage collection) any resource as soon as it is exhausted or not necessary any more. For example, Enumerator<T>.concat(first).concat(second).concat(third) should dump first once first yields its last element, then it should dump second once second yields its last element, finally dumping third when third yields its last element.

    This is crucial when enumerators get composed at runtime, leading to potentially infinite topologies. If early release of resources doesn’t take place, then enumerator objects float in memory uselessly even when they have no data to offer any more.

Pipelining

Enumerator pipelining is the ability to transform a sequence of enumerator operations into an single enumerator that applies the operations in a pipeline.

Let’s consider the following example:

 
Enumerator.rangeInt(0, 10)
         
.map(i -> i *= 3)
         
.filter(i -> 0 == i%2)
         
.map(i -> i += 5)
       .filter(i  -> i<20)
         

This obviously produces the sequence 5, 11, 17. But let’s have a look at how it produces that sequence:

  1. It produces an enumerator from 0 to 9 – THEN -
  2. It produces an enumerator that takes every element from 1) and it multiplies it by 3 – THEN -
  3. It produces an enumerator that takes every element from 2) and it rejects the odd ones – THEN -
  4. It produces an enumerator that takes every element from 3) and it adds 5 to it – THEN -
  5. It produces an enumerator that takes every element from 4) and it rejects the ones above 19

In other words, for four operations upon the original enumerator it creates four extra enumerators. But it doesn’t stop here, getting elements from the final enumerator involves a lot of work too:

  1. Getting an element from 5) requires an element from 4) – BUT -
  2. Getting an element from 4) requires an element from 3) – BUT -
  3. Getting an element from 3) requires an element from 2) – BUT -
  4. Getting an element from 2) requires an element from 1) which is the original element

This means that to obtain each final element we call five functions nested in each other. This doesn’t matter for a small number of compositions (five nested calls amounts to almost nothing) but if the compositions are generated dynamically at runtime and they count in the thousands, tens of thousands of even a million (as we saw above), then we may end up with a lot of intermediary enumerator objects and with huge call stacks.

While storing one million objects in memory in not a big deal nowadays, having one million nested function calls still is a big deal.

The problem can be alleviated by observing that the above operations don’t need extra enumerators, they can be internalized and pipelined inside a special enumerator that knows how to harness such things. More precisely, this means that:

  • Any operation like Enumerator.map(), Enumerator.filter() and the like does not produce an Enumerator but a PipelineEnumerator (a special type of enumerator that extends Enumerator with pipelining capabilities)
  • Any operation like PipelineEnumerator.map(), PipelineEnumerator.filter() and the like does not produce a new PipelineEnumerator but it internalizes the applied operator (UnaryOp<T> for map(), Predicate<T> for filter()) within an internal pipeline and it returns the PipelineEnumerator object to the caller
  • Extracting an element from a PipelineEnumerator means to extract an element from its source, then to apply all the operations in the pipeline upon the element and, at the end, if the element is still returnable, to return the element in its final form

Pipelining makes the piece of code from above not to produce four extra enumerators and a stack of four plus one nested calls, but a single pipeline enumerator with a pipeline of four extra operators internalized. Extracting an element from the pipelined enumerator will not produce a call stack of depth 5 (five), but of depth up to 2 (two): one for the pipeline enumerator and one for the source or the pipelined operators (depending on the position in the pipeline).

Most importantly, if we have one million extra operations the stack will stay the same size of 2 (two) and the danger of stack overflow is driven away.

OBS: the size of the internal pipeline will still be one million. But as shown above, one million objects in memory is feasible but a call stack of one million frames is not.

While not necessarily a pipelining operation, concatenation can go the same route: a concat() operation may produce a PipelineConcatEnumerator, the difference being that PipelineConcatEnumerator accumulates the iterators to concatenate (it pipelines them) and it keeps track of the current iterator when extracting elements. When the current iterator is exhausted, it moves to the next iterator in the pipeline and so on.

Controlled element extraction

Controlled element extraction is the ability to build an enumerator from a set of existing iterators by choosing at runtime what element to extract next. The signature of the operation is obvious:


public static <E> Enumerator<E> choiceOf(IntSupplier indexSupplier,
                                        
Iterator<E> first,
                                        
Iterator<E> second,
                                        
Iterator<E>... rest);

The algorithm for next() of the resulted choice enumerator is simple:

  1. Call indexSupplier.get() and retrieve the index of the iterator that will be extracted next. If the index is out of bounds, bring it into bounds by modulo operations
  2. If that particular iterator has elements, return its next element – OTHERWISE -
  3. Go round robin until you find one iterator that has a next element and return that next element of that iterator

The algorithm for hasNext() is equally simple:

  • If any iterator in the set has elements (hasNext() returns true) then return true – OTHERWISE -
  • Return false

While this kind of extraction doesn’t seem to be a big deal, actually it is:

  • It is a form of generalized / intertwined concatenation. Indeed, if indexSupplier returns 0 while first has elements, then returns 1 while second has elements and so on, then the resulting enumerator is nothing else than the concatenation of the participants
  • It allows controlled randomized generation. If indexSupplier has a controlled randomizer behind, then the resulting enumerator will pick up elements in controlled random manner. If this is combined with map() and the participants themselves are built the same way, this allows for generating a potentially infinite sequence of objects of potentially infinite (but controlled) diversity.

    The ability to do this has gigantic implications when trying to make recursive algorithms robust (parsers, code generators, tree- or heap-walkers - to name a few). Producing test data for that kind of algorithms is almost as hard as dealing with those algorithms themselves, as building the test data requires a similar type of reasoning as the algorithms, but in reverse.

    Controlled element extraction upon enumerators that have been created themselves by controlled element extraction makes this easy and intuitive. OBS: if this seems arcane, then it is. I might have an entire blog post for this topic in the future.

What about element insertion?

Enumerators do not support element insertion. They are, in fact, highly compose-able abstractions upon sets whose elements are being produced successively. This compose-ability makes the case for element insertion moot.

Why all the fuss

I started this diversion on enumerators because I needed to generate infinite data structures of an infinite (but controlled) variety and structure. That kind of data I am using to test the project I am currently working on (and I don’t disclose yet). Designed for testing purposes, enumerators quickly proved useful for the project itself.

Where and when?

I do not intend to keep his enumerator library hidden. Currently I host it privately on BitBucket but I’ll soon make it public and open source and I’ll deploy it to Maven Central. It is fully implemented and tested with 100% coverage (if we are to trust the JaCoCo tool) but it lacks proper Java docs and I haven’t decided upon a licence yet.

In other words, final touches and bureaucratic details prevent me from bring it to the fore tomorrow or to give a timeline.

What’s next

The next project of this kind might be Java 8 flows, i.e. a more flexible way to deal with Java 8 streams.

No comments:

Post a Comment