Saturday, 9 May 2015

Caching enumerations


Enumerators are particularly well suited to specify sets of elements via composing constraints, generators or any combination thereof. Enumerators have a declarative flavour that almost matches Mathematics. For example, a set definition like this one:
A = { x in N | 0 <= x <= 10 and x is even }
can nicely be expressed as:
Enumerator<E> e = Enumerator.rangeInt(0, 11).filter(x –> 0 == x % 2);
or, in Enumerable parlance:
Enumerable<E> e = Enumerable.rangeInt(0, 11).filter(x –> 0 == x % 2);
the difference being that an Enumerable<E> can participate to for(each) statements and can be enumerated upon more than once.

The problem

All is great for a few compositions, but enumerators are designed to support large numbers of compositions, in the range of hundreds of thousands or while memory lasts. The sheer existence of that many operations imply that a single original element undergoes hundreds of thousands of transformations before coming out as an enumerated result.
There is nothing we can do: if the operations are really needed then they must be carried out, period.
The problem is when we have to enumerate repeatedly over a sequence that doesn’t change: if the transformations have no side effects, then re-computing the same result means that hundreds of thousands of transformations are being applied uselessly over and over. Such a waste!
The solution comes in the form of the age-old method of caching.

The solution

Caching, when done carefully, saves a huge deal of time by:
  • storing locally what is remote
  • storing in memory what is on slow media
  • storing for fast retrieval what is computed too slowly
Our case falls in the third category. Before showing implementation details, let us first overview a couple of sub-optimal solutions.

Bad solution: early collection

The simplest method to cache is simply to collect the elements into a list (or equivalent) and then reuse the list. This is bad for several reasons:
  • element collection is an eager operation which breaks the lazy nature of enumerators
  • element collection is a terminal operation which breaks the operator pipeline thus introducing superfluous boilerplate code
  • element collection is an all-or-nothing operation thus being prone to storing too much
These points are not a big deal when composing just a few operators and/or when the original set is small. But Enumerator<E> exists specifically for large sets and many compositions making early collection of elements a bad candidate for cache implementation.

Sub-optimal solution: enumerator sharing

Another method for caching is sharing enumerators via the ShareableEnumerator<E> class. Given that each resulted SharingEnumerator<E> will work independently while the origin is enumerated only once, caching is assured. In fact, this sharing mechanism is based on an internal caching scheme that is efficient: the cached elements get evicted as soon as no sharing enumerator needs them.
Despite its appeal, this solution is still not good enough:
  • it is based on Enumerator<E> and not Enumerable<E>: caching ends when the whole enumeration ends, it does not survive enumerating sessions. It must be a strictly local and volatile affair. This reduces the scope of persistence, which is the main advantage of caching
  • it requires that sharing ends before enumeration starts: no new sharing enumerator can be spawned while an existing sharing enumerator enumerates. This is in contrast with enumerations over a list, for example, where we can start an enumeration while another one is under way

Good solution: a cached enumerable

The best solution is to have another type of enumerable, a cached enumerable that:
  • it is lazy: a cached enumerable, being an enumerable, is lazy by design
  • it is persistent across enumerating sessions: a cached enumerable, being an enumerable, persists across enumerating sessions. In fact, it creates them by spawning enumerators
  • it allows sharing and enumerating to intertwine: a cached enumerable, when spawning enumerators, would store the elements internally in a sequence and then have the returned enumerators iterate over the sequence instead of the source. This way the source gets enumerated only once and the spawned enumerators can emerge and disappear at will. No separation is imposed on sharing versus enumerating
  • it allows to clear the cache: a cached enumerable must be able to drop its cache on demand, usually when the original source of data has changed
  • it supports limits: a cached enumerable should be able to limit the size of its cache and to gracefully abandon caching when the limit is reached.

    Abandoning caching is a must because enumerators cannot cache partially: a partial cache would involve enumerating the cached elements twice in order to reach what’s after, because enumerators don’t have random/indexed access. For obvious reasons, enumerating twice will hit performance worse than not caching at all.

    On another hand, the user should be allowed to resize the limit to allow resumed caching under better odds. This requires to notify the user about the event of limit being hit. The Observable pattern is the best candidate to implement such behaviour.
All the requirements can be met in the EnumJ library by a new public class aptly named CachedEnumerable<E>, a new .asCached() operator and a clever internal enumerator class that knows how to deal with the intricacies of caching while enumerating.
I will reveal more implementation details of CachedEnumerable<E> in a future post.

Why all the fuss

Because the cost of applying the composed operations can be really prohibitive. As Enumerable<T> allows for repeated enumerations, being able to cache when many operators and/or large sets of elements are involved is an absolute must.

No comments:

Post a Comment