Thursday, 30 July 2015

The power of documentation

This blog post is motivated by the fact that I’ve just published the latest draft of the JavaDoc documentation for the EnumJ library.

Why documenting?

An old wisdom in software says that we should document what we write because we still want to understand it six months later.

Very true: even the best written software can escape memory and well written documentation helps bring to the surface the necessary info about what was the original intent of the software we might stare at now. In the respect of documentation, Java really shines: the JavaDoc utility tool is very easy to use and it is nicely integrated with Maven and the major IDEs.

But there is one more reason for documenting thoroughly and that is … code review.

Self code reviews

The power of code reviews is well known – especially when accompanied by adequate tools.

Nevertheless, they still do suffer from unequal knowledge upon the code being reviewed: one developer is the author, the other one isn’t. It may happen that the suggestions of the uneducated reviewer are uninformed and, sometimes, of lesser value. Perhaps the best form of reviews are self-code reviews, when the developer reviews his/her own code – but in this case the whole process suffers from self-validating bias.

This is when documentation comes into play. If the developer documents critically what he/she has written, then the whole documentation process also becomes a code review – and a very powerful one.

For illustration purposes, I’ll show a few significant improvements I’ve implemented quickly over the past few days – all of them being a result of reflecting one more time upon the code while documenting.

OBS: this may require consulting the documentation mentioned above. If the next three sections seem too foreign, just read the titles and jump to Conclusions below.

Better ShareableEnumerator<E>

This is a really significant improvement. Before, SharingEnumerator<E> was a public class iterating over a list of SharedElementWrapper<E>. The whole thing was complex and error prone.

Now SharedElementWrapper<E> doesn’t exist and SharingEnumerator<E> is package-private. Moreover, ShareableEnumerator<E> uses internally a CachedEnumerable<E> which not only provides a simple mechanism for sharing (the shared buffer is the caching buffer) but the spawn sharing enumerators can act concurrently – something it was not possible before.

Aggregated map operators

This is a significant improvement, too. Before, each map() induced the creation of a new MapPipeProcessor<E,R> in the pipeline.

As one can easily see, AbstractPipeProcessor<E> (and children) is a costly class, with many methods to discern the state of the operators in a uniform way. This is absolutely necessary to flatten the operations and obtain the massive scalability of EnumJ over Java 8 streams - but it comes with the cost that the state must be explicitly represented with each pipe processor.

Calling all the AbstractPipeProcessor<E,R> methods is not necessary if we know that many map() operations get applied immediately one after another. As map() is the most frequent Stream-like operation, it makes sense to aggregate map() sequences into a single one, i.e. to produce a single MapPipeProcessor<E,R> for any sequence of adjacent map() compositions, no matter how big.

This speeds up the processing tremendously, because it is internal to MapPipeProcessor and not managed by PipeEnumerator<E>.

Improved onceOnly() processing

The Enumerable<E> interface has a onceOnly() method which tells whether the enumerable produces up to one enumerator or not. It is essential to know that, as enumerators are not repeatable so enumerables encapsulating an enumerator must be “once only”.

As Enumerable<E> may be a result of massive compositions of a diverse set of iterables – some “once only”, some not – it is necessary to traverse the graph backwards to calculate the value of this flag. Caching is important, so Lazy<T> (extending Apache Commons’ LazyInitializer<T>) is key. The problem is that checking the value may trigger costly collateral operations – so it may be necessary to check whether Lazy<T> is initialized without initializing it.

Now Lazy<T> has a isInitialized() method that returns the state of initialization efficiently, without context switches. My previous article explains the mechanics of it.

Conclusion

These are three improvements over a library that has been extensively tested for correctness and performance – all three improvements coming out solely from reviewing the code while writing its documentation with a critical eye. The process is so powerful that I dare to say that these improvements (by all means non-trivial) would have taken weeks or even months of direct usage after release – with the disadvantage that, once released, the code is harder to change.

That is why I uphold that writing extensive documentation is not just for explaining what a piece of software does, but it can also be used as a tool for a second, deep and highly revealing self code review of the very same piece of software.

No comments:

Post a Comment