Using iterators to generate data

Russ Weeks Fri, 29 Aug 2014 16:11:21 -0700

There are plenty of examples of using custom iterators to filter or combine
data at either the cell level or the row level. In these cases, the amount
of data coming out of the iterator is less than the amount going in. What
about going the other direction, using a custom iterator to generate new
data based on the contents of a cell or a row? I guess this is also what a
combiner does but bear with me...


The immediately obvious use case is parsing. Suppose one cell in my row
holds an XML document. I'd like to configure an iterator with an XPath
expression to pull a field out of the document, so that I can leverage the
distributed processing of the cluster instead of parsing the doc on the
scanner-side.

I'm sure there are constraints or things to watch out for, does anybody
have any recommendations here? For instance, the generated cells would
probably have to be in the same row as the input cells?

I'm using MapReduce to satisfy all these use cases right now but I'm
interested to know how much of my code could be ported to Iterators.

Thanks!
-Russ

Using iterators to generate data

Reply via email to