There are plenty of examples of using custom iterators to filter or combine data at either the cell level or the row level. In these cases, the amount of data coming out of the iterator is less than the amount going in. What about going the other direction, using a custom iterator to generate new data based on the contents of a cell or a row? I guess this is also what a combiner does but bear with me...
The immediately obvious use case is parsing. Suppose one cell in my row holds an XML document. I'd like to configure an iterator with an XPath expression to pull a field out of the document, so that I can leverage the distributed processing of the cluster instead of parsing the doc on the scanner-side. I'm sure there are constraints or things to watch out for, does anybody have any recommendations here? For instance, the generated cells would probably have to be in the same row as the input cells? I'm using MapReduce to satisfy all these use cases right now but I'm interested to know how much of my code could be ported to Iterators. Thanks! -Russ
