Suppose you have a workflow like the following (hopefully not too
uncommon):
* "main table" - billions of rows, each with order of magnitude 100 columns
* Ternary classifier that produces an annotation on each row in the main
table. Suppose the labels are A, B, and C. Additionally, this analytic
adds all labels to an index table, which is of the form:
label : rowId
to facilitate lookups of a particular type.
Now, suppose you want to run another analytic over all rows with label
A, preferably using MapReduce. It seems the options are:
1. Create a scanner which retrieves all As from the index table; add
these row IDs to an AccumuloInputFormat job; launch a MapReduce job with
a single map phase. Con: driver program will need a large amount of
memory to hold all rows for the range list.
2. A MapReduce job over the index table, with a Reduce phase where each
reducer has a collection of row IDs to iterate over. Each reducer then
retrieves its assigned rows and runs over them.
3. Run over the entire main table with a naive filter to check
classification type. Cons: hits every row, many of which aren't going to
match.
4. AccumuloMultiTableFormat, Filters/Iterators - don't seem appropriate here
It seems option #2 is ideal, with option #1 possibly working out too.
But, I want to make sure I'm not missing something, as it doesn't seem
possible to set up a workflow where the index table is hit, row IDs are
retrieved, and these are then passed to another MapReduce job capable of
hitting a different table via MapReduce (obviously one could create a
BatchScanner given the inputs anywhere). Are there any examples that
cover this? Or does anyone have a few suggestions about how to set up
such a workflow?
Another answer might very well be: this is a wacky table/indexing setup,
which I am very amenable to hearing. But to a naive Accumulo user,
having an index table seems OK - I think it is also covered in the manual.