Suppose you have a workflow like the following (hopefully not too uncommon):

* "main table" - billions of rows, each with order of magnitude 100 columns
* Ternary classifier that produces an annotation on each row in the main table. Suppose the labels are A, B, and C. Additionally, this analytic adds all labels to an index table, which is of the form:

label : rowId

to facilitate lookups of a particular type.

Now, suppose you want to run another analytic over all rows with label A, preferably using MapReduce. It seems the options are:

1. Create a scanner which retrieves all As from the index table; add these row IDs to an AccumuloInputFormat job; launch a MapReduce job with a single map phase. Con: driver program will need a large amount of memory to hold all rows for the range list.

2. A MapReduce job over the index table, with a Reduce phase where each reducer has a collection of row IDs to iterate over. Each reducer then retrieves its assigned rows and runs over them.

3. Run over the entire main table with a naive filter to check classification type. Cons: hits every row, many of which aren't going to match.

4. AccumuloMultiTableFormat, Filters/Iterators - don't seem appropriate here

It seems option #2 is ideal, with option #1 possibly working out too. But, I want to make sure I'm not missing something, as it doesn't seem possible to set up a workflow where the index table is hit, row IDs are retrieved, and these are then passed to another MapReduce job capable of hitting a different table via MapReduce (obviously one could create a BatchScanner given the inputs anywhere). Are there any examples that cover this? Or does anyone have a few suggestions about how to set up such a workflow?

Another answer might very well be: this is a wacky table/indexing setup, which I am very amenable to hearing. But to a naive Accumulo user, having an index table seems OK - I think it is also covered in the manual.

Reply via email to