map-reduce workflows given index tables

Max Thomas Wed, 12 Aug 2015 07:06:33 -0700

Suppose you have a workflow like the following (hopefully not toouncommon):


* "main table" - billions of rows, each with order of magnitude 100 columns

* Ternary classifier that produces an annotation on each row in the maintable. Suppose the labels are A, B, and C. Additionally, this analyticadds all labels to an index table, which is of the form:


label : rowId

to facilitate lookups of a particular type.

Now, suppose you want to run another analytic over all rows with labelA, preferably using MapReduce. It seems the options are:

1. Create a scanner which retrieves all As from the index table; addthese row IDs to an AccumuloInputFormat job; launch a MapReduce job witha single map phase. Con: driver program will need a large amount ofmemory to hold all rows for the range list.

2. A MapReduce job over the index table, with a Reduce phase where eachreducer has a collection of row IDs to iterate over. Each reducer thenretrieves its assigned rows and runs over them.

3. Run over the entire main table with a naive filter to checkclassification type. Cons: hits every row, many of which aren't going tomatch.


4. AccumuloMultiTableFormat, Filters/Iterators - don't seem appropriate here

It seems option #2 is ideal, with option #1 possibly working out too.But, I want to make sure I'm not missing something, as it doesn't seempossible to set up a workflow where the index table is hit, row IDs areretrieved, and these are then passed to another MapReduce job capable ofhitting a different table via MapReduce (obviously one could create aBatchScanner given the inputs anywhere). Are there any examples thatcover this? Or does anyone have a few suggestions about how to set upsuch a workflow?

Another answer might very well be: this is a wacky table/indexing setup,which I am very amenable to hearing. But to a naive Accumulo user,having an index table seems OK - I think it is also covered in the manual.

map-reduce workflows given index tables

Reply via email to