I'm new to accumulo and inherited this project to extract all data from
accumulo (assembled as a "document" by RowID) into another web service.

So I started with to "scan" all data, and built a
"document" based on the RowID, ColumnFamily and Value. Sending
this "document" to the service.
Example data.
RowID_1 createdDate "2015-01-01:00:00:01 UTC"
RowID_1 data "this is a test"
RowID_1 title "My test title"

RowID_2 createdDate "2015-01-01:12:01:01 UTC"
RowID_2 data "this is test 2"
RowID_2 title "My test2 title"


So my table is pretty simple,  RowID, ColumnFamily and Value (no

I need to process one Billion "OLD" unique RowIDs (a years worth of data)
on a live system that is ingesting "new data" at a rate of about a 4million
RowIds a day.
i.e. I need to process data from September 2015 - September 2016, not
worrying about new data coming in.

So I'm thinking I need to run multiple processes to extract ALL the data in
this "data range" to be more efficient.
Also, it may allow me to run the processes at a lower priority and at
off-hours of the day when traffic is less.

My issue is how do I specify the "range" to scan, and how do I specify.

1. Is using the "createdDate" a good idea, if so how would I specify the
range for it.

2. How about the TimestampFilter?   If I specify my start to end to "equal"
a day (about 4 Million unique RowIDs),
Will this get me all ColumnFamily and Values for a given RowID?  Or could I
miss something because it's timestamp
was the next day.  I don't really understand Timestamps wrt Accumulo.

3. Does a map-reduce job make sense.  If so, how would I specify.



Reply via email to