I'm new to accumulo and inherited this project to extract all data from
accumulo (assembled as a "document" by RowID) into another web service.
So I started with SimpleReadClient.java to "scan" all data, and built a
"document" based on the RowID, ColumnFamily and Value. Sending
this "document" to the service.
ID CF CV
RowID_1 createdDate "2015-01-01:00:00:01 UTC"
RowID_1 data "this is a test"
RowID_1 title "My test title"
RowID_2 createdDate "2015-01-01:12:01:01 UTC"
RowID_2 data "this is test 2"
RowID_2 title "My test2 title"
So my table is pretty simple, RowID, ColumnFamily and Value (no
I need to process one Billion "OLD" unique RowIDs (a years worth of data)
on a live system that is ingesting "new data" at a rate of about a 4million
RowIds a day.
i.e. I need to process data from September 2015 - September 2016, not
worrying about new data coming in.
So I'm thinking I need to run multiple processes to extract ALL the data in
this "data range" to be more efficient.
Also, it may allow me to run the processes at a lower priority and at
off-hours of the day when traffic is less.
My issue is how do I specify the "range" to scan, and how do I specify.
1. Is using the "createdDate" a good idea, if so how would I specify the
range for it.
2. How about the TimestampFilter? If I specify my start to end to "equal"
a day (about 4 Million unique RowIDs),
Will this get me all ColumnFamily and Values for a given RowID? Or could I
miss something because it's timestamp
was the next day. I don't really understand Timestamps wrt Accumulo.
3. Does a map-reduce job make sense. If so, how would I specify.