All, I'm new to accumulo and inherited this project to extract all data from accumulo (assembled as a "document" by RowID) into another web service.
So I started with SimpleReadClient.java to "scan" all data, and built a "document" based on the RowID, ColumnFamily and Value. Sending this "document" to the service. Example data. ID CF CV RowID_1 createdDate "2015-01-01:00:00:01 UTC" RowID_1 data "this is a test" RowID_1 title "My test title" RowID_2 createdDate "2015-01-01:12:01:01 UTC" RowID_2 data "this is test 2" RowID_2 title "My test2 title" ... So my table is pretty simple, RowID, ColumnFamily and Value (no ColumnQualifier) I need to process one Billion "OLD" unique RowIDs (a years worth of data) on a live system that is ingesting "new data" at a rate of about a 4million RowIds a day. i.e. I need to process data from September 2015 - September 2016, not worrying about new data coming in. So I'm thinking I need to run multiple processes to extract ALL the data in this "data range" to be more efficient. Also, it may allow me to run the processes at a lower priority and at off-hours of the day when traffic is less. My issue is how do I specify the "range" to scan, and how do I specify. 1. Is using the "createdDate" a good idea, if so how would I specify the range for it. 2. How about the TimestampFilter? If I specify my start to end to "equal" a day (about 4 Million unique RowIDs), Will this get me all ColumnFamily and Values for a given RowID? Or could I miss something because it's timestamp was the next day. I don't really understand Timestamps wrt Accumulo. 3. Does a map-reduce job make sense. If so, how would I specify. Thanks, Bob