Re: Data Replication

2016-10-15 Thread Yamini Joshi
So HDFS is for durability while replication is for availability? I'm
assuming that the client is unaware of the replicated instance and queries
the DB with no knowledge of which instance/table will return the result.

Best regards,
Yamini Joshi

On Thu, Oct 13, 2016 at 11:46 AM, Josh Elser  wrote:

> I'm not familiar with MongoDB. Perhaps someone else can confirm this for
> you.
>
> Yamini Joshi wrote:
>
>> So, can I say that if I have a table split across nodes (i.e. num
>> tablets > 1) and HDFS replication in my system, it is sort of equivalent
>> to a sharded and replicated mongo architecture?
>>
>> Best regards,
>> Yamini Joshi
>>
>> On Thu, Oct 13, 2016 at 11:06 AM, Josh Elser > > wrote:
>>
>> The Accumulo (Data Center) Replication feature is for having
>> multiple active Accumulo clusters all containing the same data.
>>
>> HDFS provides replication as a means for durability of the data it
>> is storing. The files that Accumulo creates on one HDFS instance are
>> replicated by HDFS. This does not help if your entire cluster become
>> unavailable. That is what the data center replication Accumulo
>> feature solves.
>>
>> While both can be called "replication", they serve very different
>> purposes.
>>
>>
>> Yamini Joshi wrote:
>>
>> Hello
>>
>> I was going through some Accumulo docs and found out about
>> replication.
>> To enable replication,one needs to make some config settings as
>> described in
>> https://github.com/apache/accumulo/blob/master/docs/src/main
>> /asciidoc/chapters/replication.txt
>> > n/asciidoc/chapters/replication.txt>.
>> I cannot seem to grasp the difference between this replication
>> conf and
>> the replication on HDFS level. What exactly is the use case for
>> replication? Are the replicated instances visible to the clients?
>>
>> Best regards,
>> Yamini Joshi
>>
>>
>>


Fwd: Extacting ALL Data using multiple java processes

2016-10-15 Thread Bob Cook
All,

I'm new to accumulo and inherited this project to extract all data from
accumulo (assembled as a "document" by RowID) into another web service.

So I started with SimpleReadClient.java to "scan" all data, and built a
"document" based on the RowID, ColumnFamily and Value. Sending
this "document" to the service.
Example data.
ID CF CV
RowID_1 createdDate "2015-01-01:00:00:01 UTC"
RowID_1 data "this is a test"
RowID_1 title "My test title"

RowID_2 createdDate "2015-01-01:12:01:01 UTC"
RowID_2 data "this is test 2"
RowID_2 title "My test2 title"

...

So my table is pretty simple,  RowID, ColumnFamily and Value (no
ColumnQualifier)

I need to process one Billion "OLD" unique RowIDs (a years worth of data)
on a live system that is ingesting "new data" at a rate of about a 4million
RowIds a day.
i.e. I need to process data from September 2015 - September 2016, not
worrying about new data coming in.

So I'm thinking I need to run multiple processes to extract ALL the data in
this "data range" to be more efficient.
Also, it may allow me to run the processes at a lower priority and at
off-hours of the day when traffic is less.

My issue is how do I specify the "range" to scan, and how do I specify.

1. Is using the "createdDate" a good idea, if so how would I specify the
range for it.

2. How about the TimestampFilter?   If I specify my start to end to "equal"
a day (about 4 Million unique RowIDs),
Will this get me all ColumnFamily and Values for a given RowID?  Or could I
miss something because it's timestamp
was the next day.  I don't really understand Timestamps wrt Accumulo.

3. Does a map-reduce job make sense.  If so, how would I specify.


Thanks,

Bob