The s3a Hadoop FileSystem isn't robust enough to support the
requirements Accumulo has to guarantee no data loss around Write-Ahead Logs.
You can use the ExportSnapshot tool for Accumulo to get an immutable
"picture" of a table. The expectation is that you would use DistCp to
copy the files referenced by this snapshot to some other "cold" storage.
The downside of this approach is that each snapshot is a full copy.
There is no such thing as an incremental snapshot.
Hypothetically, you could build some additional logic which would
prevent re-copying a file to your cold-storage (all Accumulo files are
immutable, thus if Snapshot1 already referenced fileA, you wouldn't need
to re-copy fileA if Snapshot2 also references it). This is left as an
exercise to the user :)
On 10/3/17 4:40 PM, Christopher wrote:
Hi Mike. This is a great question. Accumulo has several options for backup.
Accumulo is backed by HDFS for persisting its data on disk. It may be
possible to use S3 directly at this layer. I'm not sure what the current
state is for doing something like this, but a brief Googling for "HDFS
on S3" shows a few historical projects which may still be active and mature.
Accumulo also has a replication feature to automatically mirror live
ingest to a pluggable external receiver, which could be a backup service
you've written to store data in S3. Recovery would depend on how you
store the data in S3. You could also implement an ingest system which
stores data to a backup as well as to Accumulo, to handle both live and
bulk ingest.
Accumulo also has an "exporttable" feature, which exports the metadata
for a table, along with a list of files in HDFS for you to back up to S3
(or another file system). Recovery involves using the "importtable"
feature which recreates the metadata, and bulk importing the files after
you've moved them from your backup location back onto HDFS.
This is just a rough outline of 3 possible solutions. I don't know which
(if any) would match your requirements best. There may be many other
solutions as well.
On Tue, Oct 3, 2017 at 4:10 PM <[email protected]
<mailto:[email protected]>> wrote:
Please forgive the newbie question. What options are there for
backup and recovery of accumulo data?____
__ __
Ideally I would like something that would replicate to S3 in
realtime.____