One aspect of creating rfiles for importing into Accumulo that I don't recall mentioned before is the ability to archive them for future use.
On Tue, Oct 6, 2015 at 10:25 PM, Russ Weeks <[email protected]> wrote: > Hi, Dylan, > > Yeah, writing RFiles instead of using BatchWriters > (AccumuloFileOutputFormat vs. AccumuloOutputFormat) for efficiency and > atomicity of ingest ("improved" atomicity if that even makes sense). > > I'm thinking about the NFS gateway just because the system that's > producing the CSV is kind of a black box to me. It doesn't speak Hadoop, as > Christopher alluded to, and I can't control its output format, but I can > direct its output to a filesystem that it perceives to be local. > > My options are either an NFS write direct to HDFS via the gateway, or an > NFS write to a conventional filesystem that I control, followed by some > sort of inotify-driven migration from that server to HDFS. > > -Russ > > On Tue, Oct 6, 2015 at 6:12 PM Dylan Hutchison <[email protected]> wrote: > >> Hi Russ, >> I'm curious what you have in mind. Are you looking for a solution more >> efficient than running clients that read the CSV files and open >> BatchWriters? >> >> Regards, Dylan >> >> On Tue, Oct 6, 2015 at 4:56 PM, Christopher <[email protected]> wrote: >> >>> I haven't tried it, but it sounds like a cool use case. Might be a good >>> alternative to distcp, more interoperable with tools which don't speak >>> hadoop. >>> >>> On Tue, Oct 6, 2015, 18:41 Russ Weeks <[email protected]> wrote: >>> >>>> I hope this isn't too off-topic. Any opinions re. its >>>> completeness/quality/reliability? >>>> >>>> (The use case is, CSV files -> NFS -> HDFS -> Spark -> RFiles -> >>>> Accumulo. Relevance established!) >>>> >>>> Thanks, >>>> -Russ >>>> >>> >>
