Hi, Dylan,
Yeah, writing RFiles instead of using BatchWriters
(AccumuloFileOutputFormat vs. AccumuloOutputFormat) for efficiency and
atomicity of ingest ("improved" atomicity if that even makes sense).
I'm thinking about the NFS gateway just because the system that's producing
the CSV is kind of a black box to me. It doesn't speak Hadoop, as
Christopher alluded to, and I can't control its output format, but I can
direct its output to a filesystem that it perceives to be local.
My options are either an NFS write direct to HDFS via the gateway, or an
NFS write to a conventional filesystem that I control, followed by some
sort of inotify-driven migration from that server to HDFS.
-Russ
On Tue, Oct 6, 2015 at 6:12 PM Dylan Hutchison <[email protected]> wrote:
> Hi Russ,
> I'm curious what you have in mind. Are you looking for a solution more
> efficient than running clients that read the CSV files and open
> BatchWriters?
>
> Regards, Dylan
>
> On Tue, Oct 6, 2015 at 4:56 PM, Christopher <[email protected]> wrote:
>
>> I haven't tried it, but it sounds like a cool use case. Might be a good
>> alternative to distcp, more interoperable with tools which don't speak
>> hadoop.
>>
>> On Tue, Oct 6, 2015, 18:41 Russ Weeks <[email protected]> wrote:
>>
>>> I hope this isn't too off-topic. Any opinions re. its
>>> completeness/quality/reliability?
>>>
>>> (The use case is, CSV files -> NFS -> HDFS -> Spark -> RFiles ->
>>> Accumulo. Relevance established!)
>>>
>>> Thanks,
>>> -Russ
>>>
>>
>