Mark, JoeS, JoeW,

I have gone through Mark's comment in
https://issues.apache.org/jira/browse/NIFI-25 and tend to agree ... I am
also trying to see how AWS Lambda can fit into the picture ...

--

I'm not sure about the ListS3. I can definitely see the value of it.
However, it requires that the processor maintain a significant amount of
state about what it has seen This is not cluster friendly at all. It also
requires continually pulling a potentially huge listing to see if anything
has changed.

I think we should instead push users to configure S3 to add a notification
to SQS when a new object is placed in an S3 bucket. We can then have a
GetSQS processor to detect that an item was added and then fetch the
contents via GetS3/FetchS3/RetrieveS3. This is a much more scalable
approach and handles backpressure well.

--

I notice https://issues.apache.org/jira/browse/NIFI-840 (Create ListS3
processor) has been around for sometime.  Let me know your thoughts on when
we can have ListS3 and/or if any help is needed.
Naga Vijayapuram


On Wed, Dec 2, 2015 at 12:31 PM, Naga Vijay <[email protected]> wrote:

> Mark,
>
> Thanks for the pointer on SQS.
>
> I am thinking that it would help in having a higher level processor for
> distcp to cover both HDFS and S3 as source/sink.
>
> Naga Vijayapuram
>
>
> On Wed, Dec 2, 2015 at 9:48 AM, Mark Payne <[email protected]> wrote:
>
>> We certainly can do the reverse case - sync S3 with HDFS. With S3, as Joe
>> S mentioned, we really should have a ListS3
>> but currently do not (We do have a ListHDFS though). Typically the use
>> case that I've used with S3 is to setup S3 to notify
>> when an object arrives via SQS. Then have GetSQS get that notification
>> and then pull the data via FetchS3Object.
>> So you could fairly easily setup a GetSQS -> EvaluateJSONPath ->
>> FetchS3Object -> PutHDFS. That would require that SQS be setup though to
>> notify you when new objects arrive.
>>
>> On Dec 2, 2015, at 12:24 PM, Naga Vijay <[email protected]> wrote:
>>
>> Joe Witt & Joe Skora,
>>
>> Thanks for thinking about this.  Yes, it would serve as a great
>> example/template (as would the reverse case).
>>
>> Naga Vijayapuram
>>
>>
>> On Tue, Dec 1, 2015 at 11:05 PM, Joe Skora <[email protected]> wrote:
>>
>>> @JoeW,
>>>
>>> It looks like we need to add a ListS3 processor in addition to the
>>> Multipart Upload management that I'm looking into now.  Extending
>>> ListFileTransfer for S3 shouldn't be too bad.
>>>
>>> JoeS
>>>
>>> On Wed, Dec 2, 2015 at 12:04 AM, Joe Witt <[email protected]> wrote:
>>>
>>>> Hello
>>>>
>>>> So we have FetchS3 and PutHDFS and a series of interesting in between
>>>> processes to help.  So that would get you most of the way there.  How
>>>> to get the listing/know what to pull from S3?  That part I'm not sure
>>>> about.
>>>>
>>>> This would make for a great example/template for us to post (as would
>>>> the reverse case).
>>>>
>>>> Thanks
>>>> Joe
>>>>
>>>> On Tue, Dec 1, 2015 at 10:36 PM, Naga Vijay <[email protected]> wrote:
>>>> > Hello,
>>>> >
>>>> > Is there a processor to DistCp from Amazon S3 to HDFS, or do I need
>>>> to write
>>>> > a processor for it?
>>>> >
>>>> > Thanks
>>>> > Naga
>>>>
>>>
>>>
>>
>>
>

Reply via email to