Naga,

I like your idea/flow very much.  We should definitely put this up as
an example template with documentation on why/how it works.

Joe

On Thu, Dec 3, 2015 at 11:33 AM, Naga Vijay <nagah...@gmail.com> wrote:
> Mark, JoeS, JoeW,
>
> I have gone through Mark's comment in
> https://issues.apache.org/jira/browse/NIFI-25 and tend to agree ... I am
> also trying to see how AWS Lambda can fit into the picture ...
>
> --
>
> I'm not sure about the ListS3. I can definitely see the value of it.
> However, it requires that the processor maintain a significant amount of
> state about what it has seen This is not cluster friendly at all. It also
> requires continually pulling a potentially huge listing to see if anything
> has changed.
>
> I think we should instead push users to configure S3 to add a notification
> to SQS when a new object is placed in an S3 bucket. We can then have a
> GetSQS processor to detect that an item was added and then fetch the
> contents via GetS3/FetchS3/RetrieveS3. This is a much more scalable approach
> and handles backpressure well.
>
> --
>
> I notice https://issues.apache.org/jira/browse/NIFI-840 (Create ListS3
> processor) has been around for sometime.  Let me know your thoughts on when
> we can have ListS3 and/or if any help is needed.
>
> Naga Vijayapuram
>
>
> On Wed, Dec 2, 2015 at 12:31 PM, Naga Vijay <nagah...@gmail.com> wrote:
>>
>> Mark,
>>
>> Thanks for the pointer on SQS.
>>
>> I am thinking that it would help in having a higher level processor for
>> distcp to cover both HDFS and S3 as source/sink.
>>
>> Naga Vijayapuram
>>
>>
>> On Wed, Dec 2, 2015 at 9:48 AM, Mark Payne <marka...@hotmail.com> wrote:
>>>
>>> We certainly can do the reverse case - sync S3 with HDFS. With S3, as Joe
>>> S mentioned, we really should have a ListS3
>>> but currently do not (We do have a ListHDFS though). Typically the use
>>> case that I've used with S3 is to setup S3 to notify
>>> when an object arrives via SQS. Then have GetSQS get that notification
>>> and then pull the data via FetchS3Object.
>>> So you could fairly easily setup a GetSQS -> EvaluateJSONPath ->
>>> FetchS3Object -> PutHDFS. That would require that SQS be setup though to
>>> notify you when new objects arrive.
>>>
>>> On Dec 2, 2015, at 12:24 PM, Naga Vijay <nagah...@gmail.com> wrote:
>>>
>>> Joe Witt & Joe Skora,
>>>
>>> Thanks for thinking about this.  Yes, it would serve as a great
>>> example/template (as would the reverse case).
>>>
>>> Naga Vijayapuram
>>>
>>>
>>> On Tue, Dec 1, 2015 at 11:05 PM, Joe Skora <jsk...@gmail.com> wrote:
>>>>
>>>> @JoeW,
>>>>
>>>> It looks like we need to add a ListS3 processor in addition to the
>>>> Multipart Upload management that I'm looking into now.  Extending
>>>> ListFileTransfer for S3 shouldn't be too bad.
>>>>
>>>> JoeS
>>>>
>>>> On Wed, Dec 2, 2015 at 12:04 AM, Joe Witt <joe.w...@gmail.com> wrote:
>>>>>
>>>>> Hello
>>>>>
>>>>> So we have FetchS3 and PutHDFS and a series of interesting in between
>>>>> processes to help.  So that would get you most of the way there.  How
>>>>> to get the listing/know what to pull from S3?  That part I'm not sure
>>>>> about.
>>>>>
>>>>> This would make for a great example/template for us to post (as would
>>>>> the reverse case).
>>>>>
>>>>> Thanks
>>>>> Joe
>>>>>
>>>>> On Tue, Dec 1, 2015 at 10:36 PM, Naga Vijay <nagah...@gmail.com> wrote:
>>>>> > Hello,
>>>>> >
>>>>> > Is there a processor to DistCp from Amazon S3 to HDFS, or do I need
>>>>> > to write
>>>>> > a processor for it?
>>>>> >
>>>>> > Thanks
>>>>> > Naga
>>>>
>>>>
>>>
>>>
>>
>

Reply via email to