Naga, I like your idea/flow very much. We should definitely put this up as an example template with documentation on why/how it works.
Joe On Thu, Dec 3, 2015 at 11:33 AM, Naga Vijay <nagah...@gmail.com> wrote: > Mark, JoeS, JoeW, > > I have gone through Mark's comment in > https://issues.apache.org/jira/browse/NIFI-25 and tend to agree ... I am > also trying to see how AWS Lambda can fit into the picture ... > > -- > > I'm not sure about the ListS3. I can definitely see the value of it. > However, it requires that the processor maintain a significant amount of > state about what it has seen This is not cluster friendly at all. It also > requires continually pulling a potentially huge listing to see if anything > has changed. > > I think we should instead push users to configure S3 to add a notification > to SQS when a new object is placed in an S3 bucket. We can then have a > GetSQS processor to detect that an item was added and then fetch the > contents via GetS3/FetchS3/RetrieveS3. This is a much more scalable approach > and handles backpressure well. > > -- > > I notice https://issues.apache.org/jira/browse/NIFI-840 (Create ListS3 > processor) has been around for sometime. Let me know your thoughts on when > we can have ListS3 and/or if any help is needed. > > Naga Vijayapuram > > > On Wed, Dec 2, 2015 at 12:31 PM, Naga Vijay <nagah...@gmail.com> wrote: >> >> Mark, >> >> Thanks for the pointer on SQS. >> >> I am thinking that it would help in having a higher level processor for >> distcp to cover both HDFS and S3 as source/sink. >> >> Naga Vijayapuram >> >> >> On Wed, Dec 2, 2015 at 9:48 AM, Mark Payne <marka...@hotmail.com> wrote: >>> >>> We certainly can do the reverse case - sync S3 with HDFS. With S3, as Joe >>> S mentioned, we really should have a ListS3 >>> but currently do not (We do have a ListHDFS though). Typically the use >>> case that I've used with S3 is to setup S3 to notify >>> when an object arrives via SQS. Then have GetSQS get that notification >>> and then pull the data via FetchS3Object. >>> So you could fairly easily setup a GetSQS -> EvaluateJSONPath -> >>> FetchS3Object -> PutHDFS. That would require that SQS be setup though to >>> notify you when new objects arrive. >>> >>> On Dec 2, 2015, at 12:24 PM, Naga Vijay <nagah...@gmail.com> wrote: >>> >>> Joe Witt & Joe Skora, >>> >>> Thanks for thinking about this. Yes, it would serve as a great >>> example/template (as would the reverse case). >>> >>> Naga Vijayapuram >>> >>> >>> On Tue, Dec 1, 2015 at 11:05 PM, Joe Skora <jsk...@gmail.com> wrote: >>>> >>>> @JoeW, >>>> >>>> It looks like we need to add a ListS3 processor in addition to the >>>> Multipart Upload management that I'm looking into now. Extending >>>> ListFileTransfer for S3 shouldn't be too bad. >>>> >>>> JoeS >>>> >>>> On Wed, Dec 2, 2015 at 12:04 AM, Joe Witt <joe.w...@gmail.com> wrote: >>>>> >>>>> Hello >>>>> >>>>> So we have FetchS3 and PutHDFS and a series of interesting in between >>>>> processes to help. So that would get you most of the way there. How >>>>> to get the listing/know what to pull from S3? That part I'm not sure >>>>> about. >>>>> >>>>> This would make for a great example/template for us to post (as would >>>>> the reverse case). >>>>> >>>>> Thanks >>>>> Joe >>>>> >>>>> On Tue, Dec 1, 2015 at 10:36 PM, Naga Vijay <nagah...@gmail.com> wrote: >>>>> > Hello, >>>>> > >>>>> > Is there a processor to DistCp from Amazon S3 to HDFS, or do I need >>>>> > to write >>>>> > a processor for it? >>>>> > >>>>> > Thanks >>>>> > Naga >>>> >>>> >>> >>> >> >