Mark, JoeS, JoeW, I have gone through Mark's comment in https://issues.apache.org/jira/browse/NIFI-25 and tend to agree ... I am also trying to see how AWS Lambda can fit into the picture ...
-- I'm not sure about the ListS3. I can definitely see the value of it. However, it requires that the processor maintain a significant amount of state about what it has seen This is not cluster friendly at all. It also requires continually pulling a potentially huge listing to see if anything has changed. I think we should instead push users to configure S3 to add a notification to SQS when a new object is placed in an S3 bucket. We can then have a GetSQS processor to detect that an item was added and then fetch the contents via GetS3/FetchS3/RetrieveS3. This is a much more scalable approach and handles backpressure well. -- I notice https://issues.apache.org/jira/browse/NIFI-840 (Create ListS3 processor) has been around for sometime. Let me know your thoughts on when we can have ListS3 and/or if any help is needed. Naga Vijayapuram On Wed, Dec 2, 2015 at 12:31 PM, Naga Vijay <[email protected]> wrote: > Mark, > > Thanks for the pointer on SQS. > > I am thinking that it would help in having a higher level processor for > distcp to cover both HDFS and S3 as source/sink. > > Naga Vijayapuram > > > On Wed, Dec 2, 2015 at 9:48 AM, Mark Payne <[email protected]> wrote: > >> We certainly can do the reverse case - sync S3 with HDFS. With S3, as Joe >> S mentioned, we really should have a ListS3 >> but currently do not (We do have a ListHDFS though). Typically the use >> case that I've used with S3 is to setup S3 to notify >> when an object arrives via SQS. Then have GetSQS get that notification >> and then pull the data via FetchS3Object. >> So you could fairly easily setup a GetSQS -> EvaluateJSONPath -> >> FetchS3Object -> PutHDFS. That would require that SQS be setup though to >> notify you when new objects arrive. >> >> On Dec 2, 2015, at 12:24 PM, Naga Vijay <[email protected]> wrote: >> >> Joe Witt & Joe Skora, >> >> Thanks for thinking about this. Yes, it would serve as a great >> example/template (as would the reverse case). >> >> Naga Vijayapuram >> >> >> On Tue, Dec 1, 2015 at 11:05 PM, Joe Skora <[email protected]> wrote: >> >>> @JoeW, >>> >>> It looks like we need to add a ListS3 processor in addition to the >>> Multipart Upload management that I'm looking into now. Extending >>> ListFileTransfer for S3 shouldn't be too bad. >>> >>> JoeS >>> >>> On Wed, Dec 2, 2015 at 12:04 AM, Joe Witt <[email protected]> wrote: >>> >>>> Hello >>>> >>>> So we have FetchS3 and PutHDFS and a series of interesting in between >>>> processes to help. So that would get you most of the way there. How >>>> to get the listing/know what to pull from S3? That part I'm not sure >>>> about. >>>> >>>> This would make for a great example/template for us to post (as would >>>> the reverse case). >>>> >>>> Thanks >>>> Joe >>>> >>>> On Tue, Dec 1, 2015 at 10:36 PM, Naga Vijay <[email protected]> wrote: >>>> > Hello, >>>> > >>>> > Is there a processor to DistCp from Amazon S3 to HDFS, or do I need >>>> to write >>>> > a processor for it? >>>> > >>>> > Thanks >>>> > Naga >>>> >>> >>> >> >> >
