Mark, it is my understanding that S3 does not publish the filename until the file write to S3 is complete. With that in mind would you comment on the necessity to employ PutS3Object=>SQS=>GetSQS=>EvaluateXPath=>FetchS3 if that is true? If the file is only published to any consumers once it is fully written, under what circumstance might *any* read operation attempt to read an S3 object unless it is fully there?
Thanks in advance for helping me better understand. -Jim On Wed, Aug 16, 2017 at 10:22 AM, Mark Payne <[email protected]> wrote: > Andy, > > The ScanAttribute processor allows you to match 1 or more attributes > against a dictionary. > > Consuming data that is still being written is always a tough problem to > tackle. We've seen people > take many different approaches to this. One approach is to have the > producer of the data use a > "dot naming" convention, where they write to a file named .myFile.csv and > then rename it to > to myFile.csv when done. This is often the easiest approach if you control > the producers as well. > > A more S3-centric approach is to configure the S3 bucket so that when data > is finished being > written to the bucket, S3 can send a notification to SQS. Then you can use > GetSQS to get this > notification and then use EvaluateXPath for instance to extract the > information needed and then > use FetchS3. > > Thanks > -Mark > > > On Aug 16, 2017, at 10:13 AM, Andy Loughran <[email protected]> wrote: > > Hi Mark, > > Yeah, I think that's what I have now. The issue being that I could end up > with a duplicate of a file. > > I guess I could use the DetectDuplicate processor to make sure that I > de-dupe the Flowfiles before I increment the counter. The issue here is > that I want the latest available FlowFile to replace one if it exists > (users could update a file's contents before a batch is complete). > > Given there are 5 'types', is there a processor that allows me to match a > 'type' attribute against a dictionary? > > On Wed, 16 Aug 2017 at 15:07 Mark Payne <[email protected]> wrote: > >> Hi Andy and welcome to the community! >> >> I think that what you're doing here seems very reasonable. If you want to >> wait for 5 'like flowfiles' instead of >> just 5 flowfiles, you should be able to use the "Signal Counter Name" of >> the Wait processor. For example, >> if your UpdateAttribute processor creates a 'type' and a 'batch' >> attribute, you can set the Wait processor's >> Signal Counter Name to "${type}" or to "${type}${batch}", depending on >> how you want to group them together. >> This will wait until you reach 5 flowfiles with the same "type" attribute >> (or combination of "type" and "batch" attributes), >> according to what you set as the Signal Counter Name. >> >> Does this make sense? >> >> Thanks >> -Mark >> >> > On Aug 16, 2017, at 9:55 AM, Andy Loughran <[email protected]> wrote: >> > >> > Hey everyone, >> > >> > This is my first post. >> > >> > I'm building out a pipeline with Nifi, but am stuck on an architectural >> decision around some fairly basic design. I think I'm stuck as I'm >> operating on the wrong paradigm, but the application receiving my flow is >> the limitation in this context. >> > >> > I'm using ListS3 to poll for csv files. There need to be 5 different >> types of file uploaded with a unique batch identifier for them to be >> released. I'm using UpdateAttribute to rip the type and batch from the >> filename, then using wait to hold the batch. >> > >> > At the moment though, I'm holding until a batch has 5 files, rather >> than 5 files with each attribute type matching the expected types. >> > >> > Is this the wrong way to be thinking about this problem, or does this >> sound like a good use case for Nifi - but using a better combination of >> processors. If anyone could give me guidance or point me toward an example >> template for batch process I'd be grateful. >> > >> > Look forward to helping out in the community where I can. >> > >> > Thanks, >> > >> > Andy >> >> >
