Re: Batch Processing

James McMahon Mon, 21 Aug 2017 06:23:11 -0700

Mark, it is my understanding that S3 does not publish the filename until
the file write to S3 is complete. With that in mind would you comment on
the necessity to employ PutS3Object=>SQS=>GetSQS=>EvaluateXPath=>FetchS3
if that is true? If the file is only published to any consumers once it is
fully written, under what circumstance might *any* read operation attempt
to read an S3 object unless it is fully there?


Thanks in advance for helping me better understand.   -Jim

On Wed, Aug 16, 2017 at 10:22 AM, Mark Payne <[email protected]> wrote:

> Andy,
>
> The ScanAttribute processor allows you to match 1 or more attributes
> against a dictionary.
>
> Consuming data that is still being written is always a tough problem to
> tackle. We've seen people
> take many different approaches to this. One approach is to have the
> producer of the data use a
> "dot naming" convention, where they write to a file named .myFile.csv and
> then rename it to
> to myFile.csv when done. This is often the easiest approach if you control
> the producers as well.
>
> A more S3-centric approach is to configure the S3 bucket so that when data
> is finished being
> written to the bucket, S3 can send a notification to SQS. Then you can use
> GetSQS to get this
> notification and then use EvaluateXPath for instance to extract the
> information needed and then
> use FetchS3.
>
> Thanks
> -Mark
>
>
> On Aug 16, 2017, at 10:13 AM, Andy Loughran <[email protected]> wrote:
>
> Hi Mark,
>
> Yeah, I think that's what I have now.  The issue being that I could end up
> with a duplicate of a file.
>
> I guess I could use the DetectDuplicate processor to make sure that I
> de-dupe the Flowfiles before I increment the counter.  The issue here is
> that I want the latest available FlowFile to replace one if it exists
> (users could update a file's contents before a batch is complete).
>
> Given there are 5 'types', is there a processor that allows me to match a
> 'type' attribute against a dictionary?
>
> On Wed, 16 Aug 2017 at 15:07 Mark Payne <[email protected]> wrote:
>
>> Hi Andy and welcome to the community!
>>
>> I think that what you're doing here seems very reasonable. If you want to
>> wait for 5 'like flowfiles' instead of
>> just 5 flowfiles, you should be able to use the "Signal Counter Name" of
>> the Wait processor. For example,
>> if your UpdateAttribute processor creates a 'type' and a 'batch'
>> attribute, you can set the Wait processor's
>> Signal Counter Name to "${type}" or to "${type}${batch}", depending on
>> how you want to group them together.
>> This will wait until you reach 5 flowfiles with the same "type" attribute
>> (or combination of "type" and "batch" attributes),
>> according to what you set as the Signal Counter Name.
>>
>> Does this make sense?
>>
>> Thanks
>> -Mark
>>
>> > On Aug 16, 2017, at 9:55 AM, Andy Loughran <[email protected]> wrote:
>> >
>> > Hey everyone,
>> >
>> > This is my first post.
>> >
>> > I'm building out a pipeline with Nifi, but am stuck on an architectural
>> decision around some fairly basic design.  I think I'm stuck as I'm
>> operating on the wrong paradigm, but the application receiving my flow is
>> the limitation in this context.
>> >
>> > I'm using ListS3 to poll for csv files.  There need to be 5 different
>> types of file uploaded with a unique batch identifier for them to be
>> released.  I'm using UpdateAttribute to rip the type and batch from the
>> filename, then using wait to hold the batch.
>> >
>> > At the moment though, I'm holding until a batch has 5 files, rather
>> than 5 files with each attribute type matching the expected types.
>> >
>> > Is this the wrong way to be thinking about this problem, or does this
>> sound like a good use case for Nifi - but using a better combination of
>> processors.  If anyone could give me guidance or point me toward an example
>> template for batch process I'd be grateful.
>> >
>> > Look forward to helping out in the community where I can.
>> >
>> > Thanks,
>> >
>> > Andy
>>
>>
>

Re: Batch Processing

Reply via email to