Re: Batch Processing

Andy Loughran Wed, 16 Aug 2017 08:25:30 -0700

Hey Mark,

Really appreciate your thoughts - thank you for taking the time.


The issue isn't that we can't wait for the files to be written (we're not
dealing with huge volumes) - but that a 'batch ready' status is dependent
on five parties all uploading separate files.  Once we have the complete
set, then we can begin.

Is the 'dot naming' in relation to a user controlled process?

I'm still unsure of how to manage a pre-ingested flowFile; if a new
duplicate is ingested, how do I overwrite the existing file, given that
it'll have it's own guid - or do I need to use a processor to work this one
out?

Regards,

Andy

On Wed, 16 Aug 2017 at 15:22 Mark Payne <[email protected]> wrote:

> Andy,
>
> The ScanAttribute processor allows you to match 1 or more attributes
> against a dictionary.
>
> Consuming data that is still being written is always a tough problem to
> tackle. We've seen people
> take many different approaches to this. One approach is to have the
> producer of the data use a
> "dot naming" convention, where they write to a file named .myFile.csv and
> then rename it to
> to myFile.csv when done. This is often the easiest approach if you control
> the producers as well.
>
> A more S3-centric approach is to configure the S3 bucket so that when data
> is finished being
> written to the bucket, S3 can send a notification to SQS. Then you can use
> GetSQS to get this
> notification and then use EvaluateXPath for instance to extract the
> information needed and then
> use FetchS3.
>
> Thanks
> -Mark
>
>
> On Aug 16, 2017, at 10:13 AM, Andy Loughran <[email protected]> wrote:
>
> Hi Mark,
>
> Yeah, I think that's what I have now.  The issue being that I could end up
> with a duplicate of a file.
>
> I guess I could use the DetectDuplicate processor to make sure that I
> de-dupe the Flowfiles before I increment the counter.  The issue here is
> that I want the latest available FlowFile to replace one if it exists
> (users could update a file's contents before a batch is complete).
>
> Given there are 5 'types', is there a processor that allows me to match a
> 'type' attribute against a dictionary?
>
> On Wed, 16 Aug 2017 at 15:07 Mark Payne <[email protected]> wrote:
>
>> Hi Andy and welcome to the community!
>>
>> I think that what you're doing here seems very reasonable. If you want to
>> wait for 5 'like flowfiles' instead of
>> just 5 flowfiles, you should be able to use the "Signal Counter Name" of
>> the Wait processor. For example,
>> if your UpdateAttribute processor creates a 'type' and a 'batch'
>> attribute, you can set the Wait processor's
>> Signal Counter Name to "${type}" or to "${type}${batch}", depending on
>> how you want to group them together.
>> This will wait until you reach 5 flowfiles with the same "type" attribute
>> (or combination of "type" and "batch" attributes),
>> according to what you set as the Signal Counter Name.
>>
>> Does this make sense?
>>
>> Thanks
>> -Mark
>>
>> > On Aug 16, 2017, at 9:55 AM, Andy Loughran <[email protected]> wrote:
>> >
>> > Hey everyone,
>> >
>> > This is my first post.
>> >
>> > I'm building out a pipeline with Nifi, but am stuck on an architectural
>> decision around some fairly basic design.  I think I'm stuck as I'm
>> operating on the wrong paradigm, but the application receiving my flow is
>> the limitation in this context.
>> >
>> > I'm using ListS3 to poll for csv files.  There need to be 5 different
>> types of file uploaded with a unique batch identifier for them to be
>> released.  I'm using UpdateAttribute to rip the type and batch from the
>> filename, then using wait to hold the batch.
>> >
>> > At the moment though, I'm holding until a batch has 5 files, rather
>> than 5 files with each attribute type matching the expected types.
>> >
>> > Is this the wrong way to be thinking about this problem, or does this
>> sound like a good use case for Nifi - but using a better combination of
>> processors.  If anyone could give me guidance or point me toward an example
>> template for batch process I'd be grateful.
>> >
>> > Look forward to helping out in the community where I can.
>> >
>> > Thanks,
>> >
>> > Andy
>>
>>
>

Re: Batch Processing

Reply via email to