Re: Only get file when a set exists.

Martijn Dekkers Wed, 30 May 2018 18:32:02 -0700

Hi Koji, Many thanks for your continued assistance!


> - 1 file per second is relatively low in terms of traffic, it should
> be processed fine with 1 thread
> - A flow like this, which is stateful across different parts of the
> flow works at best with single thread, because using multiple threads
> would cause race condition or concurrency issue if there's any
> implementation error
>

Yes, we had similar thoughts.


> - Based on above, I strongly recommend to NOT increase "concurrent
> tasks". If you see FlowFiles staying in a wait queue, then there must
> be different issue
>

We don't see many flowfiles stuck in a wait queue, I ran a test over a few
hours yesterday that simulates the way in which these files would appear
(we would have 4 of "ext1" show up every second, and the "ext2" can show up
a few seconds later, and not always in the same order) and we found perhaps
6 flowfiles stuck in a wait queue.


> - Also, using concurrent tasks number like 400 is too much in general
> for all processors. I recommend to increment it as 2, 3, 4 .. up to 8
> or so, only if you see the clear benefit by doing so
>

Indeed, thanks for the suggestion. Once we have the logic finished and
tested we will have to optimise this Flow. The next step is to try to load
the required processors into MiNiFy, as this will be running on many
systems with limited capacity. If we don't manage with MiNiFy, we will
still be good, but we prefer to have the smaller footprint and ease of
management we can obtain with MiNiFy.


> - The important part of this flow is extracting 'groupId' and 'type'
> from file names. Regular Expression needs to be configured properly.
> - I recommend using https://regex101.com/ to test your Regular
> Expression to see whether it can extract correct groupId and type
>

Yes, we have tested our RegExes for this extensively


>
> Lastly, regardless of how many files should be there for 'ext1' and
> 'ext2', the flow structure is simple as below.
> Let's say there should be 8 files to start processing those.
> 4 x ex1, and 4 ex2 in your case, but let's think it as 8 file types.
> And I assume the types are known, meaning, static, not dynamically change.
>

Correct, the format is <groupID><type>.<ext> where:

- groupId is unique for each set of 8
- type has 4 variants (ab, cd, ef, gh), the same type-set for each ext

So, the rule is, "a set of files consists of 8 files, and a set should
> wait to be processed until all 8 files are ready", that's all.
>

For our use case it is important that we have positive identification that
we have exact "positive identification" of each file.


> Then, the flow should be designed like below:
> 1. List files, each file will be sent as a FlowFile
>

Correct - we have several different listfiles for other sections of the
flow, we are actually collecting many different sets, all variants of the
above. However, those are far simpler (sets of 2 - ext1 and ext2 only)


> 2. Extract groupId and type from filename
>

Correct


> 3. Route FlowFiles into two branches, let's call these 'Notify' branch
> and 'Wait' branch, and pass only 1 type for a set to Wait-branch, and
> the rest 7 types to Notify-branch
>

OK, we currently split Notify branch to "all ext1" and wait branch to "all
ext2"


> At Notify branch (for the rest 7 types FlowFile, e.g. type 2, 3, 4 ... 8)
>

As mentioned, we only have 4 distinct types.


> 1. Notify that the type for a group has arrived.
> 2. Discard the FlowFile, because there's nothing to do with it in this
> branch
>



> At Wait branch (for the type 1 FlowFile)
> 1. Wait for type 2 for the groupId.
> 2. Wait for type 3 for the groupId, type 4, 5 and so on
> 3. After passing Wait for type 8, it can guarantee that all 8 files
> are available (unless there is any other program deleting those)
> 4. Get actual file content using FetchFile, and process it
>

Besides the "4 same types for each extension", this is configured as you
describe.


> I hope it helps.
>
>
It does, thanks. I will extract this portion of the flow, sanitise, and
send it along - easier to see than to describe :)



> Thanks,
> Koji


Thank you so much once again!

Martijn




>
> On Wed, May 30, 2018 at 6:10 PM, Martijn Dekkers <[email protected]>
> wrote:
> > Hey Pierre,
> >
> > Yes, we suspected as much, but we are only seeing this with the Wait
> > processor. Possibly because that is the only "blocking" we have in this
> > flow.
> >
> > Thanks for the clarification, much appreciated!
> >
> > Martijn
> >
> > On 30 May 2018 at 10:30, Pierre Villard <[email protected]>
> wrote:
> >>
> >> I'll let Koji give more information about the Wait/Notify, he is clearly
> >> the expert here.
> >>
> >> I'm just jumping in regarding your "and when viewing the queue, the
> dialog
> >> states that the queue is empty.". You're seeing this behavior because,
> even
> >> though the UI shows some flow files in the queue, the flow files are
> >> currently locked in the session of the running processor and you won't
> see
> >> flow files currently processed in a session when listing a queue. If you
> >> stop the processor, the session will be closed and you'll be able to
> list
> >> the queue and see the flow files.
> >>
> >> I recall discussions in the past to improve the UX for this. Not sure we
> >> have a JIRA for it though...
> >>
> >> Pierre
> >>
> >> 2018-05-30 8:26 GMT+02:00 Martijn Dekkers <[email protected]>:
> >>>
> >>> Hi Koji,
> >>>
> >>> Thank you for responding. I had adjusted the run schedule to closely
> >>> mimic our environment. We are expecting about 1 file per second or so.
> >>> We are also seeing some random "orphans" sitting in a wait queue every
> >>> now and again that don't trigger a debug message, and when viewing the
> >>> queue, the dialog states that the queue is empty.
> >>>
> >>> We found the random "no signal found" issue to be significantly
> decreased
> >>> when we increase the "concurrent tasks" to something large - currently
> set
> >>> to 400 for all wait and notify processors.
> >>>
> >>> I do need to mention that our requirements had changed since you made
> the
> >>> template, in that we are looking for a set of 8 files - 4 x "ext1" and
> 4 x
> >>> "ext2" both with the same pattern: <groupid><type (4 of these)>.ext1
> or ext2
> >>>
> >>> We found that the best way to make this work was to add another
> >>> wait/notify pair, each processor coming after the ones already there,
> with a
> >>> simple counter against the groupID.
> >>>
> >>> I will export a template for you, many thanks for your help - I just
> need
> >>> to spend some time sanitising the varies fields etc.
> >>>
> >>> Many thanks once again for your kind assistance.
> >>>
> >>> Martijn
> >>>
> >>> On 30 May 2018 at 08:14, Koji Kawamura <[email protected]> wrote:
> >>>>
> >>>> Hi Martjin,
> >>>>
> >>>> In my template, I was using 'Run Schedule' as '5 secs' for the Wait
> >>>> processors to avoid overusing CPU resource. However, if you expect
> >>>> more throughput, it should be lowered.
> >>>> Changed Run Schedule to 0 sec, and I passed 100 group of files (400
> >>>> files because 4 files are 1 set in my example), they reached to the
> >>>> expected goal of the flow without issue.
> >>>>
> >>>> If you can share your flow and example input file volume (hundreds of
> >>>> files were fine in my flow), I may be able to provide more useful
> >>>> comment.
> >>>>
> >>>> Thanks,
> >>>> Koji
> >>>>
> >>>> On Wed, May 30, 2018 at 2:08 PM, Martijn Dekkers
> >>>> <[email protected]> wrote:
> >>>> > Hi Koji,
> >>>> >
> >>>> > I am seeing many issues to get this to run reliably. When running
> this
> >>>> > with
> >>>> > a few flowfiles at a time, and stepping through by switching
> >>>> > processors on
> >>>> > and off it works mostly fine, but running this at volume I receive
> >>>> > many
> >>>> > errors about "no release signal found"
> >>>> >
> >>>> > I have tried to fix this in a few different ways, but the issue
> keeps
> >>>> > coming
> >>>> > back. This is also not consistent at all - different wait processors
> >>>> > will
> >>>> > block different flowfiles at different times, without changing any
> >>>> > configuration. Stop/Start the flow, and different queues will fill
> up.
> >>>> > Do
> >>>> > you have any ideas what could be causing this behavior? I checked
> the
> >>>> > DistributedMapCache Server/Client components, and they all appear to
> >>>> > be
> >>>> > working OK.
> >>>> >
> >>>> > Thanks,
> >>>> >
> >>>> > Martijn
> >>>> >
> >>>> > On 28 May 2018 at 05:11, Koji Kawamura <[email protected]>
> wrote:
> >>>> >>
> >>>> >> Hi Martin,
> >>>> >>
> >>>> >> Alternative approach is using Wait/Notify processors.
> >>>> >> I have developed similar flow using those before, and it will work
> >>>> >> with your case I believe.
> >>>> >> A NiFi flow template is available here.
> >>>> >> https://gist.github.com/ijokarumawak/
> 06b3b071eeb4d10d8a27507981422edd
> >>>> >>
> >>>> >> Hope this helps,
> >>>> >> Koji
> >>>> >>
> >>>> >>
> >>>> >> On Sun, May 27, 2018 at 11:48 PM, Andrew Grande <
> [email protected]>
> >>>> >> wrote:
> >>>> >> > Martijn,
> >>>> >> >
> >>>> >> > Here's an idea you could explore. Have the ListFile processor
> work
> >>>> >> > as
> >>>> >> > usual
> >>>> >> > and create a custom component (start with a scripting one to
> >>>> >> > prototype)
> >>>> >> > grouping the filenames as needed. I don't know of the number of
> >>>> >> > files in
> >>>> >> > a
> >>>> >> > set is different every time, so trying to be more robust.
> >>>> >> >
> >>>> >> > Once you group and count the set, you can transfer the names to
> the
> >>>> >> > success
> >>>> >> > relationship. Ignore otherwise and wait until the set is full.
> >>>> >> >
> >>>> >> > Andrew
> >>>> >> >
> >>>> >> >
> >>>> >> > On Sun, May 27, 2018, 7:29 AM Martijn Dekkers
> >>>> >> > <[email protected]>
> >>>> >> > wrote:
> >>>> >> >>
> >>>> >> >> Hello all,
> >>>> >> >>
> >>>> >> >> I am trying to work out an issue with little success.
> >>>> >> >>
> >>>> >> >> I need to ingest files generated by some application. I can only
> >>>> >> >> ingest
> >>>> >> >> these files when a specific set exists. For example:
> >>>> >> >>
> >>>> >> >> file_123_456_ab.ex1
> >>>> >> >> file_123_456_cd.ex1
> >>>> >> >> file_123_456_ef.ex1
> >>>> >> >> file_123_456_gh.ex1
> >>>> >> >> file_123_456.ex2
> >>>> >> >>
> >>>> >> >> Only when a set like that exists should I pick them up into the
> >>>> >> >> Flow.
> >>>> >> >> The
> >>>> >> >> parts I am looking for to "group" would "ab.ex1", "cd.ex1",
> >>>> >> >> "ef.ex1",
> >>>> >> >> "gh.ex1", ".ex2".
> >>>> >> >>
> >>>> >> >> I tried to do this with some expression, but couldn't work it
> out.
> >>>> >> >>
> >>>> >> >> What would be the best way to achieve this?
> >>>> >> >>
> >>>> >> >> Many thanks!
> >>>> >
> >>>> >
> >>>
> >>>
> >>
> >
>

Re: Only get file when a set exists.

Reply via email to