Re: Only get file when a set exists.

2018-06-06 Thread Koji Kawamura
JFYI, updated the template on Gist.
https://gist.github.com/ijokarumawak/06b3b071eeb4d10d8a27507981422edd

I personally prefer the 2nd work-around, "setting 'Releasable FlowFile
Count' to 0" because the resulted FlowFiles will have more informative
attributes as evidence.

On Wed, Jun 6, 2018 at 4:50 PM, Koji Kawamura  wrote:
> Hi Martijn,
>
> Thanks for sharing new information.
> Here are couple of things to help debugging.
>
> # Debug Notify branch
> 1. Stop Wait branch, to debug solely Notify branch function. Wait
> processor deletes cache entry when it thinks no need to keep it any
> longer.
> 2. Make sure Notify has 'success' and 'failure' relationship. Connect
> both relationships to LogAttribute or something stopped, to keep
> FlowFiles in the queue between that and the Notify.
> 3. Confirm every arrived FlowFile is passed to the 'success'
> relationship. This confirms Notify actually sent every notification
> with expected notification identifier.
> 4. Check all expected keys exist in Redis
>
> # Debug Notify branch 2
> If you can't stop Wait branch for some reason, add
> FetchDistributedMapCache right after Notify. This ensures that a
> signal is successfully written in Redis.
>
> If we can confirm Notify branch works without issue, then I'd suspect
> the once written key gets deleted somehow by Wait processors.
>
> There is a race condition between Wait and Notify.
> My hypothesis is:
> Assuming 'type1.ext1' is routed to Wait branch, and other 7 types are
> to Notify branch.
> 1. Notify receives 'type1.ext2', signal becomes {type1.ext2=1}
> 2. Wait for 'type1.ext2' gets a signal, which has counts as {type1.ext2=1}
> 3. Simultaneously, Notify for 'type2.ext1' notifies, the signal is
> updated to {type1.ext2=1, type2.ext1=1}
> 4. Wait for 'type1.ext2' processes the signal, since the count ab
> reached to target 1, it decrement count ab to 0. Then it deletes the
> key, because it thinks the signal is done because it doesn't have any
> count in it.
> 5. Wait for 'type2.ext1' fetch the key, but the entry is already
> deleted. And it gets stuck in the 'wait' relationship.
>
> If that's the case, changing 'Signal Identifier' from groupId to
> groupId.type can avoid the conflict.
> Alternatively, setting 'Releasable FlowFile Count' to 0 can stop Wait
> to delete cache key.
>
> Thanks,
> Koji
>
>
>
> On Tue, Jun 5, 2018 at 9:23 PM, Martijn Dekkers  
> wrote:
>> Hi Koji,
>>
>> Some more information from debugging.
>>
>> I have today deployed Redis since that gives me an easy interface to check
>> the existence of keys, and found that for files that end up stuck in the
>> wait queues, the provenance in the Notify queue shows the relevant flowfile
>> as having arrived, but the relevant key in Redis shows as (nil)
>>
>> Files that have been processed successfully show a "good" key in Redis.
>>
>> Thanks,
>>
>> Martijn
>>
>> On 5 June 2018 at 06:27, Martijn Dekkers  wrote:
>>>
>>> Hello Koji,
>>>
>>> Many thanks, apologies for the delay in responding - I had to work on some
>>> different tasks.
>>>
>>> I have followed your advice and have configured a flow accordingly, and on
>>> the whole the logic works. However, I still see the issue where a set will
>>> be stuck in the wait queue. I have tracked it down to the instance where
>>> there is a longer delay between the arrival of ext1 and ext2 files. If I
>>> pause the appropriate processor that gates the ext2 files, that set will get
>>> stuck. If all files come through roughly at a similar time, we see no
>>> issues, and the flow happily runs.
>>>
>>> I am not quite sure about the best way to debug this. I have looked at the
>>> attributes in provenance, and notice that the relevant counter for the
>>> specific wait processor isn't updated. I am not sure how I can check the
>>> status of the distributed map cache to see if this might be responsible.
>>>
>>> I can share my flowfile, but would have to email it to you directly,
>>> unfortunately I cannot share the flowfile publicly, and sanitising it to the
>>> extent that I can publicly share it would be difficult.
>>>
>>> Oh, we are using 1.6
>>>
>>> Many thanks,
>>>
>>> Martijn
>>>
>>> On 31 May 2018 at 09:57, Koji Kawamura  wrote:

 BTW, which version are you using? I hope it is 1.4.0 or higher. There
 was an issue having effects to your usage.
 https://issues.apache.org/jira/browse/NIFI-4028

 On Thu, May 31, 2018 at 4:51 PM, Koji Kawamura 
 wrote:
 > HI Martijn,
 >
 > I used the filename increment pattern based on your first filename
 > example.
 > file_123_456_ab.ex1
 > I increment the 456 part. If it changed, that's fine.
 >
 > Your current configurations look like below:
 > - Given a filename: file_123_type3.ext1
 > - matched with a RegEx: ^file_(\d{3}_)(\w+)\.ext1$
 > - groupID will be: 123_ (including the underscore)
 > - counterName will be: type3
 >
 > I was suggesting include the extension to the counterName.
 

Re: Only get file when a set exists.

2018-06-06 Thread Koji Kawamura
Hi Martijn,

Thanks for sharing new information.
Here are couple of things to help debugging.

# Debug Notify branch
1. Stop Wait branch, to debug solely Notify branch function. Wait
processor deletes cache entry when it thinks no need to keep it any
longer.
2. Make sure Notify has 'success' and 'failure' relationship. Connect
both relationships to LogAttribute or something stopped, to keep
FlowFiles in the queue between that and the Notify.
3. Confirm every arrived FlowFile is passed to the 'success'
relationship. This confirms Notify actually sent every notification
with expected notification identifier.
4. Check all expected keys exist in Redis

# Debug Notify branch 2
If you can't stop Wait branch for some reason, add
FetchDistributedMapCache right after Notify. This ensures that a
signal is successfully written in Redis.

If we can confirm Notify branch works without issue, then I'd suspect
the once written key gets deleted somehow by Wait processors.

There is a race condition between Wait and Notify.
My hypothesis is:
Assuming 'type1.ext1' is routed to Wait branch, and other 7 types are
to Notify branch.
1. Notify receives 'type1.ext2', signal becomes {type1.ext2=1}
2. Wait for 'type1.ext2' gets a signal, which has counts as {type1.ext2=1}
3. Simultaneously, Notify for 'type2.ext1' notifies, the signal is
updated to {type1.ext2=1, type2.ext1=1}
4. Wait for 'type1.ext2' processes the signal, since the count ab
reached to target 1, it decrement count ab to 0. Then it deletes the
key, because it thinks the signal is done because it doesn't have any
count in it.
5. Wait for 'type2.ext1' fetch the key, but the entry is already
deleted. And it gets stuck in the 'wait' relationship.

If that's the case, changing 'Signal Identifier' from groupId to
groupId.type can avoid the conflict.
Alternatively, setting 'Releasable FlowFile Count' to 0 can stop Wait
to delete cache key.

Thanks,
Koji



On Tue, Jun 5, 2018 at 9:23 PM, Martijn Dekkers  wrote:
> Hi Koji,
>
> Some more information from debugging.
>
> I have today deployed Redis since that gives me an easy interface to check
> the existence of keys, and found that for files that end up stuck in the
> wait queues, the provenance in the Notify queue shows the relevant flowfile
> as having arrived, but the relevant key in Redis shows as (nil)
>
> Files that have been processed successfully show a "good" key in Redis.
>
> Thanks,
>
> Martijn
>
> On 5 June 2018 at 06:27, Martijn Dekkers  wrote:
>>
>> Hello Koji,
>>
>> Many thanks, apologies for the delay in responding - I had to work on some
>> different tasks.
>>
>> I have followed your advice and have configured a flow accordingly, and on
>> the whole the logic works. However, I still see the issue where a set will
>> be stuck in the wait queue. I have tracked it down to the instance where
>> there is a longer delay between the arrival of ext1 and ext2 files. If I
>> pause the appropriate processor that gates the ext2 files, that set will get
>> stuck. If all files come through roughly at a similar time, we see no
>> issues, and the flow happily runs.
>>
>> I am not quite sure about the best way to debug this. I have looked at the
>> attributes in provenance, and notice that the relevant counter for the
>> specific wait processor isn't updated. I am not sure how I can check the
>> status of the distributed map cache to see if this might be responsible.
>>
>> I can share my flowfile, but would have to email it to you directly,
>> unfortunately I cannot share the flowfile publicly, and sanitising it to the
>> extent that I can publicly share it would be difficult.
>>
>> Oh, we are using 1.6
>>
>> Many thanks,
>>
>> Martijn
>>
>> On 31 May 2018 at 09:57, Koji Kawamura  wrote:
>>>
>>> BTW, which version are you using? I hope it is 1.4.0 or higher. There
>>> was an issue having effects to your usage.
>>> https://issues.apache.org/jira/browse/NIFI-4028
>>>
>>> On Thu, May 31, 2018 at 4:51 PM, Koji Kawamura 
>>> wrote:
>>> > HI Martijn,
>>> >
>>> > I used the filename increment pattern based on your first filename
>>> > example.
>>> > file_123_456_ab.ex1
>>> > I increment the 456 part. If it changed, that's fine.
>>> >
>>> > Your current configurations look like below:
>>> > - Given a filename: file_123_type3.ext1
>>> > - matched with a RegEx: ^file_(\d{3}_)(\w+)\.ext1$
>>> > - groupID will be: 123_ (including the underscore)
>>> > - counterName will be: type3
>>> >
>>> > I was suggesting include the extension to the counterName.
>>> > How about changing the RegEx as:
>>> > - RegEx: ^file_(\d+)_(\w+\.ext\d)$
>>> > - groupID will be: 123
>>> > - counterName will be: type3.ext1
>>> >
>>> > Then you can route type1.ext1 to Wait branch and other 7 to Notify.
>>> > In Wait branch, you need 7 Wait processors.
>>> >
>>> > It would fast to debug if you can share your flow template..
>>> >
>>> > Thanks,
>>> > Koji
>>> >
>>> > On Thu, May 31, 2018 at 3:15 PM, Martijn Dekkers
>>> >  wrote:
>>> >> Thank you Koji,
>>> >>
>>> >> 

Re: Only get file when a set exists.

2018-06-05 Thread Martijn Dekkers
Hi Koji,

Some more information from debugging.

I have today deployed Redis since that gives me an easy interface to check
the existence of keys, and found that for files that end up stuck in the
wait queues, the provenance in the Notify queue shows the relevant flowfile
as having arrived, but the relevant key in Redis shows as (nil)

Files that have been processed successfully show a "good" key in Redis.

Thanks,

Martijn

On 5 June 2018 at 06:27, Martijn Dekkers  wrote:

> Hello Koji,
>
> Many thanks, apologies for the delay in responding - I had to work on some
> different tasks.
>
> I have followed your advice and have configured a flow accordingly, and on
> the whole the logic works. However, I still see the issue where a set will
> be stuck in the wait queue. I have tracked it down to the instance where
> there is a longer delay between the arrival of ext1 and ext2 files. If I
> pause the appropriate processor that gates the ext2 files, that set will
> get stuck. If all files come through roughly at a similar time, we see no
> issues, and the flow happily runs.
>
> I am not quite sure about the best way to debug this. I have looked at the
> attributes in provenance, and notice that the relevant counter for the
> specific wait processor isn't updated. I am not sure how I can check the
> status of the distributed map cache to see if this might be responsible.
>
> I can share my flowfile, but would have to email it to you directly,
> unfortunately I cannot share the flowfile publicly, and sanitising it to
> the extent that I can publicly share it would be difficult.
>
> Oh, we are using 1.6
>
> Many thanks,
>
> Martijn
>
> On 31 May 2018 at 09:57, Koji Kawamura  wrote:
>
>> BTW, which version are you using? I hope it is 1.4.0 or higher. There
>> was an issue having effects to your usage.
>> https://issues.apache.org/jira/browse/NIFI-4028
>>
>> On Thu, May 31, 2018 at 4:51 PM, Koji Kawamura 
>> wrote:
>> > HI Martijn,
>> >
>> > I used the filename increment pattern based on your first filename
>> example.
>> > file_123_456_ab.ex1
>> > I increment the 456 part. If it changed, that's fine.
>> >
>> > Your current configurations look like below:
>> > - Given a filename: file_123_type3.ext1
>> > - matched with a RegEx: ^file_(\d{3}_)(\w+)\.ext1$
>> > - groupID will be: 123_ (including the underscore)
>> > - counterName will be: type3
>> >
>> > I was suggesting include the extension to the counterName.
>> > How about changing the RegEx as:
>> > - RegEx: ^file_(\d+)_(\w+\.ext\d)$
>> > - groupID will be: 123
>> > - counterName will be: type3.ext1
>> >
>> > Then you can route type1.ext1 to Wait branch and other 7 to Notify.
>> > In Wait branch, you need 7 Wait processors.
>> >
>> > It would fast to debug if you can share your flow template..
>> >
>> > Thanks,
>> > Koji
>> >
>> > On Thu, May 31, 2018 at 3:15 PM, Martijn Dekkers <
>> mart...@dekkers.org.uk> wrote:
>> >> Thank you Koji,
>> >>
>> >> I have tried once again, using your updated example. Unfortunately,
>> things
>> >> still get stuck at the first Wait processors' wait queue.
>> >> I did notice that the format of the files your example generates is
>> >> different. I will try to clarify:
>> >>
>> >> - 8 files in total:
>> >>
>> >> -- file_123_type1.ext1
>> >> -- file_123_type1.ext2
>> >>
>> >> -- file_123_type2.ext1
>> >> -- file_123_type2.ext2
>> >>
>> >> -- file_123_type3.ext1
>> >> -- file_123_type3.ext2
>> >>
>> >> -- file_123_type4.ext1
>> >> -- file_123_type4.ext2
>> >>
>> >> For each set of 8 files, "file_123" increments, so the first set of 8
>> is
>> >> "file_123", and the next set is "file_124" and so on.
>> >>
>> >> When I look at your example, I notice that at the final step
>> (LogAttribute
>> >> after the FetchFile set) the filenames are file_123_> >> number>.ex(1|2)
>> >>
>> >> My UpdateAttribute before the Notify branch is configured as:
>> >> groupID - ${fileName:replaceFirst('^file_(\d{3}_)(\w+)\.ext1$','$1')}
>> >> counterName - ${fileName:replaceFirst('^file
>> _(\d{3}_)(\w+)\.ext1$','$2')}
>> >>
>> >> The UpdateAttribute before the Wait branch is configured as:
>> >> groupID - ${fileName:replaceFirst('^file_(\d{3}_)(\w+)\.ext1$','$1')}
>> >>
>> >> The 4 Wait processors in the Wait branch are configured as:
>> >> All Wait processors:
>> >> Release Signal Identifier - ${groupID}
>> >>
>> >> For each individual Wait processor:
>> >> Signal Counter Name - type1
>> >> Signal Counter Name - type2
>> >> Signal Counter Name - type3
>> >> Signal Counter Name - type4
>> >>
>> >> I am a bit stumped. The best success we had was a configuration with a
>> >> RouteonAttribute sending each of type1|type2|type3|type4 to their own
>> Wait
>> >> processor, and a similar config for the Notify branch, followed by a
>> final
>> >> Wait/Notify pair that simply ensures we have the correct amount of
>> sets.
>> >>
>> >> This configuration did exactly what we want, but unfortunately we had
>> random
>> >> flowfiles stuck in the waitqueue for no apparent 

Re: Only get file when a set exists.

2018-06-04 Thread Martijn Dekkers
Hello Koji,

Many thanks, apologies for the delay in responding - I had to work on some
different tasks.

I have followed your advice and have configured a flow accordingly, and on
the whole the logic works. However, I still see the issue where a set will
be stuck in the wait queue. I have tracked it down to the instance where
there is a longer delay between the arrival of ext1 and ext2 files. If I
pause the appropriate processor that gates the ext2 files, that set will
get stuck. If all files come through roughly at a similar time, we see no
issues, and the flow happily runs.

I am not quite sure about the best way to debug this. I have looked at the
attributes in provenance, and notice that the relevant counter for the
specific wait processor isn't updated. I am not sure how I can check the
status of the distributed map cache to see if this might be responsible.

I can share my flowfile, but would have to email it to you directly,
unfortunately I cannot share the flowfile publicly, and sanitising it to
the extent that I can publicly share it would be difficult.

Oh, we are using 1.6

Many thanks,

Martijn

On 31 May 2018 at 09:57, Koji Kawamura  wrote:

> BTW, which version are you using? I hope it is 1.4.0 or higher. There
> was an issue having effects to your usage.
> https://issues.apache.org/jira/browse/NIFI-4028
>
> On Thu, May 31, 2018 at 4:51 PM, Koji Kawamura 
> wrote:
> > HI Martijn,
> >
> > I used the filename increment pattern based on your first filename
> example.
> > file_123_456_ab.ex1
> > I increment the 456 part. If it changed, that's fine.
> >
> > Your current configurations look like below:
> > - Given a filename: file_123_type3.ext1
> > - matched with a RegEx: ^file_(\d{3}_)(\w+)\.ext1$
> > - groupID will be: 123_ (including the underscore)
> > - counterName will be: type3
> >
> > I was suggesting include the extension to the counterName.
> > How about changing the RegEx as:
> > - RegEx: ^file_(\d+)_(\w+\.ext\d)$
> > - groupID will be: 123
> > - counterName will be: type3.ext1
> >
> > Then you can route type1.ext1 to Wait branch and other 7 to Notify.
> > In Wait branch, you need 7 Wait processors.
> >
> > It would fast to debug if you can share your flow template..
> >
> > Thanks,
> > Koji
> >
> > On Thu, May 31, 2018 at 3:15 PM, Martijn Dekkers 
> wrote:
> >> Thank you Koji,
> >>
> >> I have tried once again, using your updated example. Unfortunately,
> things
> >> still get stuck at the first Wait processors' wait queue.
> >> I did notice that the format of the files your example generates is
> >> different. I will try to clarify:
> >>
> >> - 8 files in total:
> >>
> >> -- file_123_type1.ext1
> >> -- file_123_type1.ext2
> >>
> >> -- file_123_type2.ext1
> >> -- file_123_type2.ext2
> >>
> >> -- file_123_type3.ext1
> >> -- file_123_type3.ext2
> >>
> >> -- file_123_type4.ext1
> >> -- file_123_type4.ext2
> >>
> >> For each set of 8 files, "file_123" increments, so the first set of 8 is
> >> "file_123", and the next set is "file_124" and so on.
> >>
> >> When I look at your example, I notice that at the final step
> (LogAttribute
> >> after the FetchFile set) the filenames are file_123_ >> number>.ex(1|2)
> >>
> >> My UpdateAttribute before the Notify branch is configured as:
> >> groupID - ${fileName:replaceFirst('^file_(\d{3}_)(\w+)\.ext1$','$1')}
> >> counterName - ${fileName:replaceFirst('^file_(\d{3}_)(\w+)\.ext1$','$
> 2')}
> >>
> >> The UpdateAttribute before the Wait branch is configured as:
> >> groupID - ${fileName:replaceFirst('^file_(\d{3}_)(\w+)\.ext1$','$1')}
> >>
> >> The 4 Wait processors in the Wait branch are configured as:
> >> All Wait processors:
> >> Release Signal Identifier - ${groupID}
> >>
> >> For each individual Wait processor:
> >> Signal Counter Name - type1
> >> Signal Counter Name - type2
> >> Signal Counter Name - type3
> >> Signal Counter Name - type4
> >>
> >> I am a bit stumped. The best success we had was a configuration with a
> >> RouteonAttribute sending each of type1|type2|type3|type4 to their own
> Wait
> >> processor, and a similar config for the Notify branch, followed by a
> final
> >> Wait/Notify pair that simply ensures we have the correct amount of sets.
> >>
> >> This configuration did exactly what we want, but unfortunately we had
> random
> >> flowfiles stuck in the waitqueue for no apparent reason.
> >>
> >> Thanks,
> >>
> >> Martijn
> >>
> >>
> >>
> >> On 31 May 2018 at 05:23, Koji Kawamura  wrote:
> >>>
> >>> The order of arrival does not matter.
> >>>
> >>> Wait processor has 'Expiration Duration' configuration, defaults to 10
> >>> min. Please adjust it according to your needs, the longest period to
> >>> wait for a delayed file.
> >>> If a FlowFile exceeds the duration, it will be sent to 'expired'
> >>> relationship, and can be treated differently, e.g. write ERROR log
> >>>
> >>> > If we have a longer wait for a file, we'd like processing for the
> next
> >>> > groupid to still be able to continue.
> >>>
> >>> In order to 

Re: Only get file when a set exists.

2018-05-31 Thread Koji Kawamura
BTW, which version are you using? I hope it is 1.4.0 or higher. There
was an issue having effects to your usage.
https://issues.apache.org/jira/browse/NIFI-4028

On Thu, May 31, 2018 at 4:51 PM, Koji Kawamura  wrote:
> HI Martijn,
>
> I used the filename increment pattern based on your first filename example.
> file_123_456_ab.ex1
> I increment the 456 part. If it changed, that's fine.
>
> Your current configurations look like below:
> - Given a filename: file_123_type3.ext1
> - matched with a RegEx: ^file_(\d{3}_)(\w+)\.ext1$
> - groupID will be: 123_ (including the underscore)
> - counterName will be: type3
>
> I was suggesting include the extension to the counterName.
> How about changing the RegEx as:
> - RegEx: ^file_(\d+)_(\w+\.ext\d)$
> - groupID will be: 123
> - counterName will be: type3.ext1
>
> Then you can route type1.ext1 to Wait branch and other 7 to Notify.
> In Wait branch, you need 7 Wait processors.
>
> It would fast to debug if you can share your flow template..
>
> Thanks,
> Koji
>
> On Thu, May 31, 2018 at 3:15 PM, Martijn Dekkers  
> wrote:
>> Thank you Koji,
>>
>> I have tried once again, using your updated example. Unfortunately, things
>> still get stuck at the first Wait processors' wait queue.
>> I did notice that the format of the files your example generates is
>> different. I will try to clarify:
>>
>> - 8 files in total:
>>
>> -- file_123_type1.ext1
>> -- file_123_type1.ext2
>>
>> -- file_123_type2.ext1
>> -- file_123_type2.ext2
>>
>> -- file_123_type3.ext1
>> -- file_123_type3.ext2
>>
>> -- file_123_type4.ext1
>> -- file_123_type4.ext2
>>
>> For each set of 8 files, "file_123" increments, so the first set of 8 is
>> "file_123", and the next set is "file_124" and so on.
>>
>> When I look at your example, I notice that at the final step (LogAttribute
>> after the FetchFile set) the filenames are file_123_> number>.ex(1|2)
>>
>> My UpdateAttribute before the Notify branch is configured as:
>> groupID - ${fileName:replaceFirst('^file_(\d{3}_)(\w+)\.ext1$','$1')}
>> counterName - ${fileName:replaceFirst('^file_(\d{3}_)(\w+)\.ext1$','$2')}
>>
>> The UpdateAttribute before the Wait branch is configured as:
>> groupID - ${fileName:replaceFirst('^file_(\d{3}_)(\w+)\.ext1$','$1')}
>>
>> The 4 Wait processors in the Wait branch are configured as:
>> All Wait processors:
>> Release Signal Identifier - ${groupID}
>>
>> For each individual Wait processor:
>> Signal Counter Name - type1
>> Signal Counter Name - type2
>> Signal Counter Name - type3
>> Signal Counter Name - type4
>>
>> I am a bit stumped. The best success we had was a configuration with a
>> RouteonAttribute sending each of type1|type2|type3|type4 to their own Wait
>> processor, and a similar config for the Notify branch, followed by a final
>> Wait/Notify pair that simply ensures we have the correct amount of sets.
>>
>> This configuration did exactly what we want, but unfortunately we had random
>> flowfiles stuck in the waitqueue for no apparent reason.
>>
>> Thanks,
>>
>> Martijn
>>
>>
>>
>> On 31 May 2018 at 05:23, Koji Kawamura  wrote:
>>>
>>> The order of arrival does not matter.
>>>
>>> Wait processor has 'Expiration Duration' configuration, defaults to 10
>>> min. Please adjust it according to your needs, the longest period to
>>> wait for a delayed file.
>>> If a FlowFile exceeds the duration, it will be sent to 'expired'
>>> relationship, and can be treated differently, e.g. write ERROR log
>>>
>>> > If we have a longer wait for a file, we'd like processing for the next
>>> > groupid to still be able to continue.
>>>
>>> In order to achieve that, use Wait Mode = 'Transfer to wait
>>> relationship', and the 'wait' relationship should be configured to use
>>> FirstInFirstOutPrioritizer.
>>> If FirstInFirstOutPrioritizer is not set, the same FlowFile will be
>>> processed again while it blocks other FlowFiles.
>>> With FirstInFirstOutPrioritizer, the processed FlowFile will be
>>> re-queued at the end of wait queue.
>>>
>>> I've updated my example to make it more realistic, by adding delay for
>>> certain set and type.
>>> https://gist.github.com/ijokarumawak/06b3b071eeb4d10d8a27507981422edd
>>>
>>> Thanks,
>>> Koji
>>>
>>> On Thu, May 31, 2018 at 10:56 AM, Martijn Dekkers
>>>  wrote:
>>> > Cool, that will make things a lot simpler. Does it matter that the ext2
>>> > files arrive in random order? Sometimes there can be a very long delay
>>> > in
>>> > some of them showing up, and we have some concerns about the overall
>>> > flow
>>> > blocking. If we have a longer wait for a file, we'd like processing for
>>> > the
>>> > next groupid to still be able to continue.
>>> >
>>> > Thank you for your help (and for writing Wait/Notify!!)
>>> >
>>> > Martijn
>>> >
>>> > On 31 May 2018 at 03:49, Koji Kawamura  wrote:
>>> >>
>>> >> Glad to hear that was helpful.
>>> >>
>>> >> "4 same type for each extension", can be treated as "8 distinct types"
>>> >> if an extension is included in a type.
>>> >> ab.ex1, cd.ex1, ef.ex1, 

Re: Only get file when a set exists.

2018-05-31 Thread Koji Kawamura
HI Martijn,

I used the filename increment pattern based on your first filename example.
file_123_456_ab.ex1
I increment the 456 part. If it changed, that's fine.

Your current configurations look like below:
- Given a filename: file_123_type3.ext1
- matched with a RegEx: ^file_(\d{3}_)(\w+)\.ext1$
- groupID will be: 123_ (including the underscore)
- counterName will be: type3

I was suggesting include the extension to the counterName.
How about changing the RegEx as:
- RegEx: ^file_(\d+)_(\w+\.ext\d)$
- groupID will be: 123
- counterName will be: type3.ext1

Then you can route type1.ext1 to Wait branch and other 7 to Notify.
In Wait branch, you need 7 Wait processors.

It would fast to debug if you can share your flow template..

Thanks,
Koji

On Thu, May 31, 2018 at 3:15 PM, Martijn Dekkers  wrote:
> Thank you Koji,
>
> I have tried once again, using your updated example. Unfortunately, things
> still get stuck at the first Wait processors' wait queue.
> I did notice that the format of the files your example generates is
> different. I will try to clarify:
>
> - 8 files in total:
>
> -- file_123_type1.ext1
> -- file_123_type1.ext2
>
> -- file_123_type2.ext1
> -- file_123_type2.ext2
>
> -- file_123_type3.ext1
> -- file_123_type3.ext2
>
> -- file_123_type4.ext1
> -- file_123_type4.ext2
>
> For each set of 8 files, "file_123" increments, so the first set of 8 is
> "file_123", and the next set is "file_124" and so on.
>
> When I look at your example, I notice that at the final step (LogAttribute
> after the FetchFile set) the filenames are file_123_ number>.ex(1|2)
>
> My UpdateAttribute before the Notify branch is configured as:
> groupID - ${fileName:replaceFirst('^file_(\d{3}_)(\w+)\.ext1$','$1')}
> counterName - ${fileName:replaceFirst('^file_(\d{3}_)(\w+)\.ext1$','$2')}
>
> The UpdateAttribute before the Wait branch is configured as:
> groupID - ${fileName:replaceFirst('^file_(\d{3}_)(\w+)\.ext1$','$1')}
>
> The 4 Wait processors in the Wait branch are configured as:
> All Wait processors:
> Release Signal Identifier - ${groupID}
>
> For each individual Wait processor:
> Signal Counter Name - type1
> Signal Counter Name - type2
> Signal Counter Name - type3
> Signal Counter Name - type4
>
> I am a bit stumped. The best success we had was a configuration with a
> RouteonAttribute sending each of type1|type2|type3|type4 to their own Wait
> processor, and a similar config for the Notify branch, followed by a final
> Wait/Notify pair that simply ensures we have the correct amount of sets.
>
> This configuration did exactly what we want, but unfortunately we had random
> flowfiles stuck in the waitqueue for no apparent reason.
>
> Thanks,
>
> Martijn
>
>
>
> On 31 May 2018 at 05:23, Koji Kawamura  wrote:
>>
>> The order of arrival does not matter.
>>
>> Wait processor has 'Expiration Duration' configuration, defaults to 10
>> min. Please adjust it according to your needs, the longest period to
>> wait for a delayed file.
>> If a FlowFile exceeds the duration, it will be sent to 'expired'
>> relationship, and can be treated differently, e.g. write ERROR log
>>
>> > If we have a longer wait for a file, we'd like processing for the next
>> > groupid to still be able to continue.
>>
>> In order to achieve that, use Wait Mode = 'Transfer to wait
>> relationship', and the 'wait' relationship should be configured to use
>> FirstInFirstOutPrioritizer.
>> If FirstInFirstOutPrioritizer is not set, the same FlowFile will be
>> processed again while it blocks other FlowFiles.
>> With FirstInFirstOutPrioritizer, the processed FlowFile will be
>> re-queued at the end of wait queue.
>>
>> I've updated my example to make it more realistic, by adding delay for
>> certain set and type.
>> https://gist.github.com/ijokarumawak/06b3b071eeb4d10d8a27507981422edd
>>
>> Thanks,
>> Koji
>>
>> On Thu, May 31, 2018 at 10:56 AM, Martijn Dekkers
>>  wrote:
>> > Cool, that will make things a lot simpler. Does it matter that the ext2
>> > files arrive in random order? Sometimes there can be a very long delay
>> > in
>> > some of them showing up, and we have some concerns about the overall
>> > flow
>> > blocking. If we have a longer wait for a file, we'd like processing for
>> > the
>> > next groupid to still be able to continue.
>> >
>> > Thank you for your help (and for writing Wait/Notify!!)
>> >
>> > Martijn
>> >
>> > On 31 May 2018 at 03:49, Koji Kawamura  wrote:
>> >>
>> >> Glad to hear that was helpful.
>> >>
>> >> "4 same type for each extension", can be treated as "8 distinct types"
>> >> if an extension is included in a type.
>> >> ab.ex1, cd.ex1, ef.ex1, gh.ex1, ab.ex2, cd.ex2, ef.ex2, gh.ex2
>> >>
>> >> Then route only 'ab.ex1' (or whichever but just 1 of them) to the Wait
>> >> branch, and the rest to Notify branch.
>> >> That will simplify the flow, if I'm not missing any other requirement.
>> >>
>> >> Thanks!
>> >> Koji
>> >>
>> >> On Thu, May 31, 2018 at 10:30 AM, Martijn Dekkers
>> >>  wrote:
>> >> > Hi Koji, Many thanks 

Re: Only get file when a set exists.

2018-05-31 Thread Martijn Dekkers
Thank you Koji,

I have tried once again, using your updated example. Unfortunately, things
still get stuck at the first Wait processors' wait queue.
I did notice that the format of the files your example generates is
different. I will try to clarify:

- 8 files in total:

-- file_123_type1.ext1
-- file_123_type1.ext2

-- file_123_type2.ext1
-- file_123_type2.ext2

-- file_123_type3.ext1
-- file_123_type3.ext2

-- file_123_type4.ext1
-- file_123_type4.ext2

For each set of 8 files, "file_123" increments, so the first set of 8 is
"file_123", and the next set is "file_124" and so on.

When I look at your example, I notice that at the final step (LogAttribute
after the FetchFile set) the filenames are file_123_.ex(1|2)

My UpdateAttribute before the Notify branch is configured as:
groupID - ${fileName:replaceFirst('^file_(\d{3}_)(\w+)\.ext1$','$1')}
counterName - ${fileName:replaceFirst('^file_(\d{3}_)(\w+)\.ext1$','$2')}

The UpdateAttribute before the Wait branch is configured as:
groupID - ${fileName:replaceFirst('^file_(\d{3}_)(\w+)\.ext1$','$1')}

The 4 Wait processors in the Wait branch are configured as:
All Wait processors:
Release Signal Identifier - ${groupID}

For each individual Wait processor:
Signal Counter Name - type1
Signal Counter Name - type2
Signal Counter Name - type3
Signal Counter Name - type4

I am a bit stumped. The best success we had was a configuration with a
RouteonAttribute sending each of type1|type2|type3|type4 to their own Wait
processor, and a similar config for the Notify branch, followed by a final
Wait/Notify pair that simply ensures we have the correct amount of sets.

This configuration did exactly what we want, but unfortunately we had
random flowfiles stuck in the waitqueue for no apparent reason.

Thanks,

Martijn



On 31 May 2018 at 05:23, Koji Kawamura  wrote:

> The order of arrival does not matter.
>
> Wait processor has 'Expiration Duration' configuration, defaults to 10
> min. Please adjust it according to your needs, the longest period to
> wait for a delayed file.
> If a FlowFile exceeds the duration, it will be sent to 'expired'
> relationship, and can be treated differently, e.g. write ERROR log
>
> > If we have a longer wait for a file, we'd like processing for the next
> groupid to still be able to continue.
>
> In order to achieve that, use Wait Mode = 'Transfer to wait
> relationship', and the 'wait' relationship should be configured to use
> FirstInFirstOutPrioritizer.
> If FirstInFirstOutPrioritizer is not set, the same FlowFile will be
> processed again while it blocks other FlowFiles.
> With FirstInFirstOutPrioritizer, the processed FlowFile will be
> re-queued at the end of wait queue.
>
> I've updated my example to make it more realistic, by adding delay for
> certain set and type.
> https://gist.github.com/ijokarumawak/06b3b071eeb4d10d8a27507981422edd
>
> Thanks,
> Koji
>
> On Thu, May 31, 2018 at 10:56 AM, Martijn Dekkers
>  wrote:
> > Cool, that will make things a lot simpler. Does it matter that the ext2
> > files arrive in random order? Sometimes there can be a very long delay in
> > some of them showing up, and we have some concerns about the overall flow
> > blocking. If we have a longer wait for a file, we'd like processing for
> the
> > next groupid to still be able to continue.
> >
> > Thank you for your help (and for writing Wait/Notify!!)
> >
> > Martijn
> >
> > On 31 May 2018 at 03:49, Koji Kawamura  wrote:
> >>
> >> Glad to hear that was helpful.
> >>
> >> "4 same type for each extension", can be treated as "8 distinct types"
> >> if an extension is included in a type.
> >> ab.ex1, cd.ex1, ef.ex1, gh.ex1, ab.ex2, cd.ex2, ef.ex2, gh.ex2
> >>
> >> Then route only 'ab.ex1' (or whichever but just 1 of them) to the Wait
> >> branch, and the rest to Notify branch.
> >> That will simplify the flow, if I'm not missing any other requirement.
> >>
> >> Thanks!
> >> Koji
> >>
> >> On Thu, May 31, 2018 at 10:30 AM, Martijn Dekkers
> >>  wrote:
> >> > Hi Koji, Many thanks for your continued assistance!
> >> >
> >> >>
> >> >> - 1 file per second is relatively low in terms of traffic, it should
> >> >> be processed fine with 1 thread
> >> >> - A flow like this, which is stateful across different parts of the
> >> >> flow works at best with single thread, because using multiple threads
> >> >> would cause race condition or concurrency issue if there's any
> >> >> implementation error
> >> >
> >> >
> >> > Yes, we had similar thoughts.
> >> >
> >> >>
> >> >> - Based on above, I strongly recommend to NOT increase "concurrent
> >> >> tasks". If you see FlowFiles staying in a wait queue, then there must
> >> >> be different issue
> >> >
> >> >
> >> > We don't see many flowfiles stuck in a wait queue, I ran a test over a
> >> > few
> >> > hours yesterday that simulates the way in which these files would
> appear
> >> > (we
> >> > would have 4 of "ext1" show up every second, and the "ext2" can show
> up
> >> > a
> >> > few seconds later, and not always in 

Re: Only get file when a set exists.

2018-05-30 Thread Koji Kawamura
The order of arrival does not matter.

Wait processor has 'Expiration Duration' configuration, defaults to 10
min. Please adjust it according to your needs, the longest period to
wait for a delayed file.
If a FlowFile exceeds the duration, it will be sent to 'expired'
relationship, and can be treated differently, e.g. write ERROR log

> If we have a longer wait for a file, we'd like processing for the next 
> groupid to still be able to continue.

In order to achieve that, use Wait Mode = 'Transfer to wait
relationship', and the 'wait' relationship should be configured to use
FirstInFirstOutPrioritizer.
If FirstInFirstOutPrioritizer is not set, the same FlowFile will be
processed again while it blocks other FlowFiles.
With FirstInFirstOutPrioritizer, the processed FlowFile will be
re-queued at the end of wait queue.

I've updated my example to make it more realistic, by adding delay for
certain set and type.
https://gist.github.com/ijokarumawak/06b3b071eeb4d10d8a27507981422edd

Thanks,
Koji

On Thu, May 31, 2018 at 10:56 AM, Martijn Dekkers
 wrote:
> Cool, that will make things a lot simpler. Does it matter that the ext2
> files arrive in random order? Sometimes there can be a very long delay in
> some of them showing up, and we have some concerns about the overall flow
> blocking. If we have a longer wait for a file, we'd like processing for the
> next groupid to still be able to continue.
>
> Thank you for your help (and for writing Wait/Notify!!)
>
> Martijn
>
> On 31 May 2018 at 03:49, Koji Kawamura  wrote:
>>
>> Glad to hear that was helpful.
>>
>> "4 same type for each extension", can be treated as "8 distinct types"
>> if an extension is included in a type.
>> ab.ex1, cd.ex1, ef.ex1, gh.ex1, ab.ex2, cd.ex2, ef.ex2, gh.ex2
>>
>> Then route only 'ab.ex1' (or whichever but just 1 of them) to the Wait
>> branch, and the rest to Notify branch.
>> That will simplify the flow, if I'm not missing any other requirement.
>>
>> Thanks!
>> Koji
>>
>> On Thu, May 31, 2018 at 10:30 AM, Martijn Dekkers
>>  wrote:
>> > Hi Koji, Many thanks for your continued assistance!
>> >
>> >>
>> >> - 1 file per second is relatively low in terms of traffic, it should
>> >> be processed fine with 1 thread
>> >> - A flow like this, which is stateful across different parts of the
>> >> flow works at best with single thread, because using multiple threads
>> >> would cause race condition or concurrency issue if there's any
>> >> implementation error
>> >
>> >
>> > Yes, we had similar thoughts.
>> >
>> >>
>> >> - Based on above, I strongly recommend to NOT increase "concurrent
>> >> tasks". If you see FlowFiles staying in a wait queue, then there must
>> >> be different issue
>> >
>> >
>> > We don't see many flowfiles stuck in a wait queue, I ran a test over a
>> > few
>> > hours yesterday that simulates the way in which these files would appear
>> > (we
>> > would have 4 of "ext1" show up every second, and the "ext2" can show up
>> > a
>> > few seconds later, and not always in the same order) and we found
>> > perhaps 6
>> > flowfiles stuck in a wait queue.
>> >
>> >>
>> >> - Also, using concurrent tasks number like 400 is too much in general
>> >> for all processors. I recommend to increment it as 2, 3, 4 .. up to 8
>> >> or so, only if you see the clear benefit by doing so
>> >
>> >
>> > Indeed, thanks for the suggestion. Once we have the logic finished and
>> > tested we will have to optimise this Flow. The next step is to try to
>> > load
>> > the required processors into MiNiFy, as this will be running on many
>> > systems
>> > with limited capacity. If we don't manage with MiNiFy, we will still be
>> > good, but we prefer to have the smaller footprint and ease of management
>> > we
>> > can obtain with MiNiFy.
>> >
>> >>
>> >> - The important part of this flow is extracting 'groupId' and 'type'
>> >> from file names. Regular Expression needs to be configured properly.
>> >> - I recommend using https://regex101.com/ to test your Regular
>> >> Expression to see whether it can extract correct groupId and type
>> >
>> >
>> > Yes, we have tested our RegExes for this extensively
>> >
>> >>
>> >>
>> >> Lastly, regardless of how many files should be there for 'ext1' and
>> >> 'ext2', the flow structure is simple as below.
>> >> Let's say there should be 8 files to start processing those.
>> >> 4 x ex1, and 4 ex2 in your case, but let's think it as 8 file types.
>> >> And I assume the types are known, meaning, static, not dynamically
>> >> change.
>> >
>> >
>> > Correct, the format is . where:
>> >
>> > - groupId is unique for each set of 8
>> > - type has 4 variants (ab, cd, ef, gh), the same type-set for each ext
>> >
>> >> So, the rule is, "a set of files consists of 8 files, and a set should
>> >> wait to be processed until all 8 files are ready", that's all.
>> >
>> >
>> > For our use case it is important that we have positive identification
>> > that
>> > we have exact "positive identification" of each file.
>> >
>> >>
>> >> 

Re: Only get file when a set exists.

2018-05-30 Thread Martijn Dekkers
Cool, that will make things a lot simpler. Does it matter that the ext2
files arrive in random order? Sometimes there can be a very long delay in
some of them showing up, and we have some concerns about the overall flow
blocking. If we have a longer wait for a file, we'd like processing for the
next groupid to still be able to continue.

Thank you for your help (and for writing Wait/Notify!!)

Martijn

On 31 May 2018 at 03:49, Koji Kawamura  wrote:

> Glad to hear that was helpful.
>
> "4 same type for each extension", can be treated as "8 distinct types"
> if an extension is included in a type.
> ab.ex1, cd.ex1, ef.ex1, gh.ex1, ab.ex2, cd.ex2, ef.ex2, gh.ex2
>
> Then route only 'ab.ex1' (or whichever but just 1 of them) to the Wait
> branch, and the rest to Notify branch.
> That will simplify the flow, if I'm not missing any other requirement.
>
> Thanks!
> Koji
>
> On Thu, May 31, 2018 at 10:30 AM, Martijn Dekkers
>  wrote:
> > Hi Koji, Many thanks for your continued assistance!
> >
> >>
> >> - 1 file per second is relatively low in terms of traffic, it should
> >> be processed fine with 1 thread
> >> - A flow like this, which is stateful across different parts of the
> >> flow works at best with single thread, because using multiple threads
> >> would cause race condition or concurrency issue if there's any
> >> implementation error
> >
> >
> > Yes, we had similar thoughts.
> >
> >>
> >> - Based on above, I strongly recommend to NOT increase "concurrent
> >> tasks". If you see FlowFiles staying in a wait queue, then there must
> >> be different issue
> >
> >
> > We don't see many flowfiles stuck in a wait queue, I ran a test over a
> few
> > hours yesterday that simulates the way in which these files would appear
> (we
> > would have 4 of "ext1" show up every second, and the "ext2" can show up a
> > few seconds later, and not always in the same order) and we found
> perhaps 6
> > flowfiles stuck in a wait queue.
> >
> >>
> >> - Also, using concurrent tasks number like 400 is too much in general
> >> for all processors. I recommend to increment it as 2, 3, 4 .. up to 8
> >> or so, only if you see the clear benefit by doing so
> >
> >
> > Indeed, thanks for the suggestion. Once we have the logic finished and
> > tested we will have to optimise this Flow. The next step is to try to
> load
> > the required processors into MiNiFy, as this will be running on many
> systems
> > with limited capacity. If we don't manage with MiNiFy, we will still be
> > good, but we prefer to have the smaller footprint and ease of management
> we
> > can obtain with MiNiFy.
> >
> >>
> >> - The important part of this flow is extracting 'groupId' and 'type'
> >> from file names. Regular Expression needs to be configured properly.
> >> - I recommend using https://regex101.com/ to test your Regular
> >> Expression to see whether it can extract correct groupId and type
> >
> >
> > Yes, we have tested our RegExes for this extensively
> >
> >>
> >>
> >> Lastly, regardless of how many files should be there for 'ext1' and
> >> 'ext2', the flow structure is simple as below.
> >> Let's say there should be 8 files to start processing those.
> >> 4 x ex1, and 4 ex2 in your case, but let's think it as 8 file types.
> >> And I assume the types are known, meaning, static, not dynamically
> change.
> >
> >
> > Correct, the format is . where:
> >
> > - groupId is unique for each set of 8
> > - type has 4 variants (ab, cd, ef, gh), the same type-set for each ext
> >
> >> So, the rule is, "a set of files consists of 8 files, and a set should
> >> wait to be processed until all 8 files are ready", that's all.
> >
> >
> > For our use case it is important that we have positive identification
> that
> > we have exact "positive identification" of each file.
> >
> >>
> >> Then, the flow should be designed like below:
> >> 1. List files, each file will be sent as a FlowFile
> >
> >
> > Correct - we have several different listfiles for other sections of the
> > flow, we are actually collecting many different sets, all variants of the
> > above. However, those are far simpler (sets of 2 - ext1 and ext2 only)
> >
> >>
> >> 2. Extract groupId and type from filename
> >
> >
> > Correct
> >
> >>
> >> 3. Route FlowFiles into two branches, let's call these 'Notify' branch
> >> and 'Wait' branch, and pass only 1 type for a set to Wait-branch, and
> >> the rest 7 types to Notify-branch
> >
> >
> > OK, we currently split Notify branch to "all ext1" and wait branch to
> "all
> > ext2"
> >
> >>
> >> At Notify branch (for the rest 7 types FlowFile, e.g. type 2, 3, 4 ...
> 8)
> >
> >
> > As mentioned, we only have 4 distinct types.
> >
> >>
> >> 1. Notify that the type for a group has arrived.
> >> 2. Discard the FlowFile, because there's nothing to do with it in this
> >> branch
> >
> >
> >
> >>
> >> At Wait branch (for the type 1 FlowFile)
> >> 1. Wait for type 2 for the groupId.
> >> 2. Wait for type 3 for the groupId, type 4, 5 and so on
> >> 3. After passing Wait 

Re: Only get file when a set exists.

2018-05-30 Thread Koji Kawamura
Glad to hear that was helpful.

"4 same type for each extension", can be treated as "8 distinct types"
if an extension is included in a type.
ab.ex1, cd.ex1, ef.ex1, gh.ex1, ab.ex2, cd.ex2, ef.ex2, gh.ex2

Then route only 'ab.ex1' (or whichever but just 1 of them) to the Wait
branch, and the rest to Notify branch.
That will simplify the flow, if I'm not missing any other requirement.

Thanks!
Koji

On Thu, May 31, 2018 at 10:30 AM, Martijn Dekkers
 wrote:
> Hi Koji, Many thanks for your continued assistance!
>
>>
>> - 1 file per second is relatively low in terms of traffic, it should
>> be processed fine with 1 thread
>> - A flow like this, which is stateful across different parts of the
>> flow works at best with single thread, because using multiple threads
>> would cause race condition or concurrency issue if there's any
>> implementation error
>
>
> Yes, we had similar thoughts.
>
>>
>> - Based on above, I strongly recommend to NOT increase "concurrent
>> tasks". If you see FlowFiles staying in a wait queue, then there must
>> be different issue
>
>
> We don't see many flowfiles stuck in a wait queue, I ran a test over a few
> hours yesterday that simulates the way in which these files would appear (we
> would have 4 of "ext1" show up every second, and the "ext2" can show up a
> few seconds later, and not always in the same order) and we found perhaps 6
> flowfiles stuck in a wait queue.
>
>>
>> - Also, using concurrent tasks number like 400 is too much in general
>> for all processors. I recommend to increment it as 2, 3, 4 .. up to 8
>> or so, only if you see the clear benefit by doing so
>
>
> Indeed, thanks for the suggestion. Once we have the logic finished and
> tested we will have to optimise this Flow. The next step is to try to load
> the required processors into MiNiFy, as this will be running on many systems
> with limited capacity. If we don't manage with MiNiFy, we will still be
> good, but we prefer to have the smaller footprint and ease of management we
> can obtain with MiNiFy.
>
>>
>> - The important part of this flow is extracting 'groupId' and 'type'
>> from file names. Regular Expression needs to be configured properly.
>> - I recommend using https://regex101.com/ to test your Regular
>> Expression to see whether it can extract correct groupId and type
>
>
> Yes, we have tested our RegExes for this extensively
>
>>
>>
>> Lastly, regardless of how many files should be there for 'ext1' and
>> 'ext2', the flow structure is simple as below.
>> Let's say there should be 8 files to start processing those.
>> 4 x ex1, and 4 ex2 in your case, but let's think it as 8 file types.
>> And I assume the types are known, meaning, static, not dynamically change.
>
>
> Correct, the format is . where:
>
> - groupId is unique for each set of 8
> - type has 4 variants (ab, cd, ef, gh), the same type-set for each ext
>
>> So, the rule is, "a set of files consists of 8 files, and a set should
>> wait to be processed until all 8 files are ready", that's all.
>
>
> For our use case it is important that we have positive identification that
> we have exact "positive identification" of each file.
>
>>
>> Then, the flow should be designed like below:
>> 1. List files, each file will be sent as a FlowFile
>
>
> Correct - we have several different listfiles for other sections of the
> flow, we are actually collecting many different sets, all variants of the
> above. However, those are far simpler (sets of 2 - ext1 and ext2 only)
>
>>
>> 2. Extract groupId and type from filename
>
>
> Correct
>
>>
>> 3. Route FlowFiles into two branches, let's call these 'Notify' branch
>> and 'Wait' branch, and pass only 1 type for a set to Wait-branch, and
>> the rest 7 types to Notify-branch
>
>
> OK, we currently split Notify branch to "all ext1" and wait branch to "all
> ext2"
>
>>
>> At Notify branch (for the rest 7 types FlowFile, e.g. type 2, 3, 4 ... 8)
>
>
> As mentioned, we only have 4 distinct types.
>
>>
>> 1. Notify that the type for a group has arrived.
>> 2. Discard the FlowFile, because there's nothing to do with it in this
>> branch
>
>
>
>>
>> At Wait branch (for the type 1 FlowFile)
>> 1. Wait for type 2 for the groupId.
>> 2. Wait for type 3 for the groupId, type 4, 5 and so on
>> 3. After passing Wait for type 8, it can guarantee that all 8 files
>> are available (unless there is any other program deleting those)
>> 4. Get actual file content using FetchFile, and process it
>
>
> Besides the "4 same types for each extension", this is configured as you
> describe.
>
>>
>> I hope it helps.
>>
>
> It does, thanks. I will extract this portion of the flow, sanitise, and send
> it along - easier to see than to describe :)
>
>
>>
>> Thanks,
>> Koji
>
>
> Thank you so much once again!
>
> Martijn
>
>
>
>>
>>
>> On Wed, May 30, 2018 at 6:10 PM, Martijn Dekkers 
>> wrote:
>> > Hey Pierre,
>> >
>> > Yes, we suspected as much, but we are only seeing this with the Wait
>> > processor. Possibly because that is the only 

Re: Only get file when a set exists.

2018-05-30 Thread Martijn Dekkers
Hi Koji, Many thanks for your continued assistance!


> - 1 file per second is relatively low in terms of traffic, it should
> be processed fine with 1 thread
> - A flow like this, which is stateful across different parts of the
> flow works at best with single thread, because using multiple threads
> would cause race condition or concurrency issue if there's any
> implementation error
>

Yes, we had similar thoughts.


> - Based on above, I strongly recommend to NOT increase "concurrent
> tasks". If you see FlowFiles staying in a wait queue, then there must
> be different issue
>

We don't see many flowfiles stuck in a wait queue, I ran a test over a few
hours yesterday that simulates the way in which these files would appear
(we would have 4 of "ext1" show up every second, and the "ext2" can show up
a few seconds later, and not always in the same order) and we found perhaps
6 flowfiles stuck in a wait queue.


> - Also, using concurrent tasks number like 400 is too much in general
> for all processors. I recommend to increment it as 2, 3, 4 .. up to 8
> or so, only if you see the clear benefit by doing so
>

Indeed, thanks for the suggestion. Once we have the logic finished and
tested we will have to optimise this Flow. The next step is to try to load
the required processors into MiNiFy, as this will be running on many
systems with limited capacity. If we don't manage with MiNiFy, we will
still be good, but we prefer to have the smaller footprint and ease of
management we can obtain with MiNiFy.


> - The important part of this flow is extracting 'groupId' and 'type'
> from file names. Regular Expression needs to be configured properly.
> - I recommend using https://regex101.com/ to test your Regular
> Expression to see whether it can extract correct groupId and type
>

Yes, we have tested our RegExes for this extensively


>
> Lastly, regardless of how many files should be there for 'ext1' and
> 'ext2', the flow structure is simple as below.
> Let's say there should be 8 files to start processing those.
> 4 x ex1, and 4 ex2 in your case, but let's think it as 8 file types.
> And I assume the types are known, meaning, static, not dynamically change.
>

Correct, the format is . where:

- groupId is unique for each set of 8
- type has 4 variants (ab, cd, ef, gh), the same type-set for each ext

So, the rule is, "a set of files consists of 8 files, and a set should
> wait to be processed until all 8 files are ready", that's all.
>

For our use case it is important that we have positive identification that
we have exact "positive identification" of each file.


> Then, the flow should be designed like below:
> 1. List files, each file will be sent as a FlowFile
>

Correct - we have several different listfiles for other sections of the
flow, we are actually collecting many different sets, all variants of the
above. However, those are far simpler (sets of 2 - ext1 and ext2 only)


> 2. Extract groupId and type from filename
>

Correct


> 3. Route FlowFiles into two branches, let's call these 'Notify' branch
> and 'Wait' branch, and pass only 1 type for a set to Wait-branch, and
> the rest 7 types to Notify-branch
>

OK, we currently split Notify branch to "all ext1" and wait branch to "all
ext2"


> At Notify branch (for the rest 7 types FlowFile, e.g. type 2, 3, 4 ... 8)
>

As mentioned, we only have 4 distinct types.


> 1. Notify that the type for a group has arrived.
> 2. Discard the FlowFile, because there's nothing to do with it in this
> branch
>



> At Wait branch (for the type 1 FlowFile)
> 1. Wait for type 2 for the groupId.
> 2. Wait for type 3 for the groupId, type 4, 5 and so on
> 3. After passing Wait for type 8, it can guarantee that all 8 files
> are available (unless there is any other program deleting those)
> 4. Get actual file content using FetchFile, and process it
>

Besides the "4 same types for each extension", this is configured as you
describe.


> I hope it helps.
>
>
It does, thanks. I will extract this portion of the flow, sanitise, and
send it along - easier to see than to describe :)



> Thanks,
> Koji


Thank you so much once again!

Martijn




>
> On Wed, May 30, 2018 at 6:10 PM, Martijn Dekkers 
> wrote:
> > Hey Pierre,
> >
> > Yes, we suspected as much, but we are only seeing this with the Wait
> > processor. Possibly because that is the only "blocking" we have in this
> > flow.
> >
> > Thanks for the clarification, much appreciated!
> >
> > Martijn
> >
> > On 30 May 2018 at 10:30, Pierre Villard 
> wrote:
> >>
> >> I'll let Koji give more information about the Wait/Notify, he is clearly
> >> the expert here.
> >>
> >> I'm just jumping in regarding your "and when viewing the queue, the
> dialog
> >> states that the queue is empty.". You're seeing this behavior because,
> even
> >> though the UI shows some flow files in the queue, the flow files are
> >> currently locked in the session of the running processor and you won't
> see
> >> flow files currently processed in a 

Re: Only get file when a set exists.

2018-05-30 Thread Koji Kawamura
Hi Martijn,

Thanks for elaborating your requirement.  Here are few comments:

- 1 file per second is relatively low in terms of traffic, it should
be processed fine with 1 thread
- A flow like this, which is stateful across different parts of the
flow works at best with single thread, because using multiple threads
would cause race condition or concurrency issue if there's any
implementation error
- Based on above, I strongly recommend to NOT increase "concurrent
tasks". If you see FlowFiles staying in a wait queue, then there must
be different issue
- Also, using concurrent tasks number like 400 is too much in general
for all processors. I recommend to increment it as 2, 3, 4 .. up to 8
or so, only if you see the clear benefit by doing so
- The important part of this flow is extracting 'groupId' and 'type'
from file names. Regular Expression needs to be configured properly.
- I recommend using https://regex101.com/ to test your Regular
Expression to see whether it can extract correct groupId and type

Lastly, regardless of how many files should be there for 'ext1' and
'ext2', the flow structure is simple as below.
Let's say there should be 8 files to start processing those.
4 x ex1, and 4 ex2 in your case, but let's think it as 8 file types.
And I assume the types are known, meaning, static, not dynamically change.
So, the rule is, "a set of files consists of 8 files, and a set should
wait to be processed until all 8 files are ready", that's all.

Then, the flow should be designed like below:
1. List files, each file will be sent as a FlowFile
2. Extract groupId and type from filename
3. Route FlowFiles into two branches, let's call these 'Notify' branch
and 'Wait' branch, and pass only 1 type for a set to Wait-branch, and
the rest 7 types to Notify-branch

At Notify branch (for the rest 7 types FlowFile, e.g. type 2, 3, 4 ... 8)
1. Notify that the type for a group has arrived.
2. Discard the FlowFile, because there's nothing to do with it in this branch

At Wait branch (for the type 1 FlowFile)
1. Wait for type 2 for the groupId.
2. Wait for type 3 for the groupId, type 4, 5 and so on
3. After passing Wait for type 8, it can guarantee that all 8 files
are available (unless there is any other program deleting those)
4. Get actual file content using FetchFile, and process it

I hope it helps.

Thanks,
Koji


On Wed, May 30, 2018 at 6:10 PM, Martijn Dekkers  wrote:
> Hey Pierre,
>
> Yes, we suspected as much, but we are only seeing this with the Wait
> processor. Possibly because that is the only "blocking" we have in this
> flow.
>
> Thanks for the clarification, much appreciated!
>
> Martijn
>
> On 30 May 2018 at 10:30, Pierre Villard  wrote:
>>
>> I'll let Koji give more information about the Wait/Notify, he is clearly
>> the expert here.
>>
>> I'm just jumping in regarding your "and when viewing the queue, the dialog
>> states that the queue is empty.". You're seeing this behavior because, even
>> though the UI shows some flow files in the queue, the flow files are
>> currently locked in the session of the running processor and you won't see
>> flow files currently processed in a session when listing a queue. If you
>> stop the processor, the session will be closed and you'll be able to list
>> the queue and see the flow files.
>>
>> I recall discussions in the past to improve the UX for this. Not sure we
>> have a JIRA for it though...
>>
>> Pierre
>>
>> 2018-05-30 8:26 GMT+02:00 Martijn Dekkers :
>>>
>>> Hi Koji,
>>>
>>> Thank you for responding. I had adjusted the run schedule to closely
>>> mimic our environment. We are expecting about 1 file per second or so.
>>> We are also seeing some random "orphans" sitting in a wait queue every
>>> now and again that don't trigger a debug message, and when viewing the
>>> queue, the dialog states that the queue is empty.
>>>
>>> We found the random "no signal found" issue to be significantly decreased
>>> when we increase the "concurrent tasks" to something large - currently set
>>> to 400 for all wait and notify processors.
>>>
>>> I do need to mention that our requirements had changed since you made the
>>> template, in that we are looking for a set of 8 files - 4 x "ext1" and 4 x
>>> "ext2" both with the same pattern: .ext1 or ext2
>>>
>>> We found that the best way to make this work was to add another
>>> wait/notify pair, each processor coming after the ones already there, with a
>>> simple counter against the groupID.
>>>
>>> I will export a template for you, many thanks for your help - I just need
>>> to spend some time sanitising the varies fields etc.
>>>
>>> Many thanks once again for your kind assistance.
>>>
>>> Martijn
>>>
>>> On 30 May 2018 at 08:14, Koji Kawamura  wrote:

 Hi Martjin,

 In my template, I was using 'Run Schedule' as '5 secs' for the Wait
 processors to avoid overusing CPU resource. However, if you expect
 more throughput, it should be lowered.
 Changed Run Schedule to 0 sec, and I passed 100 group of 

Re: Only get file when a set exists.

2018-05-30 Thread Pierre Villard
I'll let Koji give more information about the Wait/Notify, he is clearly
the expert here.

I'm just jumping in regarding your "and when viewing the queue, the dialog
states that the queue is empty.". You're seeing this behavior because, even
though the UI shows some flow files in the queue, the flow files are
currently locked in the session of the running processor and you won't see
flow files currently processed in a session when listing a queue. If you
stop the processor, the session will be closed and you'll be able to list
the queue and see the flow files.

I recall discussions in the past to improve the UX for this. Not sure we
have a JIRA for it though...

Pierre

2018-05-30 8:26 GMT+02:00 Martijn Dekkers :

> Hi Koji,
>
> Thank you for responding. I had adjusted the run schedule to closely mimic
> our environment. We are expecting about 1 file per second or so.
> We are also seeing some random "orphans" sitting in a wait queue every now
> and again that don't trigger a debug message, and when viewing the queue,
> the dialog states that the queue is empty.
>
> We found the random "no signal found" issue to be significantly decreased
> when we increase the "concurrent tasks" to something large - currently set
> to 400 for all wait and notify processors.
>
> I do need to mention that our requirements had changed since you made the
> template, in that we are looking for a set of 8 files - 4 x "ext1" and 4 x
> "ext2" both with the same pattern: .ext1 or ext2
>
> We found that the best way to make this work was to add another
> wait/notify pair, each processor coming after the ones already there, with
> a simple counter against the groupID.
>
> I will export a template for you, many thanks for your help - I just need
> to spend some time sanitising the varies fields etc.
>
> Many thanks once again for your kind assistance.
>
> Martijn
>
> On 30 May 2018 at 08:14, Koji Kawamura  wrote:
>
>> Hi Martjin,
>>
>> In my template, I was using 'Run Schedule' as '5 secs' for the Wait
>> processors to avoid overusing CPU resource. However, if you expect
>> more throughput, it should be lowered.
>> Changed Run Schedule to 0 sec, and I passed 100 group of files (400
>> files because 4 files are 1 set in my example), they reached to the
>> expected goal of the flow without issue.
>>
>> If you can share your flow and example input file volume (hundreds of
>> files were fine in my flow), I may be able to provide more useful
>> comment.
>>
>> Thanks,
>> Koji
>>
>> On Wed, May 30, 2018 at 2:08 PM, Martijn Dekkers 
>> wrote:
>> > Hi Koji,
>> >
>> > I am seeing many issues to get this to run reliably. When running this
>> with
>> > a few flowfiles at a time, and stepping through by switching processors
>> on
>> > and off it works mostly fine, but running this at volume I receive many
>> > errors about "no release signal found"
>> >
>> > I have tried to fix this in a few different ways, but the issue keeps
>> coming
>> > back. This is also not consistent at all - different wait processors
>> will
>> > block different flowfiles at different times, without changing any
>> > configuration. Stop/Start the flow, and different queues will fill up.
>> Do
>> > you have any ideas what could be causing this behavior? I checked the
>> > DistributedMapCache Server/Client components, and they all appear to be
>> > working OK.
>> >
>> > Thanks,
>> >
>> > Martijn
>> >
>> > On 28 May 2018 at 05:11, Koji Kawamura  wrote:
>> >>
>> >> Hi Martin,
>> >>
>> >> Alternative approach is using Wait/Notify processors.
>> >> I have developed similar flow using those before, and it will work
>> >> with your case I believe.
>> >> A NiFi flow template is available here.
>> >> https://gist.github.com/ijokarumawak/06b3b071eeb4d10d8a27507981422edd
>> >>
>> >> Hope this helps,
>> >> Koji
>> >>
>> >>
>> >> On Sun, May 27, 2018 at 11:48 PM, Andrew Grande 
>> >> wrote:
>> >> > Martijn,
>> >> >
>> >> > Here's an idea you could explore. Have the ListFile processor work as
>> >> > usual
>> >> > and create a custom component (start with a scripting one to
>> prototype)
>> >> > grouping the filenames as needed. I don't know of the number of
>> files in
>> >> > a
>> >> > set is different every time, so trying to be more robust.
>> >> >
>> >> > Once you group and count the set, you can transfer the names to the
>> >> > success
>> >> > relationship. Ignore otherwise and wait until the set is full.
>> >> >
>> >> > Andrew
>> >> >
>> >> >
>> >> > On Sun, May 27, 2018, 7:29 AM Martijn Dekkers <
>> mart...@dekkers.org.uk>
>> >> > wrote:
>> >> >>
>> >> >> Hello all,
>> >> >>
>> >> >> I am trying to work out an issue with little success.
>> >> >>
>> >> >> I need to ingest files generated by some application. I can only
>> ingest
>> >> >> these files when a specific set exists. For example:
>> >> >>
>> >> >> file_123_456_ab.ex1
>> >> >> file_123_456_cd.ex1
>> >> >> file_123_456_ef.ex1
>> >> >> file_123_456_gh.ex1
>> >> >> file_123_456.ex2
>> >> >>
>> >> >> Only when a set like 

Re: Only get file when a set exists.

2018-05-30 Thread Martijn Dekkers
Hi Koji,

Thank you for responding. I had adjusted the run schedule to closely mimic
our environment. We are expecting about 1 file per second or so.
We are also seeing some random "orphans" sitting in a wait queue every now
and again that don't trigger a debug message, and when viewing the queue,
the dialog states that the queue is empty.

We found the random "no signal found" issue to be significantly decreased
when we increase the "concurrent tasks" to something large - currently set
to 400 for all wait and notify processors.

I do need to mention that our requirements had changed since you made the
template, in that we are looking for a set of 8 files - 4 x "ext1" and 4 x
"ext2" both with the same pattern: .ext1 or ext2

We found that the best way to make this work was to add another wait/notify
pair, each processor coming after the ones already there, with a simple
counter against the groupID.

I will export a template for you, many thanks for your help - I just need
to spend some time sanitising the varies fields etc.

Many thanks once again for your kind assistance.

Martijn

On 30 May 2018 at 08:14, Koji Kawamura  wrote:

> Hi Martjin,
>
> In my template, I was using 'Run Schedule' as '5 secs' for the Wait
> processors to avoid overusing CPU resource. However, if you expect
> more throughput, it should be lowered.
> Changed Run Schedule to 0 sec, and I passed 100 group of files (400
> files because 4 files are 1 set in my example), they reached to the
> expected goal of the flow without issue.
>
> If you can share your flow and example input file volume (hundreds of
> files were fine in my flow), I may be able to provide more useful
> comment.
>
> Thanks,
> Koji
>
> On Wed, May 30, 2018 at 2:08 PM, Martijn Dekkers 
> wrote:
> > Hi Koji,
> >
> > I am seeing many issues to get this to run reliably. When running this
> with
> > a few flowfiles at a time, and stepping through by switching processors
> on
> > and off it works mostly fine, but running this at volume I receive many
> > errors about "no release signal found"
> >
> > I have tried to fix this in a few different ways, but the issue keeps
> coming
> > back. This is also not consistent at all - different wait processors will
> > block different flowfiles at different times, without changing any
> > configuration. Stop/Start the flow, and different queues will fill up. Do
> > you have any ideas what could be causing this behavior? I checked the
> > DistributedMapCache Server/Client components, and they all appear to be
> > working OK.
> >
> > Thanks,
> >
> > Martijn
> >
> > On 28 May 2018 at 05:11, Koji Kawamura  wrote:
> >>
> >> Hi Martin,
> >>
> >> Alternative approach is using Wait/Notify processors.
> >> I have developed similar flow using those before, and it will work
> >> with your case I believe.
> >> A NiFi flow template is available here.
> >> https://gist.github.com/ijokarumawak/06b3b071eeb4d10d8a27507981422edd
> >>
> >> Hope this helps,
> >> Koji
> >>
> >>
> >> On Sun, May 27, 2018 at 11:48 PM, Andrew Grande 
> >> wrote:
> >> > Martijn,
> >> >
> >> > Here's an idea you could explore. Have the ListFile processor work as
> >> > usual
> >> > and create a custom component (start with a scripting one to
> prototype)
> >> > grouping the filenames as needed. I don't know of the number of files
> in
> >> > a
> >> > set is different every time, so trying to be more robust.
> >> >
> >> > Once you group and count the set, you can transfer the names to the
> >> > success
> >> > relationship. Ignore otherwise and wait until the set is full.
> >> >
> >> > Andrew
> >> >
> >> >
> >> > On Sun, May 27, 2018, 7:29 AM Martijn Dekkers  >
> >> > wrote:
> >> >>
> >> >> Hello all,
> >> >>
> >> >> I am trying to work out an issue with little success.
> >> >>
> >> >> I need to ingest files generated by some application. I can only
> ingest
> >> >> these files when a specific set exists. For example:
> >> >>
> >> >> file_123_456_ab.ex1
> >> >> file_123_456_cd.ex1
> >> >> file_123_456_ef.ex1
> >> >> file_123_456_gh.ex1
> >> >> file_123_456.ex2
> >> >>
> >> >> Only when a set like that exists should I pick them up into the Flow.
> >> >> The
> >> >> parts I am looking for to "group" would "ab.ex1", "cd.ex1", "ef.ex1",
> >> >> "gh.ex1", ".ex2".
> >> >>
> >> >> I tried to do this with some expression, but couldn't work it out.
> >> >>
> >> >> What would be the best way to achieve this?
> >> >>
> >> >> Many thanks!
> >
> >
>


Re: Only get file when a set exists.

2018-05-29 Thread Martijn Dekkers
Hi Koji,

I am seeing many issues to get this to run reliably. When running this with
a few flowfiles at a time, and stepping through by switching processors on
and off it works mostly fine, but running this at volume I receive many
errors about "no release signal found"

I have tried to fix this in a few different ways, but the issue keeps
coming back. This is also not consistent at all - different wait processors
will block different flowfiles at different times, without changing any
configuration. Stop/Start the flow, and different queues will fill up. Do
you have any ideas what could be causing this behavior? I checked the
DistributedMapCache Server/Client components, and they all appear to be
working OK.

Thanks,

Martijn

On 28 May 2018 at 05:11, Koji Kawamura  wrote:

> Hi Martin,
>
> Alternative approach is using Wait/Notify processors.
> I have developed similar flow using those before, and it will work
> with your case I believe.
> A NiFi flow template is available here.
> https://gist.github.com/ijokarumawak/06b3b071eeb4d10d8a27507981422edd
>
> Hope this helps,
> Koji
>
>
> On Sun, May 27, 2018 at 11:48 PM, Andrew Grande 
> wrote:
> > Martijn,
> >
> > Here's an idea you could explore. Have the ListFile processor work as
> usual
> > and create a custom component (start with a scripting one to prototype)
> > grouping the filenames as needed. I don't know of the number of files in
> a
> > set is different every time, so trying to be more robust.
> >
> > Once you group and count the set, you can transfer the names to the
> success
> > relationship. Ignore otherwise and wait until the set is full.
> >
> > Andrew
> >
> >
> > On Sun, May 27, 2018, 7:29 AM Martijn Dekkers 
> > wrote:
> >>
> >> Hello all,
> >>
> >> I am trying to work out an issue with little success.
> >>
> >> I need to ingest files generated by some application. I can only ingest
> >> these files when a specific set exists. For example:
> >>
> >> file_123_456_ab.ex1
> >> file_123_456_cd.ex1
> >> file_123_456_ef.ex1
> >> file_123_456_gh.ex1
> >> file_123_456.ex2
> >>
> >> Only when a set like that exists should I pick them up into the Flow.
> The
> >> parts I am looking for to "group" would "ab.ex1", "cd.ex1", "ef.ex1",
> >> "gh.ex1", ".ex2".
> >>
> >> I tried to do this with some expression, but couldn't work it out.
> >>
> >> What would be the best way to achieve this?
> >>
> >> Many thanks!
>


Re: Only get file when a set exists.

2018-05-27 Thread Martijn Dekkers
Hi Koji, that looks excellent, many thanks!

I did consider wait/notify, but wasn't quite sure - not used wait/notify
before. Many thanks for putting the example together, I will give this a
try.

Martijn

On 28 May 2018 at 05:11, Koji Kawamura  wrote:

> Hi Martin,
>
> Alternative approach is using Wait/Notify processors.
> I have developed similar flow using those before, and it will work
> with your case I believe.
> A NiFi flow template is available here.
> https://gist.github.com/ijokarumawak/06b3b071eeb4d10d8a27507981422edd
>
> Hope this helps,
> Koji
>
>
> On Sun, May 27, 2018 at 11:48 PM, Andrew Grande 
> wrote:
> > Martijn,
> >
> > Here's an idea you could explore. Have the ListFile processor work as
> usual
> > and create a custom component (start with a scripting one to prototype)
> > grouping the filenames as needed. I don't know of the number of files in
> a
> > set is different every time, so trying to be more robust.
> >
> > Once you group and count the set, you can transfer the names to the
> success
> > relationship. Ignore otherwise and wait until the set is full.
> >
> > Andrew
> >
> >
> > On Sun, May 27, 2018, 7:29 AM Martijn Dekkers 
> > wrote:
> >>
> >> Hello all,
> >>
> >> I am trying to work out an issue with little success.
> >>
> >> I need to ingest files generated by some application. I can only ingest
> >> these files when a specific set exists. For example:
> >>
> >> file_123_456_ab.ex1
> >> file_123_456_cd.ex1
> >> file_123_456_ef.ex1
> >> file_123_456_gh.ex1
> >> file_123_456.ex2
> >>
> >> Only when a set like that exists should I pick them up into the Flow.
> The
> >> parts I am looking for to "group" would "ab.ex1", "cd.ex1", "ef.ex1",
> >> "gh.ex1", ".ex2".
> >>
> >> I tried to do this with some expression, but couldn't work it out.
> >>
> >> What would be the best way to achieve this?
> >>
> >> Many thanks!
>


Re: Only get file when a set exists.

2018-05-27 Thread Koji Kawamura
Hi Martin,

Alternative approach is using Wait/Notify processors.
I have developed similar flow using those before, and it will work
with your case I believe.
A NiFi flow template is available here.
https://gist.github.com/ijokarumawak/06b3b071eeb4d10d8a27507981422edd

Hope this helps,
Koji


On Sun, May 27, 2018 at 11:48 PM, Andrew Grande  wrote:
> Martijn,
>
> Here's an idea you could explore. Have the ListFile processor work as usual
> and create a custom component (start with a scripting one to prototype)
> grouping the filenames as needed. I don't know of the number of files in a
> set is different every time, so trying to be more robust.
>
> Once you group and count the set, you can transfer the names to the success
> relationship. Ignore otherwise and wait until the set is full.
>
> Andrew
>
>
> On Sun, May 27, 2018, 7:29 AM Martijn Dekkers 
> wrote:
>>
>> Hello all,
>>
>> I am trying to work out an issue with little success.
>>
>> I need to ingest files generated by some application. I can only ingest
>> these files when a specific set exists. For example:
>>
>> file_123_456_ab.ex1
>> file_123_456_cd.ex1
>> file_123_456_ef.ex1
>> file_123_456_gh.ex1
>> file_123_456.ex2
>>
>> Only when a set like that exists should I pick them up into the Flow. The
>> parts I am looking for to "group" would "ab.ex1", "cd.ex1", "ef.ex1",
>> "gh.ex1", ".ex2".
>>
>> I tried to do this with some expression, but couldn't work it out.
>>
>> What would be the best way to achieve this?
>>
>> Many thanks!


Re: Only get file when a set exists.

2018-05-27 Thread Andrew Grande
Martijn,

Here's an idea you could explore. Have the ListFile processor work as usual
and create a custom component (start with a scripting one to prototype)
grouping the filenames as needed. I don't know of the number of files in a
set is different every time, so trying to be more robust.

Once you group and count the set, you can transfer the names to the success
relationship. Ignore otherwise and wait until the set is full.

Andrew

On Sun, May 27, 2018, 7:29 AM Martijn Dekkers 
wrote:

> Hello all,
>
> I am trying to work out an issue with little success.
>
> I need to ingest files generated by some application. I can only ingest
> these files when a specific set exists. For example:
>
> file_123_456_ab.ex1
> file_123_456_cd.ex1
> file_123_456_ef.ex1
> file_123_456_gh.ex1
> file_123_456.ex2
>
> Only when a set like that exists should I pick them up into the Flow. The
> parts I am looking for to "group" would "ab.ex1", "cd.ex1", "ef.ex1",
> "gh.ex1", ".ex2".
>
> I tried to do this with some expression, but couldn't work it out.
>
> What would be the best way to achieve this?
>
> Many thanks!
>


Only get file when a set exists.

2018-05-27 Thread Martijn Dekkers
Hello all,

I am trying to work out an issue with little success.

I need to ingest files generated by some application. I can only ingest
these files when a specific set exists. For example:

file_123_456_ab.ex1
file_123_456_cd.ex1
file_123_456_ef.ex1
file_123_456_gh.ex1
file_123_456.ex2

Only when a set like that exists should I pick them up into the Flow. The
parts I am looking for to "group" would "ab.ex1", "cd.ex1", "ef.ex1",
"gh.ex1", ".ex2".

I tried to do this with some expression, but couldn't work it out.

What would be the best way to achieve this?

Many thanks!