Hello!

I am working on the following use case - I have a flow which is listing the 
json files in a folder (between 200-300), then splitting each file to high 
level 'message' (up to 2000 messages/file) and then splitting each high level 
message into low level messages (up to 60low level messages per high level 
message). The low level messages are then processed and inserted into a 
database. The flow then moves to the next folder once all files in the current 
folder are split into lower level messages and inserted into a database. This 
flow is quite data intensive and quickly the queues between processors get full 
although the back pressure threshold  is set to 20000 flowfiles and the  most 
computationally intensive processors are scheduled with 2 or more threads.

To do that I have implemented three pairs of Wait-Notify processors. To do that 
I have followed this great tutorial 
http://ijokarumawak.github.io/nifi/2017/02/02/nifi-notify-batch/ The Wait 
processors are connected to the respective original SplitJSON connection as 
follows: the first is Wait processor when the list of files is split to files, 
the second - when each file is split into high level 'messages' and the third - 
when the high level messages are split into low level messages. The 
corresponding Notify processors are as follows:
- Notify for the low level messages is after they are inserted into the 
database.
- Notify for the high level messages is connected to the third Wait processor 
(the one waiting for the low level messages to be processed)
- The last Notify is connected to the Wait processor which is waiting for the 
high level messages in each file to be processed.
I have also created one Distributed MapCacheServer with the respective Clients 
for each Wait-Notify pair.
This set up worked fine for approximately a day while I was developing it and 
the flow was moving to the next folders.

After that I changed the input source and it stopped releasing files from the 
second Wait-Notify pair (the only difference between the two data sources is 
that the second one has larger files). The Wait-Notify pair dealing with the 
low level messages continued to release flow files but the one which was 
working on the high level messages did not - the high level messages Wait 
processor was not sending any files to the following processor. I have read 
about the wait queue prioritizers and changed the priority to both Wait-Notify 
pairs to FIFO - this worked, meaning flowfiles were released from the 
high-level messages Wait-Notify pair, for some time and then it stopped - no 
more files were released. I have also read about wait penalty duration and 
experimented with different values but with no success.

I have also started looking into the wait_buffer_count and 
relesable_flow_file_count properties of the Wait processor but can't fully 
understand how they should be used. I would expect that their values should be 
chosen depending on the incoming data and have tried values between 20000 and 
100000 but it does not seem to help releasing files from the high level 
messages Wait processor. I have also looked into the Notify signal_buffer_count 
and thought that it should be connected to the values for Wait 
wait_buffer_count somehow for the balanced work of the Wait-Notify pair.

As it was working with the first data source it feels like property 
configuration issue but I am stuck now and have no more ideas where to look at.
Any suggestions and hints  about the interaction of the different properties 
would be greatly appreciated.

Many thanks in advance!

Valentina

Reply via email to