Hi Mark
We have many List/Get processors which is running via cron. Some systems
export data to disk every hour, but the systems can't block read acces to
the files while writing them. So NiFi can pull the same file multiple times
and tries to delete it while the file is written. But we know that the
export only takes 10 minutes. Therefore we use a CRON to get files between
0 0 15-55 * *
We have similar issues with other systems only providing data or are
accessibly at specific time slots.

To John:
Could you use a Notify/Wait gate function. Where a wait processor is
blocking flowfiles to the mergeContent processor. And in another flow use a
generateFlowfile and a notify process to open the gate (wait process).
After the mergeContent you could have a notify process to close the gate
again.
In this way, you would get many flowfile into the mergeContent process at
the same time.

Kind regards
Jens M. Kofoed


Den 22. feb. 2023 kl. 15.24 skrev Mark Payne <[email protected]>:

Interesting. Thanks for that feedback Harald. It might make sense to be
more surgical about this, disabling it for MergeContent, for example,
instead of all interflow processors.

Thanks
-Mark


On Feb 22, 2023, at 5:42 AM, Dobbernack, Harald (Key-Work) <
[email protected]> wrote:


Just responding to this part:

You should not be using CRON driven for any processors in the middle of a
flow. In fact, we really

should probably just disable that all together.

Please don't disable this! We actually use CRON for some of our PutSFTP
Processors as there are servicetimes of these SFTP we are supposed to
respect and not use them or the SFTP will actually not be available... Of
course we can also use a routing to a wait processor if we have arrived at
a time where the destination should not be called, but it is so more
simpler being able to tell the processor in the middle of the flow when not
to run.


-----Ursprüngliche Nachricht-----

Von: Mark Payne <[email protected]>

Gesendet: Dienstag, 21. Februar 2023 21:37

An: [email protected]; John McGinn <[email protected]>

Betreff: Re: Processor with cron scheduling in middle of flow


Key-Work IT-Sicherheit: Es handelt sich um eine externe E-Mail. Bitte nur
auf Links oder Anhänge klicken, sofern die Echtheit der Nachricht klar ist.


John,


You should not be using CRON driven for any processors in the middle of a
flow. In fact, we really should probably just disable that all together.

In fact, it’s exceedingly rare that you’d want anything other than
Timer-Driven with a Run Schedule of 0 sec.

MergeContent will not create any merged output on its first iteration after
it’s scheduled to run. It requires at least a second iteration before
anything is transferred. Its algorithm has evolved over time, and it may
well have happened to work previously but it’s really not being configured
as intended.


When you say that you’re retrieving data from a few sources and then
“merges that all back into a single file” - does that mean that you started
with one FlowFile, split it up, and then want to re-assemble the data after
performing enrichment? If so you’ll want to use a Merge Strategy of
Defragment.


If you’re trying to just bring in some data and merge it together by
correlation attribute, then Bin Packing makes sense. Here, you have a few
properties that you can use to try to get the best bin packing. In short, a
bin will be created when any of these conditions is met:


- The Minimum Group Size is reached AND the Minimum Number of Entries is met

- The Maximum Group Size OR the Maximum Number of Entries is met

- A bin has been sitting for “Max Bin Age” amount of time

- If a correlation attribute is used, and a FlowFile comes in that can’t go
into any bin, it will evict the oldest.


If you’re seeing bins smaller than expected, you can look at the Data
Provenance for the merged FlowFile, and it will tell you exactly which of
the conditions above triggered the data to be merged. This may help to
adjust these settings.


Hope this is helpful.


Thanks

-Mark



On Feb 17, 2023, at 1:39 PM, John McGinn via users <[email protected]>
wrote:


Hello,


NiFi 1.19.0 - I need some help in trying to make my idea work, or figure
out the better way to do this.


I've got a flow that retrieves data from a few data sources, enhances
individual flow files, converts attributes to CSV and then merges that all
back into a single file. It takes roughly 20 minutes for the process to run
from start to the MergeContent part, so when I do it manually, I stop the
MergeContent processor until all flowfiles are in the queue waiting, and
then I start the MergeContent processor. (Run One Time doesn't work for
some reason.) That works fine, manually.


When I try to put cron scheduling in, it never kicks off. For instance, the
initial processor in the flow has a cron schedule of the top of the hour.
(0 0 * * * ?) I then put 25 past the hour for Merge Content (0 25 * * * ?).
When I start the flow, the flowfiles are generated and queue up in front of
MergeContent by 25 minutes past the hour, but the MergeContent never kicks
off.


I added a correlation attribute recently and removed the cron entry, but
the MergeContent just creates small bunches of merged files.


I even attempted to put a cron on the AttributesToCSV with a maximum bin
age on the Merge Content, since it takes less than a minute for the
AttribuesToCSV to process the flowfiles at that point, but the cron didn't
kick off there either.


Any ideas on how to get this to work?


Thanks,

John




Harald Dobbernack


Key-Work Consulting GmbH | Kriegsstr. 100 | 76133 | Karlsruhe | Germany |
www.key-work.de<https://www.key-work.de> | Datenschutz<
https://www.key-work.de/de/footer/datenschutz.html>

Fon: +49-721-78203-264 | E-Mail: [email protected]


Key-Work Consulting GmbH, Karlsruhe, HRB 108695, HRG Mannheim

Geschäftsführung: Petra Wotring

Reply via email to