Re: some questions about splits

Mark Payne Wed, 24 Feb 2021 14:20:40 -0800

Geoffrey,

At a high level, if you’re splitting multiple times and then trying to 
re-assemble everything, then yes I think your thought process is correct. But 
you’ve no doubt seen how complex and cumbersome this approach can be. It can 
also result in extremely poor performance. So much so that when I began 
creating a series of YouTube videos on NiFi Anti-Patterns, the first 
anti-pattern that I covered was the splitting and re-merging of data [1].


Generally, this should be an absolute last resort, and Record-oriented 
processors should be used instead of splitting the data up and re-merging it. 
If you need to perform REST calls, you could do that with LookupRecord, and 
either use the RESTLookupService or if that doesn’t fit the bill exactly you 
could actually use the ScriptedLookupService and write a small script in Groovy 
or Python that would perform the REST call for you and return the results. Or 
perhaps the ScriptedTransformRecord would be more appropriate - hard to tell 
without knowing the exact use case.

Obviously, your mileage may vary, but switching the data flow to use 
record-oriented processors, if possible, would typically yield a flow that is 
much simpler and yield throughput that is at least an order of magnitude better.

But if for whatever reason you do end up being stuck with the split/merge 
approach - the key would likely be to consider backpressure heavily. If you 
have backpressure set to 10,000 FlowFiles (the default) and then you’re trying 
to merge together data, but the data comes from many different upstream splits, 
you can certainly end up in a situation like this, where you don’t have all of 
the data from a given ’split’ queued up. for MergeContent.

Hope this helps!
-Mark

[1] https://www.youtube.com/watch?v=RjWstt7nRVY


On Feb 24, 2021, at 4:59 PM, Greene (US), Geoffrey N 
<[email protected]<mailto:[email protected]>> wrote:

Im having some trouble with multiple splits/merges.  Here’s the idea:


Big data -> split 1->Save all the fragment.*attributes into variables -> split 
2-> save all the fragment.* attributes
    |
Split 1
   |
Save fragment.* attributes into split1.fragment.*
|
Split 2
|
Save fragment.* attributes into split2.fragment.* attributes
|
(More processing)
|
Split 3
|
Save fragment.* attributes into split3.fragment.* attributes
|
(other stuff)
|
Restore split3.fragment.* attributes to fragment.*
|
Merge3, using defragment strategy
|
Restore split2.fragment.* attributes to fragment.*
|
Merge 2 using defragment strategy
|
Restore split1.frragment.* attributes to fragment.*
|
Merge 1 using defragment strategy

Am I thinking about this correctly?  It seems like sometimes, nifi is unable to 
do a merge on some of the split data (errors like “there are 50 fragments, but 
we only found one).  Is it possible that I need to do some prioritization in 
the queues? I have noticed that my things do back up and the queues seem to 
fill up as its going through (several of the splits need to perform rest calls 
and processing, which can take time.  Maybe the issue is that one fragment 
“slips” through, before the others have even been processed far enough.  Is 
there an approved way to do this?

Thanks for the help!

Re: some questions about splits

Reply via email to