Geoffrey, At a high level, if you’re splitting multiple times and then trying to re-assemble everything, then yes I think your thought process is correct. But you’ve no doubt seen how complex and cumbersome this approach can be. It can also result in extremely poor performance. So much so that when I began creating a series of YouTube videos on NiFi Anti-Patterns, the first anti-pattern that I covered was the splitting and re-merging of data [1].
Generally, this should be an absolute last resort, and Record-oriented processors should be used instead of splitting the data up and re-merging it. If you need to perform REST calls, you could do that with LookupRecord, and either use the RESTLookupService or if that doesn’t fit the bill exactly you could actually use the ScriptedLookupService and write a small script in Groovy or Python that would perform the REST call for you and return the results. Or perhaps the ScriptedTransformRecord would be more appropriate - hard to tell without knowing the exact use case. Obviously, your mileage may vary, but switching the data flow to use record-oriented processors, if possible, would typically yield a flow that is much simpler and yield throughput that is at least an order of magnitude better. But if for whatever reason you do end up being stuck with the split/merge approach - the key would likely be to consider backpressure heavily. If you have backpressure set to 10,000 FlowFiles (the default) and then you’re trying to merge together data, but the data comes from many different upstream splits, you can certainly end up in a situation like this, where you don’t have all of the data from a given ’split’ queued up. for MergeContent. Hope this helps! -Mark [1] https://www.youtube.com/watch?v=RjWstt7nRVY On Feb 24, 2021, at 4:59 PM, Greene (US), Geoffrey N <[email protected]<mailto:[email protected]>> wrote: Im having some trouble with multiple splits/merges. Here’s the idea: Big data -> split 1->Save all the fragment.*attributes into variables -> split 2-> save all the fragment.* attributes | Split 1 | Save fragment.* attributes into split1.fragment.* | Split 2 | Save fragment.* attributes into split2.fragment.* attributes | (More processing) | Split 3 | Save fragment.* attributes into split3.fragment.* attributes | (other stuff) | Restore split3.fragment.* attributes to fragment.* | Merge3, using defragment strategy | Restore split2.fragment.* attributes to fragment.* | Merge 2 using defragment strategy | Restore split1.frragment.* attributes to fragment.* | Merge 1 using defragment strategy Am I thinking about this correctly? It seems like sometimes, nifi is unable to do a merge on some of the split data (errors like “there are 50 fragments, but we only found one). Is it possible that I need to do some prioritization in the queues? I have noticed that my things do back up and the queues seem to fill up as its going through (several of the splits need to perform rest calls and processing, which can take time. Maybe the issue is that one fragment “slips” through, before the others have even been processed far enough. Is there an approved way to do this? Thanks for the help!
