I am trying to design a flow where I search against an ElasticSearch index, do some processing on each document in the result set, and then merge the resulting processed documents into a single JSON document. I want to make sure that the merged document contains ALL of the results that were found by the search. I have been considering the following approaches: Perform the search using the `JsonQueryElasticsearch` processor. Route the `hits` relationship to the processor where I will process each search result. Then send each processed file into a `MergeContent` processor to be joined together. Perform the search using the `QueryElasticSearchHttp` processor. Choose a page size that seem reasonable for my search document sizes. Set the target for the query processor to flow file. Route the `success` relationship to the processor where I will process each search result. Then send each processed file into a `MergeContent` processor to be joined together. Implement a custom processor that performs the search using my own Java code. Include the processing I want to do in that processor, sending each processed file out into a `MergeContent’ processor to be joined together.
The challenge for me is understanding how to best merge the search results using options 1 or 2. Using the custom processor approach of option 3, I can merge using the defragment strategy if my processor sets the `fragment.identifer`, `fragment.index`, and `fragment.count` attributes. But I’d rather not implement and maintain a custom processor if one of the built-in processors can work. It seems like Option 2 would not provide me with a good way of merging the results. I’d have to use a bin-packing algorithm, even though I should know the hit count coming out of the search processor. I’d probably have to use a max bin age or something similar to tell the merge when it has everything it needs. This seems like a messy and possibly undependable way of making sure all results for a given search are merged together. It seems like option 1 could be made to merge a little easier, but only if I set the `Split up search results` property to `No`, and then pass the resulting JSON through a `SplitJson` processor that will define the attributes necessary for a defragmentation-based merging further downstream. That feels a little artificial to force my search output into a single flow file just so it can be cleanly split and merged later. But perhaps that is the best way? Are there any standard or best practices for doing this kind of thing? Thanks -Tim
