I am trying to design a flow where I search against an ElasticSearch index, do 
some processing on each document in the result set, and then merge the 
resulting processed documents into a single JSON document. I want to make sure 
that the merged document contains ALL of the results that were found by the 
search. I have been considering the following approaches:
Perform the search using the `JsonQueryElasticsearch` processor.  Route the 
`hits` relationship to the processor where I will process each search result. 
Then send each processed file into a `MergeContent` processor to be joined 
together.
Perform the search using the `QueryElasticSearchHttp` processor. Choose a page 
size that seem reasonable for my search document sizes. Set the target for the 
query processor to flow file. Route the `success` relationship to the processor 
where I will process each search result. Then send each processed file into a 
`MergeContent` processor to be joined together.
Implement a custom processor that performs the search using my own Java code. 
Include the processing I want to do in that processor, sending each processed 
file out into a `MergeContent’ processor to be joined together.

The challenge for me is understanding how to best merge the search results 
using options 1 or 2. Using the custom processor approach of option 3, I can 
merge using the defragment strategy if my processor sets the 
`fragment.identifer`, `fragment.index`, and `fragment.count` attributes. But 
I’d rather not implement and maintain a custom processor if one of the built-in 
processors can work.

It seems like Option 2 would not provide me with a good way of merging the 
results. I’d have to use a bin-packing algorithm, even though I should know the 
hit count coming out of the search processor. I’d probably have to use a max 
bin age or something similar to tell the merge when it has everything it needs. 
This seems like a messy and possibly undependable way of making sure all 
results for a given search are merged together.

It seems like option 1 could be made to merge a little easier, but only if I 
set the `Split up search results` property to `No`, and then pass the resulting 
JSON through a `SplitJson` processor that will define the attributes necessary 
for a defragmentation-based merging further downstream. That feels a little 
artificial to force my search output into a single flow file just so it can be 
cleanly split and merged later. But perhaps that is the best way?

Are there any standard or best practices for doing this kind of thing? 

Thanks

-Tim

Reply via email to