Hello

Run schedule should be 0.

50 should be the min number of records

5 seconds is the max bin age it sounds like you want.

Start with these changes and let us know what youre seeing.

Thanks

On Fri, Dec 2, 2022 at 10:12 PM Richard Beare <[email protected]>
wrote:

> Hi,
> I'm having a great deal of trouble configuring the mergerecord processor
> to deliver reasonable performance and I'm not sure where to look to correct
> it. One of my upstream processors requires a single record per flowfile,
> but I'd like to create larger flowfiles before passing to the next stage.
> The flowfiles are independent at this stage so there's no special
> processing required of the merging. I'd like to create flowfiles of about
> 50 to 100 records.
>
> I have two tests, both running on the same nifi system. One uses synthetic
> data, the other the production data. The performance of the mergerecord
> processor for the synthetic data is as I'd expect, and I can't figure out
> why the  production data is so much slower. Here's the
> configuration:
>
> mergerecord has the following settings. Timer driven, 1 concurrent task, 5
> second run schedule, bin packing merge strategy, min records = 1, max
> records = 100, max bin age = 4.5 secs, maximum number of bins = 1.
>
> In the case of synthetic data the typical flowfile size is in the range 2
> to 7KB.
>
> The size of flowfiles for the production case is smaller - typically
> around 1KB.
>
> The structure in the tests is slightly different. Synthetic is (note that
> I've removed the text part):
>
> [ {
>   "sampleid" : 1075,
>   "typeid" : 98,
>   "dct" : "2020-01-25T21:40:25.515Z",
>   "filename" : "__tmp/txt/mtsamples-type-98-sample-1075.txt",
>   "document" : "Text removed - typically a few hundred words",
>   "docid" : "9"
> } ]
>
> Production is:
> [ {
>   "doc_id" : "5.60622895E8",
>   "doc_text" : " Text deleted - typically a few hundred words",
>   "processing_timestamp" : "2022-11-27T23:56:35.601Z",
>   "metadata_x_ocr_applied" : true,
>   "metadata_x_parsed_by" :
> "org.apache.tika.parser.DefaultParser;org.apache.tika.parser.microsoft.rtf.RTFParser;org.apache.tika.parser.AutoDetectParser",
>   "metadata_content_type" : "application/rtf",
>   "metadata_page_count" : null,
>   "metadata_creation_date" : null,
>   "metadata_last_modified" : null
> } ]
>
>
>
> I load up the queue feeding the mergerecord processor with several hundred
> individual flowfiles and activate it.
>
> The synthetic data is nicely placed into chunks of 100, with any remainder
> being flushed in a smaller chunk.
>
> The production data is generally bundled into groups of 6 records,
> sometimes less. Certainly it never gets close to 100 records.
>
> Any ideas as to what I should look at to track down the difference?
>
> Thanks
>

Reply via email to