Hi,
I'm having a great deal of trouble configuring the mergerecord processor to
deliver reasonable performance and I'm not sure where to look to correct
it. One of my upstream processors requires a single record per flowfile,
but I'd like to create larger flowfiles before passing to the next stage.
The flowfiles are independent at this stage so there's no special
processing required of the merging. I'd like to create flowfiles of about
50 to 100 records.

I have two tests, both running on the same nifi system. One uses synthetic
data, the other the production data. The performance of the mergerecord
processor for the synthetic data is as I'd expect, and I can't figure out
why the  production data is so much slower. Here's the
configuration:

mergerecord has the following settings. Timer driven, 1 concurrent task, 5
second run schedule, bin packing merge strategy, min records = 1, max
records = 100, max bin age = 4.5 secs, maximum number of bins = 1.

In the case of synthetic data the typical flowfile size is in the range 2
to 7KB.

The size of flowfiles for the production case is smaller - typically around
1KB.

The structure in the tests is slightly different. Synthetic is (note that
I've removed the text part):

[ {
  "sampleid" : 1075,
  "typeid" : 98,
  "dct" : "2020-01-25T21:40:25.515Z",
  "filename" : "__tmp/txt/mtsamples-type-98-sample-1075.txt",
  "document" : "Text removed - typically a few hundred words",
  "docid" : "9"
} ]

Production is:
[ {
  "doc_id" : "5.60622895E8",
  "doc_text" : " Text deleted - typically a few hundred words",
  "processing_timestamp" : "2022-11-27T23:56:35.601Z",
  "metadata_x_ocr_applied" : true,
  "metadata_x_parsed_by" :
"org.apache.tika.parser.DefaultParser;org.apache.tika.parser.microsoft.rtf.RTFParser;org.apache.tika.parser.AutoDetectParser",
  "metadata_content_type" : "application/rtf",
  "metadata_page_count" : null,
  "metadata_creation_date" : null,
  "metadata_last_modified" : null
} ]



I load up the queue feeding the mergerecord processor with several hundred
individual flowfiles and activate it.

The synthetic data is nicely placed into chunks of 100, with any remainder
being flushed in a smaller chunk.

The production data is generally bundled into groups of 6 records,
sometimes less. Certainly it never gets close to 100 records.

Any ideas as to what I should look at to track down the difference?

Thanks

Reply via email to