Hello Run schedule should be 0.
50 should be the min number of records 5 seconds is the max bin age it sounds like you want. Start with these changes and let us know what youre seeing. Thanks On Fri, Dec 2, 2022 at 10:12 PM Richard Beare <[email protected]> wrote: > Hi, > I'm having a great deal of trouble configuring the mergerecord processor > to deliver reasonable performance and I'm not sure where to look to correct > it. One of my upstream processors requires a single record per flowfile, > but I'd like to create larger flowfiles before passing to the next stage. > The flowfiles are independent at this stage so there's no special > processing required of the merging. I'd like to create flowfiles of about > 50 to 100 records. > > I have two tests, both running on the same nifi system. One uses synthetic > data, the other the production data. The performance of the mergerecord > processor for the synthetic data is as I'd expect, and I can't figure out > why the production data is so much slower. Here's the > configuration: > > mergerecord has the following settings. Timer driven, 1 concurrent task, 5 > second run schedule, bin packing merge strategy, min records = 1, max > records = 100, max bin age = 4.5 secs, maximum number of bins = 1. > > In the case of synthetic data the typical flowfile size is in the range 2 > to 7KB. > > The size of flowfiles for the production case is smaller - typically > around 1KB. > > The structure in the tests is slightly different. Synthetic is (note that > I've removed the text part): > > [ { > "sampleid" : 1075, > "typeid" : 98, > "dct" : "2020-01-25T21:40:25.515Z", > "filename" : "__tmp/txt/mtsamples-type-98-sample-1075.txt", > "document" : "Text removed - typically a few hundred words", > "docid" : "9" > } ] > > Production is: > [ { > "doc_id" : "5.60622895E8", > "doc_text" : " Text deleted - typically a few hundred words", > "processing_timestamp" : "2022-11-27T23:56:35.601Z", > "metadata_x_ocr_applied" : true, > "metadata_x_parsed_by" : > "org.apache.tika.parser.DefaultParser;org.apache.tika.parser.microsoft.rtf.RTFParser;org.apache.tika.parser.AutoDetectParser", > "metadata_content_type" : "application/rtf", > "metadata_page_count" : null, > "metadata_creation_date" : null, > "metadata_last_modified" : null > } ] > > > > I load up the queue feeding the mergerecord processor with several hundred > individual flowfiles and activate it. > > The synthetic data is nicely placed into chunks of 100, with any remainder > being flushed in a smaller chunk. > > The production data is generally bundled into groups of 6 records, > sometimes less. Certainly it never gets close to 100 records. > > Any ideas as to what I should look at to track down the difference? > > Thanks >
