Gautier, I'm not certain exactly what is wrong. But as an experiment, please try setting the "Max number of Bins" to be greater than 1 (2 might be enough). My suspicion is that when you are using 100% of the allowed bins, the processor attempts to process the oldest bin every time. Because you allow only 1 bin, the bin is always oldest. If 2 bins were allowed, you would use 1 and have 1 available, but never used.
You did not ask, but you might also consider using a sequence of two merge processors, one to bin single records into bundles of 1,000 or 10,000, followed by a second to achieve 1,000,000 / 128 MB. This will help reduce the number of simultaneous flowfiles and keep the flow rate steadier. On Fri, Aug 24, 2018 at 7:27 AM Gautier DARAS <[email protected]> wrote: > Hi, > > were setting up a flow file that ingest small records (consisting of one > short csv line) from kafka, then treat / filter / rout / merge / compress > those to write it in hdfs. > > We do not understand well how the merge processor should be setup has it > does not work as we expect. > > We want it to merge records in flowfiles that will at the end fill our > hdfs blocks (128 MB for now) . > > Here are the merge processor parameters: > > Min Number of record : 1000000 > Min bin size : 200MB > Max Bin size 250 MB > Max number of Bins:1 > In the setting we let the Concurrent Task paramater to 1 > > As you can see we set up a higher Max Bin Size, than what we want because > the flowfiles are compressed on a further processor. > > But we observed that while we specify a max number of Bins of 1 and a > minimum Bin Size of 200MB, the resulting comportments does not barely > respect those parameters: creating two small flowfiles (around 25 MB) at a > time while the queue contains enough element to fill one with 128 MB. > > So our question is if there is a way to parameter our processors to > achieve our goal: filling hdfs with files of size around 120 MB. > > Thanks in advance, > > Gautier. > >
