Gautier,

I'm not certain exactly what is wrong.  But as an experiment, please try
setting the "Max number of Bins" to be greater than 1 (2 might be enough).
My suspicion is that when you are using 100% of the allowed bins, the
processor attempts to process the oldest bin every time.  Because you allow
only 1 bin, the bin is always oldest.  If 2 bins were allowed, you would
use 1 and have 1 available, but never used.

You did not ask, but you might also consider using a sequence of two merge
processors, one to bin single records into bundles of 1,000 or 10,000,
followed by a second to achieve 1,000,000 / 128 MB.  This will help reduce
the number of simultaneous flowfiles and keep the flow rate steadier.


On Fri, Aug 24, 2018 at 7:27 AM Gautier DARAS <[email protected]>
wrote:

> Hi,
>
> were setting up a flow file that ingest small records (consisting of one
> short csv line) from kafka, then treat / filter / rout / merge / compress
> those to write it in hdfs.
>
> We do not understand well how the merge processor should be setup has it
> does not work as we expect.
>
> We want it to merge records in flowfiles that will at the end fill our
> hdfs blocks (128 MB for now) .
>
> Here are the merge processor parameters:
>
> Min Number of record : 1000000
> Min bin size : 200MB
> Max Bin size 250 MB
> Max number of Bins:1
> In the setting we let the Concurrent Task paramater to 1
>
> As you can see we set up a higher Max Bin Size, than what we want because
> the flowfiles are compressed on a further processor.
>
> But we observed that while we specify a max number of Bins of 1 and a
> minimum Bin Size of 200MB, the resulting comportments does not barely
> respect those parameters: creating two small flowfiles (around 25 MB) at a
> time while the queue contains enough element to fill one with 128 MB.
>
> So our question is if there is a way to parameter our processors to
> achieve our goal: filling hdfs with files of size around 120 MB.
>
> Thanks in advance,
>
> Gautier.
>
>

Reply via email to