Hi,

were setting up a flow file that ingest small records (consisting of one
short csv line) from kafka, then treat / filter / rout / merge / compress
those to write it in hdfs.

We do not understand well how the merge processor should be setup has it
does not work as we expect.

We want it to merge records in flowfiles that will at the end fill our hdfs
blocks (128 MB for now) .

Here are the merge processor parameters:

Min Number of record : 1000000
Min bin size : 200MB
Max Bin size 250 MB
Max number of Bins:1
In the setting we let the Concurrent Task paramater to 1

As you can see we set up a higher Max Bin Size, than what we want because
the flowfiles are compressed on a further processor.

But we observed that while we specify a max number of Bins of 1 and a
minimum Bin Size of 200MB, the resulting comportments does not barely
respect those parameters: creating two small flowfiles (around 25 MB) at a
time while the queue contains enough element to fill one with 128 MB.

So our question is if there is a way to parameter our processors to achieve
our goal: filling hdfs with files of size around 120 MB.

Thanks in advance,

Gautier.

Reply via email to