Re: Mergecontent on Cluster

Rob Verkuylen Mon, 17 Sep 2018 15:21:02 -0700

Hi Brian,

My ultimate goal is to create the least amount of files of the biggest size
on HDFS. In this case I have 32 types of data coming in from a single topic
marked by an attribute and my binning is by <type/YYYY/mm>. If for example
I get data a stream containing data from these 32 types spread out over the
last 2 months, I seem to be getting 32(types)x2(months)x3(nodes)x32MB(size
limit)=6Gigs in my queue before a merge gets triggered. This effect will be
compounded when my next step is binning on 64Mb(12Gig queue) and 128(24Gig
queue needed) thereafter.

The merge will only get triggered on data from the same node as the
mergeContent is running on. So while there is effectively more than enough
data in the queue to merge, it will hold out untill there is enough on a
single node. This is ok for a data-type with enough data, but some are
bigger than others and sometimes there is enough data in the total queue,
but not enough on a single node, forcing the data to only flow though by
age-out. Resulting in small files on HDFS down the line.

Flume opens a file in HDFS and starts appending untill max-size or time is
reached. I'm looking for similar or better functionality in Nifi, resulting
in few and large files in HDFS.

Rob

On Mon, Sep 17, 2018 at 10:56 PM Bryan Bende <[email protected]> wrote:

> Hello,
>
> I'm not sure I follow... wouldn't it be more efficient to merge
> multiple files in parallel across the cluster?
>
> If you had to converge them all to one node, then this doesn't seem
> much different than just having a stand-alone NiFi, which would go
> against needing a cluster to achieve the desired through put.
>
> -Bryan
>
>
> -Bryan
>
> On Mon, Sep 17, 2018 at 4:02 PM Rob Verkuylen <[email protected]> wrote:
> >
> > I really went to replace Flume with Nifi, so for the simplest use case I
> basically have Kafka->UpdateAttribute->
> MergeContent(32->64->128MB)->PutHDFS.
> >
> > I need to run in cluster mode to get the thoughput I need, but run into
> the problem of flowfiles assigned to nodes are only merged on those nodes.
> Effectively splitting my merge efficiencly by the number of nifi nodes.
> >
> > Is there a workaround for this issue?
>

Re: Mergecontent on Cluster

Reply via email to