Hi Brian, My ultimate goal is to create the least amount of files of the biggest size on HDFS. In this case I have 32 types of data coming in from a single topic marked by an attribute and my binning is by <type/YYYY/mm>. If for example I get data a stream containing data from these 32 types spread out over the last 2 months, I seem to be getting 32(types)x2(months)x3(nodes)x32MB(size limit)=6Gigs in my queue before a merge gets triggered. This effect will be compounded when my next step is binning on 64Mb(12Gig queue) and 128(24Gig queue needed) thereafter.
The merge will only get triggered on data from the same node as the mergeContent is running on. So while there is effectively more than enough data in the queue to merge, it will hold out untill there is enough on a single node. This is ok for a data-type with enough data, but some are bigger than others and sometimes there is enough data in the total queue, but not enough on a single node, forcing the data to only flow though by age-out. Resulting in small files on HDFS down the line. Flume opens a file in HDFS and starts appending untill max-size or time is reached. I'm looking for similar or better functionality in Nifi, resulting in few and large files in HDFS. Rob On Mon, Sep 17, 2018 at 10:56 PM Bryan Bende <[email protected]> wrote: > Hello, > > I'm not sure I follow... wouldn't it be more efficient to merge > multiple files in parallel across the cluster? > > If you had to converge them all to one node, then this doesn't seem > much different than just having a stand-alone NiFi, which would go > against needing a cluster to achieve the desired through put. > > -Bryan > > > -Bryan > > On Mon, Sep 17, 2018 at 4:02 PM Rob Verkuylen <[email protected]> wrote: > > > > I really went to replace Flume with Nifi, so for the simplest use case I > basically have Kafka->UpdateAttribute-> > MergeContent(32->64->128MB)->PutHDFS. > > > > I need to run in cluster mode to get the thoughput I need, but run into > the problem of flowfiles assigned to nodes are only merged on those nodes. > Effectively splitting my merge efficiencly by the number of nifi nodes. > > > > Is there a workaround for this issue? >
