Suyog,

I suspect you are struggling to get MergeContent into a setting that
achieve optimal balance between latency and number of files?

If that is the case, there are a few ways of solving this issue. Matt one
is great and very popular around hadoop users, but not the only one:


Without going to vendor specific ways, another possible way of solving this
is to use a staging folder in HDFS, and then use NiFi to grab and
concatenate the files via

GetHDFS -> MergeContent -> PutHDFS

In summary you would have a flow like this:



Realtime Pipeline:

Listen* (whatever protocol you are using) ->
MergeContent_with_low_latency_settings -> PutHDFS_to_staging_folder

Ideally you would name the staging folder after the hour, minute or
whatever you want to concatenate based on (e.g.
hdfs:/sensor/staging_data/2016/08/31/00 )

your real time apps would point to Staging folder.




Concatenation Pipeline:

GetHDFS_from_staging_folder -> Merge Content -> PutHDFS_to_warm_store

On your GetHDFS_from_staging_folder you set:

* Directory field to use an ExpressionLanguage to look for something like
=>
 
hdfs:/sensor/staging_data/${now():toNumber():minus(3600000):format('yyyy/MM/dd/HH')
  (this assumes an hourly concatenation, adjust to the right balance of
files / buckets)

* Batch Size => use something larger, so you can fetch a large number of
small files per iteration

Your PutHDFS_to_warm_store destination would be again be dynamically set
based on time.





Hope this helps




On Thu, Sep 8, 2016 at 9:06 AM, Matt Burgess <[email protected]> wrote:

> Suyog,
>
> If MergeContent is not working out, you could put a Hadoop client on the
> NiFi node, or a NiFi instance on a Hadoop cluster. In the latter case you
> can put a Remote Process Group on the edge node NiFi and an Input Port on
> the Hadoop cluster NiFi, then send the files from the edge to the cluster.
> On the Hadoop NiFi you can use PutHDFS to place the small files, then
> ExecuteStreamCommand to execute a "hadoop fs -cat" command to bring all the
> small files together for more efficient processing. I realize it's not
> ideal but could be a viable workaround until the aforementioned Jiras get
> resolved.
>
> Regards,
> Matt
>
>
> > On Sep 7, 2016, at 12:54 PM, Kulkarni, Suyog <[email protected]>
> wrote:
> >
> > Thanks Matt.
> > Any recommendation for a workaround to achieve this? We are currently
> getting hundreds of sensor messages/minute that we are ingesting into
> Hadoop (for further analysis) using PutHDFS processor. But instead of
> creating hundreds of small message files in HDFS, we would like to have
> them saved as one large daily or weekly file. We successfully tested the
> MergeContent processor (to merge the message data and periodically write
> one big file) but the latency it introduces is not acceptable. What are
> some other options that we can try?
> >
> > Suyog Kulkarni
> > [email protected]
> >
> >
> > -----Original Message-----
> > From: Matt Burgess [mailto:[email protected]]
> > Sent: Wednesday, September 07, 2016 12:30 PM
> > To: [email protected]
> > Subject: Re: Appending files in Hadoop with PutHDFS ...
> >
> > Suyog,
> >
> > PutHDFS does not support appending files at the moment. I believe the
> Jira you mentioned is NIFI-958 [1], which is marked Resolved but should be
> Closed as duplicate. This case was split into two others,
> > NIFI-1321 for PutFile [2] and NIFI-1322 for PutHDFS [3]. The latter is
> not resolved or being actively worked on, and the former appears to have
> been abandoned in favor of an AppendLog processor.
> >
> > Regards,
> > Matt
> >
> > [1] https://secure-web.cisco.com/1Z2BohChUCt7WjQqYnmHDRy7kZCsAU
> 1hTdmwqXhD1Z84BMxX-RytYLbcBRv33zRDfYpu9wXqx_yKFJWyR5SMegn9OJby-
> c3JewEGr65lXwHqYTJ_ix0Q0VU-4VDjiRSd82iJG0oKHfrv6Ivo7RUilQ
> DN7tSjmNblsZsaDhho_-7R88ZQ-3Dgcfl36SpoAUOQB2O6n_
> uhIZhQTTdksol7c4W3rIZ4l26Qy-P8IIVm5zvSA5_SFxN3fFUADzu16XnHYO6b3S76G9FFV
> qgyI7pyBeYGohFUsoyxDZhjYJgJMZLVFES5bHwUsgPU0TgrP33Npxqn_
> isikSwfNmAIuvCJ6YZAeqloaEQCHlwxJ5pioiwCopsksVWoSwswSFVHCHgdx
> /https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNIFI-958
> > [2] https://secure-web.cisco.com/19T3mDCw6U0hqAOuo87QoFuwEOsjyK
> QPygdnkLUf4xry38meESVn5ggZOEvhWbSFbK9NPpGn-A56BWwJJXXJs0xEAkhuEHgwPP8YHpr
> SAOWnzn5O_xD6gRtigd-49MGRaItUQgLlUJ0848ZI5JUYHisuy
> fkCh0s4m1DRvUu_pU0I9mn_gcU-H67qdnGqKKcW6akuAUTLjK4j8dbLhM
> FMSb3Dnsgrs3bPH1WDjQWEhuL3erNddkJ3VNmsW83oxs9bFWEfRYbBXxVPMJ
> zmhOpozL20bwL6rhPZZ6-RnkQhcZAvQHTCNwGiaNnUduDDx72G6a70If3wko8E_
> XUroaDmgGuBzK6Wc6oJNI3094Ihn9kEldYqQ-hxwsCAfyIzEiCST/https%
> 3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNIFI-1321
> > [3] https://secure-web.cisco.com/19_Pxs1eklb1BUrYJIx3hAx13125_
> GpXkHvn4SDkYNbN9TVGLDBlfsQZ6XxLArnXHO-kbAqOygqpyyX25FgSFPNdaPv3vHsO4
> URVkwtamH08JQ-2ueutOKGU3SfsqY_Lpz9pXQ-HTqNiIiQWYiEWnFnBwiVfPhknsYcXI
> zcllpzLxbwVZ3OHMh9H4x_fUA8NrmWVgitsNSwDEZTAx3DQKcPOhQIO8YtT3IwJOwbmR_
> x7tsjsZVp3g15i9iPPSL6DBWZanTuAKE5Myn31IRLZpA4kYIzvTUCB4ragj8
> iPDIg6i1KwRxZKMDqjZXJqukPs8vPFfq47Hz3gaxzWUsPsxmNSU3VQoyOwk-
> yKkSaDFAQ8OdDHZDoxAHhbQl6ICspnE/https%3A%2F%2Fissues.apache.
> org%2Fjira%2Fbrowse%2FNIFI-1322
> >
> >> On Wed, Sep 7, 2016 at 12:24 PM, Kulkarni, Suyog <
> [email protected]> wrote:
> >> Hi,
> >>
> >>
> >>
> >> I just wanted to find out if PutHDFS now supports appending files in
> >> HDFS or not. I noticed there was a Jira with status “Resolved” for
> >> this, but I wanted to know which version has this feature or if there
> >> is any patch available for this. Also would like to know if anyone has
> >> tried it successfully or not. We are currently running version 0.6.
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Suyog Kulkarni
> >>
> >> [email protected]
> >>
> >>
> >>
> >>
> >>
> >>
> >> This email transmission and any accompanying attachments may contain
> >> CSX privileged and confidential information intended only for the use
> >> of the intended addressee. Any dissemination, distribution, copying or
> >> action taken in reliance on the contents of this email by anyone other
> >> than the intended recipient is strictly prohibited. If you have
> >> received this email in error please immediately delete it and notify
> >> sender at the above CSX email address. Sender and CSX accept no
> >> liability for any damage caused directly or indirectly by receipt of
> this email.
> >
> >
> >
> >
> > This email transmission and any accompanying attachments may contain CSX
> privileged and confidential information intended only for the use of the
> intended addressee. Any dissemination, distribution, copying or action
> taken in reliance on the contents of this email by anyone other than the
> intended recipient is strictly prohibited. If you have received this email
> in error please immediately delete it and notify sender at the above CSX
> email address. Sender and CSX accept no liability for any damage caused
> directly or indirectly by receipt of this email.
>

Reply via email to