Thanks for your help Matt and Andre. We will try out your proposed solutions.

Regards,
Suyog Kulkarni
Email: [email protected]<mailto:[email protected]>

From: Andre [mailto:[email protected]]
Sent: Wednesday, September 07, 2016 8:54 PM
To: [email protected]
Subject: Re: Appending files in Hadoop with PutHDFS ...

Suyog,

I suspect you are struggling to get MergeContent into a setting that achieve 
optimal balance between latency and number of files?

If that is the case, there are a few ways of solving this issue. Matt one is 
great and very popular around hadoop users, but not the only one:


Without going to vendor specific ways, another possible way of solving this is 
to use a staging folder in HDFS, and then use NiFi to grab and concatenate the 
files via

GetHDFS -> MergeContent -> PutHDFS

In summary you would have a flow like this:



Realtime Pipeline:

Listen* (whatever protocol you are using) -> 
MergeContent_with_low_latency_settings -> PutHDFS_to_staging_folder

Ideally you would name the staging folder after the hour, minute or whatever 
you want to concatenate based on (e.g. hdfs:/sensor/staging_data/2016/08/31/00 )

your real time apps would point to Staging folder.




Concatenation Pipeline:

GetHDFS_from_staging_folder -> Merge Content -> PutHDFS_to_warm_store

On your GetHDFS_from_staging_folder you set:

* Directory field to use an ExpressionLanguage to look for something like =>  
hdfs:/sensor/staging_data/${now():toNumber():minus(3600000):format('yyyy/MM/dd/HH')
  (this assumes an hourly concatenation, adjust to the right balance of files / 
buckets)

* Batch Size => use something larger, so you can fetch a large number of small 
files per iteration

Your PutHDFS_to_warm_store destination would be again be dynamically set based 
on time.





Hope this helps




On Thu, Sep 8, 2016 at 9:06 AM, Matt Burgess 
<[email protected]<mailto:[email protected]>> wrote:
Suyog,

If MergeContent is not working out, you could put a Hadoop client on the NiFi 
node, or a NiFi instance on a Hadoop cluster. In the latter case you can put a 
Remote Process Group on the edge node NiFi and an Input Port on the Hadoop 
cluster NiFi, then send the files from the edge to the cluster. On the Hadoop 
NiFi you can use PutHDFS to place the small files, then ExecuteStreamCommand to 
execute a "hadoop fs -cat" command to bring all the small files together for 
more efficient processing. I realize it's not ideal but could be a viable 
workaround until the aforementioned Jiras get resolved.

Regards,
Matt


> On Sep 7, 2016, at 12:54 PM, Kulkarni, Suyog 
> <[email protected]<mailto:[email protected]>> wrote:
>
> Thanks Matt.
> Any recommendation for a workaround to achieve this? We are currently getting 
> hundreds of sensor messages/minute that we are ingesting into Hadoop (for 
> further analysis) using PutHDFS processor. But instead of creating hundreds 
> of small message files in HDFS, we would like to have them saved as one large 
> daily or weekly file. We successfully tested the MergeContent processor (to 
> merge the message data and periodically write one big file) but the latency 
> it introduces is not acceptable. What are some other options that we can try?
>
> Suyog Kulkarni
> [email protected]<mailto:[email protected]>
>
>
> -----Original Message-----
> From: Matt Burgess [mailto:[email protected]<mailto:[email protected]>]
> Sent: Wednesday, September 07, 2016 12:30 PM
> To: [email protected]<mailto:[email protected]>
> Subject: Re: Appending files in Hadoop with PutHDFS ...
>
> Suyog,
>
> PutHDFS does not support appending files at the moment. I believe the Jira 
> you mentioned is NIFI-958 [1], which is marked Resolved but should be Closed 
> as duplicate. This case was split into two others,
> NIFI-1321 for PutFile [2] and NIFI-1322 for PutHDFS [3]. The latter is not 
> resolved or being actively worked on, and the former appears to have been 
> abandoned in favor of an AppendLog processor.
>
> Regards,
> Matt
>
> [1] 
> https://secure-web.cisco.com/1Z2BohChUCt7WjQqYnmHDRy7kZCsAU1hTdmwqXhD1Z84BMxX-RytYLbcBRv33zRDfYpu9wXqx_yKFJWyR5SMegn9OJby-c3JewEGr65lXwHqYTJ_ix0Q0VU-4VDjiRSd82iJG0oKHfrv6Ivo7RUilQDN7tSjmNblsZsaDhho_-7R88ZQ-3Dgcfl36SpoAUOQB2O6n_uhIZhQTTdksol7c4W3rIZ4l26Qy-P8IIVm5zvSA5_SFxN3fFUADzu16XnHYO6b3S76G9FFVqgyI7pyBeYGohFUsoyxDZhjYJgJMZLVFES5bHwUsgPU0TgrP33Npxqn_isikSwfNmAIuvCJ6YZAeqloaEQCHlwxJ5pioiwCopsksVWoSwswSFVHCHgdx/https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNIFI-958
> [2] 
> https://secure-web.cisco.com/19T3mDCw6U0hqAOuo87QoFuwEOsjyKQPygdnkLUf4xry38meESVn5ggZOEvhWbSFbK9NPpGn-A56BWwJJXXJs0xEAkhuEHgwPP8YHprSAOWnzn5O_xD6gRtigd-49MGRaItUQgLlUJ0848ZI5JUYHisuyfkCh0s4m1DRvUu_pU0I9mn_gcU-H67qdnGqKKcW6akuAUTLjK4j8dbLhMFMSb3Dnsgrs3bPH1WDjQWEhuL3erNddkJ3VNmsW83oxs9bFWEfRYbBXxVPMJzmhOpozL20bwL6rhPZZ6-RnkQhcZAvQHTCNwGiaNnUduDDx72G6a70If3wko8E_XUroaDmgGuBzK6Wc6oJNI3094Ihn9kEldYqQ-hxwsCAfyIzEiCST/https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNIFI-1321
> [3] 
> https://secure-web.cisco.com/19_Pxs1eklb1BUrYJIx3hAx13125_GpXkHvn4SDkYNbN9TVGLDBlfsQZ6XxLArnXHO-kbAqOygqpyyX25FgSFPNdaPv3vHsO4URVkwtamH08JQ-2ueutOKGU3SfsqY_Lpz9pXQ-HTqNiIiQWYiEWnFnBwiVfPhknsYcXIzcllpzLxbwVZ3OHMh9H4x_fUA8NrmWVgitsNSwDEZTAx3DQKcPOhQIO8YtT3IwJOwbmR_x7tsjsZVp3g15i9iPPSL6DBWZanTuAKE5Myn31IRLZpA4kYIzvTUCB4ragj8iPDIg6i1KwRxZKMDqjZXJqukPs8vPFfq47Hz3gaxzWUsPsxmNSU3VQoyOwk-yKkSaDFAQ8OdDHZDoxAHhbQl6ICspnE/https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNIFI-1322
>
>> On Wed, Sep 7, 2016 at 12:24 PM, Kulkarni, Suyog 
>> <[email protected]<mailto:[email protected]>> wrote:
>> Hi,
>>
>>
>>
>> I just wanted to find out if PutHDFS now supports appending files in
>> HDFS or not. I noticed there was a Jira with status “Resolved” for
>> this, but I wanted to know which version has this feature or if there
>> is any patch available for this. Also would like to know if anyone has
>> tried it successfully or not. We are currently running version 0.6.
>>
>>
>>
>> Thanks,
>>
>> Suyog Kulkarni
>>
>> [email protected]<mailto:[email protected]>
>>
>>
>>
>>
>>
>>
>> This email transmission and any accompanying attachments may contain
>> CSX privileged and confidential information intended only for the use
>> of the intended addressee. Any dissemination, distribution, copying or
>> action taken in reliance on the contents of this email by anyone other
>> than the intended recipient is strictly prohibited. If you have
>> received this email in error please immediately delete it and notify
>> sender at the above CSX email address. Sender and CSX accept no
>> liability for any damage caused directly or indirectly by receipt of this 
>> email.
>
>
>
>
> This email transmission and any accompanying attachments may contain CSX 
> privileged and confidential information intended only for the use of the 
> intended addressee. Any dissemination, distribution, copying or action taken 
> in reliance on the contents of this email by anyone other than the intended 
> recipient is strictly prohibited. If you have received this email in error 
> please immediately delete it and notify sender at the above CSX email 
> address. Sender and CSX accept no liability for any damage caused directly or 
> indirectly by receipt of this email.

Reply via email to