Best practice approach to set HDFS filename based on attibutes

Andre Mon, 19 Oct 2015 03:23:52 -0700

Hi there,

Considering implementing a lambda architecture using NiFi, where as
usual, one data path goes into HDFS while another data path goes into
Spark/Flink/whatever, however, before I get to streaming section of
the pipeline I want to plan a decent file saving strategy to use.


I've noticed the filename property for PutHDFS isn't exposed via UI,
however as very well documented in (here |
https://kisstechdocs.wordpress.com/2015/01/15/creating-a-limited-failure-loop-in-nifi/)
I can change the attribute using different processors (e.g.
UpdateAttribute) prior to reaching the PutHDFS processor.

This suggests me that I could for example have a pipeline that looks
pretty much like:

1. ListenHTTP => captures attribute LogSrc from HTTP request header LogSrc

2. MergeContent  => where Correlation Attribute Name = LogSrc /
Attribute Strategy = Keep Only Common Attributes

3. UpdateAttribute => Updates $filename so that it is now
data-${now():format('yyyyMMdd')}.log (e.g. data-20151019.log )

4. PutHDFS => Directory = /${LogSrc}/${now():format('yyyy/MM/dd')}
(e.g. /mydevice/2015/10/19/)


This, I believe, would result on a file named
/mydevice/2015/10/19/data-20151019.log


Now the question:

I know I could skip step 3 had I accepted the idea of NiFi determined
filenames but I wonder is this the best way of achieving the file
naming defined above?

On a side note: Could anyone point me to the section of the code that
defines the current naming convention? :-)

I thank you in advance

Best practice approach to set HDFS filename based on attibutes

Reply via email to