RE: ‘On primary node’ strategy of GetHDFS maybe not working

彭光裕 Wed, 12 Aug 2015 00:07:38 -0700

Thank you Joe again.

ListHDFS(primary node) + FetchHDFS(cluster) is a good idea. I’ll remove GetHDFS 
and try this suggestion later.


By the way, as the attached picture you can see. GetHDFS only get 1 enqueue, 
but put 2 dequeue, so CompressContent get 2 in and 2 out. That makes the file 
content two times as what I expected.

[cid:[email protected]]
I’m sure the property “keep source file” is false, but the duplicated pulling 
is still happened.

However, for the sake of distributed apply, ListHDFS and FetchHDFS would be 
better than only GetHDFS(primary node).

Thanks again,

Roland.
From: Joe Witt [mailto:[email protected]]
Sent: Wednesday, August 12, 2015 10:43 AM
To: [email protected]
Subject: Re: ‘On primary node’ strategy of GetHDFS maybe not working

Hello

GetHDFS pulls the file from HDFS then deletes the original.  It is possible for 
race conditions to occur though seems unlikely if you have primary node only 
doing the pull.  It is likely better at this point to use the 'ListHDFS' 
processor followed by the 'FetchHDFS' processor.  You can run the ListHDFS 
processor on a single node (primary node) and then send the listing results 
across the cluster using site-to-site if necessary and from there use FetchHDFS.

All that is probably overkill though.  First step is to figure out why you are 
seeing duplication.  Is NiFi unable to delete the original file?  Please be 
sure on GetHDFS "keep source file" is false.  If it is true then NiFi would 
keep pulling it.  However, by using ListHDFS and FetchHDFS you can pull in an 
idempotent manner.  For that case you use a Distributed Cache Service which 
shares state about listings seen across the cluster.

Please let us know if this helps or if you would like more pointers.  This is 
of course a really common use case so if we need to better document the pattern 
we're happy to do so.

Thanks
Joe

On Tue, Aug 11, 2015 at 9:31 PM, 彭光裕 
<[email protected]<mailto:[email protected]>> wrote:
[cid:[email protected]]
hi,

     My flow has a GetHDFS processor. My question is that I always get many 
copies of the same output files through this processor, no matter the 
scheduling strategy is ‘On primary node’ or ‘Timer Driven’. I thought ‘On 
primary node’ will only get one copy from HDFS, but it doesn’t.
My working environment is a nifi cluster with two worker nodes. I guess ‘On 
primary node’ strategy of GetHDFS maybe not working, so that all the nodes 
invoke GetHDFS and the race condition happens.

Any advices will be welcome, thank you!

Roland.



本信件可能包含中華電信股份有限公司機密資訊,非指定之收件者,請勿蒐集、處理或利用本信件內容,並請銷毀此信件. 
如為指定收件者,應確實保護郵件中本公司之營業機密及個人資料,不得任意傳佈或揭露,並應自行確認本郵件之附檔與超連結之安全性,以共同善盡資訊安全與個資保護責任.
Please be advised that this email message (including any attachments) contains 
confidential information and may be legally privileged. If you are not the 
intended recipient, please destroy this message and all attachments from your 
system and do not further collect, process, or use them. Chunghwa Telecom and 
all its subsidiaries and associated companies shall not be liable for the 
improper or incomplete transmission of the information contained in this email 
nor for any delay in its receipt or damage to your system. If you are the 
intended recipient, please protect the confidential and/or personal information 
contained in this email with due care. Any unauthorized use, disclosure or 
distribution of this message in whole or in part is strictly prohibited. Also, 
please self-inspect attachments and hyperlinks contained in this email to 
ensure the information security and to protect personal information.

RE: ‘On primary node’ strategy of GetHDFS maybe not working

Reply via email to