Hello

GetHDFS pulls the file from HDFS then deletes the original.  It is possible
for race conditions to occur though seems unlikely if you have primary node
only doing the pull.  It is likely better at this point to use the
'ListHDFS' processor followed by the 'FetchHDFS' processor.  You can run
the ListHDFS processor on a single node (primary node) and then send the
listing results across the cluster using site-to-site if necessary and from
there use FetchHDFS.

All that is probably overkill though.  First step is to figure out why you
are seeing duplication.  Is NiFi unable to delete the original file?
Please be sure on GetHDFS "keep source file" is false.  If it is true then
NiFi would keep pulling it.  However, by using ListHDFS and FetchHDFS you
can pull in an idempotent manner.  For that case you use a Distributed
Cache Service which shares state about listings seen across the cluster.

Please let us know if this helps or if you would like more pointers.  This
is of course a really common use case so if we need to better document the
pattern we're happy to do so.

Thanks
Joe

On Tue, Aug 11, 2015 at 9:31 PM, 彭光裕 <[email protected]> wrote:

> hi,
>
>
>
>      My flow has a GetHDFS processor. My question is that I always get
> many copies of the same output files through this processor, no matter the
> scheduling strategy is ‘On primary node’ or ‘Timer Driven’. I thought ‘On
> primary node’ will only get one copy from HDFS, but it doesn’t.
>
> My working environment is a nifi cluster with two worker nodes. I guess
> ‘On primary node’ strategy of GetHDFS maybe not working, so that all the
> nodes invoke GetHDFS and the race condition happens.
>
>
>
> Any advices will be welcome, thank you!
>
>
>
> Roland.
>
>
>
>
>
> *本信件可能包含中華電信股份有限公司機密資訊,非指定之收件者,請勿蒐集、處理或利用本信件內容,並請銷毀此信件.
> 如為指定收件者,應確實保護郵件中本公司之營業機密及個人資料,不得任意傳佈或揭露,並應自行確認本郵件之附檔與超連結之安全性,以共同善盡資訊安全與個資保護責任.
> Please be advised that this email message (including any attachments)
> contains confidential information and may be legally privileged. If you are
> not the intended recipient, please destroy this message and all attachments
> from your system and do not further collect, process, or use them. Chunghwa
> Telecom and all its subsidiaries and associated companies shall not be
> liable for the improper or incomplete transmission of the information
> contained in this email nor for any delay in its receipt or damage to your
> system. If you are the intended recipient, please protect the confidential
> and/or personal information contained in this email with due care. Any
> unauthorized use, disclosure or distribution of this message in whole or in
> part is strictly prohibited. Also, please self-inspect attachments and
> hyperlinks contained in this email to ensure the information security and
> to protect personal information.*
>

Reply via email to