Hi again,

I need to reprocess all my files after we discovered a problem. My folder 
contains 3,906,135 JSON files (590GB total size).
I tried the ListFile strategy, and it works fine on a small subset but on the 
whole dataset not a single flow was queued after many hours of waiting.

Is it normal that it takes so long to do something?

I am using the following settings:

  Tracking Timestamps,
  no recurse,
  file filter is set to the default ([^\.].*),
  the minimal size is 0b and the min age is 0s,
  track performance is off,
  max number of files is set to 5,000,000
  max disk op time is 10 s
  max directory listing time is 3 hours

Am I doing something wrong? my server is quite capable with 512GB of Ram and 
128 cores.

Thanks

Jean-Sébastien Vachon
Co-Founder & Architect
Brizo Data, Inc.
www.brizodata.com<https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com>
________________________________
From: Jean-Sebastien Vachon <[email protected]>
Sent: Thursday, February 18, 2021 8:59 AM
To: [email protected] <[email protected]>
Subject: Re: Questions about the GetFile processor

OK thanks

I missed that part of the documentation. Stupid me

Jean-Sébastien Vachon
Co-Founder & Architect
Brizo Data, Inc.
www.brizodata.com<https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com>
________________________________
From: Arpad Boda <[email protected]>
Sent: Thursday, February 18, 2021 8:46 AM
To: [email protected] <[email protected]>
Subject: Re: Questions about the GetFile processor

GetFile has no persistence.
Actually it has, but it's called your hard drive. :)

If you take a look at the documentation:
Keep Source File - "If true, the file is not deleted after it has been copied 
to the Content Repository; this causes the file to be picked up continually and 
is useful for testing purposes. If not keeping original NiFi will need write 
permissions on the directory it is pulling from otherwise it will ignore the 
file."

You can see that it's going to get the same files over and over again unless 
you configure it to delete the already processed ones.

The reason I suggested the combination above is that listfile can be triggered 
once, the metadata (filenames) are stored in your queue and fetchfile can 
process them later.

On Thu, Feb 18, 2021 at 2:39 PM Jean-Sebastien Vachon 
<[email protected]<mailto:[email protected]>> wrote:
OK I understand your point.. sorry (early morning) 😉

I am kind of stuck with the GetFile processor for now. Is there a way to know 
how many files are left to process?

Will it go forever? or will it stops streaming once all files have been 
processed? (there are no new files in the folder... everything was there at the 
beginning)

Thanks

Jean-Sébastien Vachon
Co-Founder & Architect
Brizo Data, Inc.
www.brizodata.com<https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com>
________________________________
From: Jean-Sebastien Vachon 
<[email protected]<mailto:[email protected]>>
Sent: Thursday, February 18, 2021 8:34 AM
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Questions about the GetFile processor

Thanks for your comment. However, I can't queue everything as the total size of 
the data is around 560GB.
Right now, I am using a GetFile processor and it has been running for a few 
days. If I look at my end point, it looks like it should be done pretty soon 
but data is still
streaming in at the same rate so I was wondering if the processor remembers 
every single file it has already processed or if it is simply going through all 
the files alphabetically or in whatever order it decides.

Thanks

Jean-Sébastien Vachon
Co-Founder & Architect
Brizo Data, Inc.
www.brizodata.com<https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com>
________________________________
From: Arpad Boda <[email protected]<mailto:[email protected]>>
Sent: Thursday, February 18, 2021 8:29 AM
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Questions about the GetFile processor

You can use the combination of listfile and fetchfile.
In the queue between the two you are going to see the number of (flow)files 
left to be processed.

On Thu, Feb 18, 2021 at 2:14 PM Jean-Sebastien Vachon 
<[email protected]<mailto:[email protected]>> wrote:
Hi all,

If I configure a GetFile processor to list all JSON files under a given folder, 
will it stops sending flows once it has processed all files?
My folder contains thousands of files and the processor reads them by small 
batch (10) every 30s.

Is there a way to know how many files are left to processed?

Thanks

Jean-Sébastien Vachon
Co-Founder & Architect
Brizo Data, Inc.
www.brizodata.com<https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com>

Reply via email to