Hello Yeah when there are a ton (50k or more) of files in a directory performance is *horrible*. If you can put them into some subdirs to divide it up then it will go a lot faster.
Thanks On Fri, Feb 26, 2021 at 7:30 PM Jean-Sebastien Vachon < [email protected]> wrote: > Hi again, > > I need to reprocess all my files after we discovered a problem. My folder > contains 3,906,135 JSON files (590GB total size). > I tried the ListFile strategy, and it works fine on a small subset but on > the whole dataset not a single flow was queued after many hours of waiting. > > Is it normal that it takes so long to do something? > > I am using the following settings: > > Tracking Timestamps, > no recurse, > file filter is set to the default ([^\.].*), > the minimal size is 0b and the min age is 0s, > track performance is off, > max number of files is set to 5,000,000 > max disk op time is 10 s > max directory listing time is 3 hours > > Am I doing something wrong? my server is quite capable with 512GB of Ram > and 128 cores. > > Thanks > > > *Jean-Sébastien Vachon * > Co-Founder & Architect > > > *Brizo Data, Inc. www.brizodata.com > <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com> > * > ------------------------------ > *From:* Jean-Sebastien Vachon <[email protected]> > *Sent:* Thursday, February 18, 2021 8:59 AM > > *To:* [email protected] <[email protected]> > *Subject:* Re: Questions about the GetFile processor > > OK thanks > > I missed that part of the documentation. Stupid me > > > *Jean-Sébastien Vachon * > Co-Founder & Architect > > > *Brizo Data, Inc. www.brizodata.com > <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com> > * > ------------------------------ > *From:* Arpad Boda <[email protected]> > *Sent:* Thursday, February 18, 2021 8:46 AM > *To:* [email protected] <[email protected]> > *Subject:* Re: Questions about the GetFile processor > > GetFile has no persistence. > Actually it has, but it's called your hard drive. :) > > If you take a look at the documentation: > *Keep Source File - *"If true, the file is not deleted after it has been > copied to the Content Repository; this causes the file to be picked up > continually and is useful for testing purposes. If not keeping original > NiFi will need write permissions on the directory it is pulling from > otherwise it will ignore the file." > > You can see that it's going to get the same files over and over again > unless you configure it to delete the already processed ones. > > The reason I suggested the combination above is that listfile can be > triggered once, the metadata (filenames) are stored in your queue and > fetchfile can process them later. > > On Thu, Feb 18, 2021 at 2:39 PM Jean-Sebastien Vachon < > [email protected]> wrote: > > OK I understand your point.. sorry (early morning) 😉 > > I am kind of stuck with the GetFile processor for now. Is there a way to > know how many files are left to process? > > Will it go forever? or will it stops streaming once all files have been > processed? (there are no new files in the folder... everything was there at > the beginning) > > Thanks > > > *Jean-Sébastien Vachon * > Co-Founder & Architect > > > *Brizo Data, Inc. www.brizodata.com > <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com> > * > ------------------------------ > *From:* Jean-Sebastien Vachon <[email protected]> > *Sent:* Thursday, February 18, 2021 8:34 AM > *To:* [email protected] <[email protected]> > *Subject:* Re: Questions about the GetFile processor > > Thanks for your comment. However, I can't queue everything as the total > size of the data is around 560GB. > Right now, I am using a GetFile processor and it has been running for a > few days. If I look at my end point, it looks like it should be done pretty > soon but data is still > streaming in at the same rate so I was wondering if the processor > remembers every single file it has already processed or if it is simply > going through all the files alphabetically or in whatever order it decides. > > Thanks > > > *Jean-Sébastien Vachon * > Co-Founder & Architect > > > *Brizo Data, Inc. www.brizodata.com > <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com> > * > ------------------------------ > *From:* Arpad Boda <[email protected]> > *Sent:* Thursday, February 18, 2021 8:29 AM > *To:* [email protected] <[email protected]> > *Subject:* Re: Questions about the GetFile processor > > You can use the combination of listfile and fetchfile. > In the queue between the two you are going to see the number of > (flow)files left to be processed. > > On Thu, Feb 18, 2021 at 2:14 PM Jean-Sebastien Vachon < > [email protected]> wrote: > > Hi all, > > If I configure a GetFile processor to list all JSON files under a given > folder, will it stops sending flows once it has processed all files? > My folder contains thousands of files and the processor reads them by > small batch (10) every 30s. > > Is there a way to know how many files are left to processed? > > Thanks > > > *Jean-Sébastien Vachon * > Co-Founder & Architect > > > *Brizo Data, Inc. www.brizodata.com > <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com> > * > >
