Using the Record Writer will also be much better as you won't output one flow file per listed file. You'll have one flow file with one record per listed file, and you can then use multiple SplitRecord processors to make sure the number of flow files at one point alway remains OK.
Le sam. 27 févr. 2021 à 07:19, Jean-Sebastien Vachon <[email protected]> a écrit : > Thanks for the hint > > Télécharger Outlook pour Android <https://aka.ms/ghei36> > > ------------------------------ > *From:* Joe Witt <[email protected]> > *Sent:* Friday, February 26, 2021 10:13:20 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Questions about the GetFile processor > > Hello > > Yeah when there are a ton (50k or more) of files in a directory > performance is *horrible*. If you can put them into some subdirs to > divide it up then it will go a lot faster. > > Thanks > > On Fri, Feb 26, 2021 at 7:30 PM Jean-Sebastien Vachon < > [email protected]> wrote: > > Hi again, > > I need to reprocess all my files after we discovered a problem. My folder > contains 3,906,135 JSON files (590GB total size). > I tried the ListFile strategy, and it works fine on a small subset but on > the whole dataset not a single flow was queued after many hours of waiting. > > Is it normal that it takes so long to do something? > > I am using the following settings: > > Tracking Timestamps, > no recurse, > file filter is set to the default ([^\.].*), > the minimal size is 0b and the min age is 0s, > track performance is off, > max number of files is set to 5,000,000 > max disk op time is 10 s > max directory listing time is 3 hours > > Am I doing something wrong? my server is quite capable with 512GB of Ram > and 128 cores. > > Thanks > > > *Jean-Sébastien Vachon * > Co-Founder & Architect > > > *Brizo Data, Inc. www.brizodata.com > <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com> > * > ------------------------------ > *From:* Jean-Sebastien Vachon <[email protected]> > *Sent:* Thursday, February 18, 2021 8:59 AM > > *To:* [email protected] <[email protected]> > *Subject:* Re: Questions about the GetFile processor > > OK thanks > > I missed that part of the documentation. Stupid me > > > *Jean-Sébastien Vachon * > Co-Founder & Architect > > > *Brizo Data, Inc. www.brizodata.com > <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com> > * > ------------------------------ > *From:* Arpad Boda <[email protected]> > *Sent:* Thursday, February 18, 2021 8:46 AM > *To:* [email protected] <[email protected]> > *Subject:* Re: Questions about the GetFile processor > > GetFile has no persistence. > Actually it has, but it's called your hard drive. :) > > If you take a look at the documentation: > *Keep Source File - *"If true, the file is not deleted after it has been > copied to the Content Repository; this causes the file to be picked up > continually and is useful for testing purposes. If not keeping original > NiFi will need write permissions on the directory it is pulling from > otherwise it will ignore the file." > > You can see that it's going to get the same files over and over again > unless you configure it to delete the already processed ones. > > The reason I suggested the combination above is that listfile can be > triggered once, the metadata (filenames) are stored in your queue and > fetchfile can process them later. > > On Thu, Feb 18, 2021 at 2:39 PM Jean-Sebastien Vachon < > [email protected]> wrote: > > OK I understand your point.. sorry (early morning) 😉 > > I am kind of stuck with the GetFile processor for now. Is there a way to > know how many files are left to process? > > Will it go forever? or will it stops streaming once all files have been > processed? (there are no new files in the folder... everything was there at > the beginning) > > Thanks > > > *Jean-Sébastien Vachon * > Co-Founder & Architect > > > *Brizo Data, Inc. www.brizodata.com > <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com> > * > ------------------------------ > *From:* Jean-Sebastien Vachon <[email protected]> > *Sent:* Thursday, February 18, 2021 8:34 AM > *To:* [email protected] <[email protected]> > *Subject:* Re: Questions about the GetFile processor > > Thanks for your comment. However, I can't queue everything as the total > size of the data is around 560GB. > Right now, I am using a GetFile processor and it has been running for a > few days. If I look at my end point, it looks like it should be done pretty > soon but data is still > streaming in at the same rate so I was wondering if the processor > remembers every single file it has already processed or if it is simply > going through all the files alphabetically or in whatever order it decides. > > Thanks > > > *Jean-Sébastien Vachon * > Co-Founder & Architect > > > *Brizo Data, Inc. www.brizodata.com > <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com> > * > ------------------------------ > *From:* Arpad Boda <[email protected]> > *Sent:* Thursday, February 18, 2021 8:29 AM > *To:* [email protected] <[email protected]> > *Subject:* Re: Questions about the GetFile processor > > You can use the combination of listfile and fetchfile. > In the queue between the two you are going to see the number of > (flow)files left to be processed. > > On Thu, Feb 18, 2021 at 2:14 PM Jean-Sebastien Vachon < > [email protected]> wrote: > > Hi all, > > If I configure a GetFile processor to list all JSON files under a given > folder, will it stops sending flows once it has processed all files? > My folder contains thousands of files and the processor reads them by > small batch (10) every 30s. > > Is there a way to know how many files are left to processed? > > Thanks > > > *Jean-Sébastien Vachon * > Co-Founder & Architect > > > *Brizo Data, Inc. www.brizodata.com > <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com> > * > >
