Hello

Yeah when there are a ton (50k or more) of files in a directory performance
is *horrible*.   If you can put them into some subdirs to divide it up then
it will go a lot faster.

Thanks

On Fri, Feb 26, 2021 at 7:30 PM Jean-Sebastien Vachon <
[email protected]> wrote:

> Hi again,
>
> I need to reprocess all my files after we discovered a problem. My folder
> contains 3,906,135 JSON files (590GB total size).
> I tried the ListFile strategy, and it works fine on a small subset but on
> the whole dataset not a single flow was queued after many hours of waiting.
>
> Is it normal that it takes so long to do something?
>
> I am using the following settings:
>
>   Tracking Timestamps,
>   no recurse,
>   file filter is set to the default ([^\.].*),
>   the minimal size is 0b and the min age is 0s,
>   track performance is off,
>   max number of files is set to 5,000,000
>   max disk op time is 10 s
>   max directory listing time is 3 hours
>
> Am I doing something wrong? my server is quite capable with 512GB of Ram
> and 128 cores.
>
> Thanks
>
>
> *Jean-Sébastien Vachon *
> Co-Founder & Architect
>
>
> *Brizo Data, Inc. www.brizodata.com
> <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com>
> *
> ------------------------------
> *From:* Jean-Sebastien Vachon <[email protected]>
> *Sent:* Thursday, February 18, 2021 8:59 AM
>
> *To:* [email protected] <[email protected]>
> *Subject:* Re: Questions about the GetFile processor
>
> OK thanks
>
> I missed that part of the documentation. Stupid me
>
>
> *Jean-Sébastien Vachon *
> Co-Founder & Architect
>
>
> *Brizo Data, Inc. www.brizodata.com
> <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com>
> *
> ------------------------------
> *From:* Arpad Boda <[email protected]>
> *Sent:* Thursday, February 18, 2021 8:46 AM
> *To:* [email protected] <[email protected]>
> *Subject:* Re: Questions about the GetFile processor
>
> GetFile has no persistence.
> Actually it has, but it's called your hard drive. :)
>
> If you take a look at the documentation:
> *Keep Source File - *"If true, the file is not deleted after it has been
> copied to the Content Repository; this causes the file to be picked up
> continually and is useful for testing purposes. If not keeping original
> NiFi will need write permissions on the directory it is pulling from
> otherwise it will ignore the file."
>
> You can see that it's going to get the same files over and over again
> unless you configure it to delete the already processed ones.
>
> The reason I suggested the combination above is that listfile can be
> triggered once, the metadata (filenames) are stored in your queue and
> fetchfile can process them later.
>
> On Thu, Feb 18, 2021 at 2:39 PM Jean-Sebastien Vachon <
> [email protected]> wrote:
>
> OK I understand your point.. sorry (early morning) 😉
>
> I am kind of stuck with the GetFile processor for now. Is there a way to
> know how many files are left to process?
>
> Will it go forever? or will it stops streaming once all files have been
> processed? (there are no new files in the folder... everything was there at
> the beginning)
>
> Thanks
>
>
> *Jean-Sébastien Vachon *
> Co-Founder & Architect
>
>
> *Brizo Data, Inc. www.brizodata.com
> <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com>
> *
> ------------------------------
> *From:* Jean-Sebastien Vachon <[email protected]>
> *Sent:* Thursday, February 18, 2021 8:34 AM
> *To:* [email protected] <[email protected]>
> *Subject:* Re: Questions about the GetFile processor
>
> Thanks for your comment. However, I can't queue everything as the total
> size of the data is around 560GB.
> Right now, I am using a GetFile processor and it has been running for a
> few days. If I look at my end point, it looks like it should be done pretty
> soon but data is still
> streaming in at the same rate so I was wondering if the processor
> remembers every single file it has already processed or if it is simply
> going through all the files alphabetically or in whatever order it decides.
>
> Thanks
>
>
> *Jean-Sébastien Vachon *
> Co-Founder & Architect
>
>
> *Brizo Data, Inc. www.brizodata.com
> <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com>
> *
> ------------------------------
> *From:* Arpad Boda <[email protected]>
> *Sent:* Thursday, February 18, 2021 8:29 AM
> *To:* [email protected] <[email protected]>
> *Subject:* Re: Questions about the GetFile processor
>
> You can use the combination of listfile and fetchfile.
> In the queue between the two you are going to see the number of
> (flow)files left to be processed.
>
> On Thu, Feb 18, 2021 at 2:14 PM Jean-Sebastien Vachon <
> [email protected]> wrote:
>
> Hi all,
>
> If I configure a GetFile processor to list all JSON files under a given
> folder, will it stops sending flows once it has processed all files?
> My folder contains thousands of files and the processor reads them by
> small batch (10) every 30s.
>
> Is there a way to know how many files are left to processed?
>
> Thanks
>
>
> *Jean-Sébastien Vachon *
> Co-Founder & Architect
>
>
> *Brizo Data, Inc. www.brizodata.com
> <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com>
> *
>
>

Reply via email to