Hi Mario, Just pushed a fix (29bdea) that improves how we check whether files are a zim page or not. Now max 50 characters are being read at the start of the file when indexing. If your large files are not "line based" (thus resulting in a very long read when trying to read the first line) this should fix the issue.
Regards, Jaap On Sat, Apr 24, 2021 at 10:06 AM Mario Bezzi < [email protected]> wrote: > Hi Jaap, thank you for your help on this. > > To give you some more details: Of the 3000+ files which size sums up to > 2GB, the top 500 account for 1.6GB. Among these the average size is 3.5MB, > and each of the top three is in the 250MB range. > > Please let me know if there is anything I can do to help testing your fix, > mario > > On 4/23/21 2:55 PM, Jaap Karssenberg wrote: > > Hi Mario, > > That is not the result I hoped for :( I will need to generate some > random large text files to test & debug on my end. > > Regards, > > Jaap > > > On Fri, Apr 23, 2021 at 12:59 PM Mario Bezzi < > [email protected]> wrote: > >> I think I submitted my request circa 2014 under the previous bug tracking >> system - was it hosted by Ubuntu-one? - but yes, the idea is similar. >> >> I just downloaded the development version, extracted it into a temporary >> folder, and ran it via the ./zim.py command. >> >> Indexing took some 15 minutes. Below a snapshot of what top was saying >> about the execution. >> >> top - 12:45:28 up 3 days, 16:12, 1 user, load average: 1.87, 1.92, 2.48 >> Tasks: 356 total, 3 running, 353 sleeping, 0 stopped, 0 zombie >> %Cpu(s): 13.0 us, 5.4 sy, 0.0 ni, 81.6 id, 0.0 wa, 0.0 hi, 0.0 si, >> 0.0 st >> MiB Mem : 31658.1 total, 320.9 free, 19312.0 used, 12025.3 >> buff/cache >> MiB Swap: 976.0 total, 0.0 free, 976.0 used. 10085.6 avail >> Mem >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ >> COMMAND >> >> 159310 mario_b+ 20 0 771220 80184 43420 R 100.0 0.2 *14:42.13 >> zim.py* >> >> Please let me know if there is more I can do. >> >> Thank you, >> mario >> >> On 4/23/21 11:25 AM, Jaap Karssenberg wrote: >> >> Yes that explains, those large files will have a big impact on the >> indexer. >> >> You are referring to this issue: Make indexer ignore text files that are >> not zim pages · Issue #907 · zim-desktop-wiki/zim-desktop-wiki (github.com) >> <https://github.com/zim-desktop-wiki/zim-desktop-wiki/issues/907> which >> is fixed in the development branch and will be in the next release. >> >> With that fix the indexer will read the first line of each file to decide >> whether it is a zim file or not, and if not it will not try to access the >> contents. >> >> Would be great if you have a chance to test the development branch and >> see whether it works in practice for your case ! >> >> -- Jaap >> >> >> On Thu, Apr 22, 2021 at 7:32 PM Mario Bezzi < >> [email protected]> wrote: >> >>> The folder contains 3118 ".txt" files, for a total of 2GB of data. Some >>> large txt files are attachments. A long time ago I submitted a request to >>> avoid indexing these. Not sure it has been fulfilled though. >>> >>> Thank you, >>> mario >>> >>> On 4/8/21 7:32 PM, Jaap Karssenberg wrote: >>> >>> Can you indicate how big your notebook folder is? Either an extreme >>> case, or some bug making it take much longer than needed. >>> >>> Op do 8 apr. 2021 15:59 schreef Mario Bezzi < >>> [email protected]>: >>> >>>> Thanks Jaap, I was not aware of this. >>>> >>>> To give you an idea, I just restarted Zim, and indexing kept a >>>> processor 100% busy for 13 minutes to come to an end. It was nice if this >>>> could be avoided. >>>> >>>> Thank you, >>>> mario >>>> >>>> On 4/8/21 10:06 AM, Jaap Karssenberg wrote: >>>> >>>> The indexing is not used for searching alone, it is also needed to e.g. >>>> present the page tree in the side pane and to track links >>>> >>>> Op do 8 apr. 2021 09:34 schreef Mario Bezzi < >>>> [email protected]>: >>>> >>>>> Hello, >>>>> >>>>> I may be the only one, but with my quite large notebooks I do find the >>>>> search function impractical, and for this reason I never use it. >>>>> Still, >>>>> when it starts, Zim goes crazy for a long time indexing, and I came to >>>>> the conclusion that this is normal. >>>>> >>>>> If this is the case, I would like to file a requirement to add the >>>>> ability to make indexing optional. >>>>> >>>>> Thank you, >>>>> mario >>>>> >>>>> _______________________________________________ >>>>> Mailing list: https://launchpad.net/~zim-wiki >>>>> Post to : [email protected] >>>>> Unsubscribe : https://launchpad.net/~zim-wiki >>>>> More help : https://help.launchpad.net/ListHelp >>>>> >>>> >>>> >>> >> >
_______________________________________________ Mailing list: https://launchpad.net/~zim-wiki Post to : [email protected] Unsubscribe : https://launchpad.net/~zim-wiki More help : https://help.launchpad.net/ListHelp

