[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2024-03-27 Thread Stefan Brüns
https://bugs.kde.org/show_bug.cgi?id=380456

Stefan Brüns  changed:

   What|Removed |Added

 CC||stefan.bruens@rwth-aachen.d
   ||e

--- Comment #22 from Stefan Brüns  ---
(In reply to tagwerk19 from comment #21)
> Created attachment 143869 [details]
> pdftotext results from
> https://ipfs.io/ipfs/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC
> 
> (In reply to Adam Fontenot from comment #20)
> > ... The file, in their view, is pathological ...
> Applying a modicum of patience, running:
> 
> nice -19 pdftotext QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf
> 
> took 37 hours on a machine with 16GB memory 8-]
> 
> The process gradually ate memory, reaching 10 GB. There wasn't an obvious
> impact on performance - but I would expect you'd see that bite when reaching
> the limits/starting to swap.

The long runtime is caused by some algorithmically bad implementation, i.e.
O(n^2) were e.g. O(n log n) is sufficient. The huge memory footprint is caused
by some problematic data arrangement and too greedy pre/overallocation.

I have filed two MRs [1],[2] for poppler, with both applied the extractions
runs in ~50 seconds on my 3 year old laptop, with a peak memory consumption of
1.8 GByte.

[1] https://gitlab.freedesktop.org/poppler/poppler/-/merge_requests/1514  
[2] https://gitlab.freedesktop.org/poppler/poppler/-/merge_requests/1515

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2021-11-23 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=380456

--- Comment #21 from tagwer...@innerjoin.org ---
Created attachment 143869
  --> https://bugs.kde.org/attachment.cgi?id=143869=edit
pdftotext results from
https://ipfs.io/ipfs/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC

(In reply to Adam Fontenot from comment #20)
> ... The file, in their view, is pathological ...
Applying a modicum of patience, running:

nice -19 pdftotext QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf

took 37 hours on a machine with 16GB memory 8-]

The process gradually ate memory, reaching 10 GB. There wasn't an obvious
impact on performance - but I would expect you'd see that bite when reaching
the limits/starting to swap.

Attaching the output file - just in case anyone else wants to see the result.

When moving the source file to an indexed folder it was picked up by baloo and
indexed by baloo_file_extractor. Similarly 37hrs and 10.1 GB.

Alas wasn't quick enough to notice what happened to the baloo_file_extractor
memory usage when the indexing finished - the process terminated (and released
memory) when it had nothing more to do

The details of the index records:

$ balooshow -x Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf
1546b2fc01 64513 1394354
Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf
[/home/test/Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf]
Mtime: 1637335759 2021-11-19T16:29:19
Ctime: 1637335813 2021-11-19T16:30:13
Cached properties:
Title: R Graphics Output
Document Generated By: R 3.6.0
Page Count: 1
Creation Date: 2019-09-13T11:01:30.000Z

Internal Info
Terms: 0 10 15 20 5 Mapplication Mpdf T5 X15-graphics
X15-output X15-r X17-3.6.0 X17-r X18-1 X24-2019-09-13T11:01:30Z a1 a2 b1 b2 c
graphics output qagr qchr qkel qpal r vcf − ●
File Name Terms: Fpdf Fqmvqwhpuqke7retn5f9tisea7
XAttr Terms:
generator: 3.6.0 r
pageCount: 1
title: graphics output r
creationDate: 2019-09-13T11:01:30Z

and...

$ balooshow -x Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.txt
140a61fc01 64513 1313377
Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.txt
[/home/test/Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.txt]
Mtime: 1637519014 2021-11-21T19:23:34
Ctime: 1637519014 2021-11-21T19:23:34
Cached properties:
Line Count: 4352

Internal Info
Terms: 0 10 15 20 5 Mplain Mtext T5 T8 X20-4352 a1 a2 b1 b2
c qagr qchr qkel qpal vcf − ●
File Name Terms: Fqmvqwhpuqke7retn5f9tisea7 Ftxt
XAttr Terms:
lineCount: 4352

So, for this instance, not a lot of indexable text but the metadata was
recognised (in the PDF, it was not extracted to the text) and it was possible
to search for the title:

$ baloosearch "R Graphics Output"

or...

$ baloosearch title:"R Graphics Output"
/home/test/Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf

I think with enough RAM and patience baloo can cope with even this pathological
test case but the the requirement definitely _is_ "enough Ram and patience". It
would certainly make sense to be able to say to baloo_file_extractor "give up
after 10 minutes" and flag the file as failed.

I'll update Bug 400704, which has become a collection point for these
misbehavin' reports. See:

https://bugs.kde.org/show_bug.cgi?id=400704#c31

and onwards.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2021-11-17 Thread Adam Fontenot
https://bugs.kde.org/show_bug.cgi?id=380456

--- Comment #20 from Adam Fontenot  ---
(In reply to tagwerk19 from comment #19)
> I would still suspect memory use rather than CPU as the underlying reason.
It's quite possible that you're right about that. I do know the game is
sensitive to available memory, possibly because it runs on the internal Intel
graphics chip.

> > ... baloo_file_extractor is calling out to an external library
> > (poppler), and that library is consuming an endlessly growing amount of
> > memory (from 1-3 GB before I've killed it). It's probably safe to say that
> > this memory usage is in the form of anonymous mappings which can't be
> > reclaimed. Baloo *must* take that into account and kill the extractor
> > process if it begins affecting system resources.
> That's a *lot* of memory for a "pdf to text" conversion 8-]
Yes, especially for a random 20 MB PDF I didn't even remember existed.

> You see the baloo_file_extractor RAM usage go up during the extraction and
> not come down when it is finished?
I have never been able to leave it for long enough to finish extracting from
the file. It's possible I'd even get an out of RAM hang before then. The
Poppler devs estimate at least 7 GB of RAM would be needed to extract text from
this file. I even tested their pdftotext command on a system with plenty of
RAM, and even then the issue is that it simply takes too long. I've left it
running for over an hour on this one file before, and never seen it complete.

Moreover, they insist that it's not a bug on their end. The file, in their
view, is pathological and the only reasonable solution is not to try to extract
text from it. I think I understand that perspective: it's not every day that
you come across a PDF with millions of "words" on a single page. So it's on
Baloo to bail out if the process takes too long or consumes too much RAM.
Here's the bug report I filed with them if you want to follow that
conversation: https://gitlab.freedesktop.org/poppler/poppler/-/issues/1173

> Could you see the culprit file in "System Settings > Search" (recent
> releases of baloo show the progress of the indexing there) or when running
> "balooctl monitor"?
Unfortunately, I don't remember. I do remember using lsof and friends to check
that it was the only file Baloo had open. I may not have realized at the time
that that feature had been added to the Baloo KCM.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2021-11-17 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=380456

--- Comment #19 from tagwer...@innerjoin.org ---
(In reply to Adam Fontenot from comment #18)
> Hmm, even assuming this is true, does the process suspend if the user is on
> battery? An otherwise idle system consuming 100% of a core for hours on end
> is sure to annoy the user even if it doesn't interfere with other processes.
I'm pretty confident about the CPU priority and I know that baloo is aware that
it is on battery (and avoids content indexing). What happens in your case, I'm
afraid I don't know.

> I'd also point out that I discovered this issue (after several years of
> being vaguely aware of "baloo problems") when I saw stuttering in a full
> screen game. 
I would still suspect memory use rather than CPU as the underlying reason.
There are situations where baloo is building a large transaction and requires
lots of memory, there's a summary starting
https://bugs.kde.org/show_bug.cgi?id=400704#c31. It's quite possible for
systems to "hit the mud" in these cases.

> If I'm not mistaken, that's just for internal Baloo memory usage, right?
I'd say yes, the cases I've looked at were when indexing large text files and
writing the results to the index.

> ... baloo_file_extractor is calling out to an external library
> (poppler), and that library is consuming an endlessly growing amount of
> memory (from 1-3 GB before I've killed it). It's probably safe to say that
> this memory usage is in the form of anonymous mappings which can't be
> reclaimed. Baloo *must* take that into account and kill the extractor
> process if it begins affecting system resources.
That's a *lot* of memory for a "pdf to text" conversion 8-]

You see the baloo_file_extractor RAM usage go up during the extraction and not
come down when it is finished?

> In this case, it's a graph of some scientific data. Plotting scientific data
> to PDF or SVG (which both can have extractable text) is very common. In any
> case, it shouldn't be on the user to determine which files are causing
> problems (I had to use strace!) and exclude them.
Understood.

Could you see the culprit file in "System Settings > Search" (recent releases
of baloo show the progress of the indexing there) or when running "balooctl
monitor"?

In your use case, you could save your plots to a folder that was not indexed.
Yes, I know, it's shouldn't be up to the user but in this case as a
workround...

>  A file indexer should "just work".
Yup,  I think there's general agreement on that :-)

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2021-11-16 Thread Adam Fontenot
https://bugs.kde.org/show_bug.cgi?id=380456

--- Comment #18 from Adam Fontenot  ---
(In reply to tagwerk19 from comment #17)
> If you look at htop, you'll see that baloo_file and baloo_file_extractor run
> with minimum priority. They'll yield to nearly everything that wants a CPU.
> They should take all the time they need without annoying anything else
Hmm, even assuming this is true, does the process suspend if the user is on
battery? An otherwise idle system consuming 100% of a core for hours on end is
sure to annoy the user even if it doesn't interfere with other processes.

I'd also point out that I discovered this issue (after several years of being
vaguely aware of "baloo problems") when I saw stuttering in a full screen game.
Alt-tabbing to htop showed baloo_file_extractor at 100%. Baloo may in theory
yield to other processes, but it didn't prevent me from seeing issues.

> Memory usage is different, baloo "memory maps" the index and pulls pages
> from disc to memory as needed, they'll be "forgotten" again if the RAM is
> needed (and the pages have not been modified). You might see that baloo_file
> / baloo_file_extractor use a lot of memory but that can be "just cache".
If I'm not mistaken, that's just for internal Baloo memory usage, right? In my
case, baloo_file_extractor is calling out to an external library (poppler), and
that library is consuming an endlessly growing amount of memory (from 1-3 GB
before I've killed it). It's probably safe to say that this memory usage is in
the form of anonymous mappings which can't be reclaimed. Baloo *must* take that
into account and kill the extractor process if it begins affecting system
resources.

> I'm tempted to say that if this is a application generated file with
> little/no human readable information in it (that happens to be a PDF) it
> would make sense to have an application specific mimetype for it. Then that
> can be added to baloo's "exclude filters" list. I suspect though that if the
> file is generated by a script, that might not be possible.
In this case, it's a graph of some scientific data. Plotting scientific data to
PDF or SVG (which both can have extractable text) is very common. In any case,
it shouldn't be on the user to determine which files are causing problems (I
had to use strace!) and exclude them. A file indexer should "just work".

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2021-11-16 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=380456

tagwer...@innerjoin.org changed:

   What|Removed |Added

 CC||tagwer...@innerjoin.org

--- Comment #17 from tagwer...@innerjoin.org ---
(In reply to Adam Fontenot from comment #16)
> ... it is completely unreasonable for a file indexer to ever make a user's 
> system 
> unusable. Any time it takes baloo_file_extractor more than 30 seconds to pull 
> the text out of a file, or it starts using more than 10% of the user's total 
> RAM, it should be instantly killed and the file blacklisted. Only the file 
> name 
> (not contents) should be available to search results ...
OO. Ouch!

If you look at htop, you'll see that baloo_file and baloo_file_extractor run
with minimum priority. They'll yield to nearly everything that wants a CPU.
They should take all the time they need without annoying anything else

Memory usage is different, baloo "memory maps" the index and pulls pages from
disc to memory as needed, they'll be "forgotten" again if the RAM is needed
(and the pages have not been modified). You might see that baloo_file /
baloo_file_extractor use a lot of memory but that can be "just cache".

The kicker is when indexing and you're building a *large* transaction, that
might take a lot of memory (possibly, alas, stretching to swap). If you kill
the process before the commit is done, you're condemning yourself to repeat the
work.  On a system with Out Of Memory (OOM) protections, you might hit this.

You can see a little of what's happening (the switching between reading the
source files and writing the updates to the index) with iotop.

> ... Surprisingly, the Poppler devs say there's nothing wrong with Poppler here
> (despite the fact that their pdftotext tool hangs for over an hour on this
> file). That's because the R script which generated it is apparently using
> the "I" character repeatedly as part of a graph. I don't know why R does
> that, but it does ...
I'm tempted to say that if this is a application generated file with little/no
human readable information in it (that happens to be a PDF) it would make sense
to have an application specific mimetype for it. Then that can be added to
baloo's "exclude filters" list. I suspect though that if the file is generated
by a script, that might not be possible.

> So in general, while there *may* be specific bugs with Baloo that need
> fixing or some crazy files that perhaps "shouldn't" exist, the probable
> cause of this problem for *most* users is that Baloo simply doesn't give up
> on trying to index a file when it really, really should.
Baloo does have a mechanism for flagging files as "failed" - "balooctl failed"
will list them. I think that needs more love...

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2021-11-16 Thread Adam Fontenot
https://bugs.kde.org/show_bug.cgi?id=380456

--- Comment #16 from Adam Fontenot  ---
I actually filed an upstream bug with Poppler for its handling of the specific
PDF file I was seeing issues with.
https://gitlab.freedesktop.org/poppler/poppler/-/issues/1173

Surprisingly, the Poppler devs say there's nothing wrong with Poppler here
(despite the fact that their pdftotext tool hangs for over an hour on this
file). That's because the R script which generated it is apparently using the
"I" character repeatedly as part of a graph. I don't know why R does that, but
it does.

Quoting the dev response:

> whether this bug is fixed or not baloo needs to understand that extracting 
> the 
> text of a pdf file can take forever, and thus give up after X seconds/minutes

Obviously this is not going to correspond to everyone's issues, but it's an
interesting example of the point I made:

> it is completely unreasonable for a file indexer to ever make a user's system 
> unusable. Any time it takes baloo_file_extractor more than 30 seconds to pull 
> the text out of a file, or it starts using more than 10% of the user's total 
> RAM, it should be instantly killed and the file blacklisted. Only the file 
> name 
> (not contents) should be available to search results.

So in general, while there *may* be specific bugs with Baloo that need fixing
or some crazy files that perhaps "shouldn't" exist, the probable cause of this
problem for *most* users is that Baloo simply doesn't give up on trying to
index a file when it really, really should.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2021-10-14 Thread DDR
https://bugs.kde.org/show_bug.cgi?id=380456

--- Comment #15 from DDR  ---
I second this. It's a bit absurd to just run it with no resource limits,
internally or externally.

On Wed., Oct. 13, 2021, 11:04 p.m. Adam Fontenot, 
wrote:

> https://bugs.kde.org/show_bug.cgi?id=380456
>
> Adam Fontenot  changed:
>
>What|Removed |Added
>
> 
>  CC|
> |adam.m.fontenot+kde@gmail.c
>||om
>
> --- Comment #14 from Adam Fontenot  ---
> This is still a really common issue. I don't know that I've ever spoken to
> someone who uses KDE + Baloo with the out of the box settings who hasn't
> run
> into it. I mean, just for a starting sample:
>
>
> https://old.reddit.com/r/kde/comments/j77j16/can_we_please_have_kde_disable_baloo_by_default/
>
> https://old.reddit.com/r/kde/comments/kzdoux/baloo_should_be_suspended_when_the_system_is_in/
> https://old.reddit.com/r/kde/comments/lgg0su/how_is_baloo_doing_these_days/
> https://old.reddit.com/r/kde/comments/o6w0ly/whats_wrong_with_baloo/
>
> https://old.reddit.com/r/kde/comments/pc4wk1/baloo_file_extr_extreme_cpu_usage/
>
> This is just the first five issues I could find with people talking about
> this
> *exact* problem, but the list goes on and on. The oldest complaint in that
> list
> is only a year old.
>
> More than a complaint, I have a proposal: it is completely unreasonable
> for a
> file indexer to ever make a user's system unusable. Any time it takes
> baloo_file_extractor more than 30 seconds to pull the text out of a file,
> or it
> starts using more than 10% of the user's total RAM, it should be instantly
> killed and the file blacklisted. Only the file name (not contents) should
> be
> available to search results.
>
> Moreover, some kind of heuristic is desperately needed to tell Baloo that a
> file can't be usefully indexed. Baloo is happy to use a ton of memory and
> hard
> disk space to index files that are - for most purposes - random binary
> data.
>
> Just as an example: I have a PDF that contains no meaningful text at all
> (it's
> a plot automatically generated from some technical data). It's only 20 MB.
> Yet
> baloo_file_extractor hung on this file for a *long* time, probably more
> than
> half an hour, with RAM use up over 1 GB. It continued using 100% of one CPU
> core despite the fact that I was trying to run a full screen game at the
> time.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2021-10-14 Thread Adam Fontenot
https://bugs.kde.org/show_bug.cgi?id=380456

Adam Fontenot  changed:

   What|Removed |Added

 CC||adam.m.fontenot+kde@gmail.c
   ||om

--- Comment #14 from Adam Fontenot  ---
This is still a really common issue. I don't know that I've ever spoken to
someone who uses KDE + Baloo with the out of the box settings who hasn't run
into it. I mean, just for a starting sample:

https://old.reddit.com/r/kde/comments/j77j16/can_we_please_have_kde_disable_baloo_by_default/
https://old.reddit.com/r/kde/comments/kzdoux/baloo_should_be_suspended_when_the_system_is_in/
https://old.reddit.com/r/kde/comments/lgg0su/how_is_baloo_doing_these_days/
https://old.reddit.com/r/kde/comments/o6w0ly/whats_wrong_with_baloo/
https://old.reddit.com/r/kde/comments/pc4wk1/baloo_file_extr_extreme_cpu_usage/

This is just the first five issues I could find with people talking about this
*exact* problem, but the list goes on and on. The oldest complaint in that list
is only a year old.

More than a complaint, I have a proposal: it is completely unreasonable for a
file indexer to ever make a user's system unusable. Any time it takes
baloo_file_extractor more than 30 seconds to pull the text out of a file, or it
starts using more than 10% of the user's total RAM, it should be instantly
killed and the file blacklisted. Only the file name (not contents) should be
available to search results.

Moreover, some kind of heuristic is desperately needed to tell Baloo that a
file can't be usefully indexed. Baloo is happy to use a ton of memory and hard
disk space to index files that are - for most purposes - random binary data.

Just as an example: I have a PDF that contains no meaningful text at all (it's
a plot automatically generated from some technical data). It's only 20 MB. Yet
baloo_file_extractor hung on this file for a *long* time, probably more than
half an hour, with RAM use up over 1 GB. It continued using 100% of one CPU
core despite the fact that I was trying to run a full screen game at the time.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2019-10-17 Thread Savor d'Isavano
https://bugs.kde.org/show_bug.cgi?id=380456

Savor d'Isavano  changed:

   What|Removed |Added

 CC||anohigisa...@gmail.com

--- Comment #13 from Savor d'Isavano  ---
As of baloo 5.63.0, the issue persists.

Memory consumption increases by ~2MB/s. CPU consumption is also considerable.

Luckily I noticed the CPU fan spinning noisily and disabled baloo before it was
too late to save my work (>9GB memory at the time).

See this screencast:
https://vimeo.com/366988108

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2019-01-07 Thread soredake
https://bugs.kde.org/show_bug.cgi?id=380456

soredake  changed:

   What|Removed |Added

 CC||fds...@krutt.org

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2018-12-05 Thread Reuben
https://bugs.kde.org/show_bug.cgi?id=380456

Reuben  changed:

   What|Removed |Added

 CC||reube...@yahoo.com

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2018-12-04 Thread Johannes Tiemer
https://bugs.kde.org/show_bug.cgi?id=380456

Johannes Tiemer  changed:

   What|Removed |Added

Version|5.34.0  |5.52.0
   Platform|Neon Packages   |Archlinux Packages

--- Comment #12 from Johannes Tiemer  ---
Hey everybody,
since installing the update to version 5.52 on my computer (Arch current), the
baloo file indexer began showing unwanted behaviour again. All directories with
large text files were blacklisted beforehand.
The behaviour found is as before: RAM usage explodes within roughly minute to
fill all of 16GB, since I have no swap, it then freezes my machine by clogging
the RAM … playing "nice" with RAM may be a thing too.

What I found with "balooctl monitor"
- it seemed to plainly ignore that it should _not_ index the windows partition
that I have mounted into my home for convenience
- it seems to begin expand in RAM while reporting that it is checking for
"checking for obsolete index entries"

I let baloo completely recreate its index over a few days when I realised it is
misbehaving again. See my above comment for earlier indexSize:
---
$ balooctl indexSize
Actual Size: 32,88 GiB
Expected Size: 22,85 GiB

   PostingDB:   2,31 GiB81.336 %
  PositionDB:   1,48 GiB51.905 %
DocTerms:   3,71 GiB   130.294 %
DocFilenameTerms:  57,95 MiB 1.989 %
   DocXattrTerms:0 B 0.000 %
  IdTree:   7,63 MiB 0.262 %
  IdFileName:  40,62 MiB 1.394 %
 DocTime:  18,80 MiB 0.645 %
 DocData:  53,23 MiB 1.827 %
   ContentIndexingDB:0 B 0.000 %
 FailedIdsDB:0 B 0.000 %
 MTimeDB:  10,35 MiB 0.355 %
---

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2018-09-29 Thread Johannes Tiemer
https://bugs.kde.org/show_bug.cgi?id=380456

--- Comment #11 from Johannes Tiemer  ---
After excluding the above mentioned folders with lots of small files, baloo
stopped its memory eating behavior. Scanning for file numbers and sizes and
then warning might be a simple safeguard maybe?

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2018-09-28 Thread Johannes Tiemer
https://bugs.kde.org/show_bug.cgi?id=380456

--- Comment #10 from Johannes Tiemer  ---
I checked. It took baloo_file_extractor 9 minutes (according to uptime) to fill
13GiB RAM, where it pretty much exclusively (as far as I could tell) operated
on my Archive disk, which, among others, contains lots of txt/csv-files (large
datasets, probably a little below 100GiB) and my email backups from thunderbird
which are a mess of many tens of thousands of small files.

I blacklisted a part of it now and will report back once I find out something
new about its behaviour.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2018-09-27 Thread Johannes Tiemer
https://bugs.kde.org/show_bug.cgi?id=380456

--- Comment #9 from Johannes Tiemer  ---
D'oh, forgot to mention: CPU load is 100% on one single core until I kill the
process.

I'll remember to look into what it's indexing when I boot next time.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2018-09-27 Thread Johannes Tiemer
https://bugs.kde.org/show_bug.cgi?id=380456

Johannes Tiemer  changed:

   What|Removed |Added

 CC||jtie...@gmail.com

--- Comment #8 from Johannes Tiemer  ---
I have the same issue on a current Arch Linux (baloo 5.50, kernel 4.18.9), I
only noticed it after the last upgrade a week ago, due to the machine having a
long uptime. lower versions of baloo ran just fine and I fail to remember
making huge changes to the data to be indexed. 

Component/Version: baloo_file_extractor/5.50
Platform: Arch Linux
Kernel: 4.18.9

Issue: After KDE Startup baloo_file_extractor uses ever more RAM until it
stalls the machine to a freeze once it uses all available RAM. Not even
switching to another tty to kill the process is possible anymore.

Steps to reproduce: Start Plasma session and wait while watching the RAM use
applet or htop. It takes around 3-4 minutes after startup to fill what is
available of 16GB RAM.

Remedy: Sending SIGTERM end baloo_file_extractor and frees the RAM. baloo
doesn't restart it.

Context Info: I found some sources who claim that .vdi files are a problem for
baloo, I consequently excluded my VM directory from the search (and some
others). It did not change anything about the issue.

[user@machine ~]$ balooctl status
Die Baloo-Dateiindizierung läuft
Indizierungsstatus: Dateiinhalt wird indiziert
91496/91624 Dateien indiziert
Der aktuelle Index hat eine Größe von 31,19 GiB

[user@machine ~]$ balooctl indexSize
Actual Size: 31,19 GiB
Expected Size: 14,32 GiB

   PostingDB: 616,27 MiB25.946 %
  PositionDB:   1,96 GiB84.367 %
DocTerms:   2,37 GiB   101.981 %
DocFilenameTerms:  11,44 MiB 0.482 %
   DocXattrTerms:0 B 0.000 %
  IdTree:   1,48 MiB 0.062 %
  IdFileName:   8,32 MiB 0.350 %
 DocTime:   3,87 MiB 0.163 %
 DocData:  12,64 MiB 0.532 %
   ContentIndexingDB:  12,00 KiB 0.000 %
 FailedIdsDB:0 B 0.000 %
 MTimeDB:   3,34 MiB 0.141 %

[user@machine ~]$ uname -a
Linux pica 4.18.9-arch1-1-ARCH #1 SMP PREEMPT Wed Sep 19 21:19:17 UTC 2018
x86_64 GNU/Linux

[user@machine ~]$ pikaur -Qi baloo
Name : baloo
Version  : 5.50.0-1
Beschreibung : A framework for searching and managing metadata
Architektur  : x86_64
URL  : https://community.kde.org/Frameworks
Lizenzen : LGPL
Gruppen  : kf5
Stellt bereit: Nichts
Hängt ab von : kfilemetadata  kidletime  kio  lmdb
Optionale Abhängigkeiten : qt5-declarative: QML bindings [Installiert]
Benötigt von : baloo-widgets  gwenview  plasma-desktop 
plasma-mediacenter
Optional für : plasma-workspace
In Konflikt mit  : Nichts
Ersetzt  : Nichts
Installationsgröße   : 2,41 MiB
Packer   : Antonio Rojas 
Erstellt am  : Mo 03 Sep 2018 16:26:53 CEST
Installiert am   : Mo 10 Sep 2018 02:47:54 CEST
Installationsgrund   : Installiert als Abhängigkeit für ein anderes Paket
Installations-Skript : Nein
Verifiziert durch: Signatur

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2018-09-11 Thread Allan Andersen
https://bugs.kde.org/show_bug.cgi?id=380456

Allan Andersen  changed:

   What|Removed |Added

 Status|UNCONFIRMED |CONFIRMED
 Ever confirmed|0   |1
 CC||alland...@gmail.com

--- Comment #7 from Allan Andersen  ---
Same issue here. Really annoying using cpu and lots of memory.

balooctl status
Baloo File Indexer is running
Indexer state: Idle
Indexed 98426 / 157132 files
Current size of index is 7,81 GiB

Process: baloo_file_extractor is using 9.2 GiB memory!

Its a developer machine with several thousand files.

Linux aa-Precision-3510 4.15.0-33-generic #36-Ubuntu SMP Wed Aug 15 16:00:05
UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2018-03-07 Thread DDR
https://bugs.kde.org/show_bug.cgi?id=380456

--- Comment #6 from DDR  ---
> Please report memory usage when indexing is done. I'm really curious to see 
> that.
About 1.1GiB, ~5% of the available system memory. Very reasonable.

> Did you kill it? Probably not. I've encountered this many times.
No, not as of the report. I did shortly after - it was that, or it killed me by
swapping anything useful to disk.

> Please clarify 'unresponsive': Did you have to Ctrl-C?
Yes. I think balooctl it was waiting for baloo_file_extractor to provide some
information, but the extractor never would. I think it was busy extracting. I
don't have a way to ctrl-c file extractor, but I think when I send the
equivalent signal it shuts down just fine. ("End Process" in system monitor.)

> With an index of that size searching might be a little slow. And your even 
> half-way done :)
> Not sure, but I have the feeling baloo wasn't designed for this and you're 
> overburdening it.

Searching is still lightning fast. It seems it was designed very well in that
regard.
I was definitely overburdening it. I feel it really should have known better
than to try to index a tremendous plain-text file, though. It is enthusiastic,
it bit off significantly more than it could chew. The actual search index for
the forum the database dump was from takes over a week to rebuild on the
server, I imagine the more generalised search tool would be absolutely doomed
in that endeavour. That, and a week of solid uptime is quite rare for me.


> I'm just trying to imagine what will happen when you enter 'const' in  
> KRunner/Milou.
Would Dolphin's ctrl-f suffice? It's up to 1158 folders and 115308 files.
Somewhat amazingly, although the search results took a few minutes to populate,
Dolphin itself is still perfectly responsive and I can scroll through the files
just fine. Typing to select a file works both perfectly and instantly. Memory
use remaned unremarkably low throughout the whole process, and didn't really
change when I exited Dolphin.

All 374579 files have now finished indexing. The current size of the index is
11.08 GiB.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2018-03-07 Thread DDR
https://bugs.kde.org/show_bug.cgi?id=380456

--- Comment #5 from DDR  ---
OK, so I have just discovered the magic of ls -la /proc/1234/fd, where 1234 is
the pid of baloo_file_extractor. 

baloo_file_extractor was busy on a 1.5GiB text file,
production-aria-tables.sql, and then got stuck on its backup. I added these
files to the ignore list, in File Search — System Settings, and the indexer has
gotten on with life and is indexing the last few files it needs to.
Unfortunately, as the file is a database dump of mlpforums.com, I cannot share
it for reproduction due to confidentiality issues. Perhaps a partial dump of
the kde bugs database would suffice for that purpose.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2018-02-01 Thread Michael Heidelbach
https://bugs.kde.org/show_bug.cgi?id=380456

--- Comment #4 from Michael Heidelbach  ---
(In reply to DDR from comment #2)
Commands such as ` balooctl index
> * ` are unresponsive until I've killed the process.
Please clarify 'unresponsive': Did you have to Ctrl-C?
You did this while indexing was in progress. 
$ balooctl index *
is probably waiting for indexing to finished before queing another batch.
And even then most likely you'll only get a lot of 'indexing done' messages.


> Baloo File Indexer is running
> Indexer state: Indexing file content
> Indexed 356513 / 374337 files
Please keep your cool, let indexing finish in peace. It's nearly done :-)
What kind of files are indexing? See Comment #3.
> 
> Let me know if I can provide any more information.
> 
> Thanks!

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2018-02-01 Thread Michael Heidelbach
https://bugs.kde.org/show_bug.cgi?id=380456

Michael Heidelbach  changed:

   What|Removed |Added

 CC||ottw...@gmail.com

--- Comment #3 from Michael Heidelbach  ---
(In reply to Gaël de Chalendar (aka Kleag) from comment #0)
> 1. Install KDE Neon on a machine with several thousand files, for example a
> development machine

baloo is having a hard time indexing plain text, because there are so many
terms to extract. Also the backend database is memory based so I would expect
memory consumption to rise during the process.
Please report memory usage when indexing is done. I'm really curious to see
that.

> Baloo File Indexer is running
> Indexer state: Inactif
> Indexed 361894 / 1513487 files
This is strange: There are a lot of files left to be indexed, but the indexer
itself is idle?
Did you kill it? Probably not. I've encountered this many times. 
Anyway this behaviour definitely is worth scrutinizing. I'll do it when I'm
more familiar with baloo's code.

For the time being, occasionally it helps to restart baloo with
$ balooctl stop
ensure baloo_file and baloo_file_extractor are not running
$ balooctl start

> Current size of index is 6,28 Gio
With an index of that size searching might be a little slow. And your even
half-way done :)
Not sure, but I have the feeling baloo wasn't designed for this and you're
overburdening it.
I'm just trying to imagine what will happen when you enter 'const' in
KRunner/Milou.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2018-01-27 Thread DDR
https://bugs.kde.org/show_bug.cgi?id=380456

--- Comment #2 from DDR  ---
Comment on attachment 110170
  --> https://bugs.kde.org/attachment.cgi?id=110170
Upon killing baloo_file_extractor, I suddenly have a lot more free memory.

baloo_file_extractor always seems to use about 16gb of memory, allocated fairly
quickly after I start my computer. Commands such as ` balooctl index * ` are
unresponsive until I've killed the process.

I'm running Ubuntu 17.10, which is up-to-date as today. (2018-01-27)

I don't think the index is an issue, even if it was held entirely memory it
wouldn't account for half the problem.
$ balooctl indexSize
Actual Size: 6.80 GiB
Expected Size: 5.04 GiB

When the memory usage is high (before the cliff in the attached image):
$ balooctl status
Baloo File Indexer is running
^C

After I kill baloo_file_extractor (after the cliff in the attached image):
$ balooctl status
Baloo File Indexer is running
Indexer state: Indexing file content
Indexed 356513 / 374337 files
Current size of index is 11.08 GiB

Let me know if I can provide any more information.

Thanks!

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 380456] Suspected memory leak in baloo_file_extractor

2018-01-27 Thread DDR
https://bugs.kde.org/show_bug.cgi?id=380456

DDR  changed:

   What|Removed |Added

 CC||robertsdavid...@gmail.com

--- Comment #1 from DDR  ---
Created attachment 110170
  --> https://bugs.kde.org/attachment.cgi?id=110170=edit
Upon killing baloo_file_extractor, I suddenly have a lot more free memory.

-- 
You are receiving this mail because:
You are watching all bug changes.