[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2024-07-02 Thread Kai Krakow
https://bugs.kde.org/show_bug.cgi?id=404057

--- Comment #51 from Kai Krakow  ---
(In reply to tagwerk19 from comment #50)
> (In reply to Kai Krakow from comment #49) 
> 
> Thank you for the insights. That's a lot more to think about...

You're welcome.

> What we had previously, of BTRFS presenting "another different" device
> number on reboot and Baloo's initial scan for changes not committing at
> intervals, that was a catastrophic combination.

I initiated the idea to fold the uuid into the dev/inode feels somehow, I just
hadn't time exploring the source code. Luckily, someone finally implemented
that idea. \o/

> I think, in *general*,
> nowadays Baloo does not demand so much memory. Maybe though I should check
> when a lot of files have been deleted and Baloo has to catch up. How Baloo
> might behave when it is squeezed for memory (rather than being the culprit),
> that's something new to think about...

Yeah exactly. I think one remaining issue is when system performance suffers
not because baloo uses too much memory but because it becomes squeezed into too
little memory.

Thanks for testing it. I'm currently running fine with `MemoryLow=512M` and no
high limit, seems to work great so far even with games running and while
streaming, using btrfs on bcache with hdds.

With that configuration, more baloo memory has been pushed into swap - but it
was never reclaimed so its probably inactive memory anyways and should be in
swap.

I'd recommend to look into the "below" tool (an alternative implementation to
"atop"), it tracks memory usage via cgroups and thus can tell you also the
accounted cache memory of a process group where "htop" or "top" only show
resident process memory without caches accounted.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2024-07-02 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=404057

--- Comment #50 from tagwer...@innerjoin.org ---
(In reply to Kai Krakow from comment #49) 

Thank you for the insights. That's a lot more to think about...

I'm not sure I'd worry if Baloo releases memory slowly, provided it releases
it. I'm pretty sure I didn't see Baloo taking much more than the MemoryHigh
value. When I "pushed it" the value went just a little above the limit but the
process seemed to slow (markedly). This was on an otherwise idle system so
Baloo could have taken more memory but didn't. That seems to fit with the
freedesktop documentation.

What we had previously, of BTRFS presenting "another different" device number
on reboot and Baloo's initial scan for changes not committing at intervals,
that was a catastrophic combination. I think, in *general*, nowadays Baloo does
not demand so much memory. Maybe though I should check when a lot of files have
been deleted and Baloo has to catch up. How Baloo might behave when it is
squeezed for memory (rather than being the culprit), that's something new to
think about...

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2024-07-02 Thread Kai Krakow
https://bugs.kde.org/show_bug.cgi?id=404057

--- Comment #49 from Kai Krakow  ---
(In reply to tagwerk19 from comment #47)
> That said, I've not noticed the percentages behaving differently than a
> defined amount of memory (so, I think a "MemoryHigh=25%" on a 2GB system
> acts the same as a "MemoryHigh=512M"). It would be interesting if this was
> not the case...

No, it should not be different. In the end, it boils down to fixed numbers as
can be seen in `systemctl --user status kde-baloo.service`:

# Memory: 136.4M (high: 2.0G available: 1.8G peak: 1.3G swap: 18.4M swap peak:
18.5M zswap: 6.9M)
(where "high" is calculated from percentage to this number)

> I'm pretty sure the "MemoryHigh" is a soft limit, I was able to push the
> usage to just a little above the given limit but the process slowed down
> (markedly, when I pushed hard), I think when trying to claim more memory.

MemoryHigh is not a limit. MemoryHigh is an indicator/hint for the kernel
beyond which value the process group is considered to use a high amount of
memory. Now, if the kernel needs memory, it will reclaim memory from process
groups first that are most above their "MemoryHigh" value. Thus, if you make
this very high, the kernel will decide to reclaim memory from baloo last
because it is potentially the only service with a very high "MemoryHigh" value.
 The service is allowed to go beyond that memory usage just fine as long as the
kernel has no need for other memory.

This is also a problem in itself if mixing non-memory constrained services with
ones that have this setting. The system becomes very unbalanced unless you
adjust it for your very own needs. Leaving `MemoryHigh` empty instructs cgroups
to balance memory usage in a fair way.

> I
> tried the same with MemoryMax but this seemed to be a hard limit.

Yes, it is a hard limit. Allocations beyond `MemoryMax` will force the process
group to swap out inactive memory of itself.

> I tried
> setting a MemoryHigh with a slightly higher MemoryMax but it didn't seem to
> bring any benefits, the MemoryHigh on its own seemed to be quite effective
> at limiting Baloo's memory use.

Cannot work. `MemoryHigh` is not a limit, it's a hint for the memory management
when to consider a service to use "too much" memory, so services beyond that
allocation will be considered for reclaim first. Or viewed from the other side:
`MemoryHigh` is a type of resident memory protection which the kernel _tries_
to guarantee to the service (`MemoryMin` will force the guarantee).

> I'm also pretty sure that even with a defined MemoryHigh, Baloo releases
> memory when other processes require it.

Yes, because it's a "soft guarantee", not a "soft limit". If the kernel has no
more other processes to reclaim memory from, it will start reclaiming memory
from `MemoryHigh` services below their setting, in order of priority that makes
sense in that context. Baloo itself surely will free memory, too, by itself,
and that's reflected here. Those memory allocations also include cache which
baloo cannot directly control. The kernel tracks page cache ownership per
cgroup, and accounts that to memory usage, too.

I'm pretty sure `MemoryHigh` has been set so low (512M) because someone
considered it a soft limit, but it's a soft guarantee. And setting it too low
will hint the kernel to reclaim memory from such processes first - and that's
usually cache before swapping memory out (depends on vm.swappiness).

Thus, `MemoryHigh` should be set to a proper value that includes the active
working set of memory pages for the process + a good value for cache to let it
do it's job without stuttering the desktop. That settles around 1.5G of memory
for me. But OTOH, that means, we will also protect its memory even if baloo is
idle (and may be fine with using less than 1.5G RAM including caches). Thus I
think, it may be a better idea to completely leave that value empty, maybe only
include it as a comment with explanation so users can tune it easily with
`systemctl --user edit kde-baloo.service`.

> Certainly, there was dropping and rereading of clean pages when Baloo was
> closing on the limit. That was visible in iotop. I noticed in "pathological
> cases", indexing large quantities of data and having to manage very many
> dirty pages, pushed Baloo to swap and performance very clearly suffered
> (even when the rest of the system has sufficient space). I think it's
> (likely) worthwhile adding the MemorySwapMax=0 for Baloo to stop it reaching
> that point (although only if MemoryHigh is reasonable). The argument being
> an OOM for Baloo is (likely) better than it swapping. This is a value
> judgement through...

`MemoryHigh` is a two-edged sword... You can stop the observed behavior by
using a high value like `50%`. But then, when baloo becomes idle again, it will
only slowly give memory back to other parts of the system unless the system is
already under memory pressure. IMHO, it's not a knob for services which switch
between work and 

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2024-07-02 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=404057

--- Comment #48 from tagwer...@innerjoin.org ---
(In reply to Martin Steigerwald from comment #0)
> 3. If you like to kick it beyond any sanity:
>- have it go at the results of git clone 
> https://github.com/danielmiessler/SecLists.git
>- here it eats the resources of a quite potent laptop with 16 GiB of RAM 
> as if there was no tomorrow.
For completeness, I thought it would be worth revisiting this. I downloaded and
extracted master.zip (660MB download, extracts to about 2GB).

I tested in a guest VM, Neon User edition, BTRFS, 8 GB RAM. I ended up setting
MemoryHigh=50% and MemorySwapMax=0, I think with *very* *large* number of
different words as is definitely the case here, you are going to be hammering
Baloo and it is going to need space. It reached 3.8G memory usage and bounced
around for a while just under that level, dropping down again when the indexing
finished. Rebooting the system did not trigger any reindexing (I didn't set up
any snapshots). I didn't manually unzip files within the bundle (the zip
bombs). I don't know if Baloo ever tried to unpack zipped files, if so these
could have been nasty

It looks OK.

I had six files where Baloo refused with an "Invalid Encoding". The indexing
completed in around 10 min. The index file was about 2G. The system remained
fully usable during the indexing.

I think we can say "Thankfully, that's fixed" and the call can remain closed
:-)

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2024-07-01 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=404057

--- Comment #47 from tagwer...@innerjoin.org ---
(In reply to Kai Krakow from comment #46)
> ... Just my 2 cents, no need to re-open ...
Thank you!

My experience is empirical, I tested out various combinations to see how they
behaved. Means, of course, that I might have noted the behaviour at one moment
in the implementation of the systemd limits / OOM behaviours. I'm pretty sure
that behaviours were changing in the period I was testing...

That said, I've not noticed the percentages behaving differently than a defined
amount of memory (so, I think a "MemoryHigh=25%" on a 2GB system acts the same
as a "MemoryHigh=512M"). It would be interesting if this was not the case...

I'm pretty sure the "MemoryHigh" is a soft limit, I was able to push the usage
to just a little above the given limit but the process slowed down (markedly,
when I pushed hard), I think when trying to claim more memory. I tried the same
with MemoryMax but this seemed to be a hard limit. I tried setting a MemoryHigh
with a slightly higher MemoryMax but it didn't seem to bring any benefits, the
MemoryHigh on its own seemed to be quite effective at limiting Baloo's memory
use.

I'm also pretty sure that even with a defined MemoryHigh, Baloo releases memory
when other processes require it. 

Certainly, there was dropping and rereading of clean pages when Baloo was
closing on the limit. That was visible in iotop. I noticed in "pathological
cases", indexing large quantities of data and having to manage very many dirty
pages, pushed Baloo to swap and performance very clearly suffered (even when
the rest of the system has sufficient space). I think it's (likely) worthwhile
adding the MemorySwapMax=0 for Baloo to stop it reaching that point (although
only if MemoryHigh is reasonable). The argument being an OOM for Baloo is
(likely) better than it swapping. This is a value judgement through...

I think we should stay with the systemd caps on memory use, there were so many
bugs/reports about Baloo slugging performance before the change. Those calls
have dropped to a very few (and are more about the limits probably being too
low)

As I said, this is very much the result of "from the outside in", empirical
testing.  Not something I can argue is right, particularly given your
information (thank you once again), just something I've seen.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2024-07-01 Thread Kai Krakow
https://bugs.kde.org/show_bug.cgi?id=404057

--- Comment #46 from Kai Krakow  ---
(In reply to tagwerk19 from comment #45)
> Yes, there's been a handful of bug reports where I've "blamed" the 512M
> limit.
> 
> I tentatively recommend "MemoryHigh=25%". I don't suppose many people run on
> systems with 2G RAM (even as VM's) and having a percentage means Baloo gets
> a lot more room to breath on systems with 8G.
> 
> I think "MemoryHigh=40%" is still quite reasonable and I would also include
> a "MemorySwapMax=0" to forestall swapping (which does seem to cause problems)

MemorySwapMax can lead to OOM situations, even for other processes, if the only
swap victim would be baloo. It is fine to swap out some dead pages which the
process never uses.

And we should be careful with percentages: `MemoryHigh` sets a priority at
which the kernel memory manager chooses processes to reduce memory usage (by
discard caches, flushing writeback, or swapping anonymous memory). We actually
want this to happen, we should be careful to not make baloo the last resort by
accidentally giving it the highest memory priority.

If we want to limit it, `MemoryMax` should be set (then baloo will never get
more memory). But `MemoryHigh` should be set to a reasonable minimum we want to
protect for the process so it can make forward progress. Setting it too high
creates an inverse effect for other important processes of the desktop. It the
lower bound of what is considered high memory usage before making memory
available to other processes. Memory is taken away from processes with the
highest `MemoryHigh` last.

As an idea, baloo could watch `/proc/pressure/memory` and if latencies go high,
it could pause for a while and flush its own caches. One cannot try to emulate
such a behavior with `MemoryHigh`.

Maybe the memory limits should be removed completely, and rather let the kernel
do the job using mgLRU (which could be recommended for distributions if it
works fine), and then let baloo watch memory pressure instead to throttle
itself. The problem is not with baloo reading files and using CPU, it's already
highly optimized here. The problem is with how the database uses memmap, so
it's directly competing with important desktop processes needing resident
memory (it's not designed to compete with other processes for memory). Using
memory pressure, we could mark selected memory ranges as "not needed" and flush
unwritten data early so it becomes available.

I had no more problems with baloo until it added the `MemoryHigh=512M`
parameter, so I added another one to force 2G instead. Which makes me wonder if
we need that parameter at all.

Just my 2 cents, no need to re-open.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2024-07-01 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=404057

tagwer...@innerjoin.org changed:

   What|Removed |Added

 Status|NEEDSINFO   |RESOLVED
 Resolution|WAITINGFORINFO  |FIXED

--- Comment #45 from tagwer...@innerjoin.org ---
(In reply to Kai Krakow from comment #44)
> ... `MemoryHigh=512M` is quite aggressive. It can lead to swap storms
> and cache thrashing of baloo under memory pressure ...
> ...
> I personally used `MemoryHigh=2G` to fix this for my system...
Yes, there's been a handful of bug reports where I've "blamed" the 512M limit.

I tentatively recommend "MemoryHigh=25%". I don't suppose many people run on
systems with 2G RAM (even as VM's) and having a percentage means Baloo gets a
lot more room to breath on systems with 8G.

I think "MemoryHigh=40%" is still quite reasonable and I would also include a
"MemorySwapMax=0" to forestall swapping (which does seem to cause problems)

> No objections, it makes sense.
Will close then...

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2024-07-01 Thread Kai Krakow
https://bugs.kde.org/show_bug.cgi?id=404057

--- Comment #44 from Kai Krakow  ---
(In reply to tagwerk19 from comment #43)
> I think the dust has probably settled here after:
> https://invent.kde.org/frameworks/baloo/-/merge_requests/131
> and cherrypicked for KF5
> https://invent.kde.org/frameworks/baloo/-/merge_requests/169

These work fine for me, I'm actually using baloo again in production.

> There's also been
>  https://invent.kde.org/frameworks/baloo/-/merge_requests/121

Actually, `MemoryHigh=512M` is quite aggressive. It can lead to swap storms and
cache thrashing of baloo under memory pressure because the process itself is
already 512M big, this leaves no space for caching which is important for baloo
to work with proper performance (consider that memory cgroups also account
cache usage). Especially the sub process baloo_file is hurt by this a lot while
indexing new files.

I personally used `MemoryHigh=2G` to fix this for my system - but this
parameter really depends a lot on the system environment. The service shows a
peak usage of 1.3G with almost no swap usage (less than 30M), so
`MemoryHigh=1536M` may be fine.

> and
>  https://invent.kde.org/frameworks/baloo/-/merge_requests/148

No objections, it makes sense.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2024-07-01 Thread soredake
https://bugs.kde.org/show_bug.cgi?id=404057

soredake  changed:

   What|Removed |Added

 CC|katyaberezy...@gmail.com|

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2024-07-01 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=404057

tagwer...@innerjoin.org changed:

   What|Removed |Added

 Status|CONFIRMED   |NEEDSINFO
 Resolution|--- |WAITINGFORINFO

--- Comment #43 from tagwer...@innerjoin.org ---
I think the dust has probably settled here after:
https://invent.kde.org/frameworks/baloo/-/merge_requests/131
and cherrypicked for KF5
https://invent.kde.org/frameworks/baloo/-/merge_requests/169

There's also been
 https://invent.kde.org/frameworks/baloo/-/merge_requests/121
and
 https://invent.kde.org/frameworks/baloo/-/merge_requests/148

Any need to keep this issue open or can we close it?

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2023-01-01 Thread Dennis Schridde
https://bugs.kde.org/show_bug.cgi?id=404057

Dennis Schridde  changed:

   What|Removed |Added

  Latest Commit||https://bugs.kde.org/show_b
   ||ug.cgi?id=442453

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2023-01-01 Thread Dennis Schridde
https://bugs.kde.org/show_bug.cgi?id=404057

Dennis Schridde  changed:

   What|Removed |Added

 CC||devuran...@gmx.net

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2022-06-12 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=404057

euphonis...@outlook.com changed:

   What|Removed |Added

 CC||euphonis...@outlook.com

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2022-05-29 Thread Moritz Herrmann
https://bugs.kde.org/show_bug.cgi?id=404057

Moritz Herrmann  changed:

   What|Removed |Added

 CC||moritzherrmann09+kde.org@gm
   ||ail.com

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2022-01-12 Thread Joachim Wagner
https://bugs.kde.org/show_bug.cgi?id=404057

Joachim Wagner  changed:

   What|Removed |Added

 CC||jwag...@computing.dcu.ie

--- Comment #42 from Joachim Wagner  ---
In addition to btrfs's mount-time allocated device numbers, the device numbers
of dm-crypt devices can also be an issue for users with multiple such devices
as the device minor numbers are not stable across restarts.

(I assume the numbers depend on the timing of luksOpen for each device, further
assuming backing devices are probed in parallel. This may be specific to the
Linux distribution and whether the same passphrase is used for the devices.
When I find the time, I'll create new Luks key slots with substantially
different iter-time to test this.)

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2021-12-27 Thread Massimiliano L
https://bugs.kde.org/show_bug.cgi?id=404057

Massimiliano L  changed:

   What|Removed |Added

   See Also||https://bugs.kde.org/show_b
   ||ug.cgi?id=419302

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2021-08-22 Thread Aetf
https://bugs.kde.org/show_bug.cgi?id=404057

Aetf <7437...@gmail.com> changed:

   What|Removed |Added

 CC||7437...@gmail.com

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2021-04-28 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=404057

tagwer...@innerjoin.org changed:

   What|Removed |Added

 CC||tagwer...@innerjoin.org

--- Comment #41 from tagwer...@innerjoin.org ---
(In reply to Kai Krakow from comment #40)
> Further research confirms: btrfs has unstable device ids because it exposes
> subvolumes as virtual block devices without their own device node in /dev.
This has resurfaced in Bug 402154.
There are several reports related to openSuSE - BTRFS with multiple subvols
https://bugs.kde.org/show_bug.cgi?id=402154#c12

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2021-04-28 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=404057

tagwer...@innerjoin.org changed:

   What|Removed |Added

   See Also||https://bugs.kde.org/show_b
   ||ug.cgi?id=402154

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2021-02-16 Thread soredake
https://bugs.kde.org/show_bug.cgi?id=404057

soredake  changed:

   What|Removed |Added

 CC||ndrzj1...@relay.firefox.com

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2019-11-24 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=404057

gwar...@gmail.com changed:

   What|Removed |Added

 CC||gwar...@gmail.com

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2019-10-12 Thread Kai Krakow
https://bugs.kde.org/show_bug.cgi?id=404057

--- Comment #40 from Kai Krakow  ---
Further research confirms: btrfs has unstable device ids because it exposes
subvolumes as virtual block devices without their own device node in /dev.
Thus, device id numbers are allocated dynamically at runtime from the kernel,
the same happens for NFS and FUSE file systems. The latter usually even have
unstable inode numbers.

So currently, baloo is actually only safe to use on ext2/3/4 and xfs as the
most prominent examples. On many other filesystems it will reindex files, and
will even be unable to return proper results because reverse mapping of DocIds
to filesystem paths is unreliable.

This problem is deeply baked into the design of Baloo.

My idea of merging device id and ino into one 64-bit integer wouldn't needing
much modification to the existing code/storage format in theory. But apparently
this would make using the reverse mapping impossible because the functions
couldn't extract the device id from the docid to convert it back to a mount
path.

Additionally, btrfs shares the UUID between all subvolumes so using it would
make duplicate inode numbers which would confuse baloo. Btrfs has UUID_SUB
instead.

After all, it seems we'd need a specialized GUID generator per filesystem type.
Since GUID formats may differ widely, I'd suggest to create a registry table
inside the database, much similar to my counted ID idea outlined above. Each
new GUID would be simply registered in the table with a monotonically
increasing number which can be used as the device ID for DocId. Still, we'd
need to expand the DocId but temporarily it would do.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2019-10-12 Thread Martin Steigerwald
https://bugs.kde.org/show_bug.cgi?id=404057

--- Comment #39 from Martin Steigerwald  ---
Good morning Kai. Overnight I think I got your idea about UUID->CounterID
mapping table. Sounds like an approach that can work. About… the several
databases thing… I get the disadvantages you mention. Of course it could
PolicyKit to ask for permission to create a folder… however… usually Baloo is
also *per* user, not *per* filesystem. I have two users with different Baloo
databases on my laptop. One withing ecryptfs layered on top of BTRFS.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2019-10-11 Thread Kai Krakow
https://bugs.kde.org/show_bug.cgi?id=404057

--- Comment #38 from Kai Krakow  ---
(In reply to Martin Steigerwald from comment #35)
> I like the idea to use one DB per filesystem. This way you can save the
> complete filesystem UUID and/or other identifying information *once* and use
> the full 64 bit for the inode number thing.

Such things are always difficult (the same for your hidden ID file in the root
of an FS): It requires permissions you may not have, thus you may decide to use
the most top-level directory you have write permissions to. And at that point
I'd define that the storage path is undefined.

Other solution: Name the index files by UUID and store it at a defined
location. We'll have other problems now:

Do you really want multiple multi-GB mmaps LMDB files mapped at once in RAM?
With even more chaotic random access patterns (and their potential to push your
precious cache out of memory)? Also, multi-file databases are hard to sync with
each other: At some point we may need to ensure integrity between all the
files. This won't end well. I'm all in for a single-DB approach.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2019-10-11 Thread Martin Steigerwald
https://bugs.kde.org/show_bug.cgi?id=404057

--- Comment #37 from Martin Steigerwald  ---
It is probably too late for me or I am not into programming enough at the
moment, to fully understand what you propose. :) May be something to bring to
the suitable Phabricator tasks, maybe

Overhaul Baloo database scheme: https://phabricator.kde.org/T9805

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2019-10-11 Thread Kai Krakow
https://bugs.kde.org/show_bug.cgi?id=404057

--- Comment #36 from Kai Krakow  ---
Oh, nice... Sometimes it helps to talk about a few things.

I could think of the following solution:

Add another UUID->CounterID mapping table to the database, that is easy to
achieve. Everytime we encounter a new UUID, we increase the CounterID one above
the maximum value in the DB and use that as a file system identifier.

We can now bit-reverse the CounterID so that the least-significant bits switch
position with the highest. The result will be XOR'ed with the 64-bit inode
number. Et voila: There's a DocID.

What do you think?

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2019-10-11 Thread Martin Steigerwald
https://bugs.kde.org/show_bug.cgi?id=404057

--- Comment #35 from Martin Steigerwald  ---
I like the idea to use one DB per filesystem. This way you can save the
complete filesystem UUID and/or other identifying information *once* and use
the full 64 bit for the inode number thing.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2019-10-11 Thread Martin Steigerwald
https://bugs.kde.org/show_bug.cgi?id=404057

--- Comment #34 from Martin Steigerwald  ---
I would not worry all that much about redoing the database for a format change.
In the end what we have now that it re-indexes anyway. And yes, I have a multi
device BTRFS. Actually a BTRFS RAID 1.

I wonder whether Baloo would be better off by using filesystem UUID together
with 64 bit inode number as an identifier. However using a complete filesystem
UUID may need too much storage, I don't know.

Another idea would be to mark each indexed file with an kind of ID or timestap
using extended attributes. However, that might get lost as well and it won't
work on filesystems not supporting those.

A third idea would be to write an 32 bit identifier as a filesystem ID into a
hidden file below the root directory of the filesystem or a hidden sub
directory of it. This would at least avoid using an identifier that could
change. It would not solve the 32 bit for inode number not enough for storing a
64 bit inode number issue. However this might be a change that might be easiest
to apply short term.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2019-10-11 Thread Kai Krakow
https://bugs.kde.org/show_bug.cgi?id=404057

--- Comment #33 from Kai Krakow  ---
Also, LMDB is totally the wrong tool when using 32-bit systems because your
index cannot grow beyond a certain size before crashing baloo.

I'm not sure if 32-bit systems are still a thing - but if they are, the
decision for LMDB was clearly wrong.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2019-10-11 Thread Kai Krakow
https://bugs.kde.org/show_bug.cgi?id=404057

--- Comment #32 from Kai Krakow  ---
(In reply to Martin Steigerwald from comment #30)
> I believe this bug is at
> least about two or three independent issues, but as you told, let's have it
> about the re-indexing files thing. I bet getting rid of needlessly
> re-indexing files will be one of the most effective to sort out performance
> issues with Baloo. I changed bug subject accordingly.

That's why I started a task to better distinguish between which problem is what
in Phabricator:
https://phabricator.kde.org/T11859

And that's why I suggested a few comments ago to concentrate on fixing the
re-indexing bug first.

But it turned out to be not that easy. Together with the other performance
issues and especially the tight coupling of mmap, access patterns, and
available memory, I think it's worth rethinking if LMDB is still the right
tool:

mmap introduces a lot of problems and there's no easy way around it. There's to
many things to think of to optimize access patterns then. It's unlikely to be
done anytime when looking at the problems exposed by the database scheme
already.

LMDB seems to be designed around the idea to be the only user of system RAM, or
at least only use a very smallish part of it (which may not be that small if
you have huge amounts of RAM). That's unlikely to be the situation on systems
where baloo is used.

Bad design choices have already been made and been meshed deeply into the
database scheme, which makes it difficult to migrate existing installations.

BTW: Following up on your comment #2 was totally unintentional but actually I
followed up and explained why that happens. ;-)

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2019-10-11 Thread Kai Krakow
https://bugs.kde.org/show_bug.cgi?id=404057

--- Comment #31 from Kai Krakow  ---
As far as I understand from researching the discussions in phabricator, this
problem won't be easy to fix as it is baked into the design decision that
defined the database scheme.

Based on the fact that the DocId (which is used to find if a file still exists,
or changed, or just moved unmodified) is a 64-bit value which is actually made
of a 32-bit st_dev (device id) and 32-bit ino (inode number), I see two
problems here:

1. btrfs uses 64-bit inode numbers, at some point, numbers will overflow and
the baloo database becomes confused.

2. multi-dev btrfs (and I think you also use that, as I do) may have an
unstable st_dev number across reboots, resulting in changed DocIds every once
in a while after reboot

The phabricator discussions point out that other (primarily network) file
systems suffer the same problem: They have unstable st_dev values maybe even
after reconnect. User space file systems even depend on mount order for the
st_dev value.

Changing this is no easy task, it would require a format change which either
invalidates your DB, or it needs migration. So I'm currently evaluating if it
makes sense to switch to a key/value store that doesn't rely on mmap (as it has
clear downsides on a typical desktop system). This would allow to easily change
the database scheme in the same step as the index would've to be recreated
anyways. I'm currently digging my nose into Facebook's RocksDB, it looks mostly
good except that it was optimized solely around flash-based storage.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 404057] Uses an insane amount of memory (RSS/PSS) writing a *ton* of data while re-indexing unchanged files

2019-10-11 Thread Martin Steigerwald
https://bugs.kde.org/show_bug.cgi?id=404057

Martin Steigerwald  changed:

   What|Removed |Added

Summary|Uses an insane amount of|Uses an insane amount of
   |memory (RSS/PSS) and writes |memory (RSS/PSS) writing a
   |a *ton* of data  while  |*ton* of data while
   |indexing|re-indexing unchanged files

--- Comment #30 from Martin Steigerwald  ---
(In reply to Kai Krakow from comment #28)
> (In reply to Kai Krakow from comment #19)
[…]
> Following up with more details:
> 
> The problem seems to be the following:
> 
> After reboot, the indexer finds all files as changed. For every file in the
> index, it will log to stdout/stderr:
> 
> "path to file" id seems to have changed. Perhaps baloo was not running, and
> this file was deleted + re-created

Kai, please see my comment #2 of this bug report:

https://bugs.kde.org/show_bug.cgi?id=404057#c2

I got exactly the same message. Just have been rereading what I wrote as I have
not been aware of it anymore.

So yes, that is indeed an important issue here. I believe this bug is at least
about two or three independent issues, but as you told, let's have it about the
re-indexing files thing. I bet getting rid of needlessly re-indexing files will
be one of the most effective to sort out performance issues with Baloo. I
changed bug subject accordingly.

-- 
You are receiving this mail because:
You are watching all bug changes.