[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2024-01-23 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=438074

--- Comment #20 from tagwer...@innerjoin.org ---
(In reply to Frank Steinmetzger from comment #19)
> It’s all maildir, but with over 100k files. ^^
A hurried google of "maildir format" gives me that it holds one message per
file, with the format like .eml. At least kmimetypefinder gives
"message/rfc822". I think Bug 460882 would still apply and you could be writing
loads of "random" strings (from encoded attachments, whatever) and repeatedly
rewriting the entries for "common terms".

If each of your messages has a "Subject" line, a search for "Subject" will
retrieve them all. The database record for "Subject" will have been rewritten,
with a commit, after each batch of files indexed. That will be a lot of
rewriting. Baloo knows this is an issue and batches up and indexes 40 files at
a time to cut down on the amount of rewriting required. I suppose, for loads of
small files, it could batch up more...

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2024-01-23 Thread Frank Steinmetzger
https://bugs.kde.org/show_bug.cgi?id=438074

--- Comment #19 from Frank Steinmetzger  ---
(In reply to tagwerk19 from comment #18)

> Watch out for indexing email files, particularly those encoded or with
> attachments. For .eml files see Bug 460882; .mbox files can be absolutely
> massive.

It’s all maildir, but with over 100k files. ^^
I’ve had enough at one point and figured there must be something wrong with my
database. So I moved it away and reindexed everything. Seeing that it indexed a
lot more files, I think that the database has been in a very old state for
quite some time and baloo tried to update it ever since.

There is one problem though: the write volume is very bad. The final database
file is maybe around 16 GB (the defunct database was 18 GB), but the write
volume during indexing was a multiple of that during indexing, at least 100 GB.
So I symlinked ~/.local/share/baloo to a ramdisk.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2024-01-22 Thread Eumel
https://bugs.kde.org/show_bug.cgi?id=438074

Eumel  changed:

   What|Removed |Added

 CC||blaueshawaiih...@gmx.de

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2024-01-04 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=438074

--- Comment #18 from tagwer...@innerjoin.org ---
(In reply to Frank Steinmetzger from comment #17)
> Indexer state: Suspended
I wonder what does that nowadays... There used to be a "balooctl suspend" but I
think that's been removed.

> ... constant read I/O of 150 MB/s for at least half an hour after login ...
and

> Current size of index is 18.22 GiB 
Gut feeling here is that the systemd limits on RAM are cutting in on you, have
a look at what:

systemctl --user status kde-baloo

says. The unit file limits Baloo's RAM use to 512 MB. When Baloo hits that
limit it will drop clean pages from its cache so it can load others. You see
Baloo slow down and spend a *load* of time and energy reading.

My personal view is that 512 MB is somewhat strict, 50% works for me (together
with stopping Baloo using swap)

MemoryHigh=50%
MemorySwapMax=0

Watch out for indexing email files, particularly those encoded or with
attachments. For .eml files see Bug 460882; .mbox files can be absolutely
massive.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2024-01-04 Thread Frank Steinmetzger
https://bugs.kde.org/show_bug.cgi?id=438074

Frank Steinmetzger  changed:

   What|Removed |Added

 CC||dev-...@felsenfleischer.de

--- Comment #17 from Frank Steinmetzger  ---
I’ve also been observing this problem for quite some time now. Thankfully, it
does not slow down my PC, even though it is 9½ years old. But I see it in the
system monitor applets in my panel that there is constant read I/O of 150 MB/s
for at least half an hour after login.

I ran balooshow -x on a file before and after the last two reboots and the
output was identical.
I don’t run btrfs. My system uses ext4 on LVM on LUKS. The output of lsblk also
remained unchanged across boots.
I’m not sure what else to check. I’ve issued balooctl suspend a few minutes
ago, but the indexer still chucks along. Eventually I killed the extractor with
killall.

```
~ LC_ALL=C time balooctl status
Baloo File Indexer is running
Indexer state: Suspended
Total files indexed: 176,898
Files waiting for content indexing: 74,305
Files failed to index: 0
Current size of index is 18.22 GiB

real0m49,878s
user0m0,013s
sys 0m0,004s
```

Addendum:
Reading this thread, I found out about `balooctl monitor` and started it, then
resumed the indexing. The monitor first printed some email files and there was
minor system load. Then a few seconds nothing and then the I/O load started
again, but the monitor has not shown any new filenames since.

Operating System: Arch Linux 
KDE Plasma Version: 5.27.10
KDE Frameworks Version: 5.113.0
Qt Version: 5.15.11
Kernel Version: 6.6.9-arch1-1 (64-bit)
Graphics Platform: X11
Processors: 4 × Intel® Core™ i5-4590 CPU @ 3.30GHz
Memory: 30.8 GiB of RAM

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2023-01-02 Thread Seqularise
https://bugs.kde.org/show_bug.cgi?id=438074

Seqularise  changed:

   What|Removed |Added

 CC||seqular...@outlook.com

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2023-01-01 Thread Dennis Schridde
https://bugs.kde.org/show_bug.cgi?id=438074

Dennis Schridde  changed:

   What|Removed |Added

 CC||devuran...@gmx.net
   See Also||https://bugs.kde.org/show_b
   ||ug.cgi?id=456108,
   ||https://bugs.kde.org/show_b
   ||ug.cgi?id=402154

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2022-06-28 Thread skierpage
https://bugs.kde.org/show_bug.cgi?id=438074

skierpage  changed:

   What|Removed |Added

 CC||skierp...@gmail.com

--- Comment #16 from skierpage  ---
In bug  456108 I have a similar problem with baloo constantly reindexing 12 of
my files, but all of them have modification time of Jan 1 1970 (0 seconds in
Unix epoch), or earlier than that. The reporter here says
> mtime and ctime match in both files
implying this is a different problem, so I filed a separate bug.

> .doc files are sometimes indexed, sometimes not.
> ...
> So my impression is that some of the file extractors don't work as expected?
I'm documenting every baloo limitation I come across at
https://community.kde.org/Baloo#Indexing_limitations

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2021-06-11 Thread Stefan Brüns
https://bugs.kde.org/show_bug.cgi?id=438074

Stefan Brüns  changed:

   What|Removed |Added

   Assignee|stefan.bruens@rwth-aachen.d |baloo-bugs-n...@kde.org
   |e   |

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2021-06-11 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=438074

--- Comment #15 from tagwer...@innerjoin.org ---
(In reply to Martin Tlustos from comment #12)
> But I looked through a couple of folders now checking with "balooctl index *
> | grep 'different file types'" and found that .odg, .odp, .zip, .sem, .kra,
> .ppt are always reindexed, so it is a file-type problem.
> 
> .doc files are sometimes indexed, sometimes not. If I open a doc file that
> wasn't indexed before in libreoffice and resave it, it is indexed
> successfully.
> 
> So my impression is that some of the file extractors don't work as expected?
It sounds like it.

If I look in 
/usr/share/mime/packages/freedesktop.org.xml
It seems that Qt flags .odg and .odp files as 'zipped' files. Nothing specific
for the other filetypes - but it might be worth seeing if one or the other
starts with
PK\003\004 

I notice there's new bug, Bug 438455, mentioning .doc (as contrasted to .docx)
files.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2021-06-11 Thread Martin Tlustos
https://bugs.kde.org/show_bug.cgi?id=438074

--- Comment #14 from Martin Tlustos  ---
A new error showed up:
11.06.21 09:58  dolphin kf.kio.widgets: Plugin "baloofilepropertiesplugin" is
using the deprecated loading style. Please port it to JSON loading.
11.06.21 09:58  dolphin kf.kio.widgets: Plugin "baloofilepropertiesplugin" is
using the deprecated loading style. Please port it to JSON loading.

Maybe that is the reason for some of the filetypes not being indexed?

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2021-06-11 Thread Martin Tlustos
https://bugs.kde.org/show_bug.cgi?id=438074

--- Comment #13 from Martin Tlustos  ---
One more thing: balooctl check will show one .odp file as being reindexed,
while balooctl index the same file will say it already is indexed. Strange...

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2021-06-11 Thread Martin Tlustos
https://bugs.kde.org/show_bug.cgi?id=438074

--- Comment #12 from Martin Tlustos  ---
Ok, did balooctl index * in one of the folders... Some of the files are
skipped, because they are already indexed, some are indexed (the same ones that
show up when doing balooctl check). No additional errors are shown.
Journald has two warnings and four errors:
11.06.21 08:59  baloo_file_extractor"Error: Unknown font tag 'ZaDb'"
11.06.21 08:59  baloo_file_extractor"Error: Unknown font tag 'ZaDb'"
11.06.21 08:59  baloo_file_extractor"Error: Unknown font tag 'ZaDb'"
11.06.21 08:59  baloo_file_extractor"Error: Unknown font tag 'ZaDb'"
11.06.21 09:02  baloo_file_extractorInvalid document structure (meta.xml is
missing)
11.06.21 09:02  baloo_file_extractorInvalid document structure (meta.xml is
missing)
The two "meta.xml" messages came from one specific folder where two faulty odt
documents were found. After fixing those, the "meta.xml missing" message were
gone, but there were still files in that folder that were reindexed, so these
were not the cause of the problem.

But I looked through a couple of folders now checking with "balooctl index * |
grep 'different file types'" and found that .odg, .odp, .zip, .sem, .kra, .ppt
are always reindexed, so it is a file-type problem.

.doc files are sometimes indexed, sometimes not. If I open a doc file that
wasn't indexed before in libreoffice and resave it, it is indexed successfully.

So my impression is that some of the file extractors don't work as expected?

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2021-06-10 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=438074

--- Comment #11 from tagwer...@innerjoin.org ---
(In reply to Martin Tlustos from comment #9)
> All in all it looked pretty normal to me...
So I was overly optimistic :-/

However, on the basis that you'd copied the folder and found you'd copied the
problem I still suspect that one or more of the files is tripping up the
indexing. The question is how to find it.

Maybe see if there's anything in the logs (with journalctl and look for
"baloo_fil" entries)? Manually indexing files and seeing if there are any
errors (with "balooctl index ...")?

Not sure what more to suggest. Sorry.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2021-06-10 Thread Martin Tlustos
https://bugs.kde.org/show_bug.cgi?id=438074

--- Comment #10 from Martin Tlustos  ---
Ah, sorry, no, these are all backup files that are excluded by default...

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2021-06-10 Thread Martin Tlustos
https://bugs.kde.org/show_bug.cgi?id=438074

--- Comment #9 from Martin Tlustos  ---
Ok, here's what I did: I ran balooshow -x * > balooshowinfo.txt in one of the
affected folder where there aren't too many files (but more than 40, so one
batch at least) and checked contents.
Sample content for an image:
11a07640802 2050 18483044 samplefilename.jpg
[/home/whatever/whatever/whatever/samplefilename.jpg]
Mtime: 0 1970-01-01T01:00:00
Ctime: 1623231112 2021-06-09T11:31:52
Cached properties:
Breite: 2459
Höhe: 3531

Interne Information
Begriffe: Mimage Mjpeg T4 X26-2459 X27-3531 
Dateinamen-Begriffe: Fjpg samplefilename 
XAttr Begriffe: 
height: 3531
width: 2459

Some had more infos, e.g. tags.
Same for png's.

Some of the pdfs had similar index entries, some had text extracted, I suspect
those without text extracted where image pdfs.

All in all it looked pretty normal to me, only there where a few entries with
this at the end:
"no index information found". Could those be the culprits?

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2021-06-09 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=438074

--- Comment #8 from tagwer...@innerjoin.org ---
(In reply to Martin Tlustos from comment #7)
> Copying the content of a afflicted folder to a new folder doesn't help, the
> new folder is reindexed as well.
So you've copied the problem - and it seems to be "a file problem" :-)

I know that baloo_file_extractor deals with batches of files, 40 at a time. I
don't know if it commits "what it's learned" after each file or at the end of
the batch but I can imagine that it's committed at the end of the batch.

If there's a (bad enough) failure indexing one of the files, it may be that no
"content index" information for the batch is written to the index. The indexing
of these files is incomplete so baloo, after a "balooctl check", tries again.

It should be that baloo recognises such failures, "balooctl status" does give a
count of "Files failed to index". Maybe that's not working as it should.

Anyway, it might be possible to see some evidence...

For one of the files that are repeatedly reindexed, have a look with "balooshow
-x ..." and what's listed under the "Internal Info":

If this is very basic ("Terms", "File Name Terms", "XAttr Terms"),
these are what "baloo_file" writes during its initial scan.

If you see a longer list of "Terms", words that appear within the
document, or possibly a "Width:" and "Height:" for an image (could
be loads of different fields for an image file), then this is
information collected and written by "baloo_file_extractor".

So compare what "balooshow -x .. " gives for your repeatedly reindexed files -
and compare that to what "balooshow -x" gives for files that have been indexed
OK.

My guess is if you see only "basic information" then something's failed and the
data's not been committed. After that it might be a question of trying to
"narrow down" the file/files that are causing the problem.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2021-06-09 Thread Martin Tlustos
https://bugs.kde.org/show_bug.cgi?id=438074

--- Comment #7 from Martin Tlustos  ---
Ok, some more testing...

Copying the content of a afflicted folder to a new folder doesn't help, the new
folder is reindexed as well.
stat and balooshow dont see any differences in files in those folders after
reboots (I only checked one. The original test file was on a different account,
but I redid it in my own account as well).

Creating a new test file in one of these folders is NOT reindexed, so this
indicates that it actually is a file problem.

This is just a normal home folder on a separte HDD with ext4 formatting. The OS
is on a different SDD drive. No snap, no encryption. Different file types are
affected, like png, jpg, pdf, doc, odt...

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2021-06-08 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=438074

--- Comment #6 from tagwer...@innerjoin.org ---
So...

It's not all your files, just some folders.

Baloo has accurate modification/change times - these haven't changed,
and the device number/inode also hasn't changed, but a "baloo check"
still thinks that the files needs (re)indexing.

More random thoughts...

When you did the test with "stat testfile.txt" it was in one of
these "odd" folders?

The folders are not encrypted folders or related to snaps (which might
be mounted in a different way)

It's not a particular filetype that is giving trouble? (Dunno if baloo
worries if a file "was" plain text and then "seems to be" something else)

Do you get anything strange if you do a
   baloosearch ...oneofthefunnyfiles...
you get a single or multiple hits?

After that I think the next boring, pedestrian, troubleshooting step is to copy
some of the troublesome folders (copying with all metadata "cp -a ...") and see
if you copy the problem as well. Perhaps try copying to a new user.

I will also say, thank you for your patience...

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2021-06-08 Thread Martin Tlustos
https://bugs.kde.org/show_bug.cgi?id=438074

--- Comment #5 from Martin Tlustos  ---
mtime and ctime match in both files.
I did purge the database some time ago for that very reason, but it didn't
help.
I do have a few symlinks, but they are not in any of the folders baloo checks
(only in /.trash, /.var/app, /.local, /.config and /.mozilla).
It's always the same folders that are checked, so it could be a problem with
those folders, but I didn't see any problems with folder settings or
permissions.

Btw, the same thing happens if i do balooctl check. baloo checks around 750
files (some of which haven't been changed in years), and baloo_file_extractor
writes up to 20MB/s for about two minutes.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2021-06-07 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=438074

--- Comment #4 from tagwer...@innerjoin.org ---
(In reply to Martin Tlustos from comment #3)
> Ok., did the test as suggested. No changes in device number or inode. 
It was a bit of a guess - but it would have explained the reindexing.

A possible thing to look at is whether the modified and changed times that
balooshow gives for your testfile.txt (the Mtime and Ctime), match those that
stat gives (Modify and Change times).

I'm guessing you've purged the database and started "from zero". Does the same
thing happen if you create a new user?

Could there be any confusion caused by symbolic links within your $HOME?

Have to say, I've not seen this issue so there's some guesswork involved
here...

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2021-06-07 Thread Martin Tlustos
https://bugs.kde.org/show_bug.cgi?id=438074

--- Comment #3 from Martin Tlustos  ---
Ok., did the test as suggested. No changes in device number or inode. 
I did 
stat testfile.txt >statinfo.txt and 
balooshow -x testfile.txt >balooinfo.txt,

restarted
did stat testfile.txt > statinfo-new.txt 
and 
balooshow -x testfile.txt > balooinfo-new.txt

And compared statinfo.txt with statinfo-new.txt and balooinfo.txt with
balooinfo-new.txt. No differences.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2021-06-07 Thread Martin Tlustos
https://bugs.kde.org/show_bug.cgi?id=438074

--- Comment #2 from Martin Tlustos  ---
Well, the first thing I tried was exiting and reinitiating my normal user
account session, and baloo started reindexing. Again...
Anyway, I will try the test you suggested to find wether the device number has
changed.

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 438074] baloo reindexing files on every start

2021-06-04 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=438074

tagwer...@innerjoin.org changed:

   What|Removed |Added

 CC||tagwer...@innerjoin.org

--- Comment #1 from tagwer...@innerjoin.org ---
You say "Neon" rather than, say, "openSuse" however it might be worth looking
at:
https://bugs.kde.org/show_bug.cgi?id=402154#c12

The issue in that case is that baloo expects the device number / inode for
files to be stable (not change every reboot). With certain
filessystems/distributions the devno can change, with remote filesystems it
seems that the inode can also change.

Try the test with "stat" and "balooshow -x" and see what you see.

-- 
You are receiving this mail because:
You are watching all bug changes.