Re: Baloo - Not Indexing everything by default
On Thu, Oct 16, 2014 at 2:15 PM, Martin Gräßlin mgraess...@kde.org wrote: On Thursday 16 October 2014 13:20:57 Vishesh Handa wrote: Hey guys While Baloo performs better than Nepomuk. It does have its share of problems - mostly large text files, and high IO usage. Additionally, users on linux often seem to have the craziest files. Currently, we do not index plain text files which do not have a `.txt` extension, because otherwise we land up indexing genome data and other strange files. (Actual bugs) the txt being genome data doesn't surprise me[1], but I find it sad that now txt is disabled by default (I use them quite a lot for blog posts). As genome data is really huge wouldn't it make sense to go rather for file size or abort the indexing if it's obvious random gibberish? Or skip it if it looks like csv or is mostly numbers? ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel
Re: Baloo - Not Indexing everything by default
On Fri, Oct 17, 2014 at 2:11 AM, Ömer Fadıl USTA omeru...@gmail.com wrote: Couldnt we add a .baloo file to specify for attributes for current directory, file or subdirectories about not indexing ? For example of a .baloo file skip_all skip_if_greater 1m skip_if_smaller 50k skip_ext txt jpg With --subdirs flag like Skip_ext --subdirs txt jpg Skip_all --subdirs This does seem like something I would want at some point. Except that instead of having a .baloo file. I would like us to use xattr of the folder. Maybe both? Anyway, this would still only be for advanced users. I've added a TODO about this to todo.kde.org. Hopefully, we can get a more advanced new comer to try and hack on this. -- Vishesh Handa ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel
Re: Baloo - Not Indexing everything by default
On 17.10.2014 18:24, Vishesh Handa wrote: About gibberish. It's hard to figure out what gibberish is. I think I'll add some code that we only index the first 20 characters of each word. That should help to a certain extent. Define word - Chinese and Japanese (unless mostly kana) often don't have whitespace between words. -- Vishesh Handa Cheers, Eike ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel
Re: Baloo - Not Indexing everything by default
Vishesh Handa wrote: ... Instead, we could only index - * $HOME - Not including any subfolders. * Desktop, Documents, Videos, Pictures and Music. All of these are xdg user directories. +1 Yes, please! -- Rex ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel
Re: Re: Baloo - Not Indexing everything by default
On Friday 17 October 2014 18:24:48 Vishesh Handa wrote: On Thu, Oct 16, 2014 at 2:15 PM, Martin Gräßlin mgraess...@kde.org wrote: the txt being genome data doesn't surprise me[1], but I find it sad that now txt is disabled by default (I use them quite a lot for blog posts). As genome data is really huge wouldn't it make sense to go rather for file size or abort the indexing if it's obvious random gibberish? We currently have a hard limit of 50mb on 'text/plain' files. However this does not include log files, which have a separate mimetype, Perhaps it would really be good to reduce it to about 5 mb. I think 5 MB could be a better limit. That's a huuuge text document. Someone with such a huge document is probably not needing baloo to get to it or the keywords will already be found in the first - what are that? 1000 pages? Cheers Martin signature.asc Description: This is a digitally signed message part. ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel
Re: Baloo - Not Indexing everything by default
On Thu, Oct 16, 2014 at 1:20 PM, Vishesh Handa m...@vhanda.in wrote: Hey guys While Baloo performs better than Nepomuk. It does have its share of problems - mostly large text files, and high IO usage. Additionally, users on linux often seem to have the craziest files. Currently, we do not index plain text files which do not have a `.txt` extension, because otherwise we land up indexing genome data and other strange files. (Actual bugs) I've been thinking about actually disabling the file indexing by default. However, that might be too radical. Instead, we could only index - * $HOME - Not including any subfolders. * Desktop, Documents, Videos, Pictures and Music. All of these are xdg user directories. Gnome Tracker actually does something quite similar. Comments? Hi Vishesh, First of all, please don't even consider turning it off by default. Baloo might have some issues (can't really say for Plasma 5.x experience since im still on KDE 4.xx) but having a desktop search application is really worth a lot! It's just a difficult area to get right. Also, i don't think you should limit the indexer to the XDG folders. I personally have quite some data in $HOME specifically in sub folders. So it sounds like the indexer job needs to learn some new tricks then. Here's my idea on the top of my head that might work. 1. delay indexing file content that any application has open already. I don't know which C++/Qt function you can use for this, but the idea here is to not index the files/folders that you see when you type iostat. Store the list for later indexing. To still have some data there, just index the filenames. 2. What's left is everything _not_ currently in iostat which you could potentially index. Here you should filter out any file types that you don't have indexers for. 3. Now you're left with a list of files which could potentially be indexed. Here you should probably filter out those that are bigger then 5MB. 4. Everything that's left: index! 5. While indexing, periodically look at iostat to see if any process (other then the baloo processes) have new files.folders open. Like a compile job. If something like that is detected then you should probably follow step 1 again :) This would be a nice indexer, right? The only problem i see here is doing an iostat call in code. I have no clue if there is a posix c/c++ function for that. ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel
Baloo - Not Indexing everything by default
Hey guys While Baloo performs better than Nepomuk. It does have its share of problems - mostly large text files, and high IO usage. Additionally, users on linux often seem to have the craziest files. Currently, we do not index plain text files which do not have a `.txt` extension, because otherwise we land up indexing genome data and other strange files. (Actual bugs) I've been thinking about actually disabling the file indexing by default. However, that might be too radical. Instead, we could only index - * $HOME - Not including any subfolders. * Desktop, Documents, Videos, Pictures and Music. All of these are xdg user directories. Gnome Tracker actually does something quite similar. Comments? -- Vishesh Handa ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel
Re: Baloo - Not Indexing everything by default
On Thursday 16 October 2014, Vishesh Handa wrote: * $HOME - Not including any subfolders. * Desktop, Documents, Videos, Pictures and Music. All of these are xdg user directories. Gnome Tracker actually does something quite similar. Comments? +1 i tend to prefer a whitelist on what to index instead of a blacklist of what to not index -- Marco Martin ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel
Re: Baloo - Not Indexing everything by default
Seems a shame to do this now that things are working so well. Since the .txt only change Baloo hasn't bothered me at all. David ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel
Re: Baloo - Not Indexing everything by default
Am Donnerstag, 16. Oktober 2014, 13:20:57 schrieb Vishesh Handa: Hey guys Hi Vishesh, While Baloo performs better than Nepomuk. It does have its share of problems - mostly large text files, and high IO usage. Additionally, users on linux often seem to have the craziest files. Currently, we do not index plain text files which do not have a `.txt` extension, because otherwise we land up indexing genome data and other strange files. (Actual bugs) How about limiting size for problematic files? I.e. only smaller text files? Here Baloo runs quite well. But I´d like it to also index *.txt files. Anything else that can be done to make is more efficient? In my experience its already a lot more efficient than Nepomuk. It indexed a lot of text files here, about a million or more. My mails that is :). I've been thinking about actually disabling the file indexing by default. However, that might be too radical. Instead, we could only index - * $HOME - Not including any subfolders. * Desktop, Documents, Videos, Pictures and Music. All of these are xdg user directories. Gnome Tracker actually does something quite similar. Hmmm, I actually don´t use these, except for a images folder. I store my files in categories / directories I want. I usually don´t sort by file type, but by purpose – okay I have an images folder, but mostly for Digikam, but music and audio meditations I already have split into two main directories. Thus I for me above structure just doesn´t fit. Comments? I´d rather like Baloo to be *intelligent* about errors, i.e.: If an indexer fails on a file to skip it next time. Optionally at some time present a list of files it failed to index to the user, maybe via a non intrusive summary notification at the end of an indexing cycle. And report each failed file just once in it. Extra points for offering to report a bug with the file. But is a bit difficult, cause it may well be a private file the user does not want to share. Actually I´d also like to have advanced configuration options. On my Debian the settings are very simplistic I can just say where not to search, no extension list, no file size restrictions, no nothing. I think this could help users who have problems with extra large text files. But… I think advanced error handling, i.e. not trying on a file that is known to fail, again and again and again, might be able to circumvent the need for further configuration options. I´d like to scan it for text files and source files tough. Just probably with some delay… to avoid I/O load durging git checkout or compile runs. Right now I do not seem to be able to set anything. I´d also like to see what filetypes it actually indexes. I wonder whether it indexes opendocument files for example, or PDF files. It seems from my files it finds less than Nepomuk. Ok, but PDF it seems to find. Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel
Re: Baloo - Not Indexing everything by default
Am Donnerstag, 16. Oktober 2014, 13:27:02 schrieb Marco Martin: On Thursday 16 October 2014, Vishesh Handa wrote: * $HOME - Not including any subfolders. * Desktop, Documents, Videos, Pictures and Music. All of these are xdg user directories. Gnome Tracker actually does something quite similar. Comments? +1 i tend to prefer a whitelist on what to index instead of a blacklist of what to not index +1 :) (not a dev tough). -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel
Re: Baloo - Not Indexing everything by default
Hi, * Desktop, Documents, Videos, Pictures and Music. All of these are xdg user directories. The only reason I actually index my kf5 folder with all the git clones in it, is because Dolphin doesn't properly fallback when searching non-indexed locations and then it won't find anything (in fact, I'm still a heavy user of KFind because of these shortcomings). If that were smarter, we could safely limit Baloo to the folders you mentioned. Cheers, Kai Uwe ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel
Re: Baloo - Not Indexing everything by default
On 16.10.2014 13:20, Vishesh Handa wrote: Comments? I understand the pragmatic motivation behind it, but it seems like a strange step to me. The idea behind indexing is that you can find things regardless of location, so you don't need to be aware of where things are. By making the index selective by location, you reintroduce that need and subvert the usefulness of indexing. I suppose there's an argument to be made for indexing to be opt-in, though. Cheers, Eike ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel
Re: Baloo - Not Indexing everything by default
On Thursday 16 October 2014 13:20:57 Vishesh Handa wrote: Hey guys While Baloo performs better than Nepomuk. It does have its share of problems - mostly large text files, and high IO usage. Additionally, users on linux often seem to have the craziest files. Currently, we do not index plain text files which do not have a `.txt` extension, because otherwise we land up indexing genome data and other strange files. (Actual bugs) the txt being genome data doesn't surprise me[1], but I find it sad that now txt is disabled by default (I use them quite a lot for blog posts). As genome data is really huge wouldn't it make sense to go rather for file size or abort the indexing if it's obvious random gibberish? Restricting to the XDG dirs is certainly something which could be done, but I also find this unfortunate - my setup is older than those dirs ;-) Cheers Martin [1] Having worked in a lab which did genome sequence analysis and using Plasma on all systems. signature.asc Description: This is a digitally signed message part. ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel
Re: Baloo - Not Indexing everything by default
In data giovedì 16 ottobre 2014 14:15:15, Martin Gräßlin ha scritto: genome data is really huge wouldn't it make sense to go rather for file size or abort the indexing if it's obvious random gibberish? As the person who mentioned this first (hey, I'm famous ;), I'm guessing that limiting on file size would work in principle. For reference on the sizes, these kind of files range from tens of M to a few G. Perhaps a size cutoff would work without no longer indexing everything (which IMO is a nice feature and shouldn't be disabled). -- Luca Beltrame - KDE Forums team KDE Science supporter GPG key ID: 6E1A4E79 signature.asc Description: This is a digitally signed message part. ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel
Re: Baloo - Not Indexing everything by default
Am Donnerstag, 16. Oktober 2014, 14:20:06 schrieb Luca Beltrame: In data giovedì 16 ottobre 2014 14:15:15, Martin Gräßlin ha scritto: genome data is really huge wouldn't it make sense to go rather for file size or abort the indexing if it's obvious random gibberish? As the person who mentioned this first (hey, I'm famous ;), I'm guessing that limiting on file size would work in principle. For reference on the sizes, these kind of files range from tens of M to a few G. Perhaps a size cutoff would work without no longer indexing everything (which IMO is a nice feature and shouldn't be disabled). Could limiting on filesize also be done like this: Just index the first say 100 KiB or so of a file – instead of not indexing it at all? And in search results probably include a hint it has only been partially indexed? Or would that be worse than not indexing at all in that case? For my file index I currently have: martin@merkaba:~/.local/share/baloo LANG=C du -sch file/* | sort -rh 1.2Gtotal 638Mfile/position.DB 250Mfile/postlist.DB 160Mfile/termlist.DB 103Mfile/fileMap.sqlite3 2.5Mfile/fileMap.sqlite3-wal 19M file/record.DB 4.0Kfile/termlist.baseB 4.0Kfile/termlist.baseA 4.0Kfile/record.baseB 4.0Kfile/record.baseA 4.0Kfile/postlist.baseB 4.0Kfile/postlist.baseA 4.0Kfile/iamchert 32K file/fileMap.sqlite3-shm 12K file/position.baseB 12K file/position.baseA 0 file/flintlock Thats less than the last Nepomuk index: martin@merkaba:~/.kde/share/apps/nepomuk/repository/main/data/virtuosobackend LANG=C du -sch * | sort -rh 3.1Gtotal 3.1Gsoprano-virtuoso.db 2.1Msoprano-virtuoso.log 8.0Ksoprano-virtuoso-temp.db 20K missed_flush.txt 0 soprano-virtuoso.trx 0 soprano-virtuoso.pxa 0 soprano-virtuoso.lock And as its still performant, I wouldn´t care if it indexed some nice *.txt or source files :). Actually I think I would like to be able to fulltext search in these. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel
Re: Baloo - Not Indexing everything by default
As for text file, in linux world people don't usually use .txt extension, especially when writing something like vimwiki or something similar. I guess cap the size is some what better solution ( 1-5MB is good enough). And as for folder limitation, that doesn't sound good, people usually organze files in their own way, unless we are on some mobile phone that doesn't expose a filesystem interface (the interface force people to use those location), then it doesn't work. I wonder if baloo could somehow estimate if some directory is problematic, and gives user warning about that. And could even baloo index large text file partially? So it will never guess wrong. On Thu, Oct 16, 2014 at 8:15 AM, Martin Gräßlin mgraess...@kde.org wrote: On Thursday 16 October 2014 13:20:57 Vishesh Handa wrote: Hey guys While Baloo performs better than Nepomuk. It does have its share of problems - mostly large text files, and high IO usage. Additionally, users on linux often seem to have the craziest files. Currently, we do not index plain text files which do not have a `.txt` extension, because otherwise we land up indexing genome data and other strange files. (Actual bugs) the txt being genome data doesn't surprise me[1], but I find it sad that now txt is disabled by default (I use them quite a lot for blog posts). As genome data is really huge wouldn't it make sense to go rather for file size or abort the indexing if it's obvious random gibberish? Restricting to the XDG dirs is certainly something which could be done, but I also find this unfortunate - my setup is older than those dirs ;-) Cheers Martin [1] Having worked in a lab which did genome sequence analysis and using Plasma on all systems. ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel
Re: Baloo - Not Indexing everything by default
Couldnt we add a .baloo file to specify for attributes for current directory, file or subdirectories about not indexing ? For example of a .baloo file skip_all skip_if_greater 1m skip_if_smaller 50k skip_ext txt jpg With --subdirs flag like Skip_ext --subdirs txt jpg Skip_all --subdirs On Oct 16, 2014 2:21 PM, Vishesh Handa m...@vhanda.in wrote: Hey guys While Baloo performs better than Nepomuk. It does have its share of problems - mostly large text files, and high IO usage. Additionally, users on linux often seem to have the craziest files. Currently, we do not index plain text files which do not have a `.txt` extension, because otherwise we land up indexing genome data and other strange files. (Actual bugs) I've been thinking about actually disabling the file indexing by default. However, that might be too radical. Instead, we could only index - * $HOME - Not including any subfolders. * Desktop, Documents, Videos, Pictures and Music. All of these are xdg user directories. Gnome Tracker actually does something quite similar. Comments? -- Vishesh Handa ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel ___ Plasma-devel mailing list Plasma-devel@kde.org https://mail.kde.org/mailman/listinfo/plasma-devel