Re: Baloo - Not Indexing everything by default

2014-10-17 Thread Todd Rme
On Thu, Oct 16, 2014 at 2:15 PM, Martin Gräßlin mgraess...@kde.org wrote:
 On Thursday 16 October 2014 13:20:57 Vishesh Handa wrote:
 Hey guys

 While Baloo performs better than Nepomuk. It does have its share of
 problems - mostly large text files, and high IO usage. Additionally, users
 on linux often seem to have the craziest files. Currently, we do not index
 plain text files which do not have a `.txt` extension, because otherwise we
 land up indexing genome data and other strange files. (Actual bugs)

 the txt being genome data doesn't surprise me[1], but I find it sad that now
 txt is disabled by default (I use them quite a lot for blog posts). As genome
 data is really huge wouldn't it make sense to go rather for file size or abort
 the indexing if it's obvious random gibberish?

Or skip it if it looks like csv or is mostly numbers?
___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel


Re: Baloo - Not Indexing everything by default

2014-10-17 Thread Vishesh Handa
On Fri, Oct 17, 2014 at 2:11 AM, Ömer Fadıl USTA omeru...@gmail.com wrote:

 Couldnt we add a .baloo file to specify for attributes for current
 directory, file or subdirectories about not indexing ?
 For example of a .baloo file
 skip_all
 skip_if_greater 1m
 skip_if_smaller 50k
 skip_ext txt jpg
 With --subdirs flag like
 Skip_ext --subdirs txt jpg
 Skip_all --subdirs


This does seem like something I would want at some point. Except that
instead of having a .baloo file. I would like us to use xattr of the
folder. Maybe both? Anyway, this would still only be for advanced users.

I've added a TODO about this to todo.kde.org. Hopefully, we can get a more
advanced new comer to try and hack on this.

-- 
Vishesh Handa
___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel


Re: Baloo - Not Indexing everything by default

2014-10-17 Thread Eike Hein



On 17.10.2014 18:24, Vishesh Handa wrote:

About gibberish. It's hard to figure out what gibberish is. I think I'll
add some code that we only index the first 20 characters of each word.
That should help to a certain extent.


Define word - Chinese and Japanese (unless mostly kana) often
don't have whitespace between words.




--
Vishesh Handa


Cheers,
Eike
___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel


Re: Baloo - Not Indexing everything by default

2014-10-17 Thread Rex Dieter
Vishesh Handa wrote:

 ... Instead, we could only index -
 
 * $HOME - Not including any subfolders.
 * Desktop, Documents, Videos, Pictures and Music. All of these are xdg
 user directories.

+1 Yes, please!

-- Rex

___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel


Re: Re: Baloo - Not Indexing everything by default

2014-10-17 Thread Martin Gräßlin
On Friday 17 October 2014 18:24:48 Vishesh Handa wrote:
 On Thu, Oct 16, 2014 at 2:15 PM, Martin Gräßlin mgraess...@kde.org wrote:
  the txt being genome data doesn't surprise me[1], but I find it sad that
  now
  txt is disabled by default (I use them quite a lot for blog posts). As
  genome
  data is really huge wouldn't it make sense to go rather for file size or
  abort
  the indexing if it's obvious random gibberish?
 
 We currently have a hard limit of 50mb on 'text/plain' files. However this
 does not include log files, which have a separate mimetype, Perhaps it
 would really be good to reduce it to about 5 mb.

I think 5 MB could be a better limit. That's a huuuge text document. Someone 
with such a huge document is probably not needing baloo to get to it or the 
keywords will already be found in the first - what are that? 1000 pages?

Cheers
Martin

signature.asc
Description: This is a digitally signed message part.
___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel


Re: Baloo - Not Indexing everything by default

2014-10-17 Thread Mark Gaiser
On Thu, Oct 16, 2014 at 1:20 PM, Vishesh Handa m...@vhanda.in wrote:
 Hey guys

 While Baloo performs better than Nepomuk. It does have its share of problems
 - mostly large text files, and high IO usage. Additionally, users on linux
 often seem to have the craziest files. Currently, we do not index plain text
 files which do not have a `.txt` extension, because otherwise we land up
 indexing genome data and other strange files. (Actual bugs)

 I've been thinking about actually disabling the file indexing by default.
 However, that might be too radical. Instead, we could only index -

 * $HOME - Not including any subfolders.
 * Desktop, Documents, Videos, Pictures and Music. All of these are xdg user
 directories.

 Gnome Tracker actually does something quite similar.

 Comments?

Hi Vishesh,

First of all, please don't even consider turning it off by default.
Baloo might have some issues (can't really say for Plasma 5.x
experience since im still on KDE 4.xx) but having a desktop search
application is really worth a lot! It's just a difficult area to get
right.

Also, i don't think you should limit the indexer to the XDG folders. I
personally have quite some data in $HOME specifically in sub folders.

So it sounds like the indexer job needs to learn some new tricks then.
Here's my idea on the top of my head that might work.

1. delay indexing file content that any application has open already.
I don't know which C++/Qt function you can use for this, but the idea
here is to not index the files/folders that you see when you type
iostat. Store the list for later indexing. To still have some data
there, just index the filenames.

2. What's left is everything _not_ currently in iostat which you could
potentially index. Here you should filter out any file types that you
don't have indexers for.

3. Now you're left with a list of files which could potentially be
indexed. Here you should probably filter out those that are bigger
then 5MB.

4. Everything that's left: index!

5. While indexing, periodically look at iostat to see if any process
(other then the baloo processes) have new files.folders open. Like a
compile job. If something like that is detected then you should
probably follow step 1 again :)


This would be a nice indexer, right?
The only problem i see here is doing an iostat call in code. I have
no clue if there is a posix c/c++ function for that.
___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel


Baloo - Not Indexing everything by default

2014-10-16 Thread Vishesh Handa
Hey guys

While Baloo performs better than Nepomuk. It does have its share of
problems - mostly large text files, and high IO usage. Additionally, users
on linux often seem to have the craziest files. Currently, we do not index
plain text files which do not have a `.txt` extension, because otherwise we
land up indexing genome data and other strange files. (Actual bugs)

I've been thinking about actually disabling the file indexing by default.
However, that might be too radical. Instead, we could only index -

* $HOME - Not including any subfolders.
* Desktop, Documents, Videos, Pictures and Music. All of these are xdg user
directories.

Gnome Tracker actually does something quite similar.

Comments?

--
Vishesh Handa
___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel


Re: Baloo - Not Indexing everything by default

2014-10-16 Thread Marco Martin
On Thursday 16 October 2014, Vishesh Handa wrote:

 * $HOME - Not including any subfolders.
 * Desktop, Documents, Videos, Pictures and Music. All of these are xdg user
 directories.
 
 Gnome Tracker actually does something quite similar.
 
 Comments?

+1
i tend to prefer a whitelist on what to index instead of a blacklist of what 
to not index


-- 
Marco Martin
___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel


Re: Baloo - Not Indexing everything by default

2014-10-16 Thread David Edmundson
Seems a shame to do this now that things are working so well.
Since the .txt only change Baloo hasn't bothered me at all.

David
___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel


Re: Baloo - Not Indexing everything by default

2014-10-16 Thread Martin Steigerwald
Am Donnerstag, 16. Oktober 2014, 13:20:57 schrieb Vishesh Handa:
 Hey guys

Hi Vishesh,

 While Baloo performs better than Nepomuk. It does have its share of
 problems - mostly large text files, and high IO usage. Additionally, users
 on linux often seem to have the craziest files. Currently, we do not index
 plain text files which do not have a `.txt` extension, because otherwise we
 land up indexing genome data and other strange files. (Actual bugs)

How about limiting size for problematic files? I.e. only smaller text files? 
Here Baloo runs quite well. But I´d like it to also index *.txt files.

Anything else that can be done to make is more efficient? In my experience its 
already a lot more efficient than Nepomuk. It indexed a lot of text files here, 
about a million or more. My mails that is :).
 
 I've been thinking about actually disabling the file indexing by default.
 However, that might be too radical. Instead, we could only index -
 
 * $HOME - Not including any subfolders.
 * Desktop, Documents, Videos, Pictures and Music. All of these are xdg user
 directories.
 
 Gnome Tracker actually does something quite similar.

Hmmm, I actually don´t use these, except for a images folder. I store my files 
in categories / directories I want. I usually don´t sort by file type, but by 
purpose – okay I have an images folder, but mostly for Digikam, but music and 
audio meditations I already have split into two main directories. Thus I for 
me above structure just doesn´t fit.
 
 Comments?

I´d rather like Baloo to be *intelligent* about errors, i.e.:

If an indexer fails on a file to skip it next time. Optionally at some time 
present a list of files it failed to index to the user, maybe via a non 
intrusive summary notification at the end of an indexing cycle. And report each 
failed file just once in it.

Extra points for offering to report a bug with the file. But is a bit 
difficult, 
cause it may well be a private file the user does not want to share.

Actually I´d also like to have advanced configuration options. On my Debian the 
settings are very simplistic I can just say where not to search, no extension 
list, no file size restrictions, no nothing. I think this could help users who 
have problems with extra large text files.

But… I think advanced error handling, i.e. not trying on a file that is known 
to fail, again and again and again, might be able to circumvent the need for 
further configuration options.

I´d like to scan it for text files and source files tough. Just probably with 
some delay… to avoid I/O load durging git checkout or compile runs. Right now 
I do not seem to be able to set anything. I´d also like to see what filetypes 
it actually indexes. I wonder whether it indexes opendocument files for 
example, or PDF files. It seems from my files it finds less than Nepomuk. Ok, 
but 
PDF it seems to find.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel


Re: Baloo - Not Indexing everything by default

2014-10-16 Thread Martin Steigerwald
Am Donnerstag, 16. Oktober 2014, 13:27:02 schrieb Marco Martin:
 On Thursday 16 October 2014, Vishesh Handa wrote:
  * $HOME - Not including any subfolders.
  * Desktop, Documents, Videos, Pictures and Music. All of these are xdg
  user
  directories.
  
  Gnome Tracker actually does something quite similar.
  
  Comments?
 
 +1
 i tend to prefer a whitelist on what to index instead of a blacklist of what
 to not index

+1 :) (not a dev tough).

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel


Re: Baloo - Not Indexing everything by default

2014-10-16 Thread Kai Uwe Broulik
Hi,

 * Desktop, Documents, Videos, Pictures and Music. All of these are xdg user
 directories.

The only reason I actually index my kf5 folder with all the git clones in it, 
is because Dolphin doesn't properly fallback when searching non-indexed 
locations and then it won't find anything (in fact, I'm still a heavy user of 
KFind because of these shortcomings). If that were smarter, we could safely 
limit Baloo to the folders you mentioned.

Cheers,
Kai Uwe
___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel


Re: Baloo - Not Indexing everything by default

2014-10-16 Thread Eike Hein



On 16.10.2014 13:20, Vishesh Handa wrote:

Comments?


I understand the pragmatic motivation behind it, but it seems
like a strange step to me. The idea behind indexing is that you
can find things regardless of location, so you don't need to be
aware of where things are. By making the index selective by
location, you reintroduce that need and subvert the usefulness
of indexing.

I suppose there's an argument to be made for indexing to be
opt-in, though.


Cheers,
Eike
___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel


Re: Baloo - Not Indexing everything by default

2014-10-16 Thread Martin Gräßlin
On Thursday 16 October 2014 13:20:57 Vishesh Handa wrote:
 Hey guys
 
 While Baloo performs better than Nepomuk. It does have its share of
 problems - mostly large text files, and high IO usage. Additionally, users
 on linux often seem to have the craziest files. Currently, we do not index
 plain text files which do not have a `.txt` extension, because otherwise we
 land up indexing genome data and other strange files. (Actual bugs)

the txt being genome data doesn't surprise me[1], but I find it sad that now 
txt is disabled by default (I use them quite a lot for blog posts). As genome 
data is really huge wouldn't it make sense to go rather for file size or abort 
the indexing if it's obvious random gibberish?

Restricting to the XDG dirs is certainly something which could be done, but I 
also find this unfortunate - my setup is older than those dirs ;-)

Cheers
Martin

[1] Having worked in a lab which did genome sequence analysis and using Plasma 
on all systems.

signature.asc
Description: This is a digitally signed message part.
___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel


Re: Baloo - Not Indexing everything by default

2014-10-16 Thread Luca Beltrame
In data giovedì 16 ottobre 2014 14:15:15, Martin Gräßlin ha scritto:

 genome data is really huge wouldn't it make sense to go rather for file
 size or abort the indexing if it's obvious random gibberish?

As the person who mentioned this first (hey, I'm famous ;), I'm guessing that 
limiting on file size would work in principle.

For reference on the sizes, these kind of files range from tens of M to a few 
G. Perhaps a size cutoff would work without no longer indexing everything 
(which IMO is a nice feature and shouldn't be disabled).

-- 
Luca Beltrame - KDE Forums team
KDE Science supporter
GPG key ID: 6E1A4E79

signature.asc
Description: This is a digitally signed message part.
___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel


Re: Baloo - Not Indexing everything by default

2014-10-16 Thread Martin Steigerwald
Am Donnerstag, 16. Oktober 2014, 14:20:06 schrieb Luca Beltrame:
 In data giovedì 16 ottobre 2014 14:15:15, Martin Gräßlin ha scritto:
  genome data is really huge wouldn't it make sense to go rather for file
  size or abort the indexing if it's obvious random gibberish?
 
 As the person who mentioned this first (hey, I'm famous ;), I'm guessing
 that limiting on file size would work in principle.
 
 For reference on the sizes, these kind of files range from tens of M to a
 few G. Perhaps a size cutoff would work without no longer indexing
 everything (which IMO is a nice feature and shouldn't be disabled).

Could limiting on filesize also be done like this:

Just index the first say 100 KiB or so of a file – instead of not indexing it 
at 
all? And in search results probably include a hint it has only been partially 
indexed? Or would that be worse than not indexing at all in that case?

For my file index I currently have:

martin@merkaba:~/.local/share/baloo LANG=C du -sch file/* | sort -rh
1.2Gtotal
638Mfile/position.DB
250Mfile/postlist.DB
160Mfile/termlist.DB
103Mfile/fileMap.sqlite3
2.5Mfile/fileMap.sqlite3-wal
19M file/record.DB
4.0Kfile/termlist.baseB
4.0Kfile/termlist.baseA
4.0Kfile/record.baseB
4.0Kfile/record.baseA
4.0Kfile/postlist.baseB
4.0Kfile/postlist.baseA
4.0Kfile/iamchert
32K file/fileMap.sqlite3-shm
12K file/position.baseB
12K file/position.baseA
0   file/flintlock

Thats less than the last Nepomuk index:

martin@merkaba:~/.kde/share/apps/nepomuk/repository/main/data/virtuosobackend 
LANG=C du -sch * | sort -rh
3.1Gtotal
3.1Gsoprano-virtuoso.db
2.1Msoprano-virtuoso.log
8.0Ksoprano-virtuoso-temp.db
20K missed_flush.txt
0   soprano-virtuoso.trx
0   soprano-virtuoso.pxa
0   soprano-virtuoso.lock


And as its still performant, I wouldn´t care if it indexed some nice *.txt or 
source files :). Actually I think I would like to be able to fulltext search in 
these.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel


Re: Baloo - Not Indexing everything by default

2014-10-16 Thread Weng Xuetian
As for text file, in linux world people don't usually use .txt extension,
especially when writing something like vimwiki or something similar.

I guess cap the size is some what better solution ( 1-5MB is good enough).

And as for folder limitation, that doesn't sound good, people usually
organze files in their own way, unless we are on some mobile phone that
doesn't expose a filesystem interface (the interface force people to use
those location), then it doesn't work.

I wonder if baloo could somehow estimate if some directory is problematic,
and gives user warning about that.

And could even baloo index large text file partially? So it will never
guess wrong.

On Thu, Oct 16, 2014 at 8:15 AM, Martin Gräßlin mgraess...@kde.org wrote:

 On Thursday 16 October 2014 13:20:57 Vishesh Handa wrote:
  Hey guys
 
  While Baloo performs better than Nepomuk. It does have its share of
  problems - mostly large text files, and high IO usage. Additionally,
 users
  on linux often seem to have the craziest files. Currently, we do not
 index
  plain text files which do not have a `.txt` extension, because otherwise
 we
  land up indexing genome data and other strange files. (Actual bugs)

 the txt being genome data doesn't surprise me[1], but I find it sad that
 now
 txt is disabled by default (I use them quite a lot for blog posts). As
 genome
 data is really huge wouldn't it make sense to go rather for file size or
 abort
 the indexing if it's obvious random gibberish?

 Restricting to the XDG dirs is certainly something which could be done,
 but I
 also find this unfortunate - my setup is older than those dirs ;-)

 Cheers
 Martin

 [1] Having worked in a lab which did genome sequence analysis and using
 Plasma
 on all systems.
 ___
 Plasma-devel mailing list
 Plasma-devel@kde.org
 https://mail.kde.org/mailman/listinfo/plasma-devel


___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel


Re: Baloo - Not Indexing everything by default

2014-10-16 Thread Ömer Fadıl USTA
Couldnt we add a .baloo file to specify for attributes for current
directory, file or subdirectories about not indexing ?
For example of a .baloo file
skip_all
skip_if_greater 1m
skip_if_smaller 50k
skip_ext txt jpg
With --subdirs flag like
Skip_ext --subdirs txt jpg
Skip_all --subdirs
 On Oct 16, 2014 2:21 PM, Vishesh Handa m...@vhanda.in wrote:

 Hey guys

 While Baloo performs better than Nepomuk. It does have its share of
 problems - mostly large text files, and high IO usage. Additionally, users
 on linux often seem to have the craziest files. Currently, we do not index
 plain text files which do not have a `.txt` extension, because otherwise we
 land up indexing genome data and other strange files. (Actual bugs)

 I've been thinking about actually disabling the file indexing by default.
 However, that might be too radical. Instead, we could only index -

 * $HOME - Not including any subfolders.
 * Desktop, Documents, Videos, Pictures and Music. All of these are xdg
 user directories.

 Gnome Tracker actually does something quite similar.

 Comments?

 --
 Vishesh Handa

 ___
 Plasma-devel mailing list
 Plasma-devel@kde.org
 https://mail.kde.org/mailman/listinfo/plasma-devel


___
Plasma-devel mailing list
Plasma-devel@kde.org
https://mail.kde.org/mailman/listinfo/plasma-devel