Currently it is not possible and it is not normal... I do not know if
QDBM (which stores file names associated with keywords) can be set to
split string like "ziegler-nichols" into "ziegler" and "nichols"
automatically for searching or if we need to split strings ourselves.


This question were raised before, regarding filenames with dashes and
underscores in them. That time, the reply were that in C-code, dashes and
underscores often have a meaning. I think we'll need to be context sensitive
in this case, where regular documents and filenames usually require
word-splitting, while sourcecode usually don't. (However, in c, the string
"difference=alpha-beta" actually have three interesting lexemes and a dash
should neither here create the lexeme "alpha-beta".)

What I also dislike with libstemmer (which aims to "reduce" strings to
radicals to ignore plural for instance) is that it does not ignore
accentuated characters, so if I have a file which contains "éléphant",
then "élephant" or "elephant" will not be found. "éléphant" is the
correct orthography but it happens very often that french people miss
some accents or add superflus ones... and it is the same problem in
other languages.


Unfortunately, this is not always applicable. For instance in Swedish,
there's a big difference in the words "öst" and "ost", where the meanings is
"east" and "cheese", respectively. However, "café" is often spelled "cafe",
with the same meaning.

I'm not sure at all how to handle this.
_______________________________________________
tracker-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/tracker-list

Reply via email to