On Wed, Jun 18, 2008 at 02:49:48PM +0200, Sabbiolina wrote:
> www.google.com is only treated as a unique word? Why not producing multiple
> tokens like www.google.com, www, ., google, ., com? (obviously www and . can
> be nulled or stopworded).
You wouldn't want to get the token ".". It's not a t
Sabbiolina,
you have two options:
1. Write you very own parser
2. Write dictionary, which breaks host to parts
Fortunately, you can use our dict_regex dictionary
(http://vo.astronet.ru/arxiv/dict_regex.html) instead of 2.
Oleg
On Wed, 18 Jun 2008, Sabbiolina wrote:
Hello,
I've seen that
Hello,
I've seen that the default parser for the full-text search can identify
e-mail addresses, hosts, URLs… but I have a serious problem with it:
Suppose I index the following sentence "the search engine I use the most is
www.google.com"
And I search "google" no result is found.
Instead