Re: [HACKERS] dot to be considered as a word delimiter?

2009-06-02 Thread Kenneth Marshall
On Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote:
 Sushant Sinha sushant...@gmail.com wrote: 
  
  I think that dot should be considered by as a word delimiter because
  when dot is not followed by a space, most of the time it is an error
  in typing. Beside they are not many valid english words that have
  dot in between.
  
 It's not treating it as an English word, but as a host name.
  
 select ts_debug('english', 'Mr.J.Sai Deepak');
  ts_debug
 ---
  (host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai})
  (blank,Space symbols, ,{},,)
  (asciiword,Word, all
 ASCII,Deepak,{english_stem},english_stem,{deepak})
 (3 rows)
  
 You could run it through a dictionary which would deal with host
 tokens differently.  Just be aware of what you'll be doing to
 www.google.com if you run into it.
  
 I hope this helps.
  
 -Kevin
 

In our uses for full text indexing, it is much more important to
be able to find host name and URLs than to find mistyped names.
My two cents.

Cheers,
Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dot to be considered as a word delimiter?

2009-06-02 Thread Sushant Sinha
Fair enough. I agree that there is a valid need for returning such tokens as
a host. But I think there is definitely a need to break it down into
individual words. This will help in cases when a document is missing a space
in between the words.


So what we can do is: return the entire compound word as Host and also break
it down into individual words. I can put up a patch for this if you guys
agree.

Returning multiple tokens for the same word is a feature of the text search
parser as explained in the documentation here:
http://www.postgresql.org/docs/8.3/static/textsearch-parsers.html

Thanks,
Sushant.

On Tue, Jun 2, 2009 at 8:47 AM, Kenneth Marshall k...@rice.edu wrote:

 On Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote:
  Sushant Sinha sushant...@gmail.com wrote:
 
   I think that dot should be considered by as a word delimiter because
   when dot is not followed by a space, most of the time it is an error
   in typing. Beside they are not many valid english words that have
   dot in between.
 
  It's not treating it as an English word, but as a host name.
 
  select ts_debug('english', 'Mr.J.Sai Deepak');
   ts_debug
 
 ---
   (host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai})
   (blank,Space symbols, ,{},,)
   (asciiword,Word, all
  ASCII,Deepak,{english_stem},english_stem,{deepak})
  (3 rows)
 
  You could run it through a dictionary which would deal with host
  tokens differently.  Just be aware of what you'll be doing to
  www.google.com if you run into it.
 
  I hope this helps.
 
  -Kevin
 

 In our uses for full text indexing, it is much more important to
 be able to find host name and URLs than to find mistyped names.
 My two cents.

 Cheers,
 Ken



Re: [HACKERS] dot to be considered as a word delimiter?

2009-06-02 Thread Kevin Grittner
Sushant Sinha sushant...@gmail.com wrote: 
 
 So what we can do is: return the entire compound word as Host and
 also break it down into individual words.
 
So, pretty much like we handle hyphenation?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dot to be considered as a word delimiter?

2009-06-02 Thread Kenneth Marshall
On Tue, Jun 02, 2009 at 04:40:51PM -0400, Sushant Sinha wrote:
 Fair enough. I agree that there is a valid need for returning such tokens as
 a host. But I think there is definitely a need to break it down into
 individual words. This will help in cases when a document is missing a space
 in between the words.
 
 
 So what we can do is: return the entire compound word as Host and also break
 it down into individual words. I can put up a patch for this if you guys
 agree.
 
 Returning multiple tokens for the same word is a feature of the text search
 parser as explained in the documentation here:
 http://www.postgresql.org/docs/8.3/static/textsearch-parsers.html
 
 Thanks,
 Sushant.
 

+1

Ken
 On Tue, Jun 2, 2009 at 8:47 AM, Kenneth Marshall k...@rice.edu wrote:
 
  On Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote:
   Sushant Sinha sushant...@gmail.com wrote:
  
I think that dot should be considered by as a word delimiter because
when dot is not followed by a space, most of the time it is an error
in typing. Beside they are not many valid english words that have
dot in between.
  
   It's not treating it as an English word, but as a host name.
  
   select ts_debug('english', 'Mr.J.Sai Deepak');
ts_debug
  
  ---
(host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai})
(blank,Space symbols, ,{},,)
(asciiword,Word, all
   ASCII,Deepak,{english_stem},english_stem,{deepak})
   (3 rows)
  
   You could run it through a dictionary which would deal with host
   tokens differently.  Just be aware of what you'll be doing to
   www.google.com if you run into it.
  
   I hope this helps.
  
   -Kevin
  
 
  In our uses for full text indexing, it is much more important to
  be able to find host name and URLs than to find mistyped names.
  My two cents.
 
  Cheers,
  Ken
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dot to be considered as a word delimiter?

2009-06-01 Thread Kevin Grittner
Sushant Sinha sushant...@gmail.com wrote: 
 
 I think that dot should be considered by as a word delimiter because
 when dot is not followed by a space, most of the time it is an error
 in typing. Beside they are not many valid english words that have
 dot in between.
 
It's not treating it as an English word, but as a host name.
 
select ts_debug('english', 'Mr.J.Sai Deepak');
 ts_debug
---
 (host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai})
 (blank,Space symbols, ,{},,)
 (asciiword,Word, all
ASCII,Deepak,{english_stem},english_stem,{deepak})
(3 rows)
 
You could run it through a dictionary which would deal with host
tokens differently.  Just be aware of what you'll be doing to
www.google.com if you run into it.
 
I hope this helps.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers