Re: [HACKERS] dot to be considered as a word delimiter?
On Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote: Sushant Sinha sushant...@gmail.com wrote: I think that dot should be considered by as a word delimiter because when dot is not followed by a space, most of the time it is an error in typing. Beside they are not many valid english words that have dot in between. It's not treating it as an English word, but as a host name. select ts_debug('english', 'Mr.J.Sai Deepak'); ts_debug --- (host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai}) (blank,Space symbols, ,{},,) (asciiword,Word, all ASCII,Deepak,{english_stem},english_stem,{deepak}) (3 rows) You could run it through a dictionary which would deal with host tokens differently. Just be aware of what you'll be doing to www.google.com if you run into it. I hope this helps. -Kevin In our uses for full text indexing, it is much more important to be able to find host name and URLs than to find mistyped names. My two cents. Cheers, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dot to be considered as a word delimiter?
Fair enough. I agree that there is a valid need for returning such tokens as a host. But I think there is definitely a need to break it down into individual words. This will help in cases when a document is missing a space in between the words. So what we can do is: return the entire compound word as Host and also break it down into individual words. I can put up a patch for this if you guys agree. Returning multiple tokens for the same word is a feature of the text search parser as explained in the documentation here: http://www.postgresql.org/docs/8.3/static/textsearch-parsers.html Thanks, Sushant. On Tue, Jun 2, 2009 at 8:47 AM, Kenneth Marshall k...@rice.edu wrote: On Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote: Sushant Sinha sushant...@gmail.com wrote: I think that dot should be considered by as a word delimiter because when dot is not followed by a space, most of the time it is an error in typing. Beside they are not many valid english words that have dot in between. It's not treating it as an English word, but as a host name. select ts_debug('english', 'Mr.J.Sai Deepak'); ts_debug --- (host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai}) (blank,Space symbols, ,{},,) (asciiword,Word, all ASCII,Deepak,{english_stem},english_stem,{deepak}) (3 rows) You could run it through a dictionary which would deal with host tokens differently. Just be aware of what you'll be doing to www.google.com if you run into it. I hope this helps. -Kevin In our uses for full text indexing, it is much more important to be able to find host name and URLs than to find mistyped names. My two cents. Cheers, Ken
Re: [HACKERS] dot to be considered as a word delimiter?
Sushant Sinha sushant...@gmail.com wrote: So what we can do is: return the entire compound word as Host and also break it down into individual words. So, pretty much like we handle hyphenation? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dot to be considered as a word delimiter?
On Tue, Jun 02, 2009 at 04:40:51PM -0400, Sushant Sinha wrote: Fair enough. I agree that there is a valid need for returning such tokens as a host. But I think there is definitely a need to break it down into individual words. This will help in cases when a document is missing a space in between the words. So what we can do is: return the entire compound word as Host and also break it down into individual words. I can put up a patch for this if you guys agree. Returning multiple tokens for the same word is a feature of the text search parser as explained in the documentation here: http://www.postgresql.org/docs/8.3/static/textsearch-parsers.html Thanks, Sushant. +1 Ken On Tue, Jun 2, 2009 at 8:47 AM, Kenneth Marshall k...@rice.edu wrote: On Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote: Sushant Sinha sushant...@gmail.com wrote: I think that dot should be considered by as a word delimiter because when dot is not followed by a space, most of the time it is an error in typing. Beside they are not many valid english words that have dot in between. It's not treating it as an English word, but as a host name. select ts_debug('english', 'Mr.J.Sai Deepak'); ts_debug --- (host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai}) (blank,Space symbols, ,{},,) (asciiword,Word, all ASCII,Deepak,{english_stem},english_stem,{deepak}) (3 rows) You could run it through a dictionary which would deal with host tokens differently. Just be aware of what you'll be doing to www.google.com if you run into it. I hope this helps. -Kevin In our uses for full text indexing, it is much more important to be able to find host name and URLs than to find mistyped names. My two cents. Cheers, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dot to be considered as a word delimiter?
Sushant Sinha sushant...@gmail.com wrote: I think that dot should be considered by as a word delimiter because when dot is not followed by a space, most of the time it is an error in typing. Beside they are not many valid english words that have dot in between. It's not treating it as an English word, but as a host name. select ts_debug('english', 'Mr.J.Sai Deepak'); ts_debug --- (host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai}) (blank,Space symbols, ,{},,) (asciiword,Word, all ASCII,Deepak,{english_stem},english_stem,{deepak}) (3 rows) You could run it through a dictionary which would deal with host tokens differently. Just be aware of what you'll be doing to www.google.com if you run into it. I hope this helps. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers