Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Tom Lane
I wrote: (As an example, foo-beta1 is a numhword, with component tokens foo an aword and beta1 a numword. This is how it works now modulo the redefinition of the base categories.) Argh... need more caffeine. Obviously the component tokens would be apart_hword and numpart_hword. They'd be

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Tom Lane
I wrote: Maybe aword, word, and numword? Does the lack of response mean people are satisfied with that? Fleshing the proposal out to include the hyphenated-word categories: aword All ASCII letters wordAll letters according to iswalpha() numword Mixed letters and

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Michael Glaesemann
On Oct 23, 2007, at 10:42 , Tom Lane wrote: apart_hword Part of hyphenated word, all ASCII letters part_hword Part of hyphenated word, all letters numpart_hword Part of hyphenated word, mixed letters and digits Is there a rationale for using these instead of hword_apart,

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Tom Lane
Michael Glaesemann [EMAIL PROTECTED] writes: On Oct 23, 2007, at 10:42 , Tom Lane wrote: apart_hword Part of hyphenated word, all ASCII letters part_hword Part of hyphenated word, all letters numpart_hwordPart of hyphenated word, mixed letters and digits Is there a rationale for

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes: I wrote: Maybe aword, word, and numword? Does the lack of response mean people are satisfied with that? Sorry, I had a couple responses partially written but never finished. If we were doing it from scratch I would suggest using longer names. At the least

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Alvaro Herrera
Tom Lane wrote: Michael Glaesemann [EMAIL PROTECTED] writes: On Oct 23, 2007, at 10:42 , Tom Lane wrote: apart_hwordPart of hyphenated word, all ASCII letters part_hword Part of hyphenated word, all letters numpart_hword Part of hyphenated word, mixed letters and digits

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Alvaro Herrera
Gregory Stark wrote: Tom Lane [EMAIL PROTECTED] writes: I wrote: Maybe aword, word, and numword? Does the lack of response mean people are satisfied with that? Sorry, I had a couple responses partially written but never finished. If we were doing it from scratch I would suggest

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes: Gregory Stark wrote: If we were doing it from scratch I would suggest using longer names. At the least I would still suggest using ascii or asciiword instead of aword. +1 for asciiword; aword sounds too much like a word which is not the meaning I

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Alvaro Herrera
Tom Lane wrote: OK, so with that and Michael's suggestion we have asciiword word numword asciihword hword numhword hword_asciipart hword_part hword_numpart Sold? Sold here. -- Alvaro Herrera

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes: hword_asciipart hword_part hword_numpart Out of curiosity would the foo in foo-bär or the foo-beta1 be a hword_asciipart or a hword_part/hword_numpart? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes: Out of curiosity would the foo in foo-bär or the foo-beta1 be a hword_asciipart or a hword_part/hword_numpart? foo would be hword_asciipart independently of what was in the other parts of the hword. AFAICS this is what you want for the purpose, which is

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Michael Glaesemann
On Oct 23, 2007, at 12:09 , Alvaro Herrera wrote: Tom Lane wrote: OK, so with that and Michael's suggestion we have asciiword word numword asciihword hword numhword hword_asciipart hword_part hword_numpart Sold?

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Tom Lane
Michael Glaesemann [EMAIL PROTECTED] writes: Tom Lane wrote: asciiword word numword No huge preference, but I see benefit in what Gregory was saying re: asciiword, alphaword, alnumword. word itself is pretty general, while alphaword ties it much closer to its intended meaning. They've

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Tatsuo Ishii
Just for clarification. Are you going to make these changes in the 8.3 beta test period? -- Tatsuo Ishii SRA OSS, Inc. Japan If I am reading the state machine in wparser_def.c correctly, the three classifications of words that the default parser knows are lword Composed entirely of

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Tom Lane
Tatsuo Ishii [EMAIL PROTECTED] writes: Just for clarification. Are you going to make these changes in the 8.3 beta test period? Yes, I committed them a couple hours ago. regards, tom lane ---(end of broadcast)--- TIP 7:

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-22 Thread Heikki Linnakangas
Alvaro Herrera wrote: Tom Lane wrote: ISTM that perhaps a more generally useful definition would be lwordOnly ASCII letters nlword Entirely letters per iswalpha(), but not lword word Entirely alphanumeric per iswalnum(), but not nlword

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-22 Thread Tatsuo Ishii
Alvaro Herrera wrote: Tom Lane wrote: ISTM that perhaps a more generally useful definition would be lword Only ASCII letters nlword Entirely letters per iswalpha(), but not lword word Entirely alphanumeric per iswalnum(), but not nlword

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-22 Thread Gregory Stark
Heikki Linnakangas [EMAIL PROTECTED] writes: Alvaro Herrera wrote: Tom Lane wrote: ISTM that perhaps a more generally useful definition would be lword Only ASCII letters nlword Entirely letters per iswalpha(), but not lword wordEntirely

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-22 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes: Alvaro Herrera wrote: lwordEntirely letters per iswalpha, with at least one ASCII nlword Entirely letters per iswalpha word Entirely alphanumeric per iswalnum, but not nlword I don't like this categorization

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-22 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes: Heikki Linnakangas [EMAIL PROTECTED] writes: I like the aword name more than lword, BTW. If we change the meaning of the classes, surely we can change the name as well, right? I'm not very familiar with the use case here. Is there a good reason to want

[HACKERS] Latin vs non-Latin words in text search parsing

2007-10-21 Thread Tom Lane
If I am reading the state machine in wparser_def.c correctly, the three classifications of words that the default parser knows are lword Composed entirely of ASCII letters nlword Composed entirely of non-ASCII letters (where letter is defined by iswalpha()) word

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-21 Thread Alvaro Herrera
Tom Lane wrote: ISTM that perhaps a more generally useful definition would be lword Only ASCII letters nlwordEntirely letters per iswalpha(), but not lword word Entirely alphanumeric per iswalnum(), but not nlword (hence, includes at least one

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-21 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes: Tom Lane wrote: ISTM that perhaps a more generally useful definition would be lwordOnly ASCII letters nlword Entirely letters per iswalpha(), but not lword word Entirely alphanumeric per iswalnum(), but not