subject:"\\\[HACKERS\\\] Latin vs non\\\-Latin words in text search parsing"

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Tom Lane

I wrote:
 (As an example, foo-beta1 is a numhword, with component tokens
 foo an aword and beta1 a numword.  This is how it works now
 modulo the redefinition of the base categories.)

Argh... need more caffeine.  Obviously the component tokens would
be apart_hword and numpart_hword.  They'd be the others only if
they were *not* part of a hyphenated word.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Tom Lane

I wrote:
 Maybe aword, word, and numword?

Does the lack of response mean people are satisfied with that?

Fleshing the proposal out to include the hyphenated-word categories:

aword   All ASCII letters
wordAll letters according to iswalpha()
numword Mixed letters and digits (all iswalnum())

ahword  Hyphenated word, all ASCII letters
hword   Hyphenated word, all letters
numhwordHyphenated word, mixed letters and digits

apart_hword Part of hyphenated word, all ASCII letters
part_hword  Part of hyphenated word, all letters
numpart_hword   Part of hyphenated word, mixed letters and digits

(As an example, foo-beta1 is a numhword, with component tokens
foo an aword and beta1 a numword.  This is how it works now
modulo the redefinition of the base categories.)

I'm not totally thrilled with these short names for the hyphenation
categories, but they will seem at least somewhat familiar to users
of contrib/tsearch2, and it's probably not worth changing them just
to make them look prettier.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Michael Glaesemann



On Oct 23, 2007, at 10:42 , Tom Lane wrote:


apart_hword Part of hyphenated word, all ASCII letters
part_hword  Part of hyphenated word, all letters
numpart_hword   Part of hyphenated word, mixed letters and digits


Is there a rationale for using these instead of hword_apart,  
hword_part and hword_numpart? I find the latter to be more readable  
as variable names. Or was your thought to be able to identify the  
content from the first part of the variable name?


Michael Glaesemann
grzm seespotcode net



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Tom Lane

Michael Glaesemann [EMAIL PROTECTED] writes:
 On Oct 23, 2007, at 10:42 , Tom Lane wrote:
 apart_hword  Part of hyphenated word, all ASCII letters
 part_hword   Part of hyphenated word, all letters
 numpart_hwordPart of hyphenated word, mixed letters and digits

 Is there a rationale for using these instead of hword_apart,  
 hword_part and hword_numpart?

Only that the category names were constructed that way in the contrib
module, and so this would seem familiar to existing tsearch2 users.
However, we are changing enough other details of the tsearch
configuration that maybe that's not a very strong consideration.

I have no objection in principle to choosing nicer names, except
that I would like to avoid a long-drawn-out discussion.  Is there
general approval of Michael's suggestion?

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Gregory Stark

Tom Lane [EMAIL PROTECTED] writes:

 I wrote:
 Maybe aword, word, and numword?

 Does the lack of response mean people are satisfied with that?

Sorry, I had a couple responses partially written but never finished.

If we were doing it from scratch I would suggest using longer names. At the
least I would still suggest using ascii or asciiword instead of aword.

 Fleshing the proposal out to include the hyphenated-word categories:

 aword All ASCII letters
 word  All letters according to iswalpha()
 numword   Mixed letters and digits (all iswalnum())

This does bring up another idea. Using the ctype names. They could be named
asciiword, alphaword, alnumword. Frankly I don't think this is any nicer than
numword anyways.

 I'm not totally thrilled with these short names for the hyphenation
 categories, but they will seem at least somewhat familiar to users
 of contrib/tsearch2, and it's probably not worth changing them just
 to make them look prettier.

I tried thinking of better words for this and couldn't think of any. The only
other word for a hyphenated word I could think of is probably compound and
the word for parts of a compound word is lexeme, but that's certainly not
going to be clearer (and technically it's not quite right anyway).

So in short I would still suggest using ascii instead of just a but
otherwise I think your suggestion is best.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Alvaro Herrera

Tom Lane wrote:
 Michael Glaesemann [EMAIL PROTECTED] writes:
  On Oct 23, 2007, at 10:42 , Tom Lane wrote:
  apart_hwordPart of hyphenated word, all ASCII letters
  part_hword Part of hyphenated word, all letters
  numpart_hword  Part of hyphenated word, mixed letters and digits
 
  Is there a rationale for using these instead of hword_apart,  
  hword_part and hword_numpart?
 
 Only that the category names were constructed that way in the contrib
 module, and so this would seem familiar to existing tsearch2 users.
 However, we are changing enough other details of the tsearch
 configuration that maybe that's not a very strong consideration.
 
 I have no objection in principle to choosing nicer names, except
 that I would like to avoid a long-drawn-out discussion.  Is there
 general approval of Michael's suggestion?

+1

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Alvaro Herrera

Gregory Stark wrote:
 Tom Lane [EMAIL PROTECTED] writes:
 
  I wrote:
  Maybe aword, word, and numword?
 
  Does the lack of response mean people are satisfied with that?
 
 Sorry, I had a couple responses partially written but never finished.
 
 If we were doing it from scratch I would suggest using longer names. At the
 least I would still suggest using ascii or asciiword instead of aword.

+1 for asciiword; aword sounds too much like a word which is not the
meaning I think we're trying to convey.  It is a bit longer, but there
are longer names already so I don't think it's a problem.  (It's not
like it's something anyone needs to type often).

-- 
Alvaro Herrera   http://www.PlanetPostgreSQL.org/
En el principio del tiempo era el desencanto.  Y era la desolación.  Y era
grande el escándalo, y el destello de monitores y el crujir de teclas.
(Sean los Pájaros Pulentios, Daniel Correa)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Tom Lane

Alvaro Herrera [EMAIL PROTECTED] writes:
 Gregory Stark wrote:
 If we were doing it from scratch I would suggest using longer names. At the
 least I would still suggest using ascii or asciiword instead of aword.

 +1 for asciiword; aword sounds too much like a word which is not the
 meaning I think we're trying to convey.

OK, so with that and Michael's suggestion we have

asciiword
word
numword

asciihword
hword
numhword

hword_asciipart
hword_part
hword_numpart

Sold?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Alvaro Herrera

Tom Lane wrote:

 OK, so with that and Michael's suggestion we have
 
   asciiword
   word
   numword
 
   asciihword
   hword
   numhword
 
   hword_asciipart
   hword_part
   hword_numpart
 
 Sold?

Sold here.

-- 
Alvaro Herrera http://www.flickr.com/photos/alvherre/
I am amazed at [the pgsql-sql] mailing list for the wonderful support, and
lack of hesitasion in answering a lost soul's question, I just wished the rest
of the mailing list could be like this.   (Fotis)
   (http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Gregory Stark


Tom Lane [EMAIL PROTECTED] writes:

   hword_asciipart
   hword_part
   hword_numpart

Out of curiosity would the foo in foo-bär or the foo-beta1 be a
hword_asciipart or a hword_part/hword_numpart? 

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Tom Lane

Gregory Stark [EMAIL PROTECTED] writes:
 Out of curiosity would the foo in foo-bär or the foo-beta1 be a
 hword_asciipart or a hword_part/hword_numpart? 

foo would be hword_asciipart independently of what was in the other
parts of the hword.  AFAICS this is what you want for the purpose,
which is to know which dictionary stack to push the token through.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Michael Glaesemann



On Oct 23, 2007, at 12:09 , Alvaro Herrera wrote:


Tom Lane wrote:


OK, so with that and Michael's suggestion we have

asciiword
word
numword

asciihword
hword
numhword

hword_asciipart
hword_part
hword_numpart

Sold?


Sold here.


No huge preference, but I see benefit in what Gregory was saying re:  
asciiword, alphaword, alnumword. word itself is pretty general, while  
alphaword ties it much closer to its intended meaning. They've got  
pretty consistent lengths as well. Maybe it leans too Hungarian.


I'll take your answer off the air :)

Michael Glaesemann
grzm seespotcode net



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Tom Lane

Michael Glaesemann [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 asciiword
 word
 numword

 No huge preference, but I see benefit in what Gregory was saying re:  
 asciiword, alphaword, alnumword. word itself is pretty general, while  
 alphaword ties it much closer to its intended meaning. They've got  
 pretty consistent lengths as well. Maybe it leans too Hungarian.

I stuck with the previous proposal, mainly because I was already pretty
well into making the edits by the time I saw your message.  But I think
that with this definition word matches pretty well with everyone's
understanding of that, and the other two are supersets and subsets that
might have specific uses.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Tatsuo Ishii

Just for clarification.

Are you going to make these changes in the 8.3 beta test period?
--
Tatsuo Ishii
SRA OSS, Inc. Japan

 If I am reading the state machine in wparser_def.c correctly, the
 three classifications of words that the default parser knows are
 
 lword Composed entirely of ASCII letters
 nlwordComposed entirely of non-ASCII letters
   (where letter is defined by iswalpha())
 word  Entirely alphanumeric (per iswalnum()), but not above
   cases
 
 This classification is probably sane enough for dealing with mixed
 Russian/English text --- IIUC, Russian words will come entirely from
 the Cyrillic alphabet which has no overlap with ASCII letters.  But
 I'm thinking it'll be quite inconvenient for other European languages
 whose alphabets include the base ASCII letters plus other stuff such
 as accented letters.  They will have a lot of words that fall into
 the catchall word category, which will mean they have to index
 mixed alpha-and-number words in order to catch all native words.
 
 ISTM that perhaps a more generally useful definition would be
 
 lword Only ASCII letters
 nlwordEntirely letters per iswalpha(), but not lword
 word  Entirely alphanumeric per iswalnum(), but not nlword
   (hence, includes at least one digit)
 
 However, I am no linguist and maybe I'm missing something.
 
 Comments?
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-23 Thread Tom Lane

Tatsuo Ishii [EMAIL PROTECTED] writes:
 Just for clarification.
 Are you going to make these changes in the 8.3 beta test period?

Yes, I committed them a couple hours ago.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-22 Thread Heikki Linnakangas

Alvaro Herrera wrote:
 Tom Lane wrote:
 
 ISTM that perhaps a more generally useful definition would be

 lwordOnly ASCII letters
 nlword   Entirely letters per iswalpha(), but not lword
 word Entirely alphanumeric per iswalnum(), but not nlword
  (hence, includes at least one digit)
 ...
 I am not sure if there are any western european languages were words can
 only be formed with non-ascii chars. 

There is at least in Swedish: ö (island) and å (river). They're both a
bit special because they're just one letter each.

 lword Entirely letters per iswalpha, with at least one ASCII
 nlwordEntirely letters per iswalpha
 word  Entirely alphanumeric per iswalnum, but not nlword

I don't like this categorization much more than the original. The
distinction between lword and nlword is useless for most European
languages.

I suppose that Tom's argument that it's useful to distinguish words made
of purely ASCII characters in computer-oriented stuff is valid, though I
can't immediately think of a use case. For things like parsing a
programming language, that's not really enough, so you'd probably end up
writing your own parser anyway. I'm also not clear what the use case for
the distinction between words with digits or not is. I don't think
there's any natural languages where a word can contain digits, so it
must be a computer-oriented thing as well.

I like the aword name more than lword, BTW. If we change the meaning
of the classes, surely we can change the name as well, right?

Note that the default parser is useless for languages like Japanese,
where words are not separated by whitespace, anyway.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-22 Thread Tatsuo Ishii

 Alvaro Herrera wrote:
  Tom Lane wrote:
  
  ISTM that perhaps a more generally useful definition would be
 
  lword  Only ASCII letters
  nlword Entirely letters per iswalpha(), but not lword
  word   Entirely alphanumeric per iswalnum(), but not nlword
 (hence, includes at least one digit)
  ...
  I am not sure if there are any western european languages were words can
  only be formed with non-ascii chars. 
 
 There is at least in Swedish: ö (island) and å (river). They're both a
 bit special because they're just one letter each.
 
  lword   Entirely letters per iswalpha, with at least one ASCII
  nlword  Entirely letters per iswalpha
  wordEntirely alphanumeric per iswalnum, but not nlword
 
 I don't like this categorization much more than the original. The
 distinction between lword and nlword is useless for most European
 languages.
 
 I suppose that Tom's argument that it's useful to distinguish words made
 of purely ASCII characters in computer-oriented stuff is valid, though I
 can't immediately think of a use case. For things like parsing a
 programming language, that's not really enough, so you'd probably end up
 writing your own parser anyway. I'm also not clear what the use case for
 the distinction between words with digits or not is. I don't think
 there's any natural languages where a word can contain digits, so it
 must be a computer-oriented thing as well.
 
 I like the aword name more than lword, BTW. If we change the meaning
 of the classes, surely we can change the name as well, right?
 
 Note that the default parser is useless for languages like Japanese,
 where words are not separated by whitespace, anyway.

Above is true but that does not neccessary mean that Tsearch is not
used for Japanese at all. I overcome the problem above by doing a
pre-process step which separate Japanese sentences to words devided by
white space. I wish I could write a new parser which could do the
job for 8.4 or later...

Please change the word definition very carefully.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-22 Thread Gregory Stark

Heikki Linnakangas [EMAIL PROTECTED] writes:

 Alvaro Herrera wrote:
 Tom Lane wrote:
 
 ISTM that perhaps a more generally useful definition would be

 lword   Only ASCII letters
 nlword  Entirely letters per iswalpha(), but not lword
 wordEntirely alphanumeric per iswalnum(), but not nlword
 (hence, includes at least one digit)
 ...
 I am not sure if there are any western european languages were words can
 only be formed with non-ascii chars. 

 There is at least in Swedish: ö (island) and å (river). They're both a
 bit special because they're just one letter each.

For what it's worth I did the same search last night and found three French
words including çà -- which admittedly is likely to be a noise word. Other
dictionaries such as Italian and Irish also have one-letter words like this.
The only other with multi-letter words is actually Faroese with íð and óð.

 I like the aword name more than lword, BTW. If we change the meaning
 of the classes, surely we can change the name as well, right?

I'm not very familiar with the use case here. Is there a good reason to want
to abbreviate these names? I think I would expect ascii, word, and token
for the three categories Tom describes.

 Note that the default parser is useless for languages like Japanese,
 where words are not separated by whitespace, anyway.

I also wonder about languages like Arabic and Hindi which do have words but
I'm not sure if they use white space as simply as in latin languages.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-22 Thread Tom Lane

Heikki Linnakangas [EMAIL PROTECTED] writes:
 Alvaro Herrera wrote:
 lwordEntirely letters per iswalpha, with at least one ASCII
 nlword   Entirely letters per iswalpha
 word Entirely alphanumeric per iswalnum, but not nlword

 I don't like this categorization much more than the original. The
 distinction between lword and nlword is useless for most European
 languages.

Right.  That's not an objection in itself, since you can just add the
same dictionary mappings to both token types, but the question is when
would such a distinction actually be useful?  AFAICS the only case where
it'd make sense to put different mappings on lword and nlword with the
above definitions is when dealing with Russian or similar languages,
where the entire alphabet is non-ASCII.  However, my proposal (pure
ASCII vs not pure ASCII) seems to work just as well for that case as
this proposal does.

 ... I'm also not clear what the use case for
 the distinction between words with digits or not is. I don't think
 there's any natural languages where a word can contain digits, so it
 must be a computer-oriented thing as well.

Well, that's exactly why we *should* distinguish words-with-digits;
it's unlikely that any standard dictionary will do sane things with
them, so if you want to index them they need to go down a different
dictionary chain.

A more drastic change would be to not treat a string like beta1
as a single token at all, so that the alphanumeric-word category
would go away entirely.  However I'm disinclined to tinker with
the parser that much.  It's seen enough use in the contrib module
that I'm prepared to grant that the design is generally useful.
I'm just worried that the subcategories of word need a bit of
adjustment for languages other than Russian and English.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-22 Thread Tom Lane

Gregory Stark [EMAIL PROTECTED] writes:
 Heikki Linnakangas [EMAIL PROTECTED] writes:
 I like the aword name more than lword, BTW. If we change the meaning
 of the classes, surely we can change the name as well, right?

 I'm not very familiar with the use case here. Is there a good reason to want
 to abbreviate these names? I think I would expect ascii, word, and token
 for the three categories Tom describes.

Please look at the first nine rows of the table here:
http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html
It's not clear to me where we'd go with the names for the
hyphenated-word and hyphenated-word-part categories.  Also, ISTM that
 we should use related names for these three categories, since they are
all considered valid parts of hyphenated words.

Another point: token is probably unreasonably confusing as a name for
a token type.  Is that a token token or a word token?

Maybe aword, word, and numword?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

[HACKERS] Latin vs non-Latin words in text search parsing

2007-10-21 Thread Tom Lane

If I am reading the state machine in wparser_def.c correctly, the
three classifications of words that the default parser knows are

lword   Composed entirely of ASCII letters
nlword  Composed entirely of non-ASCII letters
(where letter is defined by iswalpha())
wordEntirely alphanumeric (per iswalnum()), but not above
cases

This classification is probably sane enough for dealing with mixed
Russian/English text --- IIUC, Russian words will come entirely from
the Cyrillic alphabet which has no overlap with ASCII letters.  But
I'm thinking it'll be quite inconvenient for other European languages
whose alphabets include the base ASCII letters plus other stuff such
as accented letters.  They will have a lot of words that fall into
the catchall word category, which will mean they have to index
mixed alpha-and-number words in order to catch all native words.

ISTM that perhaps a more generally useful definition would be

lword   Only ASCII letters
nlword  Entirely letters per iswalpha(), but not lword
wordEntirely alphanumeric per iswalnum(), but not nlword
(hence, includes at least one digit)

However, I am no linguist and maybe I'm missing something.

Comments?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-21 Thread Alvaro Herrera

Tom Lane wrote:

 ISTM that perhaps a more generally useful definition would be
 
 lword Only ASCII letters
 nlwordEntirely letters per iswalpha(), but not lword
 word  Entirely alphanumeric per iswalnum(), but not nlword
   (hence, includes at least one digit)
 
 However, I am no linguist and maybe I'm missing something.

I tend to agree with the need to redefine the categories.  I am not sure
I agree with this particular definition though.  I would think that a
latin word should include ASCII letters and accented letters, and a
non-latin word would be one that included only non-ASCII chars.

alvherre=# select * from ts_debug('spanish', 'añadido añadió añadidura');
 Alias |  Description  |   Token   |  Dictionaries  |  Lexized token   
---+---+---++--
 word  | Word  | añadido   | {spanish_stem} | spanish_stem: {añad}
 blank | Space symbols |   | {} | 
 word  | Word  | añadió| {spanish_stem} | spanish_stem: {añad}
 blank | Space symbols |   | {} | 
 word  | Word  | añadidura | {spanish_stem} | spanish_stem: {añadidur}
(5 lignes)

I would think those would all fit in the latin word category.  This
example is more interesting because it shows a word categorized
differently just because the plural loses the accent:

alvherre=# select * from ts_debug('spanish', 'caracteres carácter');
 Alias |  Description  |   Token|  Dictionaries  |  Lexized token   
---+---+++--
 lword | Latin word| caracteres | {spanish_stem} | spanish_stem: {caracter}
 blank | Space symbols || {} | 
 word  | Word  | carácter   | {spanish_stem} | spanish_stem: {caract}
(3 lignes)

I am not sure if there are any western european languages were words can
only be formed with non-ascii chars.  At least in spanish accents tend
to be rare.  However, I would think this is also wrong:

alvherre=# select * from ts_debug('french', 'à');
 Alias  |  Description   | Token | Dictionaries  |  Lexized token  
++---+---+-
 nlword | Non-latin word | à | {french_stem} | french_stem: {}
(1 ligne)

I don't think this is much of a problem, this particular word being
(most likely) a stopword.

So, how about

lword   Entirely letters per iswalpha, with at least one ASCII
nlword  Entirely letters per iswalpha
wordEntirely alphanumeric per iswalnum, but not nlword

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Latin vs non-Latin words in text search parsing

2007-10-21 Thread Tom Lane

Alvaro Herrera [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 ISTM that perhaps a more generally useful definition would be
 
 lwordOnly ASCII letters
 nlword   Entirely letters per iswalpha(), but not lword
 word Entirely alphanumeric per iswalnum(), but not nlword

 ... how about

 lword Entirely letters per iswalpha, with at least one ASCII
 nlwordEntirely letters per iswalpha
 word  Entirely alphanumeric per iswalnum, but not nlword

Hmm.  Then we have no category for entirely ASCII, which is an
interesting category at least from the English standpoint, and I think
also in a lot of computer-oriented contexts.  I think you may be putting
too much emphasis on the Latin aspect of the category name, which I
find to be a bit historical.  I'm not sure if it's too late to consider
renaming the categories; if we were willing to do that I'd propose
categories aword, naword, word, defined as above.

Another thing that bothers me about your suggestion is that (at least in
some locales) iswalpha will return true for things that are neither
ASCII letters nor accented versions of them, eg Cyrillic letters.
So I'm not sure the surprise factor is any less with your approach
than mine: you could still get lword for something decidedly not
Latin-derived.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

[HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Re: [HACKERS] Latin vs non-Latin words in text search parsing

23 matches

Site Navigation

Mail list logo

Footer information