Re: [Fwd: Re: [HACKERS] tsearch in core patch]

2007-06-30 Thread Josh Berkus

Ishii-san,


Ok, probably we need to copy the English stemming rule to the one for
Japanese.

Pardon my ignorance here, but is the concept of stemming even relevant
to Japanese/Chinese/Korean?  What little I know about ideographic
languages suggests it wouldn't work well.  And surely the specific rules
in the Snowball project's English stemmer wouldn't work.


Your undestanding is correct. English stemmer would not work for
Japanese non English part.


That reminds me, don't you guys have your own full text search for 
Japanese?  Planning on merging it with the core code anytime soon?


--Josh

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [Fwd: Re: [HACKERS] tsearch in core patch]

2007-06-30 Thread Tatsuo Ishii
 Ishii-san,
 
  Ok, probably we need to copy the English stemming rule to the one for
  Japanese.
  Pardon my ignorance here, but is the concept of stemming even relevant
  to Japanese/Chinese/Korean?  What little I know about ideographic
  languages suggests it wouldn't work well.  And surely the specific rules
  in the Snowball project's English stemmer wouldn't work.
  
  Your undestanding is correct. English stemmer would not work for
  Japanese non English part.
 
 That reminds me, don't you guys have your own full text search for 
 Japanese?  Planning on merging it with the core code anytime soon?

No. Actually Japanese (non English part) does not need stemming at
all. However, since Japanese is an agglutinative language, we have to
break continuous Japanese string into space separated words. For
example, we need to break:

todayisfine

into:

today is fine

(of course those English are just for non-Japanese spearker's
understanding, actually they are Japanese).

For this we need good dictionary and software. Fortunately we have
several kinds of open source softwares for this pupose. Once I have
written a PostgreSQL C function envoking one of these software to do
the work and it works great with tsearch2.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [Fwd: Re: [HACKERS] tsearch in core patch]

2007-06-25 Thread Mike Rylander

On 6/25/07, Tom Lane [EMAIL PROTECTED] wrote:

Well, it's not hard at all to find chunks of English text that have
embedded bits of French, Spanish, or what-have-you, but that's not an
argument for trying to intermix the stemmers.  I doubt that such simple
bits of program could tell the language difference well enough to
determine which stemming rules to apply.



While I imagine that is probably true of many, if not most, my project
in particular would greatly benefit from the ability to mix stemmers.
I work with complex bibliographic data, which has language information
embedded within records.  This is not limited to the record level
either.  Individual fields within each bibliographic record can be in
different langauges.

Especially in countries where making software multi-lingual (such as
Canada (en_CA/fr_CA)) is a requirement for use in public institutions,
the ability to choose a stemmer and stop-word list at will for any
particular record will actually provide the exact behavior needed.
The obvious generalization from Canada would be to support any mix of
languages supported by tsearch2.

I can certainly understand the benefit of making the default
configuration a simple locale to language map, but there are
definitely uses for searching using different stemmers/stop-lists even
within the same corpus/index.  So, as a datapoint for the discussion,
I would ask that the option of multiple languages per DB locale not be
removed if it can be at all avoided.

Thanks for listening (and for all the great work on getting tsearch
into core! :) ...

--
Mike Rylander

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [Fwd: Re: [HACKERS] tsearch in core patch]

2007-06-25 Thread Tom Lane
Mike Rylander [EMAIL PROTECTED] writes:
 I can certainly understand the benefit of making the default
 configuration a simple locale to language map, but there are
 definitely uses for searching using different stemmers/stop-lists even
 within the same corpus/index.  So, as a datapoint for the discussion,
 I would ask that the option of multiple languages per DB locale not be
 removed if it can be at all avoided.

Nobody is proposing that --- the issue here is just how we set up the
default configuration.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [Fwd: Re: [HACKERS] tsearch in core patch]

2007-06-25 Thread Mike Rylander

On 6/25/07, Tom Lane [EMAIL PROTECTED] wrote:

Mike Rylander [EMAIL PROTECTED] writes:
 I can certainly understand the benefit of making the default
 configuration a simple locale to language map, but there are
 definitely uses for searching using different stemmers/stop-lists even
 within the same corpus/index.  So, as a datapoint for the discussion,
 I would ask that the option of multiple languages per DB locale not be
 removed if it can be at all avoided.

Nobody is proposing that --- the issue here is just how we set up the
default configuration.



Then I misunderstood.  Sorry for the noise, folks.

--
Mike Rylander

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [Fwd: Re: [HACKERS] tsearch in core patch]

2007-06-24 Thread Tatsuo Ishii
 Tatsuo Ishii wrote:
 
  japanese '{ja_JP, C}'
  
  How would we know C - japanese?
  
 You can't do that. You can't have different languages (not locales)
 mapping to the same 'tsearch language' because the stemmer doesn't know
 that a specific word is in english or japanese. So you have two options:
 (a) disable stemming (b) leave the language set to 'japanese' and see if
 it plays well.

Ok, probably we need to copy the English stemming rule to the one for
Japanese. I think same thing (commonly used English with local
language) can be applied to Chinese and Korean.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [Fwd: Re: [HACKERS] tsearch in core patch]

2007-06-24 Thread Tom Lane
Tatsuo Ishii [EMAIL PROTECTED] writes:
 Ok, probably we need to copy the English stemming rule to the one for
 Japanese.

Pardon my ignorance here, but is the concept of stemming even relevant
to Japanese/Chinese/Korean?  What little I know about ideographic
languages suggests it wouldn't work well.  And surely the specific rules
in the Snowball project's English stemmer wouldn't work.

 I think same thing (commonly used English with local
 language) can be applied to Chinese and Korean.

Well, it's not hard at all to find chunks of English text that have
embedded bits of French, Spanish, or what-have-you, but that's not an
argument for trying to intermix the stemmers.  I doubt that such simple
bits of program could tell the language difference well enough to
determine which stemming rules to apply.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [Fwd: Re: [HACKERS] tsearch in core patch]

2007-06-24 Thread Tatsuo Ishii
 Tatsuo Ishii [EMAIL PROTECTED] writes:
  Ok, probably we need to copy the English stemming rule to the one for
  Japanese.
 
 Pardon my ignorance here, but is the concept of stemming even relevant
 to Japanese/Chinese/Korean?  What little I know about ideographic
 languages suggests it wouldn't work well.  And surely the specific rules
 in the Snowball project's English stemmer wouldn't work.

Your undestanding is correct. English stemmer would not work for
Japanese non English part.

What I meant was the chunks of English text in Japanese.

  I think same thing (commonly used English with local
  language) can be applied to Chinese and Korean.
 
 Well, it's not hard at all to find chunks of English text that have
 embedded bits of French, Spanish, or what-have-you, but that's not an
 argument for trying to intermix the stemmers.  I doubt that such simple
 bits of program could tell the language difference well enough to
 determine which stemming rules to apply.

For Japanese, it will be fairly simple: 7bit ASCII range words must be
English (Note that mostly used Japanese encodings such as EUC do not
allow to mix with ISO 8859).
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [Fwd: Re: [HACKERS] tsearch in core patch]

2007-06-23 Thread Euler Taveira de Oliveira
Tatsuo Ishii wrote:

 japanese '{ja_JP, C}'
 
 How would we know C - japanese?
 
You can't do that. You can't have different languages (not locales)
mapping to the same 'tsearch language' because the stemmer doesn't know
that a specific word is in english or japanese. So you have two options:
(a) disable stemming (b) leave the language set to 'japanese' and see if
it plays well.


-- 
  Euler Taveira de Oliveira
  http://www.timbira.com/

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [Fwd: Re: [HACKERS] tsearch in core patch]

2007-06-22 Thread Tatsuo Ishii
  How would this work for initdb with locale C?
 
  I'm worrying about that too.
 
 english '{en_GB, en_US, C}'
 
 I suppose, that locale name always has a dot separator exept C locale ---
 which is well known exception

So we would have to?:

japanese '{ja_JP, C}'

How would we know C - japanese?

Also I'm wondering how we could handle texts including Japanese and
English. It's very common in Japan.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 6: explain analyze is your friend


[Fwd: Re: [HACKERS] tsearch in core patch]

2007-06-22 Thread teodor

 How would this work for initdb with locale C?

 I'm worrying about that too.

english '{en_GB, en_US, C}'

I suppose, that locale name always has a dot separator exept C locale ---
which is well known exception




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly