Re: [HACKERS] Extending range of to_tsvector et al

2012-10-01 Thread Tom Lane
john knightley john.knight...@gmail.com writes:
 On Mon, Oct 1, 2012 at 11:58 AM, Dan Scott deni...@gmail.com wrote:
 So... perhaps LC_CTYPE=C is a possible workaround for you?

 LC_CTYPE would not be a work around - this database needs to be in
 utf8 , the full text search is to be used for a mediawiki.

You're confusing locale and encoding.  They are different things.

 Is this a bug that is being worked on?

No.  As I already tried to explain to you, this behavior is not
determined by Postgres, it's determined by the platform's locale
support.  You need to complain to your OS vendor.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extending range of to_tsvector et al

2012-09-30 Thread Dan Scott
On Sun, Sep 30, 2012 at 1:56 PM, johnkn63 john.knight...@gmail.com wrote:
 When using to_tsvector  a number of newer unicode characters and pua
 characters are not included. How do I add the characters which I desire to
 be found?

I've just started digging into this code a bit, but from what I've
found src/backend/tsearch/wparser_def.c defines much of the parser
functionality, and in the area of Unicode includes a number of
comments like:

* with multibyte encoding and C-locale isw* function may fail or give
wrong result.
* multibyte encoding and C-locale often are used for Asian languages.
* any non-ascii symbol with multibyte encoding with C-locale is an
alpha character

... in concert with ifdefs around WIDE_UPPER_LOWER (in effect if
WCSTOMBS and TOWLOWER are available) to complicate testing scenarios
:)

Also note that src/test/regress/sql/tsearch.sql and
regress/sql/tsdicts.sql currently focus on English, ASCII-only data.

Perhaps this is a good opportunity for you to describe what your
environment looks like (OS, PostgreSQL version, encoding and locale
settings for the database) and show some sample to_tsquery() @@
to_tsvector() queries that don't behave the way you think they should
behave - and we could start building some test cases as a first step?

-- 
Dan Scott
Laurentian University


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extending range of to_tsvector et al

2012-09-30 Thread john knightley
Dear Dan,

thank you for your reply.

The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
a utf8 local

A short 5 line dictionary file  is sufficient to test:-

raeuz
我们
昭厵
꽖떂
撘䮬

line 1 raeuz Zhuang word written using English letters and show up
under ts_vector ok
line 2 我们 uses everyday Chinese word and show up under ts_vector ok
line 3 昭厵 Zhuang word written using rather old Chinese charcters
found in Unicode 3.1 which came in about the year 2000  and show up
under ts_vector ok
line 4 꽖떂 Zhuang word written using rather old Chinese charcters
found in Unicode 5.2 which came in about the year 2009 but do not show
up under ts_vector ok
line 5 撘䮬 Zhuang word written using rather old Chinese charcters
found in PUA area of the font Sawndip.ttf but do not show up under
ts_vector ok (Font can be downloaded from
http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)

The last two words even though included in a dictionary do not get
accepted by ts_vector.

Regards
John

On Mon, Oct 1, 2012 at 11:04 AM, Dan Scott deni...@gmail.com wrote:
 On Sun, Sep 30, 2012 at 1:56 PM, johnkn63 john.knight...@gmail.com wrote:
 When using to_tsvector  a number of newer unicode characters and pua
 characters are not included. How do I add the characters which I desire to
 be found?

 I've just started digging into this code a bit, but from what I've
 found src/backend/tsearch/wparser_def.c defines much of the parser
 functionality, and in the area of Unicode includes a number of
 comments like:

 * with multibyte encoding and C-locale isw* function may fail or give
 wrong result.
 * multibyte encoding and C-locale often are used for Asian languages.
 * any non-ascii symbol with multibyte encoding with C-locale is an
 alpha character

 ... in concert with ifdefs around WIDE_UPPER_LOWER (in effect if
 WCSTOMBS and TOWLOWER are available) to complicate testing scenarios
 :)

 Also note that src/test/regress/sql/tsearch.sql and
 regress/sql/tsdicts.sql currently focus on English, ASCII-only data.

 Perhaps this is a good opportunity for you to describe what your
 environment looks like (OS, PostgreSQL version, encoding and locale
 settings for the database) and show some sample to_tsquery() @@
 to_tsvector() queries that don't behave the way you think they should
 behave - and we could start building some test cases as a first step?

 --
 Dan Scott
 Laurentian University


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extending range of to_tsvector et al

2012-09-30 Thread Dan Scott
Hi John:

On Sun, Sep 30, 2012 at 11:45 PM, john knightley
john.knight...@gmail.com wrote:
 Dear Dan,

 thank you for your reply.

 The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
 a utf8 local

 A short 5 line dictionary file  is sufficient to test:-

 raeuz
 我们
 昭厵
 꽖떂
 撘䮬

 line 1 raeuz Zhuang word written using English letters and show up
 under ts_vector ok
 line 2 我们 uses everyday Chinese word and show up under ts_vector ok
 line 3 昭厵 Zhuang word written using rather old Chinese charcters
 found in Unicode 3.1 which came in about the year 2000  and show up
 under ts_vector ok
 line 4 꽖떂 Zhuang word written using rather old Chinese charcters
 found in Unicode 5.2 which came in about the year 2009 but do not show
 up under ts_vector ok
 line 5 撘䮬 Zhuang word written using rather old Chinese charcters
 found in PUA area of the font Sawndip.ttf but do not show up under
 ts_vector ok (Font can be downloaded from
 http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)

 The last two words even though included in a dictionary do not get
 accepted by ts_vector.

Hmm. Fedora 17 x86-64 w/ PostgreSQL 9.1.5 here, the latter seems to
work using the default text search configuration (albeit with one
crucial note: I created the database with the lc_ctype=C
lc_collate=C options):

WORKING:

createdb --template=template0 --lc-ctype=C --lc-collate=C foobar
foobar=# select ts_debug('撘䮬');
ts_debug

 (word,Word, all letters,撘䮬,{english_stem},english_stem,{撘䮬})
(1 row)

NOT WORKING AS EXPECTED:

foobaz=# SHOW LC_CTYPE;
  lc_ctype
-
 en_US.UTF-8
(1 row)

foobaz=# select ts_debug('撘䮬');
ts_debug
-
 (blank,Space symbols,撘䮬,{},,)
(1 row)

So... perhaps LC_CTYPE=C is a possible workaround for you?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extending range of to_tsvector et al

2012-09-30 Thread Tom Lane
john knightley john.knight...@gmail.com writes:
 The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
 a utf8 local

 A short 5 line dictionary file  is sufficient to test:-

 raeuz
 我们
 𦘭𥎵
 𪽖𫖂
 󶒘󴮬

 line 1 raeuz Zhuang word written using English letters and show up
 under ts_vector ok
 line 2 我们 uses everyday Chinese word and show up under ts_vector ok
 line 3 𦘭𥎵 Zhuang word written using rather old Chinese charcters
 found in Unicode 3.1 which came in about the year 2000  and show up
 under ts_vector ok
 line 4 𪽖𫖂 Zhuang word written using rather old Chinese charcters
 found in Unicode 5.2 which came in about the year 2009 but do not show
 up under ts_vector ok
 line 5 󶒘󴮬 Zhuang word written using rather old Chinese charcters
 found in PUA area of the font Sawndip.ttf but do not show up under
 ts_vector ok (Font can be downloaded from
 http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)

AFAIK there is nothing in Postgres itself that would distinguish, say,
𦘭 from 𪽖.  I think this must be down to
your platform's locale definition: it probably thinks that the former is
a letter and the latter is not.  You'd have to gripe to the locale
maintainers to get that fixed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extending range of to_tsvector et al

2012-09-30 Thread john knightley
On Mon, Oct 1, 2012 at 12:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 john knightley john.knight...@gmail.com writes:
 The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
 a utf8 local

 A short 5 line dictionary file  is sufficient to test:-

 raeuz
 我们
 昭厵
 꽖떂
 撘䮬

 line 1 raeuz Zhuang word written using English letters and show up
 under ts_vector ok
 line 2 我们 uses everyday Chinese word and show up under ts_vector ok
 line 3 昭厵 Zhuang word written using rather old Chinese charcters
 found in Unicode 3.1 which came in about the year 2000  and show up
 under ts_vector ok
 line 4 꽖떂 Zhuang word written using rather old Chinese charcters
 found in Unicode 5.2 which came in about the year 2009 but do not show
 up under ts_vector ok
 line 5 撘䮬 Zhuang word written using rather old Chinese charcters
 found in PUA area of the font Sawndip.ttf but do not show up under
 ts_vector ok (Font can be downloaded from
 http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)

 AFAIK there is nothing in Postgres itself that would distinguish, say,
 昭 from 꽖.  I think this must be down to
 your platform's locale definition: it probably thinks that the former is
 a letter and the latter is not.  You'd have to gripe to the locale
 maintainers to get that fixed.

 regards, tom lane

PostgreSQL in general does not usually distinguish but full text search does:-

 select ts_debug('昭 from 꽖');

gives the result:-

 ts_debug
---
 (word,Word, all letters,昭,{english_stem},english_stem,{昭})
 (blank,Space symbols, ,{},,)
 (asciiword,Word, all ASCII,from,{english_stem},english_stem,{})
 (blank,Space symbols, 꽖,{},,)
(4 rows)

Somewhere there is dictionary, or library that is based on @ Unicode
4.0 which includes 昭,U+2662d but not  떂,U+2b582 which is
Unicode 5.1.

Also PUA characters are dropped in the same way by the full text
search, which is what google does but which I do not wish to do.

Regards
John


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extending range of to_tsvector et al

2012-09-30 Thread john knightley
On Mon, Oct 1, 2012 at 11:58 AM, Dan Scott deni...@gmail.com wrote:
 Hi John:

 On Sun, Sep 30, 2012 at 11:45 PM, john knightley
 john.knight...@gmail.com wrote:
 Dear Dan,

 thank you for your reply.

 The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
 a utf8 local

 A short 5 line dictionary file  is sufficient to test:-

 raeuz
 我们
 昭厵
 꽖떂
 撘䮬

 line 1 raeuz Zhuang word written using English letters and show up
 under ts_vector ok
 line 2 我们 uses everyday Chinese word and show up under ts_vector ok
 line 3 昭厵 Zhuang word written using rather old Chinese charcters
 found in Unicode 3.1 which came in about the year 2000  and show up
 under ts_vector ok
 line 4 꽖떂 Zhuang word written using rather old Chinese charcters
 found in Unicode 5.2 which came in about the year 2009 but do not show
 up under ts_vector ok
 line 5 撘䮬 Zhuang word written using rather old Chinese charcters
 found in PUA area of the font Sawndip.ttf but do not show up under
 ts_vector ok (Font can be downloaded from
 http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)

 The last two words even though included in a dictionary do not get
 accepted by ts_vector.

 Hmm. Fedora 17 x86-64 w/ PostgreSQL 9.1.5 here, the latter seems to
 work using the default text search configuration (albeit with one
 crucial note: I created the database with the lc_ctype=C
 lc_collate=C options):

 WORKING:

 createdb --template=template0 --lc-ctype=C --lc-collate=C foobar
 foobar=# select ts_debug('撘䮬');
 ts_debug
 
  (word,Word, all letters,撘䮬,{english_stem},english_stem,{撘䮬})
 (1 row)

 NOT WORKING AS EXPECTED:




 foobaz=# SHOW LC_CTYPE;
   lc_ctype
 -
  en_US.UTF-8
 (1 row)

 foobaz=# select ts_debug('撘䮬');
 ts_debug
 -
  (blank,Space symbols,撘䮬,{},,)
 (1 row)

 So... perhaps LC_CTYPE=C is a possible workaround for you?

LC_CTYPE would not be a work around - this database needs to be in
utf8 , the full text search is to be used for a mediawiki. Is this a
bug that is being worked on?

Regards
John


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers