Re: [HACKERS] Extending range of to_tsvector et al
john knightley john.knight...@gmail.com writes: On Mon, Oct 1, 2012 at 11:58 AM, Dan Scott deni...@gmail.com wrote: So... perhaps LC_CTYPE=C is a possible workaround for you? LC_CTYPE would not be a work around - this database needs to be in utf8 , the full text search is to be used for a mediawiki. You're confusing locale and encoding. They are different things. Is this a bug that is being worked on? No. As I already tried to explain to you, this behavior is not determined by Postgres, it's determined by the platform's locale support. You need to complain to your OS vendor. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extending range of to_tsvector et al
On Sun, Sep 30, 2012 at 1:56 PM, johnkn63 john.knight...@gmail.com wrote: When using to_tsvector a number of newer unicode characters and pua characters are not included. How do I add the characters which I desire to be found? I've just started digging into this code a bit, but from what I've found src/backend/tsearch/wparser_def.c defines much of the parser functionality, and in the area of Unicode includes a number of comments like: * with multibyte encoding and C-locale isw* function may fail or give wrong result. * multibyte encoding and C-locale often are used for Asian languages. * any non-ascii symbol with multibyte encoding with C-locale is an alpha character ... in concert with ifdefs around WIDE_UPPER_LOWER (in effect if WCSTOMBS and TOWLOWER are available) to complicate testing scenarios :) Also note that src/test/regress/sql/tsearch.sql and regress/sql/tsdicts.sql currently focus on English, ASCII-only data. Perhaps this is a good opportunity for you to describe what your environment looks like (OS, PostgreSQL version, encoding and locale settings for the database) and show some sample to_tsquery() @@ to_tsvector() queries that don't behave the way you think they should behave - and we could start building some test cases as a first step? -- Dan Scott Laurentian University -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extending range of to_tsvector et al
Dear Dan, thank you for your reply. The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on a utf8 local A short 5 line dictionary file is sufficient to test:- raeuz 我们 昭厵 꽖떂 撘䮬 line 1 raeuz Zhuang word written using English letters and show up under ts_vector ok line 2 我们 uses everyday Chinese word and show up under ts_vector ok line 3 昭厵 Zhuang word written using rather old Chinese charcters found in Unicode 3.1 which came in about the year 2000 and show up under ts_vector ok line 4 꽖떂 Zhuang word written using rather old Chinese charcters found in Unicode 5.2 which came in about the year 2009 but do not show up under ts_vector ok line 5 撘䮬 Zhuang word written using rather old Chinese charcters found in PUA area of the font Sawndip.ttf but do not show up under ts_vector ok (Font can be downloaded from http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf) The last two words even though included in a dictionary do not get accepted by ts_vector. Regards John On Mon, Oct 1, 2012 at 11:04 AM, Dan Scott deni...@gmail.com wrote: On Sun, Sep 30, 2012 at 1:56 PM, johnkn63 john.knight...@gmail.com wrote: When using to_tsvector a number of newer unicode characters and pua characters are not included. How do I add the characters which I desire to be found? I've just started digging into this code a bit, but from what I've found src/backend/tsearch/wparser_def.c defines much of the parser functionality, and in the area of Unicode includes a number of comments like: * with multibyte encoding and C-locale isw* function may fail or give wrong result. * multibyte encoding and C-locale often are used for Asian languages. * any non-ascii symbol with multibyte encoding with C-locale is an alpha character ... in concert with ifdefs around WIDE_UPPER_LOWER (in effect if WCSTOMBS and TOWLOWER are available) to complicate testing scenarios :) Also note that src/test/regress/sql/tsearch.sql and regress/sql/tsdicts.sql currently focus on English, ASCII-only data. Perhaps this is a good opportunity for you to describe what your environment looks like (OS, PostgreSQL version, encoding and locale settings for the database) and show some sample to_tsquery() @@ to_tsvector() queries that don't behave the way you think they should behave - and we could start building some test cases as a first step? -- Dan Scott Laurentian University -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extending range of to_tsvector et al
Hi John: On Sun, Sep 30, 2012 at 11:45 PM, john knightley john.knight...@gmail.com wrote: Dear Dan, thank you for your reply. The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on a utf8 local A short 5 line dictionary file is sufficient to test:- raeuz 我们 昭厵 꽖떂 撘䮬 line 1 raeuz Zhuang word written using English letters and show up under ts_vector ok line 2 我们 uses everyday Chinese word and show up under ts_vector ok line 3 昭厵 Zhuang word written using rather old Chinese charcters found in Unicode 3.1 which came in about the year 2000 and show up under ts_vector ok line 4 꽖떂 Zhuang word written using rather old Chinese charcters found in Unicode 5.2 which came in about the year 2009 but do not show up under ts_vector ok line 5 撘䮬 Zhuang word written using rather old Chinese charcters found in PUA area of the font Sawndip.ttf but do not show up under ts_vector ok (Font can be downloaded from http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf) The last two words even though included in a dictionary do not get accepted by ts_vector. Hmm. Fedora 17 x86-64 w/ PostgreSQL 9.1.5 here, the latter seems to work using the default text search configuration (albeit with one crucial note: I created the database with the lc_ctype=C lc_collate=C options): WORKING: createdb --template=template0 --lc-ctype=C --lc-collate=C foobar foobar=# select ts_debug('撘䮬'); ts_debug (word,Word, all letters,撘䮬,{english_stem},english_stem,{撘䮬}) (1 row) NOT WORKING AS EXPECTED: foobaz=# SHOW LC_CTYPE; lc_ctype - en_US.UTF-8 (1 row) foobaz=# select ts_debug('撘䮬'); ts_debug - (blank,Space symbols,撘䮬,{},,) (1 row) So... perhaps LC_CTYPE=C is a possible workaround for you? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extending range of to_tsvector et al
john knightley john.knight...@gmail.com writes: The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on a utf8 local A short 5 line dictionary file is sufficient to test:- raeuz æ们 ð¦ð¥µ ðª½ð« ó¶ó´®¬ line 1 raeuz Zhuang word written using English letters and show up under ts_vector ok line 2 æ们 uses everyday Chinese word and show up under ts_vector ok line 3 ð¦ð¥µ Zhuang word written using rather old Chinese charcters found in Unicode 3.1 which came in about the year 2000 and show up under ts_vector ok line 4 ðª½ð« Zhuang word written using rather old Chinese charcters found in Unicode 5.2 which came in about the year 2009 but do not show up under ts_vector ok line 5 ó¶ó´®¬ Zhuang word written using rather old Chinese charcters found in PUA area of the font Sawndip.ttf but do not show up under ts_vector ok (Font can be downloaded from http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf) AFAIK there is nothing in Postgres itself that would distinguish, say, ð¦ from ðª½. I think this must be down to your platform's locale definition: it probably thinks that the former is a letter and the latter is not. You'd have to gripe to the locale maintainers to get that fixed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extending range of to_tsvector et al
On Mon, Oct 1, 2012 at 12:11 PM, Tom Lane t...@sss.pgh.pa.us wrote: john knightley john.knight...@gmail.com writes: The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on a utf8 local A short 5 line dictionary file is sufficient to test:- raeuz 我们 昭厵 꽖떂 撘䮬 line 1 raeuz Zhuang word written using English letters and show up under ts_vector ok line 2 我们 uses everyday Chinese word and show up under ts_vector ok line 3 昭厵 Zhuang word written using rather old Chinese charcters found in Unicode 3.1 which came in about the year 2000 and show up under ts_vector ok line 4 꽖떂 Zhuang word written using rather old Chinese charcters found in Unicode 5.2 which came in about the year 2009 but do not show up under ts_vector ok line 5 撘䮬 Zhuang word written using rather old Chinese charcters found in PUA area of the font Sawndip.ttf but do not show up under ts_vector ok (Font can be downloaded from http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf) AFAIK there is nothing in Postgres itself that would distinguish, say, 昭 from 꽖. I think this must be down to your platform's locale definition: it probably thinks that the former is a letter and the latter is not. You'd have to gripe to the locale maintainers to get that fixed. regards, tom lane PostgreSQL in general does not usually distinguish but full text search does:- select ts_debug('昭 from 꽖'); gives the result:- ts_debug --- (word,Word, all letters,昭,{english_stem},english_stem,{昭}) (blank,Space symbols, ,{},,) (asciiword,Word, all ASCII,from,{english_stem},english_stem,{}) (blank,Space symbols, 꽖,{},,) (4 rows) Somewhere there is dictionary, or library that is based on @ Unicode 4.0 which includes 昭,U+2662d but not 떂,U+2b582 which is Unicode 5.1. Also PUA characters are dropped in the same way by the full text search, which is what google does but which I do not wish to do. Regards John -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extending range of to_tsvector et al
On Mon, Oct 1, 2012 at 11:58 AM, Dan Scott deni...@gmail.com wrote: Hi John: On Sun, Sep 30, 2012 at 11:45 PM, john knightley john.knight...@gmail.com wrote: Dear Dan, thank you for your reply. The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on a utf8 local A short 5 line dictionary file is sufficient to test:- raeuz 我们 昭厵 꽖떂 撘䮬 line 1 raeuz Zhuang word written using English letters and show up under ts_vector ok line 2 我们 uses everyday Chinese word and show up under ts_vector ok line 3 昭厵 Zhuang word written using rather old Chinese charcters found in Unicode 3.1 which came in about the year 2000 and show up under ts_vector ok line 4 꽖떂 Zhuang word written using rather old Chinese charcters found in Unicode 5.2 which came in about the year 2009 but do not show up under ts_vector ok line 5 撘䮬 Zhuang word written using rather old Chinese charcters found in PUA area of the font Sawndip.ttf but do not show up under ts_vector ok (Font can be downloaded from http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf) The last two words even though included in a dictionary do not get accepted by ts_vector. Hmm. Fedora 17 x86-64 w/ PostgreSQL 9.1.5 here, the latter seems to work using the default text search configuration (albeit with one crucial note: I created the database with the lc_ctype=C lc_collate=C options): WORKING: createdb --template=template0 --lc-ctype=C --lc-collate=C foobar foobar=# select ts_debug('撘䮬'); ts_debug (word,Word, all letters,撘䮬,{english_stem},english_stem,{撘䮬}) (1 row) NOT WORKING AS EXPECTED: foobaz=# SHOW LC_CTYPE; lc_ctype - en_US.UTF-8 (1 row) foobaz=# select ts_debug('撘䮬'); ts_debug - (blank,Space symbols,撘䮬,{},,) (1 row) So... perhaps LC_CTYPE=C is a possible workaround for you? LC_CTYPE would not be a work around - this database needs to be in utf8 , the full text search is to be used for a mediawiki. Is this a bug that is being worked on? Regards John -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers