(This is a cross post from Stack Exchange, not getting much traction there)
On my Mac install of PG:
```
=# select to_tsvector('english', 'abcd สวัสดี');
to_tsvector
-
'abcd':1
(1 row)
=# select * from ts_debug('hello สวัสดี');
alias | description | token | dictionaries | dictionary | lexemes
---+-+---++--+-
asciiword | Word, all ASCII | hello | {english_stem} | english_stem | {hello}
blank | Space symbols | สวัสดี | {} | |
(2 rows)
```
On my Linux install of PG:
```
=# select to_tsvector('english', 'abcd สวัสดี');
to_tsvector
---
'abcd':1 'สวัสดี':2
(1 row)
=# select * from ts_debug('hello สวัสดี');
alias |description| token | dictionaries | dictionary | lexemes
---+---+---++--+-
asciiword | Word, all ASCII | hello | {english_stem} | english_stem | {hello}
blank | Space symbols | | {} | |
word | Word, all letters | สวัสดี | {english_stem} |
english_stem | {สวัสดี}
(3 rows)
```
So something is clearly different about the way the tokenisation is
defined in PG. My question is, how do I figure out what is different
and how do I make my mac install of PG work like the Linux one?
On both installs:
```
# SHOW default_text_search_config;
default_text_search_config
pg_catalog.english
(1 row)
# show lc_ctype;
lc_ctype
-
en_US.UTF-8
(1 row)
```
So somehow this mac install thinks that thai letters are spaces... how
do I debug this and fix the "Space Symbol" definition here.
Interestingly this install works with Armenian, but falls over when we
reach Hebrew
```
=# select * from ts_debug('ԵԵԵ');
alias |description| token | dictionaries | dictionary | lexemes
---+---+---++--+-
word | Word, all letters | ԵԵԵ | {english_stem} | english_stem | {եեե}
(1 row)
=# select * from ts_debug('אאא');
alias | description | token | dictionaries | dictionary | lexemes
---+---+---+--++-
blank | Space symbols | אאא | {} ||
(1 row)
```