Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-11 Thread Artur Zakirov
On 11.02.2016 01:19, Tom Lane wrote: I wrote: Artur Zakirov writes: I think this is not a bug. It is a normal behavior. In Mac OS sscanf() with the %s format reads the string one character at a time. The size of letter 'Ñ…' is 2. And sscanf() separate it into two

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-11 Thread Artur Zakirov
On 11.02.2016 03:33, Tom Lane wrote: Artur Zakirov writes: [ tsearch_aff_parse_v1.patch ] I've pushed this with some corrections --- notably, I did not like the lack of buffer-overrun prevention, and it did the wrong thing if a line had more than one trailing

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-10 Thread Teodor Sigaev
It seems that *scanf() with %s format occures only here: - check.c - get_bin_version() - server.c - get_major_server_version() - filemap.c - isRelDataFile() - pg_backup_directory.c - _LoadBlobs() - xlog.c - do_pg_stop_backup() - mac.c - macaddr_in() I think here sscanf() do not works with the

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-10 Thread Artur Zakirov
On 09.02.2016 20:13, Tom Lane wrote: I do not like this patch much. It is basically "let's stop using sscanf() because it seems to have a bug on one platform". There are at least two things wrong with that approach: 1. By my count there are about 80 uses of *scanf() in our code. Are we going

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-10 Thread Tom Lane
Artur Zakirov writes: > I agree that previous patch is wrong. Instead of using new > parse_ooaffentry() function maybe better to use sscanf() with %ls > format. The %ls format is used to read a wide character string. No, that way is going to give you worse portability

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-10 Thread Artur Zakirov
On 10.02.2016 18:51, Teodor Sigaev wrote: Hmm. Here src/backend/access/transam/xlog.c read_tablespace_map() using %s in scanf looks suspisious. I don't fully understand but it looks like it tries to read oid as string. So, it should be safe in usial case Next, _LoadBlobs() reads filename

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-10 Thread Tom Lane
I wrote: > Artur Zakirov writes: >> I think this is not a bug. It is a normal behavior. In Mac OS sscanf() >> with the %s format reads the string one character at a time. The size of >> letter 'х' is 2. And sscanf() separate it into two wrong characters. > That

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-10 Thread Larry Rosenman
On 2016-02-10 16:19, Tom Lane wrote: I wrote: Artur Zakirov writes: I think this is not a bug. It is a normal behavior. In Mac OS sscanf() with the %s format reads the string one character at a time. The size of letter 'х' is 2. And sscanf() separate it into two

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-10 Thread Tom Lane
Larry Rosenman writes: > On 2016-02-10 16:19, Tom Lane wrote: >> I looked into the OS X sources, and found that indeed you are right: >> *scanf processes the input a byte at a time, and applies isspace() to >> each byte separately, even when the locale is such that that's a >>

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-10 Thread Tom Lane
Artur Zakirov writes: > [ tsearch_aff_parse_v1.patch ] I've pushed this with some corrections --- notably, I did not like the lack of buffer-overrun prevention, and it did the wrong thing if a line had more than one trailing space character. We still need to look at

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-10 Thread Chapman Flack
On 02/10/16 17:19, Tom Lane wrote: > I also verified that in UTF8-based locales, isspace() thinks that 0x85 and > 0xA0, and no other high-bit-set values, are spaces. Not sure exactly why Unicode NEXT LINE (NEL) and NO-BREAK SPACE, respectively.

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-10 Thread Chapman Flack
On 02/10/16 23:55, Tom Lane wrote: > Yeah, I got that --- what seems squishier is that none of the other C1 > control characters are considered whitespace? That seems to be exactly the case: http://www.unicode.org/Public/5.2.0/ucd/PropList.txt 09..0D, 20, 85, and A0 are the only whitespace

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-10 Thread Tom Lane
Chapman Flack writes: > On 02/10/16 17:19, Tom Lane wrote: >> I also verified that in UTF8-based locales, isspace() thinks that 0x85 and >> 0xA0, and no other high-bit-set values, are spaces. Not sure exactly why > Unicode NEXT LINE (NEL) and NO-BREAK SPACE, respectively.

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-10 Thread Larry Rosenman
On 2016-02-10 17:00, Tom Lane wrote: Larry Rosenman writes: On 2016-02-10 16:19, Tom Lane wrote: I looked into the OS X sources, and found that indeed you are right: *scanf processes the input a byte at a time, and applies isspace() to each byte separately, even when the

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-10 Thread Tom Lane
Larry Rosenman writes: > If you want, file a bug at https://bugs.freebsd.org/bugzilla Probably not much point; the commit log shows pretty clearly that they have been thinking about the code's behavior with multibyte characters, so I assume they've intentionally decided to keep

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-09 Thread Tom Lane
Artur Zakirov writes: >> I think the NIImportOOAffixes() in spell.c should be corrected to avoid >> this bug. > I have attached a patch. It adds new functions parse_ooaffentry() and > get_nextentry() and fixes a couple comments. I do not like this patch much. It is

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-01-29 Thread Artur Zakirov
On 28.01.2016 17:42, Artur Zakirov wrote: On 27.01.2016 15:28, Artur Zakirov wrote: On 27.01.2016 14:14, Stas Kelvich wrote: Hi. I tried that and confirm strange behaviour. It seems that problem with small cyrillic letter ‘х’. (simplest obscene language filter? =) That can be reproduced with

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-01-28 Thread Artur Zakirov
On 27.01.2016 15:28, Artur Zakirov wrote: On 27.01.2016 14:14, Stas Kelvich wrote: Hi. I tried that and confirm strange behaviour. It seems that problem with small cyrillic letter ‘х’. (simplest obscene language filter? =) That can be reproduced with simpler test Stas The test program

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-01-27 Thread Artur Zakirov
On 27.01.2016 14:14, Stas Kelvich wrote: Hi. I tried that and confirm strange behaviour. It seems that problem with small cyrillic letter ‘х’. (simplest obscene language filter? =) That can be reproduced with simpler test Stas The test program was corrected. Now it uses wchar_t type. And

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-01-27 Thread Stas Kelvich
Hi. I tried that and confirm strange behaviour. It seems that problem with small cyrillic letter ‘х’. (simplest obscene language filter? =) That can be reproduced with simpler test Stas test.c Description: Binary data > On 27 Jan 2016, at 13:59, Artur Zakirov

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-01-27 Thread Shulgin, Oleksandr
On Wed, Jan 27, 2016 at 10:59 AM, Artur Zakirov wrote: > Hello. > > When a user try to create a text search dictionary for the russian > language on Mac OS then called the following error message: > > CREATE EXTENSION hunspell_ru_ru; > + ERROR: invalid byte sequence

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-01-27 Thread Artur Zakirov
On 27.01.2016 13:46, Shulgin, Oleksandr wrote: Not sure why the file uses "SET KOI8-R" directive then? This directive is used only by Hunspell program. PostgreSQL ignores this directive and assumes that input affix and dictionary files in the UTF-8 encoding. What error message do you