Re: [HACKERS] unaccent extension missing some accents

2011-11-10 Thread Bruce Momjian
Tom Lane wrote:
 J Smith dark.panda+li...@gmail.com writes:
  I've attached a patch against master for unaccent.c that uses swscanf
  along with char2wchar and wchar2char instead of sscanf directly to
  initialize the unaccent extension and it appears to fix the problem in
  both the master and 9.1 branches.
 
 swscanf doesn't seem like an acceptable approach: it's a function that
 is relied on nowhere else in PG, so it adds new portability risks of its
 own.  It doesn't exist on some platforms that we support (like the one
 I'm typing this message on) and there's no real good reason to assume
 that it's not broken in its own ways on others.
 
 If you really want to pursue this, I'd suggest parsing the line
 manually, perhaps via strchr searches for \t and \n.  It likely wouldn't
 be very many more lines than what you've got here.
 
 However, the bigger picture is that OS X's UTF8 locales are broken
 through-and-through, and most of their other problems are not feasible
 to work around.  So basically you can't use them for anything
 interesting, and it's not clear that it's worth putting any time into
 solving individual problems.  In the particular case here, the issue
 presumably is that sscanf is relying on isspace() ... but we rely on
 isspace() directly, in quite a lot of places, so how much is it going
 to fix to dodge it right here?

If Apple's low-level code came from FreeBSD and NetBSD, how did they get
so broken?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-10 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 Tom Lane wrote:
 However, the bigger picture is that OS X's UTF8 locales are broken
 through-and-through, and most of their other problems are not feasible
 to work around.

 If Apple's low-level code came from FreeBSD and NetBSD, how did they get
 so broken?

AFAIK, they're broken in the BSDen too, or at least were when Apple
branched off from whichever BSD they started from (which was years ago).
There may be a better solution available upstream by now, but it doesn't
appear to me that Apple has any interest in fixing this area.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-07 Thread Tom Lane
J Smith dark.panda+li...@gmail.com writes:
 Alright, I wrote up another patch that uses strchr to parse out the
 lines of the unaccent.rules file, foregoing sscanf completely.
 Hopefully this looks a bit better than using swscanf.

I looked at this a bit and realized that sscanf is actually doing a
couple of critical things for us, which are lost in translation when
doing it like this:

1. It ignores whitespace other than the dividing tab.  If we don't
continue to do that, we'll likely break existing config files.

2. It ensures that src and trg each consist of at least one (nonblank)
character.  placeChar() is critically dependent on the assumption that
src is not empty.

However, after looking around a bit at the other tsearch config-file-
reading functions, I noted that they all use t_isspace() to identify
whitespace ... and that function in fact should be okay on OS X,
because it uses iswspace in multibyte encodings.

So it's fairly simple to improve this code to reject whitespace that
way.  I don't like the existing code anyway because of its potential
vulnerability to buffer overrun.  I'll fix it up and commit.

 As for the other problems with isspace and such on OSX, it might be
 worth looking at the python portability fixes.

If OS X's UTF8 locales weren't so thoroughly broken (eg sorting does not
work), I might be tempted to try to do it that way, but I still fail
to see the point.  After reviewing the code I feel that unaccent needs
to be fixed because it's not consistent with the other tsearch config
file parsers, and not so much because it works or doesn't work on any
specific platform.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-07 Thread Florian Pflug
On Nov7, 2011, at 17:46 , J Smith wrote:
 On Mon, Nov 7, 2011 at 11:12 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 If OS X's UTF8 locales weren't so thoroughly broken (eg sorting does not
 work), I might be tempted to try to do it that way, but I still fail
 to see the point.  After reviewing the code I feel that unaccent needs
 to be fixed because it's not consistent with the other tsearch config
 file parsers, and not so much because it works or doesn't work on any
 specific platform.
 
 Yeah, I never knew there was such a problem with OSX and UTF8 before
 running into it here but it's good to know.

Various issues with OSX and UTF-8 locales seems to come up quite often, yet
we're not really in a position to do anything about them.

Thus, I think we should warn about these issues and save people the trouble
of finding out about this the hard way. The only question is, what would be
a good place for such a warning?

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-07 Thread J Smith
On Mon, Nov 7, 2011 at 11:12 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I looked at this a bit and realized that sscanf is actually doing a
 couple of critical things for us, which are lost in translation when
 doing it like this:

 1. It ignores whitespace other than the dividing tab.  If we don't
 continue to do that, we'll likely break existing config files.

 2. It ensures that src and trg each consist of at least one (nonblank)
 character.  placeChar() is critically dependent on the assumption that
 src is not empty.

 However, after looking around a bit at the other tsearch config-file-
 reading functions, I noted that they all use t_isspace() to identify
 whitespace ... and that function in fact should be okay on OS X,
 because it uses iswspace in multibyte encodings.

 So it's fairly simple to improve this code to reject whitespace that
 way.  I don't like the existing code anyway because of its potential
 vulnerability to buffer overrun.  I'll fix it up and commit.

 As for the other problems with isspace and such on OSX, it might be
 worth looking at the python portability fixes.

 If OS X's UTF8 locales weren't so thoroughly broken (eg sorting does not
 work), I might be tempted to try to do it that way, but I still fail
 to see the point.  After reviewing the code I feel that unaccent needs
 to be fixed because it's not consistent with the other tsearch config
 file parsers, and not so much because it works or doesn't work on any
 specific platform.


Yeah, I never knew there was such a problem with OSX and UTF8 before
running into it here but it's good to know. When I noticed the
unnaccent extension in more recent PostgreSQL versions, I figured it
would perform better than our current plperl-based accent stripping
function (which it surely does) and just noticed the results on my
machine were a little off, but our linux-based servers were fine and
dandy and yadda yadda yadda.

Anyways, lemme know if there's anything else I could help with or
could test and whatnot. Cheers.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-07 Thread Tom Lane
J Smith dark.panda+li...@gmail.com writes:
 Anyways, lemme know if there's anything else I could help with or
 could test and whatnot. Cheers.

If you have time to check that the patch I just committed fixes your
problem, it'd be worth doing.  I did not test it on OS X ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-07 Thread J Smith
On Mon, Nov 7, 2011 at 11:59 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 If you have time to check that the patch I just committed fixes your
 problem, it'd be worth doing.  I did not test it on OS X ...

Looks good to me, thanks.

Would it even really be worth it to look into any of the other locale
issues on OSX, given that PostgreSQL is now included in their default
installs starting with 10.7, or would this really be more of a case of
hoping Apple some day fixes the issue upstream? It doesn't seem like
that's going to happen any time soon, mind you, as apparently this is
a rather long-standing issue in their libc, but who knows. I know OSX
isn't exactly the most popular database server OS out there, but it
seems to be becoming more popular for developers these days at the
very least.

Anyways, cheers, and thanks again.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-07 Thread Tom Lane
J Smith dark.panda+li...@gmail.com writes:
 Would it even really be worth it to look into any of the other locale
 issues on OSX, given that PostgreSQL is now included in their default
 installs starting with 10.7, or would this really be more of a case of
 hoping Apple some day fixes the issue upstream?

To my mind, the killer issue is the incorrect sorting --- there's
basically no way we can work around that.  If Apple were to fix that,
we might be able to nibble at the margins of the other stuff.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-07 Thread J Smith
On Mon, Nov 7, 2011 at 11:53 AM, Florian Pflug f...@phlo.org wrote:

 Various issues with OSX and UTF-8 locales seems to come up quite often, yet
 we're not really in a position to do anything about them.

 Thus, I think we should warn about these issues and save people the trouble
 of finding out about this the hard way. The only question is, what would be
 a good place for such a warning?


Hmm, I suppose one place could be on initdb if initializing with a
UTF-8 locale, although that only really helps with fixed settings like
LC_COLLATE and LC_CTYPE and the defaults for the user-configurable
settings, right? For the configurable settings I guess logging as
warnings on server start up or when they're changed via SET. But then
there's all of the other places, like when using COLLATE and using
locale-aware functions and so on and so forth. It would be a lot to
cover, I'll bet.

I guess maybe initdb, start up and SET warnings and a note in the docs
would hopefully suffice?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] unaccent extension missing some accents

2011-11-06 Thread J Smith
G'day list.

I've been messing around with the unaccent extension and I've noticed
that some of the characters listed in the unaccent.rules file aren't
actually being unaccented on my system.

Here are the system details and whatnot.

- OSX 10.7.2

- the server is compiled via macports. Tried using both gcc and llvm
4.2.1 compilers that come with the latest version of XCode.

- the same symptoms show up in both 9.0.5 and 9.1.1. I've also tried
building manually from the latest REL9_1_STABLE branch from git to
make sure macports wasn't the problem, but I'm getting the same
results with both compilers.

When I first do a CREATE EXTENSION for unaccent, I'm seeing the
following warnings in the log file:

===
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 8 of configuration file
/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules:
à  a

WARNING:  duplicate TO argument, use first one
CONTEXT:  line 57 of configuration file
/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules:
Ġ  G

WARNING:  duplicate TO argument, use first one
CONTEXT:  line 144 of configuration file
/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules:
Š  S

===

I've dug around through the unaccent code a bit and I've noticed that
the sscanf it does when reading the file is producing some odd output.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread J Smith
Gah! Accidentally hit Send. Let me finish that last message before
sending this time!


G'day list.

I've been messing around with the unaccent extension and I've noticed
that some of the characters listed in the unaccent.rules file aren't
actually being unaccented on my system.

Here are the system details and whatnot.

- OSX 10.7.2

- the server is compiled via macports. Tried using both gcc and llvm
4.2.1 compilers that come with the latest version of XCode.

- the same symptoms show up in both 9.0.5 and 9.1.1. I've also tried
building manually from the latest REL9_1_STABLE branch from git to
make sure macports wasn't the problem, but I'm getting the same
results with both compilers.

When I first do a CREATE EXTENSION for unaccent, I'm seeing the
following warnings in the log file:

===
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 8 of configuration file
/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules:
à      a
       
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 57 of configuration file
/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules:
Ġ      G
       
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 144 of configuration file
/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules:
Š      S
       
===

I've dug around through the unaccent.c code a bit and I've noticed
that the sscanf it does when reading the file is producing some odd
output. I've tried with a minimal example using the same sort of
sscanf code reading from the same unaccent.rules file, but the minimal
example doesn't produce the same output.

I put some elog debugging lines into unaccent.c and found that sscanf
sometimes reads the scanned line by finding only one byte for the for
the source character rather than the two required for the complete
UTF-8 code point. It appears that the following characters are causing
the problem, along with the code points and such:

'Å' = 'A' | c3,85 = 41
'à' = 'a' | c3,a0 = 61
'ą' = 'a' | c4,85 = 61
'Ġ' = 'G' | c4,a0 = 47
'Ņ' = 'N' | c5,85 = 4e
'Š' = 'S' | c5,a0 = 53

In each case, one byte was being read in the source string rather than
two, leading to the duplicate TO warnings above. This later leads to
the characters that produced the warning being ignored when unaccent
is called and left in the output.

I haven't been able to reproduce in a smaller example, and haven't
been able to reproduce on a CentOS server, so at this point I'm at a
loss as to the problem.

Anybody got any ideas?

Cheers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread Florian Pflug
On Nov6, 2011, at 18:43 , J Smith wrote:
 I put some elog debugging lines into unaccent.c and found that sscanf
 sometimes reads the scanned line by finding only one byte for the for
 the source character rather than the two required for the complete
 UTF-8 code point. It appears that the following characters are causing
 the problem, along with the code points and such:
 
 'Å' = 'A' | c3,85 = 41
 'à' = 'a' | c3,a0 = 61
 'ą' = 'a' | c4,85 = 61
 'Ġ' = 'G' | c4,a0 = 47
 'Ņ' = 'N' | c5,85 = 4e
 'Š' = 'S' | c5,a0 = 53
 
 In each case, one byte was being read in the source string rather than
 two, leading to the duplicate TO warnings above. This later leads to
 the characters that produced the warning being ignored when unaccent
 is called and left in the output.

What's the locale of the database you're seeing this in, and which charset
does it use?

I think scanf() uses isspace() and friends, and last time I looked the
locale definitions where all pretty bogus on OSX. So maybe scanf() somehow
decides that 0xA0 is whitespace.

 I haven't been able to reproduce in a smaller example, and haven't
 been able to reproduce on a CentOS server, so at this point I'm at a
 loss as to the problem.

Have you tried to set the same locale as postgres (using setlocale()) in
your tests?

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread J Smith
On Sun, Nov 6, 2011 at 1:18 PM, Florian Pflug f...@phlo.org wrote:

 What's the locale of the database you're seeing this in, and which charset
 does it use?

 I think scanf() uses isspace() and friends, and last time I looked the
 locale definitions where all pretty bogus on OSX. So maybe scanf() somehow
 decides that 0xA0 is whitespace.


Ah, that does it: the locale I was using in the test code was just
plain ol' C locale, while in the database it was en_CA.UTF-8. Changing
the locale in my test code produced the wonky results. I should have
figured it was a locale problem. Sure enough, in a UTF-8 locale, it
believes that both 0xa0 and 0x85 are spaces. Pretty wonky behaviour
indeed.

Apparently this is a known OSX issue that has its roots in and older
version of FreeBSD's libc I guess, eh? I've found various bug reports
that allude to the problem and they all seem to point that way.

I've attached a patch against master for unaccent.c that uses swscanf
along with char2wchar and wchar2char instead of sscanf directly to
initialize the unaccent extension and it appears to fix the problem in
both the master and 9.1 branches.

I haven't added any tests in the expected output file 'cause I'm not
exactly sure what I should be testing against, but I could take a
crack at that, too, if the patch looks reasonable and is usable.

Cheers.


0001-Fix-weirdness-when-dealing-with-UTF-8-in-buggy-libc-.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread Tom Lane
J Smith dark.panda+li...@gmail.com writes:
 I've attached a patch against master for unaccent.c that uses swscanf
 along with char2wchar and wchar2char instead of sscanf directly to
 initialize the unaccent extension and it appears to fix the problem in
 both the master and 9.1 branches.

swscanf doesn't seem like an acceptable approach: it's a function that
is relied on nowhere else in PG, so it adds new portability risks of its
own.  It doesn't exist on some platforms that we support (like the one
I'm typing this message on) and there's no real good reason to assume
that it's not broken in its own ways on others.

If you really want to pursue this, I'd suggest parsing the line
manually, perhaps via strchr searches for \t and \n.  It likely wouldn't
be very many more lines than what you've got here.

However, the bigger picture is that OS X's UTF8 locales are broken
through-and-through, and most of their other problems are not feasible
to work around.  So basically you can't use them for anything
interesting, and it's not clear that it's worth putting any time into
solving individual problems.  In the particular case here, the issue
presumably is that sscanf is relying on isspace() ... but we rely on
isspace() directly, in quite a lot of places, so how much is it going
to fix to dodge it right here?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread J Smith
On 2011-11-06, at 7:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 swscanf doesn't seem like an acceptable approach: it's a function that
 is relied on nowhere else in PG, so it adds new portability risks of its
 own.  It doesn't exist on some platforms that we support (like the one
 I'm typing this message on) and there's no real good reason to assume
 that it's not broken in its own ways on others.

 If you really want to pursue this, I'd suggest parsing the line
 manually, perhaps via strchr searches for \t and \n.  It likely wouldn't
 be very many more lines than what you've got here.

 However, the bigger picture is that OS X's UTF8 locales are broken
 through-and-through, and most of their other problems are not feasible
 to work around.  So basically you can't use them for anything
 interesting, and it's not clear that it's worth putting any time into
 solving individual problems.  In the particular case here, the issue
 presumably is that sscanf is relying on isspace() ... but we rely on
 isspace() directly, in quite a lot of places, so how much is it going
 to fix to dodge it right here?

regards, tom lane

There are some fixes for isspace and friend that I've seen python
using so perhaps in those cases a similar fix could be applied. For
instance, maybe something like the code around line 674 here:

http://svn.python.org/view/python/trunk/Include/pyport.h?revision=81029view=markup

Perhaps that would be suitable on OSX at least in the case of isspace
et al. As far as I can tell scanf doesn't seem to use isspace on my
system so that would only be a partial fix for this an whatever other
situations isspace is used in. (on a mobile now so I can't check a the
moment.)

This isn't really a huge deal for me but I'll try to get a chance to
write up a little parser anyways just for kicks.

Cheers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread J Smith
Alright, I wrote up another patch that uses strchr to parse out the
lines of the unaccent.rules file, foregoing sscanf completely.
Hopefully this looks a bit better than using swscanf.

As for the other problems with isspace and such on OSX, it might be
worth looking at the python portability fixes. I played briefly with
the isspace and friends macros they have and they looked okay, but I
certainly can't speak for how well they'd work for the rest of the
PostgreSQL code base.

Cheers.


0001-Fix-weirdness-when-dealing-with-UTF-8-in-buggy-libc-.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers