Re: [HACKERS] Encoding and i18n

2007-10-06 Thread Alvaro Herrera
Gregory Stark wrote:
 
 Reading the commit message about the TZ encoding issue I'm curious why this
 isn't a more widespread problem. How does gettext now what encoding we want
 messages in? How do we prevent things like to_char(now(),'month') from
 producing strings in an encoding different from the database's encoding?

The PO files include encoding information, so it's easy for the server
to recode them from that to the server (or client) encoding, as
appropriate.

Of course, then it is up to the translator to get it right ... but I
think when he doesn't, people notice fairly quickly.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Encoding and i18n

2007-10-06 Thread Gregory Stark
Alvaro Herrera [EMAIL PROTECTED] writes:

 Gregory Stark wrote:
 
 Reading the commit message about the TZ encoding issue I'm curious why this
 isn't a more widespread problem. How does gettext now what encoding we want
 messages in? How do we prevent things like to_char(now(),'month') from
 producing strings in an encoding different from the database's encoding?

 The PO files include encoding information, so it's easy for the server
 to recode them from that to the server (or client) encoding, as
 appropriate.

So does the _() macro automatically recode it to the current server encoding?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Encoding and i18n

2007-10-06 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 Reading the commit message about the TZ encoding issue I'm curious why this
 isn't a more widespread problem. How does gettext now what encoding we want
 messages in? How do we prevent things like to_char(now(),'month') from
 producing strings in an encoding different from the database's encoding?

The short answer is it's all a house of cards, and if you troll
the archives you will find plenty of bug reports traceable to
misconfiguration in this area.  The recent attempt to enforce
that nl_langinfo(CODESET) matches the database encoding is a first
step towards making this more bulletproof, but we're finding out
that even that is harder than it looks.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Encoding and i18n

2007-10-06 Thread Alvaro Herrera
Gregory Stark wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
 
  Gregory Stark wrote:
  
  Reading the commit message about the TZ encoding issue I'm curious why this
  isn't a more widespread problem. How does gettext now what encoding we want
  messages in? How do we prevent things like to_char(now(),'month') from
  producing strings in an encoding different from the database's encoding?
 
  The PO files include encoding information, so it's easy for the server
  to recode them from that to the server (or client) encoding, as
  appropriate.
 
 So does the _() macro automatically recode it to the current server encoding?

Well, I'm not sure if it's _(), elog() or what, but it does get recoded.
If I have a different client_encoding and get a NOTICE, then both the
server and client get a message in the corresponding encoding.

In fact this is the reason for the most common PANIC: stack overflow
in elog.c error stack.  When a message needs to be recoded but the
recoding procedure errors out, it wants to report that and this one also
fails, you get infinite recursion and nothing can get reported.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Encoding and i18n

2007-10-06 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 So does the _() macro automatically recode it to the current server encoding?

From the gettext manual:

---

gettext not only looks up a translation in a message catalog. It also
converts the translation on the fly to the desired output character
set. This is useful if the user is working in a different character set
than the translator who created the message catalog, because it avoids
distributing variants of message catalogs which differ only in the
character set.

The output character set is, by default, the value of nl_langinfo
(CODESET), which depends on the LC_CTYPE part of the current locale. But
programs which store strings in a locale independent way (e.g. UTF-8)
can request that gettext and related functions return the translations
in that encoding, by use of the bind_textdomain_codeset function.

---

We don't currently call bind_textdomain_codeset, in part because of the
lack of portability of names for codesets.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Encoding and i18n

2007-10-06 Thread Gregory Stark
Alvaro Herrera [EMAIL PROTECTED] writes:

 Gregory Stark wrote:

 So does the _() macro automatically recode it to the current server encoding?

 Well, I'm not sure if it's _(), elog() or what, but it does get recoded.
 If I have a different client_encoding and get a NOTICE, then both the
 server and client get a message in the corresponding encoding.

Actually I was thinking about things like formatting.c which take localized
strings and return them as data which can end up in the database. If they're
in the wrong encoding then they'll be invalidly encoded strings in the
database.

 In fact this is the reason for the most common PANIC: stack overflow
 in elog.c error stack.  When a message needs to be recoded but the
 recoding procedure errors out, it wants to report that and this one also
 fails, you get infinite recursion and nothing can get reported.

Ouch


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Encoding and i18n

2007-10-06 Thread Alvaro Herrera
Gregory Stark wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
 
  Gregory Stark wrote:
 
  So does the _() macro automatically recode it to the current server 
  encoding?
 
  Well, I'm not sure if it's _(), elog() or what, but it does get recoded.
  If I have a different client_encoding and get a NOTICE, then both the
  server and client get a message in the corresponding encoding.
 
 Actually I was thinking about things like formatting.c which take localized
 strings and return them as data which can end up in the database. If they're
 in the wrong encoding then they'll be invalidly encoded strings in the
 database.

Oh, I didn't think of that.  Let me see if I can get an invalid string
into the database that way.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Encoding and i18n

2007-10-06 Thread Andrew Dunstan



Alvaro Herrera wrote:

Actually I was thinking about things like formatting.c which take localized
strings and return them as data which can end up in the database. If they're
in the wrong encoding then they'll be invalidly encoded strings in the
database.



Oh, I didn't think of that.  Let me see if I can get an invalid string
into the database that way.

  


I was quite certain when we closed most of these holes recently that we 
hadn't caught them all, so this wouldn't surprise me in the least.


cheers

andrew

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Polymorphic arguments and composite types

2007-10-06 Thread Stephan Szabo
On Fri, 5 Oct 2007, Simon Riggs wrote:

 On Fri, 2007-10-05 at 11:24 -0700, Stephan Szabo wrote:
  On Fri, 5 Oct 2007, Simon Riggs wrote:
 
   On Fri, 2007-10-05 at 10:59 -0700, Stephan Szabo wrote:
On Fri, 5 Oct 2007, Simon Riggs wrote:
   
 On Fri, 2007-10-05 at 10:32 -0700, Stephan Szabo wrote:
  On Fri, 5 Oct 2007, Simon Riggs wrote:
 
   Because we already do exactly that here:
  
 select 1, (select col2 from c), 3;
  
   The inner select returns a ROW, yet we treat it as a single column
   value.
 
  The inner select does not return a row. It's not a row subquery, 
  it's a
  scalar subquery.

 Thanks Stephan, Tom already explained that.

 My comments above were in response to Why would you think that?
   
Right, but I guess I couldn't see why you would consider that the same 
as
treating a rowtype as a scalar, because when I look at that my brain
converts that to a scalar subquery, so I guess I simply see the scalar.
If we supported select 1, (select 2,3), select 4 giving something like
(1,(2,3),4), I'd also have confusion over the case, but that should 
error.
  
   Well, my brain didn't... All I've said was that we should document it,
   to help those people that don't know they're SQL standard as good as the
   best people on this list.
 
  Where would you document this beyond 4.2 though? While I don't exactly
  like the wording of 4.2.9, it seems like it's already trying to say that.

 Yeh, it does, but you're forgetting that my original complaint was that
 you couldn't use it in an ANY clause, which 4.2 does not exclude.
 Bearing in mind you can use a scalar subquery in lots of places, I
 thought it worth reporting.

Well, but I'd argue that we're now talking about separate issues.

The first is how scalar subqueries act, as far as not being a rowtype.

The second is related to the question of ANY and scalar subqueries
specifically.

The third is related to where you can use scalar subqueries.

 The ANY clause at 9.19.4 mentions a subquery, but doesn't say it can't
 be a scalar subquery; it doesn't restrict this to non-scalar subqueries.

While it's true that it isn't a scalar subquery (although it's not a
restriction on the kind of subquery, it's the definition of what (select
...) turn into when used there), I don't see how the text doesn't
basically say that op ANY (subquery returning a single array) works
the way it currently does.

I think it'd be more applicable to mention in the array one that using a
subquery as the right hand side turns it into the other form. I'm not
convinced it's necessary, but also I'd think that one general mention
would likely be better than separate ones in each of ANY and ALL.

It might be reasonable to try to note where subqueries are scalar
subqueries, but I think that'll be prone to being wrong or misinterpreted
as well.

 Searching in Arrays, 8.14.5 doesn't say it can't be a subquery either.

True, although I don't know if it's right to mention there since that
section appears to link to the other section saying that the other
section describes the method.

 Section 9.20.3 mentions ANY (array expression). The term array
 expression is not defined nor is there a link to where it is defined,
 nor is the term indexed.

I'm not sure why we're using a separate term for that.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Encoding and i18n

2007-10-06 Thread Alvaro Herrera
Andrew Dunstan wrote:


 Alvaro Herrera wrote:
 Actually I was thinking about things like formatting.c which take 
 localized
 strings and return them as data which can end up in the database. If 
 they're
 in the wrong encoding then they'll be invalidly encoded strings in the
 database.

 Oh, I didn't think of that.  Let me see if I can get an invalid string
 into the database that way.

 I was quite certain when we closed most of these holes recently that we 
 hadn't caught them all, so this wouldn't surprise me in the least.

It seems to work correctly:

alvherre=# drop table week;
DROP TABLE
alvherre=# create table week (a text);
CREATE TABLE
alvherre=# \encoding utf8
alvherre=# insert into week  select to_char(now()-'3 days'::interval, 'tmday');
INSERT 0 1
alvherre=# \encoding latin1
alvherre=# insert into week  select to_char(now()-'3 days'::interval, 'tmday');
INSERT 0 1
alvherre=# select * from week;
 a 
---
 miércoles
 miércoles
(2 lignes)

I tried on both a UTF8 and Latin1 terminal and it works OK in all cases.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Polymorphic arguments and composite types

2007-10-06 Thread Simon Riggs
On Sat, 2007-10-06 at 10:15 -0700, Stephan Szabo wrote:

  Yeh, it does, but you're forgetting that my original complaint was that
  you couldn't use it in an ANY clause, which 4.2 does not exclude.
  Bearing in mind you can use a scalar subquery in lots of places, I
  thought it worth reporting.
 
 Well, but I'd argue that we're now talking about separate issues.

It's simpler than that. I asked a question because the manual isn't
specific on my original point. I'll do a doc patch to make sure nobody
makes the same mistake I did and we record all the good points people
have made.

  Section 9.20.3 mentions ANY (array expression). The term array
  expression is not defined nor is there a link to where it is defined,
  nor is the term indexed.
 
 I'm not sure why we're using a separate term for that.

The term array expression is used in the manual, but not defined.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] Windows and locales and UTF-8 (oh my)

2007-10-06 Thread Tom Lane
I've been learning much more than I wanted to know about $SUBJECT
since putting in the src/port/chklocale.c code to try to enforce
that our database encoding matches the system locale settings.
There's an ongoing thread in -patches that's been focused on
getting reasonable behavior from the point of view of the Far
Eastern contingent:
http://archives.postgresql.org/pgsql-patches/2007-10/msg00031.php
(Some of that's been applied, but not the very latest proposals.)
Here's some more info from an off-list discussion with Dave Page:

--- Forwarded Messages

Date:Fri, 05 Oct 2007 20:54:04 +0100
From:Dave Page [EMAIL PROTECTED]
To:  Tom Lane [EMAIL PROTECTED]
Subject: Re: [CORE] 8.3beta1 Available ...

Dave Page wrote:
 Some further info on that - utf-8 on Windows is actually a
 pseudo-codepage (65001) which doesn't have NLS files, hence why we have
 to convert to utf-16 before sorting. Perhaps the utf-8/65001 name
 difference is the problem here. I'll knock up a quick test program when
 the kids have gone to bed.

So, my test prog (below) returns the following:

[EMAIL PROTECTED]:~$ ./setlc English_United Kingdom.65001
LC_COLLATE=English_United
Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United
Kingdom.65001;LC_NUMERIC=English_United
Kingdom.65001;LC_TIME=English_United Kingdom.65001

So everything other than LC_CTYPE is acceptable in UTF-8 on Windows -
and we already handle LC_CTYPE for UTF-8 on Windows through our UTF-8 -
UTF-16 conversions internally.

Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps?

Regards, Dave.


#include locale.h

main (int argc, char *argv[])
{
char *lc;

if (argc  1)
setlocale(LC_ALL, argv[1]);

lc = setlocale(LC_ALL, NULL);
printf(%s\n, lc);
}

--- Message 2

Date:Fri, 05 Oct 2007 23:32:36 +0100
From:Dave Page [EMAIL PROTECTED]
To:  Tom Lane [EMAIL PROTECTED]
Subject: Re: [CORE] 8.3beta1 Available ...

Tom Lane wrote:
 Dave Page [EMAIL PROTECTED] writes:
 So, my test prog (below) returns the following:
 
 [EMAIL PROTECTED]:~$ ./setlc English_United Kingdom.65001
 LC_COLLATE=English_United
 Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United
 Kingdom.65001;LC_NUMERIC=English_United
 Kingdom.65001;LC_TIME=English_United Kingdom.65001
 
 That's just frickin' weird ... and a bit scary.  There is a fair amount
 of code in PG that checks for lc_ctype_is_c and does things differently;
 one wonders if that isn't going to get misled by this behavior.  (Hmm,
 maybe this explains some of the upper/lower doesn't work reports we've
 been getting??)  Are you sure all variants of Windows act that way?

All the ones we support afaict.

 Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps?
 
 Is there something in Windows that constrains them to be all the same?
 If not this proposal seems just plain wrong :-(  But in any case I'd
 feel more comfortable having it look at LC_COLLATE.

They can all be set independently - it's just that there's no UTF-7
(65000) or UTF-8 (65001) NLS files (http://shlimazl.nm.ru/eng/nls.htm)
defining them fully so Windows doesn't know any more than the characters
that are in both 'pseudo codepages'.

As a result, you can't set LC_CTYPE to .65001 because Windows knows it
can't handle ToUpper() or ToLower() etc. but you can use it to encode
messages and other text.

/D

--- End of Forwarded Messages

I am thinking that Dave's discovery explains some previously unsolved
bug reports, such as
http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
If Windows returns LC_CTYPE=C in a situation like this, then
the various single-byte-charset optimization paths that are enabled by
lc_ctype_is_c() would be mistakenly used, leading to misbehavior in
upper()/lower() and other places.  ISTM we had better hack
lc_ctype_is_c() so that on Windows (only), if the database encoding
is UTF-8 then it returns FALSE regardless of what setlocale says.

That still leaves me with a boatload of questions, though.  If we can't
trust LC_CTYPE as an indicator of the system charset, what can we trust?
In particular this seems to say that looking at LC_CTYPE for chklocale's
purposes is completely useless; what do we look at instead?

Another issue: is it possible to set, say, LC_MESSAGES and LC_TIME to
different codepages and if so what happens?  If that does enable
different bits of infrastructure to return incompatibly encoded strings,
seems we need a defense against that --- what should it be?

One bright spot is that this does seem to suggest a way to implement the
recommendation I made in the -patches thread: if we can't support the
encoding (codepage) used by the locale seen by initdb, we could try
stripping the codepage indicator (if any) and plastering on .65001
to get a UTF8-compatible locale name.  That'd only work on Windows
but that seems the platform where we're most likely to see unsupportable
default encodings.

Comments?  I don't have a Windows 

Re: [HACKERS] Polymorphic arguments and composite types

2007-10-06 Thread Stephan Szabo

On Sat, 6 Oct 2007, Simon Riggs wrote:

 On Sat, 2007-10-06 at 10:15 -0700, Stephan Szabo wrote:

   Yeh, it does, but you're forgetting that my original complaint was that
   you couldn't use it in an ANY clause, which 4.2 does not exclude.
   Bearing in mind you can use a scalar subquery in lots of places, I
   thought it worth reporting.
 
  Well, but I'd argue that we're now talking about separate issues.

 It's simpler than that. I asked a question because the manual isn't
 specific on my original point. I'll do a doc patch to make sure nobody
 makes the same mistake I did and we record all the good points people
 have made.

   Section 9.20.3 mentions ANY (array expression). The term array
   expression is not defined nor is there a link to where it is defined,
   nor is the term indexed.
 
  I'm not sure why we're using a separate term for that.

 The term array expression is used in the manual, but not defined.

Right. I meant, if those are the only uses, why did we use a specific term
array expression rather than relying on saying that the expression given
must have array type.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Encoding and i18n

2007-10-06 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 I tried on both a UTF8 and Latin1 terminal and it works OK in all cases.

The cases that would be interesting involve to_char's locale-specific
format codes (eg Dy) along with LC_TIME settings that are deliberately
incompatible with the database encoding.  client_encoding is not relevant.

It's not real clear to me whether, on a Unix machine, there is even
supposed to be any difference between setting LC_TIME=es_ES.iso88591 and
setting it to es_ES.utf8.  Since nl_langinfo(CODESET) is supposedly
determined only by LC_CTYPE, you could argue that strftime's results
should be in that encoding regardless, and that the codeset component of
other LC_ variables should be ignored.  Some experimentation suggests
that at least in glibc it doesn't work that way, and that there is in
fact no principled way for you to find out what encoding strftime is
giving you :-(.

$ LANG=es_ES.utf8 date
sáb oct  6 14:11:30 EDT 2007
$ LANG=es_ES.iso88591 date
sáb oct  6 14:11:42 EDT 2007
$ LANG=en_US.iso88591 LC_TIME=es_ES.utf8 date
sáb oct  6 14:12:10 EDT 2007
$ LC_CTYPE=en_US.iso88591 LC_TIME=es_ES.utf8 date
sáb oct  6 14:12:34 EDT 2007

Perhaps a workable fix for this would be to try to mangle the LC_ settings
we pass to setlocale() so that they all have the same codeset component
(if any).  It looks like the convention of .foo being a codeset name
is fairly well standardized, even if the spelling of the codeset name is
not ...

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Encoding and i18n

2007-10-06 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 Since nl_langinfo(CODESET) is supposedly determined only by LC_CTYPE, you
 could argue that strftime's results should be in that encoding regardless,

It seems to me we aren't actually using strftime any more in any case. We seem
to be using things like _(Monday) instead. Except in my tests I don't get
any French dates even when the server is started in French mode. I think we
just don't have localizations for those strings yet.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Encoding and i18n

2007-10-06 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] writes:
 Since nl_langinfo(CODESET) is supposedly determined only by LC_CTYPE, you
 could argue that strftime's results should be in that encoding regardless,

 It seems to me we aren't actually using strftime any more in any case.

Sorry, I was using strftime as a generic standin for everything that
LC_TIME affects.  Trace the usage of backend/utils/adt/pg_locale.c
to see what's really at stake there.

The practical issues would likely be things like type money using a
currency symbol that's given in the wrong encoding.

And of course you did get the point that we already know a bogus
LC_MESSAGES setting leads directly to error-stack-overflow PANIC.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] ECPG regression tests

2007-10-06 Thread Andrew Dunstan



Andrew Dunstan wrote:



Magnus Hagander wrote:

Bingo.

With that, all the ECPG regression tests now pass on MSVC builds.

Andrew - please enable it for the buildfarm :-)


  


Yes, when I have had a chance to test it. Might be a day or so.




I finally managed to get this working after much wailing and gnashing of 
teeth and rending of hair. (Hint: if you don't put the PlatformSDK 
directories first in the INCLUDE and LIB lists bad and inexplicable 
things can happen.)


Pick up the latest version of run_build.pl in CVS if you want to run 
this in your buildfarm animal now.


A release will be forthcoming very soon.

cheers

andrew

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Encoding and i18n

2007-10-06 Thread Euler Taveira de Oliveira
Gregory Stark wrote:

 It seems to me we aren't actually using strftime any more in any case. We seem
 to be using things like _(Monday) instead. Except in my tests I don't get
 any French dates even when the server is started in French mode. I think we
 just don't have localizations for those strings yet.
 
This was already discussed [1]. I proposed a patch (that was rejected)
because it calls setlocale() in every template pattern in to_char()
IIRC. I coded a patch to implement the setlocale() caching mechanism but
didn't send it. :( I'll take a look and this.

[1] http://archives.postgresql.org/pgsql-hackers/2006-11/msg00523.php


-- 
  Euler Taveira de Oliveira
  http://www.timbira.com/

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate