Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Where'd you get the licence from?
None of that is in the licence I'm reading!

(http://www-306.ibm.com/software/globalization/icu/index.jsp)
(http://www-306.ibm.com/software/globalization/icu/license.jsp) 

... John

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane
> Sent: Saturday, May 07, 2005 3:17 PM
> To: Bruce Momjian
> Cc: Palle Girgensohn; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU 
> 
> Bruce Momjian  writes:
> > Tom Lane wrote:
> >> Not until ICU is released under a BSD license ...
> 
> > Well, readline isn't BSD either, but we use it.  It is any 
> different?
> 
> Did you read the license?  Some of the more troubling bits:
> 
> : It is the understanding of INTERNATIONAL BUSINESS MACHINES 
> CORPORATION
> : that the purpose for which its publications are being reproduced is
> : accurate and true as stated in your attached request.
> 
> (er, which attached request would that be?)
> 
> : Permission to quote from or reprint IBM publications is 
> limited to the
> : purpose and quantities originally requested and must not be 
> construed as
> : a blanket license to use the material for other purposes or 
> to reprint
> : other IBM copyrighted material.
> 
> : IBM reserves the right to withdraw permission to reproduce 
> copyrighted
> : material whenever, in its discretion, it feels that the privilege of
> : reproducing its material is being used in a way detrimental to its
> : interest or the above instructions are not being followed 
> properly to
> : protect its copyright.
> 
> : IBM may have patents or pending patent applications covering subject
> : matter in this document. The furnishing of this document 
> does not give
> : you any license to these patents. You can send license inquiries, in
> : writing, to:
> 
> : For license inquiries regarding double-byte (DBCS) 
> information, contact
> : the IBM Intellectual Property Department in your country or send
> : inquiries, in writing, to:
> 
>   regards, tom lane
> 
> ---(end of 
> broadcast)---
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to 
> [EMAIL PROTECTED])
> 
> 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] rendezvous

2005-05-07 Thread Jochem van Dieten
On 5/7/05, Alvaro Herrera wrote:
> On Fri, May 06, 2005 at 03:30:10PM -0700, Joshua D. Drake wrote:
>> 
>> Rendezvous is the Apple network discovery protocol yes? That was renamed 
>> Bonjour by apple due to a Trademark problem.
> 
> Maybe we should name it Zeroconf.

Is the implemented protocol IETF ZeroConf or Apple Rendezvous? IIRC
there are differences in the requirements for TTLs and I have had some
experiences with Apple reserving local.arpa while the rest of the
world doesn't.

Jochem

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Simon Riggs
On Sat, 2005-05-07 at 14:52 +1000, Neil Conway wrote:
> Andrew Sullivan wrote:
> > Sure it is.  "Don't enable anything you don't need," is the first
> > security rule.  Everything is turned off by default.  If you want it,
> > enable it.
> 
> So would you have us disable all the non-essential builtin functions? 
> (Many of which have has security problems in the past.) What about the 
> builtin encoding conversions, non-btree indexes, or a myriad of features 
> that not all users need or use?

I support Andrew's comment, though might reword it to 
"Don't enable anything that gives users programmable features or user
exits by default".

You can't use the builtin encoding functions or non-btree indexes to
access things you are not supposed to.

Anything that is *always* there provides a platform for malware. 

I'm not really sure what is wrong with the CREATE LANGUAGE statement
anyway - it is dynamically accessible, so doesn't require changes that
effect other database instance users. I do understand the wish to make
the lives of admins easier, but this isn't a hard thing to do...

> What makes sense for the default configuration of an operating system 
> (which by nature must be hardened against attack) does not necessarily 
> make sense for a database system.

Security is everybody's job, not just the OS guys. Personally, I forget
that constantly, but the principle seems clear.

Best Regards, Simon Riggs



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Neil Conway
Simon Riggs wrote:
I support Andrew's comment, though might reword it to 
"Don't enable anything that gives users programmable features or user
exits by default".
Users can already define SQL functions by default, which certainly 
provides "programmable features". I'm not quite sure what you mean by 
"user exits."

I guess I'm missing how pl/pgsql is a fundamentally greater security risk.
You can't use the builtin encoding functions or non-btree indexes to
access things you are not supposed to.
How can you use pl/pgsql to "access things you are not supposed to"?
-Neil
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] pgFoundry

2005-05-07 Thread Greg Stark

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:

> >>No that is public presentation of the project not project development.
> > I don't see that people are going to be able to participate in development
> > if they don't use the mailing lists.
> 
> I am not arguing that but public mailing lists are no place to track status of
> sub projects or tasks. They are for discussion.

What does it mean to "track" the status of something? How would the status
change except by discussion? What would be the point of announcing the status
of something without allowing people to comment?

I think you have a severely flawed idea of how free software development
proceeds. What you're describing sounds like something a manager of a
commercial project would want. Perhaps it's something the managers of the
people working on Postgres on behalf of some corporate sponsors might want but
in those cases I doubt they would want the information to be public anyways.

In the free software world there's no top-down management of the project with
managers issuing direction and expecting feedback reports. People only want
tools that make their lives easier. Not tools that make other people's lives
easier at the expense of their own convenience. The programmers are not
beholden to any corporate interests (other than their own sponsors, who
presumably are getting all the feedback they're looking for privately).

I'm rather surprised Postgres doesn't have a good bug tracking system. That's
something most projects find pretty essential. Strangely enough the reason
seems to be that Postgres really doesn't have many bugs... Unlike web browsers
or GUIs or most of the other free software projects out there, databases don't
tolerate bugs well. Any serious bug is cause for an immediate point release.
The only use for a bug tracking system would really be for tracking all those
pesky "IWBNI" bugs that never rise to urgent status.

But "tracking the status" of sub-projects is just not the kind of thing free
software people do. They send emails when they have something to say.

-- 
greg


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn

--On fredag, maj 06, 2005 23.31.20 -0400 Tom Lane <[EMAIL PROTECTED]> wrote:
Bruce Momjian  writes:
Is this patch ready for application?
Not until ICU is released under a BSD license ...
It's not GPL anyway. Seems pretty much like the BSD license, at least more 
BSD-ish than GPL-ish.


/Palle
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn

--On fredag, maj 06, 2005 22.57.59 -0400 Bruce Momjian 
 wrote:

Is this patch ready for application?
http://people.freebsd.org/~girgen/postgresql-icu/pg-802-icu-2005-05-06.d
iff.gz
The web site is:
	http://people.freebsd.org/~girgen/postgresql-icu/readme.html
I don't think so, not quite. I have not had any positive reports from linux 
users, this is only tested in a FreeBSD environment. I'd say it needs some 
more testing.

Also, apparently, ICU is installed by default in many linux distributions, 
and usually it is version 2.8. Some linux users have asked me if there are 
plans for a patch that works with ICU 2.8. That's probably a good idea. IBM 
and the ICU folks seem to consider 3.2 to be the stable version, older 
versions are hard to find on their sites, but most linux distributers seem 
to consider it too bleeding edge, even gentoo. I don't know why they don't 
agree.

I do have a few questions:
Why don't you use the lc_ctype_is_c() part of this test?
 	if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
Um, well, I didn't think about that. :)  What would be the locale in this 
case? c_C.UTF-8? ;)  Hmm, it is possible to have CTYPE=C and use a wide 
encoding, indeed. Then the strings will be handled like byte-wide chars. 
Yeah, it's a bug. I'll fix it! Thanks.

Why is so much code added, for example, in lower()?  The existing
multibyte code is much smaller, and lots of code is added in other
places too.
ICU uses UTF-16 internally, so all strings must be converted from the 
database encoding to UTF-16. Since that means the strings need to be 
copied, I took the same approach as in varlena.c:varstr_cmp(), where small 
strings use the heap and only larger strings use a palloc. Comments in 
varstr_cmp about performance made me use that approach.

Also, in the latest patch, I also added checks and logging for *every* 
status returned from ICU. I hope this will help debugging on debian, where 
previous version didn't work. That excessive status checking is hardly be 
necessary once the stuff is better tested.

I think the string copying and heap/palloc choices stands for most of the 
code bloat, together with the excessive status checking and logging.


Why do you need to add a mapping of encoding names from iana to our
names?
This was already answered by John Hansen... There's an old thread here 
about the choice of the name "UNICODE" to describe an encoding, which it 
doesn't. There's half a dozen unicode based encodings... UTF-8 is used by 
postgresql, that would have been a better name... Similarly for most other 
encodings, really. ICU expect a setlocale(3) string (i.e. IANA). PostgreSQL 
can't provide it, so a mapping table is required.

I use this patch in production on one FreeBSD 4.10 server at the moment. 
With the latest version, I've had no problems. Logging is swithed on for 
now, and it shows no signs of ICU complaining. I'd like more reports on 
Linux, though.

/Palle
-
--
Palle Girgensohn wrote:
Hi!
I've put together a patch for using IBM's ICU package for collation.
If your OS does not have full support for collation ur
uppercase/lowercase  in multibyte locales, this might be useful. If you
are using a multibyte  character encoding in your database and want
collation, i.e. order by, and  also lower(), upper() and initcap() to
work properly, this patch will do  just that.
This patch is needed for FreeBSD, since this OS has no support for
collation of for example unicode locales (that is, wcscoll(3) does not
do  what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK
the  patch is *not* necessary for Linux, although IBM claims ICU
collation to be  about twice as fast as glibc for simple western locales.
It adds a configure switch, `--with-icu', which will set up the code to
use  ICU instead of wchar_t and wcscoll.
This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it
seems to run well. I've not had the time to do any comparative
performance  tests yet, but it seems it is at least not slower than
using LATIN1 with  sv_SE.ISO8859-1 locale, perhaps even faster.
I'd be delighted if some more experienced postgresql hackers would
review  this stuff. The patch is pretty compact, so it's fast reading :)
I'm  planning to add this patch as an option (tagged "experimental") to
FreeBSD's postgresql port. Any ideas about whether this is a good idea
or  not?
Any thoughts or ideas are welcome!
Cheers,
Palle
Patch at:

ICU at sourceforge: 
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings
--
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|

Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
> 
> I use this patch in production on one FreeBSD 4.10 server at 
> the moment. 
> With the latest version, I've had no problems. Logging is 
> swithed on for 
> now, and it shows no signs of ICU complaining. I'd like more 
> reports on 
> Linux, though.

I currently use this on gentoo with ICU3.2 unmasked.

Works a dream, even with locale C and UNICODE database.

Small test:

createdb --encoding UNICODE --locale C test
psql test
set client_encoding=iso88591;
CREATE TABLE test (t text);
INSERT INTO test (t) VALUES ('æøå');
set client_encoding=unicode;
INSERT INTO test (t) SELECT upper(t) FROM test;
set client_encoding=iso88591;
SELECT * FROM test;
  t
-
 æøå
 ÆØÅ
(2 rows)

Just as I'd expect, as upper/lower/initcap are locale independent for these 
characters.


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS]

2005-05-07 Thread [EMAIL PROTECTED]
unsubscribe
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Bruce Momjian
Palle Girgensohn wrote:
> >
> > Is this patch ready for application?
> 
> I don't think so, not quite. I have not had any positive reports from linux 
> users, this is only tested in a FreeBSD environment. I'd say it needs some 
> more testing.

OK.

> Also, apparently, ICU is installed by default in many linux distributions, 
> and usually it is version 2.8. Some linux users have asked me if there are 
> plans for a patch that works with ICU 2.8. That's probably a good idea. IBM 
> and the ICU folks seem to consider 3.2 to be the stable version, older 
> versions are hard to find on their sites, but most linux distributers seem 
> to consider it too bleeding edge, even gentoo. I don't know why they don't 
> agree.

Good point.  Why would linux folks need ICU?  Doesn't their OS support
encodings natively?  I am particularly excited about this for OSs that
don't have such encodings, like UTF8 support for Win32.

Because ICU will not be used unless enabled by configure, it seems we
are fine with only supporting the newest version.  Do Linux users need
to use ICU for any reason?

> > I do have a few questions:
> >
> > Why don't you use the lc_ctype_is_c() part of this test?
> >
> > if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
> 
> Um, well, I didn't think about that. :)  What would be the locale in this 
> case? c_C.UTF-8? ;)  Hmm, it is possible to have CTYPE=C and use a wide 
> encoding, indeed. Then the strings will be handled like byte-wide chars. 
> Yeah, it's a bug. I'll fix it! Thanks.

The additional test is more of an optmization, and it fixes a problem
with some OSs that have processing problems with UTF8 when the locale is
supposed to be turned off, like in "C".  I realize ICU might be fine
with it but the optimization still is an issue.

> > Why is so much code added, for example, in lower()?  The existing
> > multibyte code is much smaller, and lots of code is added in other
> > places too.
> 
> ICU uses UTF-16 internally, so all strings must be converted from the 
> database encoding to UTF-16. Since that means the strings need to be 
> copied, I took the same approach as in varlena.c:varstr_cmp(), where small 
> strings use the heap and only larger strings use a palloc. Comments in 
> varstr_cmp about performance made me use that approach.

Oh, interesting.   I think you need to create new functions that
factor out that common code so the patch is smaller and easier to
maintain.

> Also, in the latest patch, I also added checks and logging for *every* 
> status returned from ICU. I hope this will help debugging on debian, where 
> previous version didn't work. That excessive status checking is hardly be 
> necessary once the stuff is better tested.
> 
> I think the string copying and heap/palloc choices stands for most of the 
> code bloat, together with the excessive status checking and logging.

OK, move that into some common functions and I think it will be better.

> > Why do you need to add a mapping of encoding names from iana to our
> > names?
> 
> This was already answered by John Hansen... There's an old thread here 
> about the choice of the name "UNICODE" to describe an encoding, which it 
> doesn't. There's half a dozen unicode based encodings... UTF-8 is used by 
> postgresql, that would have been a better name... Similarly for most other 
> encodings, really. ICU expect a setlocale(3) string (i.e. IANA). PostgreSQL 
> can't provide it, so a mapping table is required.

We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does that
help?

> I use this patch in production on one FreeBSD 4.10 server at the moment. 
> With the latest version, I've had no problems. Logging is swithed on for 
> now, and it shows no signs of ICU complaining. I'd like more reports on 
> Linux, though.

OK, I certainly would like this all done for 8.1 which should have
feature freeze on July 1.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Errm,... initdb --encoding UNICODE --locale C

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of John Hansen
> Sent: Saturday, May 07, 2005 10:23 PM
> To: Palle Girgensohn; Bruce Momjian
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> > 
> > I use this patch in production on one FreeBSD 4.10 server at the 
> > moment.
> > With the latest version, I've had no problems. Logging is 
> swithed on 
> > for now, and it shows no signs of ICU complaining. I'd like more 
> > reports on Linux, though.
> 
> I currently use this on gentoo with ICU3.2 unmasked.
> 
> Works a dream, even with locale C and UNICODE database.
> 
> Small test:
> 
> createdb --encoding UNICODE --locale C test psql test set 
> client_encoding=iso88591; CREATE TABLE test (t text); INSERT 
> INTO test (t) VALUES ('æøå'); set client_encoding=unicode; 
> INSERT INTO test (t) SELECT upper(t) FROM test; set 
> client_encoding=iso88591; SELECT * FROM test;
>   t
> -
>  æøå
>  ÆØÅ
> (2 rows)
> 
> Just as I'd expect, as upper/lower/initcap are locale 
> independent for these characters.
> 
> 
> ---(end of 
> broadcast)---
> TIP 5: Have you checked our extensive FAQ?
> 
>http://www.postgresql.org/docs/faq
> 
> 

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn
--On lördag, maj 07, 2005 22.53.46 +1000 John Hansen <[EMAIL PROTECTED]> 
wrote:

Errm,... initdb --encoding UNICODE --locale C
You mean that ICU *shall* be used even for the C locale, and not as Bruce 
suggested here:

I do have a few questions:
Why don't you use the lc_ctype_is_c() part of this test?
 	if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
Um, well, I didn't think about that. :)  What would be the locale in this
case? c_C.UTF-8? ;)  Hmm, it is possible to have CTYPE=C and use a wide
encoding, indeed. Then the strings will be handled like byte-wide chars.
Yeah, it's a bug. I'll fix it! Thanks.
John disagrees here, and I'm obliged to agree. Using the C locale, one will 
expect C collation, but upper/lower is better off still using ICU. Hence, 
the above stuff is *not* a bug. Do we agree?

/Palle


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of John Hansen
Sent: Saturday, May 07, 2005 10:23 PM
To: Palle Girgensohn; Bruce Momjian
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Patch for collation using ICU
>
> I use this patch in production on one FreeBSD 4.10 server at the
> moment.
> With the latest version, I've had no problems. Logging is
swithed on
> for now, and it shows no signs of ICU complaining. I'd like more
> reports on Linux, though.
I currently use this on gentoo with ICU3.2 unmasked.
Works a dream, even with locale C and UNICODE database.
Small test:
createdb --encoding UNICODE --locale C test psql test set
client_encoding=iso88591; CREATE TABLE test (t text); INSERT
INTO test (t) VALUES ('æøå'); set client_encoding=unicode;
INSERT INTO test (t) SELECT upper(t) FROM test; set
client_encoding=iso88591; SELECT * FROM test;
  t
-
 æøå
 ÆØÅ
(2 rows)
Just as I'd expect, as upper/lower/initcap are locale
independent for these characters.
---(end of
broadcast)---
TIP 5: Have you checked our extensive FAQ?
   http://www.postgresql.org/docs/faq



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Bruce Momjian wrote:
> Palle Girgensohn wrote:
> > >
> > > Is this patch ready for application?
> > 
> > I don't think so, not quite. I have not had any positive 
> reports from 
> > linux users, this is only tested in a FreeBSD environment. 
> I'd say it 
> > needs some more testing.
> 
> OK.
> 
> > Also, apparently, ICU is installed by default in many linux 
> > distributions, and usually it is version 2.8. Some linux users have 
> > asked me if there are plans for a patch that works with ICU 2.8. 
> > That's probably a good idea. IBM and the ICU folks seem to consider 
> > 3.2 to be the stable version, older versions are hard to 
> find on their 
> > sites, but most linux distributers seem to consider it too bleeding 
> > edge, even gentoo. I don't know why they don't agree.
> 
> Good point.  Why would linux folks need ICU?  Doesn't their 
> OS support encodings natively?  I am particularly excited 
> about this for OSs that don't have such encodings, like UTF8 
> support for Win32.
> 
> Because ICU will not be used unless enabled by configure, it 
> seems we are fine with only supporting the newest version.  
> Do Linux users need to use ICU for any reason?

Yes, because on many linux platforms locale support is broken.
Also, ICU enables full unicode support, particularly in multi-language
situations where locale is C, and makes upper/lower/initcap work as
expected, except where it depends on locale information.

There are also many other useful things in ICU that could be
implemented. Transliteration, and break-iterators for example.
Break-iteration particularly interresting for converting a text to a
list of words. Another is it's builtin substring searches.

> 
> > > I do have a few questions:
> > >
> > > Why don't you use the lc_ctype_is_c() part of this test?
> > >
> > >   if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
> > 
> > Um, well, I didn't think about that. :)  What would be the 
> locale in 
> > this case? c_C.UTF-8? ;)  Hmm, it is possible to have 
> CTYPE=C and use 
> > a wide encoding, indeed. Then the strings will be handled 
> like byte-wide chars.
> > Yeah, it's a bug. I'll fix it! Thanks.
> 
> The additional test is more of an optmization, and it fixes a 
> problem with some OSs that have processing problems with UTF8 
> when the locale is supposed to be turned off, like in "C".  I 
> realize ICU might be fine with it but the optimization still 
> is an issue.

That the locale is supposed to be turned off, doesn't mean it shouldn't
use ICU.
ICU is more than just locales.

> > > Why is so much code added, for example, in lower()?  The existing 
> > > multibyte code is much smaller, and lots of code is added 
> in other 
> > > places too.
> > 
> > ICU uses UTF-16 internally, so all strings must be 
> converted from the 
> > database encoding to UTF-16. Since that means the strings 
> need to be 
> > copied, I took the same approach as in 
> varlena.c:varstr_cmp(), where 
> > small strings use the heap and only larger strings use a palloc. 
> > Comments in varstr_cmp about performance made me use that approach.
> 
> Oh, interesting.   I think you need to create new functions that
> factor out that common code so the patch is smaller and 
> easier to maintain.
> 
> > Also, in the latest patch, I also added checks and logging 
> for *every* 
> > status returned from ICU. I hope this will help debugging 
> on debian, 
> > where previous version didn't work. That excessive status 
> checking is 
> > hardly be necessary once the stuff is better tested.
> > 
> > I think the string copying and heap/palloc choices stands 
> for most of 
> > the code bloat, together with the excessive status checking 
> and logging.
> 
> OK, move that into some common functions and I think it will 
> be better.
> 
> > > Why do you need to add a mapping of encoding names from 
> iana to our 
> > > names?
> > 
> > This was already answered by John Hansen... There's an old 
> thread here 
> > about the choice of the name "UNICODE" to describe an 
> encoding, which 
> > it doesn't. There's half a dozen unicode based encodings... 
> UTF-8 is 
> > used by postgresql, that would have been a better name... Similarly 
> > for most other encodings, really. ICU expect a setlocale(3) string 
> > (i.e. IANA). PostgreSQL can't provide it, so a mapping 
> table is required.
> 
> We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). 
>  Does that help?
> 
> > I use this patch in production on one FreeBSD 4.10 server 
> at the moment. 
> > With the latest version, I've had no problems. Logging is 
> swithed on 
> > for now, and it shows no signs of ICU complaining. I'd like more 
> > reports on Linux, though.
> 
> OK, I certainly would like this all done for 8.1 which should 
> have feature freeze on July 1.
> 
> -- 
>   Bruce Momjian|  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us   |  (610) 359-1001
>   +  If your life is a hard drive, |  13 Roberts Road
>  

Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
> --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen 
> <[EMAIL PROTECTED]>
> wrote:
> 
> > Errm,... initdb --encoding UNICODE --locale C
> 
> You mean that ICU *shall* be used even for the C locale, and 
> not as Bruce suggested here:

Yes, that's exactly what I mean.

> 
> >> I do have a few questions:
> >>
> >> Why don't you use the lc_ctype_is_c() part of this test?
> >>
> >>if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
> >
> > Um, well, I didn't think about that. :)  What would be the 
> locale in 
> > this case? c_C.UTF-8? ;)  Hmm, it is possible to have 
> CTYPE=C and use 
> > a wide encoding, indeed. Then the strings will be handled 
> like byte-wide chars.
> > Yeah, it's a bug. I'll fix it! Thanks.
> 
> John disagrees here, and I'm obliged to agree. Using the C 
> locale, one will expect C collation, but upper/lower is 
> better off still using ICU. Hence, the above stuff is *not* a 
> bug. Do we agree?
> 
> /Palle
> 
> 
> >
> >> -Original Message-
> >> From: [EMAIL PROTECTED]
> >> [mailto:[EMAIL PROTECTED] On Behalf Of 
> John Hansen
> >> Sent: Saturday, May 07, 2005 10:23 PM
> >> To: Palle Girgensohn; Bruce Momjian
> >> Cc: pgsql-hackers@postgresql.org
> >> Subject: Re: [HACKERS] Patch for collation using ICU
> >>
> >> >
> >> > I use this patch in production on one FreeBSD 4.10 server at the 
> >> > moment.
> >> > With the latest version, I've had no problems. Logging is
> >> swithed on
> >> > for now, and it shows no signs of ICU complaining. I'd like more 
> >> > reports on Linux, though.
> >>
> >> I currently use this on gentoo with ICU3.2 unmasked.
> >>
> >> Works a dream, even with locale C and UNICODE database.
> >>
> >> Small test:
> >>
> >> createdb --encoding UNICODE --locale C test psql test set 
> >> client_encoding=iso88591; CREATE TABLE test (t text); INSERT INTO 
> >> test (t) VALUES ('æøå'); set client_encoding=unicode; INSERT INTO 
> >> test (t) SELECT upper(t) FROM test; set client_encoding=iso88591; 
> >> SELECT * FROM test;
> >>   t
> >> -
> >>  æøå
> >>  ÆØÅ
> >> (2 rows)
> >>
> >> Just as I'd expect, as upper/lower/initcap are locale 
> independent for 
> >> these characters.
> >>
> >>
> >> ---(end of
> >> broadcast)---
> >> TIP 5: Have you checked our extensive FAQ?
> >>
> >>http://www.postgresql.org/docs/faq
> >>
> >>
> 
> 
> 
> 
> 
> 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Btw, I had been planning to propose replacing every single one of the built in 
charset conversion functions with calls to ICU (thus making pg _depend_ on 
ICU), as this would seem like a cleaner solution than for us to maintain our 
own conversion tables.

ICU also has a fair few conversions that we do not have at present.

Any thoughts?

... John

> -Original Message-
> From: John Hansen 
> Sent: Saturday, May 07, 2005 11:09 PM
> To: 'Palle Girgensohn'; 'Bruce Momjian'
> Cc: 'pgsql-hackers@postgresql.org'
> Subject: RE: [HACKERS] Patch for collation using ICU
> 
> > --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen 
> > <[EMAIL PROTECTED]>
> > wrote:
> > 
> > > Errm,... initdb --encoding UNICODE --locale C
> > 
> > You mean that ICU *shall* be used even for the C locale, and not as 
> > Bruce suggested here:
> 
> Yes, that's exactly what I mean.
> 
> > 
> > >> I do have a few questions:
> > >>
> > >> Why don't you use the lc_ctype_is_c() part of this test?
> > >>
> > >>  if (pg_database_encoding_max_length() > 1 && 
> !lc_ctype_is_c())
> > >
> > > Um, well, I didn't think about that. :)  What would be the
> > locale in
> > > this case? c_C.UTF-8? ;)  Hmm, it is possible to have
> > CTYPE=C and use
> > > a wide encoding, indeed. Then the strings will be handled
> > like byte-wide chars.
> > > Yeah, it's a bug. I'll fix it! Thanks.
> > 
> > John disagrees here, and I'm obliged to agree. Using the C 
> locale, one 
> > will expect C collation, but upper/lower is better off still using 
> > ICU. Hence, the above stuff is *not* a bug. Do we agree?
> > 
> > /Palle
> > 
> > 
> > >
> > >> -Original Message-
> > >> From: [EMAIL PROTECTED]
> > >> [mailto:[EMAIL PROTECTED] On Behalf Of
> > John Hansen
> > >> Sent: Saturday, May 07, 2005 10:23 PM
> > >> To: Palle Girgensohn; Bruce Momjian
> > >> Cc: pgsql-hackers@postgresql.org
> > >> Subject: Re: [HACKERS] Patch for collation using ICU
> > >>
> > >> >
> > >> > I use this patch in production on one FreeBSD 4.10 
> server at the 
> > >> > moment.
> > >> > With the latest version, I've had no problems. Logging is
> > >> swithed on
> > >> > for now, and it shows no signs of ICU complaining. I'd 
> like more 
> > >> > reports on Linux, though.
> > >>
> > >> I currently use this on gentoo with ICU3.2 unmasked.
> > >>
> > >> Works a dream, even with locale C and UNICODE database.
> > >>
> > >> Small test:
> > >>
> > >> createdb --encoding UNICODE --locale C test psql test set 
> > >> client_encoding=iso88591; CREATE TABLE test (t text); 
> INSERT INTO 
> > >> test (t) VALUES ('æøå'); set client_encoding=unicode; 
> INSERT INTO 
> > >> test (t) SELECT upper(t) FROM test; set 
> client_encoding=iso88591; 
> > >> SELECT * FROM test;
> > >>   t
> > >> -
> > >>  æøå
> > >>  ÆØÅ
> > >> (2 rows)
> > >>
> > >> Just as I'd expect, as upper/lower/initcap are locale
> > independent for
> > >> these characters.
> > >>
> > >>
> > >> ---(end of
> > >> broadcast)---
> > >> TIP 5: Have you checked our extensive FAQ?
> > >>
> > >>http://www.postgresql.org/docs/faq
> > >>
> > >>
> > 
> > 
> > 
> > 
> > 
> > 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn

--On lördag, maj 07, 2005 08.37.05 -0400 Bruce Momjian 
 wrote:

Palle Girgensohn wrote:
>
> Is this patch ready for application?
I don't think so, not quite. I have not had any positive reports from
linux  users, this is only tested in a FreeBSD environment. I'd say it
needs some  more testing.
OK.
John Hansen just reported that it does work on linux. fine!

Also, apparently, ICU is installed by default in many linux
distributions,  and usually it is version 2.8. Some linux users have
asked me if there are  plans for a patch that works with ICU 2.8. That's
probably a good idea. IBM  and the ICU folks seem to consider 3.2 to be
the stable version, older  versions are hard to find on their sites, but
most linux distributers seem  to consider it too bleeding edge, even
gentoo. I don't know why they don't  agree.
Good point.  Why would linux folks need ICU?  Doesn't their OS support
encodings natively?  I am particularly excited about this for OSs that
don't have such encodings, like UTF8 support for Win32.
Because ICU will not be used unless enabled by configure, it seems we
are fine with only supporting the newest version.  Do Linux users need
to use ICU for any reason?

There are corner cases where it is impossible to upper/lowercase one 
character at the time. for example:

-- without ICU
select upper('Eßer');
upper
---
EßER
(1 row)
-- with ICU
select upper('Eßer');
upper
---
ESSER
(1 rad)
This is because in the standard postgres implementation, upper/lower is 
done one character at the time. A proper upper/lower cannot do it that way. 
Other known example is in Turkish, where an Ì (?) should look different 
whether it is an initial letter or not. This fails in standard postgresql 
for all platforms.

> I do have a few questions:
>
> Why don't you use the lc_ctype_is_c() part of this test?
>
>if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
Um, well, I didn't think about that. :)  What would be the locale in
this  case? c_C.UTF-8? ;)  Hmm, it is possible to have CTYPE=C and use a
wide  encoding, indeed. Then the strings will be handled like byte-wide
chars.  Yeah, it's a bug. I'll fix it! Thanks.
The additional test is more of an optmization, and it fixes a problem
with some OSs that have processing problems with UTF8 when the locale is
supposed to be turned off, like in "C".  I realize ICU might be fine
with it but the optimization still is an issue.
Well, the results are quite different, depending on whether ICU is used or 
not. See separate mail.


> Why is so much code added, for example, in lower()?  The existing
> multibyte code is much smaller, and lots of code is added in other
> places too.
ICU uses UTF-16 internally, so all strings must be converted from the
database encoding to UTF-16. Since that means the strings need to be
copied, I took the same approach as in varlena.c:varstr_cmp(), where
small  strings use the heap and only larger strings use a palloc.
Comments in  varstr_cmp about performance made me use that approach.
Oh, interesting.   I think you need to create new functions that
factor out that common code so the patch is smaller and easier to
maintain.
Hmm, yes, perhaps it can be refactored a bit. It has ocurred to me...

Also, in the latest patch, I also added checks and logging for *every*
status returned from ICU. I hope this will help debugging on debian,
where  previous version didn't work. That excessive status checking is
hardly be  necessary once the stuff is better tested.
I think the string copying and heap/palloc choices stands for most of
the  code bloat, together with the excessive status checking and logging.
OK, move that into some common functions and I think it will be better.
Best way for upper/lower/initcap is probably to use a function pointer... 
uhh...


> Why do you need to add a mapping of encoding names from iana to our
> names?
This was already answered by John Hansen... There's an old thread here
about the choice of the name "UNICODE" to describe an encoding, which it
doesn't. There's half a dozen unicode based encodings... UTF-8 is used
by  postgresql, that would have been a better name... Similarly for most
other  encodings, really. ICU expect a setlocale(3) string (i.e. IANA).
PostgreSQL  can't provide it, so a mapping table is required.
We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does that
help?
I'm aware of that. It might help for unicode, but there are a bunch of 
other encodings. IANA has decided that utf-8 has *no* aliases, hence only 
utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is 
fogiving, I don't remember/know, but I think we need the mappings, 
unfortunately.


I use this patch in production on one FreeBSD 4.10 server at the moment.
With the latest version, I've had no problems. Logging is swithed on for
now, and it shows no signs of ICU complaining. I'd like more reports on
Linux, though.
OK, I certainly would like this all done for 8.1 which should have
feature freeze on July 1

Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Palle Girgensohn wrote:
> I'm aware of that. It might help for unicode, but there are a 
> bunch of 
> other encodings. IANA has decided that utf-8 has *no* 
> aliases, hence only 
> utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is 
> fogiving, I don't remember/know, but I think we need the mappings, 
> unfortunately.
> 

Here is the list of encoding names and aliases the ICU accepts as of
3.2:
(it's a bit long...)

UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2
UTF-16BE x-utf-16be ibm-1200 ibm-1201 ibm-5297 ibm-13488 ibm-17584
windows-1201 cp1200 cp1201 UTF16_BigEndian
UTF-16LE x-utf-16le ibm-1202 ibm-13490 ibm-17586 UTF16_LittleEndian
windows-1200
UTF-32 ISO-10646-UCS-4 csUCS4 ucs-4
UTF-32BE UTF32_BigEndian ibm-1232 ibm-1233
UTF-32LE UTF32_LittleEndian ibm-1234
UTF16_PlatformEndian
UTF16_OppositeEndian
UTF32_PlatformEndian
UTF32_OppositeEndian
UTF-7 windows-65000
IMAP-mailbox-name
SCSU
BOCU-1 csBOCU-1
CESU-8
ISO-8859-1 ibm-819 IBM819 cp819 latin1 8859_1 csISOLatin1 iso-ir-100
ISO_8859-1:1987 l1 819
US-ASCII ASCII ANSI_X3.4-1968 ANSI_X3.4-1986 ISO_646.irv:1991
iso_646.irv:1983 ISO646-US us csASCII iso-ir-6 cp367 ascii7 646
windows-20127
gb18030 ibm-1392 windows-54936
ibm-367_P100-1995 ibm-367 IBM367
ibm-912_P100-1995 ibm-912 iso-8859-2 ISO_8859-2:1987 latin2 csISOLatin2
iso-ir-101 l2 8859_2 cp912 912 windows-28592
ibm-913_P100-2000 ibm-913 iso-8859-3 ISO_8859-3:1988 latin3 csISOLatin3
iso-ir-109 l3 8859_3 cp913 913 windows-28593
ibm-914_P100-1995 ibm-914 iso-8859-4 latin4 csISOLatin4 iso-ir-110
ISO_8859-4:1988 l4 8859_4 cp914 914 windows-28594
ibm-915_P100-1995 ibm-915 iso-8859-5 cyrillic csISOLatinCyrillic
iso-ir-144 ISO_8859-5:1988 8859_5 cp915 915 windows-28595
ibm-1089_P100-1995 ibm-1089 iso-8859-6 arabic csISOLatinArabic
iso-ir-127 ISO_8859-6:1987 ECMA-114 ASMO-708 8859_6 cp1089 1089
windows-28596 ISO-8859-6-I ISO-8859-6-E
ibm-813_P100-1995 ibm-813 iso-8859-7 greek greek8 ELOT_928 ECMA-118
csISOLatinGreek iso-ir-126 ISO_8859-7:1987 8859_7 cp813 813
windows-28597
ibm-916_P100-1995 ibm-916 iso-8859-8 hebrew csISOLatinHebrew iso-ir-138
ISO_8859-8:1988 ISO-8859-8-I ISO-8859-8-E 8859_8 cp916 916 windows-28598
ibm-920_P100-1995 ibm-920 iso-8859-9 latin5 csISOLatin5 iso-ir-148
ISO_8859-9:1989 l5 8859_9 cp920 920 windows-28599 ECMA-128
ibm-921_P100-1995 ibm-921 iso-8859-13 8859_13 cp921 921
ibm-923_P100-1998 ibm-923 iso-8859-15 Latin-9 l9 8859_15 latin0
csisolatin0 csisolatin9 iso8859_15_fdis cp923 923 windows-28605
ibm-942_P12A-1999 ibm-942 ibm-932 cp932 shift_jis78 sjis78
ibm-942_VSUB_VPUA ibm-932_VSUB_VPUA
ibm-943_P15A-2003 ibm-943 Shift_JIS MS_Kanji csShiftJIS windows-31j
csWindows31J x-sjis x-ms-cp932 cp932 windows-932 cp943c IBM-943C ms932
pck sjis ibm-943_VSUB_VPUA
ibm-943_P130-1999 ibm-943 Shift_JIS cp943 943 ibm-943_VASCII_VSUB_VPUA
ibm-33722_P12A-1999 ibm-33722 ibm-5050 EUC-JP
Extended_UNIX_Code_Packed_Format_for_Japanese csEUCPkdFmtJapanese
X-EUC-JP eucjis windows-51932 ibm-33722_VPUA IBM-eucJP
ibm-33722_P120-1999 ibm-33722 ibm-5050 cp33722 33722
ibm-33722_VASCII_VPUA
ibm-954_P101-2000 ibm-954 EUC-JP
ibm-1373_P100-2002 ibm-1373 windows-950
windows-950-2000 Big5 csBig5 windows-950 x-big5
ibm-950_P110-1999 ibm-950 cp950 950
macos-2566-10.2 Big5-HKSCS big5hk HKSCS-BIG5
ibm-1375_P100-2003 ibm-1375 Big5-HKSCS
ibm-1386_P100-2002 ibm-1386 cp1386 windows-936 ibm-1386_VSUB_VPUA
windows-936-2000 GBK CP936 MS936 windows-936
ibm-1383_P110-1999 ibm-1383 GB2312 csGB2312 EUC-CN ibm-eucCN hp15CN
cp1383 1383 ibm-1383_VPUA
ibm-5478_P100-1995 ibm-5478 GB_2312-80 chinese iso-ir-58 csISO58GB231280
gb2312-1980 GB2312.1980-0
ibm-964_P110-1999 ibm-964 EUC-TW ibm-eucTW cns11643 cp964 964
ibm-964_VPUA
ibm-949_P110-1999 ibm-949 cp949 949 ibm-949_VASCII_VSUB_VPUA
ibm-949_P11A-1999 ibm-949 cp949c ibm-949_VSUB_VPUA
ibm-970_P110-1995 ibm-970 EUC-KR KS_C_5601-1987 windows-51949 csEUCKR
ibm-eucKR KSC_5601 5601 ibm-970_VPUA
ibm-971_P100-1995 ibm-971 ibm-971_VPUA
ibm-1363_P11B-1998 ibm-1363 KS_C_5601-1987 KS_C_5601-1989 KSC_5601
csKSC56011987 korean iso-ir-149 5601 cp1363 ksc windows-949
ibm-1363_VSUB_VPUA
ibm-1363_P110-1997 ibm-1363 ibm-1363_VASCII_VSUB_VPUA
windows-949-2000 windows-949 KS_C_5601-1987 KS_C_5601-1989 KSC_5601
csKSC56011987 korean iso-ir-149 ms949
ibm-1162_P100-1999 ibm-1162
ibm-874_P100-1995 ibm-874 ibm-9066 cp874 TIS-620 tis620.2533 eucTH
cp9066
windows-874-2000 TIS-620 windows-874 MS874
ibm-437_P100-1995 ibm-437 IBM437 cp437 437 csPC8CodePage437 windows-437
ibm-850_P100-1995 ibm-850 IBM850 cp850 850 csPC850Multilingual
windows-850
ibm-851_P100-1995 ibm-851 IBM851 cp851 851 csPC851
ibm-852_P100-1995 ibm-852 IBM852 cp852 852 csPCp852 windows-852
ibm-855_P100-1995 ibm-855 IBM855 cp855 855 csIBM855 csPCp855
ibm-856_P100-1995 ibm-856 cp856 856
ibm-857_P100-1995 ibm-857 IBM857 cp857 857 csIBM857 windows-857
ibm-858_P100-1997 ibm-858 IBM00858 CCSID00858 CP00858
PC-Multilingual-850+euro cp858
ibm-860_P100-1995 ibm-860 IBM860 cp860 860 csIBM8

Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn

--On lördag, maj 07, 2005 23.25.15 +1000 John Hansen <[EMAIL PROTECTED]> 
wrote:

Palle Girgensohn wrote:
I'm aware of that. It might help for unicode, but there are a
bunch of
other encodings. IANA has decided that utf-8 has *no*
aliases, hence only
utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
fogiving, I don't remember/know, but I think we need the mappings,
unfortunately.
Here is the list of encoding names and aliases the ICU accepts as of
3.2:
(it's a bit long...)
UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
No UTF8 in there. I think that's good, charset aliases are a hassle.
/Palle
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
 

> -Original Message-
> From: Palle Girgensohn [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, May 07, 2005 11:30 PM
> To: John Hansen; Bruce Momjian
> Cc: pgsql-hackers@postgresql.org
> Subject: RE: [HACKERS] Patch for collation using ICU
> 
> 
> 
> --On lördag, maj 07, 2005 23.25.15 +1000 John Hansen 
> <[EMAIL PROTECTED]>
> wrote:
> 
> > Palle Girgensohn wrote:
> >> I'm aware of that. It might help for unicode, but there 
> are a bunch 
> >> of other encodings. IANA has decided that utf-8 has *no* aliases, 
> >> hence only
> >> utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is 
> >> fogiving, I don't remember/know, but I think we need the mappings, 
> >> unfortunately.
> >>
> >
> > Here is the list of encoding names and aliases the ICU accepts as of
> > 3.2:
> > (it's a bit long...)
> >
> > UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
> 
> No UTF8 in there. I think that's good, charset aliases are a hassle.

Yup! :)

> 
> /Palle
> 
> 
> 

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn

--On lördag, maj 07, 2005 22.22.52 +1000 John Hansen <[EMAIL PROTECTED]> 
wrote:

I use this patch in production on one FreeBSD 4.10 server at
the moment.
With the latest version, I've had no problems. Logging is
swithed on for
now, and it shows no signs of ICU complaining. I'd like more
reports on
Linux, though.
I currently use this on gentoo with ICU3.2 unmasked.
Works a dream, even with locale C and UNICODE database.
Ah, good to hear, John. I beleive your report about linux is what's keeping 
this back. Did you also manage to get it running on Debian?

/Palle
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
 

> -Original Message-
> From: Palle Girgensohn [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, May 07, 2005 11:33 PM
> To: John Hansen; Bruce Momjian
> Cc: pgsql-hackers@postgresql.org
> Subject: RE: [HACKERS] Patch for collation using ICU
> 
> 
> 
> --On lördag, maj 07, 2005 22.22.52 +1000 John Hansen 
> <[EMAIL PROTECTED]>
> wrote:
> 
> >>
> >> I use this patch in production on one FreeBSD 4.10 server at the 
> >> moment.
> >> With the latest version, I've had no problems. Logging is 
> swithed on 
> >> for now, and it shows no signs of ICU complaining. I'd like more 
> >> reports on Linux, though.
> >
> > I currently use this on gentoo with ICU3.2 unmasked.
> >
> > Works a dream, even with locale C and UNICODE database.
> 
> Ah, good to hear, John. I beleive your report about linux is 
> what's keeping this back. Did you also manage to get it 
> running on Debian?

Not without ICU3.2

> 
> /Palle
> 
> 

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn
--On lördag, maj 07, 2005 23.15.29 +1000 John Hansen <[EMAIL PROTECTED]> 
wrote:

Btw, I had been planning to propose replacing every single one of the
built in charset conversion functions with calls to ICU (thus making pg
_depend_ on ICU), as this would seem like a cleaner solution than for us
to maintain our own conversion tables.
ICU also has a fair few conversions that we do not have at present.
Any thoughts?
I just had a similar though. And why use ICU only for multibyte charsets? 
If I use LATIN1, I still expect upper('ß') => SS, and I don't get it... 
Same for the Turkish example.

It does eat more memory, and can perhaps cush some performance bits? With 
the current scheme, a strdup is often enough, or at least just one palloc. 
With ICU, using UTF-16, you must allocate memory twice, once for the ICU 
internal UTF-16 representation. That's not a very strong objection, though, 
as this would be an option... :)

John, I have a hard time finding docs about what differs in ICU 2.8 from 
3.2. Do you have any pointers?

It seems 3.2 has much more support and bug fixes, I'm not sure if we should 
really consider 2.8?

/Palle
... John
-Original Message-
From: John Hansen
Sent: Saturday, May 07, 2005 11:09 PM
To: 'Palle Girgensohn'; 'Bruce Momjian'
Cc: 'pgsql-hackers@postgresql.org'
Subject: RE: [HACKERS] Patch for collation using ICU
> --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen
> <[EMAIL PROTECTED]>
> wrote:
>
> > Errm,... initdb --encoding UNICODE --locale C
>
> You mean that ICU *shall* be used even for the C locale, and not as
> Bruce suggested here:
Yes, that's exactly what I mean.
>
> >> I do have a few questions:
> >>
> >> Why don't you use the lc_ctype_is_c() part of this test?
> >>
> >>   if (pg_database_encoding_max_length() > 1 &&
!lc_ctype_is_c())
> >
> > Um, well, I didn't think about that. :)  What would be the
> locale in
> > this case? c_C.UTF-8? ;)  Hmm, it is possible to have
> CTYPE=C and use
> > a wide encoding, indeed. Then the strings will be handled
> like byte-wide chars.
> > Yeah, it's a bug. I'll fix it! Thanks.
>
> John disagrees here, and I'm obliged to agree. Using the C
locale, one
> will expect C collation, but upper/lower is better off still using
> ICU. Hence, the above stuff is *not* a bug. Do we agree?
>
> /Palle
>
>
> >
> >> -Original Message-
> >> From: [EMAIL PROTECTED]
> >> [mailto:[EMAIL PROTECTED] On Behalf Of
> John Hansen
> >> Sent: Saturday, May 07, 2005 10:23 PM
> >> To: Palle Girgensohn; Bruce Momjian
> >> Cc: pgsql-hackers@postgresql.org
> >> Subject: Re: [HACKERS] Patch for collation using ICU
> >>
> >> >
> >> > I use this patch in production on one FreeBSD 4.10
server at the
> >> > moment.
> >> > With the latest version, I've had no problems. Logging is
> >> swithed on
> >> > for now, and it shows no signs of ICU complaining. I'd
like more
> >> > reports on Linux, though.
> >>
> >> I currently use this on gentoo with ICU3.2 unmasked.
> >>
> >> Works a dream, even with locale C and UNICODE database.
> >>
> >> Small test:
> >>
> >> createdb --encoding UNICODE --locale C test psql test set
> >> client_encoding=iso88591; CREATE TABLE test (t text);
INSERT INTO
> >> test (t) VALUES ('æøå'); set client_encoding=unicode;
INSERT INTO
> >> test (t) SELECT upper(t) FROM test; set
client_encoding=iso88591;
> >> SELECT * FROM test;
> >>   t
> >> -
> >>  æøå
> >>  ÆØÅ
> >> (2 rows)
> >>
> >> Just as I'd expect, as upper/lower/initcap are locale
> independent for
> >> these characters.
> >>
> >>
> >> ---(end of
> >> broadcast)---
> >> TIP 5: Have you checked our extensive FAQ?
> >>
> >>http://www.postgresql.org/docs/faq
> >>
> >>
>
>
>
>
>
>


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn

--On lördag, maj 07, 2005 23.33.31 +1000 John Hansen <[EMAIL PROTECTED]> 
wrote:


-Original Message-
From: Palle Girgensohn [mailto:[EMAIL PROTECTED]
Sent: Saturday, May 07, 2005 11:33 PM
To: John Hansen; Bruce Momjian
Cc: pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Patch for collation using ICU

--On lördag, maj 07, 2005 22.22.52 +1000 John Hansen
<[EMAIL PROTECTED]>
wrote:
>>
>> I use this patch in production on one FreeBSD 4.10 server at the
>> moment.
>> With the latest version, I've had no problems. Logging is
swithed on
>> for now, and it shows no signs of ICU complaining. I'd like more
>> reports on Linux, though.
>
> I currently use this on gentoo with ICU3.2 unmasked.
>
> Works a dream, even with locale C and UNICODE database.
Ah, good to hear, John. I beleive your report about linux is
what's keeping this back. Did you also manage to get it
running on Debian?
Not without ICU3.2
Did you try the latest patch? Maybe it will help, and if not, it will 
(hopefully) give a lot more informative error messages.

/Palle
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Andrew Dunstan

John Hansen wrote:
Here is the list of encoding names and aliases the ICU accepts as of
3.2:
(it's a bit long...)
UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2
 

[snip]
Don't we use "unicode" as an alias for UTF-8 ?
cheers
andrew
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
> Did you try the latest patch? Maybe it will help, and if not, it will
> (hopefully) give a lot more informative error messages.

No, and I got rid of my debian boxes @ home.
The patch required a certain amount of modifications too, to even
compile with 2.8.

So I guess it's a valid question to ask: it it worth supporting 2.8?
It is of course an option to bundle icu 3.2 with pg!


... John

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Bruce Momjian
John Hansen wrote:
> > --On l?rdag, maj 07, 2005 22.53.46 +1000 John Hansen 
> > <[EMAIL PROTECTED]>
> > wrote:
> > 
> > > Errm,... initdb --encoding UNICODE --locale C
> > 
> > You mean that ICU *shall* be used even for the C locale, and 
> > not as Bruce suggested here:
> 
> Yes, that's exactly what I mean.

There are two reasons for that optimization --- first, some locale
support is broken and Unicode encoding with a C locale crashes (not an
issue for ICU), and second, it is an optimization for languages like
Japanese that want to use unicode, but don't need a locale because
upper/lower means nothing in those character sets.

So, the first issue doesn't apply for ICU, and the second might not
depending on what characters you are using in the Unicode character set.

I guess I am little confused how ICU can do upper() when the locale is
C.  What is it using to determine A is upper for a?  Am I confused?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
 

> -Original Message-
> From: Andrew Dunstan [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, May 07, 2005 11:39 PM
> To: John Hansen
> Cc: Palle Girgensohn; Bruce Momjian; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> 
> 
> John Hansen wrote:
> 
> >Here is the list of encoding names and aliases the ICU accepts as of
> >3.2:
> >(it's a bit long...)
> >
> >UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
> >UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2
> >
> >  
> >
> [snip]
> 
> Don't we use "unicode" as an alias for UTF-8 ?

Yes, unfortunately!

> 
> cheers
> 
> andrew
> 
> 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Bruce Momjian
Palle Girgensohn wrote:
> >> Also, apparently, ICU is installed by default in many linux
> >> distributions,  and usually it is version 2.8. Some linux users have
> >> asked me if there are  plans for a patch that works with ICU 2.8. That's
> >> probably a good idea. IBM  and the ICU folks seem to consider 3.2 to be
> >> the stable version, older  versions are hard to find on their sites, but
> >> most linux distributers seem  to consider it too bleeding edge, even
> >> gentoo. I don't know why they don't  agree.
> >
> > Good point.  Why would linux folks need ICU?  Doesn't their OS support
> > encodings natively?  I am particularly excited about this for OSs that
> > don't have such encodings, like UTF8 support for Win32.
> >
> > Because ICU will not be used unless enabled by configure, it seems we
> > are fine with only supporting the newest version.  Do Linux users need
> > to use ICU for any reason?
> 
> 
> There are corner cases where it is impossible to upper/lowercase one 
> character at the time. for example:
> 
> -- without ICU
>  select upper('E?er');
>  upper
> ---
>  E?ER
> (1 row)
> 
> -- with ICU
> select upper('E?er');
>  upper
> ---
>  ESSER
> (1 rad)
> 
> This is because in the standard postgres implementation, upper/lower is 
> done one character at the time. A proper upper/lower cannot do it that way. 
> Other known example is in Turkish, where an ? (?) should look different 
> whether it is an initial letter or not. This fails in standard postgresql 
> for all platforms.

Uh, where do you see that?  Our code has:

workspace = texttowcs(string);

for (i = 0; workspace[i] != 0; i++)
workspace[i] = towupper(workspace[i]);

result = wcstotext(workspace, i);


> >> Also, in the latest patch, I also added checks and logging for *every*
> >> status returned from ICU. I hope this will help debugging on debian,
> >> where  previous version didn't work. That excessive status checking is
> >> hardly be  necessary once the stuff is better tested.
> >>
> >> I think the string copying and heap/palloc choices stands for most of
> >> the  code bloat, together with the excessive status checking and logging.
> >
> > OK, move that into some common functions and I think it will be better.
> 
> Best way for upper/lower/initcap is probably to use a function pointer... 
> uhh...

Uh, I don't think so.  Just send pointers to the the function and let
the function allocate the memory, and another function to free them, or
something like that.  I can probably do it if you want.

> >> > Why do you need to add a mapping of encoding names from iana to our
> >> > names?
> >>
> >> This was already answered by John Hansen... There's an old thread here
> >> about the choice of the name "UNICODE" to describe an encoding, which it
> >> doesn't. There's half a dozen unicode based encodings... UTF-8 is used
> >> by  postgresql, that would have been a better name... Similarly for most
> >> other  encodings, really. ICU expect a setlocale(3) string (i.e. IANA).
> >> PostgreSQL  can't provide it, so a mapping table is required.
> >
> > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does that
> > help?
> 
> I'm aware of that. It might help for unicode, but there are a bunch of 
> other encodings. IANA has decided that utf-8 has *no* aliases, hence only 
> utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is 
> fogiving, I don't remember/know, but I think we need the mappings, 
> unfortunately.

OK.  I guess I am just confused why the native implementations are OK.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Bruce Momjian
Andrew Dunstan wrote:
> 
> 
> John Hansen wrote:
> 
> >Here is the list of encoding names and aliases the ICU accepts as of
> >3.2:
> >(it's a bit long...)
> >
> >UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
> >UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2
> >
> >  
> >
> [snip]
> 
> Don't we use "unicode" as an alias for UTF-8 ?

We do.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Bruce Momjian wrote:
> 
> There are two reasons for that optimization --- first, some 
> locale support is broken and Unicode encoding with a C locale 
> crashes (not an issue for ICU), and second, it is an 
> optimization for languages like Japanese that want to use 
> unicode, but don't need a locale because upper/lower means 
> nothing in those character sets.

No, upper/lower means nothing in those languages, so why would you need
to optimize upper/lower if they're not used??
And if they are, it's obviously because the text contains characters
from other languages (probably english) and as such they should behave
correctly.

Did I mention that for japanese and the like, ICU would also offer
transliteration...

> 
> So, the first issue doesn't apply for ICU, and the second 
> might not depending on what characters you are using in the 
> Unicode character set.
> 
> I guess I am little confused how ICU can do upper() when the 
> locale is C.  What is it using to determine A is upper for a? 
>  Am I confused?

Simple, UNICODE basically consist of a table of characters
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt)

Excerpt:

0041;LATIN CAPITAL LETTER A;Lu;0;L;N0061;
...
0061;LATIN SMALL LETTER A;Ll;0;L;N;;;0041;;0041

From this you can see, that for 0041, which is capital letter A, there
is a mapping to it's lowercase counterpart, 0061
Likewise, there is a mapping for 0061 which says it's uppercase
counterpart is 0041.
There is also SpecialCasing.txt which covers those mappings that haven't
got a 1-1 mapping, such as the german SS.

These mappings are fixed, independent of locale, only a few cases from
specialcasing.txt depend on locale/context.



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Bruce Momjian
Palle Girgensohn wrote:
> 
> --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen <[EMAIL PROTECTED]> 
> wrote:
> 
> > Btw, I had been planning to propose replacing every single one of the
> > built in charset conversion functions with calls to ICU (thus making pg
> > _depend_ on ICU), as this would seem like a cleaner solution than for us
> > to maintain our own conversion tables.
> >
> > ICU also has a fair few conversions that we do not have at present.

That is a much larger issue, similar to our shipping our own timezone
database.  What does it buy us?

o  Do we ship it in our tarball?
o  Is the license compatible?
o  Does it remove utils/mb conversions?
o  Does it allow us to index LIKE (next high char)?
o  Does it allow us to support multiple encodings in
   a single database easier?
o  performance?

> I just had a similar though. And why use ICU only for multibyte charsets? 
> If I use LATIN1, I still expect upper('?') => SS, and I don't get it... 
> Same for the Turkish example.

We assume the native toupper() can handle single-byte character
encodings.  We use towupper() only for wide character sets.


-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
> It seems 3.2 has much more support and bug fixes, I'm not 
> sure if we should really consider 2.8?

As I said, probably not worth the effort.

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn

--On lördag, maj 07, 2005 09.52.59 -0400 Bruce Momjian 
 wrote:

Palle Girgensohn wrote:
>> Also, apparently, ICU is installed by default in many linux
>> distributions,  and usually it is version 2.8. Some linux users have
>> asked me if there are  plans for a patch that works with ICU 2.8.
>> That's probably a good idea. IBM  and the ICU folks seem to consider
>> 3.2 to be the stable version, older  versions are hard to find on
>> their sites, but most linux distributers seem  to consider it too
>> bleeding edge, even gentoo. I don't know why they don't  agree.
>
> Good point.  Why would linux folks need ICU?  Doesn't their OS support
> encodings natively?  I am particularly excited about this for OSs that
> don't have such encodings, like UTF8 support for Win32.
>
> Because ICU will not be used unless enabled by configure, it seems we
> are fine with only supporting the newest version.  Do Linux users need
> to use ICU for any reason?
There are corner cases where it is impossible to upper/lowercase one
character at the time. for example:
-- without ICU
 select upper('E?er');
 upper
---
 E?ER
(1 row)
-- with ICU
select upper('E?er');
 upper
---
 ESSER
(1 rad)
This is because in the standard postgres implementation, upper/lower is
done one character at the time. A proper upper/lower cannot do it that
way.  Other known example is in Turkish, where an ? (?) should look
different  whether it is an initial letter or not. This fails in
standard postgresql  for all platforms.
Uh, where do you see that?  Our code has:
workspace = texttowcs(string);
for (i = 0; workspace[i] != 0; i++)
workspace[i] = towupper(workspace[i]);
as you see, the loop runs towupper for one character at the time. I cannot 
consider whether the letter is the initial, as required in Turkish, and it 
cannot really convert one character into two ('ß' -> 'SS')

result = wcstotext(workspace, i);

>> Also, in the latest patch, I also added checks and logging for *every*
>> status returned from ICU. I hope this will help debugging on debian,
>> where  previous version didn't work. That excessive status checking is
>> hardly be  necessary once the stuff is better tested.
>>
>> I think the string copying and heap/palloc choices stands for most of
>> the  code bloat, together with the excessive status checking and
>> logging.
>
> OK, move that into some common functions and I think it will be better.
Best way for upper/lower/initcap is probably to use a function
pointer...  uhh...
Uh, I don't think so.  Just send pointers to the the function and let
the function allocate the memory, and another function to free them, or
something like that.  I can probably do it if you want.
I'll check it out, it seems simple enough.
> We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does
> that help?
I'm aware of that. It might help for unicode, but there are a bunch of
other encodings. IANA has decided that utf-8 has *no* aliases, hence
only  utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
fogiving, I don't remember/know, but I think we need the mappings,
unfortunately.
OK.  I guess I am just confused why the native implementations are OK.
They're OK since they understand that UNICODE (or UTF8) is really utf-8. 
Problem is the strings used to describe them are not understood by ICU.

BTW, the pg_enc2iananame_tbl is only used *from* internal representation 
*to* IANA, not the other way around. Maybe that fact lowers the rate of 
confusion? ;-)

/Palle
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Bruce Momjian
Palle Girgensohn wrote:
> >> This is because in the standard postgres implementation, upper/lower is
> >> done one character at the time. A proper upper/lower cannot do it that
> >> way.  Other known example is in Turkish, where an ? (?) should look
> >> different  whether it is an initial letter or not. This fails in
> >> standard postgresql  for all platforms.
> >
> > Uh, where do you see that?  Our code has:
> >
> > workspace = texttowcs(string);
> >
> > for (i = 0; workspace[i] != 0; i++)
> > workspace[i] = towupper(workspace[i]);
> 
> as you see, the loop runs towupper for one character at the time. I cannot 
> consider whether the letter is the initial, as required in Turkish, and it 
> cannot really convert one character into two ('?' -> 'SS')

Oh, OK. I thought texttowcs() would expand the string to allow such
conversions.

> >> > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does
> >> > that help?
> >>
> >> I'm aware of that. It might help for unicode, but there are a bunch of
> >> other encodings. IANA has decided that utf-8 has *no* aliases, hence
> >> only  utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
> >> fogiving, I don't remember/know, but I think we need the mappings,
> >> unfortunately.
> >
> > OK.  I guess I am just confused why the native implementations are OK.
> 
> They're OK since they understand that UNICODE (or UTF8) is really utf-8. 
> Problem is the strings used to describe them are not understood by ICU.
> 
> BTW, the pg_enc2iananame_tbl is only used *from* internal representation 
> *to* IANA, not the other way around. Maybe that fact lowers the rate of 
> confusion? ;-)

OK, got it.  I am still a little confused why every native
implementation understands our existing names but ICU does not.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Bruce Momjian wrote:
> Palle Girgensohn wrote:
> > 
> > --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen 
> > <[EMAIL PROTECTED]>
> > wrote:
> > 
> > > Btw, I had been planning to propose replacing every single one of 
> > > the built in charset conversion functions with calls to ICU (thus 
> > > making pg _depend_ on ICU), as this would seem like a cleaner 
> > > solution than for us to maintain our own conversion tables.
> > >
> > > ICU also has a fair few conversions that we do not have 
> at present.
> 
> That is a much larger issue, similar to our shipping our own 
> timezone database.  What does it buy us?
>   
>   o  Do we ship it in our tarball?
>   o  Is the license compatible?
>   o  Does it remove utils/mb conversions?
>   o  Does it allow us to index LIKE (next high char)?
>   o  Does it allow us to support multiple encodings in
>  a single database easier?
>   o  performance?
> 
> > I just had a similar though. And why use ICU only for 
> multibyte charsets? 
> > If I use LATIN1, I still expect upper('?') => SS, and I 
> don't get it... 
> > Same for the Turkish example.
> 
> We assume the native toupper() can handle single-byte 
> character encodings.  We use towupper() only for wide character sets.

That assumption is wrong,...

Encoding latin1
Locale <> de*

Select Upper('ß'); (lowercase german SS)
Should return SS, but returns ß

... John



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn

--On lördag, maj 07, 2005 10.06.43 -0400 Bruce Momjian 
 wrote:

Palle Girgensohn wrote:
--On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen
<[EMAIL PROTECTED]>  wrote:
> Btw, I had been planning to propose replacing every single one of the
> built in charset conversion functions with calls to ICU (thus making pg
> _depend_ on ICU), as this would seem like a cleaner solution than for
> us to maintain our own conversion tables.
>
> ICU also has a fair few conversions that we do not have at present.
That is a much larger issue, similar to our shipping our own timezone
database.  What does it buy us?
o  Do we ship it in our tarball?
o  Is the license compatible?
It looks pretty similar to BSD, although I'm a novice on the subject.
	o  Does it remove utils/mb conversions?
Yes, it would probably be possible to remove pg's own conversions.
	o  Does it allow us to index LIKE (next high char)?
I beleive so, using ICU's substring stuff.
o  Does it allow us to support multiple encodings in
   a single database easier?
Heh, the ultimate dream. Perhaps?
	o  performance?
ICU in itself is said to be much faster than for example glibc. Problem is 
the need for conversion via UTF-16, which requires extra memory allocations 
and cpu cycles. I don't use glibc, but my very simple performance tests for 
FreeBSD show that it is similiar in speed.


I just had a similar though. And why use ICU only for multibyte
charsets?  If I use LATIN1, I still expect upper('?') => SS, and I don't
get it...  Same for the Turkish example.
We assume the native toupper() can handle single-byte character
encodings.  We use towupper() only for wide character sets.
True, problem is that native toupper/towupper run one char at the time. 
This is a bad design decision in POSIX, there is no way it can handle the 
examples above unless considering more than one character. ICU does just 
that.

/Palle
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] pgFoundry

2005-05-07 Thread Marc G. Fournier
On Sat, 7 May 2005, Greg Stark wrote:
But "tracking the status" of sub-projects is just not the kind of thing 
free software people do. They send emails when they have something to 
say.
in defence of Joshua's idea, there are some "large projects" within our 
development that would be nice to see some sort of 'time lines' for, but 
mainly those are the sorts of things that have set milestones to look for 
... I believe that Simon's work on the PITR stuff might fall under that, 
where he had various 'stages' he was working towards ...

But, I think that the # of "sub projects" that this would apply to is so 
few that setting anything up more formally then Simon sending out an 
updated patch would be more work then the derived benefit ...


Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 7615664
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Bruce Momjian
John Hansen wrote:
> Bruce Momjian wrote:
> > 
> > There are two reasons for that optimization --- first, some 
> > locale support is broken and Unicode encoding with a C locale 
> > crashes (not an issue for ICU), and second, it is an 
> > optimization for languages like Japanese that want to use 
> > unicode, but don't need a locale because upper/lower means 
> > nothing in those character sets.
> 
> No, upper/lower means nothing in those languages, so why would you need
> to optimize upper/lower if they're not used??

True.  I suppose it is for databases that use both Japanese and Latin
alphabets and run upper() on all values.

> And if they are, it's obviously because the text contains characters
> from other languages (probably english) and as such they should behave
> correctly.
> 
> Did I mention that for japanese and the like, ICU would also offer
> transliteration...

Interesting.

> > So, the first issue doesn't apply for ICU, and the second 
> > might not depending on what characters you are using in the 
> > Unicode character set.
> > 
> > I guess I am little confused how ICU can do upper() when the 
> > locale is C.  What is it using to determine A is upper for a? 
> >  Am I confused?
> 
> Simple, UNICODE basically consist of a table of characters
> (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt)
> 
> Excerpt:
> 
> 0041;LATIN CAPITAL LETTER A;Lu;0;L;N0061;
> ...
> 0061;LATIN SMALL LETTER A;Ll;0;L;N;;;0041;;0041
> 
> >From this you can see, that for 0041, which is capital letter A, there
> is a mapping to it's lowercase counterpart, 0061
> Likewise, there is a mapping for 0061 which says it's uppercase
> counterpart is 0041.
> There is also SpecialCasing.txt which covers those mappings that haven't
> got a 1-1 mapping, such as the german SS.
> 
> These mappings are fixed, independent of locale, only a few cases from
> specialcasing.txt depend on locale/context.

As far as I know, the only way to use Unicode currently is to use a
locale that is unicode-aware.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Tom Lane
"John Hansen" <[EMAIL PROTECTED]> writes:
> Where'd you get the licence from?

It was the first thing I came across in their docs:

http://icu.sourceforge.net/userguide/intro.html

Looking more closely, it may be that this license is only intended to
apply to the documentation and not the code ... though "free" code with
un-free documentation isn't real useful.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Andrew Sullivan
On Sat, May 07, 2005 at 02:52:57PM +1000, Neil Conway wrote:
> 
> So would you have us disable all the non-essential builtin functions? 
> (Many of which have has security problems in the past.) What about the 
> builtin encoding conversions, non-btree indexes, or a myriad of features 
> that not all users need or use?

This is not really analogous, because those are already on (and in
most cases, not easily disabled).  What you're arguing for is to add
yet another on-by-default feature.  Given that there's already a way
to turn it on, why make it automatic?  Moreover, if some repackager
wants to make this more convenient, s/he can do so by turning it on
by default.  I don't see what's wrong with conservatism here.

> What makes sense for the default configuration of an operating system 
> (which by nature must be hardened against attack) does not necessarily 
> make sense for a database system.

Indeed.  But that doesn't mean that the principle isn't sound for
both cases.  I haven't seen an argument against that yet.

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]
The plural of anecdote is not data.
--Roger Brinner

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] FC3 broken with HEAD

2005-05-07 Thread Alvaro Herrera
On Sat, May 07, 2005 at 04:27:04PM +1000, Neil Conway wrote:
> Tom Lane wrote:
> >Yeah, we will.  Please file a bugzilla entry for this though --- I
> >concur that it is a linker bug.
> 
> Okay, patch reverted. The RH bug is here:
> 
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=157126

Huh, RH's bugzilla is really slow.  Are they using the
PostgreSQL-powered bugzilla?

-- 
Alvaro Herrera (<[EMAIL PROTECTED]>)
Syntax error: function hell() needs an argument.
Please choose what hell you want to involve.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[HACKERS] Race conditions, race conditions!

2005-05-07 Thread Tom Lane
I promised an analysis of the problems Jan Wieck uncovered yesterday,
so here it is.

Jan had developed a testbed (which I hope he will post) that essentially
has a bunch of client threads all doing instances of the same
transaction in READ COMMITTED mode.  The intended database invariant is
that the number of rows in table t2 with a particular ID value is equal
to the "cnt" field of the single row in t1 with that ID value:

begin;
select cnt from t1 where id = K for update;
delete from t2 where id = K;
-- check returned rowcount to see that exactly CNT rows were deleted
insert into t2 values (K);  -- repeat this N times
update t1 set cnt = N where id = K;
commit;

K is randomly chosen for each execution from the known set of t1 keys,
and N is randomly chosen each time as a small positive integer.  This
should maintain the invariant, since at any time only one transaction
can be holding the SELECT FOR UPDATE lock on a particular t1 row.

The trick is to run this under astronomical load; 100 or so client
threads on a garden-variety PC is about right.

What Jan was seeing was that every so often, a client thread would
error out, reporting that its DELETE had deleted zero rows rather
than the expected number.  But subsequent examination showed the t2
rows as being there.

Investigation showed that the connected backend had properly acquired
FOR UPDATE lock on the t1 row, but the snapshot it was using for the
subsequent DELETE showed the inserter of the t1 row as still running.
This should be impossible, since the SELECT FOR UPDATE cannot lock an
uncommitted row, and in READ COMMITTED mode we certainly will take a
new snapshot for the DELETE.

However, there is a race condition here.  During transaction commit,
xact.c first marks the transaction committed in pg_clog, and then
clears its XID from PGPROC.  This means there is a narrow window
in which both TransactionIdDidCommit and TransactionIdIsInProgress
will return true.  (We cannot do it the other way around, because
if neither one is returning true, onlookers are entitled to assume
that the transaction has crashed.)  However, the tqual.c routines
will allow a row to be seen as committed as soon as
TransactionIdDidCommit(xmin) returns true.  So the scenario is:

1. Backend A does RecordTransactionCommit to mark itself committed
in pg_clog, but then loses the CPU in the narrow window between
doing that and clearing its PGPROC entry.  Because of the ridiculous
load, it doesn't get control back for awhile.

2. Backend B comes along to run the test transaction for the same K
value.  It inspects the t1 row, concludes it's committed, marks the
row as locked FOR UPDATE, and returns the results to the client.

3. The client now issues the DELETE command.  B takes a new snapshot,
but because A is still not cleared out of PGPROC, A's transaction
is shown as still running in the snapshot.

4. Now the DELETE will delete no rows, because it doesn't consider the
t2 rows it should delete to be committed.


AFAICS this race condition has always been there; certainly at
least since Vadim put in MVCC, and it looks likely that the
original Berkeley code had a form of the problem.

The correct fix is that the tqual.c routines should check commit status
in the following way:

if (TransactionIdIsInProgress(xid))
   // still in progress, don't touch tuple
else if (TransactionIdDidCommit(xid))
   // committed, mark accordingly
else
   // must be aborted or crashed, mark accordingly

rather than what they have traditionally done:

if (TransactionIdDidCommit(xid))
   // committed, mark accordingly
else if (TransactionIdDidAbort(xid))
   // aborted, mark accordingly
else
   // assume still in progress, don't touch tuple

Vadim recognized that the former logic was necessary for VACUUM to use
in deciding if it could clean up dead tuples, but apparently he never
made the extrapolation that it should be used *everywhere* transaction
status is tested.


The other interesting thing we saw was an Assertion failure.  The
postmaster log showed

WARNING:  relation "t1" page 196 is uninitialized --- fixing
TRAP: FailedAssertion("!PageHeader) ((PageHeader) pageHeader))->pd_upper == 
0))", File: "hio.c", Line: 263)
LOG:  server process (PID 11296) was terminated by signal 6

The WARNING could only have come from VACUUM.  (Jan's testbed does
launch a VACUUM every so often.)  The Assert failure is in
RelationGetBufferForTuple where it is adding a new page to a table.

I interpret this as the guy doing RelationGetBufferForTuple added a
page, but before he could initialize it, he lost control for long enough
for a VACUUM to scan through the entire table, see the zeroed page, and
"fix" it.  Then when the first guy got control again, his Assert saying
the page was zeroes failed.  The window for this exists because bufmgr/smgr
do physically extend th

Re: [HACKERS] FC3 broken with HEAD

2005-05-07 Thread Tom Lane
Alvaro Herrera <[EMAIL PROTECTED]> writes:
> Huh, RH's bugzilla is really slow.  Are they using the
> PostgreSQL-powered bugzilla?

They are, but I think it's a fairly old version (7.3 IIRC).
I'm still trying to get them to move some internal systems
off 7.1 :-(

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] to_char(interval) issues

2005-05-07 Thread Josh Berkus
Bruce,

> * Prevent to_char() on interval from returning meaningless values

> * Allow to_char() on interval values to accumulate the highest unit
>   requested

Sounds like it would cover my use cases.   Others?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Tom Lane
Neil Conway <[EMAIL PROTECTED]> writes:
> Users can already define SQL functions by default, which certainly 
> provides "programmable features". I'm not quite sure what you mean by 
> "user exits."

> I guess I'm missing how pl/pgsql is a fundamentally greater security risk.

plpgsql has control structures (loops, IF); SQL functions don't.
That makes for a fundamental difference in the power of the programming
language ... at least according to CS theory as I was taught it.  Now
admittedly the "primitive statements" of SQL are a lot more powerful
than the primitive statements usually considered in programming language
theory, but I think there is still a pretty significant difference in
capability.

An example of why this could be interesting from a security standpoint
is that, given access to pg_shadow, it'd be pretty trivial to write
a plpgsql function that tries to break user passwords by brute force
(just generate possible passwords, hash them, and look for a match).
I don't see any way to do that in plain SQL, at least not without a
pre-existing SRF to generate the trial passwords for you.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn
--On lördag, maj 07, 2005 10.58.09 -0400 Tom Lane <[EMAIL PROTECTED]> wrote:
"John Hansen" <[EMAIL PROTECTED]> writes:
Where'd you get the licence from?
It was the first thing I came across in their docs:
http://icu.sourceforge.net/userguide/intro.html
Looking more closely, it may be that this license is only intended to
apply to the documentation and not the code ... though "free" code with
un-free documentation isn't real useful.
			regards, tom lane
Someone should ask the ICU folks about that...
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] Patch for database locale settings

2005-05-07 Thread Bruce Momjian

Description added to TODO:

* Allow locale to be set at database creation

  Currently locale can only be set during initdb.  No global tables have
  locale-aware columns.  However, the database template used during
  database creation might have locale-aware indexes.  The indexes would
  need to be reindexed to match the new locale.

---

Tom Lane wrote:
> Alvaro Herrera <[EMAIL PROTECTED]> writes:
> > Maybe it would work if we forced indexes on shared relations to be
> > scanned using a fixed collation.
> 
> The shared relations only have indexes on name, oid, and integer:
> select distinct atttypid::regtype
> from pg_class c join pg_attribute a on c.oid = a.attrelid
> where relisshared and relkind = 'i';
> and name has non-locale-sensitive ordering rules anyway.  So that's
> not the big problem; we could probably get away with decreeing that
> name will always be that way and that shared relations can't have
> locale-dependent indexes.
> 
> The big problem (and the reason why this idea has been shot down in
> the past) is that CREATE DATABASE can't change the locale from what it
> is in the template database unless it's prepared to reindex any locale-
> sensitive indexes in the non-shared part of the template database.
> Which would be a difficult undertaking seeing that we can't even connect
> to the copied database until after commit.
> 
> We could maybe say that we will never have any locale-dependent indexes
> at all on any system catalog, but what of user-defined tables in
> template databases?  It would simply not work to do something as simple
> as creating a table with an indexed text column in template1.
> 
> On the other hand you could argue that people already run the same kind
> of risk when changing database encoding at CREATE, which is a feature
> that's been there a long time and hasn't garnered many complaints.
> Not so much that their indexes will break as that their data will.
> So perhaps we should be willing to document "don't do that".  Certainly
> it would be a lot more useful if both locale and encoding could be set
> per-database.
> 
>   regards, tom lane
> 
> ---(end of broadcast)---
> TIP 5: Have you checked our extensive FAQ?
> 
>http://www.postgresql.org/docs/faq
> 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] Cleaning up unreferenced table files

2005-05-07 Thread Heikki Linnakangas
Maybe we should take a different approach to the problem:
1. Create new file with an extension to mark that it's not
   yet committed (eg. 1234.notcommitted)
2. ...
3. Take CheckpointStartLock
4. Write commit record to WAL, with list of created files.
5. rename created file (1234.notcommitted -> 1234).
6. Release CheckpointStartLock
This would guarantee that after successful WAL replay, all files in the 
data directory with .notcommitted extension can be safely deleted. No need 
to read pg_database or pg_class.

We would take a performance hit because of the additional rename and fsync
step. Also, we must somehow make sure that the new file or the directory 
it's in is fsynced on checkpoint to make sure that the rename is flushed 
to disk.

A variant of the scheme would be to create two files on step 1. One would 
be the actual relfile (1234) and the other would an empty marker file
(1234.notcommitted). That way the smgr code wouldn't have to care it the 
file is new or not when opening it.

- Heikki
On Thu, 5 May 2005, Tom Lane wrote:
Bruce Momjian  writes:
Applied.
Now that I've had a chance to look at it, this patch is thoroughly
broken.  Problems observed in a quick review:
1. It doesn't work at all for non-default tablespaces: it will
claim that every file in such a tablespace is stale.  The fact
that it does that rather than failing entirely is accidental.
It tries to read the database's pg_class in the target tablespace
whether it's there or not.  Because the system is still in recovery
mode, the low-level routines allow the access to the nonexistent
pg_class table to pass --- in fact they think they should create
the file, so after it runs there's a bogus empty "1259" file in each
such tablespace (which of course it complains about, too).  The code
then proceeds to think that pg_class is empty so of course everything
draws a warning.
2. It's not robust against stale subdirectories of a tablespace
(ie, subdirs corresponding to a nonexistent database) --- again,
it'll try to read a nonexistent pg_class.  Then it'll produce a
bunch of off-target complaint messages.
3. It's assuming that relfilenode is unique database-wide, when no
such assumption is safe.  We only have a guarantee that it's unique
tablespace-wide.
4. It fails to examine table segment files (such as "nnn.1").  These
should be complained of when the "nnn" doesn't match any hash entry.
5. It will load every relfilenode value in pg_class into the hashtable
whether it's meaningful or not.  There should be a check on relkind.
6. I don't think relying on strtol to decide if a filename is entirely
numeric is very safe.  Note all the extra defenses in pg_atoi against
various platform-specific misbehaviors of strtol.  Personally I'd use a
strspn test instead.
7. There are no checks for readdir failure (compare any other readdir
loop in the backend).
See also Simon Riggs' complaints that the circumstances under which it's
done are pretty randomly selected.  (One particular thing that I think
is a bad idea is to do this in a standalone backend.  Any sort of
corruption in any db's pg_class would render it impossible to start up.)
To fix the first three problems, and also avoid the performance problem
of multiply rescanning a database's pg_class for each of its
tablespaces, I would suggest that the hashtable entries be widened to
RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid).  Then
there should be one iteration over pg_database to learn the OIDs and
default tablespaces of each database; with that you can read pg_class
from its correct location for each database and load all the entries
into the hashtable.  Then you iterate through the tablespaces looking
for stuff not present in the hashtable.  You might also want to build a
list or hashtable of known database OIDs, so that you can recognize a
stale subdirectory immediately and issue a direct complaint about it
without even recursing into it.
regards, tom lane
- Heikki
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] FC3 broken with HEAD

2005-05-07 Thread Greg Stark

I think this has more to do with poor architectural decisions made in
bugzilla. Things like storing the entire ticket history in one big record
instead of normalizing it.

This means you can't do anything to a ticket without munging through 10s of k
of data. This also creates problems when multiple people are trying to update
a single ticket at a time.

-- 
greg


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Andrew - Supernews
On 2005-05-07, Tom Lane <[EMAIL PROTECTED]> wrote:
> Neil Conway <[EMAIL PROTECTED]> writes:
>> Users can already define SQL functions by default, which certainly 
>> provides "programmable features". I'm not quite sure what you mean by 
>> "user exits."
>
>> I guess I'm missing how pl/pgsql is a fundamentally greater security risk.
>
> plpgsql has control structures (loops, IF); SQL functions don't.
> That makes for a fundamental difference in the power of the programming
> language ... at least according to CS theory as I was taught it.

SQL functions do have control structures: CASE WHEN and recursion. I have
even implemented generate_series() in pure SQL.

> Now
> admittedly the "primitive statements" of SQL are a lot more powerful
> than the primitive statements usually considered in programming language
> theory, but I think there is still a pretty significant difference in
> capability.

The only thing that makes pg's SQL functions not turing-complete is the
fact that recursion depth is not unlimited. In practice this isn't much
of a restriction, since you can do large-scale iterations by using SQL
sets and joins rather than recursion.

> An example of why this could be interesting from a security standpoint
> is that, given access to pg_shadow, it'd be pretty trivial to write
> a plpgsql function that tries to break user passwords by brute force
> (just generate possible passwords, hash them, and look for a match).
> I don't see any way to do that in plain SQL, at least not without a
> pre-existing SRF to generate the trial passwords for you.

Writing a pure SQL SRF that generates a range of trial passwords is in
fact not hard at all. The key point to note is that you can generate
sets of literal values of moderate size using UNION ALL, and you can
then cross-join those sets against themselves multiple times to generate
much larger and more complex sets.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Josh Berkus
People:

Before we get into more minutia regarding potential security risk of plpgsql, 
are there any reasons *other* than security to not enable it?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Race conditions, race conditions!

2005-05-07 Thread Greg Stark

Tom Lane <[EMAIL PROTECTED]> writes:

> The trick is to run this under astronomical load; 100 or so client
> threads on a garden-variety PC is about right.

I wonder if there's an argument for building assertion-enabled builds with
code that randomly yields the processor some percentage of time before and
after taking a lock. It wouldn't catch every case but it might help.

-- 
greg


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] pgFoundry

2005-05-07 Thread Josh Berkus
Greg,

> I'm rather surprised Postgres doesn't have a good bug tracking system.
> That's something most projects find pretty essential. Strangely enough the
> reason seems to be that Postgres really doesn't have many bugs... Unlike
> web browsers or GUIs or most of the other free software projects out there,
> databases don't tolerate bugs well. Any serious bug is cause for an
> immediate point release. The only use for a bug tracking system would
> really be for tracking all those pesky "IWBNI" bugs that never rise to
> urgent status.

Actually, a bug tracker would be useful for two purposes:

1) Tracking bugs that are actually feature requests, in an effort (possibly 
futile) to cut down on the number of "when will PostgreSQL be able to use an 
index on MAX()?" requests that we get.   A certain amount of this is in our 
FAQ, but the FAQ has the flaws of both not being easily searchable, and 
having a very manual update process so that it frequently gets out of date.

2) Tracking bugs that were fixed (and features that were added) in particular 
releases so that users know when they need to upgrade.   For example, if a 
user had an index corruption problem with 7.4.1, it would be useful for them 
to know that an upgrade to 7.4.5 (as I recall) would fix it.  Currently, 
they'd have to read all of the release notes from 7.4.2 through 7.4.5 and 
decipher insider terminonolgy to figure it out -- not always easy to do, and 
even harder to convince your boss.

The problem is that a bug tracker would not be useful to the Postgresql 
*developers*; our current system works pretty well for us now.  Except for 
one possibility:

IF we had a formal bugtracker, are there people who are not currently 
contributing to PostgreSQL who would be willing/able to read, test, and 
analyse bug reports?   With the addition of several companies to our 
community, it's a possibility, and would make the trouble of using a bug 
tracker worthwhile, I think.

Comments?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] pgFoundry

2005-05-07 Thread Joshua D. Drake

What does it mean to "track" the status of something? How would the status
change except by discussion? What would be the point of announcing the status
of something without allowing people to comment?
No one said anything about not letting people comment or discuss. What I 
am suggesting is a better public presentation of what the heck is going 
on with PostgreSQL development.

I think you have a severely flawed idea of how free software development
proceeds.
Then you obviously aren't paying attention. Look at other major OSS 
projects. They have these things in place. Even the Linux kernel has a 
bugzilla (although I am not advocating bugzilla). Not to mention KDE, 
Gnome, Debian..

These projects also have reasonably defined milestones for particular 
releases and show status of those milestones during the release.

 What you're describing sounds like something a manager of a
commercial project would want. Perhaps it's something the managers of the
people working on Postgres on behalf of some corporate sponsors might want but
in those cases I doubt they would want the information to be public anyways.
What I am describing is what other large OSS projects already do.
In the free software world there's no top-down management of the project with
managers issuing direction and expecting feedback reports.
No but there are people in charge of particular tasks. There are people 
only working on certain things. Like the work that the people did on PITR.

 People only want
tools that make their lives easier. Not tools that make other people's lives
easier at the expense of their own convenience. The programmers are not
beholden to any corporate interests (other than their own sponsors, who
presumably are getting all the feedback they're looking for privately).
I am not suggesting that anybody be beholden to anybody accept maybe the 
community itself.

Sincerely,
Joshua D. Drake

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] [PATCHES] Cleaning up unreferenced table files

2005-05-07 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Maybe we should take a different approach to the problem:
> 1. Create new file with an extension to mark that it's not
> yet committed (eg. 1234.notcommitted)

This is pushing the problem into the wrong place, viz the lowest-level
file access routines, which will now all have to know about
.notcommitted status.  It also creates race conditions --- think about
backend A trying to commit file 1234 at about the same time that
backend B is trying to flush some dirty buffers belonging to that file.
But most importantly, it doesn't handle the file-deletion case.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] pgFoundry

2005-05-07 Thread Robert Treat
On Saturday 07 May 2005 15:31, Josh Berkus wrote:
> 2) Tracking bugs that were fixed (and features that were added) in
> particular releases so that users know when they need to upgrade.  

one idea that has been quasi floated before would be something equivalent to 
Simon Riggs developer summaries and/or Bruce's release history, that could be 
updated as appropriate for each development cycle. If someone were willing to 
maintain it, we could put it on the website along with the TODO list. 

-- 
Robert Treat
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Tom Lane wrote:
> "John Hansen" <[EMAIL PROTECTED]> writes:
> > Where'd you get the licence from?
> 
> It was the first thing I came across in their docs:
> 
> http://icu.sourceforge.net/userguide/intro.html
> 
> Looking more closely, it may be that this license is only 
> intended to apply to the documentation and not the code ... 
> though "free" code with un-free documentation isn't real useful.
> 

Right, it seems to apply only to the resources found on sourceforge.

>   regards, tom lane
> 
> 

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Tom Lane
Josh Berkus  writes:
> Before we get into more minutia regarding potential security risk of plpgsql,
> are there any reasons *other* than security to not enable it?

Several potential issues have already been mentioned in this thread,
eg, what about shared library dependency vs linking plpgsql statically
into the backend, how will we deal with existing dump files, how will
it be possible for a DBA to disable the feature if he wishes.

But security is the easiest to handwave about with no facts ;-) ...
so if it gets shot down on that basis then no one need do the work
to figure out solutions to the other questions.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Race conditions, race conditions!

2005-05-07 Thread Tom Lane
Greg Stark <[EMAIL PROTECTED]> writes:
> I wonder if there's an argument for building assertion-enabled builds with
> code that randomly yields the processor some percentage of time before and
> after taking a lock. It wouldn't catch every case but it might help.

Seems like that would mainly help you find cases where you'd put a lock
acquire or release a bit too late or too soon in a sequence of events;
not cases where you'd failed to acquire a needed lock at all.  It'd be
more useful I think to have a facility that randomly stops backends for
awhile regardless of exactly where they are in the code.

A high-load test case actually does this to some extent, but the problem
is you have little reproducibility and no assurance that execution
stopped for long enough to let critical events happen elsewhere.  The
ideal facility I think would slow one backend much more than others,
whereas high load still leaves them all making progress at about the
same rate ...

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Neil Conway
Andrew Sullivan wrote:
This is not really analogous, because those are already on
Which is my point: you're suggesting we retrofit a security policy onto 
PG that does not apply to the vast majority of the base system -- and 
that if applied would require fundamental changes.

Indeed.  But that doesn't mean that the principle isn't sound for
both cases.  I haven't seen an argument against that yet.
Security (in the limited sense of "disabling features by default") is 
not free; there is a tradeoff between security and convenience, security 
and administrative simplicity, and so on. Given that I have yet to see a 
single substantive argument for pl/pgsql being a security risk that has 
withstood any scrutiny, I don't see that the "security" side of the 
tradeoff has a lot of merit.

-Neil
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Tom Lane
"John Hansen" <[EMAIL PROTECTED]> writes:
> Btw, I had been planning to propose replacing every single one of the
> built in charset conversion functions with calls to ICU (thus making
> pg _depend_ on ICU),

I find that fairly unacceptable ... especially given the licensing
questions, but in any case.

It might be OK to rip out the existing conversion support and say
that *if* you want encoding conversion, you have to use ICU.  But
I don't want to be told you cannot build PG without ICU period.

The 3.2 vs 2.8 business is disturbing also; specifically, I don't
think we get to require 3.2 on a platform where 2.8 is installed.
People just aren't going to hold still for that, even assuming
that ICU supports installing both versions at once, which isn't
clear to me at the moment ...

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[HACKERS] Fix PID file location?

2005-05-07 Thread Josh Berkus
Hey, folks,

I've noticed a problem with alternate PGDATA locations.  Here's how to 
reproduce:

On 8.0.2 on RHAS4:

1) Initdb a directory (on my system, /pgdata/pgdata)
2) Move the .conf files to an alternate location ( /etc/pgsql/)
3) Set $PGDATA to the alternate location ( /etc/pgsql )
4) Edit postgresql.conf to support this file arrangement
data_directory = '/pgdata/pgdata'   
5) pg_ctl start PostgreSQL
6) pg_ctl stop PostgreSQL
7) Get an error:  "No PID file found".

The problem seems to be that pg_ctl expects the PID file to be in $PGDATA, but 
the file actually gets written by the postmaster to the actual data 
directory.  You can work around this by setting "external_pid_file", but this 
then prevents you from using external_pid_file for another purpose.

Seems like it should be a relatively easy fix, although I'm not sure whether 
the postmaster should write the PID to $PGDATA, or whether pg_ctl should be 
made to look in the right place.  Probably the latter.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Tatsuo Ishii
> Bruce Momjian wrote:
(B> > 
(B> > There are two reasons for that optimization --- first, some 
(B> > locale support is broken and Unicode encoding with a C locale 
(B> > crashes (not an issue for ICU), and second, it is an 
(B> > optimization for languages like Japanese that want to use 
(B> > unicode, but don't need a locale because upper/lower means 
(B> > nothing in those character sets.
(B> 
(B> No, upper/lower means nothing in those languages, so why would you need
(B> to optimize upper/lower if they're not used??
(B> And if they are, it's obviously because the text contains characters
(B> from other languages (probably english) and as such they should behave
(B> correctly.
(B
(BYes, Japanese (and probably Chinese and Korean) languages include
(BASCII character. More precisely ASCII is part of Japanese
(Bencodings(LATIN1 is not, however). And we have no problem at all with
(Bglibc/C locale. See below("unitest" is an UNICODE database).
(B
(Bunitest=# create table t1(t text);
(BCREATE TABLE
(Bunitest=# \encoding EUC_JP
(Bunitest=# insert into t1 values('abc$B$"$$$&(B');
(BINSERT 1842628 1
(Bunitest=# select upper(t) from t1;
(B   upper   
(B---
(B ABC$B$"$$$&(B
(B(1 row)
(B
(BSo Japanese(including ASCII)/UNICODE behavior is perfectly correct at
(Bthis moment. So I strongly object removing that optimization.
(B--
(BTatsuo Ishii
(B
(B---(end of broadcast)---
(BTIP 3: if posting/reading through Usenet, please send an appropriate
(B  subscribe-nomail command to [EMAIL PROTECTED] so that your
(B  message can get through to the mailing list cleanly

Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Tatsuo Ishii
> Palle Girgensohn wrote:
> > 
> > --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen <[EMAIL PROTECTED]> 
> > wrote:
> > 
> > > Btw, I had been planning to propose replacing every single one of the
> > > built in charset conversion functions with calls to ICU (thus making pg
> > > _depend_ on ICU), as this would seem like a cleaner solution than for us
> > > to maintain our own conversion tables.

I don't buy it. If current conversion tables does the right thing, why
we need to replace. Or if conversion tables are not correct, why don't
you fix it? I think the rule of character conversion will not change
frequently, especially for LATIN languages. Thus maintaining cost is
not too high.
--
Tatsuo Ishii

> > > ICU also has a fair few conversions that we do not have at present.
> 
> That is a much larger issue, similar to our shipping our own timezone
> database.  What does it buy us?
>   
>   o  Do we ship it in our tarball?
>   o  Is the license compatible?
>   o  Does it remove utils/mb conversions?
>   o  Does it allow us to index LIKE (next high char)?
>   o  Does it allow us to support multiple encodings in
>  a single database easier?
>   o  performance?

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Mike Mascari
Neil Conway wrote:
Andrew Sullivan wrote:
This is not really analogous, because those are already on
Security (in the limited sense of "disabling features by default") is 
not free; there is a tradeoff between security and convenience, security 
and administrative simplicity, and so on. Given that I have yet to see a 
single substantive argument for pl/pgsql being a security risk that has 
withstood any scrutiny, I don't see that the "security" side of the 
tradeoff has a lot of merit.
People who use views to achieve row security, which is a rather common 
paradigm, cannot allow users to create functions with side effects.

Mike Mascari
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Neil Conway
Mike Mascari wrote:
People who use views to achieve row security, which is a rather common 
paradigm, cannot allow users to create functions with side effects.
Can you elaborate? I'm not sure I follow you.
(I'll note anyway that (1) SQL functions can have side effects: CREATE 
FUNCTION foo() RETURNS VOID AS 'DELETE FROM ...', for example (2) 
Administrators can always choose to drop pl/pgsql for a particular 
database, disable it at initdb time, or REVOKE usage of pl/pgsql for 
particular users.)

-Neil
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Mike Mascari
Neil Conway wrote:
Mike Mascari wrote:
People who use views to achieve row security, which is a rather common 
paradigm, cannot allow users to create functions with side effects.

Can you elaborate? I'm not sure I follow you.
(I'll note anyway that (1) SQL functions can have side effects: CREATE 
FUNCTION foo() RETURNS VOID AS 'DELETE FROM ...', for example 
Wow. That's a problem IMHO. Many people use views to allow userA to 
query his salary without seeing userB's. If userA can just:

1. Create a SQL function which inserts into another table its arguments
2. Issue a query like:
SELECT *
FROM view_of_salaries_based_on_current_user
WHERE my_side_effect_function_that_inserts_into_a_temp_table(salary, 
employee);

and that function is writable with just 'SQL', then many, many people 
have a serious security risk on their hands. Perhaps this is why 
Oracle's standard UDFs cannot perform inserts, updates, or deletes.

Mike Mascari
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Race conditions, race conditions!

2005-05-07 Thread Tatsuo Ishii
Are we going to put the fixes into 8.0.3 and so on? Or will it be
included in 8.0.4?
--
Tatsuo Ishii

> I promised an analysis of the problems Jan Wieck uncovered yesterday,
> so here it is.
> 
> Jan had developed a testbed (which I hope he will post) that essentially
> has a bunch of client threads all doing instances of the same
> transaction in READ COMMITTED mode.  The intended database invariant is
> that the number of rows in table t2 with a particular ID value is equal
> to the "cnt" field of the single row in t1 with that ID value:
> 
>   begin;
>   select cnt from t1 where id = K for update;
>   delete from t2 where id = K;
>   -- check returned rowcount to see that exactly CNT rows were deleted
>   insert into t2 values (K);  -- repeat this N times
>   update t1 set cnt = N where id = K;
>   commit;
> 
> K is randomly chosen for each execution from the known set of t1 keys,
> and N is randomly chosen each time as a small positive integer.  This
> should maintain the invariant, since at any time only one transaction
> can be holding the SELECT FOR UPDATE lock on a particular t1 row.
> 
> The trick is to run this under astronomical load; 100 or so client
> threads on a garden-variety PC is about right.
> 
> What Jan was seeing was that every so often, a client thread would
> error out, reporting that its DELETE had deleted zero rows rather
> than the expected number.  But subsequent examination showed the t2
> rows as being there.
> 
> Investigation showed that the connected backend had properly acquired
> FOR UPDATE lock on the t1 row, but the snapshot it was using for the
> subsequent DELETE showed the inserter of the t1 row as still running.
> This should be impossible, since the SELECT FOR UPDATE cannot lock an
> uncommitted row, and in READ COMMITTED mode we certainly will take a
> new snapshot for the DELETE.
> 
> However, there is a race condition here.  During transaction commit,
> xact.c first marks the transaction committed in pg_clog, and then
> clears its XID from PGPROC.  This means there is a narrow window
> in which both TransactionIdDidCommit and TransactionIdIsInProgress
> will return true.  (We cannot do it the other way around, because
> if neither one is returning true, onlookers are entitled to assume
> that the transaction has crashed.)  However, the tqual.c routines
> will allow a row to be seen as committed as soon as
> TransactionIdDidCommit(xmin) returns true.  So the scenario is:
> 
> 1. Backend A does RecordTransactionCommit to mark itself committed
> in pg_clog, but then loses the CPU in the narrow window between
> doing that and clearing its PGPROC entry.  Because of the ridiculous
> load, it doesn't get control back for awhile.
> 
> 2. Backend B comes along to run the test transaction for the same K
> value.  It inspects the t1 row, concludes it's committed, marks the
> row as locked FOR UPDATE, and returns the results to the client.
> 
> 3. The client now issues the DELETE command.  B takes a new snapshot,
> but because A is still not cleared out of PGPROC, A's transaction
> is shown as still running in the snapshot.
> 
> 4. Now the DELETE will delete no rows, because it doesn't consider the
> t2 rows it should delete to be committed.
> 
> 
> AFAICS this race condition has always been there; certainly at
> least since Vadim put in MVCC, and it looks likely that the
> original Berkeley code had a form of the problem.
> 
> The correct fix is that the tqual.c routines should check commit status
> in the following way:
> 
>   if (TransactionIdIsInProgress(xid))
>  // still in progress, don't touch tuple
>   else if (TransactionIdDidCommit(xid))
>  // committed, mark accordingly
>   else
>  // must be aborted or crashed, mark accordingly
> 
> rather than what they have traditionally done:
> 
>   if (TransactionIdDidCommit(xid))
>  // committed, mark accordingly
>   else if (TransactionIdDidAbort(xid))
>  // aborted, mark accordingly
>   else
>  // assume still in progress, don't touch tuple
> 
> Vadim recognized that the former logic was necessary for VACUUM to use
> in deciding if it could clean up dead tuples, but apparently he never
> made the extrapolation that it should be used *everywhere* transaction
> status is tested.
> 
> 
> The other interesting thing we saw was an Assertion failure.  The
> postmaster log showed
> 
> WARNING:  relation "t1" page 196 is uninitialized --- fixing
> TRAP: FailedAssertion("!PageHeader) ((PageHeader) pageHeader))->pd_upper 
> == 0))", File: "hio.c", Line: 263)
> LOG:  server process (PID 11296) was terminated by signal 6
> 
> The WARNING could only have come from VACUUM.  (Jan's testbed does
> launch a VACUUM every so often.)  The Assert failure is in
> RelationGetBufferForTuple where it is adding a new page to a table.
> 
> I interpret this as the guy doing RelationGetBufferForTuple added a
> page, but before he could initialize i

Re: [HACKERS] Race conditions, race conditions!

2005-05-07 Thread Bruce Momjian
Tatsuo Ishii wrote:
> Are we going to put the fixes into 8.0.3 and so on? Or will it be
> included in 8.0.4?

We have removed 8.0.3 from the FTP servers and plan to re-release 8.0.3
and previous 7.X releases.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Andrew Dunstan

Mike Mascari wrote:
Neil Conway wrote:
Mike Mascari wrote:
People who use views to achieve row security, which is a rather 
common paradigm, cannot allow users to create functions with side 
effects.

Can you elaborate? I'm not sure I follow you.
(I'll note anyway that (1) SQL functions can have side effects: 
CREATE FUNCTION foo() RETURNS VOID AS 'DELETE FROM ...', for example 

Wow. That's a problem IMHO. Many people use views to allow userA to 
query his salary without seeing userB's. If userA can just:

1. Create a SQL function which inserts into another table its arguments
2. Issue a query like:
SELECT *
FROM view_of_salaries_based_on_current_user
WHERE my_side_effect_function_that_inserts_into_a_temp_table(salary, 
employee);

and that function is writable with just 'SQL', then many, many people 
have a serious security risk on their hands. Perhaps this is why 
Oracle's standard UDFs cannot perform inserts, updates, or deletes.


but the side effect function will only run (unless you set it with 
security definer) with the privileges of the caller - it won't grant 
visibility to things that user can't otherwise  see.

In any case, you should define your security setup  with the 
capabilities / limitations of the db engine in mind. If there is any 
security problem in your scenario, it is that you appear to have made 
unwarranted assumptions about how postgres works, rather than that 
postgres has a problem.

Either way, this does not illustrate how enabling plpgsql by default is 
a security risk.

cheers
andrew
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Mike Mascari
Andrew Dunstan wrote:

Mike Mascari wrote:

but the side effect function will only run (unless you set it with 
security definer) with the privileges of the caller - it won't grant 
visibility to things that user can't otherwise  see.
If the visibility is determined by view definitions, such as using 
CURRENT_USER, which is an exceedingly common practice, then the caller 
will be able to record tuples before they are filtered by the executor.

In any case, you should define your security setup  with the 
capabilities / limitations of the db engine in mind. If there is any 
security problem in your scenario, it is that you appear to have made 
unwarranted assumptions about how postgres works, rather than that 
postgres has a problem.
I think most people coming from any other enterprise-class RDBMS 
environment will be surprised that they cannot use VIEWs to provide 
user-specific views on data. I could be wrong, but I'd put money on it...

Either way, this does not illustrate how enabling plpgsql by default is 
a security risk.
Correct, as the vulnerability exists within the 'SQL' language as well. 
The only difference is that enabling plpgsql by default changes it from 
a leak to a full blown flood.

Mike Mascari
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Neil Conway
Mike Mascari wrote:
Correct, as the vulnerability exists within the 'SQL' language as well. 
The only difference is that enabling plpgsql by default changes it from 
a leak to a full blown flood.
How does it make any difference at all?
-Neil
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread Tatsuo Ishii
We have developed patches which relaxes the character validation so
that PostgreSQL accepts invalid characters. It works like this:

1) new postgresql.conf item "mbstr_check" added.
2) if mbstr_check = 0 then invalid characters are not accepted
   (same as current PostgreSQL behavior). This is the default.
3) if mbstr_check = 1 then invalid characters are accepted with
   WARNING
4) if mbstr_check = 2 then invalid characters are accepted without any
   warnings
5) We have checked PostgreSQL source code if accepting invalid
   characters makes some troubles. We have found that we need to fix a
   place and the fix is included in the patches.

Madison,
If you are interested in the patches, I could send it to you.

Hackers,
Do you think the functionality something like above is worth to add to
PostgreSQL?
--
Tatsuo Ishii

> Hi all,
> 
>I've been chasing down a bug and from what I have learned it may be 
> because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles 
> invalid unicode. I've been given some ideas on how to try to catch 
> invalid unicode but it seems expensive so I am hoping there is a 
> postgresql way to deal with this problem.
> 
>I've run into a problem where a bulk postgres "COPY..." statement is 
> dieing because one of the lines contains a file name with an invalid 
> unicode character. In nautilus this file has '(invalid encoding)' and 
> the postgres error is 'CONTEXT:  COPY file_info_3, line 228287, column 
> file_name: "Femme Fatal\u.url"'.
> 
>To actually look at the file from the shell (bash) shows what appears 
> to be a whitespace but when I copy/paste the file name I get the 
> '\u' you see above.
> 
>I could, with the help of the TLUG people, use regex to match for an 
> invalid character and skip the file but that is not ideal. The reason is 
> that this is for my backup program and invalid unicode or not, the 
> contents of the file may still be important and I would prefer to have 
> it in the database so that it is later copied. I can copy and move the 
> file in the shell so the file isn't apparently in an of itself corrupt.
> 
>So then, is there a way I can tell postresql to accept the invalid 
> unicode name? Here is a copy of my schema:
> 
> tle-bu=> \d file_info_2
>Table "public.file_info_2"
>  Column| Type |Modifiers
> --+--+-
>   file_group_name  | text |
>   file_group_uid   | bigint   | not null
>   file_mod_time| bigint   | not null
>   file_name| text | not null
>   file_parent_dir  | text | not null
>   file_perm| text | not null
>   file_size| bigint   | not null
>   file_type| character varying(2) | not null default 
> 'f'::character varying
>   file_user_name   | text |
>   file_user_uid| bigint   | not null
>   file_backup  | boolean  | not null default true
>   file_display | boolean  | not null default false
>   file_restore_display | boolean  | not null default false
>   file_restore | boolean  | not null default false
> Indexes:
>  "file_info_2_display_idx" btree (file_type, file_parent_dir, file_name)
> 
>'file_name' and 'file_parent_dir' are the columns that could have 
> entries with the invalid unicode characters. Maybe I could/should use 
> something other than 'text'? These columns could contain anything that a 
> file or directory name could be.
> 
>Thanks!
> 
> Madison
> 
> -- 
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> Madison Kelly (Digimer)
> TLE-BU, The Linux Experience; Back Up
> http://tle-bu.thelinuxexperience.com
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> 
> ---(end of broadcast)---
> TIP 7: don't forget to increase your free space map settings
> 

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Tom Lane wrote:
> "John Hansen" <[EMAIL PROTECTED]> writes:
> > Btw, I had been planning to propose replacing every single 
> one of the 
> > built in charset conversion functions with calls to ICU 
> (thus making 
> > pg _depend_ on ICU),
> 
> I find that fairly unacceptable ... especially given the 
> licensing questions, but in any case.

The licencing seems pretty clear to me.
http://www-306.ibm.com/software/globalization/icu/license.jsp

> 
> It might be OK to rip out the existing conversion support and 
> say that *if* you want encoding conversion, you have to use 
> ICU.  But I don't want to be told you cannot build PG without 
> ICU period.

Right, that could be done, but I think the issue at heart is _are_ 
we going to use it at all, and if so, locale support would certainly 
benefit from going that way as well.

> 
> The 3.2 vs 2.8 business is disturbing also; specifically, I 
> don't think we get to require 3.2 on a platform where 2.8 is 
> installed.

There seems to be nothing in the ICU licence that would prevent us from
bundling it.
This would solve both the 3.2 vs 2.8 problems, and would remove the
'dependency'.

> People just aren't going to hold still for that, even 
> assuming that ICU supports installing both versions at once, 
> which isn't clear to me at the moment ...

There's no problems with having both installed.
I did that on debian to get the patch going.
Tho, bundling it seems cleaner to me.

> 
>   regards, tom lane
> 
> 

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
> I don't buy it. If current conversion tables does the right 
> thing, why we need to replace. Or if conversion tables are 
> not correct, why don't you fix it? I think the rule of 
> character conversion will not change frequently, especially 
> for LATIN languages. Thus maintaining cost is not too high.

I never said we need to, but if we're going to implement ICU,
then we might as well go all the way.

... John


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Tatsuo Ishii wrote:
> Sent: Sunday, May 08, 2005 10:09 AM
> To: John Hansen
> Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; 
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> > Bruce Momjian wrote:
> > > 
> > > There are two reasons for that optimization --- first, 
> some locale 
> > > support is broken and Unicode encoding with a C locale 
> crashes (not 
> > > an issue for ICU), and second, it is an optimization for 
> languages 
> > > like Japanese that want to use unicode, but don't need a locale 
> > > because upper/lower means nothing in those character sets.
> > 
> > No, upper/lower means nothing in those languages, so why would you 
> > need to optimize upper/lower if they're not used??
> > And if they are, it's obviously because the text contains 
> characters 
> > from other languages (probably english) and as such they 
> should behave 
> > correctly.
> 
> Yes, Japanese (and probably Chinese and Korean) languages 
> include ASCII character. More precisely ASCII is part of Japanese
> encodings(LATIN1 is not, however). And we have no problem at 
> all with glibc/C locale. See below("unitest" is an UNICODE database).
> 
> unitest=# create table t1(t text);
> CREATE TABLE
> unitest=# \encoding EUC_JP
> unitest=# insert into t1 values('abcããã');
> INSERT 1842628 1
> unitest=# select upper(t) from t1;
>upper   
> ---
>  ABCããã
> (1 row)
> 
> So Japanese(including ASCII)/UNICODE behavior is perfectly 
> correct at this moment. 

Right, so you _never_ use accented ascii characters in Japanese? 
(like à for example, whose uppercase is Ã)

> So I strongly object removing that optimization.

I'm guessing this would call for a vote then, since if implementing ICU, then
I'd have to object to leaving it in.

Changing the bahaviour of ICU doesn't seem right. Changing the behaviour of pg, 
so that it works as it should when using unicode, seems the right solution to 
me.

> --
> Tatsuo Ishii
> 
> 

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread John Hansen
Tatsuo Ishii wrote:
> Sent: Sunday, May 08, 2005 12:01 PM
> To: [EMAIL PROTECTED]
> Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
> 
> We have developed patches which relaxes the character 
> validation so that PostgreSQL accepts invalid characters. It 
> works like this:

That is just plain 100% wrong!!
Under no circumstances should there be invalid data in a database.
And if you're trying to make a database of invalid data, then at 
least encode it using a valid encoding.

In fact, I've proposed strengthening the validation routines for UTF-8.

... John

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] rendezvous

2005-05-07 Thread Alvaro Herrera
On Sat, May 07, 2005 at 10:43:46AM +0200, Jochem van Dieten wrote:
> On 5/7/05, Alvaro Herrera wrote:
> > On Fri, May 06, 2005 at 03:30:10PM -0700, Joshua D. Drake wrote:
> >> 
> >> Rendezvous is the Apple network discovery protocol yes? That was renamed 
> >> Bonjour by apple due to a Trademark problem.
> > 
> > Maybe we should name it Zeroconf.
> 
> Is the implemented protocol IETF ZeroConf or Apple Rendezvous? IIRC
> there are differences in the requirements for TTLs and I have had some
> experiences with Apple reserving local.arpa while the rest of the
> world doesn't.

Huh, not sure.  The only thing the code does is check for
DNSServiceDiscovery.h and call a function therein.  No implementation
details are taken care of in the Postgres source.

My bet is that it only works on Mac OS X.  If there was interest I'm
sure we could make it work using libhowl but I don't see anybody asking ...

-- 
Alvaro Herrera (<[EMAIL PROTECTED]>)
"No necesitamos banderas
 No reconocemos fronteras"  (Jorge González)

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread John Hansen
Madison Kelly wrote:
>Under most circumstances I would agree with you completely. In my 
> case though I have to decide between risking a loss of a 
> user's data or 
> attempt to store the file name in some manner that would 
> return the same 
> name used by the file system.
> 
>The user (or one of his/her users in the case of an admin) may be 
> completely unaware of the file name being an invalid unicode 
> name. The 
> file itself though may still be quite valid and contain information 
> worthy of backing up. I could notify the user/admin that the 
> name is not 
> valid but there is no way I could rely on the name being 
> changed. Given 
> the choices, I would prefer to attempt to store/use the file 
> name with 
> the invalid unicode character than simply ignore the file.
> 
>Is there a way to store the name in raw binary? If so, 
> would this not 
> be safe because to postgresql it should no longer matter what 
> data is or 
> represents, right? Maybe there is a third option I am not yet 
> concidering?

Set the client_encoding to ascii when storing that name, and again when
retrieving it.
Or, use a bytea column.

> 
> Madison

... John

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Alvaro Herrera
On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
> Tatsuo Ishii wrote:

> > So Japanese(including ASCII)/UNICODE behavior is perfectly 
> > correct at this moment. 
> 
> Right, so you _never_ use accented ascii characters in Japanese? 
> (like è for example, whose uppercase is È)

That isn't ASCII.  It's latin1 or some other ASCII extension.

-- 
Alvaro Herrera (<[EMAIL PROTECTED]>)
"La principal característica humana es la tontería"
(Augusto Monterroso)

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread Tom Lane
"John Hansen" <[EMAIL PROTECTED]> writes:
> Tatsuo Ishii wrote:
>> We have developed patches which relaxes the character 
>> validation so that PostgreSQL accepts invalid characters.

> That is just plain 100% wrong!!

That was my first reaction too.  Why would this be a good idea?
If someone does want an encoding-agnostic database, they can
set it as SQL_ASCII.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Alvaro Herrera wrote:
> Sent: Sunday, May 08, 2005 2:49 PM
> To: John Hansen
> Cc: Tatsuo Ishii; pgman@candle.pha.pa.us; 
> [EMAIL PROTECTED]; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
> > Tatsuo Ishii wrote:
> 
> > > So Japanese(including ASCII)/UNICODE behavior is 
> perfectly correct 
> > > at this moment.
> > 
> > Right, so you _never_ use accented ascii characters in Japanese? 
> > (like è for example, whose uppercase is È)
> 
> That isn't ASCII.  It's latin1 or some other ASCII extension.

Point taken...
But...

If you want EUC_JP (Japanese + ASCII) then use that as your backend encoding, 
not UTF-8 (unicode).
UTF-8 encoded databases are very useful for representing multiple languages in 
the same database,
but this usefulness vanishes if functions like upper/lower doesn't work 
correctly.

So optimizing for 3 languages breaks more than a hundred, that's doesn't seem 
fair!

> 
> --
> Alvaro Herrera (<[EMAIL PROTECTED]>) "La principal 
> característica humana es la tontería"
> (Augusto Monterroso)
> 
> 

... John

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] Can we get patents?

2005-05-07 Thread Alvaro Herrera
Hackers,

I was reading LWN.net and noticed an article about Eben Moglen's keynote
at linux.conf.au.  Apparently he advises free software projects to get
patents on their best ideas.

Eben encouraged free software developers to record their novel
inventions and to obtain patents on the best of them. Free legal
help can be made available to obtain patents on the best ideas.
Until the rules of the game can be changed, we must play the
game, and having the right patents available may make all the
difference in defending against an attack.

http://lwn.net/Articles/133421/

Eben Moglen is the FSF's attorney.

I'm wondering, could the PostgreSQL Foundation (or some other entity)
get patents on some parts of Postgres?  Maybe ResourceOwners for
example; or the newer parts of the optimizer.

The patents would be freely licensed to everyone (including commercial
redistributors and developers/users of competing products), except to
patent litigators, or something like that.

-- 
Alvaro Herrera (<[EMAIL PROTECTED]>)
"La espina, desde que nace, ya pincha" (Proverbio africano)

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread Tatsuo Ishii
> Tatsuo Ishii wrote:
> > Sent: Sunday, May 08, 2005 12:01 PM
> > To: [EMAIL PROTECTED]
> > Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org
> > Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
> > 
> > We have developed patches which relaxes the character 
> > validation so that PostgreSQL accepts invalid characters. It 
> > works like this:
> 
> That is just plain 100% wrong!!
>
> Under no circumstances should there be invalid data in a database.
> And if you're trying to make a database of invalid data, then at 
> least encode it using a valid encoding.
> 
> In fact, I've proposed strengthening the validation routines for UTF-8.

Actually I myself thought as you are before. Later I found that it was
not so good idea. People already have invalid encoded data in their
precious database and have very hard time to migrate to newer version
of PostgreSQL because of encoding validation.

Think about this kind of situation:

There is a table t1(member_id integer primary key, member_name text,
address text, phone text, email text). I have to reach each member by
either adress, phone or email. Unfortunately some of address field
have wrong encoded data. In this case I will use phone or email to
reach them. 

Now I need to upgrade to newer PostgreSQL within 1 day. I know I have
to fix wrong encoded field but it will take more than 1 day. So I
would like to import the data first then fix wrong encoded field on
running database since I can reach members by phone or email even with
wrong encoded address field...

I saw this kind of situation in the real world and that's why we
developed the patches.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Tatsuo Ishii
> Alvaro Herrera wrote:
> > Sent: Sunday, May 08, 2005 2:49 PM
> > To: John Hansen
> > Cc: Tatsuo Ishii; pgman@candle.pha.pa.us; 
> > [EMAIL PROTECTED]; pgsql-hackers@postgresql.org
> > Subject: Re: [HACKERS] Patch for collation using ICU
> > 
> > On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
> > > Tatsuo Ishii wrote:
> > 
> > > > So Japanese(including ASCII)/UNICODE behavior is 
> > perfectly correct 
> > > > at this moment.
> > > 
> > > Right, so you _never_ use accented ascii characters in Japanese? 
> > > (like è for example, whose uppercase is È)
> > 
> > That isn't ASCII.  It's latin1 or some other ASCII extension.
> 
> Point taken...
> But...
> 
> If you want EUC_JP (Japanese + ASCII) then use that as your backend encoding, 
> not UTF-8 (unicode).
> UTF-8 encoded databases are very useful for representing multiple languages 
> in the same database,
> but this usefulness vanishes if functions like upper/lower doesn't work 
> correctly.

I'm just curious if Germany/French/Spanish mixed text can be sorted
correctly. I think these languages need their own locales even with
UNICODE/ICU.

> So optimizing for 3 languages breaks more than a hundred, that's doesn't seem 
> fair!

Why don't you add a GUC variable or some such to control the
upper/lower behavior?
--
Tatsuo Ishii

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Can we get patents?

2005-05-07 Thread Marc G. Fournier
On Sun, 8 May 2005, Alvaro Herrera wrote:
Hackers,
I was reading LWN.net and noticed an article about Eben Moglen's keynote
at linux.conf.au.  Apparently he advises free software projects to get
patents on their best ideas.
Eben encouraged free software developers to record their novel
inventions and to obtain patents on the best of them. Free legal
help can be made available to obtain patents on the best ideas.
Until the rules of the game can be changed, we must play the
game, and having the right patents available may make all the
difference in defending against an attack.
http://lwn.net/Articles/133421/
Eben Moglen is the FSF's attorney.
I'm wondering, could the PostgreSQL Foundation (or some other entity)
get patents on some parts of Postgres?  Maybe ResourceOwners for
example; or the newer parts of the optimizer.
The patents would be freely licensed to everyone (including commercial
redistributors and developers/users of competing products), except to
patent litigators, or something like that.
Individual developers could get their work patent'd, I would imagine ...

Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 7615664
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster