Re: [HACKERS] FC3 broken with HEAD

2005-05-07 Thread Neil Conway
Tom Lane wrote:
Yeah, we will.  Please file a bugzilla entry for this though --- I
concur that it is a linker bug.
Okay, patch reverted. The RH bug is here:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=157126
-Neil
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] CVS HEAD problem: psql: symbol lookup error:

2005-05-07 Thread Oleg Bartunov
On Sat, 7 May 2005, Markus Bertheau wrote:
See FC3 broken with HEAD.
I have Slackware 10.1

 , 06/05/2005  23:54 +0400, Oleg Bartunov :
Just got this problem.
Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Where'd you get the licence from?
None of that is in the licence I'm reading!

(http://www-306.ibm.com/software/globalization/icu/index.jsp)
(http://www-306.ibm.com/software/globalization/icu/license.jsp) 

... John

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane
 Sent: Saturday, May 07, 2005 3:17 PM
 To: Bruce Momjian
 Cc: Palle Girgensohn; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Patch for collation using ICU 
 
 Bruce Momjian pgman@candle.pha.pa.us writes:
  Tom Lane wrote:
  Not until ICU is released under a BSD license ...
 
  Well, readline isn't BSD either, but we use it.  It is any 
 different?
 
 Did you read the license?  Some of the more troubling bits:
 
 : It is the understanding of INTERNATIONAL BUSINESS MACHINES 
 CORPORATION
 : that the purpose for which its publications are being reproduced is
 : accurate and true as stated in your attached request.
 
 (er, which attached request would that be?)
 
 : Permission to quote from or reprint IBM publications is 
 limited to the
 : purpose and quantities originally requested and must not be 
 construed as
 : a blanket license to use the material for other purposes or 
 to reprint
 : other IBM copyrighted material.
 
 : IBM reserves the right to withdraw permission to reproduce 
 copyrighted
 : material whenever, in its discretion, it feels that the privilege of
 : reproducing its material is being used in a way detrimental to its
 : interest or the above instructions are not being followed 
 properly to
 : protect its copyright.
 
 : IBM may have patents or pending patent applications covering subject
 : matter in this document. The furnishing of this document 
 does not give
 : you any license to these patents. You can send license inquiries, in
 : writing, to:
 
 : For license inquiries regarding double-byte (DBCS) 
 information, contact
 : the IBM Intellectual Property Department in your country or send
 : inquiries, in writing, to:
 
   regards, tom lane
 
 ---(end of 
 broadcast)---
 TIP 2: you can get off all lists at once with the unregister command
 (send unregister YourEmailAddressHere to 
 [EMAIL PROTECTED])
 
 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] rendezvous

2005-05-07 Thread Jochem van Dieten
On 5/7/05, Alvaro Herrera wrote:
 On Fri, May 06, 2005 at 03:30:10PM -0700, Joshua D. Drake wrote:
 
 Rendezvous is the Apple network discovery protocol yes? That was renamed 
 Bonjour by apple due to a Trademark problem.
 
 Maybe we should name it Zeroconf.

Is the implemented protocol IETF ZeroConf or Apple Rendezvous? IIRC
there are differences in the requirements for TTLs and I have had some
experiences with Apple reserving local.arpa while the rest of the
world doesn't.

Jochem

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Simon Riggs
On Sat, 2005-05-07 at 14:52 +1000, Neil Conway wrote:
 Andrew Sullivan wrote:
  Sure it is.  Don't enable anything you don't need, is the first
  security rule.  Everything is turned off by default.  If you want it,
  enable it.
 
 So would you have us disable all the non-essential builtin functions? 
 (Many of which have has security problems in the past.) What about the 
 builtin encoding conversions, non-btree indexes, or a myriad of features 
 that not all users need or use?

I support Andrew's comment, though might reword it to 
Don't enable anything that gives users programmable features or user
exits by default.

You can't use the builtin encoding functions or non-btree indexes to
access things you are not supposed to.

Anything that is *always* there provides a platform for malware. 

I'm not really sure what is wrong with the CREATE LANGUAGE statement
anyway - it is dynamically accessible, so doesn't require changes that
effect other database instance users. I do understand the wish to make
the lives of admins easier, but this isn't a hard thing to do...

 What makes sense for the default configuration of an operating system 
 (which by nature must be hardened against attack) does not necessarily 
 make sense for a database system.

Security is everybody's job, not just the OS guys. Personally, I forget
that constantly, but the principle seems clear.

Best Regards, Simon Riggs



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Neil Conway
Simon Riggs wrote:
I support Andrew's comment, though might reword it to 
Don't enable anything that gives users programmable features or user
exits by default.
Users can already define SQL functions by default, which certainly 
provides programmable features. I'm not quite sure what you mean by 
user exits.

I guess I'm missing how pl/pgsql is a fundamentally greater security risk.
You can't use the builtin encoding functions or non-btree indexes to
access things you are not supposed to.
How can you use pl/pgsql to access things you are not supposed to?
-Neil
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] pgFoundry

2005-05-07 Thread Greg Stark

Joshua D. Drake [EMAIL PROTECTED] writes:

 No that is public presentation of the project not project development.
  I don't see that people are going to be able to participate in development
  if they don't use the mailing lists.
 
 I am not arguing that but public mailing lists are no place to track status of
 sub projects or tasks. They are for discussion.

What does it mean to track the status of something? How would the status
change except by discussion? What would be the point of announcing the status
of something without allowing people to comment?

I think you have a severely flawed idea of how free software development
proceeds. What you're describing sounds like something a manager of a
commercial project would want. Perhaps it's something the managers of the
people working on Postgres on behalf of some corporate sponsors might want but
in those cases I doubt they would want the information to be public anyways.

In the free software world there's no top-down management of the project with
managers issuing direction and expecting feedback reports. People only want
tools that make their lives easier. Not tools that make other people's lives
easier at the expense of their own convenience. The programmers are not
beholden to any corporate interests (other than their own sponsors, who
presumably are getting all the feedback they're looking for privately).

I'm rather surprised Postgres doesn't have a good bug tracking system. That's
something most projects find pretty essential. Strangely enough the reason
seems to be that Postgres really doesn't have many bugs... Unlike web browsers
or GUIs or most of the other free software projects out there, databases don't
tolerate bugs well. Any serious bug is cause for an immediate point release.
The only use for a bug tracking system would really be for tracking all those
pesky IWBNI bugs that never rise to urgent status.

But tracking the status of sub-projects is just not the kind of thing free
software people do. They send emails when they have something to say.

-- 
greg


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn

--On fredag, maj 06, 2005 23.31.20 -0400 Tom Lane [EMAIL PROTECTED] wrote:
Bruce Momjian pgman@candle.pha.pa.us writes:
Is this patch ready for application?
Not until ICU is released under a BSD license ...
It's not GPL anyway. Seems pretty much like the BSD license, at least more 
BSD-ish than GPL-ish.

http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icu/license.html
/Palle
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn

--On fredag, maj 06, 2005 22.57.59 -0400 Bruce Momjian 
pgman@candle.pha.pa.us wrote:

Is this patch ready for application?
http://people.freebsd.org/~girgen/postgresql-icu/pg-802-icu-2005-05-06.d
iff.gz
The web site is:
	http://people.freebsd.org/~girgen/postgresql-icu/readme.html
I don't think so, not quite. I have not had any positive reports from linux 
users, this is only tested in a FreeBSD environment. I'd say it needs some 
more testing.

Also, apparently, ICU is installed by default in many linux distributions, 
and usually it is version 2.8. Some linux users have asked me if there are 
plans for a patch that works with ICU 2.8. That's probably a good idea. IBM 
and the ICU folks seem to consider 3.2 to be the stable version, older 
versions are hard to find on their sites, but most linux distributers seem 
to consider it too bleeding edge, even gentoo. I don't know why they don't 
agree.

I do have a few questions:
Why don't you use the lc_ctype_is_c() part of this test?
 	if (pg_database_encoding_max_length()  1  !lc_ctype_is_c())
Um, well, I didn't think about that. :)  What would be the locale in this 
case? c_C.UTF-8? ;)  Hmm, it is possible to have CTYPE=C and use a wide 
encoding, indeed. Then the strings will be handled like byte-wide chars. 
Yeah, it's a bug. I'll fix it! Thanks.

Why is so much code added, for example, in lower()?  The existing
multibyte code is much smaller, and lots of code is added in other
places too.
ICU uses UTF-16 internally, so all strings must be converted from the 
database encoding to UTF-16. Since that means the strings need to be 
copied, I took the same approach as in varlena.c:varstr_cmp(), where small 
strings use the heap and only larger strings use a palloc. Comments in 
varstr_cmp about performance made me use that approach.

Also, in the latest patch, I also added checks and logging for *every* 
status returned from ICU. I hope this will help debugging on debian, where 
previous version didn't work. That excessive status checking is hardly be 
necessary once the stuff is better tested.

I think the string copying and heap/palloc choices stands for most of the 
code bloat, together with the excessive status checking and logging.


Why do you need to add a mapping of encoding names from iana to our
names?
This was already answered by John Hansen... There's an old thread here 
about the choice of the name UNICODE to describe an encoding, which it 
doesn't. There's half a dozen unicode based encodings... UTF-8 is used by 
postgresql, that would have been a better name... Similarly for most other 
encodings, really. ICU expect a setlocale(3) string (i.e. IANA). PostgreSQL 
can't provide it, so a mapping table is required.

I use this patch in production on one FreeBSD 4.10 server at the moment. 
With the latest version, I've had no problems. Logging is swithed on for 
now, and it shows no signs of ICU complaining. I'd like more reports on 
Linux, though.

/Palle
-
--
Palle Girgensohn wrote:
Hi!
I've put together a patch for using IBM's ICU package for collation.
If your OS does not have full support for collation ur
uppercase/lowercase  in multibyte locales, this might be useful. If you
are using a multibyte  character encoding in your database and want
collation, i.e. order by, and  also lower(), upper() and initcap() to
work properly, this patch will do  just that.
This patch is needed for FreeBSD, since this OS has no support for
collation of for example unicode locales (that is, wcscoll(3) does not
do  what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK
the  patch is *not* necessary for Linux, although IBM claims ICU
collation to be  about twice as fast as glibc for simple western locales.
It adds a configure switch, `--with-icu', which will set up the code to
use  ICU instead of wchar_t and wcscoll.
This has been tested only on FreeBSD-4.11  FreeBSD-5-stable, where it
seems to run well. I've not had the time to do any comparative
performance  tests yet, but it seems it is at least not slower than
using LATIN1 with  sv_SE.ISO8859-1 locale, perhaps even faster.
I'd be delighted if some more experienced postgresql hackers would
review  this stuff. The patch is pretty compact, so it's fast reading :)
I'm  planning to add this patch as an option (tagged experimental) to
FreeBSD's postgresql port. Any ideas about whether this is a good idea
or  not?
Any thoughts or ideas are welcome!
Cheers,
Palle
Patch at:
http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.
diff
ICU at sourceforge: http://icu.sf.net/
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings
--
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your 

Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
 
 I use this patch in production on one FreeBSD 4.10 server at 
 the moment. 
 With the latest version, I've had no problems. Logging is 
 swithed on for 
 now, and it shows no signs of ICU complaining. I'd like more 
 reports on 
 Linux, though.

I currently use this on gentoo with ICU3.2 unmasked.

Works a dream, even with locale C and UNICODE database.

Small test:

createdb --encoding UNICODE --locale C test
psql test
set client_encoding=iso88591;
CREATE TABLE test (t text);
INSERT INTO test (t) VALUES ('æøå');
set client_encoding=unicode;
INSERT INTO test (t) SELECT upper(t) FROM test;
set client_encoding=iso88591;
SELECT * FROM test;
  t
-
 æøå
 ÆØÅ
(2 rows)

Just as I'd expect, as upper/lower/initcap are locale independent for these 
characters.


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS]

2005-05-07 Thread [EMAIL PROTECTED]
unsubscribe
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Bruce Momjian
Palle Girgensohn wrote:
 
  Is this patch ready for application?
 
 I don't think so, not quite. I have not had any positive reports from linux 
 users, this is only tested in a FreeBSD environment. I'd say it needs some 
 more testing.

OK.

 Also, apparently, ICU is installed by default in many linux distributions, 
 and usually it is version 2.8. Some linux users have asked me if there are 
 plans for a patch that works with ICU 2.8. That's probably a good idea. IBM 
 and the ICU folks seem to consider 3.2 to be the stable version, older 
 versions are hard to find on their sites, but most linux distributers seem 
 to consider it too bleeding edge, even gentoo. I don't know why they don't 
 agree.

Good point.  Why would linux folks need ICU?  Doesn't their OS support
encodings natively?  I am particularly excited about this for OSs that
don't have such encodings, like UTF8 support for Win32.

Because ICU will not be used unless enabled by configure, it seems we
are fine with only supporting the newest version.  Do Linux users need
to use ICU for any reason?

  I do have a few questions:
 
  Why don't you use the lc_ctype_is_c() part of this test?
 
  if (pg_database_encoding_max_length()  1  !lc_ctype_is_c())
 
 Um, well, I didn't think about that. :)  What would be the locale in this 
 case? c_C.UTF-8? ;)  Hmm, it is possible to have CTYPE=C and use a wide 
 encoding, indeed. Then the strings will be handled like byte-wide chars. 
 Yeah, it's a bug. I'll fix it! Thanks.

The additional test is more of an optmization, and it fixes a problem
with some OSs that have processing problems with UTF8 when the locale is
supposed to be turned off, like in C.  I realize ICU might be fine
with it but the optimization still is an issue.

  Why is so much code added, for example, in lower()?  The existing
  multibyte code is much smaller, and lots of code is added in other
  places too.
 
 ICU uses UTF-16 internally, so all strings must be converted from the 
 database encoding to UTF-16. Since that means the strings need to be 
 copied, I took the same approach as in varlena.c:varstr_cmp(), where small 
 strings use the heap and only larger strings use a palloc. Comments in 
 varstr_cmp about performance made me use that approach.

Oh, interesting.   I think you need to create new functions that
factor out that common code so the patch is smaller and easier to
maintain.

 Also, in the latest patch, I also added checks and logging for *every* 
 status returned from ICU. I hope this will help debugging on debian, where 
 previous version didn't work. That excessive status checking is hardly be 
 necessary once the stuff is better tested.
 
 I think the string copying and heap/palloc choices stands for most of the 
 code bloat, together with the excessive status checking and logging.

OK, move that into some common functions and I think it will be better.

  Why do you need to add a mapping of encoding names from iana to our
  names?
 
 This was already answered by John Hansen... There's an old thread here 
 about the choice of the name UNICODE to describe an encoding, which it 
 doesn't. There's half a dozen unicode based encodings... UTF-8 is used by 
 postgresql, that would have been a better name... Similarly for most other 
 encodings, really. ICU expect a setlocale(3) string (i.e. IANA). PostgreSQL 
 can't provide it, so a mapping table is required.

We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does that
help?

 I use this patch in production on one FreeBSD 4.10 server at the moment. 
 With the latest version, I've had no problems. Logging is swithed on for 
 now, and it shows no signs of ICU complaining. I'd like more reports on 
 Linux, though.

OK, I certainly would like this all done for 8.1 which should have
feature freeze on July 1.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Errm,... initdb --encoding UNICODE --locale C

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of John Hansen
 Sent: Saturday, May 07, 2005 10:23 PM
 To: Palle Girgensohn; Bruce Momjian
 Cc: pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Patch for collation using ICU
 
  
  I use this patch in production on one FreeBSD 4.10 server at the 
  moment.
  With the latest version, I've had no problems. Logging is 
 swithed on 
  for now, and it shows no signs of ICU complaining. I'd like more 
  reports on Linux, though.
 
 I currently use this on gentoo with ICU3.2 unmasked.
 
 Works a dream, even with locale C and UNICODE database.
 
 Small test:
 
 createdb --encoding UNICODE --locale C test psql test set 
 client_encoding=iso88591; CREATE TABLE test (t text); INSERT 
 INTO test (t) VALUES ('æøå'); set client_encoding=unicode; 
 INSERT INTO test (t) SELECT upper(t) FROM test; set 
 client_encoding=iso88591; SELECT * FROM test;
   t
 -
  æøå
  ÆØÅ
 (2 rows)
 
 Just as I'd expect, as upper/lower/initcap are locale 
 independent for these characters.
 
 
 ---(end of 
 broadcast)---
 TIP 5: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faq
 
 

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn
--On lördag, maj 07, 2005 22.53.46 +1000 John Hansen [EMAIL PROTECTED] 
wrote:

Errm,... initdb --encoding UNICODE --locale C
You mean that ICU *shall* be used even for the C locale, and not as Bruce 
suggested here:

I do have a few questions:
Why don't you use the lc_ctype_is_c() part of this test?
 	if (pg_database_encoding_max_length()  1  !lc_ctype_is_c())
Um, well, I didn't think about that. :)  What would be the locale in this
case? c_C.UTF-8? ;)  Hmm, it is possible to have CTYPE=C and use a wide
encoding, indeed. Then the strings will be handled like byte-wide chars.
Yeah, it's a bug. I'll fix it! Thanks.
John disagrees here, and I'm obliged to agree. Using the C locale, one will 
expect C collation, but upper/lower is better off still using ICU. Hence, 
the above stuff is *not* a bug. Do we agree?

/Palle


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of John Hansen
Sent: Saturday, May 07, 2005 10:23 PM
To: Palle Girgensohn; Bruce Momjian
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Patch for collation using ICU

 I use this patch in production on one FreeBSD 4.10 server at the
 moment.
 With the latest version, I've had no problems. Logging is
swithed on
 for now, and it shows no signs of ICU complaining. I'd like more
 reports on Linux, though.
I currently use this on gentoo with ICU3.2 unmasked.
Works a dream, even with locale C and UNICODE database.
Small test:
createdb --encoding UNICODE --locale C test psql test set
client_encoding=iso88591; CREATE TABLE test (t text); INSERT
INTO test (t) VALUES ('æøå'); set client_encoding=unicode;
INSERT INTO test (t) SELECT upper(t) FROM test; set
client_encoding=iso88591; SELECT * FROM test;
  t
-
 æøå
 ÆØÅ
(2 rows)
Just as I'd expect, as upper/lower/initcap are locale
independent for these characters.
---(end of
broadcast)---
TIP 5: Have you checked our extensive FAQ?
   http://www.postgresql.org/docs/faq



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Bruce Momjian wrote:
 Palle Girgensohn wrote:
  
   Is this patch ready for application?
  
  I don't think so, not quite. I have not had any positive 
 reports from 
  linux users, this is only tested in a FreeBSD environment. 
 I'd say it 
  needs some more testing.
 
 OK.
 
  Also, apparently, ICU is installed by default in many linux 
  distributions, and usually it is version 2.8. Some linux users have 
  asked me if there are plans for a patch that works with ICU 2.8. 
  That's probably a good idea. IBM and the ICU folks seem to consider 
  3.2 to be the stable version, older versions are hard to 
 find on their 
  sites, but most linux distributers seem to consider it too bleeding 
  edge, even gentoo. I don't know why they don't agree.
 
 Good point.  Why would linux folks need ICU?  Doesn't their 
 OS support encodings natively?  I am particularly excited 
 about this for OSs that don't have such encodings, like UTF8 
 support for Win32.
 
 Because ICU will not be used unless enabled by configure, it 
 seems we are fine with only supporting the newest version.  
 Do Linux users need to use ICU for any reason?

Yes, because on many linux platforms locale support is broken.
Also, ICU enables full unicode support, particularly in multi-language
situations where locale is C, and makes upper/lower/initcap work as
expected, except where it depends on locale information.

There are also many other useful things in ICU that could be
implemented. Transliteration, and break-iterators for example.
Break-iteration particularly interresting for converting a text to a
list of words. Another is it's builtin substring searches.

 
   I do have a few questions:
  
   Why don't you use the lc_ctype_is_c() part of this test?
  
 if (pg_database_encoding_max_length()  1  !lc_ctype_is_c())
  
  Um, well, I didn't think about that. :)  What would be the 
 locale in 
  this case? c_C.UTF-8? ;)  Hmm, it is possible to have 
 CTYPE=C and use 
  a wide encoding, indeed. Then the strings will be handled 
 like byte-wide chars.
  Yeah, it's a bug. I'll fix it! Thanks.
 
 The additional test is more of an optmization, and it fixes a 
 problem with some OSs that have processing problems with UTF8 
 when the locale is supposed to be turned off, like in C.  I 
 realize ICU might be fine with it but the optimization still 
 is an issue.

That the locale is supposed to be turned off, doesn't mean it shouldn't
use ICU.
ICU is more than just locales.

   Why is so much code added, for example, in lower()?  The existing 
   multibyte code is much smaller, and lots of code is added 
 in other 
   places too.
  
  ICU uses UTF-16 internally, so all strings must be 
 converted from the 
  database encoding to UTF-16. Since that means the strings 
 need to be 
  copied, I took the same approach as in 
 varlena.c:varstr_cmp(), where 
  small strings use the heap and only larger strings use a palloc. 
  Comments in varstr_cmp about performance made me use that approach.
 
 Oh, interesting.   I think you need to create new functions that
 factor out that common code so the patch is smaller and 
 easier to maintain.
 
  Also, in the latest patch, I also added checks and logging 
 for *every* 
  status returned from ICU. I hope this will help debugging 
 on debian, 
  where previous version didn't work. That excessive status 
 checking is 
  hardly be necessary once the stuff is better tested.
  
  I think the string copying and heap/palloc choices stands 
 for most of 
  the code bloat, together with the excessive status checking 
 and logging.
 
 OK, move that into some common functions and I think it will 
 be better.
 
   Why do you need to add a mapping of encoding names from 
 iana to our 
   names?
  
  This was already answered by John Hansen... There's an old 
 thread here 
  about the choice of the name UNICODE to describe an 
 encoding, which 
  it doesn't. There's half a dozen unicode based encodings... 
 UTF-8 is 
  used by postgresql, that would have been a better name... Similarly 
  for most other encodings, really. ICU expect a setlocale(3) string 
  (i.e. IANA). PostgreSQL can't provide it, so a mapping 
 table is required.
 
 We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). 
  Does that help?
 
  I use this patch in production on one FreeBSD 4.10 server 
 at the moment. 
  With the latest version, I've had no problems. Logging is 
 swithed on 
  for now, and it shows no signs of ICU complaining. I'd like more 
  reports on Linux, though.
 
 OK, I certainly would like this all done for 8.1 which should 
 have feature freeze on July 1.
 
 -- 
   Bruce Momjian|  http://candle.pha.pa.us
   pgman@candle.pha.pa.us   |  (610) 359-1001
   +  If your life is a hard drive, |  13 Roberts Road
   +  Christ can be your backup.|  Newtown Square, 
 Pennsylvania 19073
 
 ---(end of 
 broadcast)---
 TIP 4: Don't 'kill -9' the 

Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
 --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen 
 [EMAIL PROTECTED]
 wrote:
 
  Errm,... initdb --encoding UNICODE --locale C
 
 You mean that ICU *shall* be used even for the C locale, and 
 not as Bruce suggested here:

Yes, that's exactly what I mean.

 
  I do have a few questions:
 
  Why don't you use the lc_ctype_is_c() part of this test?
 
 if (pg_database_encoding_max_length()  1  !lc_ctype_is_c())
 
  Um, well, I didn't think about that. :)  What would be the 
 locale in 
  this case? c_C.UTF-8? ;)  Hmm, it is possible to have 
 CTYPE=C and use 
  a wide encoding, indeed. Then the strings will be handled 
 like byte-wide chars.
  Yeah, it's a bug. I'll fix it! Thanks.
 
 John disagrees here, and I'm obliged to agree. Using the C 
 locale, one will expect C collation, but upper/lower is 
 better off still using ICU. Hence, the above stuff is *not* a 
 bug. Do we agree?
 
 /Palle
 
 
 
  -Original Message-
  From: [EMAIL PROTECTED]
  [mailto:[EMAIL PROTECTED] On Behalf Of 
 John Hansen
  Sent: Saturday, May 07, 2005 10:23 PM
  To: Palle Girgensohn; Bruce Momjian
  Cc: pgsql-hackers@postgresql.org
  Subject: Re: [HACKERS] Patch for collation using ICU
 
  
   I use this patch in production on one FreeBSD 4.10 server at the 
   moment.
   With the latest version, I've had no problems. Logging is
  swithed on
   for now, and it shows no signs of ICU complaining. I'd like more 
   reports on Linux, though.
 
  I currently use this on gentoo with ICU3.2 unmasked.
 
  Works a dream, even with locale C and UNICODE database.
 
  Small test:
 
  createdb --encoding UNICODE --locale C test psql test set 
  client_encoding=iso88591; CREATE TABLE test (t text); INSERT INTO 
  test (t) VALUES ('æøå'); set client_encoding=unicode; INSERT INTO 
  test (t) SELECT upper(t) FROM test; set client_encoding=iso88591; 
  SELECT * FROM test;
t
  -
   æøå
   ÆØÅ
  (2 rows)
 
  Just as I'd expect, as upper/lower/initcap are locale 
 independent for 
  these characters.
 
 
  ---(end of
  broadcast)---
  TIP 5: Have you checked our extensive FAQ?
 
 http://www.postgresql.org/docs/faq
 
 
 
 
 
 
 
 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Btw, I had been planning to propose replacing every single one of the built in 
charset conversion functions with calls to ICU (thus making pg _depend_ on 
ICU), as this would seem like a cleaner solution than for us to maintain our 
own conversion tables.

ICU also has a fair few conversions that we do not have at present.

Any thoughts?

... John

 -Original Message-
 From: John Hansen 
 Sent: Saturday, May 07, 2005 11:09 PM
 To: 'Palle Girgensohn'; 'Bruce Momjian'
 Cc: 'pgsql-hackers@postgresql.org'
 Subject: RE: [HACKERS] Patch for collation using ICU
 
  --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen 
  [EMAIL PROTECTED]
  wrote:
  
   Errm,... initdb --encoding UNICODE --locale C
  
  You mean that ICU *shall* be used even for the C locale, and not as 
  Bruce suggested here:
 
 Yes, that's exactly what I mean.
 
  
   I do have a few questions:
  
   Why don't you use the lc_ctype_is_c() part of this test?
  
if (pg_database_encoding_max_length()  1  
 !lc_ctype_is_c())
  
   Um, well, I didn't think about that. :)  What would be the
  locale in
   this case? c_C.UTF-8? ;)  Hmm, it is possible to have
  CTYPE=C and use
   a wide encoding, indeed. Then the strings will be handled
  like byte-wide chars.
   Yeah, it's a bug. I'll fix it! Thanks.
  
  John disagrees here, and I'm obliged to agree. Using the C 
 locale, one 
  will expect C collation, but upper/lower is better off still using 
  ICU. Hence, the above stuff is *not* a bug. Do we agree?
  
  /Palle
  
  
  
   -Original Message-
   From: [EMAIL PROTECTED]
   [mailto:[EMAIL PROTECTED] On Behalf Of
  John Hansen
   Sent: Saturday, May 07, 2005 10:23 PM
   To: Palle Girgensohn; Bruce Momjian
   Cc: pgsql-hackers@postgresql.org
   Subject: Re: [HACKERS] Patch for collation using ICU
  
   
I use this patch in production on one FreeBSD 4.10 
 server at the 
moment.
With the latest version, I've had no problems. Logging is
   swithed on
for now, and it shows no signs of ICU complaining. I'd 
 like more 
reports on Linux, though.
  
   I currently use this on gentoo with ICU3.2 unmasked.
  
   Works a dream, even with locale C and UNICODE database.
  
   Small test:
  
   createdb --encoding UNICODE --locale C test psql test set 
   client_encoding=iso88591; CREATE TABLE test (t text); 
 INSERT INTO 
   test (t) VALUES ('æøå'); set client_encoding=unicode; 
 INSERT INTO 
   test (t) SELECT upper(t) FROM test; set 
 client_encoding=iso88591; 
   SELECT * FROM test;
 t
   -
æøå
ÆØÅ
   (2 rows)
  
   Just as I'd expect, as upper/lower/initcap are locale
  independent for
   these characters.
  
  
   ---(end of
   broadcast)---
   TIP 5: Have you checked our extensive FAQ?
  
  http://www.postgresql.org/docs/faq
  
  
  
  
  
  
  
  

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn

--On lördag, maj 07, 2005 08.37.05 -0400 Bruce Momjian 
pgman@candle.pha.pa.us wrote:

Palle Girgensohn wrote:

 Is this patch ready for application?
I don't think so, not quite. I have not had any positive reports from
linux  users, this is only tested in a FreeBSD environment. I'd say it
needs some  more testing.
OK.
John Hansen just reported that it does work on linux. fine!

Also, apparently, ICU is installed by default in many linux
distributions,  and usually it is version 2.8. Some linux users have
asked me if there are  plans for a patch that works with ICU 2.8. That's
probably a good idea. IBM  and the ICU folks seem to consider 3.2 to be
the stable version, older  versions are hard to find on their sites, but
most linux distributers seem  to consider it too bleeding edge, even
gentoo. I don't know why they don't  agree.
Good point.  Why would linux folks need ICU?  Doesn't their OS support
encodings natively?  I am particularly excited about this for OSs that
don't have such encodings, like UTF8 support for Win32.
Because ICU will not be used unless enabled by configure, it seems we
are fine with only supporting the newest version.  Do Linux users need
to use ICU for any reason?

There are corner cases where it is impossible to upper/lowercase one 
character at the time. for example:

-- without ICU
select upper('Eßer');
upper
---
EßER
(1 row)
-- with ICU
select upper('Eßer');
upper
---
ESSER
(1 rad)
This is because in the standard postgres implementation, upper/lower is 
done one character at the time. A proper upper/lower cannot do it that way. 
Other known example is in Turkish, where an Ì (?) should look different 
whether it is an initial letter or not. This fails in standard postgresql 
for all platforms.

 I do have a few questions:

 Why don't you use the lc_ctype_is_c() part of this test?

if (pg_database_encoding_max_length()  1  !lc_ctype_is_c())
Um, well, I didn't think about that. :)  What would be the locale in
this  case? c_C.UTF-8? ;)  Hmm, it is possible to have CTYPE=C and use a
wide  encoding, indeed. Then the strings will be handled like byte-wide
chars.  Yeah, it's a bug. I'll fix it! Thanks.
The additional test is more of an optmization, and it fixes a problem
with some OSs that have processing problems with UTF8 when the locale is
supposed to be turned off, like in C.  I realize ICU might be fine
with it but the optimization still is an issue.
Well, the results are quite different, depending on whether ICU is used or 
not. See separate mail.


 Why is so much code added, for example, in lower()?  The existing
 multibyte code is much smaller, and lots of code is added in other
 places too.
ICU uses UTF-16 internally, so all strings must be converted from the
database encoding to UTF-16. Since that means the strings need to be
copied, I took the same approach as in varlena.c:varstr_cmp(), where
small  strings use the heap and only larger strings use a palloc.
Comments in  varstr_cmp about performance made me use that approach.
Oh, interesting.   I think you need to create new functions that
factor out that common code so the patch is smaller and easier to
maintain.
Hmm, yes, perhaps it can be refactored a bit. It has ocurred to me...

Also, in the latest patch, I also added checks and logging for *every*
status returned from ICU. I hope this will help debugging on debian,
where  previous version didn't work. That excessive status checking is
hardly be  necessary once the stuff is better tested.
I think the string copying and heap/palloc choices stands for most of
the  code bloat, together with the excessive status checking and logging.
OK, move that into some common functions and I think it will be better.
Best way for upper/lower/initcap is probably to use a function pointer... 
uhh...


 Why do you need to add a mapping of encoding names from iana to our
 names?
This was already answered by John Hansen... There's an old thread here
about the choice of the name UNICODE to describe an encoding, which it
doesn't. There's half a dozen unicode based encodings... UTF-8 is used
by  postgresql, that would have been a better name... Similarly for most
other  encodings, really. ICU expect a setlocale(3) string (i.e. IANA).
PostgreSQL  can't provide it, so a mapping table is required.
We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does that
help?
I'm aware of that. It might help for unicode, but there are a bunch of 
other encodings. IANA has decided that utf-8 has *no* aliases, hence only 
utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is 
fogiving, I don't remember/know, but I think we need the mappings, 
unfortunately.


I use this patch in production on one FreeBSD 4.10 server at the moment.
With the latest version, I've had no problems. Logging is swithed on for
now, and it shows no signs of ICU complaining. I'd like more reports on
Linux, though.
OK, I certainly would like this all done for 8.1 which should have
feature freeze on 

Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Palle Girgensohn wrote:
 I'm aware of that. It might help for unicode, but there are a 
 bunch of 
 other encodings. IANA has decided that utf-8 has *no* 
 aliases, hence only 
 utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is 
 fogiving, I don't remember/know, but I think we need the mappings, 
 unfortunately.
 

Here is the list of encoding names and aliases the ICU accepts as of
3.2:
(it's a bit long...)

UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2
UTF-16BE x-utf-16be ibm-1200 ibm-1201 ibm-5297 ibm-13488 ibm-17584
windows-1201 cp1200 cp1201 UTF16_BigEndian
UTF-16LE x-utf-16le ibm-1202 ibm-13490 ibm-17586 UTF16_LittleEndian
windows-1200
UTF-32 ISO-10646-UCS-4 csUCS4 ucs-4
UTF-32BE UTF32_BigEndian ibm-1232 ibm-1233
UTF-32LE UTF32_LittleEndian ibm-1234
UTF16_PlatformEndian
UTF16_OppositeEndian
UTF32_PlatformEndian
UTF32_OppositeEndian
UTF-7 windows-65000
IMAP-mailbox-name
SCSU
BOCU-1 csBOCU-1
CESU-8
ISO-8859-1 ibm-819 IBM819 cp819 latin1 8859_1 csISOLatin1 iso-ir-100
ISO_8859-1:1987 l1 819
US-ASCII ASCII ANSI_X3.4-1968 ANSI_X3.4-1986 ISO_646.irv:1991
iso_646.irv:1983 ISO646-US us csASCII iso-ir-6 cp367 ascii7 646
windows-20127
gb18030 ibm-1392 windows-54936
ibm-367_P100-1995 ibm-367 IBM367
ibm-912_P100-1995 ibm-912 iso-8859-2 ISO_8859-2:1987 latin2 csISOLatin2
iso-ir-101 l2 8859_2 cp912 912 windows-28592
ibm-913_P100-2000 ibm-913 iso-8859-3 ISO_8859-3:1988 latin3 csISOLatin3
iso-ir-109 l3 8859_3 cp913 913 windows-28593
ibm-914_P100-1995 ibm-914 iso-8859-4 latin4 csISOLatin4 iso-ir-110
ISO_8859-4:1988 l4 8859_4 cp914 914 windows-28594
ibm-915_P100-1995 ibm-915 iso-8859-5 cyrillic csISOLatinCyrillic
iso-ir-144 ISO_8859-5:1988 8859_5 cp915 915 windows-28595
ibm-1089_P100-1995 ibm-1089 iso-8859-6 arabic csISOLatinArabic
iso-ir-127 ISO_8859-6:1987 ECMA-114 ASMO-708 8859_6 cp1089 1089
windows-28596 ISO-8859-6-I ISO-8859-6-E
ibm-813_P100-1995 ibm-813 iso-8859-7 greek greek8 ELOT_928 ECMA-118
csISOLatinGreek iso-ir-126 ISO_8859-7:1987 8859_7 cp813 813
windows-28597
ibm-916_P100-1995 ibm-916 iso-8859-8 hebrew csISOLatinHebrew iso-ir-138
ISO_8859-8:1988 ISO-8859-8-I ISO-8859-8-E 8859_8 cp916 916 windows-28598
ibm-920_P100-1995 ibm-920 iso-8859-9 latin5 csISOLatin5 iso-ir-148
ISO_8859-9:1989 l5 8859_9 cp920 920 windows-28599 ECMA-128
ibm-921_P100-1995 ibm-921 iso-8859-13 8859_13 cp921 921
ibm-923_P100-1998 ibm-923 iso-8859-15 Latin-9 l9 8859_15 latin0
csisolatin0 csisolatin9 iso8859_15_fdis cp923 923 windows-28605
ibm-942_P12A-1999 ibm-942 ibm-932 cp932 shift_jis78 sjis78
ibm-942_VSUB_VPUA ibm-932_VSUB_VPUA
ibm-943_P15A-2003 ibm-943 Shift_JIS MS_Kanji csShiftJIS windows-31j
csWindows31J x-sjis x-ms-cp932 cp932 windows-932 cp943c IBM-943C ms932
pck sjis ibm-943_VSUB_VPUA
ibm-943_P130-1999 ibm-943 Shift_JIS cp943 943 ibm-943_VASCII_VSUB_VPUA
ibm-33722_P12A-1999 ibm-33722 ibm-5050 EUC-JP
Extended_UNIX_Code_Packed_Format_for_Japanese csEUCPkdFmtJapanese
X-EUC-JP eucjis windows-51932 ibm-33722_VPUA IBM-eucJP
ibm-33722_P120-1999 ibm-33722 ibm-5050 cp33722 33722
ibm-33722_VASCII_VPUA
ibm-954_P101-2000 ibm-954 EUC-JP
ibm-1373_P100-2002 ibm-1373 windows-950
windows-950-2000 Big5 csBig5 windows-950 x-big5
ibm-950_P110-1999 ibm-950 cp950 950
macos-2566-10.2 Big5-HKSCS big5hk HKSCS-BIG5
ibm-1375_P100-2003 ibm-1375 Big5-HKSCS
ibm-1386_P100-2002 ibm-1386 cp1386 windows-936 ibm-1386_VSUB_VPUA
windows-936-2000 GBK CP936 MS936 windows-936
ibm-1383_P110-1999 ibm-1383 GB2312 csGB2312 EUC-CN ibm-eucCN hp15CN
cp1383 1383 ibm-1383_VPUA
ibm-5478_P100-1995 ibm-5478 GB_2312-80 chinese iso-ir-58 csISO58GB231280
gb2312-1980 GB2312.1980-0
ibm-964_P110-1999 ibm-964 EUC-TW ibm-eucTW cns11643 cp964 964
ibm-964_VPUA
ibm-949_P110-1999 ibm-949 cp949 949 ibm-949_VASCII_VSUB_VPUA
ibm-949_P11A-1999 ibm-949 cp949c ibm-949_VSUB_VPUA
ibm-970_P110-1995 ibm-970 EUC-KR KS_C_5601-1987 windows-51949 csEUCKR
ibm-eucKR KSC_5601 5601 ibm-970_VPUA
ibm-971_P100-1995 ibm-971 ibm-971_VPUA
ibm-1363_P11B-1998 ibm-1363 KS_C_5601-1987 KS_C_5601-1989 KSC_5601
csKSC56011987 korean iso-ir-149 5601 cp1363 ksc windows-949
ibm-1363_VSUB_VPUA
ibm-1363_P110-1997 ibm-1363 ibm-1363_VASCII_VSUB_VPUA
windows-949-2000 windows-949 KS_C_5601-1987 KS_C_5601-1989 KSC_5601
csKSC56011987 korean iso-ir-149 ms949
ibm-1162_P100-1999 ibm-1162
ibm-874_P100-1995 ibm-874 ibm-9066 cp874 TIS-620 tis620.2533 eucTH
cp9066
windows-874-2000 TIS-620 windows-874 MS874
ibm-437_P100-1995 ibm-437 IBM437 cp437 437 csPC8CodePage437 windows-437
ibm-850_P100-1995 ibm-850 IBM850 cp850 850 csPC850Multilingual
windows-850
ibm-851_P100-1995 ibm-851 IBM851 cp851 851 csPC851
ibm-852_P100-1995 ibm-852 IBM852 cp852 852 csPCp852 windows-852
ibm-855_P100-1995 ibm-855 IBM855 cp855 855 csIBM855 csPCp855
ibm-856_P100-1995 ibm-856 cp856 856
ibm-857_P100-1995 ibm-857 IBM857 cp857 857 csIBM857 windows-857
ibm-858_P100-1997 ibm-858 IBM00858 CCSID00858 CP00858
PC-Multilingual-850+euro cp858
ibm-860_P100-1995 ibm-860 IBM860 cp860 860 csIBM860

Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn

--On lördag, maj 07, 2005 23.25.15 +1000 John Hansen [EMAIL PROTECTED] 
wrote:

Palle Girgensohn wrote:
I'm aware of that. It might help for unicode, but there are a
bunch of
other encodings. IANA has decided that utf-8 has *no*
aliases, hence only
utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
fogiving, I don't remember/know, but I think we need the mappings,
unfortunately.
Here is the list of encoding names and aliases the ICU accepts as of
3.2:
(it's a bit long...)
UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
No UTF8 in there. I think that's good, charset aliases are a hassle.
/Palle
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
 

 -Original Message-
 From: Palle Girgensohn [mailto:[EMAIL PROTECTED] 
 Sent: Saturday, May 07, 2005 11:30 PM
 To: John Hansen; Bruce Momjian
 Cc: pgsql-hackers@postgresql.org
 Subject: RE: [HACKERS] Patch for collation using ICU
 
 
 
 --On lördag, maj 07, 2005 23.25.15 +1000 John Hansen 
 [EMAIL PROTECTED]
 wrote:
 
  Palle Girgensohn wrote:
  I'm aware of that. It might help for unicode, but there 
 are a bunch 
  of other encodings. IANA has decided that utf-8 has *no* aliases, 
  hence only
  utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is 
  fogiving, I don't remember/know, but I think we need the mappings, 
  unfortunately.
 
 
  Here is the list of encoding names and aliases the ICU accepts as of
  3.2:
  (it's a bit long...)
 
  UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
 
 No UTF8 in there. I think that's good, charset aliases are a hassle.

Yup! :)

 
 /Palle
 
 
 

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn

--On lördag, maj 07, 2005 22.22.52 +1000 John Hansen [EMAIL PROTECTED] 
wrote:

I use this patch in production on one FreeBSD 4.10 server at
the moment.
With the latest version, I've had no problems. Logging is
swithed on for
now, and it shows no signs of ICU complaining. I'd like more
reports on
Linux, though.
I currently use this on gentoo with ICU3.2 unmasked.
Works a dream, even with locale C and UNICODE database.
Ah, good to hear, John. I beleive your report about linux is what's keeping 
this back. Did you also manage to get it running on Debian?

/Palle
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
 

 -Original Message-
 From: Palle Girgensohn [mailto:[EMAIL PROTECTED] 
 Sent: Saturday, May 07, 2005 11:33 PM
 To: John Hansen; Bruce Momjian
 Cc: pgsql-hackers@postgresql.org
 Subject: RE: [HACKERS] Patch for collation using ICU
 
 
 
 --On lördag, maj 07, 2005 22.22.52 +1000 John Hansen 
 [EMAIL PROTECTED]
 wrote:
 
 
  I use this patch in production on one FreeBSD 4.10 server at the 
  moment.
  With the latest version, I've had no problems. Logging is 
 swithed on 
  for now, and it shows no signs of ICU complaining. I'd like more 
  reports on Linux, though.
 
  I currently use this on gentoo with ICU3.2 unmasked.
 
  Works a dream, even with locale C and UNICODE database.
 
 Ah, good to hear, John. I beleive your report about linux is 
 what's keeping this back. Did you also manage to get it 
 running on Debian?

Not without ICU3.2

 
 /Palle
 
 

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn
--On lördag, maj 07, 2005 23.15.29 +1000 John Hansen [EMAIL PROTECTED] 
wrote:

Btw, I had been planning to propose replacing every single one of the
built in charset conversion functions with calls to ICU (thus making pg
_depend_ on ICU), as this would seem like a cleaner solution than for us
to maintain our own conversion tables.
ICU also has a fair few conversions that we do not have at present.
Any thoughts?
I just had a similar though. And why use ICU only for multibyte charsets? 
If I use LATIN1, I still expect upper('ß') = SS, and I don't get it... 
Same for the Turkish example.

It does eat more memory, and can perhaps cush some performance bits? With 
the current scheme, a strdup is often enough, or at least just one palloc. 
With ICU, using UTF-16, you must allocate memory twice, once for the ICU 
internal UTF-16 representation. That's not a very strong objection, though, 
as this would be an option... :)

John, I have a hard time finding docs about what differs in ICU 2.8 from 
3.2. Do you have any pointers?

It seems 3.2 has much more support and bug fixes, I'm not sure if we should 
really consider 2.8?

/Palle
... John
-Original Message-
From: John Hansen
Sent: Saturday, May 07, 2005 11:09 PM
To: 'Palle Girgensohn'; 'Bruce Momjian'
Cc: 'pgsql-hackers@postgresql.org'
Subject: RE: [HACKERS] Patch for collation using ICU
 --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen
 [EMAIL PROTECTED]
 wrote:

  Errm,... initdb --encoding UNICODE --locale C

 You mean that ICU *shall* be used even for the C locale, and not as
 Bruce suggested here:
Yes, that's exactly what I mean.

  I do have a few questions:
 
  Why don't you use the lc_ctype_is_c() part of this test?
 
if (pg_database_encoding_max_length()  1 
!lc_ctype_is_c())
 
  Um, well, I didn't think about that. :)  What would be the
 locale in
  this case? c_C.UTF-8? ;)  Hmm, it is possible to have
 CTYPE=C and use
  a wide encoding, indeed. Then the strings will be handled
 like byte-wide chars.
  Yeah, it's a bug. I'll fix it! Thanks.

 John disagrees here, and I'm obliged to agree. Using the C
locale, one
 will expect C collation, but upper/lower is better off still using
 ICU. Hence, the above stuff is *not* a bug. Do we agree?

 /Palle


 
  -Original Message-
  From: [EMAIL PROTECTED]
  [mailto:[EMAIL PROTECTED] On Behalf Of
 John Hansen
  Sent: Saturday, May 07, 2005 10:23 PM
  To: Palle Girgensohn; Bruce Momjian
  Cc: pgsql-hackers@postgresql.org
  Subject: Re: [HACKERS] Patch for collation using ICU
 
  
   I use this patch in production on one FreeBSD 4.10
server at the
   moment.
   With the latest version, I've had no problems. Logging is
  swithed on
   for now, and it shows no signs of ICU complaining. I'd
like more
   reports on Linux, though.
 
  I currently use this on gentoo with ICU3.2 unmasked.
 
  Works a dream, even with locale C and UNICODE database.
 
  Small test:
 
  createdb --encoding UNICODE --locale C test psql test set
  client_encoding=iso88591; CREATE TABLE test (t text);
INSERT INTO
  test (t) VALUES ('æøå'); set client_encoding=unicode;
INSERT INTO
  test (t) SELECT upper(t) FROM test; set
client_encoding=iso88591;
  SELECT * FROM test;
t
  -
   æøå
   ÆØÅ
  (2 rows)
 
  Just as I'd expect, as upper/lower/initcap are locale
 independent for
  these characters.
 
 
  ---(end of
  broadcast)---
  TIP 5: Have you checked our extensive FAQ?
 
 http://www.postgresql.org/docs/faq
 
 








---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn

--On lördag, maj 07, 2005 23.33.31 +1000 John Hansen [EMAIL PROTECTED] 
wrote:


-Original Message-
From: Palle Girgensohn [mailto:[EMAIL PROTECTED]
Sent: Saturday, May 07, 2005 11:33 PM
To: John Hansen; Bruce Momjian
Cc: pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Patch for collation using ICU

--On lördag, maj 07, 2005 22.22.52 +1000 John Hansen
[EMAIL PROTECTED]
wrote:

 I use this patch in production on one FreeBSD 4.10 server at the
 moment.
 With the latest version, I've had no problems. Logging is
swithed on
 for now, and it shows no signs of ICU complaining. I'd like more
 reports on Linux, though.

 I currently use this on gentoo with ICU3.2 unmasked.

 Works a dream, even with locale C and UNICODE database.
Ah, good to hear, John. I beleive your report about linux is
what's keeping this back. Did you also manage to get it
running on Debian?
Not without ICU3.2
Did you try the latest patch? Maybe it will help, and if not, it will 
(hopefully) give a lot more informative error messages.

/Palle
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Andrew Dunstan

John Hansen wrote:
Here is the list of encoding names and aliases the ICU accepts as of
3.2:
(it's a bit long...)
UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2
 

[snip]
Don't we use unicode as an alias for UTF-8 ?
cheers
andrew
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
 Did you try the latest patch? Maybe it will help, and if not, it will
 (hopefully) give a lot more informative error messages.

No, and I got rid of my debian boxes @ home.
The patch required a certain amount of modifications too, to even
compile with 2.8.

So I guess it's a valid question to ask: it it worth supporting 2.8?
It is of course an option to bundle icu 3.2 with pg!


... John

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Bruce Momjian
John Hansen wrote:
  --On l?rdag, maj 07, 2005 22.53.46 +1000 John Hansen 
  [EMAIL PROTECTED]
  wrote:
  
   Errm,... initdb --encoding UNICODE --locale C
  
  You mean that ICU *shall* be used even for the C locale, and 
  not as Bruce suggested here:
 
 Yes, that's exactly what I mean.

There are two reasons for that optimization --- first, some locale
support is broken and Unicode encoding with a C locale crashes (not an
issue for ICU), and second, it is an optimization for languages like
Japanese that want to use unicode, but don't need a locale because
upper/lower means nothing in those character sets.

So, the first issue doesn't apply for ICU, and the second might not
depending on what characters you are using in the Unicode character set.

I guess I am little confused how ICU can do upper() when the locale is
C.  What is it using to determine A is upper for a?  Am I confused?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
 

 -Original Message-
 From: Andrew Dunstan [mailto:[EMAIL PROTECTED] 
 Sent: Saturday, May 07, 2005 11:39 PM
 To: John Hansen
 Cc: Palle Girgensohn; Bruce Momjian; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Patch for collation using ICU
 
 
 
 John Hansen wrote:
 
 Here is the list of encoding names and aliases the ICU accepts as of
 3.2:
 (it's a bit long...)
 
 UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
 UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2
 
   
 
 [snip]
 
 Don't we use unicode as an alias for UTF-8 ?

Yes, unfortunately!

 
 cheers
 
 andrew
 
 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Bruce Momjian
Palle Girgensohn wrote:
  Also, apparently, ICU is installed by default in many linux
  distributions,  and usually it is version 2.8. Some linux users have
  asked me if there are  plans for a patch that works with ICU 2.8. That's
  probably a good idea. IBM  and the ICU folks seem to consider 3.2 to be
  the stable version, older  versions are hard to find on their sites, but
  most linux distributers seem  to consider it too bleeding edge, even
  gentoo. I don't know why they don't  agree.
 
  Good point.  Why would linux folks need ICU?  Doesn't their OS support
  encodings natively?  I am particularly excited about this for OSs that
  don't have such encodings, like UTF8 support for Win32.
 
  Because ICU will not be used unless enabled by configure, it seems we
  are fine with only supporting the newest version.  Do Linux users need
  to use ICU for any reason?
 
 
 There are corner cases where it is impossible to upper/lowercase one 
 character at the time. for example:
 
 -- without ICU
  select upper('E?er');
  upper
 ---
  E?ER
 (1 row)
 
 -- with ICU
 select upper('E?er');
  upper
 ---
  ESSER
 (1 rad)
 
 This is because in the standard postgres implementation, upper/lower is 
 done one character at the time. A proper upper/lower cannot do it that way. 
 Other known example is in Turkish, where an ? (?) should look different 
 whether it is an initial letter or not. This fails in standard postgresql 
 for all platforms.

Uh, where do you see that?  Our code has:

workspace = texttowcs(string);

for (i = 0; workspace[i] != 0; i++)
workspace[i] = towupper(workspace[i]);

result = wcstotext(workspace, i);


  Also, in the latest patch, I also added checks and logging for *every*
  status returned from ICU. I hope this will help debugging on debian,
  where  previous version didn't work. That excessive status checking is
  hardly be  necessary once the stuff is better tested.
 
  I think the string copying and heap/palloc choices stands for most of
  the  code bloat, together with the excessive status checking and logging.
 
  OK, move that into some common functions and I think it will be better.
 
 Best way for upper/lower/initcap is probably to use a function pointer... 
 uhh...

Uh, I don't think so.  Just send pointers to the the function and let
the function allocate the memory, and another function to free them, or
something like that.  I can probably do it if you want.

   Why do you need to add a mapping of encoding names from iana to our
   names?
 
  This was already answered by John Hansen... There's an old thread here
  about the choice of the name UNICODE to describe an encoding, which it
  doesn't. There's half a dozen unicode based encodings... UTF-8 is used
  by  postgresql, that would have been a better name... Similarly for most
  other  encodings, really. ICU expect a setlocale(3) string (i.e. IANA).
  PostgreSQL  can't provide it, so a mapping table is required.
 
  We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does that
  help?
 
 I'm aware of that. It might help for unicode, but there are a bunch of 
 other encodings. IANA has decided that utf-8 has *no* aliases, hence only 
 utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is 
 fogiving, I don't remember/know, but I think we need the mappings, 
 unfortunately.

OK.  I guess I am just confused why the native implementations are OK.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Bruce Momjian
Andrew Dunstan wrote:
 
 
 John Hansen wrote:
 
 Here is the list of encoding names and aliases the ICU accepts as of
 3.2:
 (it's a bit long...)
 
 UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208
 UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2
 
   
 
 [snip]
 
 Don't we use unicode as an alias for UTF-8 ?

We do.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Bruce Momjian wrote:
 
 There are two reasons for that optimization --- first, some 
 locale support is broken and Unicode encoding with a C locale 
 crashes (not an issue for ICU), and second, it is an 
 optimization for languages like Japanese that want to use 
 unicode, but don't need a locale because upper/lower means 
 nothing in those character sets.

No, upper/lower means nothing in those languages, so why would you need
to optimize upper/lower if they're not used??
And if they are, it's obviously because the text contains characters
from other languages (probably english) and as such they should behave
correctly.

Did I mention that for japanese and the like, ICU would also offer
transliteration...

 
 So, the first issue doesn't apply for ICU, and the second 
 might not depending on what characters you are using in the 
 Unicode character set.
 
 I guess I am little confused how ICU can do upper() when the 
 locale is C.  What is it using to determine A is upper for a? 
  Am I confused?

Simple, UNICODE basically consist of a table of characters
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt)

Excerpt:

0041;LATIN CAPITAL LETTER A;Lu;0;L;N0061;
...
0061;LATIN SMALL LETTER A;Ll;0;L;N;;;0041;;0041

From this you can see, that for 0041, which is capital letter A, there
is a mapping to it's lowercase counterpart, 0061
Likewise, there is a mapping for 0061 which says it's uppercase
counterpart is 0041.
There is also SpecialCasing.txt which covers those mappings that haven't
got a 1-1 mapping, such as the german SS.

These mappings are fixed, independent of locale, only a few cases from
specialcasing.txt depend on locale/context.



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Bruce Momjian
Palle Girgensohn wrote:
 
 --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen [EMAIL PROTECTED] 
 wrote:
 
  Btw, I had been planning to propose replacing every single one of the
  built in charset conversion functions with calls to ICU (thus making pg
  _depend_ on ICU), as this would seem like a cleaner solution than for us
  to maintain our own conversion tables.
 
  ICU also has a fair few conversions that we do not have at present.

That is a much larger issue, similar to our shipping our own timezone
database.  What does it buy us?

o  Do we ship it in our tarball?
o  Is the license compatible?
o  Does it remove utils/mb conversions?
o  Does it allow us to index LIKE (next high char)?
o  Does it allow us to support multiple encodings in
   a single database easier?
o  performance?

 I just had a similar though. And why use ICU only for multibyte charsets? 
 If I use LATIN1, I still expect upper('?') = SS, and I don't get it... 
 Same for the Turkish example.

We assume the native toupper() can handle single-byte character
encodings.  We use towupper() only for wide character sets.


-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
 It seems 3.2 has much more support and bug fixes, I'm not 
 sure if we should really consider 2.8?

As I said, probably not worth the effort.

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn

--On lördag, maj 07, 2005 09.52.59 -0400 Bruce Momjian 
pgman@candle.pha.pa.us wrote:

Palle Girgensohn wrote:
 Also, apparently, ICU is installed by default in many linux
 distributions,  and usually it is version 2.8. Some linux users have
 asked me if there are  plans for a patch that works with ICU 2.8.
 That's probably a good idea. IBM  and the ICU folks seem to consider
 3.2 to be the stable version, older  versions are hard to find on
 their sites, but most linux distributers seem  to consider it too
 bleeding edge, even gentoo. I don't know why they don't  agree.

 Good point.  Why would linux folks need ICU?  Doesn't their OS support
 encodings natively?  I am particularly excited about this for OSs that
 don't have such encodings, like UTF8 support for Win32.

 Because ICU will not be used unless enabled by configure, it seems we
 are fine with only supporting the newest version.  Do Linux users need
 to use ICU for any reason?
There are corner cases where it is impossible to upper/lowercase one
character at the time. for example:
-- without ICU
 select upper('E?er');
 upper
---
 E?ER
(1 row)
-- with ICU
select upper('E?er');
 upper
---
 ESSER
(1 rad)
This is because in the standard postgres implementation, upper/lower is
done one character at the time. A proper upper/lower cannot do it that
way.  Other known example is in Turkish, where an ? (?) should look
different  whether it is an initial letter or not. This fails in
standard postgresql  for all platforms.
Uh, where do you see that?  Our code has:
workspace = texttowcs(string);
for (i = 0; workspace[i] != 0; i++)
workspace[i] = towupper(workspace[i]);
as you see, the loop runs towupper for one character at the time. I cannot 
consider whether the letter is the initial, as required in Turkish, and it 
cannot really convert one character into two ('ß' - 'SS')

result = wcstotext(workspace, i);

 Also, in the latest patch, I also added checks and logging for *every*
 status returned from ICU. I hope this will help debugging on debian,
 where  previous version didn't work. That excessive status checking is
 hardly be  necessary once the stuff is better tested.

 I think the string copying and heap/palloc choices stands for most of
 the  code bloat, together with the excessive status checking and
 logging.

 OK, move that into some common functions and I think it will be better.
Best way for upper/lower/initcap is probably to use a function
pointer...  uhh...
Uh, I don't think so.  Just send pointers to the the function and let
the function allocate the memory, and another function to free them, or
something like that.  I can probably do it if you want.
I'll check it out, it seems simple enough.
 We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does
 that help?
I'm aware of that. It might help for unicode, but there are a bunch of
other encodings. IANA has decided that utf-8 has *no* aliases, hence
only  utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
fogiving, I don't remember/know, but I think we need the mappings,
unfortunately.
OK.  I guess I am just confused why the native implementations are OK.
They're OK since they understand that UNICODE (or UTF8) is really utf-8. 
Problem is the strings used to describe them are not understood by ICU.

BTW, the pg_enc2iananame_tbl is only used *from* internal representation 
*to* IANA, not the other way around. Maybe that fact lowers the rate of 
confusion? ;-)

/Palle
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Bruce Momjian
Palle Girgensohn wrote:
  This is because in the standard postgres implementation, upper/lower is
  done one character at the time. A proper upper/lower cannot do it that
  way.  Other known example is in Turkish, where an ? (?) should look
  different  whether it is an initial letter or not. This fails in
  standard postgresql  for all platforms.
 
  Uh, where do you see that?  Our code has:
 
  workspace = texttowcs(string);
 
  for (i = 0; workspace[i] != 0; i++)
  workspace[i] = towupper(workspace[i]);
 
 as you see, the loop runs towupper for one character at the time. I cannot 
 consider whether the letter is the initial, as required in Turkish, and it 
 cannot really convert one character into two ('?' - 'SS')

Oh, OK. I thought texttowcs() would expand the string to allow such
conversions.

   We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does
   that help?
 
  I'm aware of that. It might help for unicode, but there are a bunch of
  other encodings. IANA has decided that utf-8 has *no* aliases, hence
  only  utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
  fogiving, I don't remember/know, but I think we need the mappings,
  unfortunately.
 
  OK.  I guess I am just confused why the native implementations are OK.
 
 They're OK since they understand that UNICODE (or UTF8) is really utf-8. 
 Problem is the strings used to describe them are not understood by ICU.
 
 BTW, the pg_enc2iananame_tbl is only used *from* internal representation 
 *to* IANA, not the other way around. Maybe that fact lowers the rate of 
 confusion? ;-)

OK, got it.  I am still a little confused why every native
implementation understands our existing names but ICU does not.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Bruce Momjian wrote:
 Palle Girgensohn wrote:
  
  --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen 
  [EMAIL PROTECTED]
  wrote:
  
   Btw, I had been planning to propose replacing every single one of 
   the built in charset conversion functions with calls to ICU (thus 
   making pg _depend_ on ICU), as this would seem like a cleaner 
   solution than for us to maintain our own conversion tables.
  
   ICU also has a fair few conversions that we do not have 
 at present.
 
 That is a much larger issue, similar to our shipping our own 
 timezone database.  What does it buy us?
   
   o  Do we ship it in our tarball?
   o  Is the license compatible?
   o  Does it remove utils/mb conversions?
   o  Does it allow us to index LIKE (next high char)?
   o  Does it allow us to support multiple encodings in
  a single database easier?
   o  performance?
 
  I just had a similar though. And why use ICU only for 
 multibyte charsets? 
  If I use LATIN1, I still expect upper('?') = SS, and I 
 don't get it... 
  Same for the Turkish example.
 
 We assume the native toupper() can handle single-byte 
 character encodings.  We use towupper() only for wide character sets.

That assumption is wrong,...

Encoding latin1
Locale  de*

Select Upper('ß'); (lowercase german SS)
Should return SS, but returns ß

... John



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn

--On lördag, maj 07, 2005 10.06.43 -0400 Bruce Momjian 
pgman@candle.pha.pa.us wrote:

Palle Girgensohn wrote:
--On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen
[EMAIL PROTECTED]  wrote:
 Btw, I had been planning to propose replacing every single one of the
 built in charset conversion functions with calls to ICU (thus making pg
 _depend_ on ICU), as this would seem like a cleaner solution than for
 us to maintain our own conversion tables.

 ICU also has a fair few conversions that we do not have at present.
That is a much larger issue, similar to our shipping our own timezone
database.  What does it buy us?
o  Do we ship it in our tarball?
o  Is the license compatible?
It looks pretty similar to BSD, although I'm a novice on the subject.
	o  Does it remove utils/mb conversions?
Yes, it would probably be possible to remove pg's own conversions.
	o  Does it allow us to index LIKE (next high char)?
I beleive so, using ICU's substring stuff.
o  Does it allow us to support multiple encodings in
   a single database easier?
Heh, the ultimate dream. Perhaps?
	o  performance?
ICU in itself is said to be much faster than for example glibc. Problem is 
the need for conversion via UTF-16, which requires extra memory allocations 
and cpu cycles. I don't use glibc, but my very simple performance tests for 
FreeBSD show that it is similiar in speed.


I just had a similar though. And why use ICU only for multibyte
charsets?  If I use LATIN1, I still expect upper('?') = SS, and I don't
get it...  Same for the Turkish example.
We assume the native toupper() can handle single-byte character
encodings.  We use towupper() only for wide character sets.
True, problem is that native toupper/towupper run one char at the time. 
This is a bad design decision in POSIX, there is no way it can handle the 
examples above unless considering more than one character. ICU does just 
that.

/Palle
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] pgFoundry

2005-05-07 Thread Marc G. Fournier
On Sat, 7 May 2005, Greg Stark wrote:
But tracking the status of sub-projects is just not the kind of thing 
free software people do. They send emails when they have something to 
say.
in defence of Joshua's idea, there are some large projects within our 
development that would be nice to see some sort of 'time lines' for, but 
mainly those are the sorts of things that have set milestones to look for 
... I believe that Simon's work on the PITR stuff might fall under that, 
where he had various 'stages' he was working towards ...

But, I think that the # of sub projects that this would apply to is so 
few that setting anything up more formally then Simon sending out an 
updated patch would be more work then the derived benefit ...


Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 7615664
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Bruce Momjian
John Hansen wrote:
 Bruce Momjian wrote:
  
  There are two reasons for that optimization --- first, some 
  locale support is broken and Unicode encoding with a C locale 
  crashes (not an issue for ICU), and second, it is an 
  optimization for languages like Japanese that want to use 
  unicode, but don't need a locale because upper/lower means 
  nothing in those character sets.
 
 No, upper/lower means nothing in those languages, so why would you need
 to optimize upper/lower if they're not used??

True.  I suppose it is for databases that use both Japanese and Latin
alphabets and run upper() on all values.

 And if they are, it's obviously because the text contains characters
 from other languages (probably english) and as such they should behave
 correctly.
 
 Did I mention that for japanese and the like, ICU would also offer
 transliteration...

Interesting.

  So, the first issue doesn't apply for ICU, and the second 
  might not depending on what characters you are using in the 
  Unicode character set.
  
  I guess I am little confused how ICU can do upper() when the 
  locale is C.  What is it using to determine A is upper for a? 
   Am I confused?
 
 Simple, UNICODE basically consist of a table of characters
 (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt)
 
 Excerpt:
 
 0041;LATIN CAPITAL LETTER A;Lu;0;L;N0061;
 ...
 0061;LATIN SMALL LETTER A;Ll;0;L;N;;;0041;;0041
 
 From this you can see, that for 0041, which is capital letter A, there
 is a mapping to it's lowercase counterpart, 0061
 Likewise, there is a mapping for 0061 which says it's uppercase
 counterpart is 0041.
 There is also SpecialCasing.txt which covers those mappings that haven't
 got a 1-1 mapping, such as the german SS.
 
 These mappings are fixed, independent of locale, only a few cases from
 specialcasing.txt depend on locale/context.

As far as I know, the only way to use Unicode currently is to use a
locale that is unicode-aware.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Tom Lane
John Hansen [EMAIL PROTECTED] writes:
 Where'd you get the licence from?

It was the first thing I came across in their docs:

http://icu.sourceforge.net/userguide/intro.html

Looking more closely, it may be that this license is only intended to
apply to the documentation and not the code ... though free code with
un-free documentation isn't real useful.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Andrew Sullivan
On Sat, May 07, 2005 at 02:52:57PM +1000, Neil Conway wrote:
 
 So would you have us disable all the non-essential builtin functions? 
 (Many of which have has security problems in the past.) What about the 
 builtin encoding conversions, non-btree indexes, or a myriad of features 
 that not all users need or use?

This is not really analogous, because those are already on (and in
most cases, not easily disabled).  What you're arguing for is to add
yet another on-by-default feature.  Given that there's already a way
to turn it on, why make it automatic?  Moreover, if some repackager
wants to make this more convenient, s/he can do so by turning it on
by default.  I don't see what's wrong with conservatism here.

 What makes sense for the default configuration of an operating system 
 (which by nature must be hardened against attack) does not necessarily 
 make sense for a database system.

Indeed.  But that doesn't mean that the principle isn't sound for
both cases.  I haven't seen an argument against that yet.

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]
The plural of anecdote is not data.
--Roger Brinner

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] FC3 broken with HEAD

2005-05-07 Thread Alvaro Herrera
On Sat, May 07, 2005 at 04:27:04PM +1000, Neil Conway wrote:
 Tom Lane wrote:
 Yeah, we will.  Please file a bugzilla entry for this though --- I
 concur that it is a linker bug.
 
 Okay, patch reverted. The RH bug is here:
 
 https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=157126

Huh, RH's bugzilla is really slow.  Are they using the
PostgreSQL-powered bugzilla?

-- 
Alvaro Herrera ([EMAIL PROTECTED])
Syntax error: function hell() needs an argument.
Please choose what hell you want to involve.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[HACKERS] Race conditions, race conditions!

2005-05-07 Thread Tom Lane
I promised an analysis of the problems Jan Wieck uncovered yesterday,
so here it is.

Jan had developed a testbed (which I hope he will post) that essentially
has a bunch of client threads all doing instances of the same
transaction in READ COMMITTED mode.  The intended database invariant is
that the number of rows in table t2 with a particular ID value is equal
to the cnt field of the single row in t1 with that ID value:

begin;
select cnt from t1 where id = K for update;
delete from t2 where id = K;
-- check returned rowcount to see that exactly CNT rows were deleted
insert into t2 values (K);  -- repeat this N times
update t1 set cnt = N where id = K;
commit;

K is randomly chosen for each execution from the known set of t1 keys,
and N is randomly chosen each time as a small positive integer.  This
should maintain the invariant, since at any time only one transaction
can be holding the SELECT FOR UPDATE lock on a particular t1 row.

The trick is to run this under astronomical load; 100 or so client
threads on a garden-variety PC is about right.

What Jan was seeing was that every so often, a client thread would
error out, reporting that its DELETE had deleted zero rows rather
than the expected number.  But subsequent examination showed the t2
rows as being there.

Investigation showed that the connected backend had properly acquired
FOR UPDATE lock on the t1 row, but the snapshot it was using for the
subsequent DELETE showed the inserter of the t1 row as still running.
This should be impossible, since the SELECT FOR UPDATE cannot lock an
uncommitted row, and in READ COMMITTED mode we certainly will take a
new snapshot for the DELETE.

However, there is a race condition here.  During transaction commit,
xact.c first marks the transaction committed in pg_clog, and then
clears its XID from PGPROC.  This means there is a narrow window
in which both TransactionIdDidCommit and TransactionIdIsInProgress
will return true.  (We cannot do it the other way around, because
if neither one is returning true, onlookers are entitled to assume
that the transaction has crashed.)  However, the tqual.c routines
will allow a row to be seen as committed as soon as
TransactionIdDidCommit(xmin) returns true.  So the scenario is:

1. Backend A does RecordTransactionCommit to mark itself committed
in pg_clog, but then loses the CPU in the narrow window between
doing that and clearing its PGPROC entry.  Because of the ridiculous
load, it doesn't get control back for awhile.

2. Backend B comes along to run the test transaction for the same K
value.  It inspects the t1 row, concludes it's committed, marks the
row as locked FOR UPDATE, and returns the results to the client.

3. The client now issues the DELETE command.  B takes a new snapshot,
but because A is still not cleared out of PGPROC, A's transaction
is shown as still running in the snapshot.

4. Now the DELETE will delete no rows, because it doesn't consider the
t2 rows it should delete to be committed.


AFAICS this race condition has always been there; certainly at
least since Vadim put in MVCC, and it looks likely that the
original Berkeley code had a form of the problem.

The correct fix is that the tqual.c routines should check commit status
in the following way:

if (TransactionIdIsInProgress(xid))
   // still in progress, don't touch tuple
else if (TransactionIdDidCommit(xid))
   // committed, mark accordingly
else
   // must be aborted or crashed, mark accordingly

rather than what they have traditionally done:

if (TransactionIdDidCommit(xid))
   // committed, mark accordingly
else if (TransactionIdDidAbort(xid))
   // aborted, mark accordingly
else
   // assume still in progress, don't touch tuple

Vadim recognized that the former logic was necessary for VACUUM to use
in deciding if it could clean up dead tuples, but apparently he never
made the extrapolation that it should be used *everywhere* transaction
status is tested.


The other interesting thing we saw was an Assertion failure.  The
postmaster log showed

WARNING:  relation t1 page 196 is uninitialized --- fixing
TRAP: FailedAssertion(!PageHeader) ((PageHeader) pageHeader))-pd_upper == 
0)), File: hio.c, Line: 263)
LOG:  server process (PID 11296) was terminated by signal 6

The WARNING could only have come from VACUUM.  (Jan's testbed does
launch a VACUUM every so often.)  The Assert failure is in
RelationGetBufferForTuple where it is adding a new page to a table.

I interpret this as the guy doing RelationGetBufferForTuple added a
page, but before he could initialize it, he lost control for long enough
for a VACUUM to scan through the entire table, see the zeroed page, and
fix it.  Then when the first guy got control again, his Assert saying
the page was zeroes failed.  The window for this exists because bufmgr/smgr
do physically extend the file, 

Re: [HACKERS] FC3 broken with HEAD

2005-05-07 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Huh, RH's bugzilla is really slow.  Are they using the
 PostgreSQL-powered bugzilla?

They are, but I think it's a fairly old version (7.3 IIRC).
I'm still trying to get them to move some internal systems
off 7.1 :-(

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] to_char(interval) issues

2005-05-07 Thread Josh Berkus
Bruce,

 * Prevent to_char() on interval from returning meaningless values

 * Allow to_char() on interval values to accumulate the highest unit
   requested

Sounds like it would cover my use cases.   Others?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Tom Lane
Neil Conway [EMAIL PROTECTED] writes:
 Users can already define SQL functions by default, which certainly 
 provides programmable features. I'm not quite sure what you mean by 
 user exits.

 I guess I'm missing how pl/pgsql is a fundamentally greater security risk.

plpgsql has control structures (loops, IF); SQL functions don't.
That makes for a fundamental difference in the power of the programming
language ... at least according to CS theory as I was taught it.  Now
admittedly the primitive statements of SQL are a lot more powerful
than the primitive statements usually considered in programming language
theory, but I think there is still a pretty significant difference in
capability.

An example of why this could be interesting from a security standpoint
is that, given access to pg_shadow, it'd be pretty trivial to write
a plpgsql function that tries to break user passwords by brute force
(just generate possible passwords, hash them, and look for a match).
I don't see any way to do that in plain SQL, at least not without a
pre-existing SRF to generate the trial passwords for you.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Palle Girgensohn
--On lördag, maj 07, 2005 10.58.09 -0400 Tom Lane [EMAIL PROTECTED] wrote:
John Hansen [EMAIL PROTECTED] writes:
Where'd you get the licence from?
It was the first thing I came across in their docs:
http://icu.sourceforge.net/userguide/intro.html
Looking more closely, it may be that this license is only intended to
apply to the documentation and not the code ... though free code with
un-free documentation isn't real useful.
			regards, tom lane
Someone should ask the ICU folks about that...
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] Patch for database locale settings

2005-05-07 Thread Bruce Momjian

Description added to TODO:

* Allow locale to be set at database creation

  Currently locale can only be set during initdb.  No global tables have
  locale-aware columns.  However, the database template used during
  database creation might have locale-aware indexes.  The indexes would
  need to be reindexed to match the new locale.

---

Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Maybe it would work if we forced indexes on shared relations to be
  scanned using a fixed collation.
 
 The shared relations only have indexes on name, oid, and integer:
 select distinct atttypid::regtype
 from pg_class c join pg_attribute a on c.oid = a.attrelid
 where relisshared and relkind = 'i';
 and name has non-locale-sensitive ordering rules anyway.  So that's
 not the big problem; we could probably get away with decreeing that
 name will always be that way and that shared relations can't have
 locale-dependent indexes.
 
 The big problem (and the reason why this idea has been shot down in
 the past) is that CREATE DATABASE can't change the locale from what it
 is in the template database unless it's prepared to reindex any locale-
 sensitive indexes in the non-shared part of the template database.
 Which would be a difficult undertaking seeing that we can't even connect
 to the copied database until after commit.
 
 We could maybe say that we will never have any locale-dependent indexes
 at all on any system catalog, but what of user-defined tables in
 template databases?  It would simply not work to do something as simple
 as creating a table with an indexed text column in template1.
 
 On the other hand you could argue that people already run the same kind
 of risk when changing database encoding at CREATE, which is a feature
 that's been there a long time and hasn't garnered many complaints.
 Not so much that their indexes will break as that their data will.
 So perhaps we should be willing to document don't do that.  Certainly
 it would be a lot more useful if both locale and encoding could be set
 per-database.
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 5: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faq
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] Cleaning up unreferenced table files

2005-05-07 Thread Heikki Linnakangas
Maybe we should take a different approach to the problem:
1. Create new file with an extension to mark that it's not
   yet committed (eg. 1234.notcommitted)
2. ...
3. Take CheckpointStartLock
4. Write commit record to WAL, with list of created files.
5. rename created file (1234.notcommitted - 1234).
6. Release CheckpointStartLock
This would guarantee that after successful WAL replay, all files in the 
data directory with .notcommitted extension can be safely deleted. No need 
to read pg_database or pg_class.

We would take a performance hit because of the additional rename and fsync
step. Also, we must somehow make sure that the new file or the directory 
it's in is fsynced on checkpoint to make sure that the rename is flushed 
to disk.

A variant of the scheme would be to create two files on step 1. One would 
be the actual relfile (1234) and the other would an empty marker file
(1234.notcommitted). That way the smgr code wouldn't have to care it the 
file is new or not when opening it.

- Heikki
On Thu, 5 May 2005, Tom Lane wrote:
Bruce Momjian pgman@candle.pha.pa.us writes:
Applied.
Now that I've had a chance to look at it, this patch is thoroughly
broken.  Problems observed in a quick review:
1. It doesn't work at all for non-default tablespaces: it will
claim that every file in such a tablespace is stale.  The fact
that it does that rather than failing entirely is accidental.
It tries to read the database's pg_class in the target tablespace
whether it's there or not.  Because the system is still in recovery
mode, the low-level routines allow the access to the nonexistent
pg_class table to pass --- in fact they think they should create
the file, so after it runs there's a bogus empty 1259 file in each
such tablespace (which of course it complains about, too).  The code
then proceeds to think that pg_class is empty so of course everything
draws a warning.
2. It's not robust against stale subdirectories of a tablespace
(ie, subdirs corresponding to a nonexistent database) --- again,
it'll try to read a nonexistent pg_class.  Then it'll produce a
bunch of off-target complaint messages.
3. It's assuming that relfilenode is unique database-wide, when no
such assumption is safe.  We only have a guarantee that it's unique
tablespace-wide.
4. It fails to examine table segment files (such as nnn.1).  These
should be complained of when the nnn doesn't match any hash entry.
5. It will load every relfilenode value in pg_class into the hashtable
whether it's meaningful or not.  There should be a check on relkind.
6. I don't think relying on strtol to decide if a filename is entirely
numeric is very safe.  Note all the extra defenses in pg_atoi against
various platform-specific misbehaviors of strtol.  Personally I'd use a
strspn test instead.
7. There are no checks for readdir failure (compare any other readdir
loop in the backend).
See also Simon Riggs' complaints that the circumstances under which it's
done are pretty randomly selected.  (One particular thing that I think
is a bad idea is to do this in a standalone backend.  Any sort of
corruption in any db's pg_class would render it impossible to start up.)
To fix the first three problems, and also avoid the performance problem
of multiply rescanning a database's pg_class for each of its
tablespaces, I would suggest that the hashtable entries be widened to
RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid).  Then
there should be one iteration over pg_database to learn the OIDs and
default tablespaces of each database; with that you can read pg_class
from its correct location for each database and load all the entries
into the hashtable.  Then you iterate through the tablespaces looking
for stuff not present in the hashtable.  You might also want to build a
list or hashtable of known database OIDs, so that you can recognize a
stale subdirectory immediately and issue a direct complaint about it
without even recursing into it.
regards, tom lane
- Heikki
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] FC3 broken with HEAD

2005-05-07 Thread Greg Stark

I think this has more to do with poor architectural decisions made in
bugzilla. Things like storing the entire ticket history in one big record
instead of normalizing it.

This means you can't do anything to a ticket without munging through 10s of k
of data. This also creates problems when multiple people are trying to update
a single ticket at a time.

-- 
greg


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Andrew - Supernews
On 2005-05-07, Tom Lane [EMAIL PROTECTED] wrote:
 Neil Conway [EMAIL PROTECTED] writes:
 Users can already define SQL functions by default, which certainly 
 provides programmable features. I'm not quite sure what you mean by 
 user exits.

 I guess I'm missing how pl/pgsql is a fundamentally greater security risk.

 plpgsql has control structures (loops, IF); SQL functions don't.
 That makes for a fundamental difference in the power of the programming
 language ... at least according to CS theory as I was taught it.

SQL functions do have control structures: CASE WHEN and recursion. I have
even implemented generate_series() in pure SQL.

 Now
 admittedly the primitive statements of SQL are a lot more powerful
 than the primitive statements usually considered in programming language
 theory, but I think there is still a pretty significant difference in
 capability.

The only thing that makes pg's SQL functions not turing-complete is the
fact that recursion depth is not unlimited. In practice this isn't much
of a restriction, since you can do large-scale iterations by using SQL
sets and joins rather than recursion.

 An example of why this could be interesting from a security standpoint
 is that, given access to pg_shadow, it'd be pretty trivial to write
 a plpgsql function that tries to break user passwords by brute force
 (just generate possible passwords, hash them, and look for a match).
 I don't see any way to do that in plain SQL, at least not without a
 pre-existing SRF to generate the trial passwords for you.

Writing a pure SQL SRF that generates a range of trial passwords is in
fact not hard at all. The key point to note is that you can generate
sets of literal values of moderate size using UNION ALL, and you can
then cross-join those sets against themselves multiple times to generate
much larger and more complex sets.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Josh Berkus
People:

Before we get into more minutia regarding potential security risk of plpgsql, 
are there any reasons *other* than security to not enable it?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Race conditions, race conditions!

2005-05-07 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 The trick is to run this under astronomical load; 100 or so client
 threads on a garden-variety PC is about right.

I wonder if there's an argument for building assertion-enabled builds with
code that randomly yields the processor some percentage of time before and
after taking a lock. It wouldn't catch every case but it might help.

-- 
greg


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] pgFoundry

2005-05-07 Thread Josh Berkus
Greg,

 I'm rather surprised Postgres doesn't have a good bug tracking system.
 That's something most projects find pretty essential. Strangely enough the
 reason seems to be that Postgres really doesn't have many bugs... Unlike
 web browsers or GUIs or most of the other free software projects out there,
 databases don't tolerate bugs well. Any serious bug is cause for an
 immediate point release. The only use for a bug tracking system would
 really be for tracking all those pesky IWBNI bugs that never rise to
 urgent status.

Actually, a bug tracker would be useful for two purposes:

1) Tracking bugs that are actually feature requests, in an effort (possibly 
futile) to cut down on the number of when will PostgreSQL be able to use an 
index on MAX()? requests that we get.   A certain amount of this is in our 
FAQ, but the FAQ has the flaws of both not being easily searchable, and 
having a very manual update process so that it frequently gets out of date.

2) Tracking bugs that were fixed (and features that were added) in particular 
releases so that users know when they need to upgrade.   For example, if a 
user had an index corruption problem with 7.4.1, it would be useful for them 
to know that an upgrade to 7.4.5 (as I recall) would fix it.  Currently, 
they'd have to read all of the release notes from 7.4.2 through 7.4.5 and 
decipher insider terminonolgy to figure it out -- not always easy to do, and 
even harder to convince your boss.

The problem is that a bug tracker would not be useful to the Postgresql 
*developers*; our current system works pretty well for us now.  Except for 
one possibility:

IF we had a formal bugtracker, are there people who are not currently 
contributing to PostgreSQL who would be willing/able to read, test, and 
analyse bug reports?   With the addition of several companies to our 
community, it's a possibility, and would make the trouble of using a bug 
tracker worthwhile, I think.

Comments?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] pgFoundry

2005-05-07 Thread Joshua D. Drake

What does it mean to track the status of something? How would the status
change except by discussion? What would be the point of announcing the status
of something without allowing people to comment?
No one said anything about not letting people comment or discuss. What I 
am suggesting is a better public presentation of what the heck is going 
on with PostgreSQL development.

I think you have a severely flawed idea of how free software development
proceeds.
Then you obviously aren't paying attention. Look at other major OSS 
projects. They have these things in place. Even the Linux kernel has a 
bugzilla (although I am not advocating bugzilla). Not to mention KDE, 
Gnome, Debian..

These projects also have reasonably defined milestones for particular 
releases and show status of those milestones during the release.

 What you're describing sounds like something a manager of a
commercial project would want. Perhaps it's something the managers of the
people working on Postgres on behalf of some corporate sponsors might want but
in those cases I doubt they would want the information to be public anyways.
What I am describing is what other large OSS projects already do.
In the free software world there's no top-down management of the project with
managers issuing direction and expecting feedback reports.
No but there are people in charge of particular tasks. There are people 
only working on certain things. Like the work that the people did on PITR.

 People only want
tools that make their lives easier. Not tools that make other people's lives
easier at the expense of their own convenience. The programmers are not
beholden to any corporate interests (other than their own sponsors, who
presumably are getting all the feedback they're looking for privately).
I am not suggesting that anybody be beholden to anybody accept maybe the 
community itself.

Sincerely,
Joshua D. Drake

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] [PATCHES] Cleaning up unreferenced table files

2005-05-07 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Maybe we should take a different approach to the problem:
 1. Create new file with an extension to mark that it's not
 yet committed (eg. 1234.notcommitted)

This is pushing the problem into the wrong place, viz the lowest-level
file access routines, which will now all have to know about
.notcommitted status.  It also creates race conditions --- think about
backend A trying to commit file 1234 at about the same time that
backend B is trying to flush some dirty buffers belonging to that file.
But most importantly, it doesn't handle the file-deletion case.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] pgFoundry

2005-05-07 Thread Robert Treat
On Saturday 07 May 2005 15:31, Josh Berkus wrote:
 2) Tracking bugs that were fixed (and features that were added) in
 particular releases so that users know when they need to upgrade.  

one idea that has been quasi floated before would be something equivalent to 
Simon Riggs developer summaries and/or Bruce's release history, that could be 
updated as appropriate for each development cycle. If someone were willing to 
maintain it, we could put it on the website along with the TODO list. 

-- 
Robert Treat
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Tom Lane wrote:
 John Hansen [EMAIL PROTECTED] writes:
  Where'd you get the licence from?
 
 It was the first thing I came across in their docs:
 
 http://icu.sourceforge.net/userguide/intro.html
 
 Looking more closely, it may be that this license is only 
 intended to apply to the documentation and not the code ... 
 though free code with un-free documentation isn't real useful.
 

Right, it seems to apply only to the resources found on sourceforge.

   regards, tom lane
 
 

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Tom Lane
Josh Berkus josh@agliodbs.com writes:
 Before we get into more minutia regarding potential security risk of plpgsql,
 are there any reasons *other* than security to not enable it?

Several potential issues have already been mentioned in this thread,
eg, what about shared library dependency vs linking plpgsql statically
into the backend, how will we deal with existing dump files, how will
it be possible for a DBA to disable the feature if he wishes.

But security is the easiest to handwave about with no facts ;-) ...
so if it gets shot down on that basis then no one need do the work
to figure out solutions to the other questions.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Race conditions, race conditions!

2005-05-07 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 I wonder if there's an argument for building assertion-enabled builds with
 code that randomly yields the processor some percentage of time before and
 after taking a lock. It wouldn't catch every case but it might help.

Seems like that would mainly help you find cases where you'd put a lock
acquire or release a bit too late or too soon in a sequence of events;
not cases where you'd failed to acquire a needed lock at all.  It'd be
more useful I think to have a facility that randomly stops backends for
awhile regardless of exactly where they are in the code.

A high-load test case actually does this to some extent, but the problem
is you have little reproducibility and no assurance that execution
stopped for long enough to let critical events happen elsewhere.  The
ideal facility I think would slow one backend much more than others,
whereas high load still leaves them all making progress at about the
same rate ...

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Neil Conway
Andrew Sullivan wrote:
This is not really analogous, because those are already on
Which is my point: you're suggesting we retrofit a security policy onto 
PG that does not apply to the vast majority of the base system -- and 
that if applied would require fundamental changes.

Indeed.  But that doesn't mean that the principle isn't sound for
both cases.  I haven't seen an argument against that yet.
Security (in the limited sense of disabling features by default) is 
not free; there is a tradeoff between security and convenience, security 
and administrative simplicity, and so on. Given that I have yet to see a 
single substantive argument for pl/pgsql being a security risk that has 
withstood any scrutiny, I don't see that the security side of the 
tradeoff has a lot of merit.

-Neil
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Tom Lane
John Hansen [EMAIL PROTECTED] writes:
 Btw, I had been planning to propose replacing every single one of the
 built in charset conversion functions with calls to ICU (thus making
 pg _depend_ on ICU),

I find that fairly unacceptable ... especially given the licensing
questions, but in any case.

It might be OK to rip out the existing conversion support and say
that *if* you want encoding conversion, you have to use ICU.  But
I don't want to be told you cannot build PG without ICU period.

The 3.2 vs 2.8 business is disturbing also; specifically, I don't
think we get to require 3.2 on a platform where 2.8 is installed.
People just aren't going to hold still for that, even assuming
that ICU supports installing both versions at once, which isn't
clear to me at the moment ...

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Tatsuo Ishii
 Bruce Momjian wrote:
(B  
(B  There are two reasons for that optimization --- first, some 
(B  locale support is broken and Unicode encoding with a C locale 
(B  crashes (not an issue for ICU), and second, it is an 
(B  optimization for languages like Japanese that want to use 
(B  unicode, but don't need a locale because upper/lower means 
(B  nothing in those character sets.
(B 
(B No, upper/lower means nothing in those languages, so why would you need
(B to optimize upper/lower if they're not used??
(B And if they are, it's obviously because the text contains characters
(B from other languages (probably english) and as such they should behave
(B correctly.
(B
(BYes, Japanese (and probably Chinese and Korean) languages include
(BASCII character. More precisely ASCII is part of Japanese
(Bencodings(LATIN1 is not, however). And we have no problem at all with
(Bglibc/C locale. See below("unitest" is an UNICODE database).
(B
(Bunitest=# create table t1(t text);
(BCREATE TABLE
(Bunitest=# \encoding EUC_JP
(Bunitest=# insert into t1 values('abc$B$"$$$&(B');
(BINSERT 1842628 1
(Bunitest=# select upper(t) from t1;
(B   upper   
(B---
(B ABC$B$"$$$&(B
(B(1 row)
(B
(BSo Japanese(including ASCII)/UNICODE behavior is perfectly correct at
(Bthis moment. So I strongly object removing that optimization.
(B--
(BTatsuo Ishii
(B
(B---(end of broadcast)---
(BTIP 3: if posting/reading through Usenet, please send an appropriate
(B  subscribe-nomail command to [EMAIL PROTECTED] so that your
(B  message can get through to the mailing list cleanly

Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Tatsuo Ishii
 Palle Girgensohn wrote:
  
  --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen [EMAIL PROTECTED] 
  wrote:
  
   Btw, I had been planning to propose replacing every single one of the
   built in charset conversion functions with calls to ICU (thus making pg
   _depend_ on ICU), as this would seem like a cleaner solution than for us
   to maintain our own conversion tables.

I don't buy it. If current conversion tables does the right thing, why
we need to replace. Or if conversion tables are not correct, why don't
you fix it? I think the rule of character conversion will not change
frequently, especially for LATIN languages. Thus maintaining cost is
not too high.
--
Tatsuo Ishii

   ICU also has a fair few conversions that we do not have at present.
 
 That is a much larger issue, similar to our shipping our own timezone
 database.  What does it buy us?
   
   o  Do we ship it in our tarball?
   o  Is the license compatible?
   o  Does it remove utils/mb conversions?
   o  Does it allow us to index LIKE (next high char)?
   o  Does it allow us to support multiple encodings in
  a single database easier?
   o  performance?

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Mike Mascari
Neil Conway wrote:
Andrew Sullivan wrote:
This is not really analogous, because those are already on
Security (in the limited sense of disabling features by default) is 
not free; there is a tradeoff between security and convenience, security 
and administrative simplicity, and so on. Given that I have yet to see a 
single substantive argument for pl/pgsql being a security risk that has 
withstood any scrutiny, I don't see that the security side of the 
tradeoff has a lot of merit.
People who use views to achieve row security, which is a rather common 
paradigm, cannot allow users to create functions with side effects.

Mike Mascari
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Neil Conway
Mike Mascari wrote:
People who use views to achieve row security, which is a rather common 
paradigm, cannot allow users to create functions with side effects.
Can you elaborate? I'm not sure I follow you.
(I'll note anyway that (1) SQL functions can have side effects: CREATE 
FUNCTION foo() RETURNS VOID AS 'DELETE FROM ...', for example (2) 
Administrators can always choose to drop pl/pgsql for a particular 
database, disable it at initdb time, or REVOKE usage of pl/pgsql for 
particular users.)

-Neil
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Mike Mascari
Neil Conway wrote:
Mike Mascari wrote:
People who use views to achieve row security, which is a rather common 
paradigm, cannot allow users to create functions with side effects.

Can you elaborate? I'm not sure I follow you.
(I'll note anyway that (1) SQL functions can have side effects: CREATE 
FUNCTION foo() RETURNS VOID AS 'DELETE FROM ...', for example 
Wow. That's a problem IMHO. Many people use views to allow userA to 
query his salary without seeing userB's. If userA can just:

1. Create a SQL function which inserts into another table its arguments
2. Issue a query like:
SELECT *
FROM view_of_salaries_based_on_current_user
WHERE my_side_effect_function_that_inserts_into_a_temp_table(salary, 
employee);

and that function is writable with just 'SQL', then many, many people 
have a serious security risk on their hands. Perhaps this is why 
Oracle's standard UDFs cannot perform inserts, updates, or deletes.

Mike Mascari
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Race conditions, race conditions!

2005-05-07 Thread Tatsuo Ishii
Are we going to put the fixes into 8.0.3 and so on? Or will it be
included in 8.0.4?
--
Tatsuo Ishii

 I promised an analysis of the problems Jan Wieck uncovered yesterday,
 so here it is.
 
 Jan had developed a testbed (which I hope he will post) that essentially
 has a bunch of client threads all doing instances of the same
 transaction in READ COMMITTED mode.  The intended database invariant is
 that the number of rows in table t2 with a particular ID value is equal
 to the cnt field of the single row in t1 with that ID value:
 
   begin;
   select cnt from t1 where id = K for update;
   delete from t2 where id = K;
   -- check returned rowcount to see that exactly CNT rows were deleted
   insert into t2 values (K);  -- repeat this N times
   update t1 set cnt = N where id = K;
   commit;
 
 K is randomly chosen for each execution from the known set of t1 keys,
 and N is randomly chosen each time as a small positive integer.  This
 should maintain the invariant, since at any time only one transaction
 can be holding the SELECT FOR UPDATE lock on a particular t1 row.
 
 The trick is to run this under astronomical load; 100 or so client
 threads on a garden-variety PC is about right.
 
 What Jan was seeing was that every so often, a client thread would
 error out, reporting that its DELETE had deleted zero rows rather
 than the expected number.  But subsequent examination showed the t2
 rows as being there.
 
 Investigation showed that the connected backend had properly acquired
 FOR UPDATE lock on the t1 row, but the snapshot it was using for the
 subsequent DELETE showed the inserter of the t1 row as still running.
 This should be impossible, since the SELECT FOR UPDATE cannot lock an
 uncommitted row, and in READ COMMITTED mode we certainly will take a
 new snapshot for the DELETE.
 
 However, there is a race condition here.  During transaction commit,
 xact.c first marks the transaction committed in pg_clog, and then
 clears its XID from PGPROC.  This means there is a narrow window
 in which both TransactionIdDidCommit and TransactionIdIsInProgress
 will return true.  (We cannot do it the other way around, because
 if neither one is returning true, onlookers are entitled to assume
 that the transaction has crashed.)  However, the tqual.c routines
 will allow a row to be seen as committed as soon as
 TransactionIdDidCommit(xmin) returns true.  So the scenario is:
 
 1. Backend A does RecordTransactionCommit to mark itself committed
 in pg_clog, but then loses the CPU in the narrow window between
 doing that and clearing its PGPROC entry.  Because of the ridiculous
 load, it doesn't get control back for awhile.
 
 2. Backend B comes along to run the test transaction for the same K
 value.  It inspects the t1 row, concludes it's committed, marks the
 row as locked FOR UPDATE, and returns the results to the client.
 
 3. The client now issues the DELETE command.  B takes a new snapshot,
 but because A is still not cleared out of PGPROC, A's transaction
 is shown as still running in the snapshot.
 
 4. Now the DELETE will delete no rows, because it doesn't consider the
 t2 rows it should delete to be committed.
 
 
 AFAICS this race condition has always been there; certainly at
 least since Vadim put in MVCC, and it looks likely that the
 original Berkeley code had a form of the problem.
 
 The correct fix is that the tqual.c routines should check commit status
 in the following way:
 
   if (TransactionIdIsInProgress(xid))
  // still in progress, don't touch tuple
   else if (TransactionIdDidCommit(xid))
  // committed, mark accordingly
   else
  // must be aborted or crashed, mark accordingly
 
 rather than what they have traditionally done:
 
   if (TransactionIdDidCommit(xid))
  // committed, mark accordingly
   else if (TransactionIdDidAbort(xid))
  // aborted, mark accordingly
   else
  // assume still in progress, don't touch tuple
 
 Vadim recognized that the former logic was necessary for VACUUM to use
 in deciding if it could clean up dead tuples, but apparently he never
 made the extrapolation that it should be used *everywhere* transaction
 status is tested.
 
 
 The other interesting thing we saw was an Assertion failure.  The
 postmaster log showed
 
 WARNING:  relation t1 page 196 is uninitialized --- fixing
 TRAP: FailedAssertion(!PageHeader) ((PageHeader) pageHeader))-pd_upper 
 == 0)), File: hio.c, Line: 263)
 LOG:  server process (PID 11296) was terminated by signal 6
 
 The WARNING could only have come from VACUUM.  (Jan's testbed does
 launch a VACUUM every so often.)  The Assert failure is in
 RelationGetBufferForTuple where it is adding a new page to a table.
 
 I interpret this as the guy doing RelationGetBufferForTuple added a
 page, but before he could initialize it, he lost control for long enough
 for a VACUUM to scan through the entire table, see the zeroed page, and
 fix 

Re: [HACKERS] Race conditions, race conditions!

2005-05-07 Thread Bruce Momjian
Tatsuo Ishii wrote:
 Are we going to put the fixes into 8.0.3 and so on? Or will it be
 included in 8.0.4?

We have removed 8.0.3 from the FTP servers and plan to re-release 8.0.3
and previous 7.X releases.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Andrew Dunstan

Mike Mascari wrote:
Neil Conway wrote:
Mike Mascari wrote:
People who use views to achieve row security, which is a rather 
common paradigm, cannot allow users to create functions with side 
effects.

Can you elaborate? I'm not sure I follow you.
(I'll note anyway that (1) SQL functions can have side effects: 
CREATE FUNCTION foo() RETURNS VOID AS 'DELETE FROM ...', for example 

Wow. That's a problem IMHO. Many people use views to allow userA to 
query his salary without seeing userB's. If userA can just:

1. Create a SQL function which inserts into another table its arguments
2. Issue a query like:
SELECT *
FROM view_of_salaries_based_on_current_user
WHERE my_side_effect_function_that_inserts_into_a_temp_table(salary, 
employee);

and that function is writable with just 'SQL', then many, many people 
have a serious security risk on their hands. Perhaps this is why 
Oracle's standard UDFs cannot perform inserts, updates, or deletes.


but the side effect function will only run (unless you set it with 
security definer) with the privileges of the caller - it won't grant 
visibility to things that user can't otherwise  see.

In any case, you should define your security setup  with the 
capabilities / limitations of the db engine in mind. If there is any 
security problem in your scenario, it is that you appear to have made 
unwarranted assumptions about how postgres works, rather than that 
postgres has a problem.

Either way, this does not illustrate how enabling plpgsql by default is 
a security risk.

cheers
andrew
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Mike Mascari
Andrew Dunstan wrote:

Mike Mascari wrote:

but the side effect function will only run (unless you set it with 
security definer) with the privileges of the caller - it won't grant 
visibility to things that user can't otherwise  see.
If the visibility is determined by view definitions, such as using 
CURRENT_USER, which is an exceedingly common practice, then the caller 
will be able to record tuples before they are filtered by the executor.

In any case, you should define your security setup  with the 
capabilities / limitations of the db engine in mind. If there is any 
security problem in your scenario, it is that you appear to have made 
unwarranted assumptions about how postgres works, rather than that 
postgres has a problem.
I think most people coming from any other enterprise-class RDBMS 
environment will be surprised that they cannot use VIEWs to provide 
user-specific views on data. I could be wrong, but I'd put money on it...

Either way, this does not illustrate how enabling plpgsql by default is 
a security risk.
Correct, as the vulnerability exists within the 'SQL' language as well. 
The only difference is that enabling plpgsql by default changes it from 
a leak to a full blown flood.

Mike Mascari
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] pl/pgsql enabled by default

2005-05-07 Thread Neil Conway
Mike Mascari wrote:
Correct, as the vulnerability exists within the 'SQL' language as well. 
The only difference is that enabling plpgsql by default changes it from 
a leak to a full blown flood.
How does it make any difference at all?
-Neil
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread Tatsuo Ishii
We have developed patches which relaxes the character validation so
that PostgreSQL accepts invalid characters. It works like this:

1) new postgresql.conf item mbstr_check added.
2) if mbstr_check = 0 then invalid characters are not accepted
   (same as current PostgreSQL behavior). This is the default.
3) if mbstr_check = 1 then invalid characters are accepted with
   WARNING
4) if mbstr_check = 2 then invalid characters are accepted without any
   warnings
5) We have checked PostgreSQL source code if accepting invalid
   characters makes some troubles. We have found that we need to fix a
   place and the fix is included in the patches.

Madison,
If you are interested in the patches, I could send it to you.

Hackers,
Do you think the functionality something like above is worth to add to
PostgreSQL?
--
Tatsuo Ishii

 Hi all,
 
I've been chasing down a bug and from what I have learned it may be 
 because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles 
 invalid unicode. I've been given some ideas on how to try to catch 
 invalid unicode but it seems expensive so I am hoping there is a 
 postgresql way to deal with this problem.
 
I've run into a problem where a bulk postgres COPY... statement is 
 dieing because one of the lines contains a file name with an invalid 
 unicode character. In nautilus this file has '(invalid encoding)' and 
 the postgres error is 'CONTEXT:  COPY file_info_3, line 228287, column 
 file_name: Femme Fatal\u.url'.
 
To actually look at the file from the shell (bash) shows what appears 
 to be a whitespace but when I copy/paste the file name I get the 
 '\u' you see above.
 
I could, with the help of the TLUG people, use regex to match for an 
 invalid character and skip the file but that is not ideal. The reason is 
 that this is for my backup program and invalid unicode or not, the 
 contents of the file may still be important and I would prefer to have 
 it in the database so that it is later copied. I can copy and move the 
 file in the shell so the file isn't apparently in an of itself corrupt.
 
So then, is there a way I can tell postresql to accept the invalid 
 unicode name? Here is a copy of my schema:
 
 tle-bu= \d file_info_2
Table public.file_info_2
  Column| Type |Modifiers
 --+--+-
   file_group_name  | text |
   file_group_uid   | bigint   | not null
   file_mod_time| bigint   | not null
   file_name| text | not null
   file_parent_dir  | text | not null
   file_perm| text | not null
   file_size| bigint   | not null
   file_type| character varying(2) | not null default 
 'f'::character varying
   file_user_name   | text |
   file_user_uid| bigint   | not null
   file_backup  | boolean  | not null default true
   file_display | boolean  | not null default false
   file_restore_display | boolean  | not null default false
   file_restore | boolean  | not null default false
 Indexes:
  file_info_2_display_idx btree (file_type, file_parent_dir, file_name)
 
'file_name' and 'file_parent_dir' are the columns that could have 
 entries with the invalid unicode characters. Maybe I could/should use 
 something other than 'text'? These columns could contain anything that a 
 file or directory name could be.
 
Thanks!
 
 Madison
 
 -- 
 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 Madison Kelly (Digimer)
 TLE-BU, The Linux Experience; Back Up
 http://tle-bu.thelinuxexperience.com
 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 
 ---(end of broadcast)---
 TIP 7: don't forget to increase your free space map settings
 

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Tom Lane wrote:
 John Hansen [EMAIL PROTECTED] writes:
  Btw, I had been planning to propose replacing every single 
 one of the 
  built in charset conversion functions with calls to ICU 
 (thus making 
  pg _depend_ on ICU),
 
 I find that fairly unacceptable ... especially given the 
 licensing questions, but in any case.

The licencing seems pretty clear to me.
http://www-306.ibm.com/software/globalization/icu/license.jsp

 
 It might be OK to rip out the existing conversion support and 
 say that *if* you want encoding conversion, you have to use 
 ICU.  But I don't want to be told you cannot build PG without 
 ICU period.

Right, that could be done, but I think the issue at heart is _are_ 
we going to use it at all, and if so, locale support would certainly 
benefit from going that way as well.

 
 The 3.2 vs 2.8 business is disturbing also; specifically, I 
 don't think we get to require 3.2 on a platform where 2.8 is 
 installed.

There seems to be nothing in the ICU licence that would prevent us from
bundling it.
This would solve both the 3.2 vs 2.8 problems, and would remove the
'dependency'.

 People just aren't going to hold still for that, even 
 assuming that ICU supports installing both versions at once, 
 which isn't clear to me at the moment ...

There's no problems with having both installed.
I did that on debian to get the patch going.
Tho, bundling it seems cleaner to me.

 
   regards, tom lane
 
 

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
 I don't buy it. If current conversion tables does the right 
 thing, why we need to replace. Or if conversion tables are 
 not correct, why don't you fix it? I think the rule of 
 character conversion will not change frequently, especially 
 for LATIN languages. Thus maintaining cost is not too high.

I never said we need to, but if we're going to implement ICU,
then we might as well go all the way.

... John


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Tatsuo Ishii wrote:
 Sent: Sunday, May 08, 2005 10:09 AM
 To: John Hansen
 Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; 
 pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Patch for collation using ICU
 
  Bruce Momjian wrote:
   
   There are two reasons for that optimization --- first, 
 some locale 
   support is broken and Unicode encoding with a C locale 
 crashes (not 
   an issue for ICU), and second, it is an optimization for 
 languages 
   like Japanese that want to use unicode, but don't need a locale 
   because upper/lower means nothing in those character sets.
  
  No, upper/lower means nothing in those languages, so why would you 
  need to optimize upper/lower if they're not used??
  And if they are, it's obviously because the text contains 
 characters 
  from other languages (probably english) and as such they 
 should behave 
  correctly.
 
 Yes, Japanese (and probably Chinese and Korean) languages 
 include ASCII character. More precisely ASCII is part of Japanese
 encodings(LATIN1 is not, however). And we have no problem at 
 all with glibc/C locale. See below(unitest is an UNICODE database).
 
 unitest=# create table t1(t text);
 CREATE TABLE
 unitest=# \encoding EUC_JP
 unitest=# insert into t1 values('abc');
 INSERT 1842628 1
 unitest=# select upper(t) from t1;
upper   
 ---
  ABC
 (1 row)
 
 So Japanese(including ASCII)/UNICODE behavior is perfectly 
 correct at this moment. 

Right, so you _never_ use accented ascii characters in Japanese? 
(like  for example, whose uppercase is )

 So I strongly object removing that optimization.

I'm guessing this would call for a vote then, since if implementing ICU, then
I'd have to object to leaving it in.

Changing the bahaviour of ICU doesn't seem right. Changing the behaviour of pg, 
so that it works as it should when using unicode, seems the right solution to 
me.

 --
 Tatsuo Ishii
 
 

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread John Hansen
Tatsuo Ishii wrote:
 Sent: Sunday, May 08, 2005 12:01 PM
 To: [EMAIL PROTECTED]
 Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
 
 We have developed patches which relaxes the character 
 validation so that PostgreSQL accepts invalid characters. It 
 works like this:

That is just plain 100% wrong!!
Under no circumstances should there be invalid data in a database.
And if you're trying to make a database of invalid data, then at 
least encode it using a valid encoding.

In fact, I've proposed strengthening the validation routines for UTF-8.

... John

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] rendezvous

2005-05-07 Thread Alvaro Herrera
On Sat, May 07, 2005 at 10:43:46AM +0200, Jochem van Dieten wrote:
 On 5/7/05, Alvaro Herrera wrote:
  On Fri, May 06, 2005 at 03:30:10PM -0700, Joshua D. Drake wrote:
  
  Rendezvous is the Apple network discovery protocol yes? That was renamed 
  Bonjour by apple due to a Trademark problem.
  
  Maybe we should name it Zeroconf.
 
 Is the implemented protocol IETF ZeroConf or Apple Rendezvous? IIRC
 there are differences in the requirements for TTLs and I have had some
 experiences with Apple reserving local.arpa while the rest of the
 world doesn't.

Huh, not sure.  The only thing the code does is check for
DNSServiceDiscovery.h and call a function therein.  No implementation
details are taken care of in the Postgres source.

My bet is that it only works on Mac OS X.  If there was interest I'm
sure we could make it work using libhowl but I don't see anybody asking ...

-- 
Alvaro Herrera ([EMAIL PROTECTED])
No necesitamos banderas
 No reconocemos fronteras  (Jorge González)

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread John Hansen
Madison Kelly wrote:
Under most circumstances I would agree with you completely. In my 
 case though I have to decide between risking a loss of a 
 user's data or 
 attempt to store the file name in some manner that would 
 return the same 
 name used by the file system.
 
The user (or one of his/her users in the case of an admin) may be 
 completely unaware of the file name being an invalid unicode 
 name. The 
 file itself though may still be quite valid and contain information 
 worthy of backing up. I could notify the user/admin that the 
 name is not 
 valid but there is no way I could rely on the name being 
 changed. Given 
 the choices, I would prefer to attempt to store/use the file 
 name with 
 the invalid unicode character than simply ignore the file.
 
Is there a way to store the name in raw binary? If so, 
 would this not 
 be safe because to postgresql it should no longer matter what 
 data is or 
 represents, right? Maybe there is a third option I am not yet 
 concidering?

Set the client_encoding to ascii when storing that name, and again when
retrieving it.
Or, use a bytea column.

 
 Madison

... John

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Alvaro Herrera
On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
 Tatsuo Ishii wrote:

  So Japanese(including ASCII)/UNICODE behavior is perfectly 
  correct at this moment. 
 
 Right, so you _never_ use accented ascii characters in Japanese? 
 (like è for example, whose uppercase is È)

That isn't ASCII.  It's latin1 or some other ASCII extension.

-- 
Alvaro Herrera ([EMAIL PROTECTED])
La principal característica humana es la tontería
(Augusto Monterroso)

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread Tom Lane
John Hansen [EMAIL PROTECTED] writes:
 Tatsuo Ishii wrote:
 We have developed patches which relaxes the character 
 validation so that PostgreSQL accepts invalid characters.

 That is just plain 100% wrong!!

That was my first reaction too.  Why would this be a good idea?
If someone does want an encoding-agnostic database, they can
set it as SQL_ASCII.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread John Hansen
Alvaro Herrera wrote:
 Sent: Sunday, May 08, 2005 2:49 PM
 To: John Hansen
 Cc: Tatsuo Ishii; pgman@candle.pha.pa.us; 
 [EMAIL PROTECTED]; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Patch for collation using ICU
 
 On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
  Tatsuo Ishii wrote:
 
   So Japanese(including ASCII)/UNICODE behavior is 
 perfectly correct 
   at this moment.
  
  Right, so you _never_ use accented ascii characters in Japanese? 
  (like è for example, whose uppercase is È)
 
 That isn't ASCII.  It's latin1 or some other ASCII extension.

Point taken...
But...

If you want EUC_JP (Japanese + ASCII) then use that as your backend encoding, 
not UTF-8 (unicode).
UTF-8 encoded databases are very useful for representing multiple languages in 
the same database,
but this usefulness vanishes if functions like upper/lower doesn't work 
correctly.

So optimizing for 3 languages breaks more than a hundred, that's doesn't seem 
fair!

 
 --
 Alvaro Herrera ([EMAIL PROTECTED]) La principal 
 característica humana es la tontería
 (Augusto Monterroso)
 
 

... John

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] Can we get patents?

2005-05-07 Thread Alvaro Herrera
Hackers,

I was reading LWN.net and noticed an article about Eben Moglen's keynote
at linux.conf.au.  Apparently he advises free software projects to get
patents on their best ideas.

Eben encouraged free software developers to record their novel
inventions and to obtain patents on the best of them. Free legal
help can be made available to obtain patents on the best ideas.
Until the rules of the game can be changed, we must play the
game, and having the right patents available may make all the
difference in defending against an attack.

http://lwn.net/Articles/133421/

Eben Moglen is the FSF's attorney.

I'm wondering, could the PostgreSQL Foundation (or some other entity)
get patents on some parts of Postgres?  Maybe ResourceOwners for
example; or the newer parts of the optimizer.

The patents would be freely licensed to everyone (including commercial
redistributors and developers/users of competing products), except to
patent litigators, or something like that.

-- 
Alvaro Herrera ([EMAIL PROTECTED])
La espina, desde que nace, ya pincha (Proverbio africano)

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread Tatsuo Ishii
 Tatsuo Ishii wrote:
  Sent: Sunday, May 08, 2005 12:01 PM
  To: [EMAIL PROTECTED]
  Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org
  Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
  
  We have developed patches which relaxes the character 
  validation so that PostgreSQL accepts invalid characters. It 
  works like this:
 
 That is just plain 100% wrong!!

 Under no circumstances should there be invalid data in a database.
 And if you're trying to make a database of invalid data, then at 
 least encode it using a valid encoding.
 
 In fact, I've proposed strengthening the validation routines for UTF-8.

Actually I myself thought as you are before. Later I found that it was
not so good idea. People already have invalid encoded data in their
precious database and have very hard time to migrate to newer version
of PostgreSQL because of encoding validation.

Think about this kind of situation:

There is a table t1(member_id integer primary key, member_name text,
address text, phone text, email text). I have to reach each member by
either adress, phone or email. Unfortunately some of address field
have wrong encoded data. In this case I will use phone or email to
reach them. 

Now I need to upgrade to newer PostgreSQL within 1 day. I know I have
to fix wrong encoded field but it will take more than 1 day. So I
would like to import the data first then fix wrong encoded field on
running database since I can reach members by phone or email even with
wrong encoded address field...

I saw this kind of situation in the real world and that's why we
developed the patches.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Patch for collation using ICU

2005-05-07 Thread Tatsuo Ishii
 Alvaro Herrera wrote:
  Sent: Sunday, May 08, 2005 2:49 PM
  To: John Hansen
  Cc: Tatsuo Ishii; pgman@candle.pha.pa.us; 
  [EMAIL PROTECTED]; pgsql-hackers@postgresql.org
  Subject: Re: [HACKERS] Patch for collation using ICU
  
  On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote:
   Tatsuo Ishii wrote:
  
So Japanese(including ASCII)/UNICODE behavior is 
  perfectly correct 
at this moment.
   
   Right, so you _never_ use accented ascii characters in Japanese? 
   (like è for example, whose uppercase is È)
  
  That isn't ASCII.  It's latin1 or some other ASCII extension.
 
 Point taken...
 But...
 
 If you want EUC_JP (Japanese + ASCII) then use that as your backend encoding, 
 not UTF-8 (unicode).
 UTF-8 encoded databases are very useful for representing multiple languages 
 in the same database,
 but this usefulness vanishes if functions like upper/lower doesn't work 
 correctly.

I'm just curious if Germany/French/Spanish mixed text can be sorted
correctly. I think these languages need their own locales even with
UNICODE/ICU.

 So optimizing for 3 languages breaks more than a hundred, that's doesn't seem 
 fair!

Why don't you add a GUC variable or some such to control the
upper/lower behavior?
--
Tatsuo Ishii

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org