Re: [HACKERS] FC3 broken with HEAD
Tom Lane wrote: Yeah, we will. Please file a bugzilla entry for this though --- I concur that it is a linker bug. Okay, patch reverted. The RH bug is here: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=157126 -Neil ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] CVS HEAD problem: psql: symbol lookup error:
On Sat, 7 May 2005, Markus Bertheau wrote: See FC3 broken with HEAD. I have Slackware 10.1 , 06/05/2005 23:54 +0400, Oleg Bartunov : Just got this problem. Regards, Oleg _ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly Regards, Oleg _ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83 ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Patch for collation using ICU
Where'd you get the licence from? None of that is in the licence I'm reading! (http://www-306.ibm.com/software/globalization/icu/index.jsp) (http://www-306.ibm.com/software/globalization/icu/license.jsp) ... John -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane Sent: Saturday, May 07, 2005 3:17 PM To: Bruce Momjian Cc: Palle Girgensohn; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU Bruce Momjian pgman@candle.pha.pa.us writes: Tom Lane wrote: Not until ICU is released under a BSD license ... Well, readline isn't BSD either, but we use it. It is any different? Did you read the license? Some of the more troubling bits: : It is the understanding of INTERNATIONAL BUSINESS MACHINES CORPORATION : that the purpose for which its publications are being reproduced is : accurate and true as stated in your attached request. (er, which attached request would that be?) : Permission to quote from or reprint IBM publications is limited to the : purpose and quantities originally requested and must not be construed as : a blanket license to use the material for other purposes or to reprint : other IBM copyrighted material. : IBM reserves the right to withdraw permission to reproduce copyrighted : material whenever, in its discretion, it feels that the privilege of : reproducing its material is being used in a way detrimental to its : interest or the above instructions are not being followed properly to : protect its copyright. : IBM may have patents or pending patent applications covering subject : matter in this document. The furnishing of this document does not give : you any license to these patents. You can send license inquiries, in : writing, to: : For license inquiries regarding double-byte (DBCS) information, contact : the IBM Intellectual Property Department in your country or send : inquiries, in writing, to: regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED]) ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] rendezvous
On 5/7/05, Alvaro Herrera wrote: On Fri, May 06, 2005 at 03:30:10PM -0700, Joshua D. Drake wrote: Rendezvous is the Apple network discovery protocol yes? That was renamed Bonjour by apple due to a Trademark problem. Maybe we should name it Zeroconf. Is the implemented protocol IETF ZeroConf or Apple Rendezvous? IIRC there are differences in the requirements for TTLs and I have had some experiences with Apple reserving local.arpa while the rest of the world doesn't. Jochem ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] pl/pgsql enabled by default
On Sat, 2005-05-07 at 14:52 +1000, Neil Conway wrote: Andrew Sullivan wrote: Sure it is. Don't enable anything you don't need, is the first security rule. Everything is turned off by default. If you want it, enable it. So would you have us disable all the non-essential builtin functions? (Many of which have has security problems in the past.) What about the builtin encoding conversions, non-btree indexes, or a myriad of features that not all users need or use? I support Andrew's comment, though might reword it to Don't enable anything that gives users programmable features or user exits by default. You can't use the builtin encoding functions or non-btree indexes to access things you are not supposed to. Anything that is *always* there provides a platform for malware. I'm not really sure what is wrong with the CREATE LANGUAGE statement anyway - it is dynamically accessible, so doesn't require changes that effect other database instance users. I do understand the wish to make the lives of admins easier, but this isn't a hard thing to do... What makes sense for the default configuration of an operating system (which by nature must be hardened against attack) does not necessarily make sense for a database system. Security is everybody's job, not just the OS guys. Personally, I forget that constantly, but the principle seems clear. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] pl/pgsql enabled by default
Simon Riggs wrote: I support Andrew's comment, though might reword it to Don't enable anything that gives users programmable features or user exits by default. Users can already define SQL functions by default, which certainly provides programmable features. I'm not quite sure what you mean by user exits. I guess I'm missing how pl/pgsql is a fundamentally greater security risk. You can't use the builtin encoding functions or non-btree indexes to access things you are not supposed to. How can you use pl/pgsql to access things you are not supposed to? -Neil ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] pgFoundry
Joshua D. Drake [EMAIL PROTECTED] writes: No that is public presentation of the project not project development. I don't see that people are going to be able to participate in development if they don't use the mailing lists. I am not arguing that but public mailing lists are no place to track status of sub projects or tasks. They are for discussion. What does it mean to track the status of something? How would the status change except by discussion? What would be the point of announcing the status of something without allowing people to comment? I think you have a severely flawed idea of how free software development proceeds. What you're describing sounds like something a manager of a commercial project would want. Perhaps it's something the managers of the people working on Postgres on behalf of some corporate sponsors might want but in those cases I doubt they would want the information to be public anyways. In the free software world there's no top-down management of the project with managers issuing direction and expecting feedback reports. People only want tools that make their lives easier. Not tools that make other people's lives easier at the expense of their own convenience. The programmers are not beholden to any corporate interests (other than their own sponsors, who presumably are getting all the feedback they're looking for privately). I'm rather surprised Postgres doesn't have a good bug tracking system. That's something most projects find pretty essential. Strangely enough the reason seems to be that Postgres really doesn't have many bugs... Unlike web browsers or GUIs or most of the other free software projects out there, databases don't tolerate bugs well. Any serious bug is cause for an immediate point release. The only use for a bug tracking system would really be for tracking all those pesky IWBNI bugs that never rise to urgent status. But tracking the status of sub-projects is just not the kind of thing free software people do. They send emails when they have something to say. -- greg ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Patch for collation using ICU
--On fredag, maj 06, 2005 23.31.20 -0400 Tom Lane [EMAIL PROTECTED] wrote: Bruce Momjian pgman@candle.pha.pa.us writes: Is this patch ready for application? Not until ICU is released under a BSD license ... It's not GPL anyway. Seems pretty much like the BSD license, at least more BSD-ish than GPL-ish. http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icu/license.html /Palle ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Patch for collation using ICU
--On fredag, maj 06, 2005 22.57.59 -0400 Bruce Momjian pgman@candle.pha.pa.us wrote: Is this patch ready for application? http://people.freebsd.org/~girgen/postgresql-icu/pg-802-icu-2005-05-06.d iff.gz The web site is: http://people.freebsd.org/~girgen/postgresql-icu/readme.html I don't think so, not quite. I have not had any positive reports from linux users, this is only tested in a FreeBSD environment. I'd say it needs some more testing. Also, apparently, ICU is installed by default in many linux distributions, and usually it is version 2.8. Some linux users have asked me if there are plans for a patch that works with ICU 2.8. That's probably a good idea. IBM and the ICU folks seem to consider 3.2 to be the stable version, older versions are hard to find on their sites, but most linux distributers seem to consider it too bleeding edge, even gentoo. I don't know why they don't agree. I do have a few questions: Why don't you use the lc_ctype_is_c() part of this test? if (pg_database_encoding_max_length() 1 !lc_ctype_is_c()) Um, well, I didn't think about that. :) What would be the locale in this case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a wide encoding, indeed. Then the strings will be handled like byte-wide chars. Yeah, it's a bug. I'll fix it! Thanks. Why is so much code added, for example, in lower()? The existing multibyte code is much smaller, and lots of code is added in other places too. ICU uses UTF-16 internally, so all strings must be converted from the database encoding to UTF-16. Since that means the strings need to be copied, I took the same approach as in varlena.c:varstr_cmp(), where small strings use the heap and only larger strings use a palloc. Comments in varstr_cmp about performance made me use that approach. Also, in the latest patch, I also added checks and logging for *every* status returned from ICU. I hope this will help debugging on debian, where previous version didn't work. That excessive status checking is hardly be necessary once the stuff is better tested. I think the string copying and heap/palloc choices stands for most of the code bloat, together with the excessive status checking and logging. Why do you need to add a mapping of encoding names from iana to our names? This was already answered by John Hansen... There's an old thread here about the choice of the name UNICODE to describe an encoding, which it doesn't. There's half a dozen unicode based encodings... UTF-8 is used by postgresql, that would have been a better name... Similarly for most other encodings, really. ICU expect a setlocale(3) string (i.e. IANA). PostgreSQL can't provide it, so a mapping table is required. I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. /Palle - -- Palle Girgensohn wrote: Hi! I've put together a patch for using IBM's ICU package for collation. If your OS does not have full support for collation ur uppercase/lowercase in multibyte locales, this might be useful. If you are using a multibyte character encoding in your database and want collation, i.e. order by, and also lower(), upper() and initcap() to work properly, this patch will do just that. This patch is needed for FreeBSD, since this OS has no support for collation of for example unicode locales (that is, wcscoll(3) does not do what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the patch is *not* necessary for Linux, although IBM claims ICU collation to be about twice as fast as glibc for simple western locales. It adds a configure switch, `--with-icu', which will set up the code to use ICU instead of wchar_t and wcscoll. This has been tested only on FreeBSD-4.11 FreeBSD-5-stable, where it seems to run well. I've not had the time to do any comparative performance tests yet, but it seems it is at least not slower than using LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster. I'd be delighted if some more experienced postgresql hackers would review this stuff. The patch is pretty compact, so it's fast reading :) I'm planning to add this patch as an option (tagged experimental) to FreeBSD's postgresql port. Any ideas about whether this is a good idea or not? Any thoughts or ideas are welcome! Cheers, Palle Patch at: http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14. diff ICU at sourceforge: http://icu.sf.net/ ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your
Re: [HACKERS] Patch for collation using ICU
I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. I currently use this on gentoo with ICU3.2 unmasked. Works a dream, even with locale C and UNICODE database. Small test: createdb --encoding UNICODE --locale C test psql test set client_encoding=iso88591; CREATE TABLE test (t text); INSERT INTO test (t) VALUES ('æøå'); set client_encoding=unicode; INSERT INTO test (t) SELECT upper(t) FROM test; set client_encoding=iso88591; SELECT * FROM test; t - æøå ÆØÅ (2 rows) Just as I'd expect, as upper/lower/initcap are locale independent for these characters. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS]
unsubscribe ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Patch for collation using ICU
Palle Girgensohn wrote: Is this patch ready for application? I don't think so, not quite. I have not had any positive reports from linux users, this is only tested in a FreeBSD environment. I'd say it needs some more testing. OK. Also, apparently, ICU is installed by default in many linux distributions, and usually it is version 2.8. Some linux users have asked me if there are plans for a patch that works with ICU 2.8. That's probably a good idea. IBM and the ICU folks seem to consider 3.2 to be the stable version, older versions are hard to find on their sites, but most linux distributers seem to consider it too bleeding edge, even gentoo. I don't know why they don't agree. Good point. Why would linux folks need ICU? Doesn't their OS support encodings natively? I am particularly excited about this for OSs that don't have such encodings, like UTF8 support for Win32. Because ICU will not be used unless enabled by configure, it seems we are fine with only supporting the newest version. Do Linux users need to use ICU for any reason? I do have a few questions: Why don't you use the lc_ctype_is_c() part of this test? if (pg_database_encoding_max_length() 1 !lc_ctype_is_c()) Um, well, I didn't think about that. :) What would be the locale in this case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a wide encoding, indeed. Then the strings will be handled like byte-wide chars. Yeah, it's a bug. I'll fix it! Thanks. The additional test is more of an optmization, and it fixes a problem with some OSs that have processing problems with UTF8 when the locale is supposed to be turned off, like in C. I realize ICU might be fine with it but the optimization still is an issue. Why is so much code added, for example, in lower()? The existing multibyte code is much smaller, and lots of code is added in other places too. ICU uses UTF-16 internally, so all strings must be converted from the database encoding to UTF-16. Since that means the strings need to be copied, I took the same approach as in varlena.c:varstr_cmp(), where small strings use the heap and only larger strings use a palloc. Comments in varstr_cmp about performance made me use that approach. Oh, interesting. I think you need to create new functions that factor out that common code so the patch is smaller and easier to maintain. Also, in the latest patch, I also added checks and logging for *every* status returned from ICU. I hope this will help debugging on debian, where previous version didn't work. That excessive status checking is hardly be necessary once the stuff is better tested. I think the string copying and heap/palloc choices stands for most of the code bloat, together with the excessive status checking and logging. OK, move that into some common functions and I think it will be better. Why do you need to add a mapping of encoding names from iana to our names? This was already answered by John Hansen... There's an old thread here about the choice of the name UNICODE to describe an encoding, which it doesn't. There's half a dozen unicode based encodings... UTF-8 is used by postgresql, that would have been a better name... Similarly for most other encodings, really. ICU expect a setlocale(3) string (i.e. IANA). PostgreSQL can't provide it, so a mapping table is required. We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that help? I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. OK, I certainly would like this all done for 8.1 which should have feature freeze on July 1. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Patch for collation using ICU
Errm,... initdb --encoding UNICODE --locale C -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of John Hansen Sent: Saturday, May 07, 2005 10:23 PM To: Palle Girgensohn; Bruce Momjian Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. I currently use this on gentoo with ICU3.2 unmasked. Works a dream, even with locale C and UNICODE database. Small test: createdb --encoding UNICODE --locale C test psql test set client_encoding=iso88591; CREATE TABLE test (t text); INSERT INTO test (t) VALUES ('æøå'); set client_encoding=unicode; INSERT INTO test (t) SELECT upper(t) FROM test; set client_encoding=iso88591; SELECT * FROM test; t - æøå ÆØÅ (2 rows) Just as I'd expect, as upper/lower/initcap are locale independent for these characters. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Patch for collation using ICU
--On lördag, maj 07, 2005 22.53.46 +1000 John Hansen [EMAIL PROTECTED] wrote: Errm,... initdb --encoding UNICODE --locale C You mean that ICU *shall* be used even for the C locale, and not as Bruce suggested here: I do have a few questions: Why don't you use the lc_ctype_is_c() part of this test? if (pg_database_encoding_max_length() 1 !lc_ctype_is_c()) Um, well, I didn't think about that. :) What would be the locale in this case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a wide encoding, indeed. Then the strings will be handled like byte-wide chars. Yeah, it's a bug. I'll fix it! Thanks. John disagrees here, and I'm obliged to agree. Using the C locale, one will expect C collation, but upper/lower is better off still using ICU. Hence, the above stuff is *not* a bug. Do we agree? /Palle -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of John Hansen Sent: Saturday, May 07, 2005 10:23 PM To: Palle Girgensohn; Bruce Momjian Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. I currently use this on gentoo with ICU3.2 unmasked. Works a dream, even with locale C and UNICODE database. Small test: createdb --encoding UNICODE --locale C test psql test set client_encoding=iso88591; CREATE TABLE test (t text); INSERT INTO test (t) VALUES ('æøå'); set client_encoding=unicode; INSERT INTO test (t) SELECT upper(t) FROM test; set client_encoding=iso88591; SELECT * FROM test; t - æøå ÆØÅ (2 rows) Just as I'd expect, as upper/lower/initcap are locale independent for these characters. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Patch for collation using ICU
Bruce Momjian wrote: Palle Girgensohn wrote: Is this patch ready for application? I don't think so, not quite. I have not had any positive reports from linux users, this is only tested in a FreeBSD environment. I'd say it needs some more testing. OK. Also, apparently, ICU is installed by default in many linux distributions, and usually it is version 2.8. Some linux users have asked me if there are plans for a patch that works with ICU 2.8. That's probably a good idea. IBM and the ICU folks seem to consider 3.2 to be the stable version, older versions are hard to find on their sites, but most linux distributers seem to consider it too bleeding edge, even gentoo. I don't know why they don't agree. Good point. Why would linux folks need ICU? Doesn't their OS support encodings natively? I am particularly excited about this for OSs that don't have such encodings, like UTF8 support for Win32. Because ICU will not be used unless enabled by configure, it seems we are fine with only supporting the newest version. Do Linux users need to use ICU for any reason? Yes, because on many linux platforms locale support is broken. Also, ICU enables full unicode support, particularly in multi-language situations where locale is C, and makes upper/lower/initcap work as expected, except where it depends on locale information. There are also many other useful things in ICU that could be implemented. Transliteration, and break-iterators for example. Break-iteration particularly interresting for converting a text to a list of words. Another is it's builtin substring searches. I do have a few questions: Why don't you use the lc_ctype_is_c() part of this test? if (pg_database_encoding_max_length() 1 !lc_ctype_is_c()) Um, well, I didn't think about that. :) What would be the locale in this case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a wide encoding, indeed. Then the strings will be handled like byte-wide chars. Yeah, it's a bug. I'll fix it! Thanks. The additional test is more of an optmization, and it fixes a problem with some OSs that have processing problems with UTF8 when the locale is supposed to be turned off, like in C. I realize ICU might be fine with it but the optimization still is an issue. That the locale is supposed to be turned off, doesn't mean it shouldn't use ICU. ICU is more than just locales. Why is so much code added, for example, in lower()? The existing multibyte code is much smaller, and lots of code is added in other places too. ICU uses UTF-16 internally, so all strings must be converted from the database encoding to UTF-16. Since that means the strings need to be copied, I took the same approach as in varlena.c:varstr_cmp(), where small strings use the heap and only larger strings use a palloc. Comments in varstr_cmp about performance made me use that approach. Oh, interesting. I think you need to create new functions that factor out that common code so the patch is smaller and easier to maintain. Also, in the latest patch, I also added checks and logging for *every* status returned from ICU. I hope this will help debugging on debian, where previous version didn't work. That excessive status checking is hardly be necessary once the stuff is better tested. I think the string copying and heap/palloc choices stands for most of the code bloat, together with the excessive status checking and logging. OK, move that into some common functions and I think it will be better. Why do you need to add a mapping of encoding names from iana to our names? This was already answered by John Hansen... There's an old thread here about the choice of the name UNICODE to describe an encoding, which it doesn't. There's half a dozen unicode based encodings... UTF-8 is used by postgresql, that would have been a better name... Similarly for most other encodings, really. ICU expect a setlocale(3) string (i.e. IANA). PostgreSQL can't provide it, so a mapping table is required. We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that help? I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. OK, I certainly would like this all done for 8.1 which should have feature freeze on July 1. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the
Re: [HACKERS] Patch for collation using ICU
--On lördag, maj 07, 2005 22.53.46 +1000 John Hansen [EMAIL PROTECTED] wrote: Errm,... initdb --encoding UNICODE --locale C You mean that ICU *shall* be used even for the C locale, and not as Bruce suggested here: Yes, that's exactly what I mean. I do have a few questions: Why don't you use the lc_ctype_is_c() part of this test? if (pg_database_encoding_max_length() 1 !lc_ctype_is_c()) Um, well, I didn't think about that. :) What would be the locale in this case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a wide encoding, indeed. Then the strings will be handled like byte-wide chars. Yeah, it's a bug. I'll fix it! Thanks. John disagrees here, and I'm obliged to agree. Using the C locale, one will expect C collation, but upper/lower is better off still using ICU. Hence, the above stuff is *not* a bug. Do we agree? /Palle -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of John Hansen Sent: Saturday, May 07, 2005 10:23 PM To: Palle Girgensohn; Bruce Momjian Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. I currently use this on gentoo with ICU3.2 unmasked. Works a dream, even with locale C and UNICODE database. Small test: createdb --encoding UNICODE --locale C test psql test set client_encoding=iso88591; CREATE TABLE test (t text); INSERT INTO test (t) VALUES ('æøå'); set client_encoding=unicode; INSERT INTO test (t) SELECT upper(t) FROM test; set client_encoding=iso88591; SELECT * FROM test; t - æøå ÆØÅ (2 rows) Just as I'd expect, as upper/lower/initcap are locale independent for these characters. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Patch for collation using ICU
Btw, I had been planning to propose replacing every single one of the built in charset conversion functions with calls to ICU (thus making pg _depend_ on ICU), as this would seem like a cleaner solution than for us to maintain our own conversion tables. ICU also has a fair few conversions that we do not have at present. Any thoughts? ... John -Original Message- From: John Hansen Sent: Saturday, May 07, 2005 11:09 PM To: 'Palle Girgensohn'; 'Bruce Momjian' Cc: 'pgsql-hackers@postgresql.org' Subject: RE: [HACKERS] Patch for collation using ICU --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen [EMAIL PROTECTED] wrote: Errm,... initdb --encoding UNICODE --locale C You mean that ICU *shall* be used even for the C locale, and not as Bruce suggested here: Yes, that's exactly what I mean. I do have a few questions: Why don't you use the lc_ctype_is_c() part of this test? if (pg_database_encoding_max_length() 1 !lc_ctype_is_c()) Um, well, I didn't think about that. :) What would be the locale in this case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a wide encoding, indeed. Then the strings will be handled like byte-wide chars. Yeah, it's a bug. I'll fix it! Thanks. John disagrees here, and I'm obliged to agree. Using the C locale, one will expect C collation, but upper/lower is better off still using ICU. Hence, the above stuff is *not* a bug. Do we agree? /Palle -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of John Hansen Sent: Saturday, May 07, 2005 10:23 PM To: Palle Girgensohn; Bruce Momjian Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. I currently use this on gentoo with ICU3.2 unmasked. Works a dream, even with locale C and UNICODE database. Small test: createdb --encoding UNICODE --locale C test psql test set client_encoding=iso88591; CREATE TABLE test (t text); INSERT INTO test (t) VALUES ('æøå'); set client_encoding=unicode; INSERT INTO test (t) SELECT upper(t) FROM test; set client_encoding=iso88591; SELECT * FROM test; t - æøå ÆØÅ (2 rows) Just as I'd expect, as upper/lower/initcap are locale independent for these characters. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Patch for collation using ICU
--On lördag, maj 07, 2005 08.37.05 -0400 Bruce Momjian pgman@candle.pha.pa.us wrote: Palle Girgensohn wrote: Is this patch ready for application? I don't think so, not quite. I have not had any positive reports from linux users, this is only tested in a FreeBSD environment. I'd say it needs some more testing. OK. John Hansen just reported that it does work on linux. fine! Also, apparently, ICU is installed by default in many linux distributions, and usually it is version 2.8. Some linux users have asked me if there are plans for a patch that works with ICU 2.8. That's probably a good idea. IBM and the ICU folks seem to consider 3.2 to be the stable version, older versions are hard to find on their sites, but most linux distributers seem to consider it too bleeding edge, even gentoo. I don't know why they don't agree. Good point. Why would linux folks need ICU? Doesn't their OS support encodings natively? I am particularly excited about this for OSs that don't have such encodings, like UTF8 support for Win32. Because ICU will not be used unless enabled by configure, it seems we are fine with only supporting the newest version. Do Linux users need to use ICU for any reason? There are corner cases where it is impossible to upper/lowercase one character at the time. for example: -- without ICU select upper('Eßer'); upper --- EßER (1 row) -- with ICU select upper('Eßer'); upper --- ESSER (1 rad) This is because in the standard postgres implementation, upper/lower is done one character at the time. A proper upper/lower cannot do it that way. Other known example is in Turkish, where an Ì (?) should look different whether it is an initial letter or not. This fails in standard postgresql for all platforms. I do have a few questions: Why don't you use the lc_ctype_is_c() part of this test? if (pg_database_encoding_max_length() 1 !lc_ctype_is_c()) Um, well, I didn't think about that. :) What would be the locale in this case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a wide encoding, indeed. Then the strings will be handled like byte-wide chars. Yeah, it's a bug. I'll fix it! Thanks. The additional test is more of an optmization, and it fixes a problem with some OSs that have processing problems with UTF8 when the locale is supposed to be turned off, like in C. I realize ICU might be fine with it but the optimization still is an issue. Well, the results are quite different, depending on whether ICU is used or not. See separate mail. Why is so much code added, for example, in lower()? The existing multibyte code is much smaller, and lots of code is added in other places too. ICU uses UTF-16 internally, so all strings must be converted from the database encoding to UTF-16. Since that means the strings need to be copied, I took the same approach as in varlena.c:varstr_cmp(), where small strings use the heap and only larger strings use a palloc. Comments in varstr_cmp about performance made me use that approach. Oh, interesting. I think you need to create new functions that factor out that common code so the patch is smaller and easier to maintain. Hmm, yes, perhaps it can be refactored a bit. It has ocurred to me... Also, in the latest patch, I also added checks and logging for *every* status returned from ICU. I hope this will help debugging on debian, where previous version didn't work. That excessive status checking is hardly be necessary once the stuff is better tested. I think the string copying and heap/palloc choices stands for most of the code bloat, together with the excessive status checking and logging. OK, move that into some common functions and I think it will be better. Best way for upper/lower/initcap is probably to use a function pointer... uhh... Why do you need to add a mapping of encoding names from iana to our names? This was already answered by John Hansen... There's an old thread here about the choice of the name UNICODE to describe an encoding, which it doesn't. There's half a dozen unicode based encodings... UTF-8 is used by postgresql, that would have been a better name... Similarly for most other encodings, really. ICU expect a setlocale(3) string (i.e. IANA). PostgreSQL can't provide it, so a mapping table is required. We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that help? I'm aware of that. It might help for unicode, but there are a bunch of other encodings. IANA has decided that utf-8 has *no* aliases, hence only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is fogiving, I don't remember/know, but I think we need the mappings, unfortunately. I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. OK, I certainly would like this all done for 8.1 which should have feature freeze on
Re: [HACKERS] Patch for collation using ICU
Palle Girgensohn wrote: I'm aware of that. It might help for unicode, but there are a bunch of other encodings. IANA has decided that utf-8 has *no* aliases, hence only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is fogiving, I don't remember/know, but I think we need the mappings, unfortunately. Here is the list of encoding names and aliases the ICU accepts as of 3.2: (it's a bit long...) UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208 UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2 UTF-16BE x-utf-16be ibm-1200 ibm-1201 ibm-5297 ibm-13488 ibm-17584 windows-1201 cp1200 cp1201 UTF16_BigEndian UTF-16LE x-utf-16le ibm-1202 ibm-13490 ibm-17586 UTF16_LittleEndian windows-1200 UTF-32 ISO-10646-UCS-4 csUCS4 ucs-4 UTF-32BE UTF32_BigEndian ibm-1232 ibm-1233 UTF-32LE UTF32_LittleEndian ibm-1234 UTF16_PlatformEndian UTF16_OppositeEndian UTF32_PlatformEndian UTF32_OppositeEndian UTF-7 windows-65000 IMAP-mailbox-name SCSU BOCU-1 csBOCU-1 CESU-8 ISO-8859-1 ibm-819 IBM819 cp819 latin1 8859_1 csISOLatin1 iso-ir-100 ISO_8859-1:1987 l1 819 US-ASCII ASCII ANSI_X3.4-1968 ANSI_X3.4-1986 ISO_646.irv:1991 iso_646.irv:1983 ISO646-US us csASCII iso-ir-6 cp367 ascii7 646 windows-20127 gb18030 ibm-1392 windows-54936 ibm-367_P100-1995 ibm-367 IBM367 ibm-912_P100-1995 ibm-912 iso-8859-2 ISO_8859-2:1987 latin2 csISOLatin2 iso-ir-101 l2 8859_2 cp912 912 windows-28592 ibm-913_P100-2000 ibm-913 iso-8859-3 ISO_8859-3:1988 latin3 csISOLatin3 iso-ir-109 l3 8859_3 cp913 913 windows-28593 ibm-914_P100-1995 ibm-914 iso-8859-4 latin4 csISOLatin4 iso-ir-110 ISO_8859-4:1988 l4 8859_4 cp914 914 windows-28594 ibm-915_P100-1995 ibm-915 iso-8859-5 cyrillic csISOLatinCyrillic iso-ir-144 ISO_8859-5:1988 8859_5 cp915 915 windows-28595 ibm-1089_P100-1995 ibm-1089 iso-8859-6 arabic csISOLatinArabic iso-ir-127 ISO_8859-6:1987 ECMA-114 ASMO-708 8859_6 cp1089 1089 windows-28596 ISO-8859-6-I ISO-8859-6-E ibm-813_P100-1995 ibm-813 iso-8859-7 greek greek8 ELOT_928 ECMA-118 csISOLatinGreek iso-ir-126 ISO_8859-7:1987 8859_7 cp813 813 windows-28597 ibm-916_P100-1995 ibm-916 iso-8859-8 hebrew csISOLatinHebrew iso-ir-138 ISO_8859-8:1988 ISO-8859-8-I ISO-8859-8-E 8859_8 cp916 916 windows-28598 ibm-920_P100-1995 ibm-920 iso-8859-9 latin5 csISOLatin5 iso-ir-148 ISO_8859-9:1989 l5 8859_9 cp920 920 windows-28599 ECMA-128 ibm-921_P100-1995 ibm-921 iso-8859-13 8859_13 cp921 921 ibm-923_P100-1998 ibm-923 iso-8859-15 Latin-9 l9 8859_15 latin0 csisolatin0 csisolatin9 iso8859_15_fdis cp923 923 windows-28605 ibm-942_P12A-1999 ibm-942 ibm-932 cp932 shift_jis78 sjis78 ibm-942_VSUB_VPUA ibm-932_VSUB_VPUA ibm-943_P15A-2003 ibm-943 Shift_JIS MS_Kanji csShiftJIS windows-31j csWindows31J x-sjis x-ms-cp932 cp932 windows-932 cp943c IBM-943C ms932 pck sjis ibm-943_VSUB_VPUA ibm-943_P130-1999 ibm-943 Shift_JIS cp943 943 ibm-943_VASCII_VSUB_VPUA ibm-33722_P12A-1999 ibm-33722 ibm-5050 EUC-JP Extended_UNIX_Code_Packed_Format_for_Japanese csEUCPkdFmtJapanese X-EUC-JP eucjis windows-51932 ibm-33722_VPUA IBM-eucJP ibm-33722_P120-1999 ibm-33722 ibm-5050 cp33722 33722 ibm-33722_VASCII_VPUA ibm-954_P101-2000 ibm-954 EUC-JP ibm-1373_P100-2002 ibm-1373 windows-950 windows-950-2000 Big5 csBig5 windows-950 x-big5 ibm-950_P110-1999 ibm-950 cp950 950 macos-2566-10.2 Big5-HKSCS big5hk HKSCS-BIG5 ibm-1375_P100-2003 ibm-1375 Big5-HKSCS ibm-1386_P100-2002 ibm-1386 cp1386 windows-936 ibm-1386_VSUB_VPUA windows-936-2000 GBK CP936 MS936 windows-936 ibm-1383_P110-1999 ibm-1383 GB2312 csGB2312 EUC-CN ibm-eucCN hp15CN cp1383 1383 ibm-1383_VPUA ibm-5478_P100-1995 ibm-5478 GB_2312-80 chinese iso-ir-58 csISO58GB231280 gb2312-1980 GB2312.1980-0 ibm-964_P110-1999 ibm-964 EUC-TW ibm-eucTW cns11643 cp964 964 ibm-964_VPUA ibm-949_P110-1999 ibm-949 cp949 949 ibm-949_VASCII_VSUB_VPUA ibm-949_P11A-1999 ibm-949 cp949c ibm-949_VSUB_VPUA ibm-970_P110-1995 ibm-970 EUC-KR KS_C_5601-1987 windows-51949 csEUCKR ibm-eucKR KSC_5601 5601 ibm-970_VPUA ibm-971_P100-1995 ibm-971 ibm-971_VPUA ibm-1363_P11B-1998 ibm-1363 KS_C_5601-1987 KS_C_5601-1989 KSC_5601 csKSC56011987 korean iso-ir-149 5601 cp1363 ksc windows-949 ibm-1363_VSUB_VPUA ibm-1363_P110-1997 ibm-1363 ibm-1363_VASCII_VSUB_VPUA windows-949-2000 windows-949 KS_C_5601-1987 KS_C_5601-1989 KSC_5601 csKSC56011987 korean iso-ir-149 ms949 ibm-1162_P100-1999 ibm-1162 ibm-874_P100-1995 ibm-874 ibm-9066 cp874 TIS-620 tis620.2533 eucTH cp9066 windows-874-2000 TIS-620 windows-874 MS874 ibm-437_P100-1995 ibm-437 IBM437 cp437 437 csPC8CodePage437 windows-437 ibm-850_P100-1995 ibm-850 IBM850 cp850 850 csPC850Multilingual windows-850 ibm-851_P100-1995 ibm-851 IBM851 cp851 851 csPC851 ibm-852_P100-1995 ibm-852 IBM852 cp852 852 csPCp852 windows-852 ibm-855_P100-1995 ibm-855 IBM855 cp855 855 csIBM855 csPCp855 ibm-856_P100-1995 ibm-856 cp856 856 ibm-857_P100-1995 ibm-857 IBM857 cp857 857 csIBM857 windows-857 ibm-858_P100-1997 ibm-858 IBM00858 CCSID00858 CP00858 PC-Multilingual-850+euro cp858 ibm-860_P100-1995 ibm-860 IBM860 cp860 860 csIBM860
Re: [HACKERS] Patch for collation using ICU
--On lördag, maj 07, 2005 23.25.15 +1000 John Hansen [EMAIL PROTECTED] wrote: Palle Girgensohn wrote: I'm aware of that. It might help for unicode, but there are a bunch of other encodings. IANA has decided that utf-8 has *no* aliases, hence only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is fogiving, I don't remember/know, but I think we need the mappings, unfortunately. Here is the list of encoding names and aliases the ICU accepts as of 3.2: (it's a bit long...) UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208 No UTF8 in there. I think that's good, charset aliases are a hassle. /Palle ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Patch for collation using ICU
-Original Message- From: Palle Girgensohn [mailto:[EMAIL PROTECTED] Sent: Saturday, May 07, 2005 11:30 PM To: John Hansen; Bruce Momjian Cc: pgsql-hackers@postgresql.org Subject: RE: [HACKERS] Patch for collation using ICU --On lördag, maj 07, 2005 23.25.15 +1000 John Hansen [EMAIL PROTECTED] wrote: Palle Girgensohn wrote: I'm aware of that. It might help for unicode, but there are a bunch of other encodings. IANA has decided that utf-8 has *no* aliases, hence only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is fogiving, I don't remember/know, but I think we need the mappings, unfortunately. Here is the list of encoding names and aliases the ICU accepts as of 3.2: (it's a bit long...) UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208 No UTF8 in there. I think that's good, charset aliases are a hassle. Yup! :) /Palle ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Patch for collation using ICU
--On lördag, maj 07, 2005 22.22.52 +1000 John Hansen [EMAIL PROTECTED] wrote: I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. I currently use this on gentoo with ICU3.2 unmasked. Works a dream, even with locale C and UNICODE database. Ah, good to hear, John. I beleive your report about linux is what's keeping this back. Did you also manage to get it running on Debian? /Palle ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Patch for collation using ICU
-Original Message- From: Palle Girgensohn [mailto:[EMAIL PROTECTED] Sent: Saturday, May 07, 2005 11:33 PM To: John Hansen; Bruce Momjian Cc: pgsql-hackers@postgresql.org Subject: RE: [HACKERS] Patch for collation using ICU --On lördag, maj 07, 2005 22.22.52 +1000 John Hansen [EMAIL PROTECTED] wrote: I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. I currently use this on gentoo with ICU3.2 unmasked. Works a dream, even with locale C and UNICODE database. Ah, good to hear, John. I beleive your report about linux is what's keeping this back. Did you also manage to get it running on Debian? Not without ICU3.2 /Palle ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Patch for collation using ICU
--On lördag, maj 07, 2005 23.15.29 +1000 John Hansen [EMAIL PROTECTED] wrote: Btw, I had been planning to propose replacing every single one of the built in charset conversion functions with calls to ICU (thus making pg _depend_ on ICU), as this would seem like a cleaner solution than for us to maintain our own conversion tables. ICU also has a fair few conversions that we do not have at present. Any thoughts? I just had a similar though. And why use ICU only for multibyte charsets? If I use LATIN1, I still expect upper('ß') = SS, and I don't get it... Same for the Turkish example. It does eat more memory, and can perhaps cush some performance bits? With the current scheme, a strdup is often enough, or at least just one palloc. With ICU, using UTF-16, you must allocate memory twice, once for the ICU internal UTF-16 representation. That's not a very strong objection, though, as this would be an option... :) John, I have a hard time finding docs about what differs in ICU 2.8 from 3.2. Do you have any pointers? It seems 3.2 has much more support and bug fixes, I'm not sure if we should really consider 2.8? /Palle ... John -Original Message- From: John Hansen Sent: Saturday, May 07, 2005 11:09 PM To: 'Palle Girgensohn'; 'Bruce Momjian' Cc: 'pgsql-hackers@postgresql.org' Subject: RE: [HACKERS] Patch for collation using ICU --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen [EMAIL PROTECTED] wrote: Errm,... initdb --encoding UNICODE --locale C You mean that ICU *shall* be used even for the C locale, and not as Bruce suggested here: Yes, that's exactly what I mean. I do have a few questions: Why don't you use the lc_ctype_is_c() part of this test? if (pg_database_encoding_max_length() 1 !lc_ctype_is_c()) Um, well, I didn't think about that. :) What would be the locale in this case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a wide encoding, indeed. Then the strings will be handled like byte-wide chars. Yeah, it's a bug. I'll fix it! Thanks. John disagrees here, and I'm obliged to agree. Using the C locale, one will expect C collation, but upper/lower is better off still using ICU. Hence, the above stuff is *not* a bug. Do we agree? /Palle -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of John Hansen Sent: Saturday, May 07, 2005 10:23 PM To: Palle Girgensohn; Bruce Momjian Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. I currently use this on gentoo with ICU3.2 unmasked. Works a dream, even with locale C and UNICODE database. Small test: createdb --encoding UNICODE --locale C test psql test set client_encoding=iso88591; CREATE TABLE test (t text); INSERT INTO test (t) VALUES ('æøå'); set client_encoding=unicode; INSERT INTO test (t) SELECT upper(t) FROM test; set client_encoding=iso88591; SELECT * FROM test; t - æøå ÆØÅ (2 rows) Just as I'd expect, as upper/lower/initcap are locale independent for these characters. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Patch for collation using ICU
--On lördag, maj 07, 2005 23.33.31 +1000 John Hansen [EMAIL PROTECTED] wrote: -Original Message- From: Palle Girgensohn [mailto:[EMAIL PROTECTED] Sent: Saturday, May 07, 2005 11:33 PM To: John Hansen; Bruce Momjian Cc: pgsql-hackers@postgresql.org Subject: RE: [HACKERS] Patch for collation using ICU --On lördag, maj 07, 2005 22.22.52 +1000 John Hansen [EMAIL PROTECTED] wrote: I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. I currently use this on gentoo with ICU3.2 unmasked. Works a dream, even with locale C and UNICODE database. Ah, good to hear, John. I beleive your report about linux is what's keeping this back. Did you also manage to get it running on Debian? Not without ICU3.2 Did you try the latest patch? Maybe it will help, and if not, it will (hopefully) give a lot more informative error messages. /Palle ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Patch for collation using ICU
John Hansen wrote: Here is the list of encoding names and aliases the ICU accepts as of 3.2: (it's a bit long...) UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208 UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2 [snip] Don't we use unicode as an alias for UTF-8 ? cheers andrew ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Patch for collation using ICU
Did you try the latest patch? Maybe it will help, and if not, it will (hopefully) give a lot more informative error messages. No, and I got rid of my debian boxes @ home. The patch required a certain amount of modifications too, to even compile with 2.8. So I guess it's a valid question to ask: it it worth supporting 2.8? It is of course an option to bundle icu 3.2 with pg! ... John ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Patch for collation using ICU
John Hansen wrote: --On l?rdag, maj 07, 2005 22.53.46 +1000 John Hansen [EMAIL PROTECTED] wrote: Errm,... initdb --encoding UNICODE --locale C You mean that ICU *shall* be used even for the C locale, and not as Bruce suggested here: Yes, that's exactly what I mean. There are two reasons for that optimization --- first, some locale support is broken and Unicode encoding with a C locale crashes (not an issue for ICU), and second, it is an optimization for languages like Japanese that want to use unicode, but don't need a locale because upper/lower means nothing in those character sets. So, the first issue doesn't apply for ICU, and the second might not depending on what characters you are using in the Unicode character set. I guess I am little confused how ICU can do upper() when the locale is C. What is it using to determine A is upper for a? Am I confused? -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Patch for collation using ICU
-Original Message- From: Andrew Dunstan [mailto:[EMAIL PROTECTED] Sent: Saturday, May 07, 2005 11:39 PM To: John Hansen Cc: Palle Girgensohn; Bruce Momjian; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU John Hansen wrote: Here is the list of encoding names and aliases the ICU accepts as of 3.2: (it's a bit long...) UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208 UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2 [snip] Don't we use unicode as an alias for UTF-8 ? Yes, unfortunately! cheers andrew ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Patch for collation using ICU
Palle Girgensohn wrote: Also, apparently, ICU is installed by default in many linux distributions, and usually it is version 2.8. Some linux users have asked me if there are plans for a patch that works with ICU 2.8. That's probably a good idea. IBM and the ICU folks seem to consider 3.2 to be the stable version, older versions are hard to find on their sites, but most linux distributers seem to consider it too bleeding edge, even gentoo. I don't know why they don't agree. Good point. Why would linux folks need ICU? Doesn't their OS support encodings natively? I am particularly excited about this for OSs that don't have such encodings, like UTF8 support for Win32. Because ICU will not be used unless enabled by configure, it seems we are fine with only supporting the newest version. Do Linux users need to use ICU for any reason? There are corner cases where it is impossible to upper/lowercase one character at the time. for example: -- without ICU select upper('E?er'); upper --- E?ER (1 row) -- with ICU select upper('E?er'); upper --- ESSER (1 rad) This is because in the standard postgres implementation, upper/lower is done one character at the time. A proper upper/lower cannot do it that way. Other known example is in Turkish, where an ? (?) should look different whether it is an initial letter or not. This fails in standard postgresql for all platforms. Uh, where do you see that? Our code has: workspace = texttowcs(string); for (i = 0; workspace[i] != 0; i++) workspace[i] = towupper(workspace[i]); result = wcstotext(workspace, i); Also, in the latest patch, I also added checks and logging for *every* status returned from ICU. I hope this will help debugging on debian, where previous version didn't work. That excessive status checking is hardly be necessary once the stuff is better tested. I think the string copying and heap/palloc choices stands for most of the code bloat, together with the excessive status checking and logging. OK, move that into some common functions and I think it will be better. Best way for upper/lower/initcap is probably to use a function pointer... uhh... Uh, I don't think so. Just send pointers to the the function and let the function allocate the memory, and another function to free them, or something like that. I can probably do it if you want. Why do you need to add a mapping of encoding names from iana to our names? This was already answered by John Hansen... There's an old thread here about the choice of the name UNICODE to describe an encoding, which it doesn't. There's half a dozen unicode based encodings... UTF-8 is used by postgresql, that would have been a better name... Similarly for most other encodings, really. ICU expect a setlocale(3) string (i.e. IANA). PostgreSQL can't provide it, so a mapping table is required. We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that help? I'm aware of that. It might help for unicode, but there are a bunch of other encodings. IANA has decided that utf-8 has *no* aliases, hence only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is fogiving, I don't remember/know, but I think we need the mappings, unfortunately. OK. I guess I am just confused why the native implementations are OK. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Patch for collation using ICU
Andrew Dunstan wrote: John Hansen wrote: Here is the list of encoding names and aliases the ICU accepts as of 3.2: (it's a bit long...) UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208 UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2 [snip] Don't we use unicode as an alias for UTF-8 ? We do. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Patch for collation using ICU
Bruce Momjian wrote: There are two reasons for that optimization --- first, some locale support is broken and Unicode encoding with a C locale crashes (not an issue for ICU), and second, it is an optimization for languages like Japanese that want to use unicode, but don't need a locale because upper/lower means nothing in those character sets. No, upper/lower means nothing in those languages, so why would you need to optimize upper/lower if they're not used?? And if they are, it's obviously because the text contains characters from other languages (probably english) and as such they should behave correctly. Did I mention that for japanese and the like, ICU would also offer transliteration... So, the first issue doesn't apply for ICU, and the second might not depending on what characters you are using in the Unicode character set. I guess I am little confused how ICU can do upper() when the locale is C. What is it using to determine A is upper for a? Am I confused? Simple, UNICODE basically consist of a table of characters (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt) Excerpt: 0041;LATIN CAPITAL LETTER A;Lu;0;L;N0061; ... 0061;LATIN SMALL LETTER A;Ll;0;L;N;;;0041;;0041 From this you can see, that for 0041, which is capital letter A, there is a mapping to it's lowercase counterpart, 0061 Likewise, there is a mapping for 0061 which says it's uppercase counterpart is 0041. There is also SpecialCasing.txt which covers those mappings that haven't got a 1-1 mapping, such as the german SS. These mappings are fixed, independent of locale, only a few cases from specialcasing.txt depend on locale/context. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Patch for collation using ICU
Palle Girgensohn wrote: --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen [EMAIL PROTECTED] wrote: Btw, I had been planning to propose replacing every single one of the built in charset conversion functions with calls to ICU (thus making pg _depend_ on ICU), as this would seem like a cleaner solution than for us to maintain our own conversion tables. ICU also has a fair few conversions that we do not have at present. That is a much larger issue, similar to our shipping our own timezone database. What does it buy us? o Do we ship it in our tarball? o Is the license compatible? o Does it remove utils/mb conversions? o Does it allow us to index LIKE (next high char)? o Does it allow us to support multiple encodings in a single database easier? o performance? I just had a similar though. And why use ICU only for multibyte charsets? If I use LATIN1, I still expect upper('?') = SS, and I don't get it... Same for the Turkish example. We assume the native toupper() can handle single-byte character encodings. We use towupper() only for wide character sets. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Patch for collation using ICU
It seems 3.2 has much more support and bug fixes, I'm not sure if we should really consider 2.8? As I said, probably not worth the effort. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Patch for collation using ICU
--On lördag, maj 07, 2005 09.52.59 -0400 Bruce Momjian pgman@candle.pha.pa.us wrote: Palle Girgensohn wrote: Also, apparently, ICU is installed by default in many linux distributions, and usually it is version 2.8. Some linux users have asked me if there are plans for a patch that works with ICU 2.8. That's probably a good idea. IBM and the ICU folks seem to consider 3.2 to be the stable version, older versions are hard to find on their sites, but most linux distributers seem to consider it too bleeding edge, even gentoo. I don't know why they don't agree. Good point. Why would linux folks need ICU? Doesn't their OS support encodings natively? I am particularly excited about this for OSs that don't have such encodings, like UTF8 support for Win32. Because ICU will not be used unless enabled by configure, it seems we are fine with only supporting the newest version. Do Linux users need to use ICU for any reason? There are corner cases where it is impossible to upper/lowercase one character at the time. for example: -- without ICU select upper('E?er'); upper --- E?ER (1 row) -- with ICU select upper('E?er'); upper --- ESSER (1 rad) This is because in the standard postgres implementation, upper/lower is done one character at the time. A proper upper/lower cannot do it that way. Other known example is in Turkish, where an ? (?) should look different whether it is an initial letter or not. This fails in standard postgresql for all platforms. Uh, where do you see that? Our code has: workspace = texttowcs(string); for (i = 0; workspace[i] != 0; i++) workspace[i] = towupper(workspace[i]); as you see, the loop runs towupper for one character at the time. I cannot consider whether the letter is the initial, as required in Turkish, and it cannot really convert one character into two ('ß' - 'SS') result = wcstotext(workspace, i); Also, in the latest patch, I also added checks and logging for *every* status returned from ICU. I hope this will help debugging on debian, where previous version didn't work. That excessive status checking is hardly be necessary once the stuff is better tested. I think the string copying and heap/palloc choices stands for most of the code bloat, together with the excessive status checking and logging. OK, move that into some common functions and I think it will be better. Best way for upper/lower/initcap is probably to use a function pointer... uhh... Uh, I don't think so. Just send pointers to the the function and let the function allocate the memory, and another function to free them, or something like that. I can probably do it if you want. I'll check it out, it seems simple enough. We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that help? I'm aware of that. It might help for unicode, but there are a bunch of other encodings. IANA has decided that utf-8 has *no* aliases, hence only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is fogiving, I don't remember/know, but I think we need the mappings, unfortunately. OK. I guess I am just confused why the native implementations are OK. They're OK since they understand that UNICODE (or UTF8) is really utf-8. Problem is the strings used to describe them are not understood by ICU. BTW, the pg_enc2iananame_tbl is only used *from* internal representation *to* IANA, not the other way around. Maybe that fact lowers the rate of confusion? ;-) /Palle ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Patch for collation using ICU
Palle Girgensohn wrote: This is because in the standard postgres implementation, upper/lower is done one character at the time. A proper upper/lower cannot do it that way. Other known example is in Turkish, where an ? (?) should look different whether it is an initial letter or not. This fails in standard postgresql for all platforms. Uh, where do you see that? Our code has: workspace = texttowcs(string); for (i = 0; workspace[i] != 0; i++) workspace[i] = towupper(workspace[i]); as you see, the loop runs towupper for one character at the time. I cannot consider whether the letter is the initial, as required in Turkish, and it cannot really convert one character into two ('?' - 'SS') Oh, OK. I thought texttowcs() would expand the string to allow such conversions. We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that help? I'm aware of that. It might help for unicode, but there are a bunch of other encodings. IANA has decided that utf-8 has *no* aliases, hence only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is fogiving, I don't remember/know, but I think we need the mappings, unfortunately. OK. I guess I am just confused why the native implementations are OK. They're OK since they understand that UNICODE (or UTF8) is really utf-8. Problem is the strings used to describe them are not understood by ICU. BTW, the pg_enc2iananame_tbl is only used *from* internal representation *to* IANA, not the other way around. Maybe that fact lowers the rate of confusion? ;-) OK, got it. I am still a little confused why every native implementation understands our existing names but ICU does not. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Patch for collation using ICU
Bruce Momjian wrote: Palle Girgensohn wrote: --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen [EMAIL PROTECTED] wrote: Btw, I had been planning to propose replacing every single one of the built in charset conversion functions with calls to ICU (thus making pg _depend_ on ICU), as this would seem like a cleaner solution than for us to maintain our own conversion tables. ICU also has a fair few conversions that we do not have at present. That is a much larger issue, similar to our shipping our own timezone database. What does it buy us? o Do we ship it in our tarball? o Is the license compatible? o Does it remove utils/mb conversions? o Does it allow us to index LIKE (next high char)? o Does it allow us to support multiple encodings in a single database easier? o performance? I just had a similar though. And why use ICU only for multibyte charsets? If I use LATIN1, I still expect upper('?') = SS, and I don't get it... Same for the Turkish example. We assume the native toupper() can handle single-byte character encodings. We use towupper() only for wide character sets. That assumption is wrong,... Encoding latin1 Locale de* Select Upper('ß'); (lowercase german SS) Should return SS, but returns ß ... John ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Patch for collation using ICU
--On lördag, maj 07, 2005 10.06.43 -0400 Bruce Momjian pgman@candle.pha.pa.us wrote: Palle Girgensohn wrote: --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen [EMAIL PROTECTED] wrote: Btw, I had been planning to propose replacing every single one of the built in charset conversion functions with calls to ICU (thus making pg _depend_ on ICU), as this would seem like a cleaner solution than for us to maintain our own conversion tables. ICU also has a fair few conversions that we do not have at present. That is a much larger issue, similar to our shipping our own timezone database. What does it buy us? o Do we ship it in our tarball? o Is the license compatible? It looks pretty similar to BSD, although I'm a novice on the subject. o Does it remove utils/mb conversions? Yes, it would probably be possible to remove pg's own conversions. o Does it allow us to index LIKE (next high char)? I beleive so, using ICU's substring stuff. o Does it allow us to support multiple encodings in a single database easier? Heh, the ultimate dream. Perhaps? o performance? ICU in itself is said to be much faster than for example glibc. Problem is the need for conversion via UTF-16, which requires extra memory allocations and cpu cycles. I don't use glibc, but my very simple performance tests for FreeBSD show that it is similiar in speed. I just had a similar though. And why use ICU only for multibyte charsets? If I use LATIN1, I still expect upper('?') = SS, and I don't get it... Same for the Turkish example. We assume the native toupper() can handle single-byte character encodings. We use towupper() only for wide character sets. True, problem is that native toupper/towupper run one char at the time. This is a bad design decision in POSIX, there is no way it can handle the examples above unless considering more than one character. ICU does just that. /Palle ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] pgFoundry
On Sat, 7 May 2005, Greg Stark wrote: But tracking the status of sub-projects is just not the kind of thing free software people do. They send emails when they have something to say. in defence of Joshua's idea, there are some large projects within our development that would be nice to see some sort of 'time lines' for, but mainly those are the sorts of things that have set milestones to look for ... I believe that Simon's work on the PITR stuff might fall under that, where he had various 'stages' he was working towards ... But, I think that the # of sub projects that this would apply to is so few that setting anything up more formally then Simon sending out an updated patch would be more work then the derived benefit ... Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: [EMAIL PROTECTED] Yahoo!: yscrappy ICQ: 7615664 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Patch for collation using ICU
John Hansen wrote: Bruce Momjian wrote: There are two reasons for that optimization --- first, some locale support is broken and Unicode encoding with a C locale crashes (not an issue for ICU), and second, it is an optimization for languages like Japanese that want to use unicode, but don't need a locale because upper/lower means nothing in those character sets. No, upper/lower means nothing in those languages, so why would you need to optimize upper/lower if they're not used?? True. I suppose it is for databases that use both Japanese and Latin alphabets and run upper() on all values. And if they are, it's obviously because the text contains characters from other languages (probably english) and as such they should behave correctly. Did I mention that for japanese and the like, ICU would also offer transliteration... Interesting. So, the first issue doesn't apply for ICU, and the second might not depending on what characters you are using in the Unicode character set. I guess I am little confused how ICU can do upper() when the locale is C. What is it using to determine A is upper for a? Am I confused? Simple, UNICODE basically consist of a table of characters (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt) Excerpt: 0041;LATIN CAPITAL LETTER A;Lu;0;L;N0061; ... 0061;LATIN SMALL LETTER A;Ll;0;L;N;;;0041;;0041 From this you can see, that for 0041, which is capital letter A, there is a mapping to it's lowercase counterpart, 0061 Likewise, there is a mapping for 0061 which says it's uppercase counterpart is 0041. There is also SpecialCasing.txt which covers those mappings that haven't got a 1-1 mapping, such as the german SS. These mappings are fixed, independent of locale, only a few cases from specialcasing.txt depend on locale/context. As far as I know, the only way to use Unicode currently is to use a locale that is unicode-aware. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Patch for collation using ICU
John Hansen [EMAIL PROTECTED] writes: Where'd you get the licence from? It was the first thing I came across in their docs: http://icu.sourceforge.net/userguide/intro.html Looking more closely, it may be that this license is only intended to apply to the documentation and not the code ... though free code with un-free documentation isn't real useful. regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] pl/pgsql enabled by default
On Sat, May 07, 2005 at 02:52:57PM +1000, Neil Conway wrote: So would you have us disable all the non-essential builtin functions? (Many of which have has security problems in the past.) What about the builtin encoding conversions, non-btree indexes, or a myriad of features that not all users need or use? This is not really analogous, because those are already on (and in most cases, not easily disabled). What you're arguing for is to add yet another on-by-default feature. Given that there's already a way to turn it on, why make it automatic? Moreover, if some repackager wants to make this more convenient, s/he can do so by turning it on by default. I don't see what's wrong with conservatism here. What makes sense for the default configuration of an operating system (which by nature must be hardened against attack) does not necessarily make sense for a database system. Indeed. But that doesn't mean that the principle isn't sound for both cases. I haven't seen an argument against that yet. A -- Andrew Sullivan | [EMAIL PROTECTED] The plural of anecdote is not data. --Roger Brinner ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] FC3 broken with HEAD
On Sat, May 07, 2005 at 04:27:04PM +1000, Neil Conway wrote: Tom Lane wrote: Yeah, we will. Please file a bugzilla entry for this though --- I concur that it is a linker bug. Okay, patch reverted. The RH bug is here: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=157126 Huh, RH's bugzilla is really slow. Are they using the PostgreSQL-powered bugzilla? -- Alvaro Herrera ([EMAIL PROTECTED]) Syntax error: function hell() needs an argument. Please choose what hell you want to involve. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] Race conditions, race conditions!
I promised an analysis of the problems Jan Wieck uncovered yesterday, so here it is. Jan had developed a testbed (which I hope he will post) that essentially has a bunch of client threads all doing instances of the same transaction in READ COMMITTED mode. The intended database invariant is that the number of rows in table t2 with a particular ID value is equal to the cnt field of the single row in t1 with that ID value: begin; select cnt from t1 where id = K for update; delete from t2 where id = K; -- check returned rowcount to see that exactly CNT rows were deleted insert into t2 values (K); -- repeat this N times update t1 set cnt = N where id = K; commit; K is randomly chosen for each execution from the known set of t1 keys, and N is randomly chosen each time as a small positive integer. This should maintain the invariant, since at any time only one transaction can be holding the SELECT FOR UPDATE lock on a particular t1 row. The trick is to run this under astronomical load; 100 or so client threads on a garden-variety PC is about right. What Jan was seeing was that every so often, a client thread would error out, reporting that its DELETE had deleted zero rows rather than the expected number. But subsequent examination showed the t2 rows as being there. Investigation showed that the connected backend had properly acquired FOR UPDATE lock on the t1 row, but the snapshot it was using for the subsequent DELETE showed the inserter of the t1 row as still running. This should be impossible, since the SELECT FOR UPDATE cannot lock an uncommitted row, and in READ COMMITTED mode we certainly will take a new snapshot for the DELETE. However, there is a race condition here. During transaction commit, xact.c first marks the transaction committed in pg_clog, and then clears its XID from PGPROC. This means there is a narrow window in which both TransactionIdDidCommit and TransactionIdIsInProgress will return true. (We cannot do it the other way around, because if neither one is returning true, onlookers are entitled to assume that the transaction has crashed.) However, the tqual.c routines will allow a row to be seen as committed as soon as TransactionIdDidCommit(xmin) returns true. So the scenario is: 1. Backend A does RecordTransactionCommit to mark itself committed in pg_clog, but then loses the CPU in the narrow window between doing that and clearing its PGPROC entry. Because of the ridiculous load, it doesn't get control back for awhile. 2. Backend B comes along to run the test transaction for the same K value. It inspects the t1 row, concludes it's committed, marks the row as locked FOR UPDATE, and returns the results to the client. 3. The client now issues the DELETE command. B takes a new snapshot, but because A is still not cleared out of PGPROC, A's transaction is shown as still running in the snapshot. 4. Now the DELETE will delete no rows, because it doesn't consider the t2 rows it should delete to be committed. AFAICS this race condition has always been there; certainly at least since Vadim put in MVCC, and it looks likely that the original Berkeley code had a form of the problem. The correct fix is that the tqual.c routines should check commit status in the following way: if (TransactionIdIsInProgress(xid)) // still in progress, don't touch tuple else if (TransactionIdDidCommit(xid)) // committed, mark accordingly else // must be aborted or crashed, mark accordingly rather than what they have traditionally done: if (TransactionIdDidCommit(xid)) // committed, mark accordingly else if (TransactionIdDidAbort(xid)) // aborted, mark accordingly else // assume still in progress, don't touch tuple Vadim recognized that the former logic was necessary for VACUUM to use in deciding if it could clean up dead tuples, but apparently he never made the extrapolation that it should be used *everywhere* transaction status is tested. The other interesting thing we saw was an Assertion failure. The postmaster log showed WARNING: relation t1 page 196 is uninitialized --- fixing TRAP: FailedAssertion(!PageHeader) ((PageHeader) pageHeader))-pd_upper == 0)), File: hio.c, Line: 263) LOG: server process (PID 11296) was terminated by signal 6 The WARNING could only have come from VACUUM. (Jan's testbed does launch a VACUUM every so often.) The Assert failure is in RelationGetBufferForTuple where it is adding a new page to a table. I interpret this as the guy doing RelationGetBufferForTuple added a page, but before he could initialize it, he lost control for long enough for a VACUUM to scan through the entire table, see the zeroed page, and fix it. Then when the first guy got control again, his Assert saying the page was zeroes failed. The window for this exists because bufmgr/smgr do physically extend the file,
Re: [HACKERS] FC3 broken with HEAD
Alvaro Herrera [EMAIL PROTECTED] writes: Huh, RH's bugzilla is really slow. Are they using the PostgreSQL-powered bugzilla? They are, but I think it's a fairly old version (7.3 IIRC). I'm still trying to get them to move some internal systems off 7.1 :-( regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] to_char(interval) issues
Bruce, * Prevent to_char() on interval from returning meaningless values * Allow to_char() on interval values to accumulate the highest unit requested Sounds like it would cover my use cases. Others? -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] pl/pgsql enabled by default
Neil Conway [EMAIL PROTECTED] writes: Users can already define SQL functions by default, which certainly provides programmable features. I'm not quite sure what you mean by user exits. I guess I'm missing how pl/pgsql is a fundamentally greater security risk. plpgsql has control structures (loops, IF); SQL functions don't. That makes for a fundamental difference in the power of the programming language ... at least according to CS theory as I was taught it. Now admittedly the primitive statements of SQL are a lot more powerful than the primitive statements usually considered in programming language theory, but I think there is still a pretty significant difference in capability. An example of why this could be interesting from a security standpoint is that, given access to pg_shadow, it'd be pretty trivial to write a plpgsql function that tries to break user passwords by brute force (just generate possible passwords, hash them, and look for a match). I don't see any way to do that in plain SQL, at least not without a pre-existing SRF to generate the trial passwords for you. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Patch for collation using ICU
--On lördag, maj 07, 2005 10.58.09 -0400 Tom Lane [EMAIL PROTECTED] wrote: John Hansen [EMAIL PROTECTED] writes: Where'd you get the licence from? It was the first thing I came across in their docs: http://icu.sourceforge.net/userguide/intro.html Looking more closely, it may be that this license is only intended to apply to the documentation and not the code ... though free code with un-free documentation isn't real useful. regards, tom lane Someone should ask the ICU folks about that... ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [PATCHES] Patch for database locale settings
Description added to TODO: * Allow locale to be set at database creation Currently locale can only be set during initdb. No global tables have locale-aware columns. However, the database template used during database creation might have locale-aware indexes. The indexes would need to be reindexed to match the new locale. --- Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Maybe it would work if we forced indexes on shared relations to be scanned using a fixed collation. The shared relations only have indexes on name, oid, and integer: select distinct atttypid::regtype from pg_class c join pg_attribute a on c.oid = a.attrelid where relisshared and relkind = 'i'; and name has non-locale-sensitive ordering rules anyway. So that's not the big problem; we could probably get away with decreeing that name will always be that way and that shared relations can't have locale-dependent indexes. The big problem (and the reason why this idea has been shot down in the past) is that CREATE DATABASE can't change the locale from what it is in the template database unless it's prepared to reindex any locale- sensitive indexes in the non-shared part of the template database. Which would be a difficult undertaking seeing that we can't even connect to the copied database until after commit. We could maybe say that we will never have any locale-dependent indexes at all on any system catalog, but what of user-defined tables in template databases? It would simply not work to do something as simple as creating a table with an indexed text column in template1. On the other hand you could argue that people already run the same kind of risk when changing database encoding at CREATE, which is a feature that's been there a long time and hasn't garnered many complaints. Not so much that their indexes will break as that their data will. So perhaps we should be willing to document don't do that. Certainly it would be a lot more useful if both locale and encoding could be set per-database. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [PATCHES] Cleaning up unreferenced table files
Maybe we should take a different approach to the problem: 1. Create new file with an extension to mark that it's not yet committed (eg. 1234.notcommitted) 2. ... 3. Take CheckpointStartLock 4. Write commit record to WAL, with list of created files. 5. rename created file (1234.notcommitted - 1234). 6. Release CheckpointStartLock This would guarantee that after successful WAL replay, all files in the data directory with .notcommitted extension can be safely deleted. No need to read pg_database or pg_class. We would take a performance hit because of the additional rename and fsync step. Also, we must somehow make sure that the new file or the directory it's in is fsynced on checkpoint to make sure that the rename is flushed to disk. A variant of the scheme would be to create two files on step 1. One would be the actual relfile (1234) and the other would an empty marker file (1234.notcommitted). That way the smgr code wouldn't have to care it the file is new or not when opening it. - Heikki On Thu, 5 May 2005, Tom Lane wrote: Bruce Momjian pgman@candle.pha.pa.us writes: Applied. Now that I've had a chance to look at it, this patch is thoroughly broken. Problems observed in a quick review: 1. It doesn't work at all for non-default tablespaces: it will claim that every file in such a tablespace is stale. The fact that it does that rather than failing entirely is accidental. It tries to read the database's pg_class in the target tablespace whether it's there or not. Because the system is still in recovery mode, the low-level routines allow the access to the nonexistent pg_class table to pass --- in fact they think they should create the file, so after it runs there's a bogus empty 1259 file in each such tablespace (which of course it complains about, too). The code then proceeds to think that pg_class is empty so of course everything draws a warning. 2. It's not robust against stale subdirectories of a tablespace (ie, subdirs corresponding to a nonexistent database) --- again, it'll try to read a nonexistent pg_class. Then it'll produce a bunch of off-target complaint messages. 3. It's assuming that relfilenode is unique database-wide, when no such assumption is safe. We only have a guarantee that it's unique tablespace-wide. 4. It fails to examine table segment files (such as nnn.1). These should be complained of when the nnn doesn't match any hash entry. 5. It will load every relfilenode value in pg_class into the hashtable whether it's meaningful or not. There should be a check on relkind. 6. I don't think relying on strtol to decide if a filename is entirely numeric is very safe. Note all the extra defenses in pg_atoi against various platform-specific misbehaviors of strtol. Personally I'd use a strspn test instead. 7. There are no checks for readdir failure (compare any other readdir loop in the backend). See also Simon Riggs' complaints that the circumstances under which it's done are pretty randomly selected. (One particular thing that I think is a bad idea is to do this in a standalone backend. Any sort of corruption in any db's pg_class would render it impossible to start up.) To fix the first three problems, and also avoid the performance problem of multiply rescanning a database's pg_class for each of its tablespaces, I would suggest that the hashtable entries be widened to RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid). Then there should be one iteration over pg_database to learn the OIDs and default tablespaces of each database; with that you can read pg_class from its correct location for each database and load all the entries into the hashtable. Then you iterate through the tablespaces looking for stuff not present in the hashtable. You might also want to build a list or hashtable of known database OIDs, so that you can recognize a stale subdirectory immediately and issue a direct complaint about it without even recursing into it. regards, tom lane - Heikki ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] FC3 broken with HEAD
I think this has more to do with poor architectural decisions made in bugzilla. Things like storing the entire ticket history in one big record instead of normalizing it. This means you can't do anything to a ticket without munging through 10s of k of data. This also creates problems when multiple people are trying to update a single ticket at a time. -- greg ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] pl/pgsql enabled by default
On 2005-05-07, Tom Lane [EMAIL PROTECTED] wrote: Neil Conway [EMAIL PROTECTED] writes: Users can already define SQL functions by default, which certainly provides programmable features. I'm not quite sure what you mean by user exits. I guess I'm missing how pl/pgsql is a fundamentally greater security risk. plpgsql has control structures (loops, IF); SQL functions don't. That makes for a fundamental difference in the power of the programming language ... at least according to CS theory as I was taught it. SQL functions do have control structures: CASE WHEN and recursion. I have even implemented generate_series() in pure SQL. Now admittedly the primitive statements of SQL are a lot more powerful than the primitive statements usually considered in programming language theory, but I think there is still a pretty significant difference in capability. The only thing that makes pg's SQL functions not turing-complete is the fact that recursion depth is not unlimited. In practice this isn't much of a restriction, since you can do large-scale iterations by using SQL sets and joins rather than recursion. An example of why this could be interesting from a security standpoint is that, given access to pg_shadow, it'd be pretty trivial to write a plpgsql function that tries to break user passwords by brute force (just generate possible passwords, hash them, and look for a match). I don't see any way to do that in plain SQL, at least not without a pre-existing SRF to generate the trial passwords for you. Writing a pure SQL SRF that generates a range of trial passwords is in fact not hard at all. The key point to note is that you can generate sets of literal values of moderate size using UNION ALL, and you can then cross-join those sets against themselves multiple times to generate much larger and more complex sets. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] pl/pgsql enabled by default
People: Before we get into more minutia regarding potential security risk of plpgsql, are there any reasons *other* than security to not enable it? -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Race conditions, race conditions!
Tom Lane [EMAIL PROTECTED] writes: The trick is to run this under astronomical load; 100 or so client threads on a garden-variety PC is about right. I wonder if there's an argument for building assertion-enabled builds with code that randomly yields the processor some percentage of time before and after taking a lock. It wouldn't catch every case but it might help. -- greg ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] pgFoundry
Greg, I'm rather surprised Postgres doesn't have a good bug tracking system. That's something most projects find pretty essential. Strangely enough the reason seems to be that Postgres really doesn't have many bugs... Unlike web browsers or GUIs or most of the other free software projects out there, databases don't tolerate bugs well. Any serious bug is cause for an immediate point release. The only use for a bug tracking system would really be for tracking all those pesky IWBNI bugs that never rise to urgent status. Actually, a bug tracker would be useful for two purposes: 1) Tracking bugs that are actually feature requests, in an effort (possibly futile) to cut down on the number of when will PostgreSQL be able to use an index on MAX()? requests that we get. A certain amount of this is in our FAQ, but the FAQ has the flaws of both not being easily searchable, and having a very manual update process so that it frequently gets out of date. 2) Tracking bugs that were fixed (and features that were added) in particular releases so that users know when they need to upgrade. For example, if a user had an index corruption problem with 7.4.1, it would be useful for them to know that an upgrade to 7.4.5 (as I recall) would fix it. Currently, they'd have to read all of the release notes from 7.4.2 through 7.4.5 and decipher insider terminonolgy to figure it out -- not always easy to do, and even harder to convince your boss. The problem is that a bug tracker would not be useful to the Postgresql *developers*; our current system works pretty well for us now. Except for one possibility: IF we had a formal bugtracker, are there people who are not currently contributing to PostgreSQL who would be willing/able to read, test, and analyse bug reports? With the addition of several companies to our community, it's a possibility, and would make the trouble of using a bug tracker worthwhile, I think. Comments? -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] pgFoundry
What does it mean to track the status of something? How would the status change except by discussion? What would be the point of announcing the status of something without allowing people to comment? No one said anything about not letting people comment or discuss. What I am suggesting is a better public presentation of what the heck is going on with PostgreSQL development. I think you have a severely flawed idea of how free software development proceeds. Then you obviously aren't paying attention. Look at other major OSS projects. They have these things in place. Even the Linux kernel has a bugzilla (although I am not advocating bugzilla). Not to mention KDE, Gnome, Debian.. These projects also have reasonably defined milestones for particular releases and show status of those milestones during the release. What you're describing sounds like something a manager of a commercial project would want. Perhaps it's something the managers of the people working on Postgres on behalf of some corporate sponsors might want but in those cases I doubt they would want the information to be public anyways. What I am describing is what other large OSS projects already do. In the free software world there's no top-down management of the project with managers issuing direction and expecting feedback reports. No but there are people in charge of particular tasks. There are people only working on certain things. Like the work that the people did on PITR. People only want tools that make their lives easier. Not tools that make other people's lives easier at the expense of their own convenience. The programmers are not beholden to any corporate interests (other than their own sponsors, who presumably are getting all the feedback they're looking for privately). I am not suggesting that anybody be beholden to anybody accept maybe the community itself. Sincerely, Joshua D. Drake ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] [PATCHES] Cleaning up unreferenced table files
Heikki Linnakangas [EMAIL PROTECTED] writes: Maybe we should take a different approach to the problem: 1. Create new file with an extension to mark that it's not yet committed (eg. 1234.notcommitted) This is pushing the problem into the wrong place, viz the lowest-level file access routines, which will now all have to know about .notcommitted status. It also creates race conditions --- think about backend A trying to commit file 1234 at about the same time that backend B is trying to flush some dirty buffers belonging to that file. But most importantly, it doesn't handle the file-deletion case. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] pgFoundry
On Saturday 07 May 2005 15:31, Josh Berkus wrote: 2) Tracking bugs that were fixed (and features that were added) in particular releases so that users know when they need to upgrade. one idea that has been quasi floated before would be something equivalent to Simon Riggs developer summaries and/or Bruce's release history, that could be updated as appropriate for each development cycle. If someone were willing to maintain it, we could put it on the website along with the TODO list. -- Robert Treat Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Patch for collation using ICU
Tom Lane wrote: John Hansen [EMAIL PROTECTED] writes: Where'd you get the licence from? It was the first thing I came across in their docs: http://icu.sourceforge.net/userguide/intro.html Looking more closely, it may be that this license is only intended to apply to the documentation and not the code ... though free code with un-free documentation isn't real useful. Right, it seems to apply only to the resources found on sourceforge. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] pl/pgsql enabled by default
Josh Berkus josh@agliodbs.com writes: Before we get into more minutia regarding potential security risk of plpgsql, are there any reasons *other* than security to not enable it? Several potential issues have already been mentioned in this thread, eg, what about shared library dependency vs linking plpgsql statically into the backend, how will we deal with existing dump files, how will it be possible for a DBA to disable the feature if he wishes. But security is the easiest to handwave about with no facts ;-) ... so if it gets shot down on that basis then no one need do the work to figure out solutions to the other questions. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Race conditions, race conditions!
Greg Stark [EMAIL PROTECTED] writes: I wonder if there's an argument for building assertion-enabled builds with code that randomly yields the processor some percentage of time before and after taking a lock. It wouldn't catch every case but it might help. Seems like that would mainly help you find cases where you'd put a lock acquire or release a bit too late or too soon in a sequence of events; not cases where you'd failed to acquire a needed lock at all. It'd be more useful I think to have a facility that randomly stops backends for awhile regardless of exactly where they are in the code. A high-load test case actually does this to some extent, but the problem is you have little reproducibility and no assurance that execution stopped for long enough to let critical events happen elsewhere. The ideal facility I think would slow one backend much more than others, whereas high load still leaves them all making progress at about the same rate ... regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] pl/pgsql enabled by default
Andrew Sullivan wrote: This is not really analogous, because those are already on Which is my point: you're suggesting we retrofit a security policy onto PG that does not apply to the vast majority of the base system -- and that if applied would require fundamental changes. Indeed. But that doesn't mean that the principle isn't sound for both cases. I haven't seen an argument against that yet. Security (in the limited sense of disabling features by default) is not free; there is a tradeoff between security and convenience, security and administrative simplicity, and so on. Given that I have yet to see a single substantive argument for pl/pgsql being a security risk that has withstood any scrutiny, I don't see that the security side of the tradeoff has a lot of merit. -Neil ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Patch for collation using ICU
John Hansen [EMAIL PROTECTED] writes: Btw, I had been planning to propose replacing every single one of the built in charset conversion functions with calls to ICU (thus making pg _depend_ on ICU), I find that fairly unacceptable ... especially given the licensing questions, but in any case. It might be OK to rip out the existing conversion support and say that *if* you want encoding conversion, you have to use ICU. But I don't want to be told you cannot build PG without ICU period. The 3.2 vs 2.8 business is disturbing also; specifically, I don't think we get to require 3.2 on a platform where 2.8 is installed. People just aren't going to hold still for that, even assuming that ICU supports installing both versions at once, which isn't clear to me at the moment ... regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Patch for collation using ICU
Bruce Momjian wrote: (B (B There are two reasons for that optimization --- first, some (B locale support is broken and Unicode encoding with a C locale (B crashes (not an issue for ICU), and second, it is an (B optimization for languages like Japanese that want to use (B unicode, but don't need a locale because upper/lower means (B nothing in those character sets. (B (B No, upper/lower means nothing in those languages, so why would you need (B to optimize upper/lower if they're not used?? (B And if they are, it's obviously because the text contains characters (B from other languages (probably english) and as such they should behave (B correctly. (B (BYes, Japanese (and probably Chinese and Korean) languages include (BASCII character. More precisely ASCII is part of Japanese (Bencodings(LATIN1 is not, however). And we have no problem at all with (Bglibc/C locale. See below("unitest" is an UNICODE database). (B (Bunitest=# create table t1(t text); (BCREATE TABLE (Bunitest=# \encoding EUC_JP (Bunitest=# insert into t1 values('abc$B$"$$$&(B'); (BINSERT 1842628 1 (Bunitest=# select upper(t) from t1; (B upper (B--- (B ABC$B$"$$$&(B (B(1 row) (B (BSo Japanese(including ASCII)/UNICODE behavior is perfectly correct at (Bthis moment. So I strongly object removing that optimization. (B-- (BTatsuo Ishii (B (B---(end of broadcast)--- (BTIP 3: if posting/reading through Usenet, please send an appropriate (B subscribe-nomail command to [EMAIL PROTECTED] so that your (B message can get through to the mailing list cleanly
Re: [HACKERS] Patch for collation using ICU
Palle Girgensohn wrote: --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen [EMAIL PROTECTED] wrote: Btw, I had been planning to propose replacing every single one of the built in charset conversion functions with calls to ICU (thus making pg _depend_ on ICU), as this would seem like a cleaner solution than for us to maintain our own conversion tables. I don't buy it. If current conversion tables does the right thing, why we need to replace. Or if conversion tables are not correct, why don't you fix it? I think the rule of character conversion will not change frequently, especially for LATIN languages. Thus maintaining cost is not too high. -- Tatsuo Ishii ICU also has a fair few conversions that we do not have at present. That is a much larger issue, similar to our shipping our own timezone database. What does it buy us? o Do we ship it in our tarball? o Is the license compatible? o Does it remove utils/mb conversions? o Does it allow us to index LIKE (next high char)? o Does it allow us to support multiple encodings in a single database easier? o performance? ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] pl/pgsql enabled by default
Neil Conway wrote: Andrew Sullivan wrote: This is not really analogous, because those are already on Security (in the limited sense of disabling features by default) is not free; there is a tradeoff between security and convenience, security and administrative simplicity, and so on. Given that I have yet to see a single substantive argument for pl/pgsql being a security risk that has withstood any scrutiny, I don't see that the security side of the tradeoff has a lot of merit. People who use views to achieve row security, which is a rather common paradigm, cannot allow users to create functions with side effects. Mike Mascari ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] pl/pgsql enabled by default
Mike Mascari wrote: People who use views to achieve row security, which is a rather common paradigm, cannot allow users to create functions with side effects. Can you elaborate? I'm not sure I follow you. (I'll note anyway that (1) SQL functions can have side effects: CREATE FUNCTION foo() RETURNS VOID AS 'DELETE FROM ...', for example (2) Administrators can always choose to drop pl/pgsql for a particular database, disable it at initdb time, or REVOKE usage of pl/pgsql for particular users.) -Neil ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] pl/pgsql enabled by default
Neil Conway wrote: Mike Mascari wrote: People who use views to achieve row security, which is a rather common paradigm, cannot allow users to create functions with side effects. Can you elaborate? I'm not sure I follow you. (I'll note anyway that (1) SQL functions can have side effects: CREATE FUNCTION foo() RETURNS VOID AS 'DELETE FROM ...', for example Wow. That's a problem IMHO. Many people use views to allow userA to query his salary without seeing userB's. If userA can just: 1. Create a SQL function which inserts into another table its arguments 2. Issue a query like: SELECT * FROM view_of_salaries_based_on_current_user WHERE my_side_effect_function_that_inserts_into_a_temp_table(salary, employee); and that function is writable with just 'SQL', then many, many people have a serious security risk on their hands. Perhaps this is why Oracle's standard UDFs cannot perform inserts, updates, or deletes. Mike Mascari ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Race conditions, race conditions!
Are we going to put the fixes into 8.0.3 and so on? Or will it be included in 8.0.4? -- Tatsuo Ishii I promised an analysis of the problems Jan Wieck uncovered yesterday, so here it is. Jan had developed a testbed (which I hope he will post) that essentially has a bunch of client threads all doing instances of the same transaction in READ COMMITTED mode. The intended database invariant is that the number of rows in table t2 with a particular ID value is equal to the cnt field of the single row in t1 with that ID value: begin; select cnt from t1 where id = K for update; delete from t2 where id = K; -- check returned rowcount to see that exactly CNT rows were deleted insert into t2 values (K); -- repeat this N times update t1 set cnt = N where id = K; commit; K is randomly chosen for each execution from the known set of t1 keys, and N is randomly chosen each time as a small positive integer. This should maintain the invariant, since at any time only one transaction can be holding the SELECT FOR UPDATE lock on a particular t1 row. The trick is to run this under astronomical load; 100 or so client threads on a garden-variety PC is about right. What Jan was seeing was that every so often, a client thread would error out, reporting that its DELETE had deleted zero rows rather than the expected number. But subsequent examination showed the t2 rows as being there. Investigation showed that the connected backend had properly acquired FOR UPDATE lock on the t1 row, but the snapshot it was using for the subsequent DELETE showed the inserter of the t1 row as still running. This should be impossible, since the SELECT FOR UPDATE cannot lock an uncommitted row, and in READ COMMITTED mode we certainly will take a new snapshot for the DELETE. However, there is a race condition here. During transaction commit, xact.c first marks the transaction committed in pg_clog, and then clears its XID from PGPROC. This means there is a narrow window in which both TransactionIdDidCommit and TransactionIdIsInProgress will return true. (We cannot do it the other way around, because if neither one is returning true, onlookers are entitled to assume that the transaction has crashed.) However, the tqual.c routines will allow a row to be seen as committed as soon as TransactionIdDidCommit(xmin) returns true. So the scenario is: 1. Backend A does RecordTransactionCommit to mark itself committed in pg_clog, but then loses the CPU in the narrow window between doing that and clearing its PGPROC entry. Because of the ridiculous load, it doesn't get control back for awhile. 2. Backend B comes along to run the test transaction for the same K value. It inspects the t1 row, concludes it's committed, marks the row as locked FOR UPDATE, and returns the results to the client. 3. The client now issues the DELETE command. B takes a new snapshot, but because A is still not cleared out of PGPROC, A's transaction is shown as still running in the snapshot. 4. Now the DELETE will delete no rows, because it doesn't consider the t2 rows it should delete to be committed. AFAICS this race condition has always been there; certainly at least since Vadim put in MVCC, and it looks likely that the original Berkeley code had a form of the problem. The correct fix is that the tqual.c routines should check commit status in the following way: if (TransactionIdIsInProgress(xid)) // still in progress, don't touch tuple else if (TransactionIdDidCommit(xid)) // committed, mark accordingly else // must be aborted or crashed, mark accordingly rather than what they have traditionally done: if (TransactionIdDidCommit(xid)) // committed, mark accordingly else if (TransactionIdDidAbort(xid)) // aborted, mark accordingly else // assume still in progress, don't touch tuple Vadim recognized that the former logic was necessary for VACUUM to use in deciding if it could clean up dead tuples, but apparently he never made the extrapolation that it should be used *everywhere* transaction status is tested. The other interesting thing we saw was an Assertion failure. The postmaster log showed WARNING: relation t1 page 196 is uninitialized --- fixing TRAP: FailedAssertion(!PageHeader) ((PageHeader) pageHeader))-pd_upper == 0)), File: hio.c, Line: 263) LOG: server process (PID 11296) was terminated by signal 6 The WARNING could only have come from VACUUM. (Jan's testbed does launch a VACUUM every so often.) The Assert failure is in RelationGetBufferForTuple where it is adding a new page to a table. I interpret this as the guy doing RelationGetBufferForTuple added a page, but before he could initialize it, he lost control for long enough for a VACUUM to scan through the entire table, see the zeroed page, and fix
Re: [HACKERS] Race conditions, race conditions!
Tatsuo Ishii wrote: Are we going to put the fixes into 8.0.3 and so on? Or will it be included in 8.0.4? We have removed 8.0.3 from the FTP servers and plan to re-release 8.0.3 and previous 7.X releases. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] pl/pgsql enabled by default
Mike Mascari wrote: Neil Conway wrote: Mike Mascari wrote: People who use views to achieve row security, which is a rather common paradigm, cannot allow users to create functions with side effects. Can you elaborate? I'm not sure I follow you. (I'll note anyway that (1) SQL functions can have side effects: CREATE FUNCTION foo() RETURNS VOID AS 'DELETE FROM ...', for example Wow. That's a problem IMHO. Many people use views to allow userA to query his salary without seeing userB's. If userA can just: 1. Create a SQL function which inserts into another table its arguments 2. Issue a query like: SELECT * FROM view_of_salaries_based_on_current_user WHERE my_side_effect_function_that_inserts_into_a_temp_table(salary, employee); and that function is writable with just 'SQL', then many, many people have a serious security risk on their hands. Perhaps this is why Oracle's standard UDFs cannot perform inserts, updates, or deletes. but the side effect function will only run (unless you set it with security definer) with the privileges of the caller - it won't grant visibility to things that user can't otherwise see. In any case, you should define your security setup with the capabilities / limitations of the db engine in mind. If there is any security problem in your scenario, it is that you appear to have made unwarranted assumptions about how postgres works, rather than that postgres has a problem. Either way, this does not illustrate how enabling plpgsql by default is a security risk. cheers andrew ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] pl/pgsql enabled by default
Andrew Dunstan wrote: Mike Mascari wrote: but the side effect function will only run (unless you set it with security definer) with the privileges of the caller - it won't grant visibility to things that user can't otherwise see. If the visibility is determined by view definitions, such as using CURRENT_USER, which is an exceedingly common practice, then the caller will be able to record tuples before they are filtered by the executor. In any case, you should define your security setup with the capabilities / limitations of the db engine in mind. If there is any security problem in your scenario, it is that you appear to have made unwarranted assumptions about how postgres works, rather than that postgres has a problem. I think most people coming from any other enterprise-class RDBMS environment will be surprised that they cannot use VIEWs to provide user-specific views on data. I could be wrong, but I'd put money on it... Either way, this does not illustrate how enabling plpgsql by default is a security risk. Correct, as the vulnerability exists within the 'SQL' language as well. The only difference is that enabling plpgsql by default changes it from a leak to a full blown flood. Mike Mascari ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] pl/pgsql enabled by default
Mike Mascari wrote: Correct, as the vulnerability exists within the 'SQL' language as well. The only difference is that enabling plpgsql by default changes it from a leak to a full blown flood. How does it make any difference at all? -Neil ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
We have developed patches which relaxes the character validation so that PostgreSQL accepts invalid characters. It works like this: 1) new postgresql.conf item mbstr_check added. 2) if mbstr_check = 0 then invalid characters are not accepted (same as current PostgreSQL behavior). This is the default. 3) if mbstr_check = 1 then invalid characters are accepted with WARNING 4) if mbstr_check = 2 then invalid characters are accepted without any warnings 5) We have checked PostgreSQL source code if accepting invalid characters makes some troubles. We have found that we need to fix a place and the fix is included in the patches. Madison, If you are interested in the patches, I could send it to you. Hackers, Do you think the functionality something like above is worth to add to PostgreSQL? -- Tatsuo Ishii Hi all, I've been chasing down a bug and from what I have learned it may be because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles invalid unicode. I've been given some ideas on how to try to catch invalid unicode but it seems expensive so I am hoping there is a postgresql way to deal with this problem. I've run into a problem where a bulk postgres COPY... statement is dieing because one of the lines contains a file name with an invalid unicode character. In nautilus this file has '(invalid encoding)' and the postgres error is 'CONTEXT: COPY file_info_3, line 228287, column file_name: Femme Fatal\u.url'. To actually look at the file from the shell (bash) shows what appears to be a whitespace but when I copy/paste the file name I get the '\u' you see above. I could, with the help of the TLUG people, use regex to match for an invalid character and skip the file but that is not ideal. The reason is that this is for my backup program and invalid unicode or not, the contents of the file may still be important and I would prefer to have it in the database so that it is later copied. I can copy and move the file in the shell so the file isn't apparently in an of itself corrupt. So then, is there a way I can tell postresql to accept the invalid unicode name? Here is a copy of my schema: tle-bu= \d file_info_2 Table public.file_info_2 Column| Type |Modifiers --+--+- file_group_name | text | file_group_uid | bigint | not null file_mod_time| bigint | not null file_name| text | not null file_parent_dir | text | not null file_perm| text | not null file_size| bigint | not null file_type| character varying(2) | not null default 'f'::character varying file_user_name | text | file_user_uid| bigint | not null file_backup | boolean | not null default true file_display | boolean | not null default false file_restore_display | boolean | not null default false file_restore | boolean | not null default false Indexes: file_info_2_display_idx btree (file_type, file_parent_dir, file_name) 'file_name' and 'file_parent_dir' are the columns that could have entries with the invalid unicode characters. Maybe I could/should use something other than 'text'? These columns could contain anything that a file or directory name could be. Thanks! Madison -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Madison Kelly (Digimer) TLE-BU, The Linux Experience; Back Up http://tle-bu.thelinuxexperience.com -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Patch for collation using ICU
Tom Lane wrote: John Hansen [EMAIL PROTECTED] writes: Btw, I had been planning to propose replacing every single one of the built in charset conversion functions with calls to ICU (thus making pg _depend_ on ICU), I find that fairly unacceptable ... especially given the licensing questions, but in any case. The licencing seems pretty clear to me. http://www-306.ibm.com/software/globalization/icu/license.jsp It might be OK to rip out the existing conversion support and say that *if* you want encoding conversion, you have to use ICU. But I don't want to be told you cannot build PG without ICU period. Right, that could be done, but I think the issue at heart is _are_ we going to use it at all, and if so, locale support would certainly benefit from going that way as well. The 3.2 vs 2.8 business is disturbing also; specifically, I don't think we get to require 3.2 on a platform where 2.8 is installed. There seems to be nothing in the ICU licence that would prevent us from bundling it. This would solve both the 3.2 vs 2.8 problems, and would remove the 'dependency'. People just aren't going to hold still for that, even assuming that ICU supports installing both versions at once, which isn't clear to me at the moment ... There's no problems with having both installed. I did that on debian to get the patch going. Tho, bundling it seems cleaner to me. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Patch for collation using ICU
I don't buy it. If current conversion tables does the right thing, why we need to replace. Or if conversion tables are not correct, why don't you fix it? I think the rule of character conversion will not change frequently, especially for LATIN languages. Thus maintaining cost is not too high. I never said we need to, but if we're going to implement ICU, then we might as well go all the way. ... John ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Patch for collation using ICU
Tatsuo Ishii wrote: Sent: Sunday, May 08, 2005 10:09 AM To: John Hansen Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU Bruce Momjian wrote: There are two reasons for that optimization --- first, some locale support is broken and Unicode encoding with a C locale crashes (not an issue for ICU), and second, it is an optimization for languages like Japanese that want to use unicode, but don't need a locale because upper/lower means nothing in those character sets. No, upper/lower means nothing in those languages, so why would you need to optimize upper/lower if they're not used?? And if they are, it's obviously because the text contains characters from other languages (probably english) and as such they should behave correctly. Yes, Japanese (and probably Chinese and Korean) languages include ASCII character. More precisely ASCII is part of Japanese encodings(LATIN1 is not, however). And we have no problem at all with glibc/C locale. See below(unitest is an UNICODE database). unitest=# create table t1(t text); CREATE TABLE unitest=# \encoding EUC_JP unitest=# insert into t1 values('abc'); INSERT 1842628 1 unitest=# select upper(t) from t1; upper --- ABC (1 row) So Japanese(including ASCII)/UNICODE behavior is perfectly correct at this moment. Right, so you _never_ use accented ascii characters in Japanese? (like for example, whose uppercase is ) So I strongly object removing that optimization. I'm guessing this would call for a vote then, since if implementing ICU, then I'd have to object to leaving it in. Changing the bahaviour of ICU doesn't seem right. Changing the behaviour of pg, so that it works as it should when using unicode, seems the right solution to me. -- Tatsuo Ishii ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
Tatsuo Ishii wrote: Sent: Sunday, May 08, 2005 12:01 PM To: [EMAIL PROTECTED] Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem We have developed patches which relaxes the character validation so that PostgreSQL accepts invalid characters. It works like this: That is just plain 100% wrong!! Under no circumstances should there be invalid data in a database. And if you're trying to make a database of invalid data, then at least encode it using a valid encoding. In fact, I've proposed strengthening the validation routines for UTF-8. ... John ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] rendezvous
On Sat, May 07, 2005 at 10:43:46AM +0200, Jochem van Dieten wrote: On 5/7/05, Alvaro Herrera wrote: On Fri, May 06, 2005 at 03:30:10PM -0700, Joshua D. Drake wrote: Rendezvous is the Apple network discovery protocol yes? That was renamed Bonjour by apple due to a Trademark problem. Maybe we should name it Zeroconf. Is the implemented protocol IETF ZeroConf or Apple Rendezvous? IIRC there are differences in the requirements for TTLs and I have had some experiences with Apple reserving local.arpa while the rest of the world doesn't. Huh, not sure. The only thing the code does is check for DNSServiceDiscovery.h and call a function therein. No implementation details are taken care of in the Postgres source. My bet is that it only works on Mac OS X. If there was interest I'm sure we could make it work using libhowl but I don't see anybody asking ... -- Alvaro Herrera ([EMAIL PROTECTED]) No necesitamos banderas No reconocemos fronteras (Jorge González) ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
Madison Kelly wrote: Under most circumstances I would agree with you completely. In my case though I have to decide between risking a loss of a user's data or attempt to store the file name in some manner that would return the same name used by the file system. The user (or one of his/her users in the case of an admin) may be completely unaware of the file name being an invalid unicode name. The file itself though may still be quite valid and contain information worthy of backing up. I could notify the user/admin that the name is not valid but there is no way I could rely on the name being changed. Given the choices, I would prefer to attempt to store/use the file name with the invalid unicode character than simply ignore the file. Is there a way to store the name in raw binary? If so, would this not be safe because to postgresql it should no longer matter what data is or represents, right? Maybe there is a third option I am not yet concidering? Set the client_encoding to ascii when storing that name, and again when retrieving it. Or, use a bytea column. Madison ... John ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Patch for collation using ICU
On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote: Tatsuo Ishii wrote: So Japanese(including ASCII)/UNICODE behavior is perfectly correct at this moment. Right, so you _never_ use accented ascii characters in Japanese? (like è for example, whose uppercase is È) That isn't ASCII. It's latin1 or some other ASCII extension. -- Alvaro Herrera ([EMAIL PROTECTED]) La principal característica humana es la tontería (Augusto Monterroso) ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
John Hansen [EMAIL PROTECTED] writes: Tatsuo Ishii wrote: We have developed patches which relaxes the character validation so that PostgreSQL accepts invalid characters. That is just plain 100% wrong!! That was my first reaction too. Why would this be a good idea? If someone does want an encoding-agnostic database, they can set it as SQL_ASCII. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Patch for collation using ICU
Alvaro Herrera wrote: Sent: Sunday, May 08, 2005 2:49 PM To: John Hansen Cc: Tatsuo Ishii; pgman@candle.pha.pa.us; [EMAIL PROTECTED]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote: Tatsuo Ishii wrote: So Japanese(including ASCII)/UNICODE behavior is perfectly correct at this moment. Right, so you _never_ use accented ascii characters in Japanese? (like è for example, whose uppercase is È) That isn't ASCII. It's latin1 or some other ASCII extension. Point taken... But... If you want EUC_JP (Japanese + ASCII) then use that as your backend encoding, not UTF-8 (unicode). UTF-8 encoded databases are very useful for representing multiple languages in the same database, but this usefulness vanishes if functions like upper/lower doesn't work correctly. So optimizing for 3 languages breaks more than a hundred, that's doesn't seem fair! -- Alvaro Herrera ([EMAIL PROTECTED]) La principal característica humana es la tontería (Augusto Monterroso) ... John ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] Can we get patents?
Hackers, I was reading LWN.net and noticed an article about Eben Moglen's keynote at linux.conf.au. Apparently he advises free software projects to get patents on their best ideas. Eben encouraged free software developers to record their novel inventions and to obtain patents on the best of them. Free legal help can be made available to obtain patents on the best ideas. Until the rules of the game can be changed, we must play the game, and having the right patents available may make all the difference in defending against an attack. http://lwn.net/Articles/133421/ Eben Moglen is the FSF's attorney. I'm wondering, could the PostgreSQL Foundation (or some other entity) get patents on some parts of Postgres? Maybe ResourceOwners for example; or the newer parts of the optimizer. The patents would be freely licensed to everyone (including commercial redistributors and developers/users of competing products), except to patent litigators, or something like that. -- Alvaro Herrera ([EMAIL PROTECTED]) La espina, desde que nace, ya pincha (Proverbio africano) ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
Tatsuo Ishii wrote: Sent: Sunday, May 08, 2005 12:01 PM To: [EMAIL PROTECTED] Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem We have developed patches which relaxes the character validation so that PostgreSQL accepts invalid characters. It works like this: That is just plain 100% wrong!! Under no circumstances should there be invalid data in a database. And if you're trying to make a database of invalid data, then at least encode it using a valid encoding. In fact, I've proposed strengthening the validation routines for UTF-8. Actually I myself thought as you are before. Later I found that it was not so good idea. People already have invalid encoded data in their precious database and have very hard time to migrate to newer version of PostgreSQL because of encoding validation. Think about this kind of situation: There is a table t1(member_id integer primary key, member_name text, address text, phone text, email text). I have to reach each member by either adress, phone or email. Unfortunately some of address field have wrong encoded data. In this case I will use phone or email to reach them. Now I need to upgrade to newer PostgreSQL within 1 day. I know I have to fix wrong encoded field but it will take more than 1 day. So I would like to import the data first then fix wrong encoded field on running database since I can reach members by phone or email even with wrong encoded address field... I saw this kind of situation in the real world and that's why we developed the patches. -- Tatsuo Ishii ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Patch for collation using ICU
Alvaro Herrera wrote: Sent: Sunday, May 08, 2005 2:49 PM To: John Hansen Cc: Tatsuo Ishii; pgman@candle.pha.pa.us; [EMAIL PROTECTED]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote: Tatsuo Ishii wrote: So Japanese(including ASCII)/UNICODE behavior is perfectly correct at this moment. Right, so you _never_ use accented ascii characters in Japanese? (like è for example, whose uppercase is È) That isn't ASCII. It's latin1 or some other ASCII extension. Point taken... But... If you want EUC_JP (Japanese + ASCII) then use that as your backend encoding, not UTF-8 (unicode). UTF-8 encoded databases are very useful for representing multiple languages in the same database, but this usefulness vanishes if functions like upper/lower doesn't work correctly. I'm just curious if Germany/French/Spanish mixed text can be sorted correctly. I think these languages need their own locales even with UNICODE/ICU. So optimizing for 3 languages breaks more than a hundred, that's doesn't seem fair! Why don't you add a GUC variable or some such to control the upper/lower behavior? -- Tatsuo Ishii ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org