[HACKERS] Optional postgres database not so optional in 8.1
On a fresh installation of postgrsql 8.1if you drop the 'postgres' database, psql, createdb, etc. no longer works. psql -l; ignores -dtemplate1, and createdb doesn't have such an option. Maybe it should fallback to template1 if the postgres database doesn't exist? ... John ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] someone working to add merge?
Jaime Casanova Wrote: But MERGE isn't REPLACE... REPLACE will delete old records to insert new ones; MERGE try to insert and if the record exists then can UPDATE just a few values, maybe incrementing them with a value (all the calculation are doing by the MERGE) That sounds like MySQL's 'INSERT INTO ... ON DUPLICATE KEY UPDATE', which they recommend over REPLACE anyways. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] someone working to add merge?
I Wrote: From the mysql manual: 'REPLACE works exactly like INSERT, except that if an old record in the table has the same value as a new record for a PRIMARY KEY or a UNIQUE index, the old record is deleted before the new record is inserted. See Section 13.2.4, INSERT Syntax.' It also says: Note that unless the table has a PRIMARY KEY or UNIQUE index, using a REPLACE statement makes no sense. It becomes equivalent to INSERT, because there is no index to be used to determine whether a new row duplicates another. ... John ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] someone working to add merge?
Tom Lane Wrote: Surely they require a unique constraint --- else the behavior isn't even well defined, is it? From the mysql manual: 'REPLACE works exactly like INSERT, except that if an old record in the table has the same value as a new record for a PRIMARY KEY or a UNIQUE index, the old record is deleted before the new record is inserted. See Section 13.2.4, INSERT Syntax.' ... John ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] PG Killed by OOM Condition
Good people, Just had a thought! Might it be worth while protecting the postmaster from an OOM Kill on Linux by setting /proc/{pid}/oom_adj to -17 ? (Described vaguely in mm/oom_kill.c) Kind Regards, John Hansen ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] PG Killed by OOM Condition
Martijn van Oosterhout Wrote: Has it actually happened to you? PostgreSQL is pretty good about its memory usage. Besides, seems to me it should be an system admisitrator descision. No, Just came across this by chance, and thought it might be a good idea. Perhaps as a postgresql.conf setting. ... John ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] PG Killed by OOM Condition
Tom Lane Wrote: (a) wouldn't that require root privilege? (b) how would we determine whether we are on a system to which this applies? (c) is it actually documented in a way that makes you think it'll be a permanently supported feature (ie, somewhere outside the source code)? (a) No, /proc/{pid}/* is owned by the process (b) /proc/{pid}/oom_adj exists ? (c) No, from the source: (not docbooked, we don't want this one cluttering up the manual) ... John ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Does anybody use ORDER BY x USING y?
Martijn van Oosterhout Wrote: All we lose is the ability to say USING [arbitrary op]. Does anybody use this. Would people object to requiring the operator after USING to be part of an operator class? Hmmm ... would this prevent the hackish workaround for case-insensitive sort? Err, which hackish workaround would that be? The right solution is citext which creates it's own operator class. This doesn't have anything to do with functional indexes either. I've been using Google to find any interesting use of the USING clause but havn't found any yet. I was actually of the impression that that was exacty what it was for: specifying what op(class) to use for the sort in case you wanted to use a non-default opclass for the type, and/or if the less-than operator wasn't called ''. ... John ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [ANNOUNCE] Welcome Core Team member Dave Page
Good on ya, Dave! ... John -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Gavin M. Roy Sent: Friday, August 26, 2005 1:51 PM To: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [ANNOUNCE] Welcome Core Team member Dave Page Congrats Dave! On Aug 25, 2005, at 5:59 PM, Josh Berkus wrote: Project members: On behalf of the PostgreSQL Core Team, I welcome Dave Page. Dave has been the head of the pgODBC project for a couple of years, started the pgAdmin project in 1998, has been our lead webmaster for three years, and is now (split with Magnus) in charge of the Win32 packaging. In these latter two roles, Dave's help is indispensable to the PostgreSQL release process, so we decided it was time for him to have the title for the work he's already doing. We believe that anyone who's been around the project for a few years will agree. So, welcome Dave Page as the newest member of Core! -- --Josh Berkus Josh Berkus PostgreSQL Project Core Team www.postgresql.org ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster Gavin M. Roy 800 Pound Gorilla [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [ANNOUNCE] Welcome Core Team member Dave Page
Merlin Moncure Wrote: ... Be sure to mix in a request for better Unicode support at the same time, Dave loves that. As do I... :) ... John ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] FreeBSD ICU was Win32 unicode vs ICU
Kevin McArthur Wrote: Should the postgresql project also be looking at CLDR for cross-platform unicode support? Afaict, from the ICU website, ICU too uses CLDR. Why reinvent the wheel? ... John ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] contrib/rtree_gist into core system?
Tom Lane [mailto:[EMAIL PROTECTED] Wrote: There's no HOWTO for rtree either. Again, my point is not that one couldn't be written; it's that we would probably be better off spending the effort on a HOWTO for gist. No, but the _current_ implementation of the rtree operators are ver much self explaining and need no howto. Union(x,y) = x + y Intersect(x,y) = the values that are present in both x and y, or _overlapping_region_ Size(x) = the size of the area/length of the line, number of elements, etc... Now, how simple is that compared to gist? I for one, is yet to produce a working example of something as simple as indexing an array of 2 elements [x y] represented by a custom type as '[x y]' in string format (returned by type_out) internally stored as a char[2], so that I can fetch all rows where [x y] = ':y' (:y meaning 2nd element in array, x: meaning first element in array. I chose this as something simple to play with, having no practical application for me, but to get an understanding of gist, For now,. I have put it in the too hard basket. I did however in about half a day implement rtree support for inet/cidr (ipv4 only) as you might recall. Kind Regards, John ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] contrib/rtree_gist into core system?
Tom Lane Wrote: ... but rtree has always been marginal, and it's very hard to see where it can win over gist. Simplicity! Implementing rtree operators and support functions is FAR simpler than implementing the GiST equivalents. For example, suppose all you want to implement is the ~ operator for a custom type, then technically all you need is 4 functions (well, 5 including the stub operators) bool contains(type,type); type intersect(type,type); type union(type,type); void size(type,*float); And the 6 other operators simply defined as: bool false(type) { return false; } For GiST you still need 7 support functions + the operator function, some of which aren't exactly simple to implement, the picksplit for instance. So I'd not recommend getting rid of rtree just yet. At least not until someone has written an extensive howto on the subject of GiST implementation. ... John ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PATCHES] Function's LEAST, GREATEST and DECODE (Oracle vararg polymorphic functions)
I'd vote that these functions should follow the semantics of the , and operators. (NULL x) is NULL; ... John -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane Sent: Friday, June 24, 2005 11:21 PM To: Pavel Stehule Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [PATCHES] Function's LEAST, GREATEST and DECODE (Oracle vararg polymorphic functions) [ moving to -hackers for a wider audience ] Today's issue: should the GREATEST/LEAST functions be strict (return null if any input is null) or not (return null only if all inputs are null, else return the largest/smallest of the non-null inputs)? Pavel Stehule [EMAIL PROTECTED] writes: On Thu, 23 Jun 2005, Tom Lane wrote: Pavel Stehule [EMAIL PROTECTED] writes: + /* If any argument is null, then result is null (for GREATEST + and LEAST)*/ Are you sure about that? The only reference I could find says that these functions are not strict in Oracle: http://download-east.oracle.com/otn_hosted_doc/rdb/pdf/sql_ref_v71_vo l1.pdf on page 2-185: The NULL keyword can appear in the list but is ignored. However, not all value expressions can be specified as NULL. That is, a non-NULL value expression must be in the list so that the data type for the expression can be determined. The GREATEST and LEAST functions can result in NULL only if at run time all value expressions result in NULL. The strict interpretation is mathematically cleaner, no doubt, but offhand it seems less useful. I know it, But when moustly PostgreSQL function is strict I desided so greatest and least will be strict. There is two analogy: one, normal comparing which implicate strinct aggregate function which ignore NULL. Tom I don't know, what is better. Maybe Oracle, because least(nullif(col2, +max), nullif(col2, +max)) isn't really readable, but it's precedens for PostgreSQL. I selected more conservative solution, but my patches are only start points for discussion (really) :). Please, if You think, so Oracle way is good, correct it. I'm still favoring non-strict but it deserves more than two votes. Anybody else have an opinion? regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] GiST rtree logic is not right
I'll look at problem after GiST concurrency. Fixing rtree_gist is bug a fix, not a new feature, so I'm not limited by 1 July. Wont fixing rtree(_gist) require initdb, since the behaviour of the operators will change? ... John ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] query plan ignoring check constraints
Bruno Wolff III [mailto:[EMAIL PROTECTED] Wrote I think the real problem is that check constraints on tables aren't used by the optimizer. Given that, what you have below is expected. There has been talk about that in the past, but I haven't heard anything recently about someone considering implenting that. For your problem consider not using a partial index. It isn't going to save anything if it has a constraint matching that of the table. Ahh, I get it now,... If a column has a CHECK (col IN (1,2,3)) and a query says .. WHERE col = 4; then the planner should know that the query will return 0 rows, right? ... John ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] query plan ignoring check constraints
Bruno Wolff III [mailto:[EMAIL PROTECTED] Wrote: You only want to use partial indexes when they don't cover the whole table. They make sense to enforce uniqueness of a column under some condition and when you can save significant space (becuase the condition is only satisfied for a small fraction of rows). Yes, I know that,. I misunderstood the original post as a request for queries NOT to use indexes where it doesn't match the table contents. .. John ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] query plan ignoring check constraints
Someone Wrote: Should not check constraint act as the first filter? The index should ideally be scanned only when the check constraint is passed by the search criteria but surprisingly it did not happen. The explain analyze showed cost for index scans of subtables that cannot contain rows matching the search criteria. Obviously, indexes on columns with a check constraint, should be qualified with the same check constraint. test=# CREATE TABLE test ( foo text check(foo IN ('YES','NO')) ); CREATE TABLE test=# CREATE INDEX text_foo_idx ON test (foo) WHERE foo IN('YES','NO'); CREATE INDEX test=# INSERT INTO test VALUES ('YES'); INSERT 280188 1 test=# INSERT INTO test VALUES ('NO'); INSERT 280189 1 test=# INSERT INTO test VALUES ('no'); ERROR: new row for relation test violates check constraint test_foo_check test=# EXPLAIN ANALYZE SELECT * FROM test WHERE foo = 'YES'; QUERY PLAN Index Scan using text_foo_idx on test (cost=0.00..5.82 rows=7 width=32) (actual time=0.369..0.376 rows=1 loops=1) Index Cond: (foo = 'YES'::text) Total runtime: 0.490 ms (3 rows) test=# EXPLAIN ANALYZE SELECT * FROM test WHERE foo = 'no'; QUERY PLAN Seq Scan on test (cost=0.00..25.38 rows=7 width=32) (actual time=0.358..0.358 rows=0 loops=1) Filter: (foo = 'no'::text) Total runtime: 0.421 ms (3 rows) test=# ... John ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] Returning Composite Types from C functions
Hi all, CREATE TYPE my_type AS ( a int, b int, c int, d int, e int ); CREATE FUNCTION text_to_my_type(text) RETURNS my_type AS 'my_lib.so' LANGUAGE 'C' IMMUTABLE STRICT; CREATE CAST (text AS my_type) WITH FUNCTION text_to_my_type (text); SELECT ('1:2:3:4:5'::text::my_type).*; This results in the text_to_my_type(text) function being called no less than 5 times. Once for each element. Is this the desired behaviour, or a bug? ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Returning Composite Types from C functions
Michael Fuhr [mailto:[EMAIL PROTECTED] Wrote: Sent: Saturday, June 18, 2005 9:56 PM To: John Hansen Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Returning Composite Types from C functions On Sat, Jun 18, 2005 at 09:18:34PM +1000, John Hansen wrote: SELECT ('1:2:3:4:5'::text::my_type).*; This results in the text_to_my_type(text) function being called no less than 5 times. Once for each element. Is this the desired behaviour, or a bug? It's a known behavior with functions that return composite types. Apparently it's not easy to fix: http://archives.postgresql.org/pgsql-hackers/2005-04/msg00971.php There is a workaround tho, so should be fixable: SELECT (a.b).* FROM (SELECT ('1:2:3:4:5'::text::my_type) AS b) AS a; Or am I missing something? -- Michael Fuhr http://www.fuhr.org/~mfuhr/ ... John ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Returning Composite Types from C functions
Yes, it worked for me,... But my point is the workaround shouldn't be nescessary -Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Saturday, June 18, 2005 11:36 PM To: John Hansen Cc: Michael Fuhr; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Returning Composite Types from C functions John Hansen [EMAIL PROTECTED] writes: There is a workaround tho, so should be fixable: SELECT (a.b).* FROM (SELECT ('1:2:3:4:5'::text::my_type) AS b) AS a; Or am I missing something? Try it ;-) regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] LGPL
Josh, Both of these would be fine as add-ins to be distributed *separately* through pgFoundry or even the mirrors if they prove popular. Bundling them in unified distribution binaries with PostgreSQL would be a significant problem. You see this in other projects all the time: Requriements: __, which is GPL and can be downloaded from __ . We've managed so far to avoid needing external libraries which are not standard on most POSIX platforms, and it would be nice to keep it that way instead of doing the component easter egg hunt (which users of Linux multimedia apps are familiar with). This means that you're unlikely to be able to use glib unless it becomes standard on POSIX platforms, and someone makes a Windows port. Out of curiosity, what did you want to use it *for*? Ohh,. Just felt like a cleaner interface than ICU, As for a URI type, I don't see the problem with doing that as a PostgreSQL add-in downloadable from PGFoundry. Given the variety of URI implementations, I'm not sure we'd want a single URI type as standard anyway. That I don't know, Yet... However what I've come up with so far, has proven quite useful. Remember the all famous 'email' type? This has the same functionality, if not better create table email_addresses (email_address text,uri uri); insert into email_addresses (email_address) VALUES ('[EMAIL PROTECTED]'); update email_addresses set uri = 'mailto:'||email_address::text; -- could be a rule on insert! select (uri).username,(uri).host from email_addresses ; username | host --+ john | geeknet.com.au (1 row) And here's the really funky bit: select email_address = '[EMAIL PROTECTED]' from email_addresses ; ?column? -- f (1 row) select uri = 'mailto:[EMAIL PROTECTED]'::text::uri from email_addresses ; ?column? -- t (1 row) As it should, since email sent to the two email addresses would end up in the same mailbox The same applies to other URI formats of course. It appears this uri library is fully spec compliant. According to the FSF's junior licensing maven, building in a GPL data type or other plug-in would make *your instance* of PostgreSQL GPL, but so does PL/R and PostGIS, so that's nothing new. It just needs to be distributed separately. FYI, the reason the GPL linking issue is vague is that it depends on local copyright law, which varies from country to country and in the US from state to state. This is deliberate by the FSF because an agreement which depends on local copyright law is stronger in court than one which sets its own explicit terms. If anyone has nuts-and-bolts questions about GPL/LGPL issues, I have some friends at the FSF and can get answers from the horse's mouth. Thanks for the explanation -- Josh Berkus Aglio Database Solutions San Francisco ... John ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] max_fsm_pages 800k ... ?
INFO: analyzing pg_catalog.pg_depend INFO: pg_depend: 27 pages, 3866 rows sampled, 3866 estimated total rows INFO: free space map: 423 relations, 88475 pages stored; 431200 total pages needed DETAIL: Allocated FSM size: 4000 relations + 8 pages = 705 kB shared memory. This, on a database with a moderate amount of rows (10 Million) So,.. I'd say fairly normal. ... John -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Marc G. Fournier Sent: Thursday, June 16, 2005 3:57 AM To: pgsql-hackers@postgresql.org Subject: [HACKERS] max_fsm_pages 800k ... ? Without more information to provide at this time, does the following seem unusual? INFO: free space map: 252 relations, 411494 pages stored; 738640 total pages needed DETAIL: Allocated FSM size: 2000 relations + 40 pages = 2463 kB shared memory. I know there are alot of factors involved in the above, which I'm looking into, but when I first saw the above, I just about had a heartattack, only because i've never seen such high #s for total pages needed ... ... Is this something that others are seeing as relatively normal? Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: [EMAIL PROTECTED] Yahoo!: yscrappy ICQ: 7615664 ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] PG_FREE_IF_COPY()
Given the following snippet: HeapTupleHeader tuple; Datum temp; bool isnull; tuple = PG_GETARG_HEAPTUPLEHEADER(0); temp = GetAttributeByName(tuple, data, isnull); When using this for a btree operator functions, you need to PG_FREE_IF_COPY(?,?); Which of the above parameters need I free? tuple, or temp, Or both? ... John ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] PG_FREE_IF_COPY()
Tom Lane [mailto:[EMAIL PROTECTED] wrote: John Hansen [EMAIL PROTECTED] writes: Given the following snippet: HeapTupleHeader tuple; Datum temp; bool isnull; tuple = PG_GETARG_HEAPTUPLEHEADER(0); temp = GetAttributeByName(tuple, data, isnull); When using this for a btree operator functions, you need to PG_FREE_IF_COPY(?,?); Which of the above parameters need I free? It'd probably be wise to do a FREE_IF_COPY on the tuple. GetAttributeByName, however, hasn't copied anything (if it's a pass-by-ref type you'll just get a pointer into the tuple). If you do a copy further down that might be a reason to free something, but this code alone isn't making a copy. Cool, thanks... regards, tom lane ... John ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] LGPL
Is there any reason why we would not be able to use LGPL code in PG? ... John ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] LGPL
What about GPL ? I assume that's out of the question! -Original Message- From: Marc G. Fournier [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 15, 2005 11:59 AM To: John Hansen Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] LGPL We already do ... libreadline ... On Wed, 15 Jun 2005, John Hansen wrote: Is there any reason why we would not be able to use LGPL code in PG? ... John ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: [EMAIL PROTECTED] Yahoo!: yscrappy ICQ: 7615664 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] LGPL
Ooooh I got the impression that using GPL libraries was a Bad Thing(tm) ... John -Original Message- From: Andrew Dunstan [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 15, 2005 12:15 PM To: Marc G. Fournier Cc: John Hansen; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] LGPL Er, no. It's GPL, not LGPL software. My readline.h says: The GNU Readline Library is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version. see this ancient thread: http://archives.postgresql.org/pgsql-hackers/2000-12/msg01029.php cheers andrew Marc G. Fournier wrote: We already do ... libreadline ... On Wed, 15 Jun 2005, John Hansen wrote: Is there any reason why we would not be able to use LGPL code in PG? ... John ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: [EMAIL PROTECTED] Yahoo!: yscrappy ICQ: 7615664 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] LGPL
So, what's the story with readline? -Original Message- From: Bruce Momjian [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 15, 2005 12:11 PM To: John Hansen Cc: Marc G. Fournier; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] LGPL John Hansen wrote: What about GPL ? I assume that's out of the question! If we add some GPL code, the entire binary becomes GPL, and that prevents closed-source commercial versions from being produced. -- - -Original Message- From: Marc G. Fournier [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 15, 2005 11:59 AM To: John Hansen Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] LGPL We already do ... libreadline ... On Wed, 15 Jun 2005, John Hansen wrote: Is there any reason why we would not be able to use LGPL code in PG? ... John ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: [EMAIL PROTECTED] Yahoo!: yscrappy ICQ: 7615664 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] LGPL
Agreed. With libreadline, we are not taking their code or distributing it, but merely linking to it if it exists. Now, some say that is enough to make us GPL, but many don't agree with that interpretation. Right,. That's actually exactly what I meant: using GPL/LGPL libraries by linking to them. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] LGPL
Tom Lane [mailto:[EMAIL PROTECTED] Wrote: John Hansen [EMAIL PROTECTED] writes: Is there any reason why we would not be able to use LGPL code in PG? Another point of view on this: it's OK to use LGPL code if it's available on the local platform, so long as we don't *require* it to be present. It's even safer if the LGPL code is merely one implementation of an API that has other implementations under different licenses. For instance I have no fear at all of linking to glibc, and little of linking to libreadline (the latter because we can also use the BSD libedit). If we could not build without libreadline then we would have a very big problem. And we certainly aren't going to textually incorporate any new LGPL (or GPL) code into our distribution. Right,... Let me be more specific then, What are your thoughts on using the glib (http://developer.gnome.org/doc/API/2.2/glib/index.html) library for some functionality in pg? Additionally,. I came across this fine library (http://home.gna.org/uri/uri.en.html) which I'd like to use as a base for a new URI type, unfortunately it's GPL, so based on the above I'm guessing using it as is, is out of the question? regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] LGPL
Tom Lane [mailto:[EMAIL PROTECTED] Wrote: John Hansen [EMAIL PROTECTED] writes: Right,... Let me be more specific then, What are your thoughts on using the glib (http://developer.gnome.org/doc/API/2.2/glib/index.html) library for some functionality in pg? Right offhand that seems like a nonstarter. Exactly how would you use it in a way that didn't turn it into a required component? It looks to me like a collection of bits that are pretty useful but also very low-level, and hence not easily separable. K, that's what confused me as I got the impression it was ok to require LGPL libraries but not GPL. Additionally,. I came across this fine library (http://home.gna.org/uri/uri.en.html) which I'd like to use as a base for a new URI type, unfortunately it's GPL, so based on the above I'm guessing using it as is, is out of the question? Sure, you can do whatever you like with that ... as long as you're not expecting us to distribute the combined code as part of Postgres. It's worth reiterating here that GPL/LGPL code plus BSD code is no problem whatever for local development and use. It's only if you want to redistribute the result that you have to worry about what the licenses require. Since Postgres is a BSD-license project, *we* are not going to redistribute any GPL or LGPL code, nor any code that fundamentally depends on code that is so licensed. But you can pretty much do what you like in your own sandbox. In particular, you could develop a datatype that requires a GPL/LGPL library, and then distribute that code by itself as GPL/LGPL, and neither the GPL nor BSD camps would have any problem with that. Just don't expect us to put such code in a BSD distribution ... That's what I was afraid of regards, tom lane ... John ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] unicode upper/lower functions
Yes, Thank you! :) .. John -Original Message- From: Bruce Momjian [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 07, 2005 10:07 AM To: John Hansen Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] unicode upper/lower functions I think we have decided to use the ICU library to implement multiple locales. -- - John Hansen wrote: Hi list, Attached for your perusal, unicode versions of upper/lower, that work independent of locale except for the following languages: Turkish, Azeri, and Lithuanian. There are 15 locale specific cases in total not covered. -- John Hansen [EMAIL PROTECTED] GeekNET [ Attachment, skipping... ] ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] unicode upper/lower functions
... Except,.. It was never decided if the 'C' locale optimisations was going to be removed if/when implementing ICU. Tho I think the conclusion was a postgresql.conf parameter to enable/disable the optimisations. Either way, this code is now obsolete. ... John -Original Message- From: Bruce Momjian [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 07, 2005 10:07 AM To: John Hansen Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] unicode upper/lower functions I think we have decided to use the ICU library to implement multiple locales. -- - John Hansen wrote: Hi list, Attached for your perusal, unicode versions of upper/lower, that work independent of locale except for the following languages: Turkish, Azeri, and Lithuanian. There are 15 locale specific cases in total not covered. -- John Hansen [EMAIL PROTECTED] GeekNET [ Attachment, skipping... ] ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] unicode upper/lower functions
Bruce Momjian wrote: John Hansen wrote: ... Except,.. It was never decided if the 'C' locale optimisations was going to be removed if/when implementing ICU. Uh, why would we remove it? Oh, meaning if the locale is C we bypass locale lookups? I think we will have to see what performance we have with things. Uhh, not quite: If locale is 'C' the current assumption is 7-bit ASCII for upper/lower/initcap. ICU is capable of properky doing upper/lower/initcap except for the cases described in this (obsolete) patch. Tho I think the conclusion was a postgresql.conf parameter to enable/disable the optimisations. Either way, this code is now obsolete. Thanks. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PATCHES] Unicode characters above 0x10000 #2
Bruce, Attached patch replaces the original, applied today against CVS HEAD. Fixes the surrogates, and limits to 4 byte utf8 as per spec. Also extends UtfToLocal to 4 byte characters (tho, it does not add any, just enables the code to handle them. If my interpretation of this code is wrong, please let me know, and correct it). ... John -Original Message- From: Bruce Momjian [mailto:[EMAIL PROTECTED] Sent: Sunday, June 05, 2005 11:23 AM To: pgman@candle.pha.pa.us Cc: John Hansen; pgsql-hackers@postgresql.org; PostgreSQL-patches Subject: Re: [PATCHES] Unicode characters above 0x1 #2 Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. -- - pgman wrote: I have backed out this patch. It is unclear it is a bug fix. It will be saved for 8.1. -- - pgman wrote: Patch applied. Thanks. --- John Hansen wrote: 3 times lucky? Last one broke utf8 G This one works, Too tired, sorry for the inconvenience.. ... John Content-Description: cvs.diff [ Attachment, skipping... ] ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 === RCS file: /projects/cvsroot/pgsql/src/backend/utils/mb/wchar.c,v retrieving revision 1.38 diff -c -r1.38 wchar.c *** src/backend/utils/mb/wchar.c17 Sep 2004 21:59:57 - 1.38 --- src/backend/utils/mb/wchar.c21 Nov 2004 09:58:36 - *** *** 343,348 --- 343,373 return (pg_euc_dsplen(s)); } + bool isLegalUTF8(const UTF8 *source, int len) { + UTF8 a; + const UTF8 *srcptr = source+len; + if(!source || (pg_utf_mblen(source) != len)) return false; + switch (len) { + default: return false; + /* Everything else falls through when true... */ + case 6: if ((a = (*--srcptr)) 0x80 || a 0xBF) return false; + case 5: if ((a = (*--srcptr)) 0x80 || a 0xBF) return false; + case 4: if ((a = (*--srcptr)) 0x80 || a 0xBF) return false; + case 3: if ((a = (*--srcptr)) 0x80 || a 0xBF) return false; + case 2: if ((a = (*--srcptr)) 0xBF) return false; + switch (*source) { + /* no fall-through in this inner switch */ + case 0xE0: if (a 0xA0) return false; break; + case 0xF0: if (a 0x90) return false; break; + case 0xF4: if (a 0x8F) return false; break; + default: if (a 0x80) return false; + } + case 1: if (*source = 0x80 *source 0xC2) return false; + if (*source 0xFD) return false; + } + return true; + } + /* * convert UTF-8 string to pg_wchar (UCS-2) * caller should allocate enough space for to *** *** 398,404 * returns the byte length of a UTF-8 word pointed to by s */ int ! pg_utf_mblen(const unsigned char *s) { int len = 1; --- 423,429 * returns the byte length of a UTF-8 word pointed to by s */ int ! pg_utf_mblen(const UTF8 *s) { int len = 1; *** *** 406,418 len = 1; else if ((*s 0xe0) == 0xc0) len = 2; ! else if ((*s 0xe0) == 0xe0) ! len = 3; return (len); } static int ! pg_utf_dsplen(const unsigned char *s) { return 1; /* XXX fix me! */ } --- 431,449 len = 1; else if ((*s 0xe0) == 0xc0) len = 2; ! else if ((*s 0xf0) == 0xe0) ! len = 3
Re: [HACKERS] executing OS programs from pg
Look at peter eisentraut's procedural language PL/sh It's on pgfoundry. ... John -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane Sent: Saturday, June 04, 2005 5:16 AM To: Gevik babakhani Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] executing OS programs from pg Gevik babakhani [EMAIL PROTECTED] writes: Does anyone know how to execute an OS command from pgsql. I would like to create a trigger that op on firing would run/execute an external program. Use any of the untrusted PLs to execute system() or the like. Whether this is a good idea or not is a different question --- there are excellent reasons why it is a *bad* idea to execute outside-the-database actions from within a trigger. Mainly that the actions won't be undone if the transaction later rolls back, and now your database state is inconsistent with outside-the-database state. See the list archives for many past discussions of this point and safer ways to design your application. (BTW, this is hardly material for -hackers.) regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [PATCHES] character type value is not padded with spaces
Ahemm,... UNICODE DB: create table t (a char(10)); set client_encoding = iso88591; insert into t VALUES ('æøå'); select a, octet_length(a),length(a) from t; a | octet_length | length +--+ æøå| 13 | 3 (1 row) This is with 8.0.2. Just FYI. ... John -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tatsuo Ishii Sent: Tuesday, May 24, 2005 8:52 AM To: [EMAIL PROTECTED] Cc: pgsql-patches@postgresql.org; pgsql-hackers@postgresql.org Subject: Re: [PATCHES] character type value is not padded with spaces Hackers, The problem he found is not only existing in Japanese characters but also in any multibyte encodings including UTF-8. For me the patch looks good and I will commit it to 7.3, 7.4, 8.0 stables and current if there's no objection. -- Tatsuo Ishii Character type value including multibyte characters is not padded with spaces. It reproduces at 7.3.x, 7.4.x and 8.0.x. create table t (a char(10)); insert into t values ('X'); -- X is 2byte character. I expect that 'X ' is inserted. But 'X' is inserted. select a, octed_length(a) from t; a | octet_length ---+-- X | 10 If padded with spaces, octet_length(a) is 15. This problem is caused that string length is calculated by byte length(VARSIZE) in exprTypmod(). I attache the patch for this problem. Regards, -- Yoshiyuki Asaba [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED]) ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] [PATCHES] character type value is not padded with spaces
Ahhh... -Original Message- From: Tatsuo Ishii [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 24, 2005 9:26 AM To: John Hansen Cc: [EMAIL PROTECTED]; pgsql-patches@postgresql.org; pgsql-hackers@postgresql.org Subject: Re: [PATCHES] character type value is not padded with spaces I think you need to test with 5 characters, not 3. -- Tatsuo Ishii Ahemm,... UNICODE DB: create table t (a char(10)); set client_encoding = iso88591; insert into t VALUES ('æøå'); select a, octet_length(a),length(a) from t; a | octet_length | length +--+ æøå| 13 | 3 (1 row) This is with 8.0.2. Just FYI. ... John -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tatsuo Ishii Sent: Tuesday, May 24, 2005 8:52 AM To: [EMAIL PROTECTED] Cc: pgsql-patches@postgresql.org; pgsql-hackers@postgresql.org Subject: Re: [PATCHES] character type value is not padded with spaces Hackers, The problem he found is not only existing in Japanese characters but also in any multibyte encodings including UTF-8. For me the patch looks good and I will commit it to 7.3, 7.4, 8.0 stables and current if there's no objection. -- Tatsuo Ishii Character type value including multibyte characters is not padded with spaces. It reproduces at 7.3.x, 7.4.x and 8.0.x. create table t (a char(10)); insert into t values ('X'); -- X is 2byte character. I expect that 'X ' is inserted. But 'X' is inserted. select a, octed_length(a) from t; a | octet_length ---+-- X | 10 If padded with spaces, octet_length(a) is 15. This problem is caused that string length is calculated by byte length(VARSIZE) in exprTypmod(). I attache the patch for this problem. Regards, -- Yoshiyuki Asaba [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED]) ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Returning the name of a primary key
Tom, Juan, Wouldn't this simple SQL do the trick? CREATE OR REPLACE FUNCTION pk_column(text) RETURNS SETOF text AS ' SELECT attname::text FROM pg_class, pg_constraint, pg_attribute WHERE pg_class.oid = conrelid AND contype=''p'' AND attrelid = pg_class.oid AND attnum = ANY (conkey) AND relname=$1; ' LANGUAGE sql VOLATILE STRICT; -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane Sent: Tuesday, May 17, 2005 4:49 AM To: Juan Pablo Espino Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Returning the name of a primary key Juan Pablo Espino [EMAIL PROTECTED] writes: I need to write a function that retrieve the name of at least one table primary key, if it exists. The only argument passed to the function is the table name. I have thought something like this: You need to be searching the list of indexes, not the attributes per se. ATExecDropNotNull() might be a useful example. regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] SQL_ASCII vs. 7-bit ASCII encodings
Personally, I'd like UTF8 to be the default encoding :) This is the 21st century :D I concur. ... John ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Patch for collation using ICU
Tatsuo Ishii wrote: Sent: Tuesday, May 10, 2005 5:45 PM To: John Hansen Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU Tatsuo Ishii wrote: Sent: Tuesday, May 10, 2005 12:32 AM To: John Hansen Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU -Original Message- From: Tatsuo Ishii [mailto:[EMAIL PROTECTED] Sent: Sunday, May 08, 2005 11:08 PM To: John Hansen Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU I don't buy it. If current conversion tables does the right thing, why we need to replace. Or if conversion tables are not correct, why don't you fix it? I think the rule of character conversion will not change frequently, especially for LATIN languages. Thus maintaining cost is not too high. I never said we need to, but if we're going to implement ICU, then we might as well go all the way. So you admit there's no benefit using ICU for replacing existing conversions? Besides ICU does not support all existing conversions, I think ICU has serious flaw for using conversion. If I understand correctly, ICU uses UNICODE internally to do the conversion. For example, to implement SJIS-EUC_JP conversion, ICU first converts SJIS to UNICODE then converts UNICODE to EUC_JP. Problem is these conversion is not roud trip(conversion between SJIS/EUC_JP and UNICODE will lose some information). Thus SJIS-EUC_JP-SJIS conversion using ICU does not preserve original text. Just for the record, I fetched a web page encoded in sjis, and converted it to euc-jp and back using uconv from ICU 3.2, and the result is the original is identical to the transformed file. uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html uconv -f EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc diff index.html index.html.sjis Not all SJIS/EUC_JP characters have the problem. You might want to try: Shift_JIS 0x81e6, 0x879a, 0xfa5b. BTW, I got this with ICU 3.2: $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt Conversion from Unicode to codepage failed at input byte position 0. Unicode: 301c Error: Invalid character found The contents of a.txt is 0xa1c1 which is a valid EUC_JP character. That actually makes perfect sense, since according to unicode.org's database: 301C ~ WAVE DASH This character was encoded to match JIS C 6226-1978 1-33 wave dash. The JIS standards and some industry practise disagree in mapping. - 3030 wavy dash - FF5E full width tilde In PG FF5E is the mapping currently used. That is obviously wrong (according to the standards), as that is only a 'similar character'. Unfortunately, there is no mapping from 301C to shift_jis, as shift_jis doesn't define WAVE DASH. In all, I believe this behaviour to be correct according to the standards. There'd be nothing to stop us from defining alternative mappings for the cases where we deviate from the standard, but the question is, should we be non-standard? You missed the point. EUC_JP 0xa1c1 is a perfect valid data and uconv -f EUC_JP -t Shift_JIS should convert it to Shift_JIS 0x8160 regardless of the internal of uconv. Studying ICU forther, I found that it works fine, provided you use the _correct_ charset for the conversion.. a.txt contains 0x81 0x60 uconv -f ibm-943_P130-1999 -t EUC_JP a.txt -o b.txt b.txt now contains 0xa1 0xc1 uconv -t ibm-943_P130-1999 -f EUC_JP b.txt -o a.txt a.txt still contains 0x81 0x60 The mapping table you want is ibm-943_P130-1999 Similar, we'd need to find the right euc-jp (and plain jis) mapping, assuming we want the one that strictly defines JIS X 0208 right? I trust this to put your fears to rest... -- Tatsuo Ishii ... John ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Patch for collation using ICU
Tatsuo Ishii wrote: Sent: Tuesday, May 10, 2005 12:32 AM To: John Hansen Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU -Original Message- From: Tatsuo Ishii [mailto:[EMAIL PROTECTED] Sent: Sunday, May 08, 2005 11:08 PM To: John Hansen Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU I don't buy it. If current conversion tables does the right thing, why we need to replace. Or if conversion tables are not correct, why don't you fix it? I think the rule of character conversion will not change frequently, especially for LATIN languages. Thus maintaining cost is not too high. I never said we need to, but if we're going to implement ICU, then we might as well go all the way. So you admit there's no benefit using ICU for replacing existing conversions? Besides ICU does not support all existing conversions, I think ICU has serious flaw for using conversion. If I understand correctly, ICU uses UNICODE internally to do the conversion. For example, to implement SJIS-EUC_JP conversion, ICU first converts SJIS to UNICODE then converts UNICODE to EUC_JP. Problem is these conversion is not roud trip(conversion between SJIS/EUC_JP and UNICODE will lose some information). Thus SJIS-EUC_JP-SJIS conversion using ICU does not preserve original text. Just for the record, I fetched a web page encoded in sjis, and converted it to euc-jp and back using uconv from ICU 3.2, and the result is the original is identical to the transformed file. uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html uconv -f EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc diff index.html index.html.sjis Not all SJIS/EUC_JP characters have the problem. You might want to try: Shift_JIS 0x81e6, 0x879a, 0xfa5b. BTW, I got this with ICU 3.2: $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt Conversion from Unicode to codepage failed at input byte position 0. Unicode: 301c Error: Invalid character found The contents of a.txt is 0xa1c1 which is a valid EUC_JP character. That actually makes perfect sense, since according to unicode.org's database: 301C ~ WAVE DASH This character was encoded to match JIS C 6226-1978 1-33 wave dash. The JIS standards and some industry practise disagree in mapping. - 3030 wavy dash - FF5E full width tilde In PG FF5E is the mapping currently used. That is obviously wrong (according to the standards), as that is only a 'similar character'. Unfortunately, there is no mapping from 301C to shift_jis, as shift_jis doesn't define WAVE DASH. In all, I believe this behaviour to be correct according to the standards. There'd be nothing to stop us from defining alternative mappings for the cases where we deviate from the standard, but the question is, should we be non-standard? This makes me nervous in using ICU... -- Tatsuo Ishii ... John ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Patch for collation using ICU
Tatsuo Ishii Sent: Sunday, May 08, 2005 3:41 PM To: John Hansen Cc: [EMAIL PROTECTED]; pgman@candle.pha.pa.us; [EMAIL PROTECTED]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU Alvaro Herrera wrote: Sent: Sunday, May 08, 2005 2:49 PM To: John Hansen Cc: Tatsuo Ishii; pgman@candle.pha.pa.us; [EMAIL PROTECTED]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote: Tatsuo Ishii wrote: So Japanese(including ASCII)/UNICODE behavior is perfectly correct at this moment. Right, so you _never_ use accented ascii characters in Japanese? (like è for example, whose uppercase is È) That isn't ASCII. It's latin1 or some other ASCII extension. Point taken... But... If you want EUC_JP (Japanese + ASCII) then use that as your backend encoding, not UTF-8 (unicode). UTF-8 encoded databases are very useful for representing multiple languages in the same database, but this usefulness vanishes if functions like upper/lower doesn't work correctly. I'm just curious if Germany/French/Spanish mixed text can be sorted correctly. I think these languages need their own locales even with UNICODE/ICU. No, they will not sort correctly, for that you still need the locale. So optimizing for 3 languages breaks more than a hundred, that's doesn't seem fair! That is a compromise I'd be willing to agree on. :) Why don't you add a GUC variable or some such to control the upper/lower behavior? -- Tatsuo Ishii ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
Tatsuo Ishii wrote: Sent: Sunday, May 08, 2005 3:31 PM To: John Hansen Cc: [EMAIL PROTECTED]; pgsql-general@postgresql.org; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem Tatsuo Ishii wrote: Sent: Sunday, May 08, 2005 12:01 PM To: [EMAIL PROTECTED] Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem We have developed patches which relaxes the character validation so that PostgreSQL accepts invalid characters. It works like this: That is just plain 100% wrong!! Under no circumstances should there be invalid data in a database. And if you're trying to make a database of invalid data, then at least encode it using a valid encoding. In fact, I've proposed strengthening the validation routines for UTF-8. Actually I myself thought as you are before. Later I found that it was not so good idea. People already have invalid encoded data in their precious database and have very hard time to migrate to newer version of PostgreSQL because of encoding validation. Think about this kind of situation: There is a table t1(member_id integer primary key, member_name text, address text, phone text, email text). I have to reach each member by either adress, phone or email. Unfortunately some of address field have wrong encoded data. In this case I will use phone or email to reach them. Now I need to upgrade to newer PostgreSQL within 1 day. I know I have to fix wrong encoded field but it will take more than 1 day. So I would like to import the data first then fix wrong encoded field on running database since I can reach members by phone or email even with wrong encoded address field... Actually would be very simple, create function isvalidutf8(text) in your preferred language. C source is available from unicode.org. Create function converttoutf8(text) using whatever code is required to transform the _wrong_ encoding (SQL_ASCII - UTF8 for instance) to utf-8. Update table set field=converttoutf8(field) where !isvalidutf8(field); Now sit back and relax while your invalid data is converted to utf-8. When done, pg_dump the database, upgrade, and reload. This should take less than a day. I saw this kind of situation in the real world and that's why we developed the patches. -- Tatsuo Ishii ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Patch for collation using ICU
The source for ICU 3.2 is 9.8Mb in .tar.gz. PostgreSQL 8.0.2 is 13.2. That means the size of the distribution would almost *double* if we bundled ICU. Ermm,. Don't forget to remove the current charset conversions and locale support before making your size estimation. It's probably fine bundling it in the binary distributions (at least we'd probably do it on win32, since not many ppl will have it already there), but bundling the source seems a bit excessive to me. //Magnus ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Patch for collation using ICU
Tatsuo Ishii wrote: Sent: Sunday, May 08, 2005 11:19 PM To: John Hansen Cc: [EMAIL PROTECTED]; pgman@candle.pha.pa.us; [EMAIL PROTECTED]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote: Tatsuo Ishii wrote: So Japanese(including ASCII)/UNICODE behavior is perfectly correct at this moment. Right, so you _never_ use accented ascii characters in Japanese? (like è for example, whose uppercase is È) That isn't ASCII. It's latin1 or some other ASCII extension. Point taken... But... If you want EUC_JP (Japanese + ASCII) then use that as your backend encoding, not UTF-8 (unicode). UTF-8 encoded databases are very useful for representing multiple languages in the same database, but this usefulness vanishes if functions like upper/lower doesn't work correctly. I'm just curious if Germany/French/Spanish mixed text can be sorted correctly. I think these languages need their own locales even with UNICODE/ICU. No, they will not sort correctly, for that you still need the locale. I'm confused. I thought the ICU patches is intended for using on broken locale platforms? Initially yes, but why duplicate code? What I meant was, that they will not sort correctly using the C locale. Locale _name_ needs to be known to ICU for it to sort correctly. -- Tatsuo Ishii ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Patch for collation using ICU
Tom Lane wrote: Sent: Monday, May 09, 2005 2:47 AM To: Palle Girgensohn Cc: Tatsuo Ishii; John Hansen; [EMAIL PROTECTED]; pgman@candle.pha.pa.us; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU Palle Girgensohn [EMAIL PROTECTED] writes: I'm confused. I thought the ICU patches is intended for using on broken locale platforms? It will sort correctly in *one* locale, using ICU. You still cannot mix different locales in the same database cluster, the collation locale is still fixed at initdb time, unfortunately. I thought the point of using ICU was to be able to dig out from under that restriction? It's a bit of a large pill to swallow if we will still have to throw it away someday to become SQL spec compliant. That is not a limitation of ICU but of postgresql. I don't know what the specs say, but imagine something like: SELECT foo FROM bar ORDER BY foo WITH LOCALE 'en_US', foobar WITH LOCALE 'jp_JP'; Which would be less difficult to implement using ICU. regards, tom lane ... John ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Patch for collation using ICU
Tatsuo Ishii wrote: Sent: Sunday, May 08, 2005 11:08 PM To: John Hansen Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU I don't buy it. If current conversion tables does the right thing, why we need to replace. Or if conversion tables are not correct, why don't you fix it? I think the rule of character conversion will not change frequently, especially for LATIN languages. Thus maintaining cost is not too high. I never said we need to, but if we're going to implement ICU, then we might as well go all the way. So you admit there's no benefit using ICU for replacing existing conversions? Besides ICU does not support all existing conversions, I think ICU has serious flaw for using conversion. If I understand correctly, ICU uses UNICODE internally to do the conversion. For example, to implement SJIS-EUC_JP conversion, ICU first converts SJIS to UNICODE then converts UNICODE to EUC_JP. Problem is these conversion is not roud trip(conversion between SJIS/EUC_JP and UNICODE will lose some information). Thus SJIS-EUC_JP-SJIS conversion using ICU does not preserve original text. Could you please send me a sample text as an attachment encoded in SJIS where this would happen? -- Tatsuo Ishii ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Patch for collation using ICU
-Original Message- From: Tatsuo Ishii [mailto:[EMAIL PROTECTED] Sent: Sunday, May 08, 2005 11:08 PM To: John Hansen Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU I don't buy it. If current conversion tables does the right thing, why we need to replace. Or if conversion tables are not correct, why don't you fix it? I think the rule of character conversion will not change frequently, especially for LATIN languages. Thus maintaining cost is not too high. I never said we need to, but if we're going to implement ICU, then we might as well go all the way. So you admit there's no benefit using ICU for replacing existing conversions? Besides ICU does not support all existing conversions, I think ICU has serious flaw for using conversion. If I understand correctly, ICU uses UNICODE internally to do the conversion. For example, to implement SJIS-EUC_JP conversion, ICU first converts SJIS to UNICODE then converts UNICODE to EUC_JP. Problem is these conversion is not roud trip(conversion between SJIS/EUC_JP and UNICODE will lose some information). Thus SJIS-EUC_JP-SJIS conversion using ICU does not preserve original text. Just for the record, I fetched a web page encoded in sjis, and converted it to euc-jp and back using uconv from ICU 3.2, and the result is the original is identical to the transformed file. uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html uconv -f EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc diff index.html index.html.sjis ... John ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Patch for collation using ICU
Where'd you get the licence from? None of that is in the licence I'm reading! (http://www-306.ibm.com/software/globalization/icu/index.jsp) (http://www-306.ibm.com/software/globalization/icu/license.jsp) ... John -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane Sent: Saturday, May 07, 2005 3:17 PM To: Bruce Momjian Cc: Palle Girgensohn; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU Bruce Momjian pgman@candle.pha.pa.us writes: Tom Lane wrote: Not until ICU is released under a BSD license ... Well, readline isn't BSD either, but we use it. It is any different? Did you read the license? Some of the more troubling bits: : It is the understanding of INTERNATIONAL BUSINESS MACHINES CORPORATION : that the purpose for which its publications are being reproduced is : accurate and true as stated in your attached request. (er, which attached request would that be?) : Permission to quote from or reprint IBM publications is limited to the : purpose and quantities originally requested and must not be construed as : a blanket license to use the material for other purposes or to reprint : other IBM copyrighted material. : IBM reserves the right to withdraw permission to reproduce copyrighted : material whenever, in its discretion, it feels that the privilege of : reproducing its material is being used in a way detrimental to its : interest or the above instructions are not being followed properly to : protect its copyright. : IBM may have patents or pending patent applications covering subject : matter in this document. The furnishing of this document does not give : you any license to these patents. You can send license inquiries, in : writing, to: : For license inquiries regarding double-byte (DBCS) information, contact : the IBM Intellectual Property Department in your country or send : inquiries, in writing, to: regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED]) ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Patch for collation using ICU
I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. I currently use this on gentoo with ICU3.2 unmasked. Works a dream, even with locale C and UNICODE database. Small test: createdb --encoding UNICODE --locale C test psql test set client_encoding=iso88591; CREATE TABLE test (t text); INSERT INTO test (t) VALUES ('æøå'); set client_encoding=unicode; INSERT INTO test (t) SELECT upper(t) FROM test; set client_encoding=iso88591; SELECT * FROM test; t - æøå ÆØÅ (2 rows) Just as I'd expect, as upper/lower/initcap are locale independent for these characters. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Patch for collation using ICU
Errm,... initdb --encoding UNICODE --locale C -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of John Hansen Sent: Saturday, May 07, 2005 10:23 PM To: Palle Girgensohn; Bruce Momjian Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. I currently use this on gentoo with ICU3.2 unmasked. Works a dream, even with locale C and UNICODE database. Small test: createdb --encoding UNICODE --locale C test psql test set client_encoding=iso88591; CREATE TABLE test (t text); INSERT INTO test (t) VALUES ('æøå'); set client_encoding=unicode; INSERT INTO test (t) SELECT upper(t) FROM test; set client_encoding=iso88591; SELECT * FROM test; t - æøå ÆØÅ (2 rows) Just as I'd expect, as upper/lower/initcap are locale independent for these characters. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Patch for collation using ICU
Bruce Momjian wrote: Palle Girgensohn wrote: Is this patch ready for application? I don't think so, not quite. I have not had any positive reports from linux users, this is only tested in a FreeBSD environment. I'd say it needs some more testing. OK. Also, apparently, ICU is installed by default in many linux distributions, and usually it is version 2.8. Some linux users have asked me if there are plans for a patch that works with ICU 2.8. That's probably a good idea. IBM and the ICU folks seem to consider 3.2 to be the stable version, older versions are hard to find on their sites, but most linux distributers seem to consider it too bleeding edge, even gentoo. I don't know why they don't agree. Good point. Why would linux folks need ICU? Doesn't their OS support encodings natively? I am particularly excited about this for OSs that don't have such encodings, like UTF8 support for Win32. Because ICU will not be used unless enabled by configure, it seems we are fine with only supporting the newest version. Do Linux users need to use ICU for any reason? Yes, because on many linux platforms locale support is broken. Also, ICU enables full unicode support, particularly in multi-language situations where locale is C, and makes upper/lower/initcap work as expected, except where it depends on locale information. There are also many other useful things in ICU that could be implemented. Transliteration, and break-iterators for example. Break-iteration particularly interresting for converting a text to a list of words. Another is it's builtin substring searches. I do have a few questions: Why don't you use the lc_ctype_is_c() part of this test? if (pg_database_encoding_max_length() 1 !lc_ctype_is_c()) Um, well, I didn't think about that. :) What would be the locale in this case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a wide encoding, indeed. Then the strings will be handled like byte-wide chars. Yeah, it's a bug. I'll fix it! Thanks. The additional test is more of an optmization, and it fixes a problem with some OSs that have processing problems with UTF8 when the locale is supposed to be turned off, like in C. I realize ICU might be fine with it but the optimization still is an issue. That the locale is supposed to be turned off, doesn't mean it shouldn't use ICU. ICU is more than just locales. Why is so much code added, for example, in lower()? The existing multibyte code is much smaller, and lots of code is added in other places too. ICU uses UTF-16 internally, so all strings must be converted from the database encoding to UTF-16. Since that means the strings need to be copied, I took the same approach as in varlena.c:varstr_cmp(), where small strings use the heap and only larger strings use a palloc. Comments in varstr_cmp about performance made me use that approach. Oh, interesting. I think you need to create new functions that factor out that common code so the patch is smaller and easier to maintain. Also, in the latest patch, I also added checks and logging for *every* status returned from ICU. I hope this will help debugging on debian, where previous version didn't work. That excessive status checking is hardly be necessary once the stuff is better tested. I think the string copying and heap/palloc choices stands for most of the code bloat, together with the excessive status checking and logging. OK, move that into some common functions and I think it will be better. Why do you need to add a mapping of encoding names from iana to our names? This was already answered by John Hansen... There's an old thread here about the choice of the name UNICODE to describe an encoding, which it doesn't. There's half a dozen unicode based encodings... UTF-8 is used by postgresql, that would have been a better name... Similarly for most other encodings, really. ICU expect a setlocale(3) string (i.e. IANA). PostgreSQL can't provide it, so a mapping table is required. We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that help? I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. OK, I certainly would like this all done for 8.1 which should have feature freeze on July 1. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 4: Don't 'kill -9
Re: [HACKERS] Patch for collation using ICU
--On lördag, maj 07, 2005 22.53.46 +1000 John Hansen [EMAIL PROTECTED] wrote: Errm,... initdb --encoding UNICODE --locale C You mean that ICU *shall* be used even for the C locale, and not as Bruce suggested here: Yes, that's exactly what I mean. I do have a few questions: Why don't you use the lc_ctype_is_c() part of this test? if (pg_database_encoding_max_length() 1 !lc_ctype_is_c()) Um, well, I didn't think about that. :) What would be the locale in this case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a wide encoding, indeed. Then the strings will be handled like byte-wide chars. Yeah, it's a bug. I'll fix it! Thanks. John disagrees here, and I'm obliged to agree. Using the C locale, one will expect C collation, but upper/lower is better off still using ICU. Hence, the above stuff is *not* a bug. Do we agree? /Palle -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of John Hansen Sent: Saturday, May 07, 2005 10:23 PM To: Palle Girgensohn; Bruce Momjian Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. I currently use this on gentoo with ICU3.2 unmasked. Works a dream, even with locale C and UNICODE database. Small test: createdb --encoding UNICODE --locale C test psql test set client_encoding=iso88591; CREATE TABLE test (t text); INSERT INTO test (t) VALUES ('æøå'); set client_encoding=unicode; INSERT INTO test (t) SELECT upper(t) FROM test; set client_encoding=iso88591; SELECT * FROM test; t - æøå ÆØÅ (2 rows) Just as I'd expect, as upper/lower/initcap are locale independent for these characters. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Patch for collation using ICU
Btw, I had been planning to propose replacing every single one of the built in charset conversion functions with calls to ICU (thus making pg _depend_ on ICU), as this would seem like a cleaner solution than for us to maintain our own conversion tables. ICU also has a fair few conversions that we do not have at present. Any thoughts? ... John -Original Message- From: John Hansen Sent: Saturday, May 07, 2005 11:09 PM To: 'Palle Girgensohn'; 'Bruce Momjian' Cc: 'pgsql-hackers@postgresql.org' Subject: RE: [HACKERS] Patch for collation using ICU --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen [EMAIL PROTECTED] wrote: Errm,... initdb --encoding UNICODE --locale C You mean that ICU *shall* be used even for the C locale, and not as Bruce suggested here: Yes, that's exactly what I mean. I do have a few questions: Why don't you use the lc_ctype_is_c() part of this test? if (pg_database_encoding_max_length() 1 !lc_ctype_is_c()) Um, well, I didn't think about that. :) What would be the locale in this case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a wide encoding, indeed. Then the strings will be handled like byte-wide chars. Yeah, it's a bug. I'll fix it! Thanks. John disagrees here, and I'm obliged to agree. Using the C locale, one will expect C collation, but upper/lower is better off still using ICU. Hence, the above stuff is *not* a bug. Do we agree? /Palle -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of John Hansen Sent: Saturday, May 07, 2005 10:23 PM To: Palle Girgensohn; Bruce Momjian Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. I currently use this on gentoo with ICU3.2 unmasked. Works a dream, even with locale C and UNICODE database. Small test: createdb --encoding UNICODE --locale C test psql test set client_encoding=iso88591; CREATE TABLE test (t text); INSERT INTO test (t) VALUES ('æøå'); set client_encoding=unicode; INSERT INTO test (t) SELECT upper(t) FROM test; set client_encoding=iso88591; SELECT * FROM test; t - æøå ÆØÅ (2 rows) Just as I'd expect, as upper/lower/initcap are locale independent for these characters. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Patch for collation using ICU
Palle Girgensohn wrote: I'm aware of that. It might help for unicode, but there are a bunch of other encodings. IANA has decided that utf-8 has *no* aliases, hence only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is fogiving, I don't remember/know, but I think we need the mappings, unfortunately. Here is the list of encoding names and aliases the ICU accepts as of 3.2: (it's a bit long...) UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208 UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2 UTF-16BE x-utf-16be ibm-1200 ibm-1201 ibm-5297 ibm-13488 ibm-17584 windows-1201 cp1200 cp1201 UTF16_BigEndian UTF-16LE x-utf-16le ibm-1202 ibm-13490 ibm-17586 UTF16_LittleEndian windows-1200 UTF-32 ISO-10646-UCS-4 csUCS4 ucs-4 UTF-32BE UTF32_BigEndian ibm-1232 ibm-1233 UTF-32LE UTF32_LittleEndian ibm-1234 UTF16_PlatformEndian UTF16_OppositeEndian UTF32_PlatformEndian UTF32_OppositeEndian UTF-7 windows-65000 IMAP-mailbox-name SCSU BOCU-1 csBOCU-1 CESU-8 ISO-8859-1 ibm-819 IBM819 cp819 latin1 8859_1 csISOLatin1 iso-ir-100 ISO_8859-1:1987 l1 819 US-ASCII ASCII ANSI_X3.4-1968 ANSI_X3.4-1986 ISO_646.irv:1991 iso_646.irv:1983 ISO646-US us csASCII iso-ir-6 cp367 ascii7 646 windows-20127 gb18030 ibm-1392 windows-54936 ibm-367_P100-1995 ibm-367 IBM367 ibm-912_P100-1995 ibm-912 iso-8859-2 ISO_8859-2:1987 latin2 csISOLatin2 iso-ir-101 l2 8859_2 cp912 912 windows-28592 ibm-913_P100-2000 ibm-913 iso-8859-3 ISO_8859-3:1988 latin3 csISOLatin3 iso-ir-109 l3 8859_3 cp913 913 windows-28593 ibm-914_P100-1995 ibm-914 iso-8859-4 latin4 csISOLatin4 iso-ir-110 ISO_8859-4:1988 l4 8859_4 cp914 914 windows-28594 ibm-915_P100-1995 ibm-915 iso-8859-5 cyrillic csISOLatinCyrillic iso-ir-144 ISO_8859-5:1988 8859_5 cp915 915 windows-28595 ibm-1089_P100-1995 ibm-1089 iso-8859-6 arabic csISOLatinArabic iso-ir-127 ISO_8859-6:1987 ECMA-114 ASMO-708 8859_6 cp1089 1089 windows-28596 ISO-8859-6-I ISO-8859-6-E ibm-813_P100-1995 ibm-813 iso-8859-7 greek greek8 ELOT_928 ECMA-118 csISOLatinGreek iso-ir-126 ISO_8859-7:1987 8859_7 cp813 813 windows-28597 ibm-916_P100-1995 ibm-916 iso-8859-8 hebrew csISOLatinHebrew iso-ir-138 ISO_8859-8:1988 ISO-8859-8-I ISO-8859-8-E 8859_8 cp916 916 windows-28598 ibm-920_P100-1995 ibm-920 iso-8859-9 latin5 csISOLatin5 iso-ir-148 ISO_8859-9:1989 l5 8859_9 cp920 920 windows-28599 ECMA-128 ibm-921_P100-1995 ibm-921 iso-8859-13 8859_13 cp921 921 ibm-923_P100-1998 ibm-923 iso-8859-15 Latin-9 l9 8859_15 latin0 csisolatin0 csisolatin9 iso8859_15_fdis cp923 923 windows-28605 ibm-942_P12A-1999 ibm-942 ibm-932 cp932 shift_jis78 sjis78 ibm-942_VSUB_VPUA ibm-932_VSUB_VPUA ibm-943_P15A-2003 ibm-943 Shift_JIS MS_Kanji csShiftJIS windows-31j csWindows31J x-sjis x-ms-cp932 cp932 windows-932 cp943c IBM-943C ms932 pck sjis ibm-943_VSUB_VPUA ibm-943_P130-1999 ibm-943 Shift_JIS cp943 943 ibm-943_VASCII_VSUB_VPUA ibm-33722_P12A-1999 ibm-33722 ibm-5050 EUC-JP Extended_UNIX_Code_Packed_Format_for_Japanese csEUCPkdFmtJapanese X-EUC-JP eucjis windows-51932 ibm-33722_VPUA IBM-eucJP ibm-33722_P120-1999 ibm-33722 ibm-5050 cp33722 33722 ibm-33722_VASCII_VPUA ibm-954_P101-2000 ibm-954 EUC-JP ibm-1373_P100-2002 ibm-1373 windows-950 windows-950-2000 Big5 csBig5 windows-950 x-big5 ibm-950_P110-1999 ibm-950 cp950 950 macos-2566-10.2 Big5-HKSCS big5hk HKSCS-BIG5 ibm-1375_P100-2003 ibm-1375 Big5-HKSCS ibm-1386_P100-2002 ibm-1386 cp1386 windows-936 ibm-1386_VSUB_VPUA windows-936-2000 GBK CP936 MS936 windows-936 ibm-1383_P110-1999 ibm-1383 GB2312 csGB2312 EUC-CN ibm-eucCN hp15CN cp1383 1383 ibm-1383_VPUA ibm-5478_P100-1995 ibm-5478 GB_2312-80 chinese iso-ir-58 csISO58GB231280 gb2312-1980 GB2312.1980-0 ibm-964_P110-1999 ibm-964 EUC-TW ibm-eucTW cns11643 cp964 964 ibm-964_VPUA ibm-949_P110-1999 ibm-949 cp949 949 ibm-949_VASCII_VSUB_VPUA ibm-949_P11A-1999 ibm-949 cp949c ibm-949_VSUB_VPUA ibm-970_P110-1995 ibm-970 EUC-KR KS_C_5601-1987 windows-51949 csEUCKR ibm-eucKR KSC_5601 5601 ibm-970_VPUA ibm-971_P100-1995 ibm-971 ibm-971_VPUA ibm-1363_P11B-1998 ibm-1363 KS_C_5601-1987 KS_C_5601-1989 KSC_5601 csKSC56011987 korean iso-ir-149 5601 cp1363 ksc windows-949 ibm-1363_VSUB_VPUA ibm-1363_P110-1997 ibm-1363 ibm-1363_VASCII_VSUB_VPUA windows-949-2000 windows-949 KS_C_5601-1987 KS_C_5601-1989 KSC_5601 csKSC56011987 korean iso-ir-149 ms949 ibm-1162_P100-1999 ibm-1162 ibm-874_P100-1995 ibm-874 ibm-9066 cp874 TIS-620 tis620.2533 eucTH cp9066 windows-874-2000 TIS-620 windows-874 MS874 ibm-437_P100-1995 ibm-437 IBM437 cp437 437 csPC8CodePage437 windows-437 ibm-850_P100-1995 ibm-850 IBM850 cp850 850 csPC850Multilingual windows-850 ibm-851_P100-1995 ibm-851 IBM851 cp851 851 csPC851 ibm-852_P100-1995 ibm-852 IBM852 cp852 852 csPCp852 windows-852 ibm-855_P100-1995 ibm-855 IBM855 cp855 855 csIBM855 csPCp855 ibm-856_P100-1995 ibm-856 cp856 856 ibm-857_P100-1995 ibm-857 IBM857 cp857 857 csIBM857 windows-857 ibm-858_P100-1997 ibm-858 IBM00858 CCSID00858 CP00858 PC-Multilingual-850+euro cp858 ibm-860_P100-1995 ibm-860 IBM860 cp860 860 csIBM860
Re: [HACKERS] Patch for collation using ICU
-Original Message- From: Palle Girgensohn [mailto:[EMAIL PROTECTED] Sent: Saturday, May 07, 2005 11:30 PM To: John Hansen; Bruce Momjian Cc: pgsql-hackers@postgresql.org Subject: RE: [HACKERS] Patch for collation using ICU --On lördag, maj 07, 2005 23.25.15 +1000 John Hansen [EMAIL PROTECTED] wrote: Palle Girgensohn wrote: I'm aware of that. It might help for unicode, but there are a bunch of other encodings. IANA has decided that utf-8 has *no* aliases, hence only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is fogiving, I don't remember/know, but I think we need the mappings, unfortunately. Here is the list of encoding names and aliases the ICU accepts as of 3.2: (it's a bit long...) UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208 No UTF8 in there. I think that's good, charset aliases are a hassle. Yup! :) /Palle ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Patch for collation using ICU
-Original Message- From: Palle Girgensohn [mailto:[EMAIL PROTECTED] Sent: Saturday, May 07, 2005 11:33 PM To: John Hansen; Bruce Momjian Cc: pgsql-hackers@postgresql.org Subject: RE: [HACKERS] Patch for collation using ICU --On lördag, maj 07, 2005 22.22.52 +1000 John Hansen [EMAIL PROTECTED] wrote: I use this patch in production on one FreeBSD 4.10 server at the moment. With the latest version, I've had no problems. Logging is swithed on for now, and it shows no signs of ICU complaining. I'd like more reports on Linux, though. I currently use this on gentoo with ICU3.2 unmasked. Works a dream, even with locale C and UNICODE database. Ah, good to hear, John. I beleive your report about linux is what's keeping this back. Did you also manage to get it running on Debian? Not without ICU3.2 /Palle ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Patch for collation using ICU
Did you try the latest patch? Maybe it will help, and if not, it will (hopefully) give a lot more informative error messages. No, and I got rid of my debian boxes @ home. The patch required a certain amount of modifications too, to even compile with 2.8. So I guess it's a valid question to ask: it it worth supporting 2.8? It is of course an option to bundle icu 3.2 with pg! ... John ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Patch for collation using ICU
-Original Message- From: Andrew Dunstan [mailto:[EMAIL PROTECTED] Sent: Saturday, May 07, 2005 11:39 PM To: John Hansen Cc: Palle Girgensohn; Bruce Momjian; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU John Hansen wrote: Here is the list of encoding names and aliases the ICU accepts as of 3.2: (it's a bit long...) UTF-8 ibm-1208 ibm-1209 ibm-5304 ibm-5305 windows-65001 cp1208 UTF-16 ISO-10646-UCS-2 unicode csUnicode ucs-2 [snip] Don't we use unicode as an alias for UTF-8 ? Yes, unfortunately! cheers andrew ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Patch for collation using ICU
Bruce Momjian wrote: There are two reasons for that optimization --- first, some locale support is broken and Unicode encoding with a C locale crashes (not an issue for ICU), and second, it is an optimization for languages like Japanese that want to use unicode, but don't need a locale because upper/lower means nothing in those character sets. No, upper/lower means nothing in those languages, so why would you need to optimize upper/lower if they're not used?? And if they are, it's obviously because the text contains characters from other languages (probably english) and as such they should behave correctly. Did I mention that for japanese and the like, ICU would also offer transliteration... So, the first issue doesn't apply for ICU, and the second might not depending on what characters you are using in the Unicode character set. I guess I am little confused how ICU can do upper() when the locale is C. What is it using to determine A is upper for a? Am I confused? Simple, UNICODE basically consist of a table of characters (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt) Excerpt: 0041;LATIN CAPITAL LETTER A;Lu;0;L;N0061; ... 0061;LATIN SMALL LETTER A;Ll;0;L;N;;;0041;;0041 From this you can see, that for 0041, which is capital letter A, there is a mapping to it's lowercase counterpart, 0061 Likewise, there is a mapping for 0061 which says it's uppercase counterpart is 0041. There is also SpecialCasing.txt which covers those mappings that haven't got a 1-1 mapping, such as the german SS. These mappings are fixed, independent of locale, only a few cases from specialcasing.txt depend on locale/context. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Patch for collation using ICU
It seems 3.2 has much more support and bug fixes, I'm not sure if we should really consider 2.8? As I said, probably not worth the effort. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Patch for collation using ICU
Bruce Momjian wrote: Palle Girgensohn wrote: --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen [EMAIL PROTECTED] wrote: Btw, I had been planning to propose replacing every single one of the built in charset conversion functions with calls to ICU (thus making pg _depend_ on ICU), as this would seem like a cleaner solution than for us to maintain our own conversion tables. ICU also has a fair few conversions that we do not have at present. That is a much larger issue, similar to our shipping our own timezone database. What does it buy us? o Do we ship it in our tarball? o Is the license compatible? o Does it remove utils/mb conversions? o Does it allow us to index LIKE (next high char)? o Does it allow us to support multiple encodings in a single database easier? o performance? I just had a similar though. And why use ICU only for multibyte charsets? If I use LATIN1, I still expect upper('?') = SS, and I don't get it... Same for the Turkish example. We assume the native toupper() can handle single-byte character encodings. We use towupper() only for wide character sets. That assumption is wrong,... Encoding latin1 Locale de* Select Upper('ß'); (lowercase german SS) Should return SS, but returns ß ... John ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Patch for collation using ICU
Tom Lane wrote: John Hansen [EMAIL PROTECTED] writes: Where'd you get the licence from? It was the first thing I came across in their docs: http://icu.sourceforge.net/userguide/intro.html Looking more closely, it may be that this license is only intended to apply to the documentation and not the code ... though free code with un-free documentation isn't real useful. Right, it seems to apply only to the resources found on sourceforge. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Patch for collation using ICU
Tom Lane wrote: John Hansen [EMAIL PROTECTED] writes: Btw, I had been planning to propose replacing every single one of the built in charset conversion functions with calls to ICU (thus making pg _depend_ on ICU), I find that fairly unacceptable ... especially given the licensing questions, but in any case. The licencing seems pretty clear to me. http://www-306.ibm.com/software/globalization/icu/license.jsp It might be OK to rip out the existing conversion support and say that *if* you want encoding conversion, you have to use ICU. But I don't want to be told you cannot build PG without ICU period. Right, that could be done, but I think the issue at heart is _are_ we going to use it at all, and if so, locale support would certainly benefit from going that way as well. The 3.2 vs 2.8 business is disturbing also; specifically, I don't think we get to require 3.2 on a platform where 2.8 is installed. There seems to be nothing in the ICU licence that would prevent us from bundling it. This would solve both the 3.2 vs 2.8 problems, and would remove the 'dependency'. People just aren't going to hold still for that, even assuming that ICU supports installing both versions at once, which isn't clear to me at the moment ... There's no problems with having both installed. I did that on debian to get the patch going. Tho, bundling it seems cleaner to me. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Patch for collation using ICU
I don't buy it. If current conversion tables does the right thing, why we need to replace. Or if conversion tables are not correct, why don't you fix it? I think the rule of character conversion will not change frequently, especially for LATIN languages. Thus maintaining cost is not too high. I never said we need to, but if we're going to implement ICU, then we might as well go all the way. ... John ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Patch for collation using ICU
Tatsuo Ishii wrote: Sent: Sunday, May 08, 2005 10:09 AM To: John Hansen Cc: pgman@candle.pha.pa.us; [EMAIL PROTECTED]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU Bruce Momjian wrote: There are two reasons for that optimization --- first, some locale support is broken and Unicode encoding with a C locale crashes (not an issue for ICU), and second, it is an optimization for languages like Japanese that want to use unicode, but don't need a locale because upper/lower means nothing in those character sets. No, upper/lower means nothing in those languages, so why would you need to optimize upper/lower if they're not used?? And if they are, it's obviously because the text contains characters from other languages (probably english) and as such they should behave correctly. Yes, Japanese (and probably Chinese and Korean) languages include ASCII character. More precisely ASCII is part of Japanese encodings(LATIN1 is not, however). And we have no problem at all with glibc/C locale. See below(unitest is an UNICODE database). unitest=# create table t1(t text); CREATE TABLE unitest=# \encoding EUC_JP unitest=# insert into t1 values('abc'); INSERT 1842628 1 unitest=# select upper(t) from t1; upper --- ABC (1 row) So Japanese(including ASCII)/UNICODE behavior is perfectly correct at this moment. Right, so you _never_ use accented ascii characters in Japanese? (like for example, whose uppercase is ) So I strongly object removing that optimization. I'm guessing this would call for a vote then, since if implementing ICU, then I'd have to object to leaving it in. Changing the bahaviour of ICU doesn't seem right. Changing the behaviour of pg, so that it works as it should when using unicode, seems the right solution to me. -- Tatsuo Ishii ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
Tatsuo Ishii wrote: Sent: Sunday, May 08, 2005 12:01 PM To: [EMAIL PROTECTED] Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem We have developed patches which relaxes the character validation so that PostgreSQL accepts invalid characters. It works like this: That is just plain 100% wrong!! Under no circumstances should there be invalid data in a database. And if you're trying to make a database of invalid data, then at least encode it using a valid encoding. In fact, I've proposed strengthening the validation routines for UTF-8. ... John ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
Madison Kelly wrote: Under most circumstances I would agree with you completely. In my case though I have to decide between risking a loss of a user's data or attempt to store the file name in some manner that would return the same name used by the file system. The user (or one of his/her users in the case of an admin) may be completely unaware of the file name being an invalid unicode name. The file itself though may still be quite valid and contain information worthy of backing up. I could notify the user/admin that the name is not valid but there is no way I could rely on the name being changed. Given the choices, I would prefer to attempt to store/use the file name with the invalid unicode character than simply ignore the file. Is there a way to store the name in raw binary? If so, would this not be safe because to postgresql it should no longer matter what data is or represents, right? Maybe there is a third option I am not yet concidering? Set the client_encoding to ascii when storing that name, and again when retrieving it. Or, use a bytea column. Madison ... John ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Patch for collation using ICU
Alvaro Herrera wrote: Sent: Sunday, May 08, 2005 2:49 PM To: John Hansen Cc: Tatsuo Ishii; pgman@candle.pha.pa.us; [EMAIL PROTECTED]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Patch for collation using ICU On Sun, May 08, 2005 at 02:07:29PM +1000, John Hansen wrote: Tatsuo Ishii wrote: So Japanese(including ASCII)/UNICODE behavior is perfectly correct at this moment. Right, so you _never_ use accented ascii characters in Japanese? (like è for example, whose uppercase is È) That isn't ASCII. It's latin1 or some other ASCII extension. Point taken... But... If you want EUC_JP (Japanese + ASCII) then use that as your backend encoding, not UTF-8 (unicode). UTF-8 encoded databases are very useful for representing multiple languages in the same database, but this usefulness vanishes if functions like upper/lower doesn't work correctly. So optimizing for 3 languages breaks more than a hundred, that's doesn't seem fair! -- Alvaro Herrera ([EMAIL PROTECTED]) La principal característica humana es la tontería (Augusto Monterroso) ... John ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Patch for collation using ICU
Why do you need to add a mapping of encoding names from iana to our names? The pg encoding names are not recognized by ICU, hence the mappings Install ICU 3.2 on your system, and run uconv -l, that will give you a list of valid ICU encoding names. ... John ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Patch for collation using ICU
Btw, Does it feel right to have pg depend on the bleeding edge version of ICU? On many distro's, even gentoo (known for being bleeding edge) 2.8 is still the default. 2.8 and 3.2 are however incompatible, and supporting both, would bloat the source somewhat. ... John ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] A proper fix for the conversion-function problem
Are there any encodings we care about that require embedded zero bytes? UTF-8 does! ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] A proper fix for the conversion-function problem
Errm.. UTF-16/32 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of John Hansen Sent: Wednesday, May 04, 2005 1:22 PM To: Tom Lane; Tatsuo Ishii Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] A proper fix for the conversion-function problem Are there any encodings we care about that require embedded zero bytes? UTF-8 does! ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Call to build-in operator from new operator
Use the DirecFunctionCall1, DirecFunctionCall2, etc. functions. ... John -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: Monday, April 18, 2005 10:40 PM To: pgsql-hackers@postgresql.org Subject: [HACKERS] Call to build-in operator from new operator Greetings, I don't really know if this is the correct place to ask this question, if not please direct me to the correct mailing list. I'm trying to develop a new operator for PostGreSQL (actually for TelegraphCQ, which is an extension of PSQL). Part of the operator's procedure is the @-operator. So now my question is: How do I call the on_pb function from inside my own function?. The on_pb of course takes the argument 'PG_FUNCTION_ARGS' which is defined in fmgr.h as 'FunctionCallInfo fcinfo' which is defined as pointer to 'struct FunctionCallInfoData', so my question boils down to: What do I put into this struct to call 'on_pb' with two arguments from the call to my function? Further, is there a way to access data in tables in the database other than those given as arguments to the function? And how? Sincerely Kim Bille Department of Computer Science Aalborg University Denmark -- Mind are like parachutes --- they only work when open ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] Recursive types?
Consider the following: create table foo ( id serial primary key, path text); alter table foo add ref foo; Table public.foo Column | Type | Modifiers +-+- id | integer | not null default nextval('public.foo_id_seq'::text) path | text| ref| foo | Indexes: foo_pkey PRIMARY KEY, btree (id) insert into foo (path) VALUES ('/'); insert into foo (path,ref) VALUES ('/foo',ROW(1,'/',NULL)); insert into foo (path,ref) VALUES ('/foo/bar',ROW(2,'/foo',ROW(1,'/',NULL))); select * from foo; id | path |ref +--+--- 1 | /| 2 | /foo | (1,/,) 3 | /foo/bar | (2,/foo,(1,/,)) (3 rows) However, it is not possible to create such a type using create type, as there is no alter type, even tho alter table does the same thing. Is this a bug or a random feature? It is also not possible to dump and restore this using pg_dump. ... John ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Unicode problems on IRC
On 2005-04-10, Tom Lane tgl ( at ) sss ( dot ) pgh ( dot ) pa ( dot ) us wrote: Andrew - Supernews andrew+nonews ( at ) supernews ( dot ) com writes: I think you will find that this impression is actually false. Or that at the very least, _correct_ verification of UTF-8 sequences will still catch essentially all cases of non-utf-8 input mislabelled as utf-8 while allowing the full range of Unicode codepoints. Yeah? Cool. Does John's proposed patch do it correctly? http://candle.pha.pa.us/mhonarc/patches2/msg00076.html It looks correct to me. The only thing I think that code will let through incorrectly are encoded surrogates; those could be fixed by adding one line: switch (*source) { /* no fall-through in this inner switch */ case 0xE0: if (a 0xA0) return false; break; + case 0xED: if (a 0x9F) return false; break; case 0xF0: if (a 0x90) return false; break; case 0xF4: if (a 0x8F) return false; break; That's right, dono how I missed that one, but looks correct to me, and is in line with the code in ConvertUTF.c from unicode.org, on which I based the patch, extended to support 6 byte utf8 characters. (Accepting encoded surrogates in utf-8 was always forbidden by most specifications that used utf-8, though the Unicode specs originally were not absolute about it (but forbade generating them). Current Unicode specifications define those sequences as malformed. Surrogates are the code points from 0xD800 - 0xDFFF, which are used in UTF-16 to encode characters 0x1 - 0x10 as two 16-bit values; UTF-8 requires that such characters are encoded directly rather than via surrogate pairs.) -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ... John ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Unicode problems on IRC
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Bruce Momjian Sent: Sunday, April 10, 2005 8:18 AM To: Christopher Kings-Lynne Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Unicode problems on IRC Christopher Kings-Lynne wrote: Hey guys, The 'Unicode characters above 0x1' issue keeps rearing its ugly head in the IRC channel. I propose that it be fixed, even backported... This is John Hansen's most recent patch to fix it: http://archives.postgresql.org/pgsql-patches/2004-11/msg00259.php And from what I can tell it was committed, then reverted because it wasn't a bug. It was going to go in for 8.1. We on the channel are starting to think that it is in fact a bug. There are are people with legitimately utf-8 encoded XML documents that they cannot store in PostgreSQL. Apparently in the distant past, Unicode was limited to 0x1, but then was extended. Perhaps we can reopen this case... Uh, I thought we fixed this another way, buy not using Unicode-aware functions for upper/lower/initcap when the locale is C or POSIX. That is backpatched to 8.0.X. Does that not fix the problem reported? No, as andrew said, what this patch does, is allow values 0x and at the same time validates the input to make sure it's valid utf8. ... John -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Patch for collation using ICU
--On fredag, mars 25, 2005 16.34.41 +1100 John Hansen [EMAIL PROTECTED] wrote: Useful if it's going to support earlier releases of ICU Not all os's come with ICU3.2, debian for example, currently has 2.1 in testing, and 2.6 in unstable. Oh, OK. FreeBSD has only the 3.2 as port. I can check the older version, I doubt it would too much difference. Some autoconf sorcery needed, perhaps. Naww, it's no biggie, we'll just need to include ICU with pg I think. I tried that, there are several functions from ICU that you use, that are not in ICU2.1 Dono about 2.6. However, ICU3.2 compiles on debian with a small change to the debian/rules file. debian/tmp/etc is missing, so add mkdir debian/tmp/etc ... John /Palle ... John -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Palle Girgensohn Sent: Friday, March 25, 2005 10:40 AM To: pgsql-hackers@postgresql.org Subject: [HACKERS] Patch for collation using ICU Hi! I've put together a patch for using IBM's ICU package for collation. If your OS does not have full support for collation ur uppercase/lowercase in multibyte locales, this might be useful. If you are using a multibyte character encoding in your database and want collation, i.e. order by, and also lower(), upper() and initcap() to work properly, this patch will do just that. This patch is needed for FreeBSD, since this OS has no support for collation of for example unicode locales (that is, wcscoll(3) does not do what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the patch is *not* necessary for Linux, although IBM claims ICU collation to be about twice as fast as glibc for simple western locales. It adds a configure switch, `--with-icu', which will set up the code to use ICU instead of wchar_t and wcscoll. This has been tested only on FreeBSD-4.11 FreeBSD-5-stable, where it seems to run well. I've not had the time to do any comparative performance tests yet, but it seems it is at least not slower than using LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster. I'd be delighted if some more experienced postgresql hackers would review this stuff. The patch is pretty compact, so it's fast reading :) I'm planning to add this patch as an option (tagged experimental) to FreeBSD's postgresql port. Any ideas about whether this is a good idea or not? Any thoughts or ideas are welcome! Cheers, Palle Patch at: http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2 005-03-14.diff ICU at sourceforge: http://icu.sf.net/ ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Patch for collation using ICU
Ok,.. tested on debian sarge with ICU 3.2 UNICODE Database, C locale. upper() and lower() returns an empty string for any input, including 7bit ascii, regardless of client_encoding, so something is obviously broken. Have you tested this patch on a UNICODE DB with locale C/POSIX ? ... John -Original Message- From: John Hansen Sent: Friday, March 25, 2005 10:27 PM To: 'Palle Girgensohn'; 'pgsql-hackers@postgresql.org' Subject: RE: [HACKERS] Patch for collation using ICU --On fredag, mars 25, 2005 16.34.41 +1100 John Hansen [EMAIL PROTECTED] wrote: Useful if it's going to support earlier releases of ICU Not all os's come with ICU3.2, debian for example, currently has 2.1 in testing, and 2.6 in unstable. Oh, OK. FreeBSD has only the 3.2 as port. I can check the older version, I doubt it would too much difference. Some autoconf sorcery needed, perhaps. Naww, it's no biggie, we'll just need to include ICU with pg I think. I tried that, there are several functions from ICU that you use, that are not in ICU2.1 Dono about 2.6. However, ICU3.2 compiles on debian with a small change to the debian/rules file. debian/tmp/etc is missing, so add mkdir debian/tmp/etc ... John /Palle ... John -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Palle Girgensohn Sent: Friday, March 25, 2005 10:40 AM To: pgsql-hackers@postgresql.org Subject: [HACKERS] Patch for collation using ICU Hi! I've put together a patch for using IBM's ICU package for collation. If your OS does not have full support for collation ur uppercase/lowercase in multibyte locales, this might be useful. If you are using a multibyte character encoding in your database and want collation, i.e. order by, and also lower(), upper() and initcap() to work properly, this patch will do just that. This patch is needed for FreeBSD, since this OS has no support for collation of for example unicode locales (that is, wcscoll(3) does not do what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the patch is *not* necessary for Linux, although IBM claims ICU collation to be about twice as fast as glibc for simple western locales. It adds a configure switch, `--with-icu', which will set up the code to use ICU instead of wchar_t and wcscoll. This has been tested only on FreeBSD-4.11 FreeBSD-5-stable, where it seems to run well. I've not had the time to do any comparative performance tests yet, but it seems it is at least not slower than using LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster. I'd be delighted if some more experienced postgresql hackers would review this stuff. The patch is pretty compact, so it's fast reading :) I'm planning to add this patch as an option (tagged experimental) to FreeBSD's postgresql port. Any ideas about whether this is a good idea or not? Any thoughts or ideas are welcome! Cheers, Palle Patch at: http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2 005-03-14.diff ICU at sourceforge: http://icu.sf.net/ ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Patch for collation using ICU
--On fredag, mars 25, 2005 23.39.33 +1100 John Hansen [EMAIL PROTECTED] wrote: Ok,.. tested on debian sarge with ICU 3.2 UNICODE Database, C locale. upper() and lower() returns an empty string for any input, including 7bit ascii, regardless of client_encoding, so something is obviously broken. Have you tested this patch on a UNICODE DB with locale C/POSIX ? FYI, I also found that initdb crashes with error 139 on any locale other than C/POSIX. No, honestly not. Mostly tested it with my needs, sv_SE.UTF-8 and UNICODE, and also de_DE.UTF-8. How will PostgreSQL react to this combo? A database cluster initdb:ed with locale=C/POSIX, and then a database in UNICODE (really utf-8) representation... hmm... I think I might have made a false assumption that the locale string would contain the character encoding. I do something like encoding = strchr(locale, '.') + 1... That code will be confused by a 'C' locale, indeed. I'll check it out! /Palle ... John -Original Message- From: John Hansen Sent: Friday, March 25, 2005 10:27 PM To: 'Palle Girgensohn'; 'pgsql-hackers@postgresql.org' Subject: RE: [HACKERS] Patch for collation using ICU --On fredag, mars 25, 2005 16.34.41 +1100 John Hansen [EMAIL PROTECTED] wrote: Useful if it's going to support earlier releases of ICU Not all os's come with ICU3.2, debian for example, currently has 2.1 in testing, and 2.6 in unstable. Oh, OK. FreeBSD has only the 3.2 as port. I can check the older version, I doubt it would too much difference. Some autoconf sorcery needed, perhaps. Naww, it's no biggie, we'll just need to include ICU with pg I think. I tried that, there are several functions from ICU that you use, that are not in ICU2.1 Dono about 2.6. However, ICU3.2 compiles on debian with a small change to the debian/rules file. debian/tmp/etc is missing, so add mkdir debian/tmp/etc ... John /Palle ... John -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Palle Girgensohn Sent: Friday, March 25, 2005 10:40 AM To: pgsql-hackers@postgresql.org Subject: [HACKERS] Patch for collation using ICU Hi! I've put together a patch for using IBM's ICU package for collation. If your OS does not have full support for collation ur uppercase/lowercase in multibyte locales, this might be useful. If you are using a multibyte character encoding in your database and want collation, i.e. order by, and also lower(), upper() and initcap() to work properly, this patch will do just that. This patch is needed for FreeBSD, since this OS has no support for collation of for example unicode locales (that is, wcscoll(3) does not do what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the patch is *not* necessary for Linux, although IBM claims ICU collation to be about twice as fast as glibc for simple western locales. It adds a configure switch, `--with-icu', which will set up the code to use ICU instead of wchar_t and wcscoll. This has been tested only on FreeBSD-4.11 FreeBSD-5-stable, where it seems to run well. I've not had the time to do any comparative performance tests yet, but it seems it is at least not slower than using LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster. I'd be delighted if some more experienced postgresql hackers would review this stuff. The patch is pretty compact, so it's fast reading :) I'm planning to add this patch as an option (tagged experimental) to FreeBSD's postgresql port. Any ideas about whether this is a good idea or not? Any thoughts or ideas are welcome! Cheers, Palle Patch at: http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2 005-03-14.diff ICU at sourceforge: http://icu.sf.net/ ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Patch for collation using ICU
-Original Message- From: Palle Girgensohn [mailto:[EMAIL PROTECTED] Sent: Saturday, March 26, 2005 1:10 PM To: pgsql-hackers@postgresql.org Cc: John Hansen; Andrew Dunstan Subject: Re: [HACKERS] Patch for collation using ICU --On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn [EMAIL PROTECTED] wrote: Hi! I've put together a patch for using IBM's ICU package for collation. If your OS does not have full support for collation ur uppercase/lowercase in multibyte locales, this might be useful. If you are using a multibyte character encoding in your database and want collation, i.e. order by, and also lower(), upper() and initcap() to work properly, this patch will do just that. This patch is needed for FreeBSD, since this OS has no support for collation of for example unicode locales (that is, wcscoll(3) does not do what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the patch is *not* necessary for Linux, although IBM claims ICU collation to be about twice as fast as glibc for simple western locales. It adds a configure switch, `--with-icu', which will set up the code to use ICU instead of wchar_t and wcscoll. This has been tested only on FreeBSD-4.11 FreeBSD-5-stable, where it seems to run well. I've not had the time to do any comparative performance tests yet, but it seems it is at least not slower than using LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster. I'd be delighted if some more experienced postgresql hackers would review this stuff. The patch is pretty compact, so it's fast reading :) I'm planning to add this patch as an option (tagged experimental) to FreeBSD's postgresql port. Any ideas about whether this is a good idea or not? Any thoughts or ideas are welcome! Cheers, Palle Patch at: http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-1 4.d iff ICU at sourceforge: http://icu.sf.net/ Hi! There's a new patch to fix some reported problems. http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2 005-03-26.diff This version uses the DatabaseEncoding and sets the ICU encoding at the same time. I had to create a conversion table from PostgreSQL's own, somewhat odd and non-standard, names of encodings, into the prefered IANA names. On or two of the more odd ones might be slightly incorrect, hopefully not too far off anyway? I've noticed a couple of things about using the ICU patch vs. pristine pg-8.0.1: - ORDER BY is case insensitive when using ICU. This might break the SQL standard (?), but sure is nice :) This would mean that indexes are also case insensitive right? Which makes it a Bad Thing(tm). - When the database is initialized using the C locale, upper() and lower() normally does not work at all for non-ASCII characters even if the database's encoding is say LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD, and this is probably correct since the locale is still `C', I believe?). The ICU patch changes nothing for the LATIN1 case, since it does not act on single byte encodings, but for the UNICODE representation, it works and does what I expect it to, namely upper() and lower() neatly upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ') - 'åäö'. This is a good thing, although I'm surprised that upper/lower is dragged along with the LC_COLLATE fixation at initdb. I never run initdb in the C locale, but only now do I realize how broken that really is if you need to store anything else than English :-) That is what I would have expected. However, it probably won't work for the more exotic cases, like turkish I, which depends on the locale. I'd be delighted to get more feedback about this stuff. Thanks, Palle ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Patch for collation using ICU
Still doesn't work for me :( UNICODE DB C locale set client_encoding = iso88591; select upper('æ'); upper --- (1 row) Trying to initdb with en_IN.utf8 /usr/lib/postgresql/bin/initdb -D /var/lib/postgres/data/ -E UNICODE --locale=en_IN.utf8 The files belonging to this database system will be owned by user postgres. This user must also own the server process. The database cluster will be initialized with locale en_IN.utf8. fixing permissions on existing directory /var/lib/postgres/data ... ok creating directory /var/lib/postgres/data/global ... ok creating directory /var/lib/postgres/data/pg_xlog ... ok creating directory /var/lib/postgres/data/pg_xlog/archive_status ... ok creating directory /var/lib/postgres/data/pg_clog ... ok creating directory /var/lib/postgres/data/pg_subtrans ... ok creating directory /var/lib/postgres/data/base ... ok creating directory /var/lib/postgres/data/base/1 ... ok creating directory /var/lib/postgres/data/pg_tblspc ... ok selecting default max_connections ... 100 selecting default shared_buffers ... 1000 creating configuration files ... ok creating template1 database in /var/lib/postgres/data/base/1 ... ok initializing pg_shadow ... ok enabling unlimited row size for system tables ... ok initializing pg_depend ... ok creating system views ... ok loading pg_description ... ok creating conversions ... ok setting privileges on built-in objects ... ok creating information schema ... child process exited with exit code 139 initdb: removing contents of data directory /var/lib/postgres/data ... John -Original Message- From: Palle Girgensohn [mailto:[EMAIL PROTECTED] Sent: Saturday, March 26, 2005 1:10 PM To: pgsql-hackers@postgresql.org Cc: John Hansen; Andrew Dunstan Subject: Re: [HACKERS] Patch for collation using ICU --On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn [EMAIL PROTECTED] wrote: Hi! I've put together a patch for using IBM's ICU package for collation. If your OS does not have full support for collation ur uppercase/lowercase in multibyte locales, this might be useful. If you are using a multibyte character encoding in your database and want collation, i.e. order by, and also lower(), upper() and initcap() to work properly, this patch will do just that. This patch is needed for FreeBSD, since this OS has no support for collation of for example unicode locales (that is, wcscoll(3) does not do what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the patch is *not* necessary for Linux, although IBM claims ICU collation to be about twice as fast as glibc for simple western locales. It adds a configure switch, `--with-icu', which will set up the code to use ICU instead of wchar_t and wcscoll. This has been tested only on FreeBSD-4.11 FreeBSD-5-stable, where it seems to run well. I've not had the time to do any comparative performance tests yet, but it seems it is at least not slower than using LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster. I'd be delighted if some more experienced postgresql hackers would review this stuff. The patch is pretty compact, so it's fast reading :) I'm planning to add this patch as an option (tagged experimental) to FreeBSD's postgresql port. Any ideas about whether this is a good idea or not? Any thoughts or ideas are welcome! Cheers, Palle Patch at: http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-1 4.d iff ICU at sourceforge: http://icu.sf.net/ Hi! There's a new patch to fix some reported problems. http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2 005-03-26.diff This version uses the DatabaseEncoding and sets the ICU encoding at the same time. I had to create a conversion table from PostgreSQL's own, somewhat odd and non-standard, names of encodings, into the prefered IANA names. On or two of the more odd ones might be slightly incorrect, hopefully not too far off anyway? I've noticed a couple of things about using the ICU patch vs. pristine pg-8.0.1: - ORDER BY is case insensitive when using ICU. This might break the SQL standard (?), but sure is nice :) - When the database is initialized using the C locale, upper() and lower() normally does not work at all for non-ASCII characters even if the database's encoding is say LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD, and this is probably correct since the locale is still `C', I believe?). The ICU patch changes nothing for the LATIN1 case, since it does not act on single byte encodings, but for the UNICODE representation, it works and does what I expect it to, namely upper() and lower() neatly upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ') - 'åäö'. This is a good thing, although I'm surprised that upper/lower is dragged along with the LC_COLLATE fixation at initdb. I never run initdb
Re: [HACKERS] Patch for collation using ICU
Useful if it's going to support earlier releases of ICU Not all os's come with ICU3.2, debian for example, currently has 2.1 in testing, and 2.6 in unstable. ... John -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Palle Girgensohn Sent: Friday, March 25, 2005 10:40 AM To: pgsql-hackers@postgresql.org Subject: [HACKERS] Patch for collation using ICU Hi! I've put together a patch for using IBM's ICU package for collation. If your OS does not have full support for collation ur uppercase/lowercase in multibyte locales, this might be useful. If you are using a multibyte character encoding in your database and want collation, i.e. order by, and also lower(), upper() and initcap() to work properly, this patch will do just that. This patch is needed for FreeBSD, since this OS has no support for collation of for example unicode locales (that is, wcscoll(3) does not do what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the patch is *not* necessary for Linux, although IBM claims ICU collation to be about twice as fast as glibc for simple western locales. It adds a configure switch, `--with-icu', which will set up the code to use ICU instead of wchar_t and wcscoll. This has been tested only on FreeBSD-4.11 FreeBSD-5-stable, where it seems to run well. I've not had the time to do any comparative performance tests yet, but it seems it is at least not slower than using LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster. I'd be delighted if some more experienced postgresql hackers would review this stuff. The patch is pretty compact, so it's fast reading :) I'm planning to add this patch as an option (tagged experimental) to FreeBSD's postgresql port. Any ideas about whether this is a good idea or not? Any thoughts or ideas are welcome! Cheers, Palle Patch at: http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2 005-03-14.diff ICU at sourceforge: http://icu.sf.net/ ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [pgsql-hackers-win32] UNICODE/UTF-8 on win32
John Hansen wrote: currently, upper/lower does not work with 2+ byte unicode characters, on any OS under the C locale. Sure it does. It's just that the defined behavior of the C locale is often useless in practice. select upper('æøå'); ERROR: invalid multibyte character for locale HINT: The server's LC_CTYPE locale is probably incompatible with the database encoding. Consequently it seems that is does not work. ... John ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [pgsql-hackers-win32] UNICODE/UTF-8 on win32
select upper('æøå'); ERROR: invalid multibyte character for locale HINT: The server's LC_CTYPE locale is probably incompatible with the database encoding. Consequently it seems that is does not work. It fails on my machine should not be read as it doesn't work for anyone. It all depends on how your local mbstowcs() works. Ok,... Do you have an example of a system on which it works? ... John ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] [pgsql-hackers-win32] UNICODE/UTF-8 on win32
On HPUX 10.20, mbstowcs seems to treat all byte values as single-byte characters in C locale, so my sample-of-one says that it works everywhere ;-). Right, so for the sample SQL I sent earlier, the result would be the same as the input? That's hardly a working upper/lower If a character doesn't have case then fine, but one that does, should at least produce a warning if it cannot be converted. ... John ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [pgsql-hackers-win32] UNICODE/UTF-8 on win32
To fix UTF8, the data needs to be converted to UTF16 and then the Win32 wcscoll() can be used, and perhaps other functions like towupper(). However, UTF8 already works with normal locales but provides no ordering. Right,. So if that's fixed, then UTF8 will work only on windows? (currently, upper/lower does not work with 2+ byte unicode characters, on any OS) ... John ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] [pgsql-hackers-win32] UNICODE/UTF-8 on win32
K, let me rephrase: currently, upper/lower does not work with 2+ byte unicode characters, on any OS under the C locale. ... John ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [pgsql-hackers-win32] UNICODE/UTF-8 on win32
currently, upper/lower does not work with 2+ byte unicode characters, on any OS under the C locale. Btw,... There are only 15 cases in the utf8 repertoire that depends on locale, these are the only cases where pg should report: ERROR: invalid multibyte character for locale HINT: The server's LC_CTYPE locale is probably incompatible with the database encoding. When doing a select upper/lower (col) All others should work just fine. The error should probably also be changed to a warning, and just return the offending character unmodified. ... John ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Terminating a SETOF function call sequence
... c) would be very bad since it doesn't give me any chance to release the resources that where used in order to produce the rows. You are supposed to free resources used to produce the rows before srf_return_next(); The actual rows are pfree()'d by pg. (an dso are any other palloc()'d resources, but I'd recommend freeing them anyway, especially if youre going to use the function in an index or transactions, since resources a not freed till the end of the transaction) ... JOhn ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] Schema name of function
Just got reminded... Is there a way for a C function to determine the name of the schema in which is was created? ... John ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Schema name of function
Beautiful, just what I was looking for. Thnx, John -Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 15, 2005 6:31 AM To: Michael Fuhr Cc: John Hansen; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Schema name of function Michael Fuhr [EMAIL PROTECTED] writes: On Mon, Feb 14, 2005 at 11:11:53AM -0500, Tom Lane wrote: In C, it'd be a lot easier (and faster) to do a couple of SearchSysCache calls than to use SPI to get those rows. The following appears to work -- does it look right, aside from the missing error checking? tuple = SearchSysCache(PROCOID, ObjectIdGetDatum(fcinfo-flinfo-fn_oid), 0, 0, 0); nsoid = SysCacheGetAttr(PROCOID, tuple, Anum_pg_proc_pronamespace, isnull); schemaname = get_namespace_name(nsoid); ReleaseSysCache(tuple); That would work. Since pronamespace is one of the fixed non-nullable columns of pg_proc, you don't really need to use SysCacheGetAttr: you can just map the C struct onto the tuple and grab the field directly. nsoid = ((Form_pg_proc) GETSTRUCT(tuple))-pronamespace; utils/cache/lsyscache.c contains lots of examples of this sort of thing. (get_namespace_name is one, in fact.) regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
[HACKERS] unicode upper/lower functions
Hi list, Attached for your perusal, unicode versions of upper/lower, that work independent of locale except for the following languages: Turkish, Azeri, and Lithuanian. There are 15 locale specific cases in total not covered. -- John Hansen [EMAIL PROTECTED] GeekNET collate.tar.gz Description: application/compressed-tar ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] [NOVICE] Last ID Problem
Attempts to return the id of the last value to be inserted into a table. You can either provide a sequence name (preferred) or provide a table name with optional schema. The $catalog and $field arguments are always ignored. The current value of the sequence is returned by a call to the 'currval' PostgreSQL function. This will fail if the sequence has not yet been used in the current database connection. This suffers from the same problems that currval does when using connection pools tho. I previously suggested a function similar to last_insert_id in behaviour, and have attached it to this email for reference. Even so, this also suffers from the same problems when using a connection pool. The solution I proposed, namely having the tuple returned by inserts/updates (perhaps even deletes?) would only mean changing the client library to handle this, and as an example, libpg could easily figure out the OID of said tuple and return that if it's present for PQExec() (for backwards compatibility just as it does today,) and add a separate PQExecSelect() that instead returns the tuple(s) as if they had been SELECTed. -- John Hansen [EMAIL PROTECTED] GeekNET #include postgres.h #include fmgr.h #include storage/relfilenode.h #include commands/sequence.h static int64 _lastval = 0; PG_FUNCTION_INFO_V1(nextval_new); Datum nextval_new(PG_FUNCTION_ARGS) { _lastval = DatumGetInt64(nextval(fcinfo)); PG_RETURN_INT64(_lastval); } PG_FUNCTION_INFO_V1(lastval); Datum lastval(PG_FUNCTION_ARGS) { PG_RETURN_INT64(_lastval); } SET search_path = pg_catalog; BEGIN; DELETE FROM pg_catalog.pg_proc WHERE proname = 'nextval'; CREATE FUNCTION nextval(text) RETURNS bigint AS 'lastval.so','nextval_new' LANGUAGE 'C'; COMMENT ON FUNCTION nextval(text) IS 'sequence next value'; CREATE FUNCTION lastval() RETURNS bigint AS 'lastval.so','lastval' LANGUAGE 'C'; COMMENT ON FUNCTION lastval() IS 'sequence last value'; COMMIT; # - lastval : gcc -I /usr/include/postgresql/server/ -I /usr/include/postgresql/ -shared -o lastval.so lastval.c strip lastval.so install : install -s -m 755 lastval.so $(DESTDIR)/usr/lib/postgresql/lib/; clean : rm -f *.o *~ core *.so; ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] subselects in the target list
On Wed, 2005-02-02 at 23:22 -0500, Tom Lane wrote: Neil Conway [EMAIL PROTECTED] writes: neilc=# select a, (select * from abc) from abc; ERROR: subquery must return only one column Is there a reason we can't treat a subselect in the target list as returning a composite type? Given the 8.0 infrastructure for unnamed record types it might be possible to do that; it was surely never possible before. Whether it's a good idea is another question. The syntax you are showing is designed to return a scalar. It will (and should) barf on multiple rows as well as multiple columns. Right, the point is, that is does not, if said srf-function is written in say, C. However, this is somewhat similar to the WITH LATERAL clause previously discussed in connection with UNNEST and multisets, so perhaps it's not such a bad idea after all? For that matter, is this behavior also intentional? neilc=# select a, foo_abc2() FROM abc; ERROR: set-valued function called in context that cannot accept a set CONTEXT: PL/pgSQL function foo_abc2 line 1 at return next It's an implementation restriction in plpgsql: we didn't make it support the old-style SRF API. I'm unconvinced that it's worth fixing considering that this whole behavior (SRFs in the targetlist) is deprecated. regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings -- John Hansen [EMAIL PROTECTED] GeekNET ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] unicode upper/lower functions
uhmm,... Forgot to change the copyright. Please accept this under the same terms as postgresql itself. ... John ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] [NOVICE] Last ID Problem
Since OID's are now deprecated, and will eventually disappear, wouldn't it be a good idea, to have INSERT and UPDATE return a copy of the tuple that was inserted/updated? How about the TID? Yea, that'd work. As long as you can get an arbitrary column back out, 'as it was at the time it was committed'. Since not everything runs in a transaction,. And someone might have modified the row by the time you get to fetching it back out Or in terms of tuples,. No longer exist, if vacuum full have run... ... JOhn ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] [NOVICE] Last ID Problem
No one has stated that they will disappear. Ohh,... just the impression I've been getting when speaking with people. ... John ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]