Re: [HACKERS] Extensions, patch v20 (bitrot fixes)

2010-12-20 Thread Martijn van Oosterhout
On Mon, Dec 20, 2010 at 09:03:56AM +0900, Itagaki Takahiro wrote:
 On Mon, Dec 20, 2010 at 01:34, Tom Lane t...@sss.pgh.pa.us wrote:
  I agree that the default encoding is UTF-8, but it should be
  configurable by the 'encoding' parameter in control files.
 
  Why is it necessary to have such a parameter at all?
 
 UTF-8 is not a superset of all encodings.

I think you mean Unicode is not a superset of all character sets. I've
heard this before but never found what's missing. [citation needed]?

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Patriotism is when love of your own people comes first; nationalism,
 when hate for people other than your own comes first. 
   - Charles de Gaulle


signature.asc
Description: Digital signature


Re: [HACKERS] Extensions, patch v20 (bitrot fixes)

2010-12-20 Thread David Fetter
On Mon, Dec 20, 2010 at 08:01:42PM +0100, Martijn van Oosterhout wrote:
 On Mon, Dec 20, 2010 at 09:03:56AM +0900, Itagaki Takahiro wrote:
  On Mon, Dec 20, 2010 at 01:34, Tom Lane t...@sss.pgh.pa.us wrote:
   I agree that the default encoding is UTF-8, but it should be
   configurable by the 'encoding' parameter in control files.
  
   Why is it necessary to have such a parameter at all?
  
  UTF-8 is not a superset of all encodings.
 
 I think you mean Unicode is not a superset of all character sets. I've
 heard this before but never found what's missing. [citation needed]?

Windows-1252, ISO-2022-JP-2 and EUC-TW are such encodings.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extensions, patch v20 (bitrot fixes)

2010-12-20 Thread Tom Lane
David Fetter da...@fetter.org writes:
 On Mon, Dec 20, 2010 at 08:01:42PM +0100, Martijn van Oosterhout wrote:
 I think you mean Unicode is not a superset of all character sets. I've
 heard this before but never found what's missing. [citation needed]?

 Windows-1252, ISO-2022-JP-2 and EUC-TW are such encodings.

[citation needed]?  Exactly what characters are missing, and why would
the Unicode people have chosen to leave them out?  It's not like they've
not heard of those encodings, I'm sure.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extensions, patch v20 (bitrot fixes)

2010-12-20 Thread Kenneth Marshall
On Mon, Dec 20, 2010 at 02:10:39PM -0500, Tom Lane wrote:
 David Fetter da...@fetter.org writes:
  On Mon, Dec 20, 2010 at 08:01:42PM +0100, Martijn van Oosterhout wrote:
  I think you mean Unicode is not a superset of all character sets. I've
  heard this before but never found what's missing. [citation needed]?
 
  Windows-1252, ISO-2022-JP-2 and EUC-TW are such encodings.
 
 [citation needed]?  Exactly what characters are missing, and why would
 the Unicode people have chosen to leave them out?  It's not like they've
 not heard of those encodings, I'm sure.
 
   regards, tom lane
 

Here is an interesting description of some of the gotchas:

http://en.wikipedia.org/wiki/Windows-1252

Regards,
Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extensions, patch v20 (bitrot fixes)

2010-12-20 Thread David E. Wheeler
On Dec 20, 2010, at 11:53 AM, Kenneth Marshall wrote:

 Here is an interesting description of some of the gotchas:
 
 http://en.wikipedia.org/wiki/Windows-1252

FWIW, those are gotchas translating between Windows 1252 and Latin-1. Windows 
1252's nerbles translate to UTF-8 just fine.

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extensions, patch v20 (bitrot fixes)

2010-12-20 Thread Tom Lane
Kenneth Marshall k...@rice.edu writes:
 On Mon, Dec 20, 2010 at 02:10:39PM -0500, Tom Lane wrote:
 [citation needed]?  Exactly what characters are missing, and why would
 the Unicode people have chosen to leave them out?  It's not like they've
 not heard of those encodings, I'm sure.

 Here is an interesting description of some of the gotchas:
 http://en.wikipedia.org/wiki/Windows-1252

Well, it's interesting, but I see no glyphs on that page that lack
Unicode assignments.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extensions, patch v20 (bitrot fixes)

2010-12-20 Thread Kenneth Marshall
On Mon, Dec 20, 2010 at 03:08:48PM -0500, Tom Lane wrote:
 Kenneth Marshall k...@rice.edu writes:
  On Mon, Dec 20, 2010 at 02:10:39PM -0500, Tom Lane wrote:
  [citation needed]?  Exactly what characters are missing, and why would
  the Unicode people have chosen to leave them out?  It's not like they've
  not heard of those encodings, I'm sure.
 
  Here is an interesting description of some of the gotchas:
  http://en.wikipedia.org/wiki/Windows-1252
 
 Well, it's interesting, but I see no glyphs on that page that lack
 Unicode assignments.
 
   regards, tom lane
 
You are correct. I mis-read the text.

Regards,
Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extensions, patch v20 (bitrot fixes)

2010-12-20 Thread Nicolas Barbier
2010/12/20 Martijn van Oosterhout klep...@svana.org:

 On Mon, Dec 20, 2010 at 09:03:56AM +0900, Itagaki Takahiro wrote:

 UTF-8 is not a superset of all encodings.

 I think you mean Unicode is not a superset of all character sets. I've
 heard this before but never found what's missing. [citation needed]?

From 
URL:http://en.wikipedia.org/wiki/Japanese_language_and_computers#Character_encodings:

Unicode is supposed to solve all encoding problems in all languages
of the world. [..] There are still controversies. For Japanese, the
kanji characters have been unified with Chinese, that is a character
considered to be the same in both Japanese and Chinese have been given
one and the same code number in Unicode, even if they look a little
different. This process, called Han unification, has caused
controversy.

For examples (my browser doesn't show any differences though, probably
because I don't have the corresponding fonts):

URL:http://en.wikipedia.org/wiki/Han_unification#Examples_of_language_dependent_characters

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extensions, patch v20 (bitrot fixes)

2010-12-20 Thread Martijn van Oosterhout
On Mon, Dec 20, 2010 at 10:15:56PM +0100, Nicolas Barbier wrote:
 From 
 URL:http://en.wikipedia.org/wiki/Japanese_language_and_computers#Character_encodings:
 
 Unicode is supposed to solve all encoding problems in all languages
 of the world. [..] There are still controversies. For Japanese, the
 kanji characters have been unified with Chinese, that is a character
 considered to be the same in both Japanese and Chinese have been given
 one and the same code number in Unicode, even if they look a little
 different. This process, called Han unification, has caused
 controversy.

From http://en.wikipedia.org/wiki/CJK_Unified_Ideographs:

However, the source separation rule states that characters encoded
separately in an earlier character set would remain separate in the new
Unicode encoding.

From all the references I've seen this has been applied everywhere and
any failures to roundtrip conversions are considered bugs and I can't
believe that at this point they havn't all been fixed. This is kind of
underscored by the fact that references always point to theoretical
problems rather than actual lists of characters that can't be
converted.

ISTM that since all the mapping tables are public it should be a SMOP
to *prove* roundtrip conversions are safe, or identify the problems.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Patriotism is when love of your own people comes first; nationalism,
 when hate for people other than your own comes first. 
   - Charles de Gaulle


signature.asc
Description: Digital signature


Re: [HACKERS] Extensions, patch v20 (bitrot fixes)

2010-12-20 Thread Itagaki Takahiro
On Tue, Dec 21, 2010 at 08:04, Martijn van Oosterhout klep...@svana.org wrote:
 On Mon, Dec 20, 2010 at 10:15:56PM +0100, Nicolas Barbier wrote:
 From 
 URL:http://en.wikipedia.org/wiki/Japanese_language_and_computers#Character_encodings:
 ISTM that since all the mapping tables are public it should be a SMOP
 to *prove* roundtrip conversions are safe, or identify the problems.

Another issue in Japanese users is EUDC (End User Defined Character).
Unfortunately for both postgres developers and application developers
in Japan, many machine dependence characters are still used in popular
mobile phones in Japan. Their native encoding is SHIFT_JIS, and we
have an EUDC mapping for SHIFT_JIS to/from EUC_JP. But we don't have
for UTF-8 to/from other encodings. That is one of the reasons why we
cannot move to the UTF-8 world completely.

Imagine that a module that manipulate EUDC text. It will be written
in EUC_JP because SHIFT_JIS is not supported in postgres. Also, it
cannot be rewritten in UTF-8 because there are no mapping for the
characters used in it.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extensions, patch v20 (bitrot fixes)

2010-12-19 Thread Dimitri Fontaine
Hi,

Thanks for your review and your time. Trying to answer some of your
points there:

Robert Haas robertmh...@gmail.com writes:
 I spent a little time looking at this tonight.  I'm going to give you
 the same general advice that I've given other people who have
 submitted very large patches of this type: it'll be a lot easier to
 get this committed if we can break it into smaller pieces.  A pretty
 easy thing to do would be to separate the changes that turn the
 existing contrib modules into extensions into its own patch.

Those are pretty mechanical, so applying them after would be quite
easy. That said I don't know exactly how much it would ease the review,
but if you say it does, I'll separate the patches.

  A
 possibly more controversial idea is to actually remove the notion of a
 relocatable extension from this patch, and all the supporting
 machinery, and make that a follow-on patch.  I think it was arranged
 like that before and somehow got collapsed, but maybe the notion is
 worth revisiting, because this is a lot of code:

  224 files changed, 4713 insertions(+), 293 deletions(-)

It was like that before indeed. The problem is how to reliably implement
the CREATE EXTENSION WITH SCHEMA, which is like the second best thing
about all this infrastructure work just behind the pg_dump support.  I'm
not clear that this option should come on its own in a later patch or
that doing so simplifies anything, but not opinionated about that: let's
organise ourselves as best as we are able to.

Again, will split the code in a third patch if that's what's necessary.

 That issue aside, I think there is still a fair amount of cleanup and
 code review that needs to happen here before this can be considered
 for commit.  Here are a few things I noticed on a first read-through:

 - pg_execute_sql_string() and pg_execute_sql_file() are documented,
 but are not actually part of the patch.

Ouch, git ain't that magic after all. Will clean-up next version(s).

 - The description of the NO USER DATA option is almost content-free.
 It basically says this option is here because it wouldn't work if we
 didn't have this option.

+ (see xref linkend=functions-extension), this option allow
+ extension authors to prepare installation scripts that will avoid
+ creating user data when restoring.

I'd need some specifics here to be able to do much better. My guess is
that an example is what needed, and it would fit in the main section of
the manual about it, 35.16. Putting it all together: create extension.

 - listDependentObjects() appears to be largely cut-and-pasted from
 some other function.  It calls AcquireDeletionLock() and the comments
 mention constructing a list of objects to delete.  It sure doesn't
 seem right to be acquiring AccessExclusiveLocks on lots of things in a
 function that's there to support the pg_extension_objects SRF.

Ah well, true. AccessShareLock seems the minimum we can take, as we
surely want to prevent the extension and its objects disappearing under
us, right?

 - This bit certainly violates our message style guidelines:

 +   elog(NOTICE,
 +Installing extension '%s' from '%s', in schema %s,
 %s user data,
 +stmt-extname,
 +get_extension_absolute_path(control-script),
 +schema ? schema : public,
 +create_extension_with_user_data ? with : without);

 User-facing messages need to be ereport, not elog, so they can be
 translated, and mustn't be assembled from pieces, as you've done with
 %s user data.

This message is pretty useful for debug and review of the patch, but I'm
not sure we want to keep after that...

 - Has the issue of changing custom_variable_classes from PGC_SIGHUP to
 PGC_SUSET been discussed?  I am not sure whether that's an OK thing to
 do.  If it is OK, then the documentation also needs updating.

I've been talking about it since the very firsts versions of the patch
on this list but didn't receive much answer. You have a detailed mail
about that later in the thread, will start a new thread about that from
there.

 - This comment looks like leftovers:

 +   /* FIXME: add PGC_EXTENSION so that we don't abuse PGC_SIGHUP here? */
 +   SetConfigOption(custom_variable_classes,
 +   newval, PGC_SIGHUP, PGC_S_EXTENSION);

 Apologies if I missed the previous discussion of this, but why are we
 adding a new GUC context?

Well we're using PGC_SIGHUP and PGC_S_EXTENSION, should we use
PGC_EXTENSION too is what the comment asks, so it's not very clear but
not a leftover. We're adding a new GUC source here, not a new context.

Let's include that in the new thread about cvc, though.

 - Did we decide to ditch the encoding parameter for extension scripts
 and mandate UTF-8?

No we didn't, we decided that the default encoding is UTF-8 and that any
contrib script defaults to UTF-8, so that it's not necessary to care
about the 'encoding' parameter 

Re: [HACKERS] Extensions, patch v20 (bitrot fixes)

2010-12-19 Thread Robert Haas
On Sun, Dec 19, 2010 at 5:30 AM, Dimitri Fontaine
dimi...@2ndquadrant.fr wrote:
 Robert Haas robertmh...@gmail.com writes:
 I spent a little time looking at this tonight.  I'm going to give you
 the same general advice that I've given other people who have
 submitted very large patches of this type: it'll be a lot easier to
 get this committed if we can break it into smaller pieces.  A pretty
 easy thing to do would be to separate the changes that turn the
 existing contrib modules into extensions into its own patch.

 Those are pretty mechanical, so applying them after would be quite
 easy. That said I don't know exactly how much it would ease the review,
 but if you say it does, I'll separate the patches.

Well, I don't know who is ultimately going to end up applying this.  I
can only say what helps me.  If somebody else wants to jump in here
and say they're good to review and commit the whole thing in one go,
so be it...

  A
 possibly more controversial idea is to actually remove the notion of a
 relocatable extension from this patch, and all the supporting
 machinery, and make that a follow-on patch.  I think it was arranged
 like that before and somehow got collapsed, but maybe the notion is
 worth revisiting, because this is a lot of code:

  224 files changed, 4713 insertions(+), 293 deletions(-)

 It was like that before indeed. The problem is how to reliably implement
 the CREATE EXTENSION WITH SCHEMA, which is like the second best thing
 about all this infrastructure work just behind the pg_dump support.  I'm
 not clear that this option should come on its own in a later patch or
 that doing so simplifies anything, but not opinionated about that: let's
 organise ourselves as best as we are able to.

I'm not totally certain of it either, but I think it might.

 - The description of the NO USER DATA option is almost content-free.
 It basically says this option is here because it wouldn't work if we
 didn't have this option.

 +         (see xref linkend=functions-extension), this option allow
 +         extension authors to prepare installation scripts that will avoid
 +         creating user data when restoring.

 I'd need some specifics here to be able to do much better. My guess is
 that an example is what needed, and it would fit in the main section of
 the manual about it, 35.16. Putting it all together: create extension.

No, I think you need to describe what the option actually does, as
opposed to what problem it solves.

 - listDependentObjects() appears to be largely cut-and-pasted from
 some other function.  It calls AcquireDeletionLock() and the comments
 mention constructing a list of objects to delete.  It sure doesn't
 seem right to be acquiring AccessExclusiveLocks on lots of things in a
 function that's there to support the pg_extension_objects SRF.

 Ah well, true. AccessShareLock seems the minimum we can take, as we
 surely want to prevent the extension and its objects disappearing under
 us, right?

On two minutes thought, I'm not entirely sure that we need to take
locks at all here, but if we do, AccessShareLock is certainly the
right one.  One idea I have is that we might want to move
AcquireDeletionLock to objectaddress.c or maybe even somewhere in
storage/lmgr, rename it to something like LockObjectByAddress, and
give it a second argument for the lock strength.

 - This bit certainly violates our message style guidelines:

 +       elog(NOTICE,
 +                Installing extension '%s' from '%s', in schema %s,
 %s user data,
 +                stmt-extname,
 +                get_extension_absolute_path(control-script),
 +                schema ? schema : public,
 +                create_extension_with_user_data ? with : without);

 User-facing messages need to be ereport, not elog, so they can be
 translated, and mustn't be assembled from pieces, as you've done with
 %s user data.

 This message is pretty useful for debug and review of the patch, but I'm
 not sure we want to keep after that...

Either fix it or take it out.  If you're proposing this for commit, it
shouldn't contain anything you don't want committed.

 - Has the issue of changing custom_variable_classes from PGC_SIGHUP to
 PGC_SUSET been discussed?  I am not sure whether that's an OK thing to
 do.  If it is OK, then the documentation also needs updating.

 I've been talking about it since the very firsts versions of the patch
 on this list but didn't receive much answer. You have a detailed mail
 about that later in the thread, will start a new thread about that from
 there.

Sorry, I missed that discussion.  As you know I try to follow -hackers
pretty closely, but apparently not closely enough in this case.  Can
you provide links to the archives?

 - This comment looks like leftovers:

 +       /* FIXME: add PGC_EXTENSION so that we don't abuse PGC_SIGHUP here? 
 */
 +       SetConfigOption(custom_variable_classes,
 +                                       newval, PGC_SIGHUP, PGC_S_EXTENSION);

 Apologies if 

Re: [HACKERS] Extensions, patch v20 (bitrot fixes)

2010-12-19 Thread Itagaki Takahiro
 - Did we decide to ditch the encoding parameter for extension scripts
 and mandate UTF-8?

 No we didn't, we decided that the default encoding is UTF-8 and that any
 contrib script defaults to UTF-8, so that it's not necessary to care
 about the 'encoding' parameter in the control file there.

 Itagaki required that the extension code is encoding aware, and it
 wasn't clear that the control file 'encoding' parameter would be enough
 in his side of the world, so the encoding support is currently offered
 to authors in the control file and users still can override it in the
 CREATE EXTENSION command.

I think 'encoding' parameter is enough. Of course embedding encoding
specifiers in SQL files are better, but there would be technical problems.
(Just for reference: http://www.python.org/dev/peps/pep-0263/ )

 I'm about sure that we don't want to remove that support, and I think
 that the command part of it ain't required, but that's for people with
 complex encoding problems to tell us IMO.

 Oh, I wasn't aware that Itagaki-san had objected to Tom's proposal.

I agree that the default encoding is UTF-8, but it should be
configurable by the 'encoding' parameter in control files.

If we use UTF-8 as the the default encoding, we need to treat
3 encodings at once (server, client, and UTF-8) anyway.
So, I think no additional complexity will be added even if we
support a  configurable encoding as the third encoding.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extensions, patch v20 (bitrot fixes)

2010-12-19 Thread Tom Lane
Itagaki Takahiro itagaki.takah...@gmail.com writes:
 Oh, I wasn't aware that Itagaki-san had objected to Tom's proposal.

 I agree that the default encoding is UTF-8, but it should be
 configurable by the 'encoding' parameter in control files.

Why is it necessary to have such a parameter at all?  AFAICS it just
adds complexity for little if any gain.  Most extension files will
probably be pure ASCII anyway.  Dictionary files are *far* more likely
to contain non-ASCII characters.  If we've gotten along fine with
requiring dictionary files to be UTF8, I can't see any reason why we
can't or shouldn't take the same approach to extension files.

 So, I think no additional complexity will be added even if we
 support a  configurable encoding as the third encoding.

This is nonsense.  The mere existence of the parameter requires code
to support it and user documentation to explain it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extensions, patch v20 (bitrot fixes)

2010-12-19 Thread Dimitri Fontaine
Tom Lane t...@sss.pgh.pa.us writes:
 Why is it necessary to have such a parameter at all?  AFAICS it just
 adds complexity for little if any gain.  Most extension files will
 probably be pure ASCII anyway.  Dictionary files are *far* more likely
 to contain non-ASCII characters.  If we've gotten along fine with
 requiring dictionary files to be UTF8, I can't see any reason why we
 can't or shouldn't take the same approach to extension files.

Don't forget that extensions are not just contrib or third party Open
Source software, but a lot of in-house code, mostly functions written in
SQL and PLpgSQL.  In non-English speaking countries, the parameter names
and comments are typically not written in English.

When we're talking Japan they have some quite specifics needs and I came
to understand that the database encoding and the script encoding are not
to be the same, usually.  So I still vote for handling this parameter.

 So, I think no additional complexity will be added even if we
 support a  configurable encoding as the third encoding.

 This is nonsense.  The mere existence of the parameter requires code
 to support it and user documentation to explain it.

The whole documentation needs to be:

 varlistentry
  termreplaceable class=parameterencoding/replaceable/term
  listitem
   para
The literalencoding/literal in which the script file is read.
   /para
  /listitem
 /varlistentry

The code to support that is on the order of 25 lines of code, once we
get rid of the SQL command level support for it.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extensions, patch v20 (bitrot fixes)

2010-12-19 Thread Itagaki Takahiro
On Mon, Dec 20, 2010 at 01:34, Tom Lane t...@sss.pgh.pa.us wrote:
 I agree that the default encoding is UTF-8, but it should be
 configurable by the 'encoding' parameter in control files.

 Why is it necessary to have such a parameter at all?

UTF-8 is not a superset of all encodings.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers