Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
>-Original Message- >From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers- >ow...@postgresql.org] On Behalf Of MauMau > >Hello, > >I think it would be nice for PostgreSQL to support national character types >largely because it should ease migration from other DBMSs. > >[Reasons why we need NCHAR] >-- >1. Invite users of other DBMSs to PostgreSQL. Oracle, SQL Server, MySQL, etc. >all have NCHAR support. PostgreSQL is probably the only database out of major >ones that does not support NCHAR. >Sadly, I've read a report from some Japanese government agency that the number >of MySQL users exceeded that of PostgreSQL here in Japan in 2010 or 2011. I >wouldn't say that is due to NCHAR support, but it might be one reason. I want >PostgreSQL to be more popular and regain those users. > >2. Enhance the "open" image of PostgreSQL by implementing more features of SQL >standard. NCHAR may be a wrong and unnecessary feature of SQL standard now >that we have Unicode support, but it is defined in the standard and widely >implemented. > >3. I have heard that some potential customers didn't adopt PostgreSQL due to >lack of NCHAR support. However, I don't know the exact reason why they need >NCHAR. The use case we have is for customer(s) who are modernizing their databases on mainframes. These applications are typically written in COBOL which does have extensive support for National Characters. Supporting National Characters as in-built data types in PostgreSQL is, not to exaggerate, an important criteria in their decision to use PostgreSQL or not. (So is Embedded COBOL. But that is a separate issue.) > >4. I guess some users really want to continue to use ShiftJIS or EUC_JP for >database encoding, and use NCHAR for a limited set of columns to store >international text in Unicode: >- to avoid code conversion between the server and the client for performance >- because ShiftJIS and EUC_JP require less amount of storage (2 bytes for most >Kanji) than UTF-8 (3 bytes) This use case is described in chapter 6 of "Oracle >Database Globalization Support Guide". >-- > > >I think we need to do the following: > >[Minimum requirements] >-- >1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically. >This is already implemented. PostgreSQL treats NCHAR/NVARCHAR as synonyms for >CHAR/VARCHAR, and ignores N prefix. But this is not documented. > >2. Declare support for national character support in the manual. >1 is not sufficient because users don't want to depend on undocumented >behavior. This is exactly what the TODO item "national character support" >in PostgreSQL TODO wiki is about. > >3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so that: >- psql \d can display the user-specified data types. >- pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as >CHAR/VARCHAR. >- To implement additional features for NCHAR/NVARCHAR in the future, as >described below. >-- > Agreed. This is our minimum requirement too. Rgds, Arul Shaji > > > >[Optional requirements] >-- >1. Implement client driver support, such as: >- NCHAR host variable type (e.g. "NCHAR var_name[12];") in ECPG, as specified >in the SQL standard. >- national character methods (e.g. setNString, getNString, >setNCharacterStream) as specified in JDBC 4.0. >I think at first we can treat these national-character-specific features as the >same as CHAR/VARCHAR. > >2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always contain >Unicode data. >I think it is sufficient at first that NCHAR/NVARCHAR columns can only be used >in UTF-8 databases and they store UTF-8 strings. This allows us to reuse the >input/output/send/recv functions and other infrastructure of CHAR/VARCHAR. >This is a reasonable compromise to avoid duplication and minimize the first >implementation of NCHAR support. > >3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns. >Fixed-width encoding may allow faster string manipulation as described in >Oracle's manual. But I'm not sure about this, because UTF-16 is not a real >fixed-width encoding due to supplementary characters. This would definitely be a welcome addition. >-- > > >I don't think it is good to implement NCHAR/NVARCHAR types as extensions like >contrib/citext, because NCHAR/NVARCHAR are basic types and need client-side >support. That is, client drivers need to be aware of the fixed NCHAR/NVARCHAR >OID values. > >How do you think we should implement NCHAR support? > >Regards >MauMau > > > >-- >Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make >changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-
Re: [HACKERS] Proposal - Support for National Characters functionality
> From: Tom Lane [mailto:t...@sss.pgh.pa.us] > > Alvaro Herrera writes: > > Also, as far as I understand what we want to control here is the > > encoding that the strings are in (the mapping of bytes to characters), > > not the collation (the way a set of strings are ordered). So it > > doesn't make sense to set the NATIONAL CHARACTER option using the > > COLLATE keyword. > > My thought is that we should simply ignore the NATIONAL CHARACTER syntax, > which is not the first nor the last brain-damaged feature design in the SQL > standard. It's basically useless for what we want because there's noplace to > specify which encoding you mean. Instead, let's consider that COLLATE can > define not only the collation but also the encoding of a string datum. Yes, don't have a problem with this. If I understand you correctly, this will be simpler syntax wise, but still get nchar/nvarchar data types into a table, in different encoding from the rest of the table. > > There's still the problem of how do you get a string of a nondefault encoding > into the database in the first place. Yes, that is the bulk of the work. Will need change in a whole lot of places. Is a step-by-step approach worth exploring ? Something similar to: Step 1: Support nchar/nvarchar data types. Restrict them only to UTF-8 databases to begin with. Step 2: Support multiple encodings in a database. Remove the restriction imposed in step1. Rgds, Arul Shaji -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
> From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com] > > Boguk, Maksym escribió: > > > I think I give a wrong description there... it will be not GUC but > > GUC-type value which will be initialized during CREATE DATABASE and > > will be read only after, very similar to the lc_collate. > > So I think name national_lc_collate will be better. > > Function of this value - provide information about the default > > collation for the NATIONAL CHARACTERS inside the database. > > That's not limits user ability of use an alternative collation for > > NATIONAL CHARACTERS during create table via COLLATE keyword. > > This seems a bit odd. I mean, if I want the option for differing encodings, > surely I need to be able to set them for each column, not at the database > level. > > Also, as far as I understand what we want to control here is the encoding that > the strings are in (the mapping of bytes to characters), not the collation Yes, that is our idea too. For the sql syntax Create table tbl1 (col1 nchar); What should be the encoding and collation for col1? Because the idea is to have them in separate encoding and collation (if needed) from that of the rest of the table. We have options of a) Having guc variables that will determine the default encoding and collation for nchar/nvarchar columns. Note that the collate variable is default only. Users can still override them per column. b) Having the encoding name and collation as part of the syntax. For ex., (col1 nchar encoding UTF-8 COLLATE "C"). Ugly, but. c) Be rigid and say nchar/nvarchar columns are by default UTF-8 (or something else). One cannot change the default. But they can override it when declaring the column by having a syntax similar to (b) Rgds, Arul Shaji > (the way a set of strings are ordered). So it doesn't make sense to set the > NATIONAL CHARACTER option using the COLLATE keyword. > > -- > Álvaro Herrerahttp://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
> -Original Message- > From: Tatsuo Ishii [mailto:is...@postgresql.org] > > > Also I don't understand why you need UTF-16 support as a database encoding > because UTF-8 and UTF-16 are logically equivalent, they are just different > represention (encoding) of Unicode. That means if we already support UTF-8 > (I'm sure we already do), there's no particular reason we need to add UTF-16 > support. > > Maybe you just want to support UTF-16 as a client encoding? Given below is a design draft for this functionality: Core new functionality (new code): 1)Create and register independent NCHAR/NVARCHAR/NTEXT data types. 2)Provide support for the new GUC nchar_collation to provide the database with information about the default collation that needs to be used for the new data types. 3)Create encoding conversion subroutines to convert strings between the database encoding and UTF8 (from national strings to regular strings and back). PostgreSQL already have all required support (used for conversion between the database encoding and client_encoding), so amount of the new code will be minimal there. 4)Because all symbols from non-UTF8 encodings could be represented as UTF8 (but the reverse is not true) comparison between N* types and the regular string types inside database will be performed in UTF8 form. To achieve this feature the new IMPLICIT casts may need to be created: NCHAR -> CHAR NVARCHAR -> VARCHAR NTEXT -> TEXT. Casting in the reverse direction will be available too but only as EXPLICIT. However, these casts could fail if national strings could not be represented in the used database encoding. All these casts will use subroutines created in 3). Casting/conversion between N* types will follow the same rules/mechanics as used for casting/conversion between usual (CHAR(N)/VARCHAR(N)/TEXT) string types. 5)Comparison between NATIONAL string values will be performed via specialized UTF8 optimized functions (with respect of the nchar_collation setting). 6)Client input/output of NATIONAL strings - NATIONAL strings will respect the client_encoding setting, and their values will be transparently converted to the requested client_encoding before sending(receiving) to client (the same mechanics as used for usual string types). So no mixed encoding in client input/output will be supported/available. 7)Create set of the regression tests for these new data types. Additional changes: 1)ECPG support for these new types 2) Support in the database drivers for the data types. Rgds, Arul Shaji > -- > Tatsuo Ishii > SRA OSS, Inc. Japan > English: http://www.sraoss.co.jp/index_en.php > Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
> > On Fri, Jul 5, 2013 at 2:35 PM, Pavel Stehule wrote: > > Yes, what I know almost all use utf8 without problems. Long time I > > didn't see any request for multi encoding support. > > Well, not *everything* can be represented as UTF-8; I think this is > particularly an issue with Asian languages. > > If we chose to do it, I think that per-column encoding support would end up > looking a lot like per-column collation support: it would be yet another per- > column property along with typoid, typmod, and typcollation. I'm not entirely > sure it's worth it, although FWIW I do believe Oracle has something like this. Yes, the idea is that users will be able to declare columns of type NCHAR or NVARCHAR which will use the pre-determined encoding type. If we say that NCHAR is UTF-8 then the NCHAR column will be of UTF-8 encoding irrespective of the database encoding. It will be up to us to restrict what Unicode encodings we want to support for NCHAR/NVARCHAR columns. This is based on my interpretation of the SQL standard. As you allude to above, Oracle has a similar behaviour (they support UTF-16 as well). Support for UTF-16 will be difficult without linking with some external libraries such as ICU. > At any rate, it seems like quite a lot of work. Thanks for putting my mind at ease ;-) Rgds, Arul Shaji > > Another idea would be to do something like what we do for range types > - i.e. allow a user to declare a type that is a differently-encoded version of > some base type. But even that seems pretty hard. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
> -Original Message- > From: Claudio Freire [mailto:klaussfre...@gmail.com] > Sent: Friday, 5 July 2013 3:41 PM > To: Tatsuo Ishii > Cc: Arulappan, Arul Shaji; PostgreSQL-Dev > Subject: Re: [HACKERS] Proposal - Support for National Characters > functionality > > On Fri, Jul 5, 2013 at 2:02 AM, Tatsuo Ishii wrote: > >> - Support for NATIONAL_CHARACTER_SET GUC variable that will determine > >> the encoding that will be used in NCHAR/NVARCHAR columns. > > > > You said NCHAR's encoding is UTF-8. Why do you need the GUC if NCHAR's > > encoding is fixed to UTF-8? > > > Not only that, but I don't think it can be a GUC. Maybe a compile-time switch, > but if it were a GUC, how do you handle an existing database in UTF-8 when the > setting is switched to UTF-16? Re-encode everything? > Store the encoding along each value? It's a mess. > > Either fix it at UTF-8, or make it a compile-time thing, I'd say. Agreed, that to begin with we only support UTF-8 encoding for NCHAR columns. If that is the case, do we still need a compile time option to turn on/off NCHAR functionality ? ? Rgds, Arul Shaji -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
Ishii san, Thank you for your positive and early response. > -Original Message- > From: Tatsuo Ishii [mailto:is...@postgresql.org] > Sent: Friday, 5 July 2013 3:02 PM > To: Arulappan, Arul Shaji > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Proposal - Support for National Characters > functionality > > Arul Shaji, > > NCHAR support is on our TODO list for some time and I would like to welcome > efforts trying to implement it. However I have a few > questions: > > > This is a proposal to implement functionalities for the handling of > > National Characters. > > > > [Introduction] > > > > The aim of this proposal is to eventually have a way to represent > > 'National Characters' in a uniform way, even in non-UTF8 encoded > > databases. Many of our customers in the Asian region who are now, as > > part of their platform modernization, are moving away from mainframes > > where they have used National Characters representation in COBOL and > > other databases. Having stronger support for national characters > > representation will also make it easier for these customers to look at > > PostgreSQL more favourably when migrating from other well known RDBMSs > > who all have varying degrees of NCHAR/NVARCHAR support. > > > > [Specifications] > > > > Broadly speaking, the national characters implementation ideally will > > include the following > > - Support for NCHAR/NVARCHAR data types > > - Representing NCHAR and NVARCHAR columns in UTF-8 encoding in > > non-UTF8 databases > > I think this is not a trivial work because we do not have framework to allow > mixed encodings in a database. I'm interested in how you are going to solve > the problem. > I would be lying if I said I have the design already speced out. I will be working on this in the coming weeks and hope to design a working solution consulting with the community. > > - Support for UTF16 column encoding and representing NCHAR and > > NVARCHAR columns in UTF16 encoding in all databases. > > Why do yo need UTF-16 as the database encoding? UTF-8 is already supported, > and any UTF-16 character can be represented in UTF-8 as far as I know. > Yes, that's correct. However there are advantages in using UTF-16 encoding for those characters that are always going to take atleast two-bytes to represent. Having said that, my intention is to use UTF-8 for NCHAR as well. Supporting UTF-16 will be even more complicated as it is not supported natively in some Linux platforms. I only included it to give an option. > > - Support for NATIONAL_CHARACTER_SET GUC variable that will determine > > the encoding that will be used in NCHAR/NVARCHAR columns. > > You said NCHAR's encoding is UTF-8. Why do you need the GUC if NCHAR's > encoding is fixed to UTF-8? > If we are going to only support UTF-8 for NCHAR, then we don't need the GUC variable obviously. Rgds, Arul Shaji > > The above points are at the moment a 'wishlist' only. Our aim is to > > tackle them one-by-one as we progress. I will send a detailed proposal > > later with more technical details. > > > > The main aim at the moment is to get some feedback on the above to > > know if this feature is something that would benefit PostgreSQL in > > general, and if users maintaining DBs in non-English speaking regions > > will find this beneficial. > > > > Rgds, > > Arul Shaji > > > > > > > > P.S.: It has been quite some time since I send a correspondence to > > this list. Our mail server adds a standard legal disclaimer to all > > outgoing mails, which I know that this list is not a huge fan of. I > > used to have an exemption for the mails I send to this list. If the > > disclaimer appears, apologies in advance. I will rectify that on the next > one. > -- > Tatsuo Ishii > SRA OSS, Inc. Japan > English: http://www.sraoss.co.jp/index_en.php > Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Proposal - Support for National Characters functionality
This is a proposal to implement functionalities for the handling of National Characters. [Introduction] The aim of this proposal is to eventually have a way to represent 'National Characters' in a uniform way, even in non-UTF8 encoded databases. Many of our customers in the Asian region who are now, as part of their platform modernization, are moving away from mainframes where they have used National Characters representation in COBOL and other databases. Having stronger support for national characters representation will also make it easier for these customers to look at PostgreSQL more favourably when migrating from other well known RDBMSs who all have varying degrees of NCHAR/NVARCHAR support. [Specifications] Broadly speaking, the national characters implementation ideally will include the following - Support for NCHAR/NVARCHAR data types - Representing NCHAR and NVARCHAR columns in UTF-8 encoding in non-UTF8 databases - Support for UTF16 column encoding and representing NCHAR and NVARCHAR columns in UTF16 encoding in all databases. - Support for NATIONAL_CHARACTER_SET GUC variable that will determine the encoding that will be used in NCHAR/NVARCHAR columns. The above points are at the moment a 'wishlist' only. Our aim is to tackle them one-by-one as we progress. I will send a detailed proposal later with more technical details. The main aim at the moment is to get some feedback on the above to know if this feature is something that would benefit PostgreSQL in general, and if users maintaining DBs in non-English speaking regions will find this beneficial. Rgds, Arul Shaji P.S.: It has been quite some time since I send a correspondence to this list. Our mail server adds a standard legal disclaimer to all outgoing mails, which I know that this list is not a huge fan of. I used to have an exemption for the mails I send to this list. If the disclaimer appears, apologies in advance. I will rectify that on the next one. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers