Re: [HACKERS] Proposal - Collation at database level
Radek Strnad napsal(a): Zdenek Kotala wrote: Radek Strnad napsal(a): snip I'm thinking of dividing the problem into two parts - in beginning pg_collation will contain two functions. One will have hard-coded rules for these basic collations (SQL_CHARACTER, GRAPHIC_IRV, LATIN1, ISO8BIT, UCS_BASIC). It will compare each string character bitwise and guarantee that the implementation will meet the SQL standard implemented in PostgreSQL. Second one will allow the user to use installed system locales. The set of these collations will obviously vary between systems. Catalogs will contain encoding and collation for calling the system locale function. This will allow us to use collations such as en_US.utf8, cs_CZ.iso88592 etc. if they will be availible. We will also need to change the way how strings are compared. Regarding the set database collation the right function will be used. http://doxygen.postgresql.org/varlena_8c.html#4c7af81f110f9be0bd8eb2bd99525675 This design will make possible switch to ICU or any other implementation quite simple and will not cause any major rewriting of what I'm coding right now. Collation function is main point here. How you mentioned one will be only wrapper about strcmp and second one about strcoll. (maybe you need four - char/wchar) Which function will be used it is defined in pg_collation catalog by CREATE COLLATION command. But you need specify name of locale for system locales. It means you need attribute for storing locale name. You're right. I've extended pg_collation for system locale columns. In the first stage we actually don't need any other catalogs such as encoding, etc. and we can build this functionality only on following pg_collation catalog. Used collation function (system or built-in) will be decided on existing collation name. CATALOG(pg_collations, ###) { NameDatacolname;/* collation name */ Oidcolschema;/* collation schema */ NameData colcharset;/* character set specification */ Oid colexistingcollation; /* existing collation */ boolcolpadattribute;/* pad attribute */ boolcolcasesensitive;/* case sensitive */ boolcolaccent;/* accent sensitive */ NameDatacolsyslccollate;/* lc_collate */ NameDatacolsyslcctype; /* lc_ctype */ regproccolfunc;/* used collation function */ } FormData_pg_collations; FormData_pg_collations; It would be good to send list of new and modified SQL commands (like CREATE COLLATION) for wide discussion. CREATE COLLATION collation name FOR character set specification FROM existing collation name [ pad characteristic ] [ case sensitive ] [ accent sensitive ] [ LC_COLLATE lc_collate ] [ LC_CTYPE lc_ctype ] pad characteristic := NO PAD | PAD SPACE case sensitive := CASE SENSITIVE | CASE INSENSITIVE accent sensitive := ACCENT SENSITIVE | ACCENT INSENSITIVE Since you can specify order by in select clause there's no need for adding ascending and descending type of collation. They will allways be ascending. DROP COLLATION collation name CREATE DATABASE ... [ COLLATE collation name ] ... ALTER DATABASE ... [ COLLATE collation name ] ... I think catalog is good. Maybe attributes names colsyslccollate and colsyslcctype should be more generic because they could be shared with ICU. But collation function should be specified in CREATE COLLATION command. Maybe CREATE COLLATION [STRCOL fn name] Zdenek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Collation at database level
Zdenek Kotala wrote: Radek Strnad napsal(a): snip I'm thinking of dividing the problem into two parts - in beginning pg_collation will contain two functions. One will have hard-coded rules for these basic collations (SQL_CHARACTER, GRAPHIC_IRV, LATIN1, ISO8BIT, UCS_BASIC). It will compare each string character bitwise and guarantee that the implementation will meet the SQL standard implemented in PostgreSQL. Second one will allow the user to use installed system locales. The set of these collations will obviously vary between systems. Catalogs will contain encoding and collation for calling the system locale function. This will allow us to use collations such as en_US.utf8, cs_CZ.iso88592 etc. if they will be availible. We will also need to change the way how strings are compared. Regarding the set database collation the right function will be used. http://doxygen.postgresql.org/varlena_8c.html#4c7af81f110f9be0bd8eb2bd99525675 This design will make possible switch to ICU or any other implementation quite simple and will not cause any major rewriting of what I'm coding right now. Collation function is main point here. How you mentioned one will be only wrapper about strcmp and second one about strcoll. (maybe you need four - char/wchar) Which function will be used it is defined in pg_collation catalog by CREATE COLLATION command. But you need specify name of locale for system locales. It means you need attribute for storing locale name. You're right. I've extended pg_collation for system locale columns. In the first stage we actually don't need any other catalogs such as encoding, etc. and we can build this functionality only on following pg_collation catalog. Used collation function (system or built-in) will be decided on existing collation name. CATALOG(pg_collations, ###) { NameDatacolname;/* collation name */ Oidcolschema;/* collation schema */ NameData colcharset;/* character set specification */ Oid colexistingcollation; /* existing collation */ boolcolpadattribute;/* pad attribute */ boolcolcasesensitive;/* case sensitive */ boolcolaccent;/* accent sensitive */ NameDatacolsyslccollate;/* lc_collate */ NameDatacolsyslcctype; /* lc_ctype */ regproccolfunc;/* used collation function */ } FormData_pg_collations; FormData_pg_collations; It would be good to send list of new and modified SQL commands (like CREATE COLLATION) for wide discussion. CREATE COLLATION collation name FOR character set specification FROM existing collation name [ pad characteristic ] [ case sensitive ] [ accent sensitive ] [ LC_COLLATE lc_collate ] [ LC_CTYPE lc_ctype ] pad characteristic := NO PAD | PAD SPACE case sensitive := CASE SENSITIVE | CASE INSENSITIVE accent sensitive := ACCENT SENSITIVE | ACCENT INSENSITIVE Since you can specify order by in select clause there's no need for adding ascending and descending type of collation. They will allways be ascending. DROP COLLATION collation name CREATE DATABASE ... [ COLLATE collation name ] ... ALTER DATABASE ... [ COLLATE collation name ] ... Any thoughts? Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Collation at database level
Radek Strnad napsal(a): snip I'm thinking of dividing the problem into two parts - in beginning pg_collation will contain two functions. One will have hard-coded rules for these basic collations (SQL_CHARACTER, GRAPHIC_IRV, LATIN1, ISO8BIT, UCS_BASIC). It will compare each string character bitwise and guarantee that the implementation will meet the SQL standard implemented in PostgreSQL. Second one will allow the user to use installed system locales. The set of these collations will obviously vary between systems. Catalogs will contain encoding and collation for calling the system locale function. This will allow us to use collations such as en_US.utf8, cs_CZ.iso88592 etc. if they will be availible. We will also need to change the way how strings are compared. Regarding the set database collation the right function will be used. http://doxygen.postgresql.org/varlena_8c.html#4c7af81f110f9be0bd8eb2bd99525675 This design will make possible switch to ICU or any other implementation quite simple and will not cause any major rewriting of what I'm coding right now. Collation function is main point here. How you mentioned one will be only wrapper about strcmp and second one about strcoll. (maybe you need four - char/wchar) Which function will be used it is defined in pg_collation catalog by CREATE COLLATION command. But you need specify name of locale for system locales. It means you need attribute for storing locale name. snip CATALOG(pg_collations, ###) { NameDatacolname;/* collation name */ Oid colschema; /* collation schema */ boolcolpadattribute;/* pad attribute */ boolcolcasesensitive; /* case sensitive */ boolcolaccent; /* accent sensitive */ regproc colfunc;/* used collation function */ Oid colrepertoire; /* collation repertoire */ } FormData_pg_collations; It would be good to send list of new and modified SQL commands (like CREATE COLLATION) for wide discussion. Zdenek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Proposal - Collation at database level
Hello, I'm working on implementation of collation at database level using system locales as a Google Summer of Code 2008 project. You can read my proposal on the wiki page - http://wiki.postgresql.org/wiki/Gsoc08-collation . I'm building this over Alexey Slynko's patch sent two years ago (http://www.activebait.net/msg00019.html). Currently I'm in stage of creating catalogs for collations, repertoires, encodings and charsets. Because of every single system is containing different locales we have to guarantee at least those specified in SQL standard. SQL 2003 standard can be downloaded at http://www.wiscorp.com/sql_2003_standard.zip (~18 MB). File 5wd-02-foundation-2003-09.pdf is specifying the foundation of collations. One or more collations must be specified out of these: — SQL_CHARACTER is an implementation-defined collation. It is applicable to the SQL_CHARACTER character repertoire. — GRAPHIC_IRV is a collation in which the ordering is determined by treating the code points defined by ISO 646:1991 as unsigned integers. It is applicable to the GRAPHIC_IRV character repertoire. — LATIN1 is a collation in which the ordering is determined by treating the code points defined by ISO 8859- 1 as unsigned integers. It is applicable to the LATIN1 character repertoire. — ISO8BIT is a collation in which the ordering is determined by treating the code points defined by ISO 8859-1 as unsigned integers. When restricted to the LATIN1 characters, it produces the same collation as LATIN1. It is applicable to the ISO8BIT character repertoire. — UCS_BASIC is a collation in which the ordering is determined entirely by the Unicode scalar values of the characters in the strings being sorted. It is applicable to the UCS character repertoire. Since every character repertoire is a subset of the UCS repertoire, the UCS_BASIC collation is potentially applicable to every character set. NOTE 11 — The Unicode scalar value of a character is its code point treated as an unsigned integer. — UNICODE is the collation in which the ordering is determined by applying the Unicode Collation Algorithm with the Default Unicode Collation Element Table, as specified in [Unicode10]. It is applicable to the UCS character repertoire. Since every character repertoire is a subset of the UCS repertoire, the UNICODE collation is potentially applicable to every character set. — SQL_TEXT is an implementation-defined collation. It is applicable to the SQL_TEXT character repertoire. — SQL_IDENTIFIER is an implementation-defined collation. It is applicable to the SQL_IDENTIFIER character repertoire. I'm thinking of dividing the problem into two parts - in beginning pg_collation will contain two functions. One will have hard-coded rules for these basic collations (SQL_CHARACTER, GRAPHIC_IRV, LATIN1, ISO8BIT, UCS_BASIC). It will compare each string character bitwise and guarantee that the implementation will meet the SQL standard implemented in PostgreSQL. Second one will allow the user to use installed system locales. The set of these collations will obviously vary between systems. Catalogs will contain encoding and collation for calling the system locale function. This will allow us to use collations such as en_US.utf8, cs_CZ.iso88592 etc. if they will be availible. We will also need to change the way how strings are compared. Regarding the set database collation the right function will be used. http://doxygen.postgresql.org/varlena_8c.html#4c7af81f110f9be0bd8eb2bd99525675 This design will make possible switch to ICU or any other implementation quite simple and will not cause any major rewriting of what I'm coding right now. Catalogs specification with SQL 2003 standard SQL commands for creating tables follows: = pg_repertoires = CREATE TABLE CHARACTER_REPERTOIRES ( CHARACTER_REPERTOIRE_NAME INFORMATION_SCHEMA.SQL_IDENTIFIER CONSTRAINT CHARACTER_REPERTOIRE_NAME_NOT_NULL NOT NULL, DEFAULT_COLLATION_CATALOG INFORMATION_SCHEMA.SQL_IDENTIFIER CONSTRAINT CHARACTER_REPERTOIRES_DEFAULT_COLLATION_CATALOG_NOT_NULL NOT NULL, DEFAULT_COLLATION_SCHEMA INFORMATION_SCHEMA.SQL_IDENTIFIER CONSTRAINT CHARACTER_REPERTOIRES_DEFAULT_COLLATION_SCHEMA_NOT_NULL NOT NULL, DEFAULT_COLLATION_NAME INFORMATION_SCHEMA.SQL_IDENTIFIER CONSTRAINT CHARACTER_REPERTOIRES_DEFAULT_COLLATION_NAME_NOT_NULL NOT NULL, CONSTRAINT CHARACTER_REPERTOIRES_PRIMARY_KEY PRIMARY KEY ( CHARACTER_REPERTOIRE_NAME ), CONSTRAINT CHARACTER_REPERTOIRES_FOREIGN_KEY_COLLATIONS FOREIGN KEY ( DEFAULT_COLLATION_CATALOG, DEFAULT_COLLATION_SCHEMA, DEFAULT_COLLATION_NAME ) REFERENCES COLLATIONS ) CATALOG(pg_repertoires, ###) { NameDatarepname;/* repertoire name */ Oid repdefcolloid; /* default collation catalog */