Re: Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-11 Thread Peter Geoghegan
On Thu, Aug 11, 2016 at 4:22 AM, Palle Girgensohn wrote: > But in your strxfrm code in PostgreSQL, the keys are cached, and represented > as int64:s if I remember correctly, so perhaps there is still a benefit > using the abbreviated keys? More testing is required, I guess...

Re: Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-11 Thread Palle Girgensohn
> 11 aug. 2016 kl. 11:15 skrev Palle Girgensohn : > >> >> 11 aug. 2016 kl. 03:05 skrev Peter Geoghegan : >> >> On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn >> wrote: >>> They've been used for the FreeBSD ports since 2005, and

Re: Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-11 Thread Palle Girgensohn
> 11 aug. 2016 kl. 03:05 skrev Peter Geoghegan : > > On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn wrote: >> They've been used for the FreeBSD ports since 2005, and have served us well. >> I have of course updated them regularly. In this latest

Re: Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-10 Thread Peter Geoghegan
On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn wrote: > They've been used for the FreeBSD ports since 2005, and have served us well. > I have of course updated them regularly. In this latest version, I've removed > support for other encodings beside UTF-8, mostly since I

Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-10 Thread Palle Girgensohn
> 4 aug. 2016 kl. 02:40 skrev Bruce Momjian : > > On Thu, Aug 4, 2016 at 08:22:25AM +0800, Craig Ringer wrote: >> Yep, it does. But we've made little to no progress on integration of ICU >> support and AFAIK nobody's working on it right now. > > Uh, this email from July says

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Bruce Momjian
On Thu, Aug 4, 2016 at 08:22:25AM +0800, Craig Ringer wrote: > Yep, it does. But we've made little to no progress on integration of ICU > support and AFAIK nobody's working on it right now.  Uh, this email from July says Peter Eisentraut will submit it in September :-)

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Craig Ringer
On 4 August 2016 at 05:00, Thomas Munro wrote: > On Thu, Aug 4, 2016 at 5:16 AM, Craig Ringer > wrote: > > On 3 August 2016 at 22:54, Álvaro Hernández Tortosa > wrote: > >> What would it take to support it? Isn't the

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Álvaro Hernández Tortosa
On 03/08/16 21:42, Geoff Winkless wrote: On 3 August 2016 at 20:36, Álvaro Hernández Tortosa wrote: Isn't the correct syntax something like: select E'\uc080', U&'\c080'; ? It is a single character, 16 bit unicode sequence (see

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Thomas Munro
On Thu, Aug 4, 2016 at 5:16 AM, Craig Ringer wrote: > On 3 August 2016 at 22:54, Álvaro Hernández Tortosa wrote: >> What would it take to support it? Isn't the varlena header propagated >> everywhere, which could help infer the real length of the

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Geoff Winkless
On 3 August 2016 at 20:36, Álvaro Hernández Tortosa wrote: > Isn't the correct syntax something like: > > select E'\uc080', U&'\c080'; > > ? > > It is a single character, 16 bit unicode sequence (see >

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Tom Lane
=?UTF-8?Q?=c3=81lvaro_Hern=c3=a1ndez_Tortosa?= writes: > According to https://en.wikipedia.org/wiki/UTF-8#Codepage_layout > the encoding used in Modified UTF-8 is an (otherwise) invalid UTF-8 code > point. In short, the \u00 nul is represented (overlong encoding) by the >

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Álvaro Hernández Tortosa
On 03/08/16 21:31, Geoff Winkless wrote: On 3 August 2016 at 20:13, Álvaro Hernández Tortosa wrote: Yet they are accepted by Postgres (like if Postgres would support Modified UTF-8 intentionally). The caracter in psql does not render as a nul but as this symbol: "삀". Not

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Geoff Winkless
On 3 August 2016 at 20:13, Álvaro Hernández Tortosa wrote: > Yet they are accepted by Postgres > (like if Postgres would support Modified UTF-8 intentionally). The caracter > in psql does not render as a nul but as this symbol: "삀". Not accepted as valid utf8: # select

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Álvaro Hernández Tortosa
On 03/08/16 20:14, Álvaro Hernández Tortosa wrote: On 03/08/16 17:47, Kevin Grittner wrote: On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa wrote: What would it take to support it? Would it be of any value to support "Modified UTF-8"?

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Álvaro Hernández Tortosa
On 03/08/16 18:35, Geoff Winkless wrote: On 3 August 2016 at 15:54, Álvaro Hernández Tortosa wrote: Given that 0x00 is a perfectly legal UTF-8 character, I conclude we're strictly non-compliant. It's perhaps worth mentioning that 0x00 is valid ASCII too, and PostgreSQL

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Álvaro Hernández Tortosa
On 03/08/16 17:47, Kevin Grittner wrote: On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa wrote: What would it take to support it? Would it be of any value to support "Modified UTF-8"? https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 That's nice,

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Álvaro Hernández Tortosa
On 03/08/16 17:23, Tom Lane wrote: =?UTF-8?Q?=c3=81lvaro_Hern=c3=a1ndez_Tortosa?= writes: As has been previously discussed (see https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl for instance) varlena fields cannot accept the literal 0x00

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Craig Ringer
On 3 August 2016 at 22:54, Álvaro Hernández Tortosa wrote: > > Hi list. > > As has been previously discussed (see > https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl > for instance) varlena fields cannot accept the literal 0x00 value.

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Geoff Winkless
On 3 August 2016 at 15:54, Álvaro Hernández Tortosa wrote: > Given that 0x00 is a perfectly legal UTF-8 character, I conclude we're > strictly non-compliant. It's perhaps worth mentioning that 0x00 is valid ASCII too, and PostgreSQL has never stored that either. If you want

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Peter Eisentraut
On 8/3/16 11:47 AM, Kevin Grittner wrote: > On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa > wrote: > >> What would it take to support it? > > Would it be of any value to support "Modified UTF-8"? > > https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 Will this

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Kevin Grittner
On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa wrote: > What would it take to support it? Would it be of any value to support "Modified UTF-8"? https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise

Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Tom Lane
=?UTF-8?Q?=c3=81lvaro_Hern=c3=a1ndez_Tortosa?= writes: > As has been previously discussed (see > https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl > > for instance) varlena fields cannot accept the literal 0x00 value. Yup. > What

[HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Álvaro Hernández Tortosa
Hi list. As has been previously discussed (see https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl for instance) varlena fields cannot accept the literal 0x00 value. Sure, you can use bytea, but this hardly a good solution. The problem seems to be