subject:"Unicode grapheme clusters"

Re: Unicode grapheme clusters

2023-01-24 Thread Bruce Momjian

On Tue, Jan 24, 2023 at 11:40:01AM -0500, Greg Stark wrote:
> On Sat, 21 Jan 2023 at 13:17, Tom Lane  wrote:
> >
> > Probably our long-term answer is to avoid depending on wcwidth
> > and use wcswidth instead.  But it's hard to get excited about
> > doing the legwork for that until popular libc implementations
> > get it right.
> 
> Here's an interesting blog post about trying to do this in Rust:
> 
> https://tomdebruijn.com/posts/rust-string-length-width-calculations/
> 
> TL;DR... Even counting the number of graphemes isn't enough because
> terminals typically (but not always) display emoji graphemes using two
> columns.
> 
> At the end of the day Unicode kind of assumes a variable-width display
> where the rendering is handled by something that has access to the
> actual font metrics. So anything trying to line things up in columns
> in a way that works with any rendering system down the line using any
> font is going to be making a best guess.

Yes, good article, though I am still surprised this is not discussed
more often.  Anyway, for psql, we assume a fixed width output device, so
we can just assume that for computation.  You are right that Unicode
just doesn't seem to consider fixed width output cases and doesn't
provide much guidance.

Beyond psql, should we update our docs to say that character_length()
for Unicode returns the number of Unicode code points, and not
necessarily the number of displayed characters if grapheme clusters are
present?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: Unicode grapheme clusters

2023-01-24 Thread Isaac Morland

On Tue, 24 Jan 2023 at 11:40, Greg Stark  wrote:

>
> At the end of the day Unicode kind of assumes a variable-width display
> where the rendering is handled by something that has access to the
> actual font metrics. So anything trying to line things up in columns
> in a way that works with any rendering system down the line using any
> font is going to be making a best guess.
>

Really what is needed is another Unicode attribute: how many columns of a
monospaced display each character (or grapheme cluster) should take up. The
standard should include a precisely defined function that can take any
sequence of characters and give back its width in monospaced display
character spaces. Typefaces should only qualify as monospaced if they
respect this standard-defined computation.

Note that this is not actually a new thing: this was included in ASCII
implicitly, with a value of 1 for every character, and a value of n for
every n-character string. It has always been possible to line up values
displayed on monospaced displays by adding spaces, and it is only the
omission of this feature from Unicode which currently makes it impossible.

Re: Unicode grapheme clusters

2023-01-24 Thread Greg Stark

On Sat, 21 Jan 2023 at 13:17, Tom Lane  wrote:
>
> Probably our long-term answer is to avoid depending on wcwidth
> and use wcswidth instead.  But it's hard to get excited about
> doing the legwork for that until popular libc implementations
> get it right.

Here's an interesting blog post about trying to do this in Rust:

https://tomdebruijn.com/posts/rust-string-length-width-calculations/

TL;DR... Even counting the number of graphemes isn't enough because
terminals typically (but not always) display emoji graphemes using two
columns.

At the end of the day Unicode kind of assumes a variable-width display
where the rendering is handled by something that has access to the
actual font metrics. So anything trying to line things up in columns
in a way that works with any rendering system down the line using any
font is going to be making a best guess.

-- 
greg

Re: Unicode grapheme clusters

2023-01-21 Thread Bruce Momjian

On Sat, Jan 21, 2023 at 01:17:27PM -0500, Tom Lane wrote:
> Bruce Momjian  writes:
> > I just checked if wcswidth() would honor graphene clusters, though
> > wcwidth() does not, but it seems wcswidth() treats characters just like
> > wcwidth():
> 
> Well, that's at least potentially fixable within libc, while wcwidth
> clearly can never do this right.
> 
> Probably our long-term answer is to avoid depending on wcwidth
> and use wcswidth instead.  But it's hard to get excited about
> doing the legwork for that until popular libc implementations
> get it right.

Agreed.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: Unicode grapheme clusters

2023-01-21 Thread Tom Lane

Bruce Momjian  writes:
> I just checked if wcswidth() would honor graphene clusters, though
> wcwidth() does not, but it seems wcswidth() treats characters just like
> wcwidth():

Well, that's at least potentially fixable within libc, while wcwidth
clearly can never do this right.

Probably our long-term answer is to avoid depending on wcwidth
and use wcswidth instead.  But it's hard to get excited about
doing the legwork for that until popular libc implementations
get it right.

regards, tom lane

Re: Unicode grapheme clusters

2023-01-21 Thread Bruce Momjian

On Sat, Jan 21, 2023 at 12:37:30PM -0500, Bruce Momjian wrote:
> Well, as one of the URLs I quoted said:
> 
>   This is by design. wcwidth() is utterly broken. Any terminal or
>   terminal application that uses it is also utterly broken. Forget
>   about emoji wcwidth() doesn't even work with combining characters,
>   zero width joiners, flags, and a whole bunch of other things.
> 
> So, either we have to find a function in the library that will do the
> looping over the string for us, or we need to identify the special
> Unicode characters that create grapheme clusters and handle them in our
> code.

I just checked if wcswidth() would honor graphene clusters, though
wcwidth() does not, but it seems wcswidth() treats characters just like
wcwidth():

$ LANG=en_US.UTF-8 grapheme_test
wcswidth len=7

bytes_consumed=4, wcwidth len=2
bytes_consumed=4, wcwidth len=2
bytes_consumed=3, wcwidth len=0
bytes_consumed=3, wcwidth len=1
bytes_consumed=3, wcwidth len=0
bytes_consumed=4, wcwidth len=2

C test program attached.  This is on Debian 11.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.
#define _XOPEN_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 

int
main (int argc, char *argv[])
{
	char *cp = "‍⚕️喙";
	wchar_t wch[100];
	int i;
	
	setlocale(LC_ALL, "en_US.UTF-8");

	mbstowcs(wch, cp, 100);
	printf("wcswidth len=%d\n\n", wcswidth(wch, 100));

	while (cp[i])
	{
		int res = mbtowc(wch, cp + i, 100);

		printf("bytes_consumed=%d, ", res);
	
		int len = wcwidth(wch[0]);
		printf("wcwidth len=%d\n", len);
		i += res;
	}

	return 0;
}

Re: Unicode grapheme clusters

2023-01-21 Thread Bruce Momjian

On Sat, Jan 21, 2023 at 11:20:39AM -0500, Greg Stark wrote:
> On Fri, 20 Jan 2023 at 00:07, Pavel Stehule  wrote:
> >
> > I partially watch an progres in VTE - one of the widely used terminal libs, 
> > and I am very sceptical so there will be support in the next two years.
> >
> > Maybe the new microsoft terminal will give this area a new dynamic, but 
> > currently only few people on the planet are working on fixing or enhancing 
> > terminal's technologies. Unfortunately there is too much historical balast.
> 
> Fwiw this isn't really about terminal emulators. psql is also used to
> generate text files for reports or for display in various ways.
> 
> I think it's worth using whatever APIs we have available to implement
> better alignment for grapheme clusters and just assume whatever will
> eventually be used to display the output will display it "properly".
> 
> I do not think it's worth trying to implement this ourselves if the
> libraries aren't there yet. And I don't think it's worth trying to
> adapt to the current state of the current terminal. We don't know that
> that's the only place the output will be viewed and it'll all be
> wasted effort when the terminals eventually implement full support.

Well, as one of the URLs I quoted said:

This is by design. wcwidth() is utterly broken. Any terminal or
terminal application that uses it is also utterly broken. Forget
about emoji wcwidth() doesn't even work with combining characters,
zero width joiners, flags, and a whole bunch of other things.

So, either we have to find a function in the library that will do the
looping over the string for us, or we need to identify the special
Unicode characters that create grapheme clusters and handle them in our
code.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: Unicode grapheme clusters

2023-01-21 Thread Tom Lane

Greg Stark  writes:
> (If we were really crazy about this we could use terminal escape codes
> to query the current cursor position after emitting multicharacter
> graphemes. But as I said, I don't even think that would be useful,
> even if there weren't other reasons it would be a bad idea)

Yeah, use of a pager would be enough to break that.

regards, tom lane

Re: Unicode grapheme clusters

2023-01-21 Thread Pavel Stehule

so 21. 1. 2023 v 17:21 odesílatel Greg Stark  napsal:

> On Fri, 20 Jan 2023 at 00:07, Pavel Stehule 
> wrote:
> >
> > I partially watch an progres in VTE - one of the widely used terminal
> libs, and I am very sceptical so there will be support in the next two
> years.
> >
> > Maybe the new microsoft terminal will give this area a new dynamic, but
> currently only few people on the planet are working on fixing or enhancing
> terminal's technologies. Unfortunately there is too much historical balast.
>
> Fwiw this isn't really about terminal emulators. psql is also used to
> generate text files for reports or for display in various ways.
>
> I think it's worth using whatever APIs we have available to implement
> better alignment for grapheme clusters and just assume whatever will
> eventually be used to display the output will display it "properly".
>
> I do not think it's worth trying to implement this ourselves if the
> libraries aren't there yet. And I don't think it's worth trying to
> adapt to the current state of the current terminal. We don't know that
> that's the only place the output will be viewed and it'll all be
> wasted effort when the terminals eventually implement full support.
>
> (If we were really crazy about this we could use terminal escape codes
> to query the current cursor position after emitting multicharacter
> graphemes. But as I said, I don't even think that would be useful,
> even if there weren't other reasons it would be a bad idea)
>

+1

Pavel

>
>
> --
> greg
>

Re: Unicode grapheme clusters

2023-01-21 Thread Greg Stark

On Fri, 20 Jan 2023 at 00:07, Pavel Stehule  wrote:
>
> I partially watch an progres in VTE - one of the widely used terminal libs, 
> and I am very sceptical so there will be support in the next two years.
>
> Maybe the new microsoft terminal will give this area a new dynamic, but 
> currently only few people on the planet are working on fixing or enhancing 
> terminal's technologies. Unfortunately there is too much historical balast.

Fwiw this isn't really about terminal emulators. psql is also used to
generate text files for reports or for display in various ways.

I think it's worth using whatever APIs we have available to implement
better alignment for grapheme clusters and just assume whatever will
eventually be used to display the output will display it "properly".

I do not think it's worth trying to implement this ourselves if the
libraries aren't there yet. And I don't think it's worth trying to
adapt to the current state of the current terminal. We don't know that
that's the only place the output will be viewed and it'll all be
wasted effort when the terminals eventually implement full support.

(If we were really crazy about this we could use terminal escape codes
to query the current cursor position after emitting multicharacter
graphemes. But as I said, I don't even think that would be useful,
even if there weren't other reasons it would be a bad idea)

-- 
greg

Re: Unicode grapheme clusters

2023-01-19 Thread Pavel Stehule

pá 20. 1. 2023 v 2:55 odesílatel Bruce Momjian  napsal:

> On Thu, Jan 19, 2023 at 07:53:43PM -0500, Tom Lane wrote:
> > Bruce Momjian  writes:
> > > I am not sure what you are referring to above?  character_length?  I
> was
> > > talking about display length, and psql uses that --- at some point, our
> > > lack of support for graphemes will cause psql to not align columns.
> >
> > That's going to happen regardless, as long as we can't be sure
> > what the display will do with the characters --- and that's a
> > problem that will persist for a very long time.
> >
> > Ideally, yeah, it'd be great if all this stuff rendered perfectly;
> > but IMO it's so far outside mainstream usage of psql that it's
> > not something that could possibly repay the investment of time
> > to get even a partial solution.
>
> We have a few options:
>
> *  TODO item
> *  document psql works that way
> *  do nothing
>
> I think the big question is how common such cases will be in the future.
> The report from 2022, and one from 2019 didn't seem to clearly outline
> the issue so it would good to have something documented somewhere.
>

There can be a note in psql documentation like "Unicode grapheme clusters
are not supported yet. It is not well supported by other necessary software
like terminal emulators and curses libraries".

I partially watch an progres in VTE - one of the widely used terminal libs,
and I am very sceptical so there will be support in the next two years.

Maybe the new microsoft terminal will give this area a new dynamic, but
currently only few people on the planet are working on fixing or enhancing
terminal's technologies. Unfortunately there is too much historical balast.

Regards

Pavel


> --
>   Bruce Momjian  https://momjian.us
>   EDB  https://enterprisedb.com
>
> Embrace your flaws.  They make you human, rather than perfect,
> which you will never be.
>
>
>

Re: Unicode grapheme clusters

2023-01-19 Thread Bruce Momjian

On Thu, Jan 19, 2023 at 07:53:43PM -0500, Tom Lane wrote:
> Bruce Momjian  writes:
> > I am not sure what you are referring to above?  character_length?  I was
> > talking about display length, and psql uses that --- at some point, our
> > lack of support for graphemes will cause psql to not align columns.
> 
> That's going to happen regardless, as long as we can't be sure
> what the display will do with the characters --- and that's a
> problem that will persist for a very long time.
> 
> Ideally, yeah, it'd be great if all this stuff rendered perfectly;
> but IMO it's so far outside mainstream usage of psql that it's
> not something that could possibly repay the investment of time
> to get even a partial solution.

We have a few options:

*  TODO item
*  document psql works that way
*  do nothing

I think the big question is how common such cases will be in the future.
The report from 2022, and one from 2019 didn't seem to clearly outline
the issue so it would good to have something documented somewhere.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: Unicode grapheme clusters

2023-01-19 Thread Tom Lane

Bruce Momjian  writes:
> I am not sure what you are referring to above?  character_length?  I was
> talking about display length, and psql uses that --- at some point, our
> lack of support for graphemes will cause psql to not align columns.

That's going to happen regardless, as long as we can't be sure
what the display will do with the characters --- and that's a
problem that will persist for a very long time.

Ideally, yeah, it'd be great if all this stuff rendered perfectly;
but IMO it's so far outside mainstream usage of psql that it's
not something that could possibly repay the investment of time
to get even a partial solution.

regards, tom lane

Re: Unicode grapheme clusters

2023-01-19 Thread Bruce Momjian

On Thu, Jan 19, 2023 at 07:37:48PM -0500, Greg Stark wrote:
> This is how we've always documented it. Postgres treats code points as
> "characters" not graphemes.
> 
> You don't need to go to anything as esoteric as emojis to see this either.
> Accented characters like é have no canonical forms that are multiple code
> points and in some character sets some accented characters can only be
> represented that way.
> 
> But I don't think there's any reason to consider changing e existing 
> functions.
> They have to be consistent with substr and the other string manipulation
> functions.
> 
> We could add new functions to work with graphemes but it might bring more pain
> keeping it up to date

I am not sure what you are referring to above?  character_length?  I was
talking about display length, and psql uses that --- at some point, our
lack of support for graphemes will cause psql to not align columns.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: Unicode grapheme clusters

2023-01-19 Thread Greg Stark

This is how we've always documented it. Postgres treats code points as
"characters" not graphemes.

You don't need to go to anything as esoteric as emojis to see this either.
Accented characters like é have no canonical forms that are multiple code
points and in some character sets some accented characters can only be
represented that way.

But I don't think there's any reason to consider changing e existing
functions. They have to be consistent with substr and the other string
manipulation functions.

We could add new functions to work with graphemes but it might bring more
pain keeping it up to date

Re: Unicode grapheme clusters

2023-01-19 Thread Bruce Momjian

On Thu, Jan 19, 2023 at 02:44:57PM +0100, Pavel Stehule wrote:
> Surely it should be fixed. Unfortunately - all the terminals that I can use
> don't support it. So at this moment it may be premature to fix it, because the
> visual form will still be broken.

Yes, none of my terminal emulators handle grapheme clusters either.  In
fact, viewing this email messed up my screen and I had to use control-L
to fix it.

I think one big problem is that our Unicode library doesn't have any way
I know of to query the display device to determine how it
supports/renders Unicode characters, so any display width we report
could be wrong.

Oddly, it seems grapheme clusters were added in Unicode 3.2, which came
out in 2002:

https://www.unicode.org/reports/tr28/tr28-3.html
https://www.quora.com/What-is-graphemeCluster

but somehow I am only seeing studying them now.

Anyway, I added a psql item for this so we don't forget about it:

https://wiki.postgresql.org/wiki/Todo#psql

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: Unicode grapheme clusters

2023-01-19 Thread Pavel Stehule

čt 19. 1. 2023 v 1:20 odesílatel Bruce Momjian  napsal:

> Just my luck, I had to dig into a two-"character" emoji that came to me
> as part of a Google Calendar entry --- here it is:
>
> ‍⚕️喙
>
>   libc
> Unicode UTF8  len
> U+1F469  f0 9f 91 a9   2   woman
> U+1F3FC  f0 9f 8f bc   2   emoji modifier fitzpatrick type-3 (skin
> tone)
> U+200D   e2 80 8d  0   zero width joiner (ZWJ)
> U+2695   e2 9a 95  1   staff with snake
> U+FE0F   ef b8 8f  0   variation selector-16 (VS16) (previous
> character as emoji)
> U+1FA7A  f0 9f a9 ba   2   stethoscope
>
> Now, in Debian 11 character apps like vi, I see:
>
>   a woman(2) - a black box(2) - a staff with snake(1) - a stethoscope(2)
>
> Display widths are in parentheses.  I also see '<200d>' in blue.
>
> In current Firefox, I see a woman with a stethoscope around her neck,
> and then a stethoscope.  Copying the Unicode string above into a browser
> URL bar should show you the same thing, thought it might be too small to
> see.
>
> For those looking for details on how these should be handled, see this
> for an explanation of grapheme clusters that use things like skin tone
> modifiers and zero-width joiners:
>
> https://tonsky.me/blog/emoji/
>
> These comments explain the confusion of the term character:
>
>
> https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme
>
> and I think this comment summarizes it well:
>
>
> https://github.com/kovidgoyal/kitty/issues/3998#issuecomment-914807237
>
> This is by design. wcwidth() is utterly broken. Any terminal or
> terminal
> application that uses it is also utterly broken. Forget about emoji
> wcwidth() doesn't even work with combining characters, zero width
> joiners, flags, and a whole bunch of other things.
>
> I decided to see how Postgres, without ICU, handles it:
>
> show lc_ctype;
>   lc_ctype
> -
>  en_US.UTF-8
>
> select octet_length('‍⚕️喙');
>  octet_length
> --
>21
>
> select character_length('‍⚕️喙');
>  character_length
> --
> 6
>
> The octet_length() is verified as correct by counting the UTF8 bytes
> above.  I think character_length() is correct if we consider the number
> of Unicode characters, display and non-display.
>
> I then started looking at how Postgres computes and uses _display_
> width.  The display width, when properly processed like by Firefox, is 4
> (two double-wide displayed characters.)  Based on the libc display
> lengths above and incorrect displayed character lengths in Debian 11, it
> would be 7.
>
> libpq has PQdsplen(), which calls pg_encoding_dsplen(), which then calls
> the per-encoding width function stored in pg_wchar_table.dsplen --- for
> UTF8, the function is pg_utf_dsplen().
>
> There is no SQL API for display length, but PQdsplen() that can be
> called with a string by calling pg_wcswidth() the gdb debugger:
>
> pg_wcswidth(const char *pwcs, size_t len, int encoding)
> UTF8 encoding == 6
>
> (gdb) print (int)pg_wcswidth("abcd", 4, 6)
> $8 = 4
> (gdb) print (int)pg_wcswidth("‍⚕️喙", 21, 6))
> $9 = 7
>
> Here is the psql output:
>
> SELECT octet_length('‍⚕️喙'), '‍⚕️喙',
> character_length('‍⚕️喙');
>  octet_length | ?column? | character_length
> --+--+--
>21 | ‍⚕️喙  |6
>
> More often called from psql are pg_wcssize() and pg_wcsformat(), which
> also calls PQdsplen().
>
> I think the question is whether we want to report a string width that
> assumes the display doesn't understand the more complex UTF8
> controls/"characters" listed above.
>
> tsearch has p_isspecial() calls pg_dsplen() which also uses
> pg_wchar_table.dsplen.  p_isspecial() also has a small table of what it
> calls "strange_letter",
>
> Here is a report about Unicode variation selector and combining
> characters from May, 2022:
>
>
> https://www.postgresql.org/message-id/flat/013f01d873bb%24ff5f64b0%24fe1e2e10%24%40ndensan.co.jp
>
> Is this something people want improved?
>

Surely it should be fixed. Unfortunately - all the terminals that I can use
don't support it. So at this moment it may be premature to fix it, because
the visual form will still be broken.

Regards

Pavel


> --
>   Bruce Momjian  https://momjian.us
>   EDB  https://enterprisedb.com
>
> Embrace your flaws.  They make you human, rather than perfect,
> which you will never be.
>
>
>

Unicode grapheme clusters

2023-01-18 Thread Bruce Momjian

Just my luck, I had to dig into a two-"character" emoji that came to me
as part of a Google Calendar entry --- here it is:

‍⚕️喙

  libc
Unicode UTF8  len
U+1F469  f0 9f 91 a9   2   woman
U+1F3FC  f0 9f 8f bc   2   emoji modifier fitzpatrick type-3 (skin tone)
U+200D   e2 80 8d  0   zero width joiner (ZWJ)
U+2695   e2 9a 95  1   staff with snake
U+FE0F   ef b8 8f  0   variation selector-16 (VS16) (previous 
character as emoji)
U+1FA7A  f0 9f a9 ba   2   stethoscope

Now, in Debian 11 character apps like vi, I see:

  a woman(2) - a black box(2) - a staff with snake(1) - a stethoscope(2)

Display widths are in parentheses.  I also see '<200d>' in blue.

In current Firefox, I see a woman with a stethoscope around her neck,
and then a stethoscope.  Copying the Unicode string above into a browser
URL bar should show you the same thing, thought it might be too small to
see.

For those looking for details on how these should be handled, see this
for an explanation of grapheme clusters that use things like skin tone
modifiers and zero-width joiners:

https://tonsky.me/blog/emoji/

These comments explain the confusion of the term character:


https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme

and I think this comment summarizes it well:

https://github.com/kovidgoyal/kitty/issues/3998#issuecomment-914807237

This is by design. wcwidth() is utterly broken. Any terminal or terminal
application that uses it is also utterly broken. Forget about emoji
wcwidth() doesn't even work with combining characters, zero width
joiners, flags, and a whole bunch of other things.

I decided to see how Postgres, without ICU, handles it:

show lc_ctype;
  lc_ctype
-
 en_US.UTF-8

select octet_length('‍⚕️喙');
 octet_length
--
   21

select character_length('‍⚕️喙');
 character_length
--
6

The octet_length() is verified as correct by counting the UTF8 bytes
above.  I think character_length() is correct if we consider the number
of Unicode characters, display and non-display.

I then started looking at how Postgres computes and uses _display_
width.  The display width, when properly processed like by Firefox, is 4
(two double-wide displayed characters.)  Based on the libc display
lengths above and incorrect displayed character lengths in Debian 11, it
would be 7.

libpq has PQdsplen(), which calls pg_encoding_dsplen(), which then calls
the per-encoding width function stored in pg_wchar_table.dsplen --- for
UTF8, the function is pg_utf_dsplen().

There is no SQL API for display length, but PQdsplen() that can be
called with a string by calling pg_wcswidth() the gdb debugger:

pg_wcswidth(const char *pwcs, size_t len, int encoding)
UTF8 encoding == 6

(gdb) print (int)pg_wcswidth("abcd", 4, 6)
$8 = 4
(gdb) print (int)pg_wcswidth("‍⚕️喙", 21, 6))
$9 = 7

Here is the psql output:

SELECT octet_length('‍⚕️喙'), '‍⚕️喙', character_length('‍⚕️喙');
 octet_length | ?column? | character_length
--+--+--
   21 | ‍⚕️喙  |6

More often called from psql are pg_wcssize() and pg_wcsformat(), which
also calls PQdsplen().

I think the question is whether we want to report a string width that
assumes the display doesn't understand the more complex UTF8
controls/"characters" listed above.
 
tsearch has p_isspecial() calls pg_dsplen() which also uses
pg_wchar_table.dsplen.  p_isspecial() also has a small table of what it
calls "strange_letter",

Here is a report about Unicode variation selector and combining
characters from May, 2022:


https://www.postgresql.org/message-id/flat/013f01d873bb%24ff5f64b0%24fe1e2e10%24%40ndensan.co.jp

Is this something people want improved?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: Unicode grapheme clusters

Re: Unicode grapheme clusters

Re: Unicode grapheme clusters

Re: Unicode grapheme clusters

Re: Unicode grapheme clusters

Re: Unicode grapheme clusters

Re: Unicode grapheme clusters

Re: Unicode grapheme clusters

Re: Unicode grapheme clusters

Re: Unicode grapheme clusters

Re: Unicode grapheme clusters

Re: Unicode grapheme clusters

Re: Unicode grapheme clusters

Re: Unicode grapheme clusters

Re: Unicode grapheme clusters

Re: Unicode grapheme clusters

Re: Unicode grapheme clusters

Unicode grapheme clusters

18 matches

Site Navigation

Mail list logo

Footer information