Re: [Development] Why can't QString use UTF-8 internally?

2015-02-12 Thread Konstantin Ritt
2015-02-12 13:11 GMT+04:00 Rutledge Shawn shawn.rutle...@theqtcompany.com: Consequently we have to do conversion each time we need the renderable text, and/or cache the results to avoid converting repeatedly. Right? Pnrftm... what? Cache what? And where? I've missed the point... And we

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-12 Thread Rutledge Shawn
On 12 Feb 2015, at 08:55, Konstantin Ritt ritt...@gmail.com wrote: 2015-02-12 11:53 GMT+04:00 Konstantin Ritt ritt...@gmail.com: 2015-02-12 11:39 GMT+04:00 Rutledge Shawn shawn.rutle...@theqtcompany.com: On 11 Feb 2015, at 18:15, Konstantin Ritt ritt...@gmail.com wrote: FYI: Unicode

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Christoph Feck
On Wednesday 11 February 2015 17:20:04 Guido Seifert wrote: Minor OT, but I am too curious... do you have an example? Are there really cases were turning lower case into upper case or vice versa changes the length of a string? office (4 code points) - OFFICE (7 code points)

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Matthew Woehlke
On 2015-02-11 11:29, Thiago Macieira wrote: On Wednesday 11 February 2015 11:22:59 Julien Blanc wrote: On 11/02/2015 10:32, Bo Thorsen wrote: 2) length() returns the number of chars I see on the screen, not a random implementation detail of the chosen encoding. How’s that supposed to work

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Thiago Macieira
On Wednesday 11 February 2015 10:32:22 Mark Gaiser wrote: Have you tried to uppercase or lowercase a string using only the Standard Library? std::string s(hello); std::transform(s.begin(), s.end(), s.begin(), ::toupper); and std::transform(s.begin(), s.end(), s.begin(), ::tolower);

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Guido Seifert
Yes, and he already said such example, ß becomes SS The other example that was given is 'i' (UTF-8 0x69) becoming 'İ' under a Turkish locale (UTF-8 0xc4 0xb0). Ah sorry. I was too focused on the visible length. 'i' = 'İ' = 1. But of course I have to look at the memory usage in the

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Tomaz Canabrava
On Wed, Feb 11, 2015 at 2:20 PM, Guido Seifert warg...@gmx.de wrote: Minor OT, but I am too curious... do you have an example? Are there really cases were turning lower case into upper case or vice versa changes the length of a string? Yes, and he already said such example, ß becomes SS

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Guido Seifert
Minor OT, but I am too curious... do you have an example? Are there really cases were turning lower case into upper case or vice versa changes the length of a string? Guido std::string s(hello); std::transform(s.begin(), s.end(), s.begin(), ::toupper); and std::transform(s.begin(),

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Thiago Macieira
On Wednesday 11 February 2015 11:22:59 Julien Blanc wrote: On 11/02/2015 10:32, Bo Thorsen wrote: 2) length() returns the number of chars I see on the screen, not a random implementation detail of the chosen encoding. How’s that supposed to work with combining characters, which are part of

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Konstantin Ritt
2015-02-11 20:35 GMT+04:00 Thiago Macieira thiago.macie...@intel.com: There are probably more examples. ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt ___ Development mailing list Development@qt-project.org

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Konstantin Ritt
FYI: Unicode codepoint != character visual representation. Moreover, a single character could be represented with a sequence of glyps or vice versa - a sequence of characters could be represented with a single glyph. QString (and every other Unicode string class in the world) represents a

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Christoph Feck
On Wednesday 11 February 2015 17:23:51 Christoph Feck wrote: On Wednesday 11 February 2015 17:20:04 Guido Seifert wrote: Minor OT, but I am too curious... do you have an example? Are there really cases were turning lower case into upper case or vice versa changes the length of a string?

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Daniel Teske
On Wednesday 11 Feb 2015 17:20:04 Guido Seifert wrote: Minor OT, but I am too curious... do you have an example? Are there really cases were turning lower case into upper case or vice versa changes the length of a string? What is uppercase ß? daniel

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Konstantin Ritt
So forget proposing QString to operate on visual or logical glyphs. There is QTextBoundaryFinder class that operates on logical items, and QFontMetrics that operates on visual glyphs. Regards, Konstantin 2015-02-11 21:59 GMT+04:00 Matthew Woehlke mw_tr...@users.sourceforge.net: On 2015-02-11

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Marc Mutz
On Wednesday 11 February 2015 17:28:43 Daniel Teske wrote: On Wednesday 11 Feb 2015 17:20:04 Guido Seifert wrote: Minor OT, but I am too curious... do you have an example? Are there really cases were turning lower case into upper case or vice versa changes the length of a string? What is

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Thiago Macieira
On Wednesday 11 February 2015 18:26:40 Guido Seifert wrote: Yes, and he already said such example, ß becomes SS The other example that was given is 'i' (UTF-8 0x69) becoming 'İ' under a Turkish locale (UTF-8 0xc4 0xb0). Ah sorry. I was too focused on the visible length. 'i' = 'İ' = 1.

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Rutledge Shawn
On 11 Feb 2015, at 18:15, Konstantin Ritt ritt...@gmail.com wrote: FYI: Unicode codepoint != character visual representation. Moreover, a single character could be represented with a sequence of glyps or vice versa - a sequence of characters could be represented with a single glyph.

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Konstantin Ritt
2015-02-12 11:39 GMT+04:00 Rutledge Shawn shawn.rutle...@theqtcompany.com: On 11 Feb 2015, at 18:15, Konstantin Ritt ritt...@gmail.com wrote: FYI: Unicode codepoint != character visual representation. Moreover, a single character could be represented with a sequence of glyps or vice versa

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Konstantin Ritt
2015-02-12 11:53 GMT+04:00 Konstantin Ritt ritt...@gmail.com: 2015-02-12 11:39 GMT+04:00 Rutledge Shawn shawn.rutle...@theqtcompany.com : On 11 Feb 2015, at 18:15, Konstantin Ritt ritt...@gmail.com wrote: FYI: Unicode codepoint != character visual representation. Moreover, a single

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Mathias Hasselmann
Am 11.02.2015 um 10:11 schrieb Marc Mutz: You overlooked where a corresponding character exists. Either uppercase ß exists (it does, it was found in an old printing, so there's a movement to adopt it, except Unicode doesn't have it), then it's not a problem, or it does (as is the case in

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Olivier Goffart
On Wednesday 11 February 2015 10:32:31 Bo Thorsen wrote: This would make me very unhappy. I'm doing a customer project right now that uses std::string all over the place and there is real pain involved in this. It's an almost empty layer over char* and brings none of the features of QString.

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Julien Blanc
On 11/02/2015 10:32, Bo Thorsen wrote: 2) length() returns the number of chars I see on the screen, not a random implementation detail of the chosen encoding. How’s that supposed to work with combining characters, which are part of unicode ? 3) at(int) and [] gives the unicode char, not a

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Marc Mutz
On Wednesday 11 February 2015 02:22:45 Thiago Macieira wrote: charT do_toupper(charT c) const; const charT* do_toupper(charT* low, const charT* high) const; Effects: Converts a character or characters to upper case. The second form replaces each character *p in the range [low,high)

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Bo Thorsen
Den 10-02-2015 kl. 23:17 skrev Allan Sandfeld Jensen: On Tuesday 10 February 2015, Oswald Buddenhagen wrote: On Wed, Feb 11, 2015 at 12:37:41AM +0400, Konstantin Ritt wrote: Yes, that would be an ideal solution. Unfortunately, that would also break a LOT of existing code. i was thinking of

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Mark Gaiser
On Wed, Feb 11, 2015 at 12:33 AM, Thiago Macieira thiago.macie...@intel.com wrote: On Tuesday 10 February 2015 23:17:21 Allan Sandfeld Jensen wrote: Maybe with C++11 we don't need QString that much anymore. Use std::string with UTF8 and std::u32string for UCS4. For Qt6 it would be worth

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Olivier Goffart
On Tuesday 10 February 2015 17:22:45 Thiago Macieira wrote: Because unlike std::vector, std::basic_string is woefully inadequate compared to QString and QByteArray. I just mentioned the easy cases, but a quick check shows how much more is lacking. I rest my case. QString will be there at

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-11 Thread Marc Mutz
On Wednesday 11 February 2015 11:11:36 Olivier Goffart wrote: UB could ckick in has no meaning. In practice there is no reason why casting a pointer to member function to remove the const would not work. Yet, you would not accept it[1]. Data races are undefined behavior according to the

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Thiago Macieira
On Wednesday 11 February 2015 01:38:12 Olivier Goffart wrote: Eh... have you tried to convert a UTF-8 or UTF-16 or UCS-4 string to the locale's narrow character set without using QString? with std::ctype::tonarrow? That's std::ctype::narrow, which I didn't realise existed until now. But I

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Thiago Macieira
On Wednesday 11 February 2015 01:59:40 Olivier Goffart wrote: Unless it is a buffer of std::atomic, it is an undefined behavior, so not only the contents of the buffer is unpredictable, but anything, really. (A sufficiently smart conforming compiler could see that you are writing at the same

[Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Rutledge Shawn
On Feb 10, 2015, at 17:08, Julien Blanc julien.bl...@nmc-company.com wrote: On 10/02/2015 16:33, Knoll Lars wrote: IMO there’s simply too many questions that this one example doesn’t answer to conclude that what we are doing is bad. Two arguments : - implicit sharing is convenient, and

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Konstantin Ritt
16 bits is completely enough for most spoken languages (see the Unicode's Blocks.txt and/or Scripts.txt for an approximated list), whereas 8 bits encoding only covers ASCII. Despite of what http://utf8everywhere.org/#conclusions says, UTF-16 is not the worst choice; it is a trade-off between the

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Thiago Macieira
On Tuesday 10 February 2015 13:26:50 Thiago Macieira wrote: But given the choice, I would choose to do nothing. Instead, I have a patch pending for Qt 6 that caches the Latin1 version of the QString in an extra block past the UTF-16 data. Sorry, I remembered wrong. I have a patch that sets a

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Konstantin Ritt
2015-02-11 1:26 GMT+04:00 Thiago Macieira thiago.macie...@intel.com: On Wednesday 11 February 2015 00:37:41 Konstantin Ritt wrote: Yes, that would be an ideal solution. Unfortunately, that would also break a LOT of existing code. In Qt4 times, I was doing some experiments with the QString

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Oswald Buddenhagen
On Wed, Feb 11, 2015 at 12:37:41AM +0400, Konstantin Ritt wrote: Yes, that would be an ideal solution. Unfortunately, that would also break a LOT of existing code. i was thinking of making it explicit with a smooth migration path - add QUtf8String (basically QByteArray, but don't permit

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Thiago Macieira
On Wednesday 11 February 2015 00:37:41 Konstantin Ritt wrote: Yes, that would be an ideal solution. Unfortunately, that would also break a LOT of existing code. In Qt4 times, I was doing some experiments with the QString adaptive storage (similar to what NSString does behind the scenes). I've

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Konstantin Ritt
Err, s/utf8Data/utf16Data/ Regards, Konstantin 2015-02-11 1:52 GMT+04:00 Konstantin Ritt ritt...@gmail.com: 2015-02-11 1:26 GMT+04:00 Thiago Macieira thiago.macie...@intel.com: On Wednesday 11 February 2015 00:37:41 Konstantin Ritt wrote: Yes, that would be an ideal solution.

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Thiago Macieira
On Tuesday 10 February 2015 22:58:58 Konstantin Ritt wrote: 16 bits is completely enough for most spoken languages (see the s/most/all/ All *living* languages are encoded in the BMP. The SMP and other planes contain only dead languages (Egyptian hieroglyphs, Linear A, Linear B, etc.), plus

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Konstantin Ritt
Yes, that would be an ideal solution. Unfortunately, that would also break a LOT of existing code. In Qt4 times, I was doing some experiments with the QString adaptive storage (similar to what NSString does behind the scenes). Konstantin 2015-02-11 0:22 GMT+04:00 Oswald Buddenhagen

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Oswald Buddenhagen
On Tue, Feb 10, 2015 at 10:58:58PM +0400, Konstantin Ritt wrote: Despite of what http://utf8everywhere.org/#conclusions says, UTF-16 is not the worst choice; it is a trade-off between the performance and the memory consumption in the most-common use case (spoken languages and mixed scripts).

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Thiago Macieira
On Wednesday 11 February 2015 01:52:34 Konstantin Ritt wrote: 2015-02-11 1:26 GMT+04:00 Thiago Macieira thiago.macie...@intel.com: On Wednesday 11 February 2015 00:37:41 Konstantin Ritt wrote: Yes, that would be an ideal solution. Unfortunately, that would also break a LOT of

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Olivier Goffart
On Tuesday 10 February 2015 23:17:21 Allan Sandfeld Jensen wrote: Maybe with C++11 we don't need QString that much anymore. Use std::string with UTF8 and std::u32string for UCS4. For Qt6 it would be worth considering how many of our classes still makes sense. Those we want CoW semantics on

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Konstantin Ritt
Can QChar represent a 32 bits codepoint, then? Regards, Konstantin 2015-02-11 2:11 GMT+04:00 Thiago Macieira thiago.macie...@intel.com: On Wednesday 11 February 2015 01:52:34 Konstantin Ritt wrote: 2015-02-11 1:26 GMT+04:00 Thiago Macieira thiago.macie...@intel.com: On Wednesday 11

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Allan Sandfeld Jensen
On Tuesday 10 February 2015, Oswald Buddenhagen wrote: On Wed, Feb 11, 2015 at 12:37:41AM +0400, Konstantin Ritt wrote: Yes, that would be an ideal solution. Unfortunately, that would also break a LOT of existing code. i was thinking of making it explicit with a smooth migration path - add

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Olivier Goffart
On Tuesday 10 February 2015 15:33:12 Thiago Macieira wrote: On Tuesday 10 February 2015 23:17:21 Allan Sandfeld Jensen wrote: Maybe with C++11 we don't need QString that much anymore. Use std::string with UTF8 and std::u32string for UCS4. For Qt6 it would be worth considering how many of

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Thiago Macieira
On Wednesday 11 February 2015 04:05:02 Konstantin Ritt wrote: Previously you said QString::data() must return QChar* (and not a generic uchar*), so that QString with an adaptive storage would have to silently convert the internal encoding into the one represented by QChar. If QString has a

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Thiago Macieira
On Tuesday 10 February 2015 23:17:21 Allan Sandfeld Jensen wrote: Maybe with C++11 we don't need QString that much anymore. Use std::string with UTF8 and std::u32string for UCS4. For Qt6 it would be worth considering how many of our classes still makes sense. Those we want CoW semantics on

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Konstantin Ritt
Previously you said QString::data() must return QChar* (and not a generic uchar*), so that QString with an adaptive storage would have to silently convert the internal encoding into the one represented by QChar. If QString has a UCS-4 indexes and length() that counts the amount of UCS-4

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Thiago Macieira
On Wednesday 11 February 2015 02:19:59 Konstantin Ritt wrote: Can QChar represent a 32 bits codepoint, then? Yes, it could be widened. But what's the advantage in using UCS-4? -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Marc Mutz
On Tuesday 10 February 2015 22:26:50 Thiago Macieira wrote: It's not insurmountable. I can think of two solutions: 1) pre-allocate enough space for the UTF-16 data (strlen(utf8) * 2), so that the const functions can implicitly write to the UTF-16 block when needed. Since the original UTF-8

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Matthew Woehlke
On 2015-02-10 18:33, Thiago Macieira wrote: Eh... have you tried to convert a UTF-8 or UTF-16 or UCS-4 string to the locale's narrow character set without using QString? Yup... we would need to standardize libiconv (or an equivalent) for that :-). Have you tried to convert a number to

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Matthew Woehlke
On 2015-02-10 18:40, Marc Mutz wrote: On Tuesday 10 February 2015 22:26:50 Thiago Macieira wrote: It's not insurmountable. I can think of two solutions: 1) pre-allocate enough space for the UTF-16 data (strlen(utf8) * 2), so that the const functions can implicitly write to the UTF-16 block

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Thiago Macieira
On Wednesday 11 February 2015 00:40:28 Marc Mutz wrote: On Tuesday 10 February 2015 22:26:50 Thiago Macieira wrote: It's not insurmountable. I can think of two solutions: 1) pre-allocate enough space for the UTF-16 data (strlen(utf8) * 2), so that the const functions can implicitly

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Thiago Macieira
On Tuesday 10 February 2015 19:07:09 Matthew Woehlke wrote: Heh. That reminds me, when will Qt classes get emplace methods? I added those methods to my local refactor of QVector, but.. Or the ability to accept movable-but-not-copyable types? ... they aren't useful because we'll never accept

Re: [Development] Why can't QString use UTF-8 internally?

2015-02-10 Thread Olivier Goffart
On Tuesday 10 February 2015 19:10:29 Matthew Woehlke wrote: On 2015-02-10 18:40, Marc Mutz wrote: On Tuesday 10 February 2015 22:26:50 Thiago Macieira wrote: It's not insurmountable. I can think of two solutions: 1) pre-allocate enough space for the UTF-16 data (strlen(utf8) * 2), so