Re: D-ish way to work with strings?

2019-12-27 Thread H. S. Teoh via Digitalmars-d-learn
On Fri, Dec 27, 2019 at 01:23:57PM +0100, Robert M. Münch via 
Digitalmars-d-learn wrote:
> On 2019-12-23 15:05:20 +, H. S. Teoh said:
[...]
> > What are you planning to do with your strings?
> 
> Pretty simple: Have user editable content that is rendered using
> different fonts supporting unicode.
> 
> So, all editing functions: insert, replace, delete at all locations in
> the string supporting all unicode characters.
[...]

Ah, I see.  In that case you might want to consider using graphemes by
default, since that's what most closely corresponds to how the user will
perceive a "character".  For processing outside of editing, though, you
might want to consider converting to some other representation for
manipulation, since graphemes are slow (the decoding process is complex,
and we can't work around that because that's what Unicode requires).


T

-- 
Windows: the ultimate triumph of marketing over technology. -- Adrian von Bidder


Re: D-ish way to work with strings?

2019-12-27 Thread Robert M. Münch via Digitalmars-d-learn

On 2019-12-23 15:05:20 +, H. S. Teoh said:

On Sun, Dec 22, 2019 at 06:27:03PM +0100, Robert M. Münch via 
Digitalmars-d-learn wrote:

Want to add I'm talking about unicode strings.

Wouldn't it make sense to handle everything as UTF-32 so that
iteration is simple because code-point = code-unit?

And later on, convert to UTF-16 or UTF-8 on demand?

[...]

Be careful that code point != "character" the way most people understand
the word "character".


I know. My point was that with UTF-8 code-points (not being a 
character) have different sizes. Which you need to take into account if 
you want to iterate by code-points.


The word you're looking for is "grapheme". Which, unfortunately, is 
rather complex and very slow to handle in

Unicode. See std.uni.byGrapheme.


Yes, that's when we come to "characters". And a "grapheme" can consists 
of several code-points. Is grapheme handling just slow in D or in 
general? If it's the latter, well, than that's just how it is.



Usually you want to just stick with UTF-8 (usually) or UTF-16 (for
Windows and Java interop). UTF-32 wastes a lot of space, and *still*
doesn't give you what you think you want, and Grapheme[] is just dog
slow because of the amount of decoding/recoding needed to manipulate it.


I need to handle graphemes when things are goind to be rendered and edited.


What are you planning to do with your strings?


Pretty simple: Have user editable content that is rendered using 
different fonts supporting unicode.


So, all editing functions: insert, replace, delete at all locations in 
the string supporting all unicode characters.


Viele Grüsse.

--
Robert M. Münch
http://www.saphirion.com
smarter | better | faster



Re: D-ish way to work with strings?

2019-12-27 Thread Robert M. Münch via Digitalmars-d-learn

On 2019-12-22 18:45:52 +, Steven Schveighoffer said:

switch to using char[]. Unfortunately, there's a lot of code out there 
that accepts string instead of const(char)[], which is more usable. I 
think many people don't realize the purpose of the string type. It's 
meant to be something that is heap-allocated (or as a global), and 
NEVER goes out of scope.


Hi Steve, thanks for the feedback. Makes sense to me.

It really depends on your use cases. strings are great precisely 
because they don't change. slicing makes huge sense there.


My "strings" change a lot, so not really a good fit to use string.

Again, use char[] if you are going to be rearranging strings. And you 
have to take care not to cheat and cast to string. Always use idup if 
you need one.


Will do.

If you find Phobos functions that unnecessarily take string instead of 
const(char)[] please post to bugzilla.


Ok, will keep an eye on it.

--
Robert M. Münch
http://www.saphirion.com
smarter | better | faster



Re: D-ish way to work with strings?

2019-12-23 Thread H. S. Teoh via Digitalmars-d-learn
On Sun, Dec 22, 2019 at 06:27:03PM +0100, Robert M. Münch via 
Digitalmars-d-learn wrote:
> Want to add I'm talking about unicode strings.
> 
> Wouldn't it make sense to handle everything as UTF-32 so that
> iteration is simple because code-point = code-unit?
> 
> And later on, convert to UTF-16 or UTF-8 on demand?
[...]

Be careful that code point != "character" the way most people understand
the word "character".  The word you're looking for is "grapheme".
Which, unfortunately, is rather complex and very slow to handle in
Unicode. See std.uni.byGrapheme.

Usually you want to just stick with UTF-8 (usually) or UTF-16 (for
Windows and Java interop). UTF-32 wastes a lot of space, and *still*
doesn't give you what you think you want, and Grapheme[] is just dog
slow because of the amount of decoding/recoding needed to manipulate it.

What are you planning to do with your strings?  IME, using ~
occasionally doesn't add *too* much GC pressure, and slicing is usually
the idiomatic way of working with strings in D (it can result in faster
code than C because you don't have to keep strcpy()'d stuff all over the
place).  If you're appending string a LOT, you might want to consider
using std.array.appender in your inner loops to alleviate some of the
cost of using ~ too much.  Or use lazy evaluation and ranges to defer
actually constructing the string until the end when it's ready to be
stored.

Still, this all depends on what you're trying to do with your strings.
Elaborate a bit more about your use case, and we might be able to give
better advice.


T

-- 
Nobody is perfect.  I am Nobody. -- pepoluan, GKC forum


Re: D-ish way to work with strings?

2019-12-22 Thread Steven Schveighoffer via Digitalmars-d-learn

On 12/22/19 9:15 AM, Robert M. Münch wrote:
I want to do all the basics mutating things with strings: append, 
insert, replace


What is the D-ish way to do that since string is aliased 
to immutable(char)[]?


switch to using char[].

Unfortunately, there's a lot of code out there that accepts string 
instead of const(char)[], which is more usable.


I think many people don't realize the purpose of the string type. It's 
meant to be something that is heap-allocated (or as a global), and NEVER 
goes out of scope. Many things are shoehorned into string which 
shouldn't be.


Using arrays, using ~ operator, always copying, changing, combining my 
strings into a new one? Does it make sense to think about reducing GC 
pressure?


It really depends on your use cases. strings are great precisely because 
they don't change. slicing makes huge sense there.


I'm a bit lost in the possibilities and don't find any "that's the way 
to do it".


Again, use char[] if you are going to be rearranging strings. And you 
have to take care not to cheat and cast to string. Always use idup if 
you need one.


If you find Phobos functions that unnecessarily take string instead of 
const(char)[] please post to bugzilla.


-Steve


Re: D-ish way to work with strings?

2019-12-22 Thread Robert M. Münch via Digitalmars-d-learn

Want to add I'm talking about unicode strings.

Wouldn't it make sense to handle everything as UTF-32 so that iteration 
is simple because code-point = code-unit?


And later on, convert to UTF-16 or UTF-8 on demand?

--
Robert M. Münch
http://www.saphirion.com
smarter | better | faster