Re: Char representation

2016-11-22 Thread RazvanN via Digitalmars-d-learn
On Tuesday, 22 November 2016 at 14:23:28 UTC, Jonathan M Davis 
wrote:
On Tuesday, November 22, 2016 13:29:47 RazvanN via 
Digitalmars-d-learn wrote:

[...]


You misunderstand. char[] is a dynamic array of char, wchar[] 
is a dynamic array of wchar[], and dchar[] is a dynamic array 
of dchar. There is nothing funny going on with the internal 
representation. Rather, the problem is with the range API and 
the traits that go with it. And it's not a bug; it's a design 
mistake.


[...]


Thank you very much for this great explanation. Things are 
starting to make sense now.


Razvan Nitu


Re: Char representation

2016-11-22 Thread Kagamin via Digitalmars-d-learn

On Tuesday, 22 November 2016 at 13:29:47 UTC, RazvanN wrote:

Given the following code:

 char[5] a = ['a', 'b', 'c', 'd', 'e'];
 alias Range = char[];
 writeln(is(ElementType!Range == char));

One would expect that the program will print true. In fact, it 
prints false and I noticed that if Range is char[], wchar[], 
dchar[], string, wstring, dstring
Unqual!(ElementType!Range) is dchar. I find it odd that the 
internal representation for char and string is dchar. Is this a 
bug?


Here's the reading: 
https://forum.dlang.org/post/nh2o9i$hr0$1...@digitalmars.com


Re: Char representation

2016-11-22 Thread Jonathan M Davis via Digitalmars-d-learn
On Tuesday, November 22, 2016 13:29:47 RazvanN via Digitalmars-d-learn 
wrote:
> Given the following code:
>
>   char[5] a = ['a', 'b', 'c', 'd', 'e'];
>   alias Range = char[];
>   writeln(is(ElementType!Range == char));
>
> One would expect that the program will print true. In fact, it
> prints false and I noticed that if Range is char[], wchar[],
> dchar[], string, wstring, dstring
> Unqual!(ElementType!Range) is dchar. I find it odd that the
> internal representation for char and string is dchar. Is this a
> bug?

You misunderstand. char[] is a dynamic array of char, wchar[] is a dynamic
array of wchar[], and dchar[] is a dynamic array of dchar. There is nothing
funny going on with the internal representation. Rather, the problem is with
the range API and the traits that go with it. And it's not a bug; it's a
design mistake.

I don't know how much you know about Unicode, but for a quick explanation,
you have code units, code points, and graphemes. A grapheme is made up of
one or more code points, and  a code point is made up of one or more code
units. In the case of UTF-8, a code unit is 8 bits; in UTF-16, a code unit
is 16 bits; and in UTF-32, a code unit is 32 bits. Those are represented in
D by char, wchar, dchar respectively. There is no guarantee that a char,
wchar, or dchar is a representable character. A code unit is just a piece of
a character except in the cases where it happens to be a full character. :|

A code point, on the other hand, actually makes up something composable and
printable. It's something like the letter A, or é, or, の, etc. It could
also be an accent, a superscript, subscript, etc. In the case of UTF-8 and
UTF-16, it can take several code units to form a single code point. In the
case of UTF-32, a single code unit is always a code point, because code
points take up 32 bits.

However, that's still not necessarily a full character. After all, an accent
or a superscript is not really a character. Rather, it's a modifier for a
character. So, one or more code points can be combined to form graphemes
which _are_ actual characters. Unfortunately, there are several
normalization schemes for the order of code points in a grapheme, and some
graphemes can be represented as a single code point or as several (most
notably, the characters which commonly have accents on them such as é come
both as single code points and as combined code points). So, this whole
thing gets stupidly complicated. It's even worse when you want to handle it
all _efficiently_.

Well, when Andrei added ranges to D, he tried to simplify things so that the
default was correct and reasonably efficient while allowing for code to
specialize where appropriate to get the full efficiency. That's a noble
goal, but unfortunately, he didn't know about graphemes at the time. He
thought that code points were guaranteed to be full characters and that if
you operated at the code point level, you were guaranteed full correctness.
So, in order to avoid errors related to chopping up strings of char or wchar
in the middle of code points, he came up with the concept of "narrow"
strings - i.e. strings which are made up of char or wchar rather than dchar
(so strings where each code unit is not guaranteed to be a code point), and
he restricted what narrow strings could do by default per the range API and
its associated traits. So, we get fun like this.

assert(!hasLength!string);
assert(!hasLength!wstring);
assert(hasLength!dstring);

assert(!isRandomAccessRange!string);
assert(!isRandomAccessRange!wstring);
assert(isRandomAccessRange!dstring);

assert(is(ElementType!string == dchar));
assert(is(ElementType!wstring == dchar));
assert(is(ElementType!dstring == dchar));

And front, popFront, back, and popBack all automatically decode the code
units in a string to code points. So, front and back both return dchar even
if the string is a string of char or wchar. The arrays themselves do not
change. However, the way that the traits in std.range.primitives treat them
is then fundamentally different from how the language treats them. So, even
though

string str = "hello world";
for(auto r = str; !r.empty; r.popFront())
{
auto e = range.front;
}

will iterate by dchar

string str = "hello world";
foreach(e; str)
{
}

will iterate by char. If you want it to iterate by dchar, then you make it
explicit.

string str = "hello world";
foreach(dchar e; str)
{
}

The result of all of this is that by default, when you treat strings as
ranges, you operate at the code point level. This avoids certain bugs where
code would otherwise chop up code points by operating on code units, but
since it doesn't actually go to the grapheme level, it still isn't actually
correct, and it's easier to miss the fact that it's wrong, since more cases
work. It's also inefficient, because the code units are always decoded to
code points regardless of whether the algorithm in question actually needs
to do that or not. It also creates confusion and questions like yours.


Re: Char representation

2016-11-22 Thread Adam D. Ruppe via Digitalmars-d-learn

On Tuesday, 22 November 2016 at 13:29:47 UTC, RazvanN wrote:

Is this a bug?


The language is sane. The standard library is not alas, it is 
insane by design, so not a bug.


Re: Char representation

2016-11-22 Thread Daniel Kozak via Digitalmars-d-learn

Dne 22.11.2016 v 14:29 RazvanN via Digitalmars-d-learn napsal(a):


Given the following code:

 char[5] a = ['a', 'b', 'c', 'd', 'e'];
 alias Range = char[];
 writeln(is(ElementType!Range == char));

One would expect that the program will print true. In fact, it prints 
false and I noticed that if Range is char[], wchar[], dchar[], string, 
wstring, dstring
Unqual!(ElementType!Range) is dchar. I find it odd that the internal 
representation for char and string is dchar. Is this a bug?

https://dlang.org/library/std/range/primitives/element_encoding_type.html


Re: Char representation

2016-11-22 Thread Daniel Kozak via Digitalmars-d-learn

Dne 22.11.2016 v 14:29 RazvanN via Digitalmars-d-learn napsal(a):


Given the following code:

 char[5] a = ['a', 'b', 'c', 'd', 'e'];
 alias Range = char[];
 writeln(is(ElementType!Range == char));

One would expect that the program will print true. In fact, it prints 
false and I noticed that if Range is char[], wchar[], dchar[], string, 
wstring, dstring
Unqual!(ElementType!Range) is dchar. I find it odd that the internal 
representation for char and string is dchar. Is this a bug?

RTFM: https://dlang.org/library/std/range/primitives/element_type.html


Re: Char representation

2016-11-22 Thread Stefan Koch via Digitalmars-d-learn

On Tuesday, 22 November 2016 at 13:29:47 UTC, RazvanN wrote:

Given the following code:

 char[5] a = ['a', 'b', 'c', 'd', 'e'];
 alias Range = char[];
 writeln(is(ElementType!Range == char));

One would expect that the program will print true. In fact, it 
prints false and I noticed that if Range is char[], wchar[], 
dchar[], string, wstring, dstring
Unqual!(ElementType!Range) is dchar. I find it odd that the 
internal representation for char and string is dchar. Is this a 
bug?


When seen as a range the element type of a char[] is indeed dchar.
This is autodecoding at work.


Re: Char representation

2016-11-22 Thread rikki cattermole via Digitalmars-d-learn

On 23/11/2016 2:29 AM, RazvanN wrote:

Given the following code:

 char[5] a = ['a', 'b', 'c', 'd', 'e'];
 alias Range = char[];
 writeln(is(ElementType!Range == char));

One would expect that the program will print true. In fact, it prints
false and I noticed that if Range is char[], wchar[], dchar[], string,
wstring, dstring
Unqual!(ElementType!Range) is dchar. I find it odd that the internal
representation for char and string is dchar. Is this a bug?


"For example, ElementType!(T[]) is T if T[] isn't a narrow string; if it 
is, the element type is dchar"[0].


[0] https://dlang.org/phobos/std_range_primitives.html#ElementType