On Tuesday, November 22, 2016 13:29:47 RazvanN via Digitalmars-d-learn
wrote:
> Given the following code:
>
> char[5] a = ['a', 'b', 'c', 'd', 'e'];
> alias Range = char[];
> writeln(is(ElementType!Range == char));
>
> One would expect that the program will print true. In fact, it
> prints false and I noticed that if Range is char[], wchar[],
> dchar[], string, wstring, dstring
> Unqual!(ElementType!Range) is dchar. I find it odd that the
> internal representation for char and string is dchar. Is this a
> bug?
You misunderstand. char[] is a dynamic array of char, wchar[] is a dynamic
array of wchar[], and dchar[] is a dynamic array of dchar. There is nothing
funny going on with the internal representation. Rather, the problem is with
the range API and the traits that go with it. And it's not a bug; it's a
design mistake.
I don't know how much you know about Unicode, but for a quick explanation,
you have code units, code points, and graphemes. A grapheme is made up of
one or more code points, and a code point is made up of one or more code
units. In the case of UTF-8, a code unit is 8 bits; in UTF-16, a code unit
is 16 bits; and in UTF-32, a code unit is 32 bits. Those are represented in
D by char, wchar, dchar respectively. There is no guarantee that a char,
wchar, or dchar is a representable character. A code unit is just a piece of
a character except in the cases where it happens to be a full character. :|
A code point, on the other hand, actually makes up something composable and
printable. It's something like the letter A, or é, or, の, etc. It could
also be an accent, a superscript, subscript, etc. In the case of UTF-8 and
UTF-16, it can take several code units to form a single code point. In the
case of UTF-32, a single code unit is always a code point, because code
points take up 32 bits.
However, that's still not necessarily a full character. After all, an accent
or a superscript is not really a character. Rather, it's a modifier for a
character. So, one or more code points can be combined to form graphemes
which _are_ actual characters. Unfortunately, there are several
normalization schemes for the order of code points in a grapheme, and some
graphemes can be represented as a single code point or as several (most
notably, the characters which commonly have accents on them such as é come
both as single code points and as combined code points). So, this whole
thing gets stupidly complicated. It's even worse when you want to handle it
all _efficiently_.
Well, when Andrei added ranges to D, he tried to simplify things so that the
default was correct and reasonably efficient while allowing for code to
specialize where appropriate to get the full efficiency. That's a noble
goal, but unfortunately, he didn't know about graphemes at the time. He
thought that code points were guaranteed to be full characters and that if
you operated at the code point level, you were guaranteed full correctness.
So, in order to avoid errors related to chopping up strings of char or wchar
in the middle of code points, he came up with the concept of "narrow"
strings - i.e. strings which are made up of char or wchar rather than dchar
(so strings where each code unit is not guaranteed to be a code point), and
he restricted what narrow strings could do by default per the range API and
its associated traits. So, we get fun like this.
assert(!hasLength!string);
assert(!hasLength!wstring);
assert(hasLength!dstring);
assert(!isRandomAccessRange!string);
assert(!isRandomAccessRange!wstring);
assert(isRandomAccessRange!dstring);
assert(is(ElementType!string == dchar));
assert(is(ElementType!wstring == dchar));
assert(is(ElementType!dstring == dchar));
And front, popFront, back, and popBack all automatically decode the code
units in a string to code points. So, front and back both return dchar even
if the string is a string of char or wchar. The arrays themselves do not
change. However, the way that the traits in std.range.primitives treat them
is then fundamentally different from how the language treats them. So, even
though
string str = "hello world";
for(auto r = str; !r.empty; r.popFront())
{
auto e = range.front;
}
will iterate by dchar
string str = "hello world";
foreach(e; str)
{
}
will iterate by char. If you want it to iterate by dchar, then you make it
explicit.
string str = "hello world";
foreach(dchar e; str)
{
}
The result of all of this is that by default, when you treat strings as
ranges, you operate at the code point level. This avoids certain bugs where
code would otherwise chop up code points by operating on code units, but
since it doesn't actually go to the grapheme level, it still isn't actually
correct, and it's easier to miss the fact that it's wrong, since more cases
work. It's also inefficient, because the code units are always decoded to
code points regardless of whether the algorithm in question actually needs
to do that or not. It also creates confusion and questions like yours.