Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

2022-12-02 Thread H. S. Teoh via Digitalmars-d-learn
On Fri, Dec 02, 2022 at 11:47:30PM +, thebluepandabear via 
Digitalmars-d-learn wrote:
> On Friday, 2 December 2022 at 23:44:28 UTC, thebluepandabear wrote:
> > > :-D
> > > 
> > > (Exercise for the reader: what's the Hausdorff dimension of the
> > > set of strings over Unicode space? :-P)
> > > 
> > > 
> > > T
> > 
> > Your explanation was great and cleared things up... not sure about
> > the linear algebra one though ;)
> 
> Actually now when I think about it, it is quite a creative way of
> explaining things. I take back what I said.

It was a math joke. :-P  It was half-serious, though, and I think the
analogy surprisingly holds up well enough in many cases.  In any case,
silly analogies are often a good mnemonic for remembering things like
Unicode terminology. :-D


T

-- 
Freedom: (n.) Man's self-given right to be enslaved by his own depravity.


Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

2022-12-02 Thread thebluepandabear via Digitalmars-d-learn
On Friday, 2 December 2022 at 23:44:28 UTC, thebluepandabear 
wrote:

:-D

(Exercise for the reader: what's the Hausdorff dimension of 
the set of strings over Unicode space? :-P)



T


Your explanation was great and cleared things up... not sure 
about the linear algebra one though ;)


Actually now when I think about it, it is quite a creative way of 
explaining things. I take back what I said.


Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

2022-12-02 Thread thebluepandabear via Digitalmars-d-learn

:-D

(Exercise for the reader: what's the Hausdorff dimension of the 
set of strings over Unicode space? :-P)



T


Your explanation was great and cleared things up... not sure 
about the linear algebra one though ;)


Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

2022-12-02 Thread H. S. Teoh via Digitalmars-d-learn
On Fri, Dec 02, 2022 at 02:32:47PM -0800, Ali Çehreli via Digitalmars-d-learn 
wrote:
> On 12/2/22 13:44, rikki cattermole wrote:
> 
> > Yeah you're right, its code unit not code point.
> 
> This proves yet again how badly chosen those names are. I must look it
> up every time before using one or the other.
> 
> So they are both "code"? One is a "unit" and the other is a "point"?
> Sheesh!
[...]

Think of Unicode as a vector space.  A code point is a point in this
space, and a code unit is one of the unit vectors; although some points
can be reached with a single unit vector, to get to a general point you
need to combine one or more unit vectors.

Furthermore, the set of unit vectors you have depends on which
coordinate system (i.e., encoding) you're using.  Reencoding a Unicode
string is essentially changing your coordinate system. ;-) (Exercise for
the reader: compute the transformation matrix for reencoding. :-P)

Also, a grapheme is a curve through this space (you *graph* the curve,
you see), and as we all know, a curve may consist of more than one
point.

:-D

(Exercise for the reader: what's the Hausdorff dimension of the set of
strings over Unicode space? :-P)


T

-- 
First Rule of History: History doesn't repeat itself -- historians merely 
repeat each other.


Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

2022-12-02 Thread Ali Çehreli via Digitalmars-d-learn

On 12/2/22 13:18, thebluepandabear wrote:

> But I don't really understand this? What does it mean that it 'must be
> represented by at least 2 bytes'?

The integral value of Ğ in unicode is 286.

  https://unicodeplus.com/U+011E

Since 'char' is 8 bits, it cannot store 286.

At first, that sounds like a hopeless situation, making one think that Ğ 
cannot be represented in a string. The concept of encoding to the 
rescue: Ğ can be encoded by 2 chars:


import std.stdio;

void main() {
foreach (c; "Ğ") {
writefln!"%b"(c);
}
}

That program prints

11000100
1000

Articles like the following explain well how that second byte is a 
continuation byte:


  https://en.wikipedia.org/wiki/UTF-8#Encoding

(It's a continuation byte because it starts with the bits 10).

> I don't think it was explained well in
> the book.

Coincidentally, according to another recent feedback I received, unicode 
and UTF are introduced way too early for such a book. I agree. I hadn't 
understood a single thing when the first time smart people were trying 
to explain unicode and UTF encodings to the company where I worked at. I 
had years of programming experience back then. (Although, I now think 
the instructors were not really good; and the company was pretty bad as 
well. :) )


> Any help would be appreciated.

I recommend the Wikipedia page I linked above. It is enlightening to 
understand how about 150K unicode characters can be encoded with units 
of 8 bits.


You can safely ignore wchar, dchar, wstring, and dstring for daily 
coding. Only special programs may need to deal with those types. 'char' 
and string are what we need and do use predominantly in D.


Ali



Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

2022-12-02 Thread rikki cattermole via Digitalmars-d-learn

On 03/12/2022 11:32 AM, Ali Çehreli wrote:

On 12/2/22 13:44, rikki cattermole wrote:

 > Yeah you're right, its code unit not code point.

This proves yet again how badly chosen those names are. I must look it 
up every time before using one or the other.


So they are both "code"? One is a "unit" and the other is a "point"? 
Sheesh!


Ali


Yeah, and I even have a physical copy beside me!

P.s.

Oh btw Unicode 15 should be coming soon to Phobos :)
Once that is in, expect Turkic support for case insensitive matching!


Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

2022-12-02 Thread Ali Çehreli via Digitalmars-d-learn

On 12/2/22 13:44, rikki cattermole wrote:

> Yeah you're right, its code unit not code point.

This proves yet again how badly chosen those names are. I must look it 
up every time before using one or the other.


So they are both "code"? One is a "unit" and the other is a "point"? Sheesh!

Ali



Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

2022-12-02 Thread Steven Schveighoffer via Digitalmars-d-learn

On 12/2/22 4:18 PM, thebluepandabear wrote:

Hello (noob question),

I am reading a book about D by Ali, and he talks about the different 
char types: char, wchar, and dchar. He says that char stores a UTF-8 
code unit, wchar stores a UTF-16 code unit, and dchar stores a UTF-32 
code unit, this makes sense.


He then goes on to say that:

"Contrary to some other programming languages, characters in D may 
consist of
different numbers of bytes. For example, because 'Ğ' must be represented 
by at
least 2 bytes in Unicode, it doesn't fit in a variable of type char. On 
the other
hand, because dchar consists of 4 bytes, it can hold any Unicode 
character."


It's his explanation as to why this code doesn't compile even though Ğ 
is a UTF-8 code unit:


```D
char utf8 = 'Ğ';
```

But I don't really understand this? What does it mean that it 'must be 
represented by at least 2 bytes'? If I do `char.sizeof` it's 2 bytes so 
I am confused why it doesn't fit, I don't think it was explained well in 
the book.


Any help would be appreciated.




a *code point* is a value out of the unicode standard. [Code 
points](https://en.wikipedia.org/wiki/Code_point) represent glyphs, 
combining marks, or other things (not sure of the full list) that reside 
in the standard. When you want to figure out, "hmm... what value does 
the emoji 👍 have?" It's a *code point*. This is a number from 0 to 
0x10 for Unicode. (BTW, it's 0x14ffd)


UTF-X are various *encodings* of unicode. UTF8 is an encoding of unicode 
where 1 to 4 bytes (called *code units*) encode a single unicode *code 
point*.


There are various encodings, and all can be decoded to the same list of 
*code points*. The most direct form is UTF-32, where each *code point* 
is also a *code unit*.


`char` is a UTF-8 code unit. `wchar` is a UTF-16 code unit, and `dchar` 
is a UTF-32 code unit.


The reason why you can't encode a Ğ into a single `char` is because it's 
code point is 0x11e, which does not fit into a single `char`. Therefore, 
an encoding scheme is used to put it into 2 `char`.


Hope this helps.

-Steve


Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

2022-12-02 Thread H. S. Teoh via Digitalmars-d-learn
On Fri, Dec 02, 2022 at 09:18:44PM +, thebluepandabear via 
Digitalmars-d-learn wrote:
> Hello (noob question),
> 
> I am reading a book about D by Ali, and he talks about the different
> char types: char, wchar, and dchar. He says that char stores a UTF-8
> code unit, wchar stores a UTF-16 code unit, and dchar stores a UTF-32
> code unit, this makes sense.
> 
> He then goes on to say that:
> 
> "Contrary to some other programming languages, characters in D may
> consist of different numbers of bytes. For example, because 'Ğ' must
> be represented by at least 2 bytes in Unicode, it doesn't fit in a
> variable of type char. On the other hand, because dchar consists of 4
> bytes, it can hold any Unicode character."
> 
> It's his explanation as to why this code doesn't compile even though Ğ
> is a UTF-8 code unit:
> 
> ```D
> char utf8 = 'Ğ';
> ```
> 
> But I don't really understand this? What does it mean that it 'must be
> represented by at least 2 bytes'? If I do `char.sizeof` it's 2 bytes
> so I am confused why it doesn't fit, I don't think it was explained
> well in the book.

That's wrong, char.sizeof should be exactly 1 byte, no more, no less.

First, before we talk about Unicode, we need to get the terminology
straight:

Code unit = unit of storage in a particular representation (encoding) of
Unicode. E.g., a UTF-8 string consists of a stream of 1-byte code units,
a UTF-16 string consists of a stream of 2-byte code units, etc.. Do NOT
confuse this with "code point", or worse, "character".

Code point = the abstract Unicode entity that occupies a single slot in
the Unicode tables.  Usually written as U+xxx where xxx is some
hexadecimal number.

IMPORTANT NOTE: do NOT confuse a code point with what a normal
human being thinks of as a "character".  Even though in many
cases a code point happens to represent a single "character",
this isn't always true.  It's safer to understand a code point
as a single slot in one of the Unicode tables.

NOTE: a code point may be represented by multiple code units,
depending on the encoding. For example, in UTF-8, some code
points require multiple code units (multiple bytes) to
represent. This varies depending on the character; the code
point `A` needs only a single code unit, but the code point `Ш`
needs 3 bytes, and the code point `😀` requires 4 bytes. In
UTF-16, `A` and `Ш` occupy only 1 code unit (2 bytes, because in
UTF-16, one code unit == 2 bytes), but `😀` needs 2 code units
(4 bytes).

Note that neither code unit nor code point correspond directly with what
we normally think of as a "character".  The Unicode terminology for that
is:

Grapheme = one or more code points that combine together to produce a
single visual representation.  For example, the 2-code-point sequence
U+006D U+030A produce the *single* grapheme `m̊`, and the 3-code-point
sequence U+03C0 U+0306 U+032f produces the grapheme `π̯̆`.  Note that each
code point in these sequences may require multiple code units, depending
on which encoding you're using.  This email is encoded in UTF-8, so the
first sequence occupies 3 bytes (1 byte for the 1st code point, 2 bytes
for the second), and the second sequence occupies 6 bytes (2 bytes per
code point).

//

OK, now let's talk about D.  In D, we have 3 "character" types (I'm
putting "character" in quotes because they are actually code units, do
NOT confuse them with visual characters): char, wchar, dchar, which are
1, 2, and 4 bytes, respectively.

To find out whether something fits into a char, first you have to find
out how many code points it occupies, and second, how many code units
are required to represent those code points.  For example, the character
`À` can be represented by the single code point U+00C0. However, it
requires *two* UTF-8 code units to represent (this is a consequence of
how UTF-8 represents code points), in spite of being a value that's less
than 256.  So U+00C0 would not fit into a single char; you need (at
least) 2 chars to hold it.

If we were to use UTF-16 instead, U+00C0 would easily fit into a single
code unit.  Each code unit in UTF-16, however, is 2 bytes, so for some
code points (such as 'a', U+0061), the UTF-8 encoding would be smaller.

A dchar always fits any Unicode code point, because code points can only
go up to 0x10 (max 3 bytes).  HOWEVER, using dchar does NOT
guarantee that it will hold a complete visual character, because Unicode
graphemes can be arbitrarily long.  For example, the `π̯̆` grapheme above
requires at least 3 code points to represent, which means it requires at
least 3 dchars (== 12 bytes) to represent. In UTF-8 encoding, however,
it occupies only 6 bytes (still the same 3 code points, just encoded
differently).

//

I hope this is clear (as mud :P -- Unicode is a complex beast). Or at
least clear*er*, anyway.


T

-- 
People say I'm indecisive, but I'm not sure about that. -- YHL, 

Re: Is it just me, or does vibe.d's api doc look strange?

2022-12-02 Thread Steven Schveighoffer via Digitalmars-d-learn

On 12/2/22 3:46 PM, Christian Köstlin wrote:
Please see this screenshot: https://imgur.com/Ez9TcqD of my browser 
(firefox or chrome) of https://vibed.org/api/vibe.web.auth/


Not just you. And Sonke is aware (there's a conversation on the dlang 
slack).


-Steve



Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

2022-12-02 Thread rikki cattermole via Digitalmars-d-learn

On 03/12/2022 10:35 AM, Adam D Ruppe wrote:

On Friday, 2 December 2022 at 21:26:40 UTC, rikki cattermole wrote:

char is always UTF-8 codepoint and therefore exactly 1 byte.
wchar is always UTF-16 codepoint and therefore exactly 2 bytes.
dchar is always UTF-32 codepoint and therefore exactly 4 bytes;


You mean "code unit". There's no such thing as a utf-8/16/32 codepoint. 
A codepoint is a more abstract concept that is encoded in one of the utf 
formats.


Yeah you're right, its code unit not code point.



Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

2022-12-02 Thread thebluepandabear via Digitalmars-d-learn


That's not a utf-8 code unit.


Hm, that specifically might not be. The thing is, I thought a 
UTF-8 code unit can store 1-4 bytes for each character, so how is 
it right to say that `char` is a utf-8 code unit, it seems like 
it's just an ASCII code unit.


Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

2022-12-02 Thread Adam D Ruppe via Digitalmars-d-learn
On Friday, 2 December 2022 at 21:26:40 UTC, rikki cattermole 
wrote:

char is always UTF-8 codepoint and therefore exactly 1 byte.
wchar is always UTF-16 codepoint and therefore exactly 2 bytes.
dchar is always UTF-32 codepoint and therefore exactly 4 bytes;


You mean "code unit". There's no such thing as a utf-8/16/32 
codepoint. A codepoint is a more abstract concept that is encoded 
in one of the utf formats.




Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

2022-12-02 Thread rikki cattermole via Digitalmars-d-learn

char is always UTF-8 codepoint and therefore exactly 1 byte.

wchar is always UTF-16 codepoint and therefore exactly 2 bytes.

dchar is always UTF-32 codepoint and therefore exactly 4 bytes;

'Ğ' has the value U+011E which is a lot larger than what 1 byte can 
hold. You need 2 chars or 1 wchar/dchar.


https://unicode-table.com/en/011E/


Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

2022-12-02 Thread Adam D Ruppe via Digitalmars-d-learn
On Friday, 2 December 2022 at 21:18:44 UTC, thebluepandabear 
wrote:
It's his explanation as to why this code doesn't compile even 
though Ğ is a UTF-8 code unit:


That's not a utf-8 code unit.

A utf-8 code unit is just a single byte with a particular 
interpretation.



If I do `char.sizeof` it's 2 bytes


Are you sure about that? `char.sizeof` is 1. A char is just a 
single byte.


The Ğ code point (note code units and code points are two 
different things, a code point is an abstract idea, like a 
number, and a code unit is one byte that, when combined, can 
create the number).


Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

2022-12-02 Thread thebluepandabear via Digitalmars-d-learn

Hello (noob question),

I am reading a book about D by Ali, and he talks about the 
different char types: char, wchar, and dchar. He says that char 
stores a UTF-8 code unit, wchar stores a UTF-16 code unit, and 
dchar stores a UTF-32 code unit, this makes sense.


He then goes on to say that:

"Contrary to some other programming languages, characters in D 
may consist of
different numbers of bytes. For example, because 'Ğ' must be 
represented by at
least 2 bytes in Unicode, it doesn't fit in a variable of type 
char. On the other
hand, because dchar consists of 4 bytes, it can hold any Unicode 
character."


It's his explanation as to why this code doesn't compile even 
though Ğ is a UTF-8 code unit:


```D
char utf8 = 'Ğ';
```

But I don't really understand this? What does it mean that it 
'must be represented by at least 2 bytes'? If I do `char.sizeof` 
it's 2 bytes so I am confused why it doesn't fit, I don't think 
it was explained well in the book.


Any help would be appreciated.



Re: Is it just me, or does vibe.d's api doc look strange?

2022-12-02 Thread Adam D Ruppe via Digitalmars-d-learn
On Friday, 2 December 2022 at 20:46:35 UTC, Christian Köstlin 
wrote:
Please see this screenshot: https://imgur.com/Ez9TcqD of my 
browser (firefox or chrome) of 
https://vibed.org/api/vibe.web.auth/




Not just you, there's something broken in their html.

You can use my website for vibe docs too:

http://dpldocs.info/vibe.web.auth

Though this specific example doesn't show in my docs because 
there's other declarations in the middle of it. But you can at 
least view it under the see source.


http://vibe-d.dpldocs.info/source/vibe.web.auth.d.html#L15

then most everythign else works on the regular site anyway


Is it just me, or does vibe.d's api doc look strange?

2022-12-02 Thread Christian Köstlin via Digitalmars-d-learn
Please see this screenshot: https://imgur.com/Ez9TcqD of my browser 
(firefox or chrome) of https://vibed.org/api/vibe.web.auth/


Kind regards,
Christian


Re: Getting the default value of a class member field

2022-12-02 Thread WebFreak001 via Digitalmars-d-learn

On Friday, 2 December 2022 at 04:14:37 UTC, kinke wrote:

On Friday, 2 December 2022 at 00:24:44 UTC, WebFreak001 wrote:
I want to use the static initializers (when used with an UDA) 
as default values inside my SQL database.


See 
https://github.com/rorm-orm/dorm/blob/a86c7856e71bbc18cd50a7a6f701c325a4746518/source/dorm/declarative/conversion.d#L959


With my current design it's not really possible to move it out 
of compile time to runtime because the type description I 
create there gets serialized and output for use in another 
program (the migrator). Right now it's simply taking the 
compile time struct I generate and just dumping it without 
modification into a JSON serializer.


[...]


Okay, so what's blocking CTFE construction of these models? 
AFAICT, you have a templated base constructor in `Model`, which 
runs an optional `@constructValue!(() => Clock.currTime + 
4.hours)` lambda UDA for all fields of the derived type. Can't 
you replace all of that with a default ctor in the derived type?


```
class MyModel : Model {
int x = 123;// statically initialized
SysTime validUntil; // dynamically initialized in ctor

this() {
validUntil = Clock.currTime + 4.hours;
}
}
```

Such an instance should be CTFE-constructible, and the valid 
instance would feature the expected value for the `validUntil` 
field. If you need to know about such dynamically generated 
fields (as e.g. here in this time-critical example), an option 
would be a `@dynamicallyInitialized` UDA. Then if you 
additionally need to be able to re-run these current 
`@constructValue` lambdas for an already constructed instance, 
you could probably go with creating a fresh new instance and 
copying over the fresh new field values.


constructValue is entirely different than this default value. 
It's not being put into the database, it's just for the library 
to send it when it's missing. (so other apps accessing the 
database can't use the same info) - It's also still an open 
question if it even gives any value because it isn't part of the 
DB.


To support constructValues I iterate over all DB fields and run 
their constructors. I implemented listing the fields with a 
ListFields!T template. However now when I want to generate the DB 
field information I also use this same template to list all 
columns to generate attributes, such as what default value to put 
into SQL. Problem here is that that tries to call the 
constructor, which wants to iterate over the fields, while the 
fields are still being iterated. (or something similar to this)


Basically in the end the compiler complained about forward 
reference / the size of the fields not being known when I put in 
a field of a template type that would try to use the same 
ListFields template on the class I put that value in.


Right now I hack around this by adding an `int cacheHack` 
template parameter to ListFields, which simply does nothing. 
However this fixes that the compiler thinks the template isn't 
usable and everything seems to work with this.


Anyway this is all completely different from the default value 
thing, because I already found workarounds and changed some 
internals a bit to support things like cyclic data structures.


I would still like a way to access the initializer from class 
fields, and it would be especially cool would be to know if they 
are explicitly set. Right now I have this weird and heavy 
`@defaultValue(...)` annotation that's basically the same as `= 
...;`, that I just needed to add to make it possible to use 
T.init as default value in the DB as well, but not force it. My 
code uses `@defaultFromInit` to make it use the initializer, but 
it would be great if I didn't need this at all. (although because 
of my cyclic template issues it might break again and be unusable 
for me)