Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
On Fri, Dec 02, 2022 at 11:47:30PM +, thebluepandabear via Digitalmars-d-learn wrote: > On Friday, 2 December 2022 at 23:44:28 UTC, thebluepandabear wrote: > > > :-D > > > > > > (Exercise for the reader: what's the Hausdorff dimension of the > > > set of strings over Unicode space? :-P) > > > > > > > > > T > > > > Your explanation was great and cleared things up... not sure about > > the linear algebra one though ;) > > Actually now when I think about it, it is quite a creative way of > explaining things. I take back what I said. It was a math joke. :-P It was half-serious, though, and I think the analogy surprisingly holds up well enough in many cases. In any case, silly analogies are often a good mnemonic for remembering things like Unicode terminology. :-D T -- Freedom: (n.) Man's self-given right to be enslaved by his own depravity.
Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
On Friday, 2 December 2022 at 23:44:28 UTC, thebluepandabear wrote: :-D (Exercise for the reader: what's the Hausdorff dimension of the set of strings over Unicode space? :-P) T Your explanation was great and cleared things up... not sure about the linear algebra one though ;) Actually now when I think about it, it is quite a creative way of explaining things. I take back what I said.
Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
:-D (Exercise for the reader: what's the Hausdorff dimension of the set of strings over Unicode space? :-P) T Your explanation was great and cleared things up... not sure about the linear algebra one though ;)
Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
On Fri, Dec 02, 2022 at 02:32:47PM -0800, Ali Çehreli via Digitalmars-d-learn wrote: > On 12/2/22 13:44, rikki cattermole wrote: > > > Yeah you're right, its code unit not code point. > > This proves yet again how badly chosen those names are. I must look it > up every time before using one or the other. > > So they are both "code"? One is a "unit" and the other is a "point"? > Sheesh! [...] Think of Unicode as a vector space. A code point is a point in this space, and a code unit is one of the unit vectors; although some points can be reached with a single unit vector, to get to a general point you need to combine one or more unit vectors. Furthermore, the set of unit vectors you have depends on which coordinate system (i.e., encoding) you're using. Reencoding a Unicode string is essentially changing your coordinate system. ;-) (Exercise for the reader: compute the transformation matrix for reencoding. :-P) Also, a grapheme is a curve through this space (you *graph* the curve, you see), and as we all know, a curve may consist of more than one point. :-D (Exercise for the reader: what's the Hausdorff dimension of the set of strings over Unicode space? :-P) T -- First Rule of History: History doesn't repeat itself -- historians merely repeat each other.
Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
On 12/2/22 13:18, thebluepandabear wrote: > But I don't really understand this? What does it mean that it 'must be > represented by at least 2 bytes'? The integral value of Ğ in unicode is 286. https://unicodeplus.com/U+011E Since 'char' is 8 bits, it cannot store 286. At first, that sounds like a hopeless situation, making one think that Ğ cannot be represented in a string. The concept of encoding to the rescue: Ğ can be encoded by 2 chars: import std.stdio; void main() { foreach (c; "Ğ") { writefln!"%b"(c); } } That program prints 11000100 1000 Articles like the following explain well how that second byte is a continuation byte: https://en.wikipedia.org/wiki/UTF-8#Encoding (It's a continuation byte because it starts with the bits 10). > I don't think it was explained well in > the book. Coincidentally, according to another recent feedback I received, unicode and UTF are introduced way too early for such a book. I agree. I hadn't understood a single thing when the first time smart people were trying to explain unicode and UTF encodings to the company where I worked at. I had years of programming experience back then. (Although, I now think the instructors were not really good; and the company was pretty bad as well. :) ) > Any help would be appreciated. I recommend the Wikipedia page I linked above. It is enlightening to understand how about 150K unicode characters can be encoded with units of 8 bits. You can safely ignore wchar, dchar, wstring, and dstring for daily coding. Only special programs may need to deal with those types. 'char' and string are what we need and do use predominantly in D. Ali
Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
On 03/12/2022 11:32 AM, Ali Çehreli wrote: On 12/2/22 13:44, rikki cattermole wrote: > Yeah you're right, its code unit not code point. This proves yet again how badly chosen those names are. I must look it up every time before using one or the other. So they are both "code"? One is a "unit" and the other is a "point"? Sheesh! Ali Yeah, and I even have a physical copy beside me! P.s. Oh btw Unicode 15 should be coming soon to Phobos :) Once that is in, expect Turkic support for case insensitive matching!
Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
On 12/2/22 13:44, rikki cattermole wrote: > Yeah you're right, its code unit not code point. This proves yet again how badly chosen those names are. I must look it up every time before using one or the other. So they are both "code"? One is a "unit" and the other is a "point"? Sheesh! Ali
Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
On 12/2/22 4:18 PM, thebluepandabear wrote: Hello (noob question), I am reading a book about D by Ali, and he talks about the different char types: char, wchar, and dchar. He says that char stores a UTF-8 code unit, wchar stores a UTF-16 code unit, and dchar stores a UTF-32 code unit, this makes sense. He then goes on to say that: "Contrary to some other programming languages, characters in D may consist of different numbers of bytes. For example, because 'Ğ' must be represented by at least 2 bytes in Unicode, it doesn't fit in a variable of type char. On the other hand, because dchar consists of 4 bytes, it can hold any Unicode character." It's his explanation as to why this code doesn't compile even though Ğ is a UTF-8 code unit: ```D char utf8 = 'Ğ'; ``` But I don't really understand this? What does it mean that it 'must be represented by at least 2 bytes'? If I do `char.sizeof` it's 2 bytes so I am confused why it doesn't fit, I don't think it was explained well in the book. Any help would be appreciated. a *code point* is a value out of the unicode standard. [Code points](https://en.wikipedia.org/wiki/Code_point) represent glyphs, combining marks, or other things (not sure of the full list) that reside in the standard. When you want to figure out, "hmm... what value does the emoji 👍 have?" It's a *code point*. This is a number from 0 to 0x10 for Unicode. (BTW, it's 0x14ffd) UTF-X are various *encodings* of unicode. UTF8 is an encoding of unicode where 1 to 4 bytes (called *code units*) encode a single unicode *code point*. There are various encodings, and all can be decoded to the same list of *code points*. The most direct form is UTF-32, where each *code point* is also a *code unit*. `char` is a UTF-8 code unit. `wchar` is a UTF-16 code unit, and `dchar` is a UTF-32 code unit. The reason why you can't encode a Ğ into a single `char` is because it's code point is 0x11e, which does not fit into a single `char`. Therefore, an encoding scheme is used to put it into 2 `char`. Hope this helps. -Steve
Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
On Fri, Dec 02, 2022 at 09:18:44PM +, thebluepandabear via Digitalmars-d-learn wrote: > Hello (noob question), > > I am reading a book about D by Ali, and he talks about the different > char types: char, wchar, and dchar. He says that char stores a UTF-8 > code unit, wchar stores a UTF-16 code unit, and dchar stores a UTF-32 > code unit, this makes sense. > > He then goes on to say that: > > "Contrary to some other programming languages, characters in D may > consist of different numbers of bytes. For example, because 'Ğ' must > be represented by at least 2 bytes in Unicode, it doesn't fit in a > variable of type char. On the other hand, because dchar consists of 4 > bytes, it can hold any Unicode character." > > It's his explanation as to why this code doesn't compile even though Ğ > is a UTF-8 code unit: > > ```D > char utf8 = 'Ğ'; > ``` > > But I don't really understand this? What does it mean that it 'must be > represented by at least 2 bytes'? If I do `char.sizeof` it's 2 bytes > so I am confused why it doesn't fit, I don't think it was explained > well in the book. That's wrong, char.sizeof should be exactly 1 byte, no more, no less. First, before we talk about Unicode, we need to get the terminology straight: Code unit = unit of storage in a particular representation (encoding) of Unicode. E.g., a UTF-8 string consists of a stream of 1-byte code units, a UTF-16 string consists of a stream of 2-byte code units, etc.. Do NOT confuse this with "code point", or worse, "character". Code point = the abstract Unicode entity that occupies a single slot in the Unicode tables. Usually written as U+xxx where xxx is some hexadecimal number. IMPORTANT NOTE: do NOT confuse a code point with what a normal human being thinks of as a "character". Even though in many cases a code point happens to represent a single "character", this isn't always true. It's safer to understand a code point as a single slot in one of the Unicode tables. NOTE: a code point may be represented by multiple code units, depending on the encoding. For example, in UTF-8, some code points require multiple code units (multiple bytes) to represent. This varies depending on the character; the code point `A` needs only a single code unit, but the code point `Ш` needs 3 bytes, and the code point `😀` requires 4 bytes. In UTF-16, `A` and `Ш` occupy only 1 code unit (2 bytes, because in UTF-16, one code unit == 2 bytes), but `😀` needs 2 code units (4 bytes). Note that neither code unit nor code point correspond directly with what we normally think of as a "character". The Unicode terminology for that is: Grapheme = one or more code points that combine together to produce a single visual representation. For example, the 2-code-point sequence U+006D U+030A produce the *single* grapheme `m̊`, and the 3-code-point sequence U+03C0 U+0306 U+032f produces the grapheme `π̯̆`. Note that each code point in these sequences may require multiple code units, depending on which encoding you're using. This email is encoded in UTF-8, so the first sequence occupies 3 bytes (1 byte for the 1st code point, 2 bytes for the second), and the second sequence occupies 6 bytes (2 bytes per code point). // OK, now let's talk about D. In D, we have 3 "character" types (I'm putting "character" in quotes because they are actually code units, do NOT confuse them with visual characters): char, wchar, dchar, which are 1, 2, and 4 bytes, respectively. To find out whether something fits into a char, first you have to find out how many code points it occupies, and second, how many code units are required to represent those code points. For example, the character `À` can be represented by the single code point U+00C0. However, it requires *two* UTF-8 code units to represent (this is a consequence of how UTF-8 represents code points), in spite of being a value that's less than 256. So U+00C0 would not fit into a single char; you need (at least) 2 chars to hold it. If we were to use UTF-16 instead, U+00C0 would easily fit into a single code unit. Each code unit in UTF-16, however, is 2 bytes, so for some code points (such as 'a', U+0061), the UTF-8 encoding would be smaller. A dchar always fits any Unicode code point, because code points can only go up to 0x10 (max 3 bytes). HOWEVER, using dchar does NOT guarantee that it will hold a complete visual character, because Unicode graphemes can be arbitrarily long. For example, the `π̯̆` grapheme above requires at least 3 code points to represent, which means it requires at least 3 dchars (== 12 bytes) to represent. In UTF-8 encoding, however, it occupies only 6 bytes (still the same 3 code points, just encoded differently). // I hope this is clear (as mud :P -- Unicode is a complex beast). Or at least clear*er*, anyway. T -- People say I'm indecisive, but I'm not sure about that. -- YHL,
Re: Is it just me, or does vibe.d's api doc look strange?
On 12/2/22 3:46 PM, Christian Köstlin wrote: Please see this screenshot: https://imgur.com/Ez9TcqD of my browser (firefox or chrome) of https://vibed.org/api/vibe.web.auth/ Not just you. And Sonke is aware (there's a conversation on the dlang slack). -Steve
Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
On 03/12/2022 10:35 AM, Adam D Ruppe wrote: On Friday, 2 December 2022 at 21:26:40 UTC, rikki cattermole wrote: char is always UTF-8 codepoint and therefore exactly 1 byte. wchar is always UTF-16 codepoint and therefore exactly 2 bytes. dchar is always UTF-32 codepoint and therefore exactly 4 bytes; You mean "code unit". There's no such thing as a utf-8/16/32 codepoint. A codepoint is a more abstract concept that is encoded in one of the utf formats. Yeah you're right, its code unit not code point.
Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
That's not a utf-8 code unit. Hm, that specifically might not be. The thing is, I thought a UTF-8 code unit can store 1-4 bytes for each character, so how is it right to say that `char` is a utf-8 code unit, it seems like it's just an ASCII code unit.
Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
On Friday, 2 December 2022 at 21:26:40 UTC, rikki cattermole wrote: char is always UTF-8 codepoint and therefore exactly 1 byte. wchar is always UTF-16 codepoint and therefore exactly 2 bytes. dchar is always UTF-32 codepoint and therefore exactly 4 bytes; You mean "code unit". There's no such thing as a utf-8/16/32 codepoint. A codepoint is a more abstract concept that is encoded in one of the utf formats.
Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
char is always UTF-8 codepoint and therefore exactly 1 byte. wchar is always UTF-16 codepoint and therefore exactly 2 bytes. dchar is always UTF-32 codepoint and therefore exactly 4 bytes; 'Ğ' has the value U+011E which is a lot larger than what 1 byte can hold. You need 2 chars or 1 wchar/dchar. https://unicode-table.com/en/011E/
Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
On Friday, 2 December 2022 at 21:18:44 UTC, thebluepandabear wrote: It's his explanation as to why this code doesn't compile even though Ğ is a UTF-8 code unit: That's not a utf-8 code unit. A utf-8 code unit is just a single byte with a particular interpretation. If I do `char.sizeof` it's 2 bytes Are you sure about that? `char.sizeof` is 1. A char is just a single byte. The Ğ code point (note code units and code points are two different things, a code point is an abstract idea, like a number, and a code unit is one byte that, when combined, can create the number).
Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
Hello (noob question), I am reading a book about D by Ali, and he talks about the different char types: char, wchar, and dchar. He says that char stores a UTF-8 code unit, wchar stores a UTF-16 code unit, and dchar stores a UTF-32 code unit, this makes sense. He then goes on to say that: "Contrary to some other programming languages, characters in D may consist of different numbers of bytes. For example, because 'Ğ' must be represented by at least 2 bytes in Unicode, it doesn't fit in a variable of type char. On the other hand, because dchar consists of 4 bytes, it can hold any Unicode character." It's his explanation as to why this code doesn't compile even though Ğ is a UTF-8 code unit: ```D char utf8 = 'Ğ'; ``` But I don't really understand this? What does it mean that it 'must be represented by at least 2 bytes'? If I do `char.sizeof` it's 2 bytes so I am confused why it doesn't fit, I don't think it was explained well in the book. Any help would be appreciated.
Re: Is it just me, or does vibe.d's api doc look strange?
On Friday, 2 December 2022 at 20:46:35 UTC, Christian Köstlin wrote: Please see this screenshot: https://imgur.com/Ez9TcqD of my browser (firefox or chrome) of https://vibed.org/api/vibe.web.auth/ Not just you, there's something broken in their html. You can use my website for vibe docs too: http://dpldocs.info/vibe.web.auth Though this specific example doesn't show in my docs because there's other declarations in the middle of it. But you can at least view it under the see source. http://vibe-d.dpldocs.info/source/vibe.web.auth.d.html#L15 then most everythign else works on the regular site anyway
Is it just me, or does vibe.d's api doc look strange?
Please see this screenshot: https://imgur.com/Ez9TcqD of my browser (firefox or chrome) of https://vibed.org/api/vibe.web.auth/ Kind regards, Christian
Re: Getting the default value of a class member field
On Friday, 2 December 2022 at 04:14:37 UTC, kinke wrote: On Friday, 2 December 2022 at 00:24:44 UTC, WebFreak001 wrote: I want to use the static initializers (when used with an UDA) as default values inside my SQL database. See https://github.com/rorm-orm/dorm/blob/a86c7856e71bbc18cd50a7a6f701c325a4746518/source/dorm/declarative/conversion.d#L959 With my current design it's not really possible to move it out of compile time to runtime because the type description I create there gets serialized and output for use in another program (the migrator). Right now it's simply taking the compile time struct I generate and just dumping it without modification into a JSON serializer. [...] Okay, so what's blocking CTFE construction of these models? AFAICT, you have a templated base constructor in `Model`, which runs an optional `@constructValue!(() => Clock.currTime + 4.hours)` lambda UDA for all fields of the derived type. Can't you replace all of that with a default ctor in the derived type? ``` class MyModel : Model { int x = 123;// statically initialized SysTime validUntil; // dynamically initialized in ctor this() { validUntil = Clock.currTime + 4.hours; } } ``` Such an instance should be CTFE-constructible, and the valid instance would feature the expected value for the `validUntil` field. If you need to know about such dynamically generated fields (as e.g. here in this time-critical example), an option would be a `@dynamicallyInitialized` UDA. Then if you additionally need to be able to re-run these current `@constructValue` lambdas for an already constructed instance, you could probably go with creating a fresh new instance and copying over the fresh new field values. constructValue is entirely different than this default value. It's not being put into the database, it's just for the library to send it when it's missing. (so other apps accessing the database can't use the same info) - It's also still an open question if it even gives any value because it isn't part of the DB. To support constructValues I iterate over all DB fields and run their constructors. I implemented listing the fields with a ListFields!T template. However now when I want to generate the DB field information I also use this same template to list all columns to generate attributes, such as what default value to put into SQL. Problem here is that that tries to call the constructor, which wants to iterate over the fields, while the fields are still being iterated. (or something similar to this) Basically in the end the compiler complained about forward reference / the size of the fields not being known when I put in a field of a template type that would try to use the same ListFields template on the class I put that value in. Right now I hack around this by adding an `int cacheHack` template parameter to ListFields, which simply does nothing. However this fixes that the compiler thinks the template isn't usable and everything seems to work with this. Anyway this is all completely different from the default value thing, because I already found workarounds and changed some internals a bit to support things like cyclic data structures. I would still like a way to access the initializer from class fields, and it would be especially cool would be to know if they are explicitly set. Right now I have this weird and heavy `@defaultValue(...)` annotation that's basically the same as `= ...;`, that I just needed to add to make it possible to use T.init as default value in the DB as well, but not force it. My code uses `@defaultFromInit` to make it use the initializer, but it would be great if I didn't need this at all. (although because of my cyclic template issues it might break again and be unusable for me)