Re: Of possible interest: fast UTF8 validation

2018-05-18 Thread David Nadlinger via Digitalmars-d

On Wednesday, 16 May 2018 at 14:48:54 UTC, Ethan Watson wrote:
And even better - LDC doesn't support core.simd and has its own 
intrinsics that don't match the SSE/AVX intrinsics API 
published by Intel.


To provide some context here: LDC only supports the types from 
core.simd, but not the __simd "assembler macro" that DMD uses to 
more or less directly emit the corresponding x86 opcodes.


LDC does support most of the GCC-style SIMD builtins for the 
respective target (x86, ARM, …), but there are two problems with 
this:


 1) As Ethan pointed out, the GCC API does not match Intel's 
intrinsics; for example, it is 
`__builtin_ia32_vfnmsubpd256_mask3` instead of 
`_mm256_mask_fnmsub_pd`, and the argument orders differ as well.


 2) The functions that LDC exposes as intrinsics are those that 
are intrinsics on the LLVM IR level. However, some operations can 
be directly represented in normal, instruction-set-independent 
LLVM IR – no explicit intrinsics are provided for these.


Unfortunately, LLVM doesn't seem to provide any particularly 
helpful tools for implementing Intel's intrinsics API. 
x86intrin.h is manually implemented for Clang as a collection of 
various macros and functions.


It would be seriously cool if someone could write a small tool to 
parse those headers, (semi-)automatically convert them to D, and 
generate tests for comparing the emitted IR against Clang. I'm 
happy to help with the LDC side of things.


 — David


Re: Of possible interest: fast UTF8 validation

2018-05-18 Thread Neia Neutuladh via Digitalmars-d

On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
On Thu, May 17, 2018 at 07:13:23PM +, Patrick Schluter via 
Digitalmars-d wrote: [...]

- the auto-synchronization and the statelessness are big deals.


Yes.  Imagine if we standardized on a header-based string 
encoding, and we wanted to implement a substring function over 
a string that contains multiple segments of different 
languages. Instead of a cheap slicing over the string, you'd 
need to scan the string or otherwise keep track of which 
segment the start/end of the substring lies in, allocate memory 
to insert headers so that the segments are properly 
interpreted, etc.. It would be an implementational nightmare, 
and an unavoidable performance hit (you'd have to copy data 
every time you take a substring), and the @nogc guys would be 
up in arms.


You'd have three data structures: Strand, Rope, and Slice.

A Strand is a series of bytes with an encoding. A Rope is a 
series of Strands. A Slice is a pair of location references 
within a Rope. You probably want a special datastructure to name 
a location within a Rope: Strand offset, then byte offset. Total 
of five words instead of two to pass a Slice, but zero dynamic 
allocations.


This would be a problem for data locality. However, rope-style 
datastructures are handy for some types of string manipulation.


As an alternative, you might have a separate document specifying 
what encodings apply to what byte ranges. Slices would then be 
three words long (pointer to the string struct, start offset, end 
offset). Iterating would cost O(log(S) + M), where S is the 
number of encoded segments and M is the number of bytes in the 
slice.


Anyway, you either get a more complex data structure, or you have 
terrible time complexity, but you don't have both.


And that's assuming we have a sane header-based encoding for 
strings that contain segments in multiple languages in the 
first place. Linguistic analysis articles, for example, would 
easily contain many such segments within a paragraph, or 
perhaps in the same sentence. How would a header-based encoding 
work for such documents?  Nevermind the recent trend of 
liberally sprinkling emojis all over regular text. If every 
emoticon embedded in a string requires splitting the string 
into 3 segments complete with their own headers, I dare not 
imagine what the code that manipulates such strings would look 
like.


"Header" implies that all encoding data appears at the start of 
the document, or in a separate metadata segment. (Call it a start 
index and two bytes to specify the encoding; reserve the first 
few bits of the encoding to specify the width.) It also brings to 
mind HTTP, and reminds me that most documents are either mostly 
ASCII or a heavy mix of ASCII and something else (HTML and XML 
being the forerunners).


If the encoding succeeded at making most scripts single-byte, 
then, testing with https://ar.wikipedia.org/wiki/Main_Page, you 
might get within 15% of UTF-8's efficiency. And then a simple 
sentence like "Ĉu ĝi ŝajnas ankaŭ esti ŝafo?" is 2.64 times as 
long in this encoding as UTF-8, since it has ten encoded 
segments, each with overhead. (Assuming the header supports 
strings up to 2^32 bytes long.)


If it didn't succeed at making Latin and Arabic single-byte 
scripts (and Latin contains over 800 characters in Unicode, while 
Arabic has over three hundred), it would be worse than UTF-16.


Re: Of possible interest: fast UTF8 validation

2018-05-18 Thread Nemanja Boric via Digitalmars-d

On Friday, 18 May 2018 at 08:44:41 UTC, Joakim wrote:


I was surprised to see that adding a emoji to a text message I 
sent last year cut my message character quota in half.  I 
googled why this was and it turns out that when you add an 
emoji, the text messaging client actually changes your message 
encoding from UTF-8 to UTF-16! I don't know if this is a 
limitation of the default Android messaging client, my telco 
carrier, or SMS, but I strongly suspect this is widespread.




Welcome to my world (and probably world of most Europeans) where 
I don't type ć, č, ž and other non-ascii letters since early 
2000s, even though SMS are today mostly flat rate and people chat 
via WhatsApp anyway.


Re: Of possible interest: fast UTF8 validation

2018-05-18 Thread Joakim via Digitalmars-d

On Thursday, 17 May 2018 at 23:11:22 UTC, Ethan wrote:
On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky 
wrote:

TCP being  reliable just plain doesn’t cut it. Corruption of
single bit is very real.


Quoting to highlight and agree.

TCP is reliable because it resends dropped packets and delivers 
them in order.


I don't write TCP packets to my long-term storage medium.

UTF as a transportation protocol Unicode is *far* more useful 
than just sending across a network.


The point wasn't that TCP is handling all the errors, it was a 
throwaway example of one other layer of the system, the network 
transport layer, that actually has a checksum that will detect a 
single bitflip, which UTF-8 will not usually detect. I mentioned 
that the filesystem and several other layers have their own such 
error detection, yet you guys crazily latch on to the TCP example 
alone.


On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
On Thu, May 17, 2018 at 07:13:23PM +, Patrick Schluter via 
Digitalmars-d wrote: [...]

- the auto-synchronization and the statelessness are big deals.


Yes.  Imagine if we standardized on a header-based string 
encoding, and we wanted to implement a substring function over 
a string that contains multiple segments of different 
languages. Instead of a cheap slicing over the string, you'd 
need to scan the string or otherwise keep track of which 
segment the start/end of the substring lies in, allocate memory 
to insert headers so that the segments are properly 
interpreted, etc.. It would be an implementational nightmare, 
and an unavoidable performance hit (you'd have to copy data 
every time you take a substring), and the @nogc guys would be 
up in arms.


As we discussed when I first raised this header scheme years ago, 
you're right that slicing could be more expensive, depending on 
whether you chose to allocate a new header for the substring or 
not. The question is whether the optimizations available from 
such a header telling you where all the language substrings are 
in a multi-language string make up for having to expensively 
process the entire UTF-8 string to get that or other data. I 
think it's fairly obvious the design tradeoff of the header would 
beat out UTF-8 for all but a few degenerate cases, but maybe you 
don't see it.


And that's assuming we have a sane header-based encoding for 
strings that contain segments in multiple languages in the 
first place. Linguistic analysis articles, for example, would 
easily contain many such segments within a paragraph, or 
perhaps in the same sentence. How would a header-based encoding 
work for such documents?


It would bloat the header to some extent, but less so than a 
UTF-8 string. You may want to use special header encodings for 
such edge cases too, if you want to maintain the same large 
performance lead over UTF-8 that you'd have for the common case.



Nevermind the recent trend of
liberally sprinkling emojis all over regular text. If every 
emoticon embedded in a string requires splitting the string 
into 3 segments complete with their own headers, I dare not 
imagine what the code that manipulates such strings would look 
like.


Personally, I don't consider emojis worth implementing, :) they 
shouldn't be part of Unicode. But since they are, I'm fairly 
certain header-based text messages with emojis would be 
significantly smaller than using UTF-8/16.


I was surprised to see that adding a emoji to a text message I 
sent last year cut my message character quota in half.  I googled 
why this was and it turns out that when you add an emoji, the 
text messaging client actually changes your message encoding from 
UTF-8 to UTF-16! I don't know if this is a limitation of the 
default Android messaging client, my telco carrier, or SMS, but I 
strongly suspect this is widespread.


Anyway, I can see the arguments about UTF-8 this time around are 
as bad as the first time I raised it five years back, so I'll 
leave this thread here.


Re: Of possible interest: fast UTF8 validation

2018-05-17 Thread Patrick Schluter via Digitalmars-d

On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
On Thu, May 17, 2018 at 07:13:23PM +, Patrick Schluter via 
Digitalmars-d wrote: [...]

[...]


Yes.  Imagine if we standardized on a header-based string 
encoding, and we wanted to implement a substring function over 
a string that contains multiple segments of different 
languages. Instead of a cheap slicing over the string, you'd 
need to scan the string or otherwise keep track of which 
segment the start/end of the substring lies in, allocate memory 
to insert headers so that the segments are properly 
interpreted, etc.. It would be an implementational nightmare, 
and an unavoidable performance hit (you'd have to copy data 
every time you take a substring), and the @nogc guys would be 
up in arms.


[...]
That's what rtf with code pages was essentially. I'm happy that 
we got rid of it and that they were replaced by xml, even if 
Microsoft's document xml being a bloated, ridiculous mess, it's 
still an order of magnitude less problematic than rtf (I mean at 
the text encoding level).


Re: Of possible interest: fast UTF8 validation

2018-05-17 Thread H. S. Teoh via Digitalmars-d
On Thu, May 17, 2018 at 07:13:23PM +, Patrick Schluter via Digitalmars-d 
wrote:
[...]
> - the auto-synchronization and the statelessness are big deals.

Yes.  Imagine if we standardized on a header-based string encoding, and
we wanted to implement a substring function over a string that contains
multiple segments of different languages. Instead of a cheap slicing
over the string, you'd need to scan the string or otherwise keep track
of which segment the start/end of the substring lies in, allocate memory
to insert headers so that the segments are properly interpreted, etc..
It would be an implementational nightmare, and an unavoidable
performance hit (you'd have to copy data every time you take a
substring), and the @nogc guys would be up in arms.

And that's assuming we have a sane header-based encoding for strings
that contain segments in multiple languages in the first place.
Linguistic analysis articles, for example, would easily contain many
such segments within a paragraph, or perhaps in the same sentence. How
would a header-based encoding work for such documents?  Nevermind the
recent trend of liberally sprinkling emojis all over regular text. If
every emoticon embedded in a string requires splitting the string into 3
segments complete with their own headers, I dare not imagine what the
code that manipulates such strings would look like.


T

-- 
Famous last words: I *think* this will work...


Re: Of possible interest: fast UTF8 validation

2018-05-17 Thread Ethan via Digitalmars-d

And at the risk of getting this topic back on track:

On Wednesday, 16 May 2018 at 20:34:26 UTC, Walter Bright wrote:
Linkers already do that. Alignment is specified on all symbols 
emitted by the compiler, and the linker uses that info.


Mea culpa. Upon further thinking, two things strike me:

1) As suggested, there's no way to instruct the front-end to 
align functions to byte boundaries outside of "optimise for 
speed" command line flags


2) I would have heavily relied on incremental linking to iterate 
on these tests when trying to work out how the processor behaved. 
I expect MSVC's incremental linker would turn out to be just 
rubbish enough to not care about how those flags originally 
behaved.


On Wednesday, 16 May 2018 at 20:36:10 UTC, Walter Bright wrote:

It would be nice to get this technique put into std.algorithm!


The code I wrote originally was C++ code with intrinsics. But I 
can certainly look at adapting it to DMD/LDC. The DMD frontend 
providing natural mappings for Intel's published intrinsics would 
be massively beneficial here.


Re: Of possible interest: fast UTF8 validation

2018-05-17 Thread Ethan via Digitalmars-d

On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky wrote:

TCP being  reliable just plain doesn’t cut it. Corruption of
single bit is very real.


Quoting to highlight and agree.

TCP is reliable because it resends dropped packets and delivers 
them in order.


I don't write TCP packets to my long-term storage medium.

UTF as a transportation protocol Unicode is *far* more useful 
than just sending across a network.


Re: Of possible interest: fast UTF8 validation

2018-05-17 Thread Patrick Schluter via Digitalmars-d

On Thursday, 17 May 2018 at 15:16:19 UTC, Joakim wrote:
On Thursday, 17 May 2018 at 13:14:46 UTC, Patrick Schluter 
wrote:
This is not practical, sorry. What happens when your message 
loses the header? Exactly, the rest of the message is garbled.


Why would it lose the header? TCP guarantees delivery and 
checksums the data, that's effective enough at the transport 
layer.


What does TCP/IP got to do with anything in discussion here. 
UTF-8 (or UTF-16 or UTF-32) has nothing to do with network 
protocols. That's completely unrelated. A file encoded on a disk 
may never leave the machine it is written on and may never see a 
wire in its lifetime and its encoding is still of vital 
importance. That's why a header encoding is too restrictive.




I agree that UTF-8 is a more redundant format, as others have 
mentioned earlier, and is thus more robust to certain types of 
data loss than a header-based scheme. However, I don't consider 
that the job of the text format, it's better done by other 
layers, like transport protocols or filesystems, which will 
guard against such losses much more reliably and efficiently.


No. A text format cannot depend on a network protocol. It would 
be as if you could only listen to a music file or a video on 
streaming and never save it on offline file as there was nowhere 
the information of what that blob of bytes represents. It doesn't 
make any sense.




For example, a random bitflip somewhere in the middle of a 
UTF-8 string will not be detectable most of the time. However, 
more robust error-correcting schemes at other layers of the 
system will easily catch that.


That's the job of the other layers. Any other file would have the 
same problem. At least, with utf-8 there will be at most only 
ever 1 codepoint lost or changed. Any other encoding would fare 
better. This said if a checksum header for your document is 
important you can add it to externally anyway.





That's exactly what happened with code page based texts when 
you don't know in which code page it is encoded. It has the 
supplemental inconvenience that mixing languages becomes 
impossible or at least very cumbersome.
UTF-8 has several properties that are difficult to have with 
other schemes.
- It is state-less, means any byte in a stream always means 
the same thing. Its meaning  does not depend on external or a 
previous byte.


I realize this was considered important at one time, but I 
think it has proven to be a bad design decision, for HTTP too. 
There are some advantages when building rudimentary systems 
with crude hardware and lots of noise, as was the case back 
then, but that's not the tech world we live in today. That's 
why almost every HTTP request today is part of a stateful 
session that explicitly keeps track of the connection, whether 
through cookies, https encryption, or HTTP/2.


Again, orthogonal to utf-8. When I speak above of streams it 
doesn't limit to sockets, file are also read in streams. So stop 
of equating UTF-8 with the Internet, these are 2 different 
domains. Internet and its protocols were defined and invented 
long before Unicode and Unicode is very usefull also offline.


- It can mix any language in the same stream without 
acrobatics and if one thinks that mixing languages doesn't 
happen often should get his head extracted from his rear, 
because it is very common (check wikipedia's front page for 
example).


I question that almost anybody needs to mix "streams." As for 
messages or files, headers handle multiple language mixing 
easily, as noted in that earlier thread.


Ok, show me how you transmit that, I'm curious:

E2010C0002

EFTA Surveillance Authority Decision


Beschluss der EFTA-Überwachungsbehörde


EFTA-Tilsynsmyndighedens beslutning


Απόφαση της Εποπτεύουσας Αρχής της ΕΖΕΣ


Decisión del Órgano de Vigilancia de la AELC


EFTAn valvontaviranomaisen päätös


Décision de l'Autorité de surveillance AELE


Decisione dell’Autorità di vigilanza EFTA


Besluit van de Toezichthoudende Autoriteit van de EVA


Decisão do Órgão de Fiscalização da EFTA


Beslut av Eftas övervakningsmyndighet


EBTA Uzraudzības iestādes Lēmums


Rozhodnutí Kontrolního úřadu ESVO


EFTA järelevalveameti otsus


Decyzja Urzędu Nadzoru EFTA


Odločba Nadzornega organa EFTE


ELPA priežiūros institucijos sprendimas


Deċiżjoni tal-Awtorità tas-Sorveljanza tal-EFTA


Rozhodnutie Dozorného orgánu EZVO


Решение на Надзорния орган на ЕАСТ







- The multi byte nature of other alphabets is not as bad as 
people think because texts in computer do not live on their 
own, meaning that they are generally embedded inside file 
formats, which more often than not are extremely bloated (xml, 
html, xliff, akoma ntoso, rtf etc.). The few bytes more in the 
text do not weigh that much.


Heh, the other parts of the tech stack are much more bloated, 
so this bloat is okay? A unique argument, but I'd argue that's 
why those bloated formats you mention are largely dying off too.


They don't, 

Re: Of possible interest: fast UTF8 validation

2018-05-17 Thread H. S. Teoh via Digitalmars-d
On Thu, May 17, 2018 at 10:16:03AM -0700, Walter Bright via Digitalmars-d wrote:
> On 5/16/2018 10:01 PM, Joakim wrote:
> > Unicode was a standardization of all the existing code pages and
> > then added these new transfer formats, but I have long thought that
> > they'd have been better off going with a header-based format that
> > kept most languages in a single-byte scheme, as they mostly were
> > except for obviously the Asian CJK languages. That way, you optimize
> > for the common string, ie one that contains a single language or at
> > least no CJK, rather than pessimizing every non-ASCII language by
> > doubling its character width, as UTF-8 does. This UTF-8 issue is one
> > of the first topics I raised in this forum, but as you noted at the
> > time nobody agreed and I don't want to dredge that all up again.
> 
> It sounds like the main issue is that a header based encoding would
> take less size?
> 
> If that's correct, then I hypothesize that adding an LZW compression
> layer would achieve the same or better result.

My bet is on the LZW being *far* better than a header-based encoding.
Natural language, which a large part of textual data consists of, tends
to have a lot of built-in redundancy, and therefore is highly
compressible.  A proper compression algorithm will beat any header-based
size reduction scheme, while still maintaining the context-free nature
of UTF-8.


T

-- 
In a world without fences, who needs Windows and Gates? -- Christian Surchi


Re: Of possible interest: fast UTF8 validation

2018-05-17 Thread Joakim via Digitalmars-d

On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:

On 5/16/2018 10:01 PM, Joakim wrote:
Unicode was a standardization of all the existing code pages 
and then added these new transfer formats, but I have long 
thought that they'd have been better off going with a 
header-based format that kept most languages in a single-byte 
scheme, as they mostly were except for obviously the Asian CJK 
languages. That way, you optimize for the common string, ie 
one that contains a single language or at least no CJK, rather 
than pessimizing every non-ASCII language by doubling its 
character width, as UTF-8 does. This UTF-8 issue is one of the 
first topics I raised in this forum, but as you noted at the 
time nobody agreed and I don't want to dredge that all up 
again.


It sounds like the main issue is that a header based encoding 
would take less size?


Yes, and be easier to process.

If that's correct, then I hypothesize that adding an LZW 
compression layer would achieve the same or better result.


In general, you would be wrong, a carefully designed binary 
format will usually beat the pants off general-purpose 
compression:


https://www.w3.org/TR/2009/WD-exi-evaluation-20090407/#compactness-results

Of course, that's because you can tailor your binary format for 
specific types of data, text in this case, and take advantage of 
patterns in that subset, such as specialized image compression 
formats do. In this case though, I haven't compared this scheme 
to general compression of UTF-8 strings, so I don't know which 
would compress better.


However, that would mostly matter for network transmission, 
another big gain of a header-based scheme that doesn't use 
compression is much faster string processing in memory. Yes, the 
average end user doesn't care for this, but giant consumers of 
text data, like search engines, would benefit greatly from this.


On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky wrote:
Indeed, and some other compression/deduplication options that 
would allow limited random access / slicing (by decoding a 
single “block” to access an element for instance).


Possibly competitive for compression only for transmission over 
the network, but unlikely for processing, as noted for Walter's 
idea.


Anything that depends on external information and is not 
self-sync is awful for interchange.


You are describing the vast majority of all formats and 
protocols, amazing how we got by with them all this time.


Internally the application can do some smarts though, but even 
then things like interning (partial interning) might be more 
valuable approach. TCP being reliable just plain doesn’t cut 
it. Corruption of single bit is very real.


You seem to have missed my point entirely: UTF-8 will not catch 
most bit flips either, only if it happens to corrupt certain key 
bits in a certain way, a minority of the possibilities. Nobody is 
arguing that data corruption doesn't happen or that some 
error-correction shouldn't be done somewhere.


The question is whether the extremely limited robustness of UTF-8 
added by its significant redundancy is a good tradeoff. I think 
it's obvious that it isn't, and I posit that anybody who knows 
anything about error-correcting codes would agree with that 
assessment. You would be much better off by having a more compact 
header-based transfer format and layering on the level of error 
correction you need at a different level, which as I noted is 
already done at the link and transport layers and various other 
parts of the system already.


If you need more error-correction than that, do it right, not in 
a broken way as UTF-8 does. Honestly, error detection/correction 
is the most laughably broken part of UTF-8, it is amazing that 
people even bring that up as a benefit.


Re: Of possible interest: fast UTF8 validation

2018-05-17 Thread Patrick Schluter via Digitalmars-d
On Thursday, 17 May 2018 at 15:37:01 UTC, Andrei Alexandrescu 
wrote:

On 05/17/2018 09:14 AM, Patrick Schluter wrote:
I'm in charge at the European Commission of the biggest 
translation memory in the world.


Impressive! Is that the Europarl?


No, Euramis. The central translation memory developed by the 
Commission and used also by the other institutions. The database 
contains more than a billion segments from parallel texts and is 
afaik the biggest of its kind. One of the big strength of the 
Euramis TM is its multi-target language store this allows fuzzy 
searches in all combinations including indirect translations 
(i.e. if a document written in english was translated in Romanian 
and in Maltese it is then possible to search for alignments 
between ro and mt). It's not the only system to do that but on 
that volume it is quite unique.
We publish also every year an extract of it of the published 
legislation [1] from the official journal so that they can be 
used by the research community. All the machine translation 
engines use it. It is one of most accessed data collection on the 
European Open Data portal [2].


The very uncommon thing about the backend software of EURAMIS is 
that it is written in C. Pure unadultered C. I'm trying to 
introduce D but with the strange (to say it politely) 
configurations our server have it is quite challenging.


[1]: 
https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory

[2]: http://data.europa.eu/euodp/fr/data


Re: Of possible interest: fast UTF8 validation

2018-05-17 Thread Dmitry Olshansky via Digitalmars-d

On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:

On 5/16/2018 10:01 PM, Joakim wrote:
Unicode was a standardization of all the existing code pages 
and then added these new transfer formats, but I have long 
thought that they'd have been better off going with a 
header-based format that kept most languages in a single-byte 
scheme, as they mostly were except for obviously the Asian CJK 
languages. That way, you optimize for the common string, ie 
one that contains a single language or at least no CJK, rather 
than pessimizing every non-ASCII language by doubling its 
character width, as UTF-8 does. This UTF-8 issue is one of the 
first topics I raised in this forum, but as you noted at the 
time nobody agreed and I don't want to dredge that all up 
again.


It sounds like the main issue is that a header based encoding 
would take less size?


If that's correct, then I hypothesize that adding an LZW 
compression layer would achieve the same or better result.


Indeed, and some other compression/deduplication options that 
would allow limited random access / slicing (by decoding a single 
“block” to access an element for instance).


Anything that depends on external information and is not 
self-sync is awful for interchange. Internally the application 
can do some smarts though, but even then things like interning 
(partial interning) might be more valuable approach. TCP being 
reliable just plain doesn’t cut it. Corruption of single bit is 
very real.




Re: Of possible interest: fast UTF8 validation

2018-05-17 Thread Walter Bright via Digitalmars-d

On 5/16/2018 10:01 PM, Joakim wrote:
Unicode was a standardization of all the existing code pages and then added 
these new transfer formats, but I have long thought that they'd have been better 
off going with a header-based format that kept most languages in a single-byte 
scheme, as they mostly were except for obviously the Asian CJK languages. That 
way, you optimize for the common string, ie one that contains a single language 
or at least no CJK, rather than pessimizing every non-ASCII language by doubling 
its character width, as UTF-8 does. This UTF-8 issue is one of the first topics 
I raised in this forum, but as you noted at the time nobody agreed and I don't 
want to dredge that all up again.


It sounds like the main issue is that a header based encoding would take less 
size?

If that's correct, then I hypothesize that adding an LZW compression layer would 
achieve the same or better result.


Re: Of possible interest: fast UTF8 validation

2018-05-17 Thread Andrei Alexandrescu via Digitalmars-d

On 05/17/2018 09:14 AM, Patrick Schluter wrote:
I'm in charge at the European Commission of the biggest translation 
memory in the world.


Impressive! Is that the Europarl?


Re: Of possible interest: fast UTF8 validation

2018-05-17 Thread Joakim via Digitalmars-d

On Thursday, 17 May 2018 at 13:14:46 UTC, Patrick Schluter wrote:
This is not practical, sorry. What happens when your message 
loses the header? Exactly, the rest of the message is garbled.


Why would it lose the header? TCP guarantees delivery and 
checksums the data, that's effective enough at the transport 
layer.


I agree that UTF-8 is a more redundant format, as others have 
mentioned earlier, and is thus more robust to certain types of 
data loss than a header-based scheme. However, I don't consider 
that the job of the text format, it's better done by other 
layers, like transport protocols or filesystems, which will guard 
against such losses much more reliably and efficiently.


For example, a random bitflip somewhere in the middle of a UTF-8 
string will not be detectable most of the time. However, more 
robust error-correcting schemes at other layers of the system 
will easily catch that.


That's exactly what happened with code page based texts when 
you don't know in which code page it is encoded. It has the 
supplemental inconvenience that mixing languages becomes 
impossible or at least very cumbersome.
UTF-8 has several properties that are difficult to have with 
other schemes.
- It is state-less, means any byte in a stream always means the 
same thing. Its meaning  does not depend on external or a 
previous byte.


I realize this was considered important at one time, but I think 
it has proven to be a bad design decision, for HTTP too. There 
are some advantages when building rudimentary systems with crude 
hardware and lots of noise, as was the case back then, but that's 
not the tech world we live in today. That's why almost every HTTP 
request today is part of a stateful session that explicitly keeps 
track of the connection, whether through cookies, https 
encryption, or HTTP/2.


- It can mix any language in the same stream without acrobatics 
and if one thinks that mixing languages doesn't happen often 
should get his head extracted from his rear, because it is very 
common (check wikipedia's front page for example).


I question that almost anybody needs to mix "streams." As for 
messages or files, headers handle multiple language mixing 
easily, as noted in that earlier thread.


- The multi byte nature of other alphabets is not as bad as 
people think because texts in computer do not live on their 
own, meaning that they are generally embedded inside file 
formats, which more often than not are extremely bloated (xml, 
html, xliff, akoma ntoso, rtf etc.). The few bytes more in the 
text do not weigh that much.


Heh, the other parts of the tech stack are much more bloated, so 
this bloat is okay? A unique argument, but I'd argue that's why 
those bloated formats you mention are largely dying off too.


I'm in charge at the European Commission of the biggest 
translation memory in the world. It handles currently 30 
languages and without UTF-8 and UTF-16 it would be 
unmanageable. I still remember when I started there in 2002 
when we handled only 11 languages of which only 1 was of 
another alphabet (Greek). Everything was based on RTF with 
codepages and it was a braindead mess. My first job in 2003 was 
to extend the system to handle the 8 newcomer languages and 
with ASCII based encodings it was completely unmanageable 
because every document processed mixes languages and alphabets 
freely (addresses and names are often written in their original 
form for instance).


I have no idea what a "translation memory" is. I don't doubt that 
dealing with non-standard codepages or layouts was difficult, and 
that a standard like Unicode made your life easier. But the 
question isn't whether standards would clean things up, of course 
they would, the question is whether a hypothetical header-based 
standard would be better than the current continuation byte 
standard, UTF-8. I think your life would've been even easier with 
the former, though depending on your usage, maybe the main gain 
for you would be just from standardization.


2 years ago we implemented also support for Chinese. The nice 
thing was that we didn't have to change much to do that thanks 
to Unicode. The second surprise was with the file sizes, 
Chinese documents were generally smaller than their European 
counterparts. Yes CJK requires 3 bytes for each ideogram, but 
generally 1 ideogram replaces many letters. The ideogram 亿 
replaces "One hundred million" for example, which of them take 
more bytes? So if CJK indeed requires more bytes to encode, it 
is firstly because they NEED many more bits in the first place 
(there are around 3 CJK codepoints in the BMP alone, add to 
it the 6 that are in the SIP and we have a need of 17 bits 
only to encode them.


That's not the relevant criteria: nobody cares if the CJK 
documents were smaller than their European counterparts. What 
they care about is that, given a different transfer format, the 
CJK document could have been significantly smaller still. Because 

Re: Of possible interest: fast UTF8 validation

2018-05-17 Thread Patrick Schluter via Digitalmars-d

On Thursday, 17 May 2018 at 05:01:54 UTC, Joakim wrote:
On Wednesday, 16 May 2018 at 20:11:35 UTC, Andrei Alexandrescu 
wrote:

On 5/16/18 1:18 PM, Joakim wrote:
On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky 
wrote:

On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei 
Alexandrescu wrote:

https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/



Sigh, this reminds me of the old quote about people 
spending a bunch of time making more efficient what 
shouldn't be done at all.


Validating UTF-8 is super common, most text protocols and 
files these days would use it, other would have an option to 
do so.


I’d like our validateUtf to be fast, since right now we do 
validation every time we decode string. And THAT is slow. 
Trying to not validate on decode means most things should be 
validated on input...


I think you know what I'm referring to, which is that UTF-8 
is a badly designed format, not that input validation 
shouldn't be done.


I find this an interesting minority opinion, at least from the 
perspective of the circles I frequent, where UTF8 is 
unanimously heralded as a great design. Only a couple of weeks 
ago I saw Dylan Beattie give a very entertaining talk on 
exactly this topic: 
https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/


Thanks for the link, skipped to the part about text encodings, 
should be fun to read the rest later.


If you could share some details on why you think UTF8 is badly 
designed and how you believe it could be/have been better, I'd 
be in your debt!


Unicode was a standardization of all the existing code pages 
and then added these new transfer formats, but I have long 
thought that they'd have been better off going with a 
header-based format that kept most languages in a single-byte 
scheme,


This is not practical, sorry. What happens when your message 
loses the header? Exactly, the rest of the message is garbled. 
That's exactly what happened with code page based texts when you 
don't know in which code page it is encoded. It has the 
supplemental inconvenience that mixing languages becomes 
impossible or at least very cumbersome.
UTF-8 has several properties that are difficult to have with 
other schemes.
- It is state-less, means any byte in a stream always means the 
same thing. Its meaning  does not depend on external or a 
previous byte.
- It can mix any language in the same stream without acrobatics 
and if one thinks that mixing languages doesn't happen often 
should get his head extracted from his rear, because it is very 
common (check wikipedia's front page for example).
- The multi byte nature of other alphabets is not as bad as 
people think because texts in computer do not live on their own, 
meaning that they are generally embedded inside file formats, 
which more often than not are extremely bloated (xml, html, 
xliff, akoma ntoso, rtf etc.). The few bytes more in the text do 
not weigh that much.


I'm in charge at the European Commission of the biggest 
translation memory in the world. It handles currently 30 
languages and without UTF-8 and UTF-16 it would be unmanageable. 
I still remember when I started there in 2002 when we handled 
only 11 languages of which only 1 was of another alphabet 
(Greek). Everything was based on RTF with codepages and it was a 
braindead mess. My first job in 2003 was to extend the system to 
handle the 8 newcomer languages and with ASCII based encodings it 
was completely unmanageable because every document processed 
mixes languages and alphabets freely (addresses and names are 
often written in their original form for instance).
2 years ago we implemented also support for Chinese. The nice 
thing was that we didn't have to change much to do that thanks to 
Unicode. The second surprise was with the file sizes, Chinese 
documents were generally smaller than their European 
counterparts. Yes CJK requires 3 bytes for each ideogram, but 
generally 1 ideogram replaces many letters. The ideogram 亿 
replaces "One hundred million" for example, which of them take 
more bytes? So if CJK indeed requires more bytes to encode, it is 
firstly because they NEED many more bits in the first place 
(there are around 3 CJK codepoints in the BMP alone, add to 
it the 6 that are in the SIP and we have a need of 17 bits 
only to encode them.



as they mostly were except for obviously the Asian CJK 
languages. That way, you optimize for the common string, ie one 
that contains a single language or at least no CJK, rather than 
pessimizing every non-ASCII language by doubling its character 
width, as UTF-8 does. This UTF-8 issue is one of the first 
topics I raised in this forum, but as you noted at the time 
nobody agreed and I don't want to dredge that all up again.


I have been researching this a bit since then, and the stated 
goals for UTF-8 at inception were that it _could not overlap 

Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread Joakim via Digitalmars-d
On Wednesday, 16 May 2018 at 20:11:35 UTC, Andrei Alexandrescu 
wrote:

On 5/16/18 1:18 PM, Joakim wrote:
On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky 
wrote:

On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei 
Alexandrescu wrote:

https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/



Sigh, this reminds me of the old quote about people spending 
a bunch of time making more efficient what shouldn't be done 
at all.


Validating UTF-8 is super common, most text protocols and 
files these days would use it, other would have an option to 
do so.


I’d like our validateUtf to be fast, since right now we do 
validation every time we decode string. And THAT is slow. 
Trying to not validate on decode means most things should be 
validated on input...


I think you know what I'm referring to, which is that UTF-8 is 
a badly designed format, not that input validation shouldn't 
be done.


I find this an interesting minority opinion, at least from the 
perspective of the circles I frequent, where UTF8 is 
unanimously heralded as a great design. Only a couple of weeks 
ago I saw Dylan Beattie give a very entertaining talk on 
exactly this topic: 
https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/


Thanks for the link, skipped to the part about text encodings, 
should be fun to read the rest later.


If you could share some details on why you think UTF8 is badly 
designed and how you believe it could be/have been better, I'd 
be in your debt!


Unicode was a standardization of all the existing code pages and 
then added these new transfer formats, but I have long thought 
that they'd have been better off going with a header-based format 
that kept most languages in a single-byte scheme, as they mostly 
were except for obviously the Asian CJK languages. That way, you 
optimize for the common string, ie one that contains a single 
language or at least no CJK, rather than pessimizing every 
non-ASCII language by doubling its character width, as UTF-8 
does. This UTF-8 issue is one of the first topics I raised in 
this forum, but as you noted at the time nobody agreed and I 
don't want to dredge that all up again.


I have been researching this a bit since then, and the stated 
goals for UTF-8 at inception were that it _could not overlap with 
ASCII anywhere for other languages_, to avoid issues with legacy 
software wrongly processing other languages as ASCII, and to 
allow seeking from an arbitrary location within a byte stream:


https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

I have no dispute with these priorities at the time, as they were 
optimizing for the institutional and tech realities of 1992 as 
Dylan also notes, and UTF-8 is actually a nice hack given those 
constraints. What I question is that those priorities are at all 
relevant today, when billions of smartphone users are regularly 
not using ASCII, and these tech companies are the largest private 
organizations on the planet, ie they have the resources to design 
a new transfer format. I see basically no relevance for the 
streaming requirement today, as I noted in this forum years ago, 
but I can see why it might have been considered important in the 
early '90s, before packet-based networking protocols had won.


I think a header-based scheme would be _much_ better today and 
the reason I know Dmitry knows that is that I have discussed 
privately with him over email that I plan to prototype a format 
like that in D. Even if UTF-8 is already fairly widespread, 
something like that could be useful as a better intermediate 
format for string processing, and maybe someday could replace 
UTF-8 too.


Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread Jonathan M Davis via Digitalmars-d
On Wednesday, May 16, 2018 13:42:11 Walter Bright via Digitalmars-d wrote:
> On 5/16/2018 1:11 PM, Andrei Alexandrescu wrote:
> > If you could share some details on why you think UTF8 is badly designed
> > and how you believe it could be/have been better, I'd be in your debt!
>
> Me too. I think UTF-8 is brilliant (and I suffered for years under the
> lash of other multibyte encodings prior to UTF-8). Shift-JIS: shudder!
>
> Perhaps you're referring to the redundancy in UTF-8 - though illegal
> encodings are made possible by such redundancy.

I'm inclined to think that the redundancy is a serious flaw. I'd argue that
if it were truly well-designed, there would be exactly one way to represent
every character - including clear up to grapheme clusters where multiple
code points are involved (i.e. there would be no normalization issues in
valid Unicode, because there would be only one valid normalization). But
there may be some technical issues that I'm not aware of that would make
that problematic. Either way, the issues that I have with UTF-8 are issues
that UTF-16 and UTF-32 have as well, since they're really issues relating to
code points.

Overall, I think that UTF-8 is by far the best encoding that we have, and I
don't think that we're going to get anything better, but I'm also definitely
inclined to think that it's still flawed - just far less flawed than the
alternatives.

And in general, I have to wonder if there would be a way to make Unicode
less complicated if we could do it from scratch without worrying about any
kind of compatability, since what we have is complicated enough that most
programmers don't come close to understanding it, and it's just way too hard
to get right. But I suspect that if efficiency matters, there's enough
inherent complexity that we'd just be screwed on that front even if we could
do a better job than was done with Unicode as we know it.

- Jonathan M Davis



Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread Walter Bright via Digitalmars-d

On 5/16/2018 1:11 PM, Andrei Alexandrescu wrote:
If you could share some details on why you think UTF8 is badly designed and how 
you believe it could be/have been better, I'd be in your debt!


Me too. I think UTF-8 is brilliant (and I suffered for years under the lash of 
other multibyte encodings prior to UTF-8). Shift-JIS: shudder!


Perhaps you're referring to the redundancy in UTF-8 - though illegal encodings 
are made possible by such redundancy.


Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread Walter Bright via Digitalmars-d

On 5/16/2018 5:47 AM, Ethan Watson wrote:
I re-implemented some common string functionality at Remedy using SSE 4.2 
instructions. Pretty handy. Except we had to turn that code off for released 
products since nowhere near enough people are running SSE 4.2 capable hardware.


It would be nice to get this technique put into std.algorithm!


Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread rikki cattermole via Digitalmars-d

On 17/05/2018 8:34 AM, Walter Bright wrote:

On 5/16/2018 10:28 AM, Ethan wrote:
(Related: one feature I'd really really really love for linkers to 
implement is the ability to mark up certain functions to only ever be 
linked at a certain byte boundary. And that's purely because Jaguar 
branch prediction often made my profiling tests non-deterministic 
between compiles. A NOP is a legit optimisation on those processors.)


Linkers already do that. Alignment is specified on all symbols emitted 
by the compiler, and the linker uses that info.


Would allowing align attribute on functions, make sense here for Ethan?


Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread Walter Bright via Digitalmars-d

On 5/16/2018 10:28 AM, Ethan wrote:
(Related: one feature I'd really really really love for linkers to implement is 
the ability to mark up certain functions to only ever be linked at a certain 
byte boundary. And that's purely because Jaguar branch prediction often made my 
profiling tests non-deterministic between compiles. A NOP is a legit 
optimisation on those processors.)


Linkers already do that. Alignment is specified on all symbols emitted by the 
compiler, and the linker uses that info.




Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread Andrei Alexandrescu via Digitalmars-d

On 5/16/18 1:18 PM, Joakim wrote:

On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky wrote:

On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:

On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:
https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/ 



Sigh, this reminds me of the old quote about people spending a bunch 
of time making more efficient what shouldn't be done at all.


Validating UTF-8 is super common, most text protocols and files these 
days would use it, other would have an option to do so.


I’d like our validateUtf to be fast, since right now we do validation 
every time we decode string. And THAT is slow. Trying to not validate 
on decode means most things should be validated on input...


I think you know what I'm referring to, which is that UTF-8 is a badly 
designed format, not that input validation shouldn't be done.


I find this an interesting minority opinion, at least from the 
perspective of the circles I frequent, where UTF8 is unanimously 
heralded as a great design. Only a couple of weeks ago I saw Dylan 
Beattie give a very entertaining talk on exactly this topic: 
https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/


If you could share some details on why you think UTF8 is badly designed 
and how you believe it could be/have been better, I'd be in your debt!



Andrei


Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread Jack Stouffer via Digitalmars-d

On Wednesday, 16 May 2018 at 17:18:06 UTC, Joakim wrote:
I think you know what I'm referring to, which is that UTF-8 is 
a badly designed format, not that input validation shouldn't be 
done.


UTF-8 seems like the best option available given the problem 
space.


Junk data is going to be a problem with any possible string 
format given that encoding translations and programmer error will 
always be prevalent.


Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread xenon325 via Digitalmars-d

On Wednesday, 16 May 2018 at 16:54:19 UTC, Walter Bright wrote:
I used to do things like that a simpler way. 3 functions would 
be created:


  void FeatureInHardware();
  void EmulateFeature();
  void Select();
  void function() doIt = 

I.e. the first time doIt is called, it calls the Select 
function which then resets doIt to either FeatureInHardware() 
or EmulateFeature().


It costs an indirect call [...]


Is this basically the same as Function MultiVersioning [1] ?

I never had a need to use it and always wondered how does it work 
out it real life.

From description it seems this would incur indirection:

"To keep the cost of dispatching low, the IFUNC [2] mechanism is 
used for dispatching. This makes the call to the dispatcher a 
one-time thing during startup and a call to a function version is 
a single jump indirect instruction."


In linked article [2] Ian Lance Taylor says glibc uses this for 
memcpy(), so this should be pretty efficient (but than again, one 
doesn't call memcpy() in hot loops too often)


[1] https://gcc.gnu.org/wiki/FunctionMultiVersioning
[2] https://www.airs.com/blog/archives/403

--
Alexander


Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread Ethan via Digitalmars-d

On Wednesday, 16 May 2018 at 16:54:19 UTC, Walter Bright wrote:
I used to do things like that a simpler way. 3 functions would 
be created:


  void FeatureInHardware();
  void EmulateFeature();
  void Select();
  void function() doIt = 

I.e. the first time doIt is called, it calls the Select 
function which then resets doIt to either FeatureInHardware() 
or EmulateFeature().


It costs an indirect call, but if you move it up the call 
hierarchy a bit so it isn't in the hot loops, the indirect 
function call cost is negligible.


The advantage is there was only one binary.


It certainly sounds reasonable enough for 99% of use cases. But 
I'm definitely the 1% here ;-)


Indirect calls invoke the wrath of the branch predictor on 
XB1/PS4 (ie an AMD Jaguar processor). But there's certainly some 
more interesting non-processor behaviour, at least on MSVC 
compilers. The provided auto-DLL loading in that environment 
performs a call to your DLL-boundary-crossing function, which 
actually winds up in a jump table that performs a jump 
instruction to actually get to your DLL code. I suspect this is 
more costly than the indirect jump at a "write a basic test" 
level. Doing an indirect call as the only action in a for-loop is 
guaranteed to bring out the costly branch predictor on the 
Jaguar. Without getting in and profiling a bunch of stuff, I'm 
not entirely sure which approach I'd prefer for a general 
approach.


Certainly, as far as this particular thread goes, every general 
purpose function of a few lines that I write that use intrinsics 
is forced inline. No function calls, indirect or otherwise. And 
on top of that, the inlined code usually pushes the branches in 
the code out the code across the byte boundary lines just far 
enough that the simple branch predictor is only ever invoked.


(Related: one feature I'd really really really love for linkers 
to implement is the ability to mark up certain functions to only 
ever be linked at a certain byte boundary. And that's purely 
because Jaguar branch prediction often made my profiling tests 
non-deterministic between compiles. A NOP is a legit optimisation 
on those processors.)


Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread Joakim via Digitalmars-d

On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky wrote:

On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu 
wrote:

https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/


Sigh, this reminds me of the old quote about people spending a 
bunch of time making more efficient what shouldn't be done at 
all.


Validating UTF-8 is super common, most text protocols and files 
these days would use it, other would have an option to do so.


I’d like our validateUtf to be fast, since right now we do 
validation every time we decode string. And THAT is slow. 
Trying to not validate on decode means most things should be 
validated on input...


I think you know what I'm referring to, which is that UTF-8 is a 
badly designed format, not that input validation shouldn't be 
done.


Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread Walter Bright via Digitalmars-d

On 5/16/2018 7:38 AM, Ethan Watson wrote:
My preferred method though is to just build multiple sets of binaries as 
DLLs/SOs/DYNLIBs, then load in the correct libraries dependant on the CPUID test 
at program initialisation.

I used to do things like that a simpler way. 3 functions would be created:

  void FeatureInHardware();
  void EmulateFeature();
  void Select();
  void function() doIt = 

I.e. the first time doIt is called, it calls the Select function which then 
resets doIt to either FeatureInHardware() or EmulateFeature().


It costs an indirect call, but if you move it up the call hierarchy a bit so it 
isn't in the hot loops, the indirect function call cost is negligible.


The advantage is there was only one binary.



The PDP-11 had an optional chipset to do floating point. The compiler generated 
function calls that emulated the floating point:


call FPADD
call FPSUB
...

Those functions would check to see if the FPU existed. If it did, it would 
in-place patch the binary to replace the calls with FPU instructions! Of course, 
that won't work these days because of protected code pages.




In the bad old DOS days, emulator calls were written out by the compiler. 
Special relocation fixup records were emitted for them. The emulator or the FPU 
library was then linked in, and included special relocation fixup values which 
tricked the linker fixup mechanism into patching those instructions with either 
emulator calls or FPU instructions. It was just brilliant!





Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread Dmitry Olshansky via Digitalmars-d

On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu 
wrote:

https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/


Sigh, this reminds me of the old quote about people spending a 
bunch of time making more efficient what shouldn't be done at 
all.


Validating UTF-8 is super common, most text protocols and files 
these days would use it, other would have an option to do so.


I’d like our validateUtf to be fast, since right now we do 
validation every time we decode string. And THAT is slow. Trying 
to not validate on decode means most things should be validated 
on input...






Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread Joakim via Digitalmars-d
On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu 
wrote:

https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/


Sigh, this reminds me of the old quote about people spending a 
bunch of time making more efficient what shouldn't be done at all.


Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread Ethan Watson via Digitalmars-d

On Wednesday, 16 May 2018 at 14:25:07 UTC, Jack Stouffer wrote:
D doesn't seem to have C definitions for the x86 SIMD 
intrinsics, which is a bummer


Replying to highlight this.

There's core.simd which doesn't look anything like SSE/AVX 
intrinsics at all, and looks a lot more like a wrapper for 
writing assembly instructions directly.


And even better - LDC doesn't support core.simd and has its own 
intrinsics that don't match the SSE/AVX intrinsics API published 
by Intel.


And since I'm a multi-platform developer, the "What about NEON 
intrinsics?" question always sits in the back of my mind.


I ended up implementing my own SIMD primitives in Binderoo, but 
they're all versioned out for LDC at the moment until I look in 
to it and complete the implementation.


Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread Ethan Watson via Digitalmars-d
On Wednesday, 16 May 2018 at 13:54:05 UTC, Andrei Alexandrescu 
wrote:
Is it workable to have a runtime-initialized flag that controls 
using SSE vs. conservative?


Sure, it's workable with these kind of speed gains. Although the 
conservative code path ends up being slightly worse off - an 
extra fetch, compare and branch get introduced.


My preferred method though is to just build multiple sets of 
binaries as DLLs/SOs/DYNLIBs, then load in the correct libraries 
dependant on the CPUID test at program initialisation. Current 
Xbox/Playstation hardware is pretty terrible when it comes to 
branching, so compiling with minimal branching and deploying the 
exact binaries for the hardware capabilities is the way I 
generally approach things.


We never got around to setting something like that up for the PC 
release of Quantum Break, although we definitely talked about it.


Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread Jack Stouffer via Digitalmars-d
On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu 
wrote:

https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/


D doesn't seem to have C definitions for the x86 SIMD intrinsics, 
which is a bummer


https://issues.dlang.org/show_bug.cgi?id=18865

It's too bad that nothing came of std.simd.


Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread Andrei Alexandrescu via Digitalmars-d

On 05/16/2018 08:47 AM, Ethan Watson wrote:

On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:
https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/ 



I re-implemented some common string functionality at Remedy using SSE 
4.2 instructions. Pretty handy. Except we had to turn that code off for 
released products since nowhere near enough people are running SSE 4.2 
capable hardware.


Is it workable to have a runtime-initialized flag that controls using 
SSE vs. conservative?


The code linked doesn't seem to use any instructions newer than SSE2, so 
it's perfectly safe to run on any x64 processor. Could probably be sped 
up with newer SSE instructions if you're only ever running internally on 
hardware you control.


Even better!

Contributions would be very welcome.


Andrei


Re: Of possible interest: fast UTF8 validation

2018-05-16 Thread Ethan Watson via Digitalmars-d
On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu 
wrote:

https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/


I re-implemented some common string functionality at Remedy using 
SSE 4.2 instructions. Pretty handy. Except we had to turn that 
code off for released products since nowhere near enough people 
are running SSE 4.2 capable hardware.


The code linked doesn't seem to use any instructions newer than 
SSE2, so it's perfectly safe to run on any x64 processor. Could 
probably be sped up with newer SSE instructions if you're only 
ever running internally on hardware you control.