Re: Is the binaryness/textness of a data format a property?

2020-03-20 Thread J Decker via Unicode
On Fri, Mar 20, 2020 at 5:48 AM Adam Borowski via Unicode <
unicode@unicode.org> wrote:

> On Fri, Mar 20, 2020 at 12:21:26PM +, Costello, Roger L. via Unicode
> wrote:
> > [Definition] Property: an attribute, quality, or characteristic of
> something.
> >
> > JPEG is a binary data format.
> > CSV is a text data format.
> >
> > Question #1: Is the binaryness/textness of a data format a property?
> >
> > Question #2: If the answer to Question #1 is yes, then what is the name
> of
> > this binaryness/textness property?
>
> I'm afraid this question is too fuzzy to have a proper answer.
>
> For example, most Unix-heads will tell you that UTF16LE is a binary rather
> than text format.  Microsoft employees and some members of this list will
> disagree.
>
> Then you have Postscript -- nothing but basic ASCII, yet utterly unreadable
> for a (sane) human.
>
> If you want _my_ definition of a file being _technically_ text, it's:
> * no bytes 0..31 other than newlines and tabs (even form feeds are out
>   nowadays)
> * correctly encoded for the expected charset (and nowadays, if that's not
>   UTF-8 Unicode, you're doing it wrong)
> * no invalid characters
>

Just a minor note...
In the case of UTF8, this means no bytes 0xF8-0xFF will ever be used; every
valid utf8 codeunit has at least 1 bit off.
I wouldn't be so picky about 'no bytes 0-31' because \t, \n, \x1b(ANSI
codes) are all quite usable...



>
> But besides this narrow technical meaning -- is a Word document "text"?
> And if it is, why not Powerpoint?  This all falls apart.
>
>
> Meow!
> --
> ⢀⣴⠾⠻⢶⣦⠀
> ⣾⠁⢠⠒⠀⣿⡁ in the beginning was the boot and root floppies and they were good.
> ⢿⡄⠘⠷⠚⠋⠀   --  on #linux-sunxi
> ⠈⠳⣄
>


Re: Unicode "no-op" Character?

2019-06-24 Thread J Decker via Unicode
On Mon, Jun 24, 2019 at 5:35 PM David Starner via Unicode <
unicode@unicode.org> wrote:

> On Sun, Jun 23, 2019 at 10:41 PM Shawn Steele via Unicode
>  wrote:
>
> IMO, since it's unlikely that anyone expects
> that they can transmit a NUL through an arbitrary channel, unlike a
> random private use character.

You would be wrong.
NUL is a valid codepoint like any other; except like in the C standard
library and descendants.
And, I expect it to be maintained.  And, for the most part is, (except for
emscripten)

>
> --
> Kie ekzistas vivo, ekzistas espero.
>


Re: Unicode "no-op" Character?

2019-06-22 Thread J Decker via Unicode
On Sat, Jun 22, 2019 at 2:04 PM Sławomir Osipiuk via Unicode <
unicode@unicode.org> wrote:

> I see there is no such character, which I pretty much expected after
> Google didn’t help.
>
>
>
> The original problem I had was solved long ago but the recent article
> about watermarking reminded me of it, and my question was mostly out of
> curiosity. The task wasn’t, strictly speaking, about “padding”, but about
> marking – injecting “flag” characters at arbitrary points in a string
> without affecting the resulting visible text. I think we ended up using
> ESC, which is a dumb choice in retrospect, though the whole approach was a
> bit of a hack anyway and the process it was for isn’t being used anymore.
>

The spec would suggest that there are escape codes like that, which can be
used.
APC,   U+009F
ST, String Terminator, U+009C
which is supposed to be a sequence of characters that should not be
displayed, but may be used to control the application displaying them.
(assuming they understand them)

https://www.aivosto.com/articles/control-characters.html

156$9CSTString Terminator 234 9/12 ST
ESC \ Closes a string opened by APC, DCS, OSC, PM or SOS.


159$9FAPCApplication Program Command 237 9/15 AC
ESC _ Starts an application program command string. ST will end the
command. The interpretation of the command is subject to the program in
question.
But it doesn't appear anything actually 'supports' that.


Re: Unicode "no-op" Character?

2019-06-21 Thread J Decker via Unicode
Sounds like a great use for ZWNBSP  (zero width non-breaking space) 0xFEFF
(Also used as BOM)
or that doesn't break; maybe 'ZERO WIDTH SPACE' (U+200B)

On Fri, Jun 21, 2019 at 9:48 PM Sławomir Osipiuk via Unicode <
unicode@unicode.org> wrote:

> Does Unicode include a character that does nothing at all? I’m talking
> about something that can be used for padding data without affecting
> interpretation of other characters, including combining chars and
> ligatures. I.e. a character that could hypothetically be inserted between a
> latin E and a combining acute and still produce É. The historical
> description of U+0016 SYNCHRONOUS IDLE seems like pretty much exactly what
> I want. It only has one slight disadvantage: it doesn’t work. All software
> I’ve tried displays it as an unknown character and it definitely breaks up
> combinations. And U+ NULL seems even worse.
>
>
>
> I can imagine the answer is that this thing I’m looking for isn’t a
> character at all and so should be the business of “a higher-level protocol”
> and not what Unicode was made for… but Unicode does include some odd things
> so I wonder if there is something like that regardless. Can anyone offer
> any suggestions?
>
>
>
> Sławomir Osipiuk
>


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-12 Thread J Decker via Unicode
On Fri, Oct 12, 2018 at 9:23 AM Doug Ewell via Unicode 
wrote:

> J Decker wrote:
>
> >> How about the opposite direction: If m is base64 encoded to yield t
> >> and then t is base64 decoded to yield n, will it always be the case
> >> that m equals n?
> >
> > False.
> > Canonical translation may occur which the different base64 may be the
> > same sort of string...
>
> Base64 is a binary-to-text encoding. Neither encoding nor decoding
> should presume any special knowledge of the meaning of the binary data,
> or do anything extra based on that presumption.
>
> Converting Unicode text to and from base64 should not perform any sort
> of Unicode normalization, convert between UTFs, insert or remove BOMs,
> etc. This is like saying that converting a JPEG image to and from base64
> should not resize or rescale the image, change its color depth, convert
> it to another graphic format, etc.
>
> So I'd say "true" to Roger's question.
>
On the first side (X to base64) definitely true.

But there is potential that text resulting from some decoded buffer is
translated, resulting in a 'congruent' string that's not exactly the
same... and the base64 will be different.

Comparing some base64 string with some other base64 string shows a binary
difference, but may be still the 'same' string.


>
> I touched on this a little bit in UTN #14, from the standpoint of trying
> to improve compression by normalizing the Unicode text first.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-12 Thread J Decker via Unicode
On Fri, Oct 12, 2018 at 3:57 AM Costello, Roger L. via Unicode <
unicode@unicode.org> wrote:

> Hi Unicode Experts,
>
> Suppose base64 encoding is applied to m to yield base64 text t.
>
> Next, suppose base64 encoding is applied to m' to yield base64 text t'.
>
> If m is not equal to m', then t will not equal t'.
>
> In other words, given different inputs, base64 encoding always yields
> different base64 texts.
>
> True or false?
>
true.  base64 to and from is always the same thing.

>
> How about the opposite direction: If m is base64 encoded to yield t and
> then t is base64 decoded to yield n, will it always be the case that m
> equals n?
>
False.
Canonical translation may occur which the different base64 may be the same
sort of string...

https://en.wikipedia.org/wiki/Unicode_equivalence
https://en.wikipedia.org/wiki/Canonical_form


> /Roger
>
>


Re: Unicode String Models

2018-09-11 Thread J Decker via Unicode
On Tue, Sep 11, 2018 at 3:15 PM Hans Åberg via Unicode 
wrote:

>
> > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:
> >
> > On Tue, 11 Sep 2018 21:10:03 +0200
> > Hans Åberg via Unicode  wrote:
> >
> >> Indeed, before UTF-8, in the 1990s, I recall some Russians using
> >> LaTeX files with sections in different Cyrillic and Latin encodings,
> >> changing the editor encoding while typing.
> >
> > Rather like some of the old Unicode list archives, which are just
> > concatenations of a month's emails, with all sorts of 8-bit encodings
> > and stretches of base64.
>
> It might be useful to represent non-UTF-8 bytes as Unicode code points.
> One way might be to use a codepoint to indicate high bit set followed by
> the byte value with its high bit set to 0, that is, truncated into the
> ASCII range. For example, U+0080 looks like it is not in use, though I
> could not verify this.
>
>
it's used for character 0x400.   0xD0 0x80   or 0x8000   0xE8 0x80 0x80
(I'm probably off a bit in the leading byte)
UTF-8 can represent from 0 to 0x20 every value; (which is all defined
codepoints) early varients can support up to U+7FFF...
and there's enough bits to carry the pattern forward to support 36 bits or
42 bits... (the last one breaking the standard a bit by allowing a byte
wihout one bit off... 0xFF would be the leadin)

0xF8-FF are unused byte values; but those can all be encoded into utf-8.


Re: UCD in XML or in CSV? (is: UCD in YAML)

2018-09-07 Thread J Decker via Unicode
On Fri, Sep 7, 2018 at 10:58 AM Philippe Verdy via Unicode <
unicode@unicode.org> wrote:

>
>
> Le jeu. 6 sept. 2018 à 19:11, Doug Ewell via Unicode 
> a écrit :
>
>> Marcel Schneider wrote:
>>
>> > BTW what I conjectured about the role of line breaks is true for CSV
>> > too, and any file downloaded from UCD on a semicolon separator basis
>> > becomes unusable when displayed straight in the built-in text editor
>> > of Windows, given Unicode uses Unix EOL.
>>
>> It's been well known for decades that Windows Notepad doesn't display
>> LF-terminated text files correctly. The solution is to use almost any
>> other editor. Notepad++ is free and a great alternative, but there are
>> plenty of others (no editor wars, please).
>>
>
> This has changed recently in Windows 10, where the builtin Notepad app now
> parses text files using LF only correctly (you can edit and save using the
> same convention for newlines, which is now autodetected; Notepad still
> creates new files using CRLF and saves them after edit using CRLF).
>
> I would love to have a notepad that handled \n.
My system is up to date.
What update must I get to have notepad handle newline only files?
(and I dare say notepad is the ONLY program that doesn't handle either
convention, command line `edit` and `wordpad`(write) even handled them)
 I'm sure there exists other programs that do it wrong; but none I've ever
used or found, or written.

Notepad now displays the newline convention in the status bar as "Windows
> (CRLF)" or "Unix (LF)" (like Notepad++), just before the line/column
> counters. There's still no preference interface to specify the default
> convention: CRLF is still the the default for new files.
>
> And no way to switch the convention before saving. In Notepad++ you do
> that with menu "Edit" > "Convert newlines" and select one of "Convert to
> Windows (CR+LF)", "Convert to Unix (LF)" or "Convert to Mac (CR)"
>
>
>


Re: Fwd: RFC 8369 on Internationalizing IPv6 Using 128-Bit Unicode

2018-04-02 Thread J Decker via Unicode
On Mon, Apr 2, 2018 at 5:42 PM, Mark E. Shoulson via Unicode <
unicode@unicode.org> wrote:

> For unique identifiers for every person, place, thing, etc, consider
> https://en.wikipedia.org/wiki/Universally_unique_identifier which are
> indeed 128 bits.
>
> What makes you think a single "glyph" that represents one of these 3.4⏨38
> items could possibly be sensibly distinguishable at any sort of glance
> (including long stares) from all the others?  I have an idea for that: we
> can show the actual *digits* of some encoding of the 128-bit number.  Then
> just inspecting for a different digit will do.
>

there's no restirction that it be one character cell in size... rendered
glyphs could be thousands of pixels wide...

sorry to drag this on ;)

>
> Now, what about a registry for "important" (and not-necessarily-important)
> UUIDs for key things and people, which associates them with an image of
> some kind?  Some sort of global icon?  And indeed, perhaps used for
> Internet-of-Things-like things?  Not necessarily a bad idea—but decidedly
> outside of the scope of Unicode.  (Maybe you could even assign your beloved
> sentences to some UUIDs and stick them in such a registry.  Again, who
> knows, maybe a decent idea.  But it ain't Unicode.)
>
> ~mark
>
>
> On 04/02/2018 02:15 PM, William_J_G Overington via Unicode wrote:
>
>> Doug Ewell wrote:
>>
>> Martin J. Dürst wrote:
>>>
>>
>>
>>> Please enjoy. Sorry for being late with forwarding, at least in some
 parts of the world.

>>>
>>
>>> Unfortunately, we know some folks will look past the humor and use this
>>>
>> as a springboard for the recurring theme "Yes, what *will* we do when
>> Unicode runs out of code points?"
>>
>> An interesting thing about the document is that it suggests a Unicode
>> code point for an individual item of a particular type, what the document
>> terms an imoji.
>>
>> This being beyond what Unicode encodes at present.
>>
>> I wondered if this could link in some ways to the Internet of Things.
>>
>
>


Re: Fwd: RFC 8369 on Internationalizing IPv6 Using 128-Bit Unicode

2018-04-02 Thread J Decker via Unicode
I was really hoping this was a joke... it didn't hit me it was April 1...

https://en.wikipedia.org/wiki/Plane_(Unicode)


PlaneAllocated code points[note 1]
Assigned
characters[note 2]


Totals 280,016 136,755
almost 50% used now.

Though that table omits 655,350 code points as 'unassigned' so it's really
only about 16% (1/6) used

using only 4-byte utf8 or 2 byte utf-16...

and of those, that's only 20(plus or minus a faction of 1) bits?

so a proposal of something a power of 6 larger than that when even just 1
more bit gives another million characters

https://en.wikipedia.org/wiki/List_of_dictionaries_by_number_of_words

I guess if it was encoded every word as a single code point... that
wouldn't be enough seems about  7,716,121 words...  so.. 24 bits. plus 1 to
double it for good measure?

*shrug*



On Mon, Apr 2, 2018 at 11:15 AM, William_J_G Overington via Unicode <
unicode@unicode.org> wrote:

> Doug Ewell wrote:
>
> > Martin J. Dürst wrote:
>
> >> Please enjoy. Sorry for being late with forwarding, at least in some
> >> parts of the world.
>
> > Unfortunately, we know some folks will look past the humor and use this
> as a springboard for the recurring theme "Yes, what *will* we do when
> Unicode runs out of code points?"
>
> An interesting thing about the document is that it suggests a Unicode code
> point for an individual item of a particular type, what the document terms
> an imoji.
>
> This being beyond what Unicode encodes at present.
>
> I wondered if this could link in some ways to the Internet of Things.
>
> I had never heard of IPv6. Indeed I checked on the Internet to find
> whether that was real. So I have started reading and learning.
>
> It would, in fact, be quite straightforward to encode what the document
> terms 128-bit Unicode characters.
>
> For example, U+FFF8 could be used as a base character and then followed by
> a sequence of 32 tag characters, each of those 32 tag characters being from
> the range
>
> U+E0030 TAG DIGIT ZERO .. U+E0039 TAG DIGIT NINE, U+E0041 TAG LATIN
> CAPITAL LETTER A .. U+E0046 TAG LATIN CAPITAL LETTER F
>
> That is, a newly-defined character from the Specials and then 32 tag
> characters encoding a hexadecimal code point.
>
> Now, if that were called 128-bit Unicode then there could be problems of
> policy, but if it were given another name so that it sits upon a Unicode
> structure so as to provide an application platform that can be manipulated
> using Unicode tools, including existing Unicode interchange formats, and
> display formats for character glyphs, then maybe something useful can be
> produced.
>
> Thus using 128-bit binary numbers in a local computer system and using
> existing Unicode characters for interchange of information between computer
> systems, converting from the one format to the other depending upon the
> needs for local processing and for interchange of information.
>
> Of particular significance is the concept of encoding individual items
> each with its own code point.
>
> Could this be used to relate glyphs to the Internet of Things?
>
> Could things like International Standard Book Numbers be included, with a
> code point for each book edition?
>
> What about individual copies of a rare book?
>
> What about museum items?
>
> What about paintings and sculptures?
>
> Could this tie up with serial numbers used in GS1-128 Barcodes?
>
> Please note that the 128 in GS1-128 refers to the 128 characters of ASCII,
> not to 128-bits.
>
> I am wondering whether U+FFF8 plus 32 tag characters could be handled
> directly by a GSUB glyph substitution within an OpenType font.
>
> However, with such a large code space, there would need to be a way to
> access glyph information over the internet, maybe use of a one-glyph web
> font for each glyph would be possible in some way.
>
> William Overington
>
> Monday 2 April 2018
>
>
>
>


Re: Interesting UTF-8 decoder

2017-10-09 Thread J Decker via Unicode
that's interesting; however it will segfault if the string ends on a memory
allocation boundary.  will have to make sure strings are always allocated
with 3 extra bytes.

2017-10-09 1:37 GMT-07:00 Martin J. Dürst via Unicode :

> A friend of mine sent me a pointer to
> http://nullprogram.com/blog/2017/10/06/, a branchless UTF-8 decoder.
>
> Regards,   Martin.
>


Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread J Decker via Unicode
On Mon, Jul 24, 2017 at 1:50 PM, Philippe Verdy <verd...@wanadoo.fr> wrote:

> 2017-07-24 21:12 GMT+02:00 J Decker via Unicode <unicode@unicode.org>:
>
>>
>>
>> If you don't have that last position in a variable, just use 3 tests but
> NO loop at all: if all 3 tests are failing, you know the input was not
> valid at all, and the way to handle this error will not be solved simply by
> using a very unsecure unbound loop like above but by exiting and returning
> an error immediately, or throwing an exception.
>
> The code should better be:
>
> if (from[0]&0xC0 == 0x80) from--;
> else if (from[-1]&0xC0 == 0x80) from -=2;
> else if (from[-2]&0xC0 == 0x80) from -=3;
> if (from[0]&0xC0 == 0x80) throw (some exception);
> // continue here with character encoded as UTF-8 starting at "from"
> (an ASCII byte or an UTF-8 leading byte)
>
>
I generally accepted any utf-8 encoding up to 31 bits though ( since I was
going from the original spec, and not what was effective limit based on
unicode codepoint space) and the while loop is more terse; but is less
optimal because of code pipeline flushing from backward jump; so yes if
series is much better :)  (the original code also has the start of the
string, and strings are effecitvly prefixed with a 0 byte anyway because of
a long little endian size)

and you'd probably be tracking an output offset also, so it becomes a
little longer than the above.

And it should be secured using a guard byte at start of your buffer in
> which the "from" pointer was pointing, so that it will never read something
> else and can generate an error.
>
>


Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread J Decker via Unicode
On Mon, Jul 24, 2017 at 10:57 AM, Costello, Roger L. via Unicode <
unicode@unicode.org> wrote:

> Hi Folks,
>
> 2. (Bug) The sending application performs the folding process - inserts
> CRLF plus white space characters - and the receiving application does the
> unfolding process but doesn't properly delete all of them.
>
> The RFC doesn't say 'characters' but either a space or a tab character
(singular)

 back scanning is simple enough

while( ( from[0] & 0xC0 ) == 0x80 )
from--;

should probably also check that from > (start+1) but since it should be
applied at 75-ish characters, that would be implicitly true.


Database missing/erroneous information

2017-07-12 Thread J Decker via Unicode
I started looking more deeply at the javascript specification.  Identifiers
are defined as starting with characters with ID_Start and continued with
ID_Continue attributes.
I grabbed the xml database (ucd.all.grouped.xml )  in which I was able to
find IDS, IDC flags ( also OIDS,OIDC, XIDS,XIDC of which meaning I'm not
entirely sure of)

but I started filtering out to find characters that are NOT IDS|IDC

Something simple like numbers 0x30-0x39 are marked with IDS='N' but have no
[ OX]IDC flags specified.  Is a lack of flag assumed N or Y?
www.unicode.org/reports/tr42/ documentation on the XML file format doesn't
specify.

http://www.unicode.org/reports/tr31/  I see 'ID_Continue characters include
ID_Start characters, plus characters '

most languages do support identifiers like a1, a2, etc as valid
identifiers, so certainly numbers should have IDC even though they're not
IDS.
Are there characters that are IDS without being IDC?  There are certainly
characters that are IDC without IDS.


some examples.

found  char { cp: '0034',  na: 'DIGIT FOUR',  gc: 'Nd',  nt: 'De',  nv:
'4',  bc: 'EN',  lb: 'NU',  sc: 'Zyyy',  scx: 'Zyyy',  Alpha: 'N',  Hex:
'Y',  AHex: 'Y',  IDS: 'N',  XIDS: 'N',  WB: 'NU',  SB: 'NU',  Cased: 'N',
 CWCM: 'N',  InSC: 'Number' }

(this has IDC notation but not IDS; since it says 'digit' I assume this is
a number type, and should not be IDS.)
found  char { cp: '0F32',  na: 'TIBETAN DIGIT HALF NINE',  gc: 'No',  nt:
'Nu',  nv: '17/2',  Alpha: 'N',  IDC: 'N',  XIDC: 'N',  SB: 'XX',  InSC:
'Number' }

This might be not IDS but is IDC?
found  char { cp: '203F',
  na: 'UNDERTIE',
  gc: 'Pc',
  IDC: 'Y',
  XIDC: 'Y',
  Pat_Syn: 'N',
  WB: 'EX' }


this is sort of IDS but not IDC?
found  char { cp: '309B',  na: 'KATAKANA-HIRAGANA VOICED SOUND MARK',  gc:
'Sk',  dt: 'com',  dm: '0020 3099',  bc: 'ON',  lb: 'NS',  sc: 'Zyyy',
 scx: 'Hira Kana',  Alpha: 'N',  Dia: 'Y',  OIDS: 'Y',  XIDS: 'N',  XIDC:
'N',  WB: 'KA',  SB: 'XX',  NFKC_QC: 'N',  NFKD_QC: 'N',  XO_NFKC: 'Y',
 XO_NFKD: 'Y',  CI: 'Y',  CWKCF: 'Y',  NFKC_CF: '0020 3099',  vo: 'Tu' }


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread J Decker via Unicode
On Mon, May 15, 2017 at 11:50 PM, Henri Sivonen via Unicode <
unicode@unicode.org> wrote:

> On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode
>  wrote:
> > I’m not sure how the discussion of “which is better” relates to the
> > discussion of ill-formed UTF-8 at all.
>
> Clearly, the "which is better" issue is distracting from the
> underlying issue. I'll clarify what I meant on that point and then
> move on:
>
> I acknowledge that UTF-16 as the internal memory representation is the
> dominant design. However, because UTF-8 as the internal memory
> representation is *such a good design* (when legacy constraits permit)
> that *despite it not being the current dominant design*, I think the
> Unicode Consortium should be fully supportive of UTF-8 as the internal
> memory representation and not treat UTF-16 as the internal
> representation as the one true way of doing things that gets
> considered when speccing stuff.
>
> I.e. I wasn't arguing against UTF-16 as the internal memory
> representation (for the purposes of this thread) but trying to
> motivate why the Consortium should consider "UTF-8 internally" equally
> despite it not being the dominant design.
>
> So: When a decision could go either way from the "UTF-16 internally"
> perspective, but one way clearly makes more sense from the "UTF-8
> internally" perspective, the "UTF-8 internally" perspective should be
> decisive in *such a case*. (I think the matter at hand is such a
> case.)
>
> At the very least a proposal should discuss the impact on the "UTF-8
> internally" case, which the proposal at hand doesn't do.
>
> (Moving on to a different point.)
>
> The matter at hand isn't, however, a new green-field (in terms of
> implementations) issue to be decided but a proposed change to a
> standard that has many widely-deployed implementations. Even when
> observing only "UTF-16 internally" implementations, I think it would
> be appropriate for the proposal to include a review of what existing
> implementations, beyond ICU, do.
>
> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
> test with three major browsers that use UTF-16 internally and have
> independent (of each other) implementations of UTF-8 decoding
> (Firefox, Edge and Chrome)


Something I've learned through working with Node (V8 javascript engine from
chrome) V8 stores strings either as UTF-16 OR UTF-8 interchangably and is
not one OR the other...

https://groups.google.com/forum/#!topic/v8-users/wmXgQOdrwfY

and I wouldn't really assume UTF-16 is a 'majority';  Go is utf-8 for
instance.



> shows agreement on the current spec: there
> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
> 6 on the second, 4 on the third and 6 on the last line). Changing the
> Unicode standard away from that kind of interop needs *way* better
> rationale than "feels right".
>
> --
> Henri Sivonen
> hsivo...@hsivonen.fi
> https://hsivonen.fi/
>
>