Aw: RE: Egyptian Hieroglyph Man with a Laptop

2020-02-14 Thread Marius Spix via Unicode
That glyph is coded on position U+1F5B3 OLD PERSONAL COMPUTER, see 
http://users.teilar.gr/~g1951d/Aegyptus.pdf
 
 

Gesendet: Donnerstag, 13. Februar 2020 um 07:58 Uhr
Von: "うみほたる via Unicode" 
An: unicode@unicode.org
Betreff: RE: Egyptian Hieroglyph Man with a Laptop
The early versions of the font Aegyptus (http://users.teilar.gr/~g1951d/) has 
the glyph as one of "Dingbats" distinguished from general characters.
The attached image is from the PDF file for Aegyptus.ttf version 3.17 (2012).



Re: Egyptian Hieroglyph Man with a Laptop

2020-02-12 Thread Marius Spix via Unicode
That is a pretty interesting finding. This glyph was not part of
http://www.unicode.org/L2/L2018/18165-n4944-hieroglyphs.pdf
but has been first seen in
http://www.unicode.org/L2/L2019/19220-n5063-hieroglyphs.pdf

The only "evidence" for this glyph I could find, is a stock photo,
which is clearly made in the 21th century.
https://www.alamy.com/stock-photo-egyptian-hieroglyphics-with-notebook-digital-illustration-57472465.html

I know, that some font creators include so-called trap characters,
similar to trap streets which are often found in maps to catch copyright
violations. But it is also possible that the someone wanted to smuggle
an easter-egg into Unicode or just test if the quality assurance works.

In my opinion, this is an invalid character, which should not be
included in Unicode.


On Thu, 12 Feb 2020 19:12:14 +0100
Frédéric Grosshans via Unicode  wrote:

> Dear Unicode list members (CC Michel Suignard),
> 
>    the Unicode proposal L2/20-068 
> , 
> “Revised draft for the encoding of an extended Egyptian Hieroglyphs 
> repertoire, Groups A to N” ( 
> https://www.unicode.org/L2/L2020/20068-n5128-ext-hieroglyph.pdf ) by 
> Michel Suignard contains a very interesting hieroglyph at position 
> *U+13579 EGYPTIAN HIEROGLYPH A-12-054, which seems to represent a man 
> with a laptop, as can be obvious in the attached image.
> 
>    I am curious about the source of this hieroglyph: in the table 
> acompannying the document, its sources are said to be “Hieroglyphica 
> extension (various sources)” with number A58C and “Hornung & Schenkel 
> (2007, last modified in 2015)”, but with no number (A;), which seems 
> unique in the table. It leads me to think this glyph only exist in
> some modern font, either as a joke, or for some computer related
> modern use. Can anyone infirm or confirm this intuition ?
> 
>     Frédéric
> 
> 




Stop words for CLDR

2020-01-23 Thread Marius Spix via Unicode
I wonder if there is any interest in adding stop words to CLDR? Stop
words are ignored by natural language processing algorithms, with use
cases like search engines, word clouds and text classification.

There are already existing collections with stop words like [1] or [2]
which could be used, but I believe that Unicode CLDR would be the best
place for such lists.

Regards,

Marius Spix

[1] https://pypi.org/project/stop-words/
[2]
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip


Re: emojis for mouse buttons?

2020-01-01 Thread Marius Spix via Unicode
Unicode characters are named after their appearance, not their
semantics. For example the diaresis and the umlaut share the code-point
U+0308. A printed booklet cannot be aware if the user is right- or
left-handed. This is the same issue as with U+2BEA and U+2BEB, which
are designed for ltr and rtl writing.


On Wed, 1 Jan 2020 10:08:42 -0500
John W Kennedy via Unicode  wrote:

> As I have already said, this will not do. Mouses do not have “left”
> and “right” buttons; they have “primary” buttons, which may be on the
> left or right, and “secondary” buttons, which may be on the right or
> left. If this goes through, users with left-handed mouse setups will
> curse you forever.
> 




Re: emojis for mouse buttons?

2020-01-01 Thread Marius Spix via Unicode
Cecause the middle button of many mice is a scroll button, I think, we
need five different characters:

LEFT MOUSE BUTTON CLICK (mouse with left button black)
MIDDLE MOUSE BUTTON CLICK (mouse with middle button black)
RIGHT MOUSE BUTTON CLICK (mouse with right button black)
MOUSE SCROLL UP (mouse with middle button black and white triangle
pointing up inside)
MOUSE SCROLL DOWN (mouse with middle button black and white triangle
pointing down inside)

These characters are pretty useful in software manuals, training
materials and user interfaces.

Happy New Year,

Marius



On Tue, 31 Dec 2019 23:04:39 +0100
Philippe Verdy via Unicode  WROTE:

> Playing with the fiolling of the middle cell to mean a double click
> is a bad idea, it would be better to add one or two rounded borders
> separated from the button, or simply display two icons in sequence
> for a double click).
> 
> Note that the glyphs do not necessarily have to show a mouse, it
> could as well be a square with its lower third part split into two or
> three squares, like a touchpad (see the notification icons displayed
> by Synaptics touchpad drivers). The same rounded borders could also
> mean the number of clicks. As well, if a ouse is represented, it may
> or may not have a wire.
> 
> Emoji-styles could use more realistic 3D-like rendering with extra
> shadows...
> 
> Le mar. 31 déc. 2019 à 22:16, wjgo_10...@btinternet.com via Unicode <
> unicode@unicode.org> a écrit :  
> 
> > How about the following.
> >
> > A filled upper cell to mean click,
> >
> > a filled upper cell and a filled middle cell to mean double click,
> >  
> Note that clicking and maintaining the button is just like the
> convention of using "+" after a key modifier before the actual key
> (both key may be styled separately to decorate their glyphs into a
> keycap, but such styling should not be applied in the distinctive
> glyph; there may also be emoji sequences to combine an anonymous
> keycap base emoji with the following characters, using joiner
> controls, but this is more difficult for keys whose labels are texts
> made of multiple letters like "End" or words like "Print Screen",
> after a possible Unicode symbol for keys like Page Up, Home, End,
> NumLock; styling the text offers better option and accessibility even
> if symbols are used and a whole translatable string is surrounded by
> deocrating styles to create a visual keycap).



pgpdjcWV_e9OI.pgp
Description: Digitale Signatur von OpenPGP


Aw: Not accepted by UTC but in ISO ballot?

2019-12-21 Thread Marius Spix via Unicode
So, WG2 N5058, was literally a TROLL submission.


> Gesendet: Samstag, 21. Dezember 2019 um 03:29 Uhr
> Von: "Shriramana Sharma via Unicode" 
> An: "UnicoDe List" 
> Betreff: Not accepted by UTC but in ISO ballot?
>
> I was looking at the pipeline for something else, and for the first
> time I see a character category: “not accepted by the UTC but in ISO
> ballot” and two characters in it.
> 
> So IIUC while technically people are free to submit a document to the
> ISO separately without submitting to UTC, it has always been the
> practice to my knowledge to get a character approved by the UTC first.
> 
> Anyone throw some light on these particular cases?
> 
> -- 
> Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा ူ၆ိျိါအူိ၆ါး
> 
>



HEAVY EQUALS SIGN

2019-12-18 Thread Marius Spix via Unicode
Unicode has a HEAVY PLUS SIGN (U+2795) and a HEAVY MINUS SIGN (U+2796).
I wonder, if a HEAVY EQUALS SIGN could complete that character set.
This would allow emoji phrases like  ➕= ❤️. (man plus cat equals
love) looking typographically better, when you replace the equals sign
with a new HEAVY EQUALS SIGN character. Thoughts?

Marius


pgpL2KRR84mbX.pgp
Description: Digitale Signatur von OpenPGP


Aw: Re: On the lack of a SQUARE TB glyph

2019-09-30 Thread Marius Spix via Unicode
What about the idea to provide half-width forms for the SI prefixes and half-width forms for common units? For example, you could encode petaohm as HALFWIDTH LATIN CAPITAL LETTER CAPITAL P + HALF WIDTH GREEK CAPITAL LETTER OMEGA, gigacalories as HALFWIDTH LATIN CAPITAL LETTER G + HALF WIDTH LATIN SMALL CAL TRIGRAPH and millimol as HALFWIDTH LATIN SMALL LETTER M +  HALF WIDTH LATIN SMALL MOL TRIGRAPH.

 

Marius Spix

 


Gesendet: Montag, 30. September 2019 um 10:32 Uhr
Von: "Asmus Freytag via Unicode" 
An: unicode@unicode.org
Betreff: Re: On the lack of a SQUARE TB glyph



On 9/30/2019 1:01 AM, Andre Schappo via Unicode wrote:


 


On Sep 27, 1 Reiwa, at 08:17, Julian Bradfield via Unicode  wrote:

Or one could allow IDS to have leaf components that are any
characters, not just ideographic characters, and then one could have
all sorts of fun.



I do like this idea.

Note: This is a modified repost as I previously forgot to credit Julian as the originator

André Schappo





And to keep my previous reply in context: I think the "all sorts of fun" would be the wrong reason to do things. However, things like squared abbreviations and squared kana words all occur in the context of typesetting text containing ideographs. Therefore, extending the IDS slightly, so that it can cover those use cases, would make a certain amount of sense. While the result being composed (or "described") wouldn't be an actual Han ideograph, it would nevertheless function like one typographically.

That makes that suggestion a rather appropriate alternative for things like *SQUARE TB.

The kinds of fonts that might have a mapping from some IDS to a single glyph might also have glyphs that correspond to popular squared abbreviations.

And the way the components are stacked is at least broadly similar to (or better a subset of) the ways ideographic components can be stacked. On might start out by disallowing things like the surround operators in favor of simply doing things like "two up" and "side by side" for starters.

In other words, not "all sorts of fun" but something targeted to precisely needed for the extension of the frozen subset of abbreviations so that they can occur in contexts that do not allow full markup languages without having to be precomposed.

A./

 

 







Aw: On the lack of a SQUARE TB glyph

2019-09-26 Thread Marius Spix via Unicode
Unfortunately, the CJK Compatibility block is full, but U+321F in the Enclosed 
CJK Letters and Months seems to be free. I definitely see a usage for the 
proposed character.
 

Gesendet: Donnerstag, 26. September 2019 um 13:21 Uhr
Von: "Fred Brennan via Unicode" 
An: unicode@unicode.org
Betreff: On the lack of a SQUARE TB glyph
Greetings,

I can't help but notice that there is no "SQUARE TB" glyph.

We have SQUARE KB, GB and MB, starting at U+3385. But no SQUARE TB?

SQUARE GB is at U+3387, and U+3388 is...SQUARE CAL, ㎈, so no space was even
left for it—not very future-proof!

The purposes of these glyphs is, as you know, for CJK. Perhaps terabytes were
not as common when these glyphs were approved, but they are common now.

There is a clear demand for a SQUARE TB. In the font SMotoya Sinkai W55 W3,
which is ©2008 株式会社 モトヤ, the glyph is unencoded and accessed via the
Discretionary Ligatures (`dlig`) OpenType feature. It has name `T_B.dlig`.
This same scheme is used in many other Motoya fonts, and presumably other CJK
fonts. In some other fonts, the `hwid` feature can be used to get a similar
effect.

SQUARE TB is likewise seen often on packaging as terabyte hard drives are now
common, as is the concept of a terabyte in operating systems.

Recently new glyphs were added for the new era name, so I don't think it's a
problem to add SQUARE TB. While we're at it, may as well add SQUARE PB. To be
future-proof (hopefully for the next hundred years!), perhaps we ought to also
add SQUARE EB, SQUARE ZB and SQUARE YB! But even if only SQUARE TB gets in
it's worth it, I need it.

Best,
Fred Brennan



 



Aw: Re: Unicode "no-op" Character?

2019-07-03 Thread Marius Spix via Unicode

A few suggestions

 

There is a reason why the C standard library function fgetc(FILE*) returns an unsigned int instead of a char, because the constant EOF (end of file) must be outside of the definition area of a char.

 

Some encodings like Base64 or Quoted-printable use the escape character =, but make sure that you can still encode this escape character in another way.

 

Another possible encoding would be using a "continue" flag. For example you could use the least significant bit to signal if a stream ends or is continued, this allows you to encode 7 bits per byte and is used for arbitrary length integers or other variable length structures where terminator characters like 0x00 may be part of the data.

 

 

 

Gesendet: Mittwoch, 03. Juli 2019 um 10:49 Uhr
Von: "Philippe Verdy via Unicode" 
An: "Sławomir Osipiuk" 
Cc: "unicode Unicode Discussion" 
Betreff: Re: Unicode "no-op" Character?



Le mer. 3 juil. 2019 à 06:09, Sławomir Osipiuk  a écrit :





I don’t think you understood me at all. I can packetize a string with any character that is guaranteed not to appear in the text.




 

Your goal is **impossible** to reach with Unicode. Assume sich character is "added" to the UCS, then it can appear in the text. Your goal being that it should be "warrantied" not to be used in any text, means that your "character" cannot be encoded at all. Unicode and ISO **require** that the any proposed character can be used in text without limitation. Logivally it would be rejected becauyse your character would not be usable at all from the start.

 

So you have no choice: you must use some transport format for your "packeting", jsut like what is used in MIME for emails, in HTTP(S) for streaming, or in internationalized domain names.

 

For your escaping mechanism you have a very large choice already of characters considered special only for your chosen transport syntax.

 

Your goal shows a chicken and egg problem. It is not solvable without creating self-contradictions immediately (and if you attempt to add some restriction to avoid the contradiction, then you'll fall on cases where you can no longer transport your message and your protocol will become unusable.








Aw: Unicode "no-op" Character?

2019-06-22 Thread Marius Spix via Unicode
Combining Grapheme Joiner (U+034F) is probably what you want as it is default 
ignorable and keeps the acute on top of the E. However it nay break languages 
with di- and trigraphs or complex diacritics.

Best regards

Marius


> Gesendet: Samstag, 22. Juni 2019 um 02:14 Uhr
> Von: "Sławomir Osipiuk via Unicode" 
> An: unicode@unicode.org
> Betreff: Unicode "no-op" Character?
>
> Does Unicode include a character that does nothing at all? I'm talking about
> something that can be used for padding data without affecting interpretation
> of other characters, including combining chars and ligatures. I.e. a
> character that could hypothetically be inserted between a latin E and a
> combining acute and still produce É. The historical description of U+0016
> SYNCHRONOUS IDLE seems like pretty much exactly what I want. It only has one
> slight disadvantage: it doesn't work. All software I've tried displays it as
> an unknown character and it definitely breaks up combinations. And U+
> NULL seems even worse.
> 
>  
> 
> I can imagine the answer is that this thing I'm looking for isn't a
> character at all and so should be the business of "a higher-level protocol"
> and not what Unicode was made for. but Unicode does include some odd things
> so I wonder if there is something like that regardless. Can anyone offer any
> suggestions?
> 
>  
> 
> Sławomir Osipiuk
> 
>



Aw: Re: Symbols of colors used in Portugal for transport

2019-05-01 Thread Marius Spix via Unicode

Unicode characters are already using a technique called hatching.

 

For example LARGE RED CIRCLE (U+1F534) has thin vertical stripes, which is recognized as red.

 

See also: https://en.wikipedia.org/wiki/Hatching_(heraldry)

 

Gesendet: Dienstag, 30. April 2019 um 21:17 Uhr
Von: "Hans Åberg via Unicode" 
An: "Mark E. Shoulson" 
Cc: unicode@unicode.org
Betreff: Re: Symbols of colors used in Portugal for transport


> On 30 Apr 2019, at 04:32, Mark E. Shoulson via Unicode  wrote:
>
> On 4/29/19 3:34 PM, Doug Ewell via Unicode wrote:
>> Hans Åberg wrote:
>>
>>> The guy who made the artwork for Heroes is completely color-blind,
>>> seeing only in a grayscale, so they agreed he coded the colors in
>>> black and white, and then that was replaced with colors.
>> Did he use this particular scheme? That is something I would expect to
>> see on the scheme's web site, and would probably be good evidence for a
>> proposal.
>
> And what about existing schemes, such as have already been in use even by the esteemed company present on this very list, and in several fonts, for the same purpose? See https://en.wikipedia.org/wiki/Hatching_(heraldry)

It is notable that historically, one started with written abbreviations but later shifted to patterns, so possibly the latter is more effective.

>> I do see several awards related to the concept, but few examples where
>> this scheme is actually in use, especially in plain text.
>> I'm not opposed to this type of symbol, but I like to think the classic
>> rule about "established, not ephemeral" would still apply.
>
> Indeed.
>
> If there were encoded mere color patches (like, say, colored circles, possibly in the U+1F534 range or something; just musing here), would those already count as encoding these sorts of things, as black-and-white font designers would be likely to interpret them in some readable fashion, perhaps with hatching. Is it better to have the color be canonical and the hatched design a matter of design, or have a set of hatched circles with fixed hatching?

Also note the screentone and halftone articles [1-2]. In addition, there are reverse Ishihara tests that those with color deficiency can read correctly, but not those with normal color vision, relying an enhanced capability to detect smaller nuances in intensity.

1. https://en.wikipedia.org/wiki/Screentone
2. https://en.wikipedia.org/wiki/Halftone


 





Re: Script_extension Property of U+0310 Combining Candrabindu

2019-04-18 Thread Marius Spix via Unicode
The Wikipedia page states, U+0310 is a general-purpose combining
diacritical mark. I would treat it similar like U+0308 (COMBINING
DIAERESIS) or U+030C (COMBINING CARON), which are both characters with
multiple names and different meanings depending on the script and the
language. The main benefit of these general-purpose combining
diacritical marks is, that they can be applied to many characters if
needed. I don’t think, it is a good idea to remove this versatility. At
least one example exists, where someone used the combining candrabindu
for a constructed language as the upside-down counterpart to the
combining fermata. http://randomguy32.de/conlang/000/writing/

Best regards,

Marius


Am Do., 18 Apr 2019 20:59:53 +0100
schrieb Richard Wordingham via Unicode :

> Is there any reason why U+0310 COMBINING CANDRABINDU has scx=Inherited
> rather than scx=Latn?  The only language I've seen the character used
> in is Sanskrit, and the only script I've seen it in is the Latin
> script.
> 
> Richard. 



pgp0mBzA7K7wW.pgp
Description: Digitale Signatur von OpenPGP


Re: Aleph-umlaut

2018-11-09 Thread Marius Spix via Unicode
Dear Mark,

I found another sample here:
https://www.marketscreener.com/BRILL-5240571/pdf/61308/Brill_Report.pdf

On page 86 it says that the aleph with diaresis is a number with
the value 1000.

See also the attached clipping.

A second source is the Brown-Driver-Briggs Hebrew-English Lexicon of the
Old Testament which also mentions that ‫א‬ ̈means 1000, but there were
no evidence of this usage in Old Testament times. See here (the very
first lemma):
www.biblab.com/students/dizionari/Brown-Driver-Briggs%20Hb-En%20Dic.docx

Yet another usage in a mathematical context of an aleph with umlaut can
be found here, however they used U+2135 ALEF SYMBOL instead of U+05D0
HEBREW LETTER ALEF. This is not related to the value 1000, as the umlaut
is used to mark the second derivative.
https://de.slideshare.net/StephenAhiante/dynamics-modelling-design-of-a-quadcopter
(page 28-29 or slide 41-42)

However, seems that there is no real font support for these characters,
though. The only font on my computer, which could render aleph
+ umlaut correctly on my system was Unifont and 
roughly Linux Libertine. Other fonts, in particular Arial, DejaVu Sans,
Liberation Sans and Linux Biolinum rendered the diaeresis to much far
to the left.

I even found a user has a similar issue with U+0308, here: 
http://smontagu.org/writings/HebrewNumbers.html

Maybe adding an annotation to U+0308 could sensitize font
designers to be aware that this combining character is also used
in the Hebrew alphabet.

My suggestion is to add the annotation “= hewbrew thousands multiplier”
to U+0308 COMBINING DIAERESIS and  a reference from 05B5 ◌ֵ
HEBREW POINT TSERE to U+0308.

Best regards,

Marius



On Fri, 9 Nov 2018 07:42:54 -0500
"Mark E. Shoulson via Unicode"  wrote:

> Noticed something really fascinating in an old pamphlet I was
> reading.  It's from 1922, in Hebrew mostly but with some Yiddish at
> the end.  The Yiddish spelling is not according to more modern
> standardization, but seems to be significantly more faithful to the
> German spellings of the same words, replacing Latin letters with
> Hebrew ones more than respelling phonetically.  And there are even
> places where it appears they represented a German ä with a Hebrew
> aleph—with an umlaut!  Actually it looks a little more like a double
> acute accent but that's surely a style choice, since it obviously is
> mapping to an umlaut.
> 
> 
> 
> (Note also the spelling דיע, a calque for German "die", where modern
> Yiddish would spell it phonetically as די.)
> 
> 
> I do NOT think this needs any special encoding, btw.  I would
> probably encode this as simply U+05D0 U+0308 (א̈).  Combining symbols
> do not (necessarily) belong to a specific alphabet, and the fact that
> most fonts would render this badly is a different issue.  I just
> thought the people here might find it interesting.
> 
> 
> (Link is
> http://rosetta.nli.org.il/delivery/DeliveryManagerServlet?dps_pid=IE36609604&_ga=2.182410660.2074158760.1541729874-1781407841.1541729874
> look at the last few pages.)
> 
> 
> ~mark
> 



pgp5wJxCb1CXI.pgp
Description: Digitale Signatur von OpenPGP


Aw: Re: Dealing with Georgian capitalization in programming languages

2018-10-09 Thread Marius Spix via Unicode
The capital ẞ (U+1E9E) has been officially approved by the Council for the 
German Language since July 2018. However, there is no word starting with ß, 
that means the character is only relevant for full-capitalized words. It may 
only stand alone in spaced type, when there is no available italic font-style.

In the Ruby bug tracker that there is also an issue with Dutch ij → IJ. The 
dedicated ligatures IJ (U+0133) and ij (U+0133) are not recommended and thus 
never used, but leading ij must always be capitalized to IJ, as in IJSBERG → 
ijsberg → IJsberg. The actual problem is that the current capitalization 
algorithm is based on a regular grammar (type 3). It has to be adjusted for a 
context-sensitive (type 1) grammar. 

Regards,

Marius

 

On 2018/10/09 09:47, Martin J. Dürst wrote:

> I have been thinking through this. It seems quite appealing.
> 
> But I'm concerned there may be some edge cases. I have been able to come
> up with two so far:
> 
> - Applying this to a string starting with upper-case SZ (U+1E9E).
> This may change SZ → ß → Ss.
> - Using the 'capitalize' method to (try to) get the titlecase
> property of a MTAVRULI character. (There's no other way
> currently in Ruby to get the titlecase property.)
> 
> There may be others. If you have some ideas, I'd appreciate to know
> about them.
> 
> This lets me wonder why the UTC didn't simply declare the titlecase
> property of MTAVRULI to be mkhedruli. Was this considered or not? The
> way things are currently set up, there seems to be no benefit of
> MTAVRULI being its own titlecase, because in actual use, that requires
> additional processing.
> 
> Regards, Martin.



Re: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)

2018-09-01 Thread Marius Spix via Unicode
Hello Marcel,

YAML supports references, so you can refer to another character’s
properties.

Example:

repertoire: 
 char:
  -
   name_alias: 
- [NUL,abbreviation]
- ["NULL",control]
   cp: 
   na1: "NULL"
   props: &
 age: "1.1"
 na: ""
 JSN: ""
 gc: Cc
 ccc: 0
 dt: none
 dm: "#"
 nt: None
 nv: NaN
 bc: BN
 bpt: n
 bpb: "#"
 Bidi_M: N
 bmg: ""
 suc: "#"
 slc: "#"
 stc: "#"
 uc: "#"
 lc: "#"
 tc: "#"
 scf: "#"
 cf: "#"
 jt: U
 jg: No_Joining_Group
 ea: N
 lb: CM
 sc: Zyyy
 scx: Zyyy
 Dash: N
 WSpace: N
 Hyphen: N
 QMark: N
 Radical: N
 Ideo: N
 UIdeo: N
 IDSB: N
 IDST: N
 hst: NA
 DI: N
 ODI: N
 Alpha: N
 OAlpha: N
 Upper: N
 OUpper: N
 Lower: N
 OLower: N
 Math: N
 OMath: N
 Hex: N
 AHex: N
 NChar: N
 VS: N
 Bidi_C: N
 Join_C: N
 Gr_Base: N
 Gr_Ext: N
 OGr_Ext: N
 Gr_Link: N
 STerm: N
 Ext: N
 Term: N
 Dia: N
 Dep: N
 IDS: N
 OIDS: N
 XIDS: N
 IDC: N
 OIDC: N
 XIDC: N
 SD: N
 LOE: N
 Pat_WS: N
 Pat_Syn: N
 GCB: CN
 WB: XX
 SB: XX
 CE: N
 Comp_Ex: N
 NFC_QC: Y
 NFD_QC: Y
 NFKC_QC: Y
 NFKD_QC: Y
 XO_NFC: N
 XO_NFD: N
 XO_NFKC: N
 XO_NFKD: N
 FC_NFKC: "#"
 CI: N
 Cased: N
 CWCF: N
 CWCM: N
 CWKCF: N
 CWL: N
 CWT: N
 CWU: N
 NFKC_CF: "#"
 InSC: Other
 InPC: NA
 PCM: N
 blk: ASCII
 isc: ""

  -
   cp: 0001
   na1: "START OF HEADING"
   name_alias: 
- [SOH,abbreviation]
- [START OF HEADING,control]
   props: *





Regards,

Marius Spix


On Sat, 1 Sep 2018 08:00:02 +0200 (CEST)
schrieb Marcel Schneider wrote:

> On 31/08/18 08:25 Marius Spix via Unicode wrote:
> > 
> > A good compromise between human readability, machine processability
> > and filesize would be using YAML.
> > 
> > Unlike JSON, YAML supports comments, anchors and references,
> > multiple documents in a file and several other features.
> 
> Thanks for advice. Already I do use YAML syntaxic highlighting to
> display XCompose files, that use the colon as a separator, too.
> 
> Did you figure out how YAML would fit UCD data? It appears to heavily
> rely on line breaks, that may get lost as data turns around across
> environments. XML indentation is only a readability feature and
> irrelevant to content. The structure is independent of invisible
> characters and is stable if only graphics are not corrupted (while it
> may happen that they are). Linebreaks are odd in that they are
> inconsistent across OSes, because Unicode was denied the right to
> impose a unique standard in that matter. The result is mashed-up
> files, and I fear YAML might not hold out.
> 
> Like XML, YAML needs to repeat attribute names in every instance.
> That is precisely what CSV gets around of, at the expense of
> readability in plain text. Personally I could use YAML as I do use
> XML for lookup in the text editor, but I’m afraid that there is no
> advantage over CSV with respect to file size.
> 
> Regards,
> 
> Marcel
> > 
> > Regards,
> > 
> > Marius Spix
> > 
> > 
> > On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via
> > Unicode wrote:
> > 
> […]



pgpMN17QQjRHP.pgp
Description: Digitale Signatur von OpenPGP


Re: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)

2018-08-31 Thread Marius Spix via Unicode
A good compromise between human readability, machine processability and
filesize would be using YAML.

Unlike JSON, YAML supports comments, anchors and references, multiple
documents in a file and several other features.

Regards,

Marius Spix


On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via Unicode
wrote:

> On 30/08/18 23:34 Philippe Verdy via Unicode wrote:
> >
> > Welel an alternative to XML is JSON which is more compact and
> > faster/simpler to process;
> 
> Thanks for pointing the problem and the solution alike. Indeed the
> main drawback of the XML format of UCD is that it results in an
> “insane” filesize. “Insane” was applied to the number of semicolons
> in UnicodeData.txt, but that is irrelevant. What is really insane is
> the filesize of the XML versions of the UCD. Even without Unihan, it
> may take up to a minute or so to load in a text editor.
> 
> > however JSON has no explicit schema, unless the schema is being
> > made part of the data itself, complicating its structure (with many
> > levels of arrays of arrays, in which case it becomes less easy to
> > read by humans, but more adapted to automated processes for fast
> > processing).
> >
> > I'd say that the XML alone is enough to generate any JSON-derived
> > dataset that will conform to the schema an application expects to
> > process fast (and with just the data it can process, excluding
> > various extensions still not implemetned). But the fastest
> > implementations are also based on data tables encoded in code (such
> > as DLL or Java classes), or custom database formats (such as
> > Berkeley dB) generated also automatically from the XML, without the
> > processing cost of decompression schemes and parsers.
> >
> > Still today, even if XML is not the usual format used by
> > applications, it is still the most interoperable format that allows
> > building all sorts of applications in all sorts of languages: the
> > cost of parsing is left to an application builder/compiler.
> 
> I’ve tried an online tool to get ucd.nounihan.flat.xml converted to
> CSV. The tool is great and offers a lot of options, but given the
> “insane” file size, my browser was up for over two hours of trouble
> until I shut down the computer manually. From what I could see in the
> result field, there are many bogus values, meaning that their
> presence is useless in the tags of most characters. And while many
> attributes have cryptic names in order to keep the file size minimal,
> some attributes have overlong values, ie the design is inconsistent.
> Eg in every character we read: jg="No_Joining_Group" That is bogus.
> One would need to take them off the tags of most characters, and even
> in the characters where they are relevant, the value would be simply
> "No". What’s the use of abbreviating "Joining Group" to "jg" in the
> atribute name if in the value it is written out? And I’m quoting from
> U+. Further many values are set to a crosshatch, instead of
> simply being removed from the characters where they are empty. Then
> the many instances of "undetermined script" resulting in *two*
> attribues with "Zyyy" value. Then in almost each character we’re told
> that it is not a whitespace, not a dash, not a hyphen, and not a
> quotation mark: Dash="N" WSpace="N" Hyphen="N" QMark="N" One couldn’t
> tell that UCD does actually benefit from the flexibility of XML,
> given that many attributes are systematically present even where they
> are useless. Perhaps ucd-*.xml would be two thirds, half, or one
> third their actual size if they were properly designed.
> 
> > Some apps embed the compilers themselves and use a stored cache for
> > faster processing: this approach allows easy updates by detecting
> > changes in the XML source, and then downloading them.
> >
> > But in CLDR such updates are generally not automated : the general
> > scheme evolves over time and there are complex dependencies to
> > check so that some data becomes usable
> 
> Should probably read *un*usable.
> 
> > (frequently you need to implement some new algorithms to follow the
> > processing rules documented in CLDR, or to use data not completely
> > validated, or to allow aplicatioçns to provide their overrides from
> > insufficiently complete datasets in CLDR, even if CLDR provides a
> > root locale and applcaitions are supposed to follow the BCP47
> > fallback resolution rules; applciations also have their own need
> > about which language codes they use or need, and CLDR provides many
> > locales that many applications are still not prepared to render
> > correctly, and many application users complain if an application is
> > partly translated and contains too many fallbacks to another
> > language, or worse to another script).
> 
> So the case is even worse than what I could see when looking into
> CLDR. Many countries, including France, don’t care about the data of
> their own locale in CLDR, but I’m not going to vent about that on
> Unicode Public, because that 

Re: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process)

2018-08-19 Thread Marius Spix via Unicode
William Overington wrote:

> 
> I decided that trying to design emoji for 'I' and for 'You' seemed
> interesting so I decided to have a go at designing some.
> 
> However pictures of people with arrows seemed to be ambiguous in
> meaning and also they seemed to need to be too detailed for rendering
> in mobile telephone messages and in many situations in web pages and
> emails generally. So eventually I decided that abstract designs would
> be a good solution to the problem.
> 


I also played with a similar idea, which requires a new
GSUB LookupType, let’s call it 9: Reader-dependent substitution.

The idea is that the reader of the text will see another glyph when
he/she is the author of the text. For example if you use the codepoint
for for ME, all other readers see the glpyh for YOU and vice versa. This
is for example usable in instant messaging and social networking
services.

In the attachment you find some ideas for the
following emoji

IDEOGRAM FOR ME / IDEOGRAM FOR YOU
IDEOGRAM FOR TWO OF US / IDEOGRAM FOR YOU TWO 
IDEOGRAM FOR WE ALL / IDEOGRAM FOR YOU ALL
IDEOGRAM FOR ME AND ANOTHER PERSON / IDEOGRAM FOR YOU AND ANOTHER PERSON
IDEOGRAM FOR ME AND MULTIPLE OTHER PERSONS / IDEOGRAM FOR YOU AND
MULTIPLE OTHER PERSONS


IDEOGRAM FOR YOU AND ME (the counterpart has no own codepoint, but is
mirrored, as you may arrange other emoji to the left or right)

The following emoji may look equal independent of the reader:
IDEOGRAM FOR ANOTHER PERSON
IDEOGRAM FOR TWO OTHER PERSONS
IDEOGRAM FOR MULTIPLE OTHER PERSONS

The rendering engine requires a flag if the user is the author or not.
I think it would be possible to implement.

What about this idea?

Regards,

Marius Spix


pgpruglW8Bqq7.pgp
Description: Digitale Signatur von OpenPGP