date:20160319

See
https://fr.wikipedia.org/wiki/Carte_m%C3%A9t%C3%A9orologique#/media/File:Station_model_fr.svg

I see these symbols for noting cloud types (here cirrus and altocumulus,
one drawn in diagonal for middle altitude, another drawn horizontally for
high altitudes).

Note that the symbols may vary: see Altocumulus for example as found in
French Wikipedia (note sure if it's accurate) which is different from the
symbol found in the sampled notation on a map

https://fr.wikipedia.org/wiki/Altocumulus

Also other symbols on the similar page in English Wikipedia, are used to
describe some cloud characteristics:

https://en.wikipedia.org/wiki/Altocumulus_cloud

Is there a well defined collection of these symbols, and are they in the
encoding pipe ?

Re: Joined "ti" coded as "Ɵ" in PDF

2016-03-17 19:02 GMT+01:00 Pierpaolo Bernardi :

> On Thu, Mar 17, 2016 at 6:37 PM, Leonardo Boiko 
> wrote:
> > The PDF *displays* correctly.  But try copying the string 'ti' from
> > the text another application outside of your PDF viewer, and you'll
> > see that the thing that *displays* as 'ti' is *coded* as Ɵ, as Don
> > Osborn said.
>
> Ah. OK.  Anyway this is not a Unicode problem. PDF knows nothing about
> unicode. It uses the encoding of the fonts used.
>

That's correct, however the PDF specs contain guidelines for naming glyphs
in fonts in such a way that the encoding can be deciphered. This is needed
for example in applications such as PDF forms where user input is expected.
When those PDF are generated from rich text, the fonts used may be built
with TrueType (without any glyph name in them, only mappings of sequences
of codepoints) or OpenType or Postscript. When OpenType fonts contain
Postscript glyphs, their names may be completely arbitrary, it does not
even matter if the font used was mapped to Unciode or if it used a legacy
or proprietary encoding).

If you see a "Ɵ" when copy-pasting from the PDF, it's because the font used
to produce it did not follow these guidelines (or did not specify any
glyphname, in which case this is a sort of OCR algorithm that attempts to
decipher the glyph : the "ti" ligature is visually extremely near from the
"Ɵ", and an OCR has lot of difficulties to disguish them, unless they also
use some linguistic dictionnary searches and some hints about the script
used in surrounding characters to enhance the guess).

Note that PDF's (or DejaVu's) are not required to contain only text, or
they could just embed a scanned and compressed bitmap image (if you want to
see how an OCR can be wrong, look at how it fails with lots of errors, for
example in the decoding projects for Wikibooks, working with scanned
bitmaps of old books: OCR is just an helper, but there's still lot of work
to correct what has been guessed and reencode the correct text; even if
humans are smarter than OCR, this is a lot of work to perform manually :
encoding the text of a single scanned old book still takes one or two
months for an experienced editor, and there are still many errors to review
later by someone else)

Most PDFs were not created with the idea of decoding later their rendered
texts. In fact they were intended to be read or printed "as is", including
with their styles, colors, and decorations of fonts everywhere or text over
photos. They were even created to be non modifiable and used then for
archival.

Some PDF tools will also cleanup from the PDF the additional metadata such
as the original fonts used, instead these PDFs will locally embed
pseudo-fonts containing sets of glyphs from various fonts (in mixed
styles), in random order or sorted by frequency of use in the document or
by order of occurence in the original text. These embedded fonts are
generated on the fly to contain only the necessary glyphs for the document.
When those embedded fonts are generated, there's a compression algorithme
that drops lots of things from the original font, including its metadata
such as the original "Postscript" glyph names.

Re: Joined "ti" coded as "Ɵ" in PDF

2016-03-19 Thread Leonardo Boiko

Yeah, I've stumbled upon this a lot in academic Japanese/Chinese
texts.  I try to copy some Chinese character, only to find out that
it's really a string of random ASCII characters.

Is there only one of those crap PDF pseudo-encodings? If so, I'll use
a conversor next time...

2016-03-17 14:57 GMT-03:00 "Jörg Knappen" :
> I inspected the pdf file, and its font encoding is termed "Identity-H". I
> couldn't reveal much about this encoding, but it seems to be a private
> encoding of Adobe used especially for Asian fonts.
>
> --Jörg Knappen
>
> Gesendet: Donnerstag, 17. März 2016 um 17:43 Uhr
> Von: "Don Osborn" 
> An: unicode@unicode.org
> Betreff: Joined "ti" coded as "Ɵ" in PDF
> Odd result when copy/pasting text from a PDF: For some reason "ti" in
> the (English) text of the document at
> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf
> is coded as "Ɵ". Looking more closely at the original text, it does
> appear that the glyph is a "ti" ligature (which afaik is not coded as
> such in Unicode).
>
> Out of curiosity, did a web search on "internaƟonal" and got over 11k
> hits, apparently all PDFs.
>
> Anyone have any idea what's going on? Am assuming this is not a
> deliberate choice by diverse people creating PDFs and wanting "ti"
> ligatures for stylistic reasons. Note the document linked above is
> current, so this is not (just) an issue with older documents.
>
> Don Osborn

Re: Purpose of and rationale behind Go Markers U+2686 to U+2689

2016-03-18 19:11 GMT+01:00 Garth Wallace :

> > The issues with line breaking (if you can use these combining around all
> > characters, inclusing spaces, can be solved using unbreakable characters.
>
> Line breaking isn't really a problem that I can see with the Quivira
> model. If they're given the usual line breaking properties for
> symbols, the Unicode line breaking algorithm would prevent a break
> between halves. East Asian vertical text is another story. In a font
> that just uses kerning to join halves (as Quivira does) you'd end up
> with the left half on top of the right in vertical text. I'm not sure
> how ligatures are handled in vertical text.
>

East Asian vertical presentation does not just stack the elements on top of
each other, very frequently they rotate them (including
Latin/Greek/Cyrillic letters) So this is not really a new complication.

The numbers however are used for noting or commenting a strategy, or the
placement order during a party.

However for game notations purpose, rotation plays a significant role
(notably if those two part symbols are joined in a circle or disc: it can
make the difference between several distinct sets of stones, or it could be
used in a 4-players go variant (where black vs. white is not sufficient to
distinguish the players). In reality the stones would have 4 colours
(stones are not really numbered,
they are all the same for the same player, or there's some special marked
type of stone for each player in addition to their normal set) or sets
would have some symbol or dot on top of them.

There are also go variants using stones that take a territory and block the
position but that cnanot be taken (both players can use them, but the
territory taken is not counted for any player.
These stones can also be placed randomly at start of the party over the
board to complicate the game, or there's a limited set of blocking stones
for each player that an choose when to play them instead of standard
stones. Those blocking stones are visually distinct, but identical for the
two players that have them at start of the party.

Although the classic rules of go are extremely simple, this game has a lot
of variants. In fact many players that don't know the exact classic rules
are inventing their own variant.

Re: Swapcase for Titlecase characters

On Fri, Mar 18, 2016, 08:43:56, Martin J. Dürst  wrote:

> I'm working on extending the case conversion methods for the programming 
> language Ruby from the current ASCII only to cover all of Unicode.
> 
> Ruby comes with four methods for case conversion. Three of them, upcase, 
> downcase, and capitalize, are quite clear. But we have hit a question 
> for the forth method, swapcase.
> 
> What swapcase does is swap upper and lower case, so that e.g.
> 
> 'Unicode Standard'.swapcase => 'uNICODE sTANDARD'
> 
> I'm not sure myself where this method is actually used, but it also 
> exists in Python (and maybe Ruby got it from there).
> 
> 
> Now the question I have is: What to do for titlecase characters? Several 
> possibilities already have been floated:
> 
> a) Leave as is, because there are neither upper nor lower case.
> 
> b) Convert to upper (or lower), which may simplify implementation.
> 
> c) Decompose the character into upper and lower case components, and 
> apply swapcase to these.
> 
> 
> For example, 'ǅinsi' (jeans) would become 'ǅINSI' with a), 'ǄINSI' (or 
> 'ǆinsi') with b), and 'dŽINSI' with c). For another example, 'ᾨδή' would 
> become 'ᾨΔΉ' with a), 'ὨΙΔΉ' (or 'ᾠΔΉ') with b), and 'ὠΙΔΉ' with c).
> 
> It looks like Python 3 (3.4.3 in my case) is doing a). My guess is that 
> from an user expectation point of view, c) is best, so I'm tending to go 
> for c). There is no existing data from the Unicode Standard for this, 
> but it seems pretty straightforward.
> 
> But before I just implement something, I'd appreciate additional input, 
> in particular from users closer to the affected language communities.

As far as I can tell from my limited experience, the swapcase method is used 
only to convert “inverted titlecase” to titlecase. I call “inverted titlecase” 
the state of text produced by keyboard input while the caps lock toggle is 
accidentally on, and those words are “inversely capitalized” where the user 
pressed the shift modifier. Therefore such examples would be most useful.

Having said that, I know that this never occurs on many keyboards of 
English-speaking users who remapped that key to perform another action such as 
backspace, compose, or kana lock. Living myself in a country where the caps 
lock toggle is indispensable, I may be considered part of the aimed user 
communities, though unfortunately I donʼt speak Croatian nor Greek.

Looking at your examples, I would add a case that typically occurs for swapcase 
to be applied: ‘ᾠΔΉ’ (cited [erroneously] as a result of option b) that is to 
be converted to ‘ᾨδή’, and ‘ǆINSI’, that is to become ‘ǅinsi’.

As about decomposing digraphs and ypogegrammeni to apply swapcase: That 
probably would be doing no good, as itʼs unnecessary and users wonʼt expect it.

I hope that helps.

Kind regards,

Marcel

Joined "ti" coded as "Ɵ" in PDF

2016-03-19 Thread Don Osborn

Odd result when copy/pasting text from a PDF: For some reason "ti" in 
the (English) text of the document at 
http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf 
is coded as "Ɵ". Looking more closely at the original text, it does 
appear that the glyph is a "ti" ligature (which afaik is not coded as 
such in Unicode).


Out of curiosity, did a web search on "internaƟonal" and got over 11k 
hits, apparently all PDFs.


Anyone have any idea what's going on? Am assuming this is not a 
deliberate choice by diverse people creating PDFs and wanting "ti" 
ligatures for stylistic reasons. Note the document linked above is 
current, so this is not (just) an issue with older documents.


Don Osborn

Re: Purpose of and rationale behind Go Markers U+2686 to U+2689

2016-03-19 Thread Garth Wallace

On Thu, Mar 17, 2016 at 11:28 PM, J Decker  wrote:
> On Thu, Mar 17, 2016 at 9:18 PM, Garth Wallace  wrote:
>> There's another strategy for dealing with enclosed numbers, which is
>> taken by the font Quivira in its PUA: encoding separate
>> left-half-circle-enclosed and right-half-circle-enclosed digits. This
>> would require 20 characters to cover the double digit range 00–99.
>> Enclosed three digit numbers would require an additional 30 for left,
>> center, and right thirds, though it may be possible to reuse the left
>> and right half circle enclosed digits and assume that fonts will
>> provide left half-center third-right half ligatures (Quivira provides
>> "middle parts" though the result is a stadium instead of a true
>> circle). It should be possible to do the same for enclosed ideographic
>> numbers, I think.
>>
>> The problems I can see with this are confusability with the already
>> encoded atomic enclosed numbers, and breaking in vertical text.
>>
>
> I suppose that's why things like this happen in appilcations
>
> Joined "ti" coded as "Ɵ" in PDF
>
> http://www.unicode.org/mail-arch/unicode-ml/y2016-m03/0084.html
>
> you get an encode of a series of  codepoints, that results in an array
> of font glyph-points to render 

What?

I don't see what an apparent ligature matching or OCR glitch in PDFs
has to do with this.

Proposal for U+2427 NARROW SHOULDERED OPEN BOX (was: Re: Proposal for U+23FF SHOULDERED NARROW OPEN BOX?)

On Mon, 14 Mar 2016 09:19:35 -0700, Ken Whistler wrote:

> U+23FF is already assigned to OBSERVER EYE SYMBOL, which is
> already under ballot for 10646 (and approved by the UTC).
> 
> http://www.unicode.org/alloc/Pipeline.html
> 
> Please always first check that page before suggesting code points
> for prospective new characters. 
> 
> --Ken
> 
> On 3/12/2016 5:42 PM, Marcel Schneider wrote:
> > Now in the block of U+237D SHOULDERED OPEN BOX there is _one_ scalar value 
> > left. Would it then be a good idea to propose *U+23FF SHOULDERED NARROW 
> > OPEN BOX for v10.0.0?
> >

Thank you. I remember OBSERVER EYE but didnʼt notice its code point and forgot 
to do a search for ‘23[F[F]]’ on the Pipeline page. Sorry.

Now I see that *U+2427 would be even better as it is both in the block of 
U+2423 OPEN BOX and in the originally intended block, except that now I dropped 
the other symbols and stay just with the NNBSP symbol to propose for the next 
free contiguous scalar value.

I really hope that such a new or, more accurately, third proposal would be 
accepted, as the NARROW NO-BREAK SPACE is so important it must have its symbol 
encoded at some point, similarly to SPACE and NO-BREAK SPACE.

About the proposed name, there is to say that first I changed it to the 
glyph-descriptional one as preferred in Unicode, rather than SYMBOL FOR NARROW 
NO-BREAK SPACE. And last I made it more analogous to the name of the symbolized 
character, by inverting “SHOULDERED” and “NARROW”.

The original proposer cannot simply resume on that “narrow” basis, being 
committed to consistency with ISO/IEC 9995-7, so that an individual like I am, 
might be good to send the proposal? However generally it would be better done 
by a NB, the more as this belongs to the international keyboard standard. Other 
countries might be interested that have a multilingual standard layout, and/or 
a national layout including U+202F.

Another scenario would be that the French NB re-proposes a reduced set of 
additional symbols, which IMHO should comprise at least the NARROW SHOULDERED 
OPEN BOX, but ideally once it will have completed the revision of most parts of 
ISO/IEC 9995, including part 7.

Best regards,

Marcel

Re: Purpose of and rationale behind Go Markers U+2686 to U+2689

That's a smart idea... Note that you could encode the middle digits so that
their enclosure at top and bottom are by default only horizontal (no arcs
of circle) when shown in isolation, and the left and right parts are just
connecting by default horizontally to the top and bottom position of the
middle digits. Allowing arbitrary number of characters. In order to create
a real circle, you could use a joiner control to given a hint the renderer
that he can create a ligature (possibly reducing the size of digits, or
changing the dimension and shape of the connected segments so that they'll
draw a circle instead of a "cartouche" rounded at start and end.

You could even set the enclosing as a combining character around existing
digits (even if those digits are not symbols by themselves, the combining
character has this property, an idea similar to the arrow combining
characters at top or bottom for mathematics notations), so that the content
of the "circle" or "cartouche".

The enclosure could also be something else than a circle (or arcs of
circle): it could be a rectangle, hintable with joiners (like with circles)
to create an enclosing square, or a rounded rectangle (hintable to create a
rounded square).

The enclosure shapes could be white or black, or could be drawn with double
strokes. This is in fact similar to the combining low line or top line
which are joining by default.

However using a joiner between them instructs not really to join the
top/bottom line (which is already the expected behavior for these low/top
lines) but to create a ligature between the base characters in the middle.
Then to create double enclosure, just "stack" several combining characters
(in the order from inside to outside: the combining characters for
enclosures should have the same high value for their combining class so
that their relative order is kept, or could have combining class 0).

The issues with line breaking (if you can use these combining around all
characters, inclusing spaces, can be solved using unbreakable characters.

Note that this addition would create a disunification with existing
enclosed characters which are already ligatured into a single symbol (they
won't be canonically equivalent, using only the decomposition properties),
but this can be solved by adding another property ("ligature
decomposition"), and mapping the existing enclosed characters to their
"ligature decomposition" using normal base characters, the new combining
characters for enclosure and the joining control between them. those
mappings can be in a new properties file (which could then be useful for
collation so that the "enclosed 79" symbol would collate like "79").

Advantage: with these, you can now enclose various numbers (not just
natural integers) or abbreviations (e.g. chemical Symbols like "Au" for
gold), or astrological symbols, or arbitrary words (using them to enclose
full sentences would not be very practicle, but their use to enclose a
person name such as the name of a Egyptian king "Ramses" is possible, even
outside the context of Egyptian hieroglyphs)... It could be used to enclose
a temperature such as "10°C", or a section heading number "1.1".

And this is much less limited than the (very quirky) use of CSS or styles
(in rich text or HTML) to add surrounding "borders" as the shapes are less
restricted (in CSS you can create rounded borders). Some new shapes are
possible such as diagonal left and right sides, or mixing a rounded left
side and a square right side (though in this case it would be hard to use
joiners and expect a ligature to be created for the enclosing shape (for
example expect a triangular enclosure created by the ligature of two
diagonal sides and horizontal top/bottom for characters in the middle,
because this would absolutely require resizing all characters in the middle
to preserve a consistent line height; but this is possible for pairs of
base characters inside the enclosure).

Note : the enclosing ligature "joiner" control is not the same as the one
for joining base characters, as the intent is to join the enclosing shape
fragments (possibly by reducing the size and repositioning the all
characters in the middle), as characters in the middle are not
ligatured themselves (if you enclose "AE" in such shapes created with
combining characters, it should not produce a "AE" letter in the final
enclosing shape.


2016-03-18 5:18 GMT+01:00 Garth Wallace :

> There's another strategy for dealing with enclosed numbers, which is
> taken by the font Quivira in its PUA: encoding separate
> left-half-circle-enclosed and right-half-circle-enclosed digits. This
> would require 20 characters to cover the double digit range 00–99.
> Enclosed three digit numbers would require an additional 30 for left,
> center, and right thirds, though it may be possible to reuse the left
> and right half circle enclosed digits and assume that fonts will
> provide left half-center third-right half ligatures

Re: Variations and Unifications ?

One problem caused by disunification is the complexification of algorithms
handling text.

I forgot an important case where disunification also occured : combining
sequences are the "normal" encoding, but legacy charsets encoded the
precomposed character separately and Unicode had to map them for round trip
compatibility purpose. This had a consequence : the creation of additional
properties (i.e. for "canonical equivalences") in order to conciliate the
two sets of encodings and allow some form for equivalence

In fact this is general: each time we disunify a character, we have to add
new properties, and possibly update the algorithms to take these properties
into account and find some form of equivalences.

So disunification solves one problem but creates others. We have to trade
the benefits and costs of using the disunified characters with those using
the "normal" characters (possibly in sequences).

But given the number of cases where we have to support sequences (even if
it's only combining sequences for canonical equivalences), we should really
defavor the real need of disunifying characters: if it's possible with
sequences, don't desunify.

A famous example (based on a legacydecision which was bad in my opinion as
the cost was not considered) was the desunification of Latin/Greek letters
for mathematical purpose, only to force a specific style. But the
alternative representation using sequences (using variation selectors for
example, as the addition of specific modifier for "styles" like "bold",
"italic" or "monospace" was rejected with good reasons) was not really
analyzed in terms of benefits and costs, using the algorithms we already
have (and that could have been updated). But mathemetical symbols are
(normally...) not used at all in the same context as plain alphabetic
letters (even if there's absolutely no warranty that they will be always
distinctable from them when they occur in some linguistic text rendered
with the same style...).

The naive thinking that disunification will make things simpler is
completely wrong (given that an application that would ignore all character
properties and would use only isolated characters would break legitime
rules in many cases, even for rendering purposes. It is in fact simpler to
keep the possible sequences that are already encoded (or that could be
extended to cover more cases: e.g. add new variation sequences, introduce
some new modiers, not just new combining characters, and so on).

We were strongly told : Unicode encodes characters, not glyphs. This should
be remembered (and the argument of costs caused by disunification of
distinct glyphs is also a good one against it).

2016-03-17 8:20 GMT+01:00 Asmus Freytag (t) :

> On 3/16/2016 11:11 PM, Philippe Verdy wrote:
>
> "Disunification may be an answer?" We should avoid it as well.
>
> Disunification is only acceptable when
> - there's a complete disunification of concepts
>
>
> I think answering this question depends on the understanding of "concept",
> and on understanding what it is that Unicode encodes.
>
> When it comes to *symbols*, which is where the discussion originated,
> it's not immediately obvious what Unicode encodes. For example, I posit
> that Unicode does not encode the "concept" for specific mathematical
> operators, but the individual "symbols" that are used for them.
>
> For example PRIME and DOUBLE PRIME can be used for minutes and seconds
> (both of time and arc) as well as for other purposes. Unicode correctly
> does not encode "MINUTE OF ARC", but the symbol used for that -- leaving it
> up to the notational convention to relate the concept and the symbol.
>
> Thus we have a case where multiple concepts match a single symbol. For the
> converse, we take the well-known case of COMMA and FULL STOP which can both
> be used to separate a decimal fraction.
>
> Only in those cases where a single concept is associated so exclusively
> with a given symbol, do we find the situation that it makes sense to treat
> variations in shape of that symbol as the same symbol, but with different
> glyphs.
>
> For some astrological symbols that is the case, but for others it is not.
> Therefore, the encoding model for astrological text cannot be uniform.
> Where symbols have exclusive association with a concept, the natural
> encoding is to encode symbols with an understood set of variant glyphs.
> Where concepts are denoted with symbols that are also used otherwise, then
> the association of concept to symbol must become a matter of notational
> convention and cannot form the basis of encoding: the code elements have to
> be on a lower level, and by necessity represent specific symbol shapes.
>
> A./
>

Re: Joined "ti" coded as "Ɵ" in PDF

2016-03-19 Thread Andrew Cunningham

Hi Don,

Latin is fine if you keep to simple well made fonts and avoid using more
sophisticated typographic features available in some fonts.

Dumb it down typographically and it works fine. PDF, despite all the
current rhetoric coming from PDF software developers, is a preprint format.
Not an archival format.

The PDF format is less than ideal. But it is widely used, often in a way
the format was never really created for. There are alternatives that
preserve the text. But they have never really taken off (compared to
PDF)for various reasons.

Andrew







On Sunday, 20 March 2016, Don Osborn  wrote:
> Thanks Andrew, Looking at the issue of ToUnicode mapping you mention, why
in the 1-many mapping of ligatures (for fonts that have them) do the "many"
not simply consist of the characters ligated? Maybe that's too simple (my
understanding of the process is clearly inadequate).
>
> The "string of random ASCII characters" (per Leonardo) used in the
Identity H system for hanzi raise other questions: (1) How are the ASCII
characters interpreted as a 1-many sequence representing a hanzi rather
than just a series of 1-1 mappings of themselves? (2) Why not just use the
Unicode code point?
>
> The details may or may not be relevant to the list topic, but as a user
of documents in PDF format, I fail to see the benefit of such obscure
mappings. And as a creator of PDFs ("save as") looking at others' PDFs I've
just encountered with these mappings, I'm wondering how these concerned
about how the font & mapping results turned out as they did. It is certain
that the creators of the documents didn't intend results that would not be
searchable by normal text, but it seems possible their a particular font
choice with these ligatures unwittingly produced these results. If the
latter, the software at the very least should show a caveat about such
mappings when generating PDFs.
>
> Maybe it's unrealistic to expect a simple implication of Unicode in PDFs
(a topic we've discussed before but which I admit not fully grasping).
Recalling I once had some wild results copy/pasting from an N'Ko PDF, and
ended up having to obtain the .docx original to obtain text for insertion
in a blog posting. But while it's not unsurprising to encounter issues with
complex non-Latin scripts from PDFs, I'd gotten to expect predictability
when dealing with most Latin text.
>
> Don
>
>
>
> On 3/17/2016 7:34 PM, Andrew Cunningham wrote:
>
> There are a few things going on.
>
> In the first instance, it may be the font itself that is the source of
the problem.
>
> My understanding is that PDF files contain a sequence of glyphs. A PDF
file will contain a ToUnicode mapping between glyphs and codepoints. This
iseither a 1-1 mapping or a 1-many mapping. The 1-many mapping provides
support for ligatures and variation sequences.
>
> I assume it uses the data in the font's cmap table. If the ligature
isn't  mapped then you will have problems. I guess the problem could be
either the font or the font subsetting and embedding performed when the PDF
is generated.
>
> Although, it is worth noting that in opentype fonts not all glyphs will
have mappings in the cmap file.
>
> The remedy, is to extensively tag the PDF and add ActualText attributes
to the tags.
>
> But the PDF specs leave it up to the developer to decide what happens in
there is both a visible text layer and ActualText. So even in an ideal PDF,
tesults will vary from software to software when copying text or searching
a PDF.
>
> At least thatsmy current understanding.
>
> Andrew
>
> On 18 Mar 2016 7:47 am, "Don Osborn"  wrote:
>>
>> Thanks all for the feedback.
>>
>> Doug, It may well be my clipboard (running Windows 7 on this particular
laptop). Get same results pasting into Word and EmEditor.
>>
>> So, when I did a web search on "internaƟonal," as previously mentioned,
and come up with a lot of results (mostly PDFs), were those also a
consequence of many not fully Unicode compliant conversions by others?
>>
>> A web search on what you came up with - "InternaƟonal" - yielded many
more (82k+) results, again mostly PDFs, with terms like "interna onal"
(such as what Steve noted) and "interna>
>> Searching within the PDF document already mentioned, "international"
comes up with nothing (which is a major fail as far as usability).
Searching the PDF in a Firefox browser window, only "internaƟonal" finds
the occurrences of what displays as "international." However after
downloading the document and searching it in Acrobat, only a search for
"internaƟonal" will find what displays as "international."
>>
>> A separate web search on "Eīects" came up with 300+ results, including
some GoogleBooks which in the texts display "effects" (as far as I
checked). So this is not limited to Adobe?
>>
>> Jörg, With regard to "Identity H," a quick search gives the impression
that this

Re: Joined "ti" coded as "Ɵ" in PDF

2016-03-19 Thread Julian Bradfield

On 2016-03-19, Don Osborn  wrote:
> The details may or may not be relevant to the list topic, but as a user
> of documents in PDF format, I fail to see the benefit of such obscure
> mappings. And as a creator of PDFs ("save as") looking at others' PDFs

Aren't you just being bitten by history? PDF derives from PostScript,
which is not a language for representing plain text with typesetting
information, but a language for type(and-graphic-)setting tout court.
There's a lot of history of fonts using arbitrary codepoints; the idea
that the underlying strings giving rise to the displayed graphics
should also be a good plain text representation of the information is
relatively novel.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Re: Meteorological symbols for cloud conditions (on maps or elsewhere)

Some other resources (outside Wikipedia):
- Kean University:
  http://www.kean.edu/~fosborne/resources/ex10g.htm
- Documented by the NOAA in US (but I don't find the complete reference)
- These symbols seem to be supported by an "international standard", but I
don't know which one exactly.
- Documented with other symbols (rain, ice, snow, thunder...) in Canada for
flight planning

https://flightplanning.navcanada.ca/cgi-bin/CreePage.pl?Langue=anglais=NS_Inconnu=wxsymbols=wxsymb
-
http://www.visualdictionaryonline.com/earth/meteorology/international-weather-symbols/clouds.php


2016-03-18 17:59 GMT+01:00 Philippe Verdy :

> See
> https://fr.wikipedia.org/wiki/Carte_m%C3%A9t%C3%A9orologique#/media/File:Station_model_fr.svg
>
> I see these symbols for noting cloud types (here cirrus and altocumulus,
> one drawn in diagonal for middle altitude, another drawn horizontally for
> high altitudes).
>
> Note that the symbols may vary: see Altocumulus for example as found in
> French Wikipedia (note sure if it's accurate) which is different from the
> symbol found in the sampled notation on a map
>
> https://fr.wikipedia.org/wiki/Altocumulus
>
> Also other symbols on the similar page in English Wikipedia, are used to
> describe some cloud characteristics:
>
> https://en.wikipedia.org/wiki/Altocumulus_cloud
>
> Is there a well defined collection of these symbols, and are they in the
> encoding pipe ?
>
>

Re: Purpose of and rationale behind Go Markers U+2686 to U+2689


  
  
On 3/18/2016 11:48 AM, Philippe Verdy
  wrote:

East Asian vertical presentation does not just stack
  the elements on top of each other, very frequently they rotate
  them (including Latin/Greek/Cyrillic letters) So this is not
  really a new complication.

  It is, because now all these combinations have to be treated as
  units (because they would be expected to NOT be rotated).
  
  They would be akin to the square kana abbreviations.
  
  Suddenly, you need dedicated support from rendering engines, where
  for horizontal texts you could design your fonts to get the
  intended outcome with a "dumb" engine.
  
  A./

Re: Variations and Unifications ?


  
  
On 3/16/2016 11:11 PM, Philippe Verdy
  wrote:


  "Disunification may be an answer?" We should avoid
it as well.

Disunification is only acceptable when
- there's a complete disunification of concepts

  


I think answering this question depends on the understanding of
"concept", and on understanding what it is that Unicode encodes.

When it comes to symbols, which is where the discussion
originated, it's not immediately obvious what Unicode encodes. For
example, I posit that Unicode does not encode the "concept" for
specific mathematical operators, but the individual "symbols" that
are used for them.

For example PRIME and DOUBLE PRIME can be used for minutes and
seconds (both of time and arc) as well as for other purposes.
Unicode correctly does not encode "MINUTE OF ARC", but the symbol
used for that -- leaving it up to the notational convention to
relate the concept and the symbol.

Thus we have a case where multiple concepts match a single symbol.
For the converse, we take the well-known case of COMMA and FULL STOP
which can both be used to separate a decimal fraction.

Only in those cases where a single concept is associated so
exclusively with a given symbol, do we find the situation that it
makes sense to treat variations in shape of that symbol as the same
symbol, but with different glyphs.

For some astrological symbols that is the case, but for others it is
not. Therefore, the encoding model for astrological text cannot be
uniform. Where symbols have exclusive association with a concept,
the natural encoding is to encode symbols with an understood set of
variant glyphs. Where concepts are denoted with symbols that are
also used otherwise, then the association of concept to symbol must
become a matter of notational convention and cannot form the basis
of encoding: the code elements have to be on a lower level, and by
necessity represent specific symbol shapes.

A./

Re: Purpose of and rationale behind Go Markers U+2686 to U+2689

2016-03-19 Thread Garth Wallace

On Fri, Mar 18, 2016 at 11:48 AM, Philippe Verdy  wrote:
> 2016-03-18 19:11 GMT+01:00 Garth Wallace :
>>
>> > The issues with line breaking (if you can use these combining around all
>> > characters, inclusing spaces, can be solved using unbreakable
>> > characters.
>>
>> Line breaking isn't really a problem that I can see with the Quivira
>> model. If they're given the usual line breaking properties for
>> symbols, the Unicode line breaking algorithm would prevent a break
>> between halves. East Asian vertical text is another story. In a font
>> that just uses kerning to join halves (as Quivira does) you'd end up
>> with the left half on top of the right in vertical text. I'm not sure
>> how ligatures are handled in vertical text.
>
>
> East Asian vertical presentation does not just stack the elements on top of
> each other, very frequently they rotate them (including Latin/Greek/Cyrillic
> letters) So this is not really a new complication.

True. I suppose if the half-enclosed digits were defined as halfwidth,
it would work. It makes intuitive sense too, if a complete numbered
circle is assumed to fill an ideographic cell. I'm not sure if
rotation of the numbers would be desired, though.

> The numbers however are used for noting or commenting a strategy, or the
> placement order during a party.
>
> However for game notations purpose, rotation plays a significant role
> (notably if those two part symbols are joined in a circle or disc: it can
> make the difference between several distinct sets of stones, or it could be
> used in a 4-players go variant (where black vs. white is not sufficient to
> distinguish the players). In reality the stones would have 4 colours (stones
> are not really numbered,
> they are all the same for the same player, or there's some special marked
> type of stone for each player in addition to their normal set) or sets would
> have some symbol or dot on top of them.

Rotation is definitely not salient in standard go kifu like it is in
fairy chess notation. Go variants for more than 2 players are uncommon
enough that I don't think any sort of standardized notation exists.

> There are also go variants using stones that take a territory and block the
> position but that cnanot be taken (both players can use them, but the
> territory taken is not counted for any player.
> These stones can also be placed randomly at start of the party over the
> board to complicate the game, or there's a limited set of blocking stones
> for each player that an choose when to play them instead of standard stones.
> Those blocking stones are visually distinct, but identical for the two
> players that have them at start of the party.

Do you have any links? I'm interested in game design.

> Although the classic rules of go are extremely simple, this game has a lot
> of variants. In fact many players that don't know the exact classic rules
> are inventing their own variant.

These are generally one-off inventions (or commercial products) so I
don't think there's much need to consider their hypothetical
variations on notation.

Re: Joined "ti" coded as "Ɵ" in PDF

2016-03-19 Thread Don Osborn

Thanks Andrew, Looking at the issue of ToUnicode mapping you mention, 
why in the 1-many mapping of ligatures (for fonts that have them) do the 
"many" not simply consist of the characters ligated? Maybe that's too 
simple (my understanding of the process is clearly inadequate).


The "string of random ASCII characters" (per Leonardo) used in the 
Identity H system for hanzi raise other questions: (1) How are the ASCII 
characters interpreted as a 1-many sequence representing a hanzi rather 
than just a series of 1-1 mappings of themselves? (2) Why not just use 
the Unicode code point?


The details may or may not be relevant to the list topic, but as a user 
of documents in PDF format, I fail to see the benefit of such obscure 
mappings. And as a creator of PDFs ("save as") looking at others' PDFs 
I've just encountered with these mappings, I'm wondering how these 
concerned about how the font & mapping results turned out as they did. 
It is certain that the creators of the documents didn't intend results 
that would not be searchable by normal text, but it seems possible their 
a particular font choice with these ligatures unwittingly produced these 
results. If the latter, the software at the very least should show a 
caveat about such mappings when generating PDFs.


Maybe it's unrealistic to expect a simple implication of Unicode in PDFs 
(a topic we've discussed before but which I admit not fully grasping). 
Recalling I once had some wild results copy/pasting from an N'Ko PDF, 
and ended up having to obtain the .docx original to obtain text for 
insertion in a blog posting. But while it's not unsurprising to 
encounter issues with complex non-Latin scripts from PDFs, I'd gotten to 
expect predictability when dealing with most Latin text.


Don



On 3/17/2016 7:34 PM, Andrew Cunningham wrote:


There are a few things going on.

In the first instance, it may be the font itself that is the source of 
the problem.


My understanding is that PDF files contain a sequence of glyphs. A PDF 
file will contain a ToUnicode mapping between glyphs and codepoints. 
This iseither a 1-1 mapping or a 1-many mapping. The 1-many mapping 
provides support for ligatures and variation sequences.


I assume it uses the data in the font's cmap table. If the ligature 
isn't  mapped then you will have problems. I guess the problem could 
be either the font or the font subsetting and embedding performed when 
the PDF is generated.


Although, it is worth noting that in opentype fonts not all glyphs 
will have mappings in the cmap file.


The remedy, is to extensively tag the PDF and add ActualText 
attributes to the tags.


But the PDF specs leave it up to the developer to decide what happens 
in there is both a visible text layer and ActualText. So even in an 
ideal PDF, tesults will vary from software to software when copying 
text or searching a PDF.


At least thatsmy current understanding.

Andrew

On 18 Mar 2016 7:47 am, "Don Osborn" > wrote:


Thanks all for the feedback.

Doug, It may well be my clipboard (running Windows 7 on this
particular laptop). Get same results pasting into Word and EmEditor.

So, when I did a web search on "internaƟonal," as previously
mentioned, and come up with a lot of results (mostly PDFs), were
those also a consequence of many not fully Unicode compliant
conversions by others?

A web search on what you came up with - "InternaƟonal" - yielded
many more (82k+) results, again mostly PDFs, with terms like
"interna onal" (such as what Steve noted) and "interna

Re: Purpose of and rationale behind Go Markers U+2686 to U+2689

Sequences were introduced long before. I know that they add their own
complications everywhere, but they are already part of existing algorithms.
If sequences (not just combining sequences) were not there, there would be
much more characters encoded in the database and eveything would be encoded
like sinograms (mostly one character per composite glyph)

2016-03-18 19:58 GMT+01:00 Asmus Freytag (t) :

> On 3/18/2016 11:11 AM, Garth Wallace wrote:
>
> The enclosure could also be something else than a circle (or arcs of> 
> circle): it could be a rectangle, hintable with joiners (like with circles)> 
> to create an enclosing square, or a rounded rectangle (hintable to create a> 
> rounded square).
>
> I thought combining characters would not be suitable for things like
> white text on black.
>
>
> Philippe seems to have an appetite for combining sequences that's not
> shared by the UTC.
>
> A./
>

Re: Joined "ti" coded as "Ɵ" in PDF

2016-03-19 Thread Don Osborn

Thanks all for the feedback.

Doug, It may well be my clipboard (running Windows 7 on this particular
laptop). Get same results pasting into Word and EmEditor.

So, when I did a web search on "internaƟonal," as previously mentioned,
and come up with a lot of results (mostly PDFs), were those also a
consequence of many not fully Unicode compliant conversions by others?

A web search on what you came up with - "InternaƟonal" - yielded many
more (82k+) results, again mostly PDFs, with terms like "interna onal"
(such as what Steve noted) and "interna wrote:

Don Osborn wrote:

Odd result when copy/pasting text from a PDF: For some reason "ti" in
the (English) text of the document at
http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf
is coded as "Ɵ". Looking more closely at the original text, it does
appear that the glyph is a "ti" ligature (which afaik is not coded as
such in Unicode).

When I copy and paste the PDF text in question into BabelPad, I get:

InternaƟonal Order and the DistribuƟon of IdenƟty in 1950 (By
invitaƟon only)

The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use
character.

Truncating this character to 16 bits, which is a Bad Thing™, yields
U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either
Don's clipboard or the editor he pasted it into is not fully
Unicode-compliant.

Don's point about using alternative characters to implement ligatures,
thereby messing up web searches, remains valid.

--
Doug Ewell | http://ewellic.org | Thornton, CO 

Re: Swapcase for Titlecase characters

On Sat Mar 19, 2016 12:54:51, Martin J. Dürst  wrote:

> On 2016/03/19 04:33, Marcel Schneider wrote:
> > On Fri, Mar 18, 2016, 08:43:56, Martin J. Dürst wrote:
> 
> >> b) Convert to upper (or lower), which may simplify implementation.
> 
> >> For example, 'ǅinsi' (jeans) would become 'ǅINSI' with a), 'ǄINSI' (or
> >> 'ǆinsi') with b), and 'dŽINSI' with c). For another example, 'ᾨδή' would
> >> become 'ᾨΔΉ' with a), 'ὨΙΔΉ' (or 'ᾠΔΉ') with b), and 'ὠΙΔΉ' with c).
> 
> > Looking at your examples, I would add a case that typically occurs for 
> > swapcase to be applied:
> 
> > ‘ᾠΔΉ’ (cited [erroneously] as a result of option b) that is to be converted 
> > to ‘ᾨδή’, and ‘ǆINSI’, that is to become ‘ǅinsi’.
> 
> First, what do you mean with "erroneously"?

The intent of that bracketed word was just to give account of the fact that 
when ‘ᾨδή’ is converted to lower case as assumed in option “b-lower”, it 
becomes ‘ᾠδή’, while ‘ᾠΔΉ’ is a typical candidate for swapcase, thus I could 
reutilize it “as is” to illustrate the fourth case.

> 
> Second, did I get this right that your additional case (let's call it 
> d)) would cycle through the three options where available:
> lower -> title -> upper -> lower.

I’m afraid that swapcase as I saw it is not a roundtrip method, therefore I got 
some awkward moments today when I thought about how to implement it. As far as 
I could see, there are two pairs:

I: lowercase → titlecase (needed to correct the initials where the user pressed 
the shift modifier)
II: uppercase → lowercase (needed to correct the body of the words input while 
caps lock was on)

That typically matches what happens when caps lock is accidentally on and the 
user writes normally―on a keyboard that includes digraphs and uses the SGCaps 
feature for them, like this:

Modifier; None; Shift
CapsLock off; Lower; Title
CapsLock on; Upper; Lower

Correcting keyboard input done with the wrong caps lock state is the only 
situation I can see where swapcase is needed and thus is likely to be used. 
This is why the swapcase method is implemented in word processors, as a part of 
an optional autocorrect feature that neutralizes the effet of starting a 
sentence normally while caps lock is on: After completing the input of an 
uppercase word with an initial lowercase letter, the word is automatically 
swapcased and caps lock is turned off.

However now that I tested it with the digraph of the examples (input through 
the composer of the keyboard layout), it doesnʼt work at all in one word 
processor, while in another one it works but uppercases the initial lowercase 
digraph instead of titlecasing it. [That may be considered effects of 
“streamlined” implementations that drop the less frequent cases.]

I donʼt believe that it would be useful to make swapcase a roundtrip method, 
and anyway it would be weird because of the letters with three case forms. The 
case conversion cycle you draw above usually applies to words (and doesnʼt work 
correctly in neither of the two tested word processors when an initial Ǳ 
digraph is present), while most letters have identical values for 
Titlecase_Mapping and Uppercase_Mapping, and usually there is no means to flag 
them with “Titlecase_State”. This might be one more reason why current 
implementations of swapcase donʼt match the expected behavior for digraphs.

> 
> > As about decomposing digraphs and ypogegrammeni to apply swapcase: That 
> > probably would be doing no good,
> > as itʼs unnecessary and users wonʼt expect it.
> 
> Why do you say "users won't expect it"? For those users not aware of the 
> encoding internals, I'd indeed guess that's what users would expect, at 
> least in the Croatian case.

That depends on what is the expected result. If the swapcase method is to 
correct inverted casing, users wouldnʼt like to see the digraphs decomposed, 
the less as in the considered languages, the Ǳ digraph is a part of the 
alphabet between ‘D’ and ‘Đ’, so that users are really aware.

> For Greek, it may be different; it depends 
> on the extent to which the iota is seen as a letter vs. seen as a mark.

Here again the user inputs a precomposed letter, with iota subscript because he 
just wants a capitalized word, not an uppercase one. And here again the 
autocorrect doesnʼt work in one word processor, while in the other one it 
applies uppercasing with uppercase iota adscript―while the rest of the word is 
lowercase―instead of capitalization, with lowercase iota adscript or iota 
subcript, that depends on conventions and preferences.

Letʼs take that as a proof how hard it is to implement swapcase with digraph 
support.

I canʼt better conclude this reply than with Asmus Freytagʼs words on Fri, 1st 
Jan 2016 12:09:13 -0800: [1]

> Unicode aims to be expressive enough to model all plain text. That means, it 
> inherits the non-reducible complexity of text. Even the insight that the 
> complexity is non-reducible would be a big step forward.

Regards,

Marcel

[1] Re:

Re: Swapcase for Titlecase characters

2016-03-19 Thread Doug Ewell


Martin J. Dürst wrote:


Now the question I have is: What to do for titlecase characters?
[ ... ]
For example, 'ǅinsi' (jeans) would become 'ǅINSI' with a), 'ǄINSI' (or
'ǆinsi') with b), and 'dŽINSI' with c).


For the Latin letters at least, my 0.02 cents' worth (you read that 
right) is that they are probably so infrequently used that option (b) 
would be just fine.


As one anecdote (which is even less like "data" than two anecdotes), I 
could not find any of the characters Ĳ ĳ Ǆ ǅ ǆ Ǉ ǈ ǉ Ǌ ǋ ǌ or their hex 
equivalents in any of the CLDR keyboard definitions. I'd imagine that 
users just type the two characters separately, and that consequently 
most data in the real world is like that.


--
Doug Ewell | http://ewellic.org | Thornton, CO 

Re: Variations and Unifications ?

"Disunification may be an answer?" We should avoid it as well.

We have other solutions in Unicode
- variation selectors (often used for sinograms when their unified shapes
must be distinguished in some contexts such as people names or toponyms or
trademark names or in other specific contexts),
- or combining sequences (including in Arabic or Hebrew where many
combining characters are not always represented visually, the same occuring
as well in Latin with accents not always presented over capitals),
- or sequences of multiple characters (like in Emojis for skin color
variants, or sequences for encoding flags),
- or other sequences using joiners (e.g. in South Asian scripts).

Disunification is only acceptable when
- there's a complete disunification of concepts and the "similar" shapes
are also different even if one originates from the other (E.g. the Latin
slashed o disunifiied from the Latin o, even if there's also the sequence
o+combining slash, almost never used as its rendering is too much
approximative in most cases)
- or there's a clear distinction of semantics and properties (e.g. the
Latin AE ligature, which is not appropriately represented by the two
separate letters, not even with a "hinting" joiner, and that has specific
properties as a plain letter, e.g. with mappings)

Before disunifying a character, we should first study the alternative of
their representation as sequences.

2016-03-16 18:34 GMT+01:00 Asmus Freytag (t) :

> On 3/15/2016 8:14 PM, David Faulks wrote:
>
> As part of my investigations into astrological symbols, I'm beginning to 
> wonder if glyph variations are justifications for separate encoding of 
> symbols I would have previously considered the same or unifiable with symbols 
> already in Unicode.
>
> For example, the semisquare aspect is usually shown with a glyph that is 
> identical to ∠ (U+2220 ANGLE). However, sometimes it looks like <, or like ∟ 
> (U+221F RIGHT ANGLE). Would this be better encoded as a separate codepoint?
>
> The parallel aspect, similarily, sometimes looks like ∥ (U+2225 PARALLEL TO), 
> but is often shown as // or ⫽ (U+2AFD DOUBLE SOLIDUS OPERATOR). This is not a 
> typographical kludge since astrological fonts often show it this way.
> There is also contra-parallel, which sometime is shown like ∦ (U+2226 NOT 
> PARALLEL TO), but has varaint glyphs with slated lines (and the crossbar is 
> often horizontal).
>
> The ‘part of fortune’ is sometimes a circled ×, or sometimes a circled +.
>
> Would it be better to have dedicated characters than to assume unifications 
> in these cases?
>
>
>
> My take is that for symbols there's always that tension between encoding
> the "concept" or encoding the shape. In my view, it is often impossible to
> answer the question whether the different angles (for example) are merely
> different "shapes" of one and the same "symbol", or whether it isn't the
> case that there are different "conventions" (using different symbols for
> the same concept).
>
> Disunification is useful, whenever different concepts require distinct
> symbol shapes (even if there are some general similarities). If other
> concepts make use of the same shapes interchangeably, it is then up to the
> author to fix the convention by selecting one or the other shape.
> Conceptually, that is similar to the decimal point: it can be either a
> period, or a comma, depending on locale (read: depending on the convention
> the author follows).
>
> Sometimes, concepts use multiple symbol shapes, but all of these shapes
> map to the same concept (and other uses are not known). In that case,
> unifying the shapes might be acceptable. The selection of shape is then a
> matter of the font (and may not always be under the control of the author).
> Conceptually, that is similar to the integral sign, which can be slanted or
> upright. The choice is one of style. While authors or readers may prefer
> one look over the other, the identity of the symbol is not in question, and
> there's no impact on transmission of the contents of the text.
>
> Whenever we have the former case, that is, multiple conventional
> presentations that are symbols in their own right in other contexts, then
> encoding an additional "generic" shape should be avoided. Unicode
> explicitly did not encode a generic "decimal point". If the convention that
> is used matters, the author is better off being able to select a specific
> shape. The results will be more predictable. The downside is that a search
> will have to cover all the conventions. Conceptually, that is no different
> from having to search for both "color" and "colour".
>
> The final case is where a convention for depicting a concept uses a symbol
> that itself has some variability (for example when representing some other
> concepts), such that some of its forms make it less than ideal for the
> conventional use intended for the concept in question. Unicode has
> historically not always been able to

Re: Joined "ti" coded as "Ɵ" in PDF

2016-03-19 Thread Steve Swales

Yes, it seems like your mileage varies with the PDF 
viewer/interpreter/converter.  Text copied from Preview on the Mac replaces the 
ti ligature with a space.  Certainly not a Unicode problem, per se, but an 
interesting problem nevertheless.

-steve

> On Mar 17, 2016, at 11:11 AM, Doug Ewell  wrote:
> 
> Don Osborn wrote:
> 
>> Odd result when copy/pasting text from a PDF: For some reason "ti" in
>> the (English) text of the document at
>> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf
>> is coded as "Ɵ". Looking more closely at the original text, it does
>> appear that the glyph is a "ti" ligature (which afaik is not coded as
>> such in Unicode).
> 
> When I copy and paste the PDF text in question into BabelPad, I get:
> 
>> InternaƟonal Order and the DistribuƟon of IdenƟty in 1950 (By
>> invitaƟon only)
> 
> The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use
> character.
> 
> Truncating this character to 16 bits, which is a Bad Thing™, yields
> U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either
> Don's clipboard or the editor he pasted it into is not fully
> Unicode-compliant.
> 
> Don's point about using alternative characters to implement ligatures,
> thereby messing up web searches, remains valid.
> 
> --
> Doug Ewell | http://ewellic.org | Thornton, CO 
> 
>

Re: Swapcase for Titlecase characters

2016-03-19 Thread Martin J. Dürst


Thanks everybody for the feedback.

On 2016/03/19 04:33, Marcel Schneider wrote:

On Fri, Mar 18, 2016, 08:43:56, Martin J. Dürst  wrote:



b) Convert to upper (or lower), which may simplify implementation.



For example, 'ǅinsi' (jeans) would become 'ǅINSI' with a), 'ǄINSI' (or
'ǆinsi') with b), and 'dŽINSI' with c). For another example, 'ᾨδή' would
become 'ᾨΔΉ' with a), 'ὨΙΔΉ' (or 'ᾠΔΉ') with b), and 'ὠΙΔΉ' with c).



Looking at your examples, I would add a case that typically occurs for swapcase 
to be applied:



‘ᾠΔΉ’ (cited [erroneously] as a result of option b) that is to be converted to 
‘ᾨδή’, and ‘ǆINSI’, that is to become ‘ǅinsi’.


First, what do you mean with "erroneously"?

Second, did I get this right that your additional case (let's call it 
d)) would cycle through the three options where available:

lower -> title -> upper -> lower.


As about decomposing digraphs and ypogegrammeni to apply swapcase: That 
probably would be doing no good,
as itʼs unnecessary and users wonʼt expect it.


Why do you say "users won't expect it"? For those users not aware of the 
encoding internals, I'd indeed guess that's what users would expect, at 
least in the Croatian case. For Greek, it may be different; it depends 
on the extent to which the iota is seen as a letter vs. seen as a mark.


Regards,   Martin.

Re: Variations and Unifications ?

On 3/15/2016 8:14 PM, David Faulks
wrote:

As part of my investigations into astrological symbols, I'm beginning to wonder if glyph variations are justifications for separate encoding of symbols I would have previously considered the same or unifiable with symbols already in Unicode.

For example, the semisquare aspect is usually shown with a glyph that is identical to ∠ (U+2220 ANGLE). However, sometimes it looks like <, or like ∟ (U+221F RIGHT ANGLE). Would this be better encoded as a separate codepoint?

The parallel aspect, similarily, sometimes looks like ∥ (U+2225 PARALLEL TO), but is often shown as // or ⫽ (U+2AFD DOUBLE SOLIDUS OPERATOR). This is not a typographical kludge since astrological fonts often show it this way.
There is also contra-parallel, which sometime is shown like ∦ (U+2226 NOT PARALLEL TO), but has varaint glyphs with slated lines (and the crossbar is often horizontal).

The ‘part of fortune’ is sometimes a circled ×, or sometimes a circled +.

Would it be better to have dedicated characters than to assume unifications in these cases?

My take is that for symbols there's always that
tension between encoding the "concept" or encoding the shape. In
my view, it is often impossible to answer the question whether the
different angles (for example) are merely different "shapes" of
one and the same "symbol", or whether it isn't the case that there
are different "conventions" (using different symbols for the same
concept).

Disunification is useful, whenever different concepts require
distinct symbol shapes (even if there are some general
similarities). If other concepts make use of the same shapes
interchangeably, it is then up to the author to fix the convention
by selecting one or the other shape. Conceptually, that is similar
to the decimal point: it can be either a period, or a comma,
depending on locale (read: depending on the convention the author
follows).

Sometimes, concepts use multiple symbol shapes, but all of these
shapes map to the same concept (and other uses are not known). In
that case, unifying the shapes might be acceptable. The selection
of shape is then a matter of the font (and may not always be under
the control of the author). Conceptually, that is similar to the
integral sign, which can be slanted or upright. The choice is one
of style. While authors or readers may prefer one look over the
other, the identity of the symbol is not in question, and there's
no impact on transmission of the contents of the text.

Whenever we have the former case, that is, multiple conventional
presentations that are symbols in their own right in other
contexts, then encoding an additional "generic" shape should be
avoided. Unicode explicitly did not encode a generic "decimal
point". If the convention that is used matters, the author is
better off being able to select a specific shape. The results will
be more predictable. The downside is that a search will have to
cover all the conventions. Conceptually, that is no different from
having to search for both "color" and "colour".

The final case is where a convention for depicting a concept uses
a symbol that itself has some variability (for example when
representing some other concepts), such that some of its forms
make it less than ideal for the conventional use intended for the
concept in question. Unicode has historically not always been able
to provide a solution. In some of these cases, plain text (that
is, without a fixed font association) may simply not give the
desired answer. If specialized fonts for the convention (e.g.
astrological fonts) do not usually exist or can't be expected,
then disunifying the symbol's shapes may be an answer.

A./

Re: Joined "ti" coded as "Ɵ" in PDF

2016-03-19 Thread Andrew Cunningham

There are a few things going on.

In the first instance, it may be the font itself that is the source of the
problem.

My understanding is that PDF files contain a sequence of glyphs. A PDF file
will contain a ToUnicode mapping between glyphs and codepoints. This
iseither a 1-1 mapping or a 1-many mapping. The 1-many mapping provides
support for ligatures and variation sequences.

I assume it uses the data in the font's cmap table. If the ligature isn't
mapped then you will have problems. I guess the problem could be either the
font or the font subsetting and embedding performed when the PDF is
generated.

Although, it is worth noting that in opentype fonts not all glyphs will
have mappings in the cmap file.

The remedy, is to extensively tag the PDF and add ActualText attributes to
the tags.

But the PDF specs leave it up to the developer to decide what happens in
there is both a visible text layer and ActualText. So even in an ideal PDF,
tesults will vary from software to software when copying text or searching
a PDF.

At least thatsmy current understanding.

Andrew
On 18 Mar 2016 7:47 am, "Don Osborn"  wrote:

> Thanks all for the feedback.
>
> Doug, It may well be my clipboard (running Windows 7 on this particular
> laptop). Get same results pasting into Word and EmEditor.
>
> So, when I did a web search on "internaƟonal," as previously mentioned,
> and come up with a lot of results (mostly PDFs), were those also a
> consequence of many not fully Unicode compliant conversions by others?
>
> A web search on what you came up with - "InternaƟonal" - yielded many
> more (82k+) results, again mostly PDFs, with terms like "interna onal"
> (such as what Steve noted) and "interna nature of, or how Google interprets, the private use character?).
>
> Searching within the PDF document already mentioned, "international" comes
> up with nothing (which is a major fail as far as usability). Searching the
> PDF in a Firefox browser window, only "internaƟonal" finds the occurrences
> of what displays as "international." However after downloading the document
> and searching it in Acrobat, only a search for "internaƟonal" will find
> what displays as "international."
>
> A separate web search on "Eīects" came up with 300+ results, including
> some GoogleBooks which in the texts display "effects" (as far as I
> checked). So this is not limited to Adobe?
>
> Jörg, With regard to "Identity H," a quick search gives the impression
> that this encoding has had a fairly wide and not so happy impact, even if
> on the surface level it may have facilitated display in a particular style
> of font in ways that no one complains about.
>
> Altogether a mess, from my limited encounter with it. There must have been
> a good reason for or saving grace of this solution?
>
> Don
>
> On 3/17/2016 2:17 PM, Steve Swales wrote:
>
>> Yes, it seems like your mileage varies with the PDF
>> viewer/interpreter/converter.  Text copied from Preview on the Mac replaces
>> the ti ligature with a space.  Certainly not a Unicode problem, per se, but
>> an interesting problem nevertheless.
>>
>> -steve
>>
>> On Mar 17, 2016, at 11:11 AM, Doug Ewell  wrote:
>>>
>>> Don Osborn wrote:
>>>
>>> Odd result when copy/pasting text from a PDF: For some reason "ti" in
 the (English) text of the document at

 http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf
 is coded as "Ɵ". Looking more closely at the original text, it does
 appear that the glyph is a "ti" ligature (which afaik is not coded as
 such in Unicode).

>>> When I copy and paste the PDF text in question into BabelPad, I get:
>>>
>>> InternaƟonal Order and the DistribuƟon of IdenƟty in 1950 (By
 invitaƟon only)

>>> The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use
>>> character.
>>>
>>> Truncating this character to 16 bits, which is a Bad Thing™, yields
>>> U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either
>>> Don's clipboard or the editor he pasted it into is not fully
>>> Unicode-compliant.
>>>
>>> Don's point about using alternative characters to implement ligatures,
>>> thereby messing up web searches, remains valid.
>>>
>>> --
>>> Doug Ewell | http://ewellic.org | Thornton, CO 
>>>
>>>
>>>
>>
>

Re: Purpose of and rationale behind Go Markers U+2686 to U+2689

2016-03-19 Thread Andrew West

On 18 March 2016 at 23:49, Garth Wallace  wrote:
>
> Correction: the 2-digit pairs would require 19 characters. There would
> be no need for a left half circle enclosed digit one, since the
> enclosed numbers 10–19 are already encoded. This would only leave
> enclosed 20 as a potential confusable. There would also be no need for
> a left third digit zero, saving one code point if the thirds are not
> unified with the halves, so there would be 29 thirds.
>
> And just to clarify, there would have to be separate half cirlced and
> negative half circled digits. So that would be 96 characters
> altogether, or 58 if left and right third-circles are unified with
> their half-circle equivalents.  Not counting ideographic numbers.

Thanks for your suggestion, I have added two new options to my draft
proposal, one based on your suggestion (60 characters: 10 left, 10
middle and 10 right for normal and negative circles) and one more
verdyesque (four enclosing circle format characters).  To be honest, I
don't think the UTC will go for either of these options, but I doubt
they will be keen to accept any of the suggested options.

> This may not work very well for ideographic numbers though. In the
> examples, they appear to be written vertically within their circles
> (AFAICT none of the moves in those diagrams are numbered 100 or above,
> although some are hard to read).

I have now added an example with circled ideographic numbers greater
than 100.  See Fig. 13 in

http://www.babelstone.co.uk/Unicode/GoNotation.pdf

In this example, numbers greater than 100 are written in two columns
within the circle, with hundreds on the right.

Andrew

Re: Joined "ti" coded as "Ɵ" in PDF

On Thu, Mar 17, 2016 at 19:02:19, Pierpaolo Bernardi  wrote:

> unicode says nothing about font technologies

It mentions them a little bit however in the core specifications:

http://www.unicode.org/versions/Unicode8.0.0/ch23.pdf#G23126

> unicode does not mandate how to encode ligatures

Probably because Unicode specifies that «it is the task of the rendering 
system» to select ligature glyphs on the basis of characteristic sequences of 
characters in the text stream.

While having found some of the mentioned oddities in an old PDF file (ffi 
ligature ending up as Y, ffl ligature as Z), I’m now really puzzled about 
actual practise.

Marcel

Re: Swapcase for Titlecase characters

2016-03-19 Thread Mark Davis ☕️

The 'swapcase' just sounds bizarre. What on earth is it for? My inclination
would be to just do the simplest possible implementation that has the
expected results for the 1:1 case pairs, and whatever falls out from the
algorithm for the others.


Mark

On Sat, Mar 19, 2016 at 4:11 AM, Asmus Freytag (t) 
wrote:

> On 3/18/2016 12:33 PM, Marcel Schneider wrote:
>
> As about decomposing digraphs and ypogegrammeni to apply swapcase: That 
> probably would be doing no good, as itʼs unnecessary and users wonʼt expect 
> it.
>
>
> That was my intuition as well, but based on a different line of argument.
> If you add a feature to match behavior somewhere else, it rarely pays to
> make that perform "better", because it just means it's now different and no
> longer matches.
>
> The exception is a feature for which you can establish unambiguously that
> there is a metric of correctness or a widely (universally?) shared
> expectation by users as to the ideal behavior. In that case, being
> compatible with a broken feature (or a random implementation of one) may in
> fact be counter productive.
>
> The mere fact that you needed to ask here made me think that this would be
> unlikely to be one of those exceptions: because in that case, you would
> have easily be able to tap into a consensus that tells you what "better"
> means. (And it the feature would probably have been more widely
> implemented).
>
> This one is pretty bizarre on the face of it, but I like Marcel's
> suggestion as to its putative purpose.
>
> A./
>

Re: Joined "ti" coded as "Ɵ" in PDF

2016-03-19 Thread Leonardo Boiko

The PDF *displays* correctly.  But try copying the string 'ti' from
the text another application outside of your PDF viewer, and you'll
see that the thing that *displays* as 'ti' is *coded* as Ɵ, as Don
Osborn said.


2016-03-17 14:26 GMT-03:00 Pierpaolo Bernardi :
> That document displays correctly for me using both the pdf viewer
> built into chrome and the standalone Acrobat reader v.11.  The problem
> could be in your PDF viewer?  What are you viewing the document with?
>
> On Thu, Mar 17, 2016 at 5:43 PM, Don Osborn  wrote:
>> Odd result when copy/pasting text from a PDF: For some reason "ti" in the
>> (English) text of the document at
>> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf
>> is coded as "Ɵ". Looking more closely at the original text, it does appear
>> that the glyph is a "ti" ligature (which afaik is not coded as such in
>> Unicode).
>>
>> Out of curiosity, did a web search on "internaƟonal" and got over 11k hits,
>> apparently all PDFs.
>>
>> Anyone have any idea what's going on? Am assuming this is not a deliberate
>> choice by diverse people creating PDFs and wanting "ti" ligatures for
>> stylistic reasons. Note the document linked above is current, so this is not
>> (just) an issue with older documents.
>>
>> Don Osborn
>

Re: Purpose of and rationale behind Go Markers U+2686 to U+2689

2016-03-19 Thread Garth Wallace

On Thu, Mar 17, 2016 at 9:18 PM, Garth Wallace  wrote:
> There's another strategy for dealing with enclosed numbers, which is
> taken by the font Quivira in its PUA: encoding separate
> left-half-circle-enclosed and right-half-circle-enclosed digits. This
> would require 20 characters to cover the double digit range 00–99.
> Enclosed three digit numbers would require an additional 30 for left,
> center, and right thirds, though it may be possible to reuse the left
> and right half circle enclosed digits and assume that fonts will
> provide left half-center third-right half ligatures (Quivira provides
> "middle parts" though the result is a stadium instead of a true
> circle). It should be possible to do the same for enclosed ideographic
> numbers, I think.
>
> The problems I can see with this are confusability with the already
> encoded atomic enclosed numbers, and breaking in vertical text.

Correction: the 2-digit pairs would require 19 characters. There would
be no need for a left half circle enclosed digit one, since the
enclosed numbers 10–19 are already encoded. This would only leave
enclosed 20 as a potential confusable. There would also be no need for
a left third digit zero, saving one code point if the thirds are not
unified with the halves, so there would be 29 thirds.

And just to clarify, there would have to be separate half cirlced and
negative half circled digits. So that would be 96 characters
altogether, or 58 if left and right third-circles are unified with
their half-circle equivalents.  Not counting ideographic numbers.

This may not work very well for ideographic numbers though. In the
examples, they appear to be written vertically within their circles
(AFAICT none of the moves in those diagrams are numbered 100 or above,
although some are hard to read).

Re: Swapcase for Titlecase characters