Meteorological symbols for cloud conditions (on maps or elsewhere)
See https://fr.wikipedia.org/wiki/Carte_m%C3%A9t%C3%A9orologique#/media/File:Station_model_fr.svg I see these symbols for noting cloud types (here cirrus and altocumulus, one drawn in diagonal for middle altitude, another drawn horizontally for high altitudes). Note that the symbols may vary: see Altocumulus for example as found in French Wikipedia (note sure if it's accurate) which is different from the symbol found in the sampled notation on a map https://fr.wikipedia.org/wiki/Altocumulus Also other symbols on the similar page in English Wikipedia, are used to describe some cloud characteristics: https://en.wikipedia.org/wiki/Altocumulus_cloud Is there a well defined collection of these symbols, and are they in the encoding pipe ?
Re: Joined "ti" coded as "Ɵ" in PDF
2016-03-17 19:02 GMT+01:00 Pierpaolo Bernardi: > On Thu, Mar 17, 2016 at 6:37 PM, Leonardo Boiko > wrote: > > The PDF *displays* correctly. But try copying the string 'ti' from > > the text another application outside of your PDF viewer, and you'll > > see that the thing that *displays* as 'ti' is *coded* as Ɵ, as Don > > Osborn said. > > Ah. OK. Anyway this is not a Unicode problem. PDF knows nothing about > unicode. It uses the encoding of the fonts used. > That's correct, however the PDF specs contain guidelines for naming glyphs in fonts in such a way that the encoding can be deciphered. This is needed for example in applications such as PDF forms where user input is expected. When those PDF are generated from rich text, the fonts used may be built with TrueType (without any glyph name in them, only mappings of sequences of codepoints) or OpenType or Postscript. When OpenType fonts contain Postscript glyphs, their names may be completely arbitrary, it does not even matter if the font used was mapped to Unciode or if it used a legacy or proprietary encoding). If you see a "Ɵ" when copy-pasting from the PDF, it's because the font used to produce it did not follow these guidelines (or did not specify any glyphname, in which case this is a sort of OCR algorithm that attempts to decipher the glyph : the "ti" ligature is visually extremely near from the "Ɵ", and an OCR has lot of difficulties to disguish them, unless they also use some linguistic dictionnary searches and some hints about the script used in surrounding characters to enhance the guess). Note that PDF's (or DejaVu's) are not required to contain only text, or they could just embed a scanned and compressed bitmap image (if you want to see how an OCR can be wrong, look at how it fails with lots of errors, for example in the decoding projects for Wikibooks, working with scanned bitmaps of old books: OCR is just an helper, but there's still lot of work to correct what has been guessed and reencode the correct text; even if humans are smarter than OCR, this is a lot of work to perform manually : encoding the text of a single scanned old book still takes one or two months for an experienced editor, and there are still many errors to review later by someone else) Most PDFs were not created with the idea of decoding later their rendered texts. In fact they were intended to be read or printed "as is", including with their styles, colors, and decorations of fonts everywhere or text over photos. They were even created to be non modifiable and used then for archival. Some PDF tools will also cleanup from the PDF the additional metadata such as the original fonts used, instead these PDFs will locally embed pseudo-fonts containing sets of glyphs from various fonts (in mixed styles), in random order or sorted by frequency of use in the document or by order of occurence in the original text. These embedded fonts are generated on the fly to contain only the necessary glyphs for the document. When those embedded fonts are generated, there's a compression algorithme that drops lots of things from the original font, including its metadata such as the original "Postscript" glyph names.
Re: Joined "ti" coded as "Ɵ" in PDF
Yeah, I've stumbled upon this a lot in academic Japanese/Chinese texts. I try to copy some Chinese character, only to find out that it's really a string of random ASCII characters. Is there only one of those crap PDF pseudo-encodings? If so, I'll use a conversor next time... 2016-03-17 14:57 GMT-03:00 "Jörg Knappen": > I inspected the pdf file, and its font encoding is termed "Identity-H". I > couldn't reveal much about this encoding, but it seems to be a private > encoding of Adobe used especially for Asian fonts. > > --Jörg Knappen > > Gesendet: Donnerstag, 17. März 2016 um 17:43 Uhr > Von: "Don Osborn" > An: unicode@unicode.org > Betreff: Joined "ti" coded as "Ɵ" in PDF > Odd result when copy/pasting text from a PDF: For some reason "ti" in > the (English) text of the document at > http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf > is coded as "Ɵ". Looking more closely at the original text, it does > appear that the glyph is a "ti" ligature (which afaik is not coded as > such in Unicode). > > Out of curiosity, did a web search on "internaƟonal" and got over 11k > hits, apparently all PDFs. > > Anyone have any idea what's going on? Am assuming this is not a > deliberate choice by diverse people creating PDFs and wanting "ti" > ligatures for stylistic reasons. Note the document linked above is > current, so this is not (just) an issue with older documents. > > Don Osborn
Re: Purpose of and rationale behind Go Markers U+2686 to U+2689
2016-03-18 19:11 GMT+01:00 Garth Wallace: > > The issues with line breaking (if you can use these combining around all > > characters, inclusing spaces, can be solved using unbreakable characters. > > Line breaking isn't really a problem that I can see with the Quivira > model. If they're given the usual line breaking properties for > symbols, the Unicode line breaking algorithm would prevent a break > between halves. East Asian vertical text is another story. In a font > that just uses kerning to join halves (as Quivira does) you'd end up > with the left half on top of the right in vertical text. I'm not sure > how ligatures are handled in vertical text. > East Asian vertical presentation does not just stack the elements on top of each other, very frequently they rotate them (including Latin/Greek/Cyrillic letters) So this is not really a new complication. The numbers however are used for noting or commenting a strategy, or the placement order during a party. However for game notations purpose, rotation plays a significant role (notably if those two part symbols are joined in a circle or disc: it can make the difference between several distinct sets of stones, or it could be used in a 4-players go variant (where black vs. white is not sufficient to distinguish the players). In reality the stones would have 4 colours (stones are not really numbered, they are all the same for the same player, or there's some special marked type of stone for each player in addition to their normal set) or sets would have some symbol or dot on top of them. There are also go variants using stones that take a territory and block the position but that cnanot be taken (both players can use them, but the territory taken is not counted for any player. These stones can also be placed randomly at start of the party over the board to complicate the game, or there's a limited set of blocking stones for each player that an choose when to play them instead of standard stones. Those blocking stones are visually distinct, but identical for the two players that have them at start of the party. Although the classic rules of go are extremely simple, this game has a lot of variants. In fact many players that don't know the exact classic rules are inventing their own variant.
Re: Swapcase for Titlecase characters
On Fri, Mar 18, 2016, 08:43:56, Martin J. Dürst wrote: > I'm working on extending the case conversion methods for the programming > language Ruby from the current ASCII only to cover all of Unicode. > > Ruby comes with four methods for case conversion. Three of them, upcase, > downcase, and capitalize, are quite clear. But we have hit a question > for the forth method, swapcase. > > What swapcase does is swap upper and lower case, so that e.g. > > 'Unicode Standard'.swapcase => 'uNICODE sTANDARD' > > I'm not sure myself where this method is actually used, but it also > exists in Python (and maybe Ruby got it from there). > > > Now the question I have is: What to do for titlecase characters? Several > possibilities already have been floated: > > a) Leave as is, because there are neither upper nor lower case. > > b) Convert to upper (or lower), which may simplify implementation. > > c) Decompose the character into upper and lower case components, and > apply swapcase to these. > > > For example, 'Džinsi' (jeans) would become 'DžINSI' with a), 'DŽINSI' (or > 'džinsi') with b), and 'dŽINSI' with c). For another example, 'ᾨδή' would > become 'ᾨΔΉ' with a), 'ὨΙΔΉ' (or 'ᾠΔΉ') with b), and 'ὠΙΔΉ' with c). > > It looks like Python 3 (3.4.3 in my case) is doing a). My guess is that > from an user expectation point of view, c) is best, so I'm tending to go > for c). There is no existing data from the Unicode Standard for this, > but it seems pretty straightforward. > > But before I just implement something, I'd appreciate additional input, > in particular from users closer to the affected language communities. As far as I can tell from my limited experience, the swapcase method is used only to convert “inverted titlecase” to titlecase. I call “inverted titlecase” the state of text produced by keyboard input while the caps lock toggle is accidentally on, and those words are “inversely capitalized” where the user pressed the shift modifier. Therefore such examples would be most useful. Having said that, I know that this never occurs on many keyboards of English-speaking users who remapped that key to perform another action such as backspace, compose, or kana lock. Living myself in a country where the caps lock toggle is indispensable, I may be considered part of the aimed user communities, though unfortunately I donʼt speak Croatian nor Greek. Looking at your examples, I would add a case that typically occurs for swapcase to be applied: ‘ᾠΔΉ’ (cited [erroneously] as a result of option b) that is to be converted to ‘ᾨδή’, and ‘džINSI’, that is to become ‘Džinsi’. As about decomposing digraphs and ypogegrammeni to apply swapcase: That probably would be doing no good, as itʼs unnecessary and users wonʼt expect it. I hope that helps. Kind regards, Marcel
Joined "ti" coded as "Ɵ" in PDF
Odd result when copy/pasting text from a PDF: For some reason "ti" in the (English) text of the document at http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf is coded as "Ɵ". Looking more closely at the original text, it does appear that the glyph is a "ti" ligature (which afaik is not coded as such in Unicode). Out of curiosity, did a web search on "internaƟonal" and got over 11k hits, apparently all PDFs. Anyone have any idea what's going on? Am assuming this is not a deliberate choice by diverse people creating PDFs and wanting "ti" ligatures for stylistic reasons. Note the document linked above is current, so this is not (just) an issue with older documents. Don Osborn
Re: Purpose of and rationale behind Go Markers U+2686 to U+2689
On Thu, Mar 17, 2016 at 11:28 PM, J Deckerwrote: > On Thu, Mar 17, 2016 at 9:18 PM, Garth Wallace wrote: >> There's another strategy for dealing with enclosed numbers, which is >> taken by the font Quivira in its PUA: encoding separate >> left-half-circle-enclosed and right-half-circle-enclosed digits. This >> would require 20 characters to cover the double digit range 00–99. >> Enclosed three digit numbers would require an additional 30 for left, >> center, and right thirds, though it may be possible to reuse the left >> and right half circle enclosed digits and assume that fonts will >> provide left half-center third-right half ligatures (Quivira provides >> "middle parts" though the result is a stadium instead of a true >> circle). It should be possible to do the same for enclosed ideographic >> numbers, I think. >> >> The problems I can see with this are confusability with the already >> encoded atomic enclosed numbers, and breaking in vertical text. >> > > I suppose that's why things like this happen in appilcations > > Joined "ti" coded as "Ɵ" in PDF > > http://www.unicode.org/mail-arch/unicode-ml/y2016-m03/0084.html > > you get an encode of a series of codepoints, that results in an array > of font glyph-points to render What? I don't see what an apparent ligature matching or OCR glitch in PDFs has to do with this.
Proposal for *U+2427 NARROW SHOULDERED OPEN BOX (was: Re: Proposal for *U+23FF SHOULDERED NARROW OPEN BOX?)
On Mon, 14 Mar 2016 09:19:35 -0700, Ken Whistler wrote: > U+23FF is already assigned to OBSERVER EYE SYMBOL, which is > already under ballot for 10646 (and approved by the UTC). > > http://www.unicode.org/alloc/Pipeline.html > > Please always first check that page before suggesting code points > for prospective new characters. > > --Ken > > On 3/12/2016 5:42 PM, Marcel Schneider wrote: > > Now in the block of U+237D SHOULDERED OPEN BOX there is _one_ scalar value > > left. Would it then be a good idea to propose *U+23FF SHOULDERED NARROW > > OPEN BOX for v10.0.0? > > Thank you. I remember OBSERVER EYE but didnʼt notice its code point and forgot to do a search for ‘23[F[F]]’ on the Pipeline page. Sorry. Now I see that *U+2427 would be even better as it is both in the block of U+2423 OPEN BOX and in the originally intended block, except that now I dropped the other symbols and stay just with the NNBSP symbol to propose for the next free contiguous scalar value. I really hope that such a new or, more accurately, third proposal would be accepted, as the NARROW NO-BREAK SPACE is so important it must have its symbol encoded at some point, similarly to SPACE and NO-BREAK SPACE. About the proposed name, there is to say that first I changed it to the glyph-descriptional one as preferred in Unicode, rather than SYMBOL FOR NARROW NO-BREAK SPACE. And last I made it more analogous to the name of the symbolized character, by inverting “SHOULDERED” and “NARROW”. The original proposer cannot simply resume on that “narrow” basis, being committed to consistency with ISO/IEC 9995-7, so that an individual like I am, might be good to send the proposal? However generally it would be better done by a NB, the more as this belongs to the international keyboard standard. Other countries might be interested that have a multilingual standard layout, and/or a national layout including U+202F. Another scenario would be that the French NB re-proposes a reduced set of additional symbols, which IMHO should comprise at least the NARROW SHOULDERED OPEN BOX, but ideally once it will have completed the revision of most parts of ISO/IEC 9995, including part 7. Best regards, Marcel
Re: Purpose of and rationale behind Go Markers U+2686 to U+2689
That's a smart idea... Note that you could encode the middle digits so that their enclosure at top and bottom are by default only horizontal (no arcs of circle) when shown in isolation, and the left and right parts are just connecting by default horizontally to the top and bottom position of the middle digits. Allowing arbitrary number of characters. In order to create a real circle, you could use a joiner control to given a hint the renderer that he can create a ligature (possibly reducing the size of digits, or changing the dimension and shape of the connected segments so that they'll draw a circle instead of a "cartouche" rounded at start and end. You could even set the enclosing as a combining character around existing digits (even if those digits are not symbols by themselves, the combining character has this property, an idea similar to the arrow combining characters at top or bottom for mathematics notations), so that the content of the "circle" or "cartouche". The enclosure could also be something else than a circle (or arcs of circle): it could be a rectangle, hintable with joiners (like with circles) to create an enclosing square, or a rounded rectangle (hintable to create a rounded square). The enclosure shapes could be white or black, or could be drawn with double strokes. This is in fact similar to the combining low line or top line which are joining by default. However using a joiner between them instructs not really to join the top/bottom line (which is already the expected behavior for these low/top lines) but to create a ligature between the base characters in the middle. Then to create double enclosure, just "stack" several combining characters (in the order from inside to outside: the combining characters for enclosures should have the same high value for their combining class so that their relative order is kept, or could have combining class 0). The issues with line breaking (if you can use these combining around all characters, inclusing spaces, can be solved using unbreakable characters. Note that this addition would create a disunification with existing enclosed characters which are already ligatured into a single symbol (they won't be canonically equivalent, using only the decomposition properties), but this can be solved by adding another property ("ligature decomposition"), and mapping the existing enclosed characters to their "ligature decomposition" using normal base characters, the new combining characters for enclosure and the joining control between them. those mappings can be in a new properties file (which could then be useful for collation so that the "enclosed 79" symbol would collate like "79"). Advantage: with these, you can now enclose various numbers (not just natural integers) or abbreviations (e.g. chemical Symbols like "Au" for gold), or astrological symbols, or arbitrary words (using them to enclose full sentences would not be very practicle, but their use to enclose a person name such as the name of a Egyptian king "Ramses" is possible, even outside the context of Egyptian hieroglyphs)... It could be used to enclose a temperature such as "10°C", or a section heading number "1.1". And this is much less limited than the (very quirky) use of CSS or styles (in rich text or HTML) to add surrounding "borders" as the shapes are less restricted (in CSS you can create rounded borders). Some new shapes are possible such as diagonal left and right sides, or mixing a rounded left side and a square right side (though in this case it would be hard to use joiners and expect a ligature to be created for the enclosing shape (for example expect a triangular enclosure created by the ligature of two diagonal sides and horizontal top/bottom for characters in the middle, because this would absolutely require resizing all characters in the middle to preserve a consistent line height; but this is possible for pairs of base characters inside the enclosure). Note : the enclosing ligature "joiner" control is not the same as the one for joining base characters, as the intent is to join the enclosing shape fragments (possibly by reducing the size and repositioning the all characters in the middle), as characters in the middle are not ligatured themselves (if you enclose "AE" in such shapes created with combining characters, it should not produce a "AE" letter in the final enclosing shape. 2016-03-18 5:18 GMT+01:00 Garth Wallace: > There's another strategy for dealing with enclosed numbers, which is > taken by the font Quivira in its PUA: encoding separate > left-half-circle-enclosed and right-half-circle-enclosed digits. This > would require 20 characters to cover the double digit range 00–99. > Enclosed three digit numbers would require an additional 30 for left, > center, and right thirds, though it may be possible to reuse the left > and right half circle enclosed digits and assume that fonts will > provide left half-center third-right half ligatures
Re: Variations and Unifications ?
One problem caused by disunification is the complexification of algorithms handling text. I forgot an important case where disunification also occured : combining sequences are the "normal" encoding, but legacy charsets encoded the precomposed character separately and Unicode had to map them for round trip compatibility purpose. This had a consequence : the creation of additional properties (i.e. for "canonical equivalences") in order to conciliate the two sets of encodings and allow some form for equivalence In fact this is general: each time we disunify a character, we have to add new properties, and possibly update the algorithms to take these properties into account and find some form of equivalences. So disunification solves one problem but creates others. We have to trade the benefits and costs of using the disunified characters with those using the "normal" characters (possibly in sequences). But given the number of cases where we have to support sequences (even if it's only combining sequences for canonical equivalences), we should really defavor the real need of disunifying characters: if it's possible with sequences, don't desunify. A famous example (based on a legacydecision which was bad in my opinion as the cost was not considered) was the desunification of Latin/Greek letters for mathematical purpose, only to force a specific style. But the alternative representation using sequences (using variation selectors for example, as the addition of specific modifier for "styles" like "bold", "italic" or "monospace" was rejected with good reasons) was not really analyzed in terms of benefits and costs, using the algorithms we already have (and that could have been updated). But mathemetical symbols are (normally...) not used at all in the same context as plain alphabetic letters (even if there's absolutely no warranty that they will be always distinctable from them when they occur in some linguistic text rendered with the same style...). The naive thinking that disunification will make things simpler is completely wrong (given that an application that would ignore all character properties and would use only isolated characters would break legitime rules in many cases, even for rendering purposes. It is in fact simpler to keep the possible sequences that are already encoded (or that could be extended to cover more cases: e.g. add new variation sequences, introduce some new modiers, not just new combining characters, and so on). We were strongly told : Unicode encodes characters, not glyphs. This should be remembered (and the argument of costs caused by disunification of distinct glyphs is also a good one against it). 2016-03-17 8:20 GMT+01:00 Asmus Freytag (t): > On 3/16/2016 11:11 PM, Philippe Verdy wrote: > > "Disunification may be an answer?" We should avoid it as well. > > Disunification is only acceptable when > - there's a complete disunification of concepts > > > I think answering this question depends on the understanding of "concept", > and on understanding what it is that Unicode encodes. > > When it comes to *symbols*, which is where the discussion originated, > it's not immediately obvious what Unicode encodes. For example, I posit > that Unicode does not encode the "concept" for specific mathematical > operators, but the individual "symbols" that are used for them. > > For example PRIME and DOUBLE PRIME can be used for minutes and seconds > (both of time and arc) as well as for other purposes. Unicode correctly > does not encode "MINUTE OF ARC", but the symbol used for that -- leaving it > up to the notational convention to relate the concept and the symbol. > > Thus we have a case where multiple concepts match a single symbol. For the > converse, we take the well-known case of COMMA and FULL STOP which can both > be used to separate a decimal fraction. > > Only in those cases where a single concept is associated so exclusively > with a given symbol, do we find the situation that it makes sense to treat > variations in shape of that symbol as the same symbol, but with different > glyphs. > > For some astrological symbols that is the case, but for others it is not. > Therefore, the encoding model for astrological text cannot be uniform. > Where symbols have exclusive association with a concept, the natural > encoding is to encode symbols with an understood set of variant glyphs. > Where concepts are denoted with symbols that are also used otherwise, then > the association of concept to symbol must become a matter of notational > convention and cannot form the basis of encoding: the code elements have to > be on a lower level, and by necessity represent specific symbol shapes. > > A./ >
Re: Joined "ti" coded as "Ɵ" in PDF
Hi Don, Latin is fine if you keep to simple well made fonts and avoid using more sophisticated typographic features available in some fonts. Dumb it down typographically and it works fine. PDF, despite all the current rhetoric coming from PDF software developers, is a preprint format. Not an archival format. The PDF format is less than ideal. But it is widely used, often in a way the format was never really created for. There are alternatives that preserve the text. But they have never really taken off (compared to PDF)for various reasons. Andrew On Sunday, 20 March 2016, Don Osbornwrote: > Thanks Andrew, Looking at the issue of ToUnicode mapping you mention, why in the 1-many mapping of ligatures (for fonts that have them) do the "many" not simply consist of the characters ligated? Maybe that's too simple (my understanding of the process is clearly inadequate). > > The "string of random ASCII characters" (per Leonardo) used in the Identity H system for hanzi raise other questions: (1) How are the ASCII characters interpreted as a 1-many sequence representing a hanzi rather than just a series of 1-1 mappings of themselves? (2) Why not just use the Unicode code point? > > The details may or may not be relevant to the list topic, but as a user of documents in PDF format, I fail to see the benefit of such obscure mappings. And as a creator of PDFs ("save as") looking at others' PDFs I've just encountered with these mappings, I'm wondering how these concerned about how the font & mapping results turned out as they did. It is certain that the creators of the documents didn't intend results that would not be searchable by normal text, but it seems possible their a particular font choice with these ligatures unwittingly produced these results. If the latter, the software at the very least should show a caveat about such mappings when generating PDFs. > > Maybe it's unrealistic to expect a simple implication of Unicode in PDFs (a topic we've discussed before but which I admit not fully grasping). Recalling I once had some wild results copy/pasting from an N'Ko PDF, and ended up having to obtain the .docx original to obtain text for insertion in a blog posting. But while it's not unsurprising to encounter issues with complex non-Latin scripts from PDFs, I'd gotten to expect predictability when dealing with most Latin text. > > Don > > > > On 3/17/2016 7:34 PM, Andrew Cunningham wrote: > > There are a few things going on. > > In the first instance, it may be the font itself that is the source of the problem. > > My understanding is that PDF files contain a sequence of glyphs. A PDF file will contain a ToUnicode mapping between glyphs and codepoints. This iseither a 1-1 mapping or a 1-many mapping. The 1-many mapping provides support for ligatures and variation sequences. > > I assume it uses the data in the font's cmap table. If the ligature isn't mapped then you will have problems. I guess the problem could be either the font or the font subsetting and embedding performed when the PDF is generated. > > Although, it is worth noting that in opentype fonts not all glyphs will have mappings in the cmap file. > > The remedy, is to extensively tag the PDF and add ActualText attributes to the tags. > > But the PDF specs leave it up to the developer to decide what happens in there is both a visible text layer and ActualText. So even in an ideal PDF, tesults will vary from software to software when copying text or searching a PDF. > > At least thatsmy current understanding. > > Andrew > > On 18 Mar 2016 7:47 am, "Don Osborn" wrote: >> >> Thanks all for the feedback. >> >> Doug, It may well be my clipboard (running Windows 7 on this particular laptop). Get same results pasting into Word and EmEditor. >> >> So, when I did a web search on "internaƟonal," as previously mentioned, and come up with a lot of results (mostly PDFs), were those also a consequence of many not fully Unicode compliant conversions by others? >> >> A web search on what you came up with - "InternaƟonal" - yielded many more (82k+) results, again mostly PDFs, with terms like "interna onal" (such as what Steve noted) and "interna > >> Searching within the PDF document already mentioned, "international" comes up with nothing (which is a major fail as far as usability). Searching the PDF in a Firefox browser window, only "internaƟonal" finds the occurrences of what displays as "international." However after downloading the document and searching it in Acrobat, only a search for "internaƟonal" will find what displays as "international." >> >> A separate web search on "Eīects" came up with 300+ results, including some GoogleBooks which in the texts display "effects" (as far as I checked). So this is not limited to Adobe? >> >> Jörg, With regard to "Identity H," a quick search gives the impression that this
Re: Joined "ti" coded as "Ɵ" in PDF
On 2016-03-19, Don Osbornwrote: > The details may or may not be relevant to the list topic, but as a user > of documents in PDF format, I fail to see the benefit of such obscure > mappings. And as a creator of PDFs ("save as") looking at others' PDFs Aren't you just being bitten by history? PDF derives from PostScript, which is not a language for representing plain text with typesetting information, but a language for type(and-graphic-)setting tout court. There's a lot of history of fonts using arbitrary codepoints; the idea that the underlying strings giving rise to the displayed graphics should also be a good plain text representation of the information is relatively novel. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: Meteorological symbols for cloud conditions (on maps or elsewhere)
Some other resources (outside Wikipedia): - Kean University: http://www.kean.edu/~fosborne/resources/ex10g.htm - Documented by the NOAA in US (but I don't find the complete reference) - These symbols seem to be supported by an "international standard", but I don't know which one exactly. - Documented with other symbols (rain, ice, snow, thunder...) in Canada for flight planning https://flightplanning.navcanada.ca/cgi-bin/CreePage.pl?Langue=anglais=NS_Inconnu=wxsymbols=wxsymb - http://www.visualdictionaryonline.com/earth/meteorology/international-weather-symbols/clouds.php 2016-03-18 17:59 GMT+01:00 Philippe Verdy: > See > https://fr.wikipedia.org/wiki/Carte_m%C3%A9t%C3%A9orologique#/media/File:Station_model_fr.svg > > I see these symbols for noting cloud types (here cirrus and altocumulus, > one drawn in diagonal for middle altitude, another drawn horizontally for > high altitudes). > > Note that the symbols may vary: see Altocumulus for example as found in > French Wikipedia (note sure if it's accurate) which is different from the > symbol found in the sampled notation on a map > > https://fr.wikipedia.org/wiki/Altocumulus > > Also other symbols on the similar page in English Wikipedia, are used to > describe some cloud characteristics: > > https://en.wikipedia.org/wiki/Altocumulus_cloud > > Is there a well defined collection of these symbols, and are they in the > encoding pipe ? > >
Re: Purpose of and rationale behind Go Markers U+2686 to U+2689
On 3/18/2016 11:48 AM, Philippe Verdy wrote: East Asian vertical presentation does not just stack the elements on top of each other, very frequently they rotate them (including Latin/Greek/Cyrillic letters) So this is not really a new complication. It is, because now all these combinations have to be treated as units (because they would be expected to NOT be rotated). They would be akin to the square kana abbreviations. Suddenly, you need dedicated support from rendering engines, where for horizontal texts you could design your fonts to get the intended outcome with a "dumb" engine. A./
Re: Variations and Unifications ?
On 3/16/2016 11:11 PM, Philippe Verdy wrote: "Disunification may be an answer?" We should avoid it as well. Disunification is only acceptable when - there's a complete disunification of concepts I think answering this question depends on the understanding of "concept", and on understanding what it is that Unicode encodes. When it comes to symbols, which is where the discussion originated, it's not immediately obvious what Unicode encodes. For example, I posit that Unicode does not encode the "concept" for specific mathematical operators, but the individual "symbols" that are used for them. For example PRIME and DOUBLE PRIME can be used for minutes and seconds (both of time and arc) as well as for other purposes. Unicode correctly does not encode "MINUTE OF ARC", but the symbol used for that -- leaving it up to the notational convention to relate the concept and the symbol. Thus we have a case where multiple concepts match a single symbol. For the converse, we take the well-known case of COMMA and FULL STOP which can both be used to separate a decimal fraction. Only in those cases where a single concept is associated so exclusively with a given symbol, do we find the situation that it makes sense to treat variations in shape of that symbol as the same symbol, but with different glyphs. For some astrological symbols that is the case, but for others it is not. Therefore, the encoding model for astrological text cannot be uniform. Where symbols have exclusive association with a concept, the natural encoding is to encode symbols with an understood set of variant glyphs. Where concepts are denoted with symbols that are also used otherwise, then the association of concept to symbol must become a matter of notational convention and cannot form the basis of encoding: the code elements have to be on a lower level, and by necessity represent specific symbol shapes. A./
Re: Purpose of and rationale behind Go Markers U+2686 to U+2689
On Fri, Mar 18, 2016 at 11:48 AM, Philippe Verdywrote: > 2016-03-18 19:11 GMT+01:00 Garth Wallace : >> >> > The issues with line breaking (if you can use these combining around all >> > characters, inclusing spaces, can be solved using unbreakable >> > characters. >> >> Line breaking isn't really a problem that I can see with the Quivira >> model. If they're given the usual line breaking properties for >> symbols, the Unicode line breaking algorithm would prevent a break >> between halves. East Asian vertical text is another story. In a font >> that just uses kerning to join halves (as Quivira does) you'd end up >> with the left half on top of the right in vertical text. I'm not sure >> how ligatures are handled in vertical text. > > > East Asian vertical presentation does not just stack the elements on top of > each other, very frequently they rotate them (including Latin/Greek/Cyrillic > letters) So this is not really a new complication. True. I suppose if the half-enclosed digits were defined as halfwidth, it would work. It makes intuitive sense too, if a complete numbered circle is assumed to fill an ideographic cell. I'm not sure if rotation of the numbers would be desired, though. > The numbers however are used for noting or commenting a strategy, or the > placement order during a party. > > However for game notations purpose, rotation plays a significant role > (notably if those two part symbols are joined in a circle or disc: it can > make the difference between several distinct sets of stones, or it could be > used in a 4-players go variant (where black vs. white is not sufficient to > distinguish the players). In reality the stones would have 4 colours (stones > are not really numbered, > they are all the same for the same player, or there's some special marked > type of stone for each player in addition to their normal set) or sets would > have some symbol or dot on top of them. Rotation is definitely not salient in standard go kifu like it is in fairy chess notation. Go variants for more than 2 players are uncommon enough that I don't think any sort of standardized notation exists. > There are also go variants using stones that take a territory and block the > position but that cnanot be taken (both players can use them, but the > territory taken is not counted for any player. > These stones can also be placed randomly at start of the party over the > board to complicate the game, or there's a limited set of blocking stones > for each player that an choose when to play them instead of standard stones. > Those blocking stones are visually distinct, but identical for the two > players that have them at start of the party. Do you have any links? I'm interested in game design. > Although the classic rules of go are extremely simple, this game has a lot > of variants. In fact many players that don't know the exact classic rules > are inventing their own variant. These are generally one-off inventions (or commercial products) so I don't think there's much need to consider their hypothetical variations on notation.
Re: Joined "ti" coded as "Ɵ" in PDF
Thanks Andrew, Looking at the issue of ToUnicode mapping you mention, why in the 1-many mapping of ligatures (for fonts that have them) do the "many" not simply consist of the characters ligated? Maybe that's too simple (my understanding of the process is clearly inadequate). The "string of random ASCII characters" (per Leonardo) used in the Identity H system for hanzi raise other questions: (1) How are the ASCII characters interpreted as a 1-many sequence representing a hanzi rather than just a series of 1-1 mappings of themselves? (2) Why not just use the Unicode code point? The details may or may not be relevant to the list topic, but as a user of documents in PDF format, I fail to see the benefit of such obscure mappings. And as a creator of PDFs ("save as") looking at others' PDFs I've just encountered with these mappings, I'm wondering how these concerned about how the font & mapping results turned out as they did. It is certain that the creators of the documents didn't intend results that would not be searchable by normal text, but it seems possible their a particular font choice with these ligatures unwittingly produced these results. If the latter, the software at the very least should show a caveat about such mappings when generating PDFs. Maybe it's unrealistic to expect a simple implication of Unicode in PDFs (a topic we've discussed before but which I admit not fully grasping). Recalling I once had some wild results copy/pasting from an N'Ko PDF, and ended up having to obtain the .docx original to obtain text for insertion in a blog posting. But while it's not unsurprising to encounter issues with complex non-Latin scripts from PDFs, I'd gotten to expect predictability when dealing with most Latin text. Don On 3/17/2016 7:34 PM, Andrew Cunningham wrote: There are a few things going on. In the first instance, it may be the font itself that is the source of the problem. My understanding is that PDF files contain a sequence of glyphs. A PDF file will contain a ToUnicode mapping between glyphs and codepoints. This iseither a 1-1 mapping or a 1-many mapping. The 1-many mapping provides support for ligatures and variation sequences. I assume it uses the data in the font's cmap table. If the ligature isn't mapped then you will have problems. I guess the problem could be either the font or the font subsetting and embedding performed when the PDF is generated. Although, it is worth noting that in opentype fonts not all glyphs will have mappings in the cmap file. The remedy, is to extensively tag the PDF and add ActualText attributes to the tags. But the PDF specs leave it up to the developer to decide what happens in there is both a visible text layer and ActualText. So even in an ideal PDF, tesults will vary from software to software when copying text or searching a PDF. At least thatsmy current understanding. Andrew On 18 Mar 2016 7:47 am, "Don Osborn"> wrote: Thanks all for the feedback. Doug, It may well be my clipboard (running Windows 7 on this particular laptop). Get same results pasting into Word and EmEditor. So, when I did a web search on "internaƟonal," as previously mentioned, and come up with a lot of results (mostly PDFs), were those also a consequence of many not fully Unicode compliant conversions by others? A web search on what you came up with - "InternaƟonal" - yielded many more (82k+) results, again mostly PDFs, with terms like "interna onal" (such as what Steve noted) and "interna
Re: Purpose of and rationale behind Go Markers U+2686 to U+2689
Sequences were introduced long before. I know that they add their own complications everywhere, but they are already part of existing algorithms. If sequences (not just combining sequences) were not there, there would be much more characters encoded in the database and eveything would be encoded like sinograms (mostly one character per composite glyph) 2016-03-18 19:58 GMT+01:00 Asmus Freytag (t): > On 3/18/2016 11:11 AM, Garth Wallace wrote: > > The enclosure could also be something else than a circle (or arcs of> > circle): it could be a rectangle, hintable with joiners (like with circles)> > to create an enclosing square, or a rounded rectangle (hintable to create a> > rounded square). > > I thought combining characters would not be suitable for things like > white text on black. > > > Philippe seems to have an appetite for combining sequences that's not > shared by the UTC. > > A./ >
Re: Joined "ti" coded as "Ɵ" in PDF
Thanks all for the feedback. Doug, It may well be my clipboard (running Windows 7 on this particular laptop). Get same results pasting into Word and EmEditor. So, when I did a web search on "internaƟonal," as previously mentioned, and come up with a lot of results (mostly PDFs), were those also a consequence of many not fully Unicode compliant conversions by others? A web search on what you came up with - "InternaƟonal" - yielded many more (82k+) results, again mostly PDFs, with terms like "interna onal" (such as what Steve noted) and "internawrote: Don Osborn wrote: Odd result when copy/pasting text from a PDF: For some reason "ti" in the (English) text of the document at http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf is coded as "Ɵ". Looking more closely at the original text, it does appear that the glyph is a "ti" ligature (which afaik is not coded as such in Unicode). When I copy and paste the PDF text in question into BabelPad, I get: InternaƟonal Order and the DistribuƟon of IdenƟty in 1950 (By invitaƟon only) The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use character. Truncating this character to 16 bits, which is a Bad Thing™, yields U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either Don's clipboard or the editor he pasted it into is not fully Unicode-compliant. Don's point about using alternative characters to implement ligatures, thereby messing up web searches, remains valid. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Swapcase for Titlecase characters
On Sat Mar 19, 2016 12:54:51, Martin J. Dürst wrote: > On 2016/03/19 04:33, Marcel Schneider wrote: > > On Fri, Mar 18, 2016, 08:43:56, Martin J. Dürst wrote: > > >> b) Convert to upper (or lower), which may simplify implementation. > > >> For example, 'Džinsi' (jeans) would become 'DžINSI' with a), 'DŽINSI' (or > >> 'džinsi') with b), and 'dŽINSI' with c). For another example, 'ᾨδή' would > >> become 'ᾨΔΉ' with a), 'ὨΙΔΉ' (or 'ᾠΔΉ') with b), and 'ὠΙΔΉ' with c). > > > Looking at your examples, I would add a case that typically occurs for > > swapcase to be applied: > > > ‘ᾠΔΉ’ (cited [erroneously] as a result of option b) that is to be converted > > to ‘ᾨδή’, and ‘džINSI’, that is to become ‘Džinsi’. > > First, what do you mean with "erroneously"? The intent of that bracketed word was just to give account of the fact that when ‘ᾨδή’ is converted to lower case as assumed in option “b-lower”, it becomes ‘ᾠδή’, while ‘ᾠΔΉ’ is a typical candidate for swapcase, thus I could reutilize it “as is” to illustrate the fourth case. > > Second, did I get this right that your additional case (let's call it > d)) would cycle through the three options where available: > lower -> title -> upper -> lower. I’m afraid that swapcase as I saw it is not a roundtrip method, therefore I got some awkward moments today when I thought about how to implement it. As far as I could see, there are two pairs: I: lowercase → titlecase (needed to correct the initials where the user pressed the shift modifier) II: uppercase → lowercase (needed to correct the body of the words input while caps lock was on) That typically matches what happens when caps lock is accidentally on and the user writes normally―on a keyboard that includes digraphs and uses the SGCaps feature for them, like this: Modifier; None; Shift CapsLock off; Lower; Title CapsLock on; Upper; Lower Correcting keyboard input done with the wrong caps lock state is the only situation I can see where swapcase is needed and thus is likely to be used. This is why the swapcase method is implemented in word processors, as a part of an optional autocorrect feature that neutralizes the effet of starting a sentence normally while caps lock is on: After completing the input of an uppercase word with an initial lowercase letter, the word is automatically swapcased and caps lock is turned off. However now that I tested it with the digraph of the examples (input through the composer of the keyboard layout), it doesnʼt work at all in one word processor, while in another one it works but uppercases the initial lowercase digraph instead of titlecasing it. [That may be considered effects of “streamlined” implementations that drop the less frequent cases.] I donʼt believe that it would be useful to make swapcase a roundtrip method, and anyway it would be weird because of the letters with three case forms. The case conversion cycle you draw above usually applies to words (and doesnʼt work correctly in neither of the two tested word processors when an initial DZ digraph is present), while most letters have identical values for Titlecase_Mapping and Uppercase_Mapping, and usually there is no means to flag them with “Titlecase_State”. This might be one more reason why current implementations of swapcase donʼt match the expected behavior for digraphs. > > > As about decomposing digraphs and ypogegrammeni to apply swapcase: That > > probably would be doing no good, > > as itʼs unnecessary and users wonʼt expect it. > > Why do you say "users won't expect it"? For those users not aware of the > encoding internals, I'd indeed guess that's what users would expect, at > least in the Croatian case. That depends on what is the expected result. If the swapcase method is to correct inverted casing, users wouldnʼt like to see the digraphs decomposed, the less as in the considered languages, the DZ digraph is a part of the alphabet between ‘D’ and ‘Đ’, so that users are really aware. > For Greek, it may be different; it depends > on the extent to which the iota is seen as a letter vs. seen as a mark. Here again the user inputs a precomposed letter, with iota subscript because he just wants a capitalized word, not an uppercase one. And here again the autocorrect doesnʼt work in one word processor, while in the other one it applies uppercasing with uppercase iota adscript―while the rest of the word is lowercase―instead of capitalization, with lowercase iota adscript or iota subcript, that depends on conventions and preferences. Letʼs take that as a proof how hard it is to implement swapcase with digraph support. I canʼt better conclude this reply than with Asmus Freytagʼs words on Fri, 1st Jan 2016 12:09:13 -0800: [1] > Unicode aims to be expressive enough to model all plain text. That means, it > inherits the non-reducible complexity of text. Even the insight that the > complexity is non-reducible would be a big step forward. Regards, Marcel [1] Re:
Re: Swapcase for Titlecase characters
Martin J. Dürst wrote: Now the question I have is: What to do for titlecase characters? [ ... ] For example, 'Džinsi' (jeans) would become 'DžINSI' with a), 'DŽINSI' (or 'džinsi') with b), and 'dŽINSI' with c). For the Latin letters at least, my 0.02 cents' worth (you read that right) is that they are probably so infrequently used that option (b) would be just fine. As one anecdote (which is even less like "data" than two anecdotes), I could not find any of the characters IJ ij DŽ Dž dž LJ Lj lj NJ Nj nj or their hex equivalents in any of the CLDR keyboard definitions. I'd imagine that users just type the two characters separately, and that consequently most data in the real world is like that. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Variations and Unifications ?
"Disunification may be an answer?" We should avoid it as well. We have other solutions in Unicode - variation selectors (often used for sinograms when their unified shapes must be distinguished in some contexts such as people names or toponyms or trademark names or in other specific contexts), - or combining sequences (including in Arabic or Hebrew where many combining characters are not always represented visually, the same occuring as well in Latin with accents not always presented over capitals), - or sequences of multiple characters (like in Emojis for skin color variants, or sequences for encoding flags), - or other sequences using joiners (e.g. in South Asian scripts). Disunification is only acceptable when - there's a complete disunification of concepts and the "similar" shapes are also different even if one originates from the other (E.g. the Latin slashed o disunifiied from the Latin o, even if there's also the sequence o+combining slash, almost never used as its rendering is too much approximative in most cases) - or there's a clear distinction of semantics and properties (e.g. the Latin AE ligature, which is not appropriately represented by the two separate letters, not even with a "hinting" joiner, and that has specific properties as a plain letter, e.g. with mappings) Before disunifying a character, we should first study the alternative of their representation as sequences. 2016-03-16 18:34 GMT+01:00 Asmus Freytag (t): > On 3/15/2016 8:14 PM, David Faulks wrote: > > As part of my investigations into astrological symbols, I'm beginning to > wonder if glyph variations are justifications for separate encoding of > symbols I would have previously considered the same or unifiable with symbols > already in Unicode. > > For example, the semisquare aspect is usually shown with a glyph that is > identical to ∠ (U+2220 ANGLE). However, sometimes it looks like <, or like ∟ > (U+221F RIGHT ANGLE). Would this be better encoded as a separate codepoint? > > The parallel aspect, similarily, sometimes looks like ∥ (U+2225 PARALLEL TO), > but is often shown as // or ⫽ (U+2AFD DOUBLE SOLIDUS OPERATOR). This is not a > typographical kludge since astrological fonts often show it this way. > There is also contra-parallel, which sometime is shown like ∦ (U+2226 NOT > PARALLEL TO), but has varaint glyphs with slated lines (and the crossbar is > often horizontal). > > The ‘part of fortune’ is sometimes a circled ×, or sometimes a circled +. > > Would it be better to have dedicated characters than to assume unifications > in these cases? > > > > My take is that for symbols there's always that tension between encoding > the "concept" or encoding the shape. In my view, it is often impossible to > answer the question whether the different angles (for example) are merely > different "shapes" of one and the same "symbol", or whether it isn't the > case that there are different "conventions" (using different symbols for > the same concept). > > Disunification is useful, whenever different concepts require distinct > symbol shapes (even if there are some general similarities). If other > concepts make use of the same shapes interchangeably, it is then up to the > author to fix the convention by selecting one or the other shape. > Conceptually, that is similar to the decimal point: it can be either a > period, or a comma, depending on locale (read: depending on the convention > the author follows). > > Sometimes, concepts use multiple symbol shapes, but all of these shapes > map to the same concept (and other uses are not known). In that case, > unifying the shapes might be acceptable. The selection of shape is then a > matter of the font (and may not always be under the control of the author). > Conceptually, that is similar to the integral sign, which can be slanted or > upright. The choice is one of style. While authors or readers may prefer > one look over the other, the identity of the symbol is not in question, and > there's no impact on transmission of the contents of the text. > > Whenever we have the former case, that is, multiple conventional > presentations that are symbols in their own right in other contexts, then > encoding an additional "generic" shape should be avoided. Unicode > explicitly did not encode a generic "decimal point". If the convention that > is used matters, the author is better off being able to select a specific > shape. The results will be more predictable. The downside is that a search > will have to cover all the conventions. Conceptually, that is no different > from having to search for both "color" and "colour". > > The final case is where a convention for depicting a concept uses a symbol > that itself has some variability (for example when representing some other > concepts), such that some of its forms make it less than ideal for the > conventional use intended for the concept in question. Unicode has > historically not always been able to
Re: Joined "ti" coded as "Ɵ" in PDF
Yes, it seems like your mileage varies with the PDF viewer/interpreter/converter. Text copied from Preview on the Mac replaces the ti ligature with a space. Certainly not a Unicode problem, per se, but an interesting problem nevertheless. -steve > On Mar 17, 2016, at 11:11 AM, Doug Ewellwrote: > > Don Osborn wrote: > >> Odd result when copy/pasting text from a PDF: For some reason "ti" in >> the (English) text of the document at >> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf >> is coded as "Ɵ". Looking more closely at the original text, it does >> appear that the glyph is a "ti" ligature (which afaik is not coded as >> such in Unicode). > > When I copy and paste the PDF text in question into BabelPad, I get: > >> InternaƟonal Order and the DistribuƟon of IdenƟty in 1950 (By >> invitaƟon only) > > The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use > character. > > Truncating this character to 16 bits, which is a Bad Thing™, yields > U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either > Don's clipboard or the editor he pasted it into is not fully > Unicode-compliant. > > Don's point about using alternative characters to implement ligatures, > thereby messing up web searches, remains valid. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO > >
Re: Swapcase for Titlecase characters
Thanks everybody for the feedback. On 2016/03/19 04:33, Marcel Schneider wrote: On Fri, Mar 18, 2016, 08:43:56, Martin J. Dürst wrote: b) Convert to upper (or lower), which may simplify implementation. For example, 'Džinsi' (jeans) would become 'DžINSI' with a), 'DŽINSI' (or 'džinsi') with b), and 'dŽINSI' with c). For another example, 'ᾨδή' would become 'ᾨΔΉ' with a), 'ὨΙΔΉ' (or 'ᾠΔΉ') with b), and 'ὠΙΔΉ' with c). Looking at your examples, I would add a case that typically occurs for swapcase to be applied: ‘ᾠΔΉ’ (cited [erroneously] as a result of option b) that is to be converted to ‘ᾨδή’, and ‘džINSI’, that is to become ‘Džinsi’. First, what do you mean with "erroneously"? Second, did I get this right that your additional case (let's call it d)) would cycle through the three options where available: lower -> title -> upper -> lower. As about decomposing digraphs and ypogegrammeni to apply swapcase: That probably would be doing no good, as itʼs unnecessary and users wonʼt expect it. Why do you say "users won't expect it"? For those users not aware of the encoding internals, I'd indeed guess that's what users would expect, at least in the Croatian case. For Greek, it may be different; it depends on the extent to which the iota is seen as a letter vs. seen as a mark. Regards, Martin.
Re: Variations and Unifications ?
On 3/15/2016 8:14 PM, David Faulks wrote: As part of my investigations into astrological symbols, I'm beginning to wonder if glyph variations are justifications for separate encoding of symbols I would have previously considered the same or unifiable with symbols already in Unicode. For example, the semisquare aspect is usually shown with a glyph that is identical to ∠ (U+2220 ANGLE). However, sometimes it looks like <, or like ∟ (U+221F RIGHT ANGLE). Would this be better encoded as a separate codepoint? The parallel aspect, similarily, sometimes looks like ∥ (U+2225 PARALLEL TO), but is often shown as // or ⫽ (U+2AFD DOUBLE SOLIDUS OPERATOR). This is not a typographical kludge since astrological fonts often show it this way. There is also contra-parallel, which sometime is shown like ∦ (U+2226 NOT PARALLEL TO), but has varaint glyphs with slated lines (and the crossbar is often horizontal). The ‘part of fortune’ is sometimes a circled ×, or sometimes a circled +. Would it be better to have dedicated characters than to assume unifications in these cases? My take is that for symbols there's always that tension between encoding the "concept" or encoding the shape. In my view, it is often impossible to answer the question whether the different angles (for example) are merely different "shapes" of one and the same "symbol", or whether it isn't the case that there are different "conventions" (using different symbols for the same concept). Disunification is useful, whenever different concepts require distinct symbol shapes (even if there are some general similarities). If other concepts make use of the same shapes interchangeably, it is then up to the author to fix the convention by selecting one or the other shape. Conceptually, that is similar to the decimal point: it can be either a period, or a comma, depending on locale (read: depending on the convention the author follows). Sometimes, concepts use multiple symbol shapes, but all of these shapes map to the same concept (and other uses are not known). In that case, unifying the shapes might be acceptable. The selection of shape is then a matter of the font (and may not always be under the control of the author). Conceptually, that is similar to the integral sign, which can be slanted or upright. The choice is one of style. While authors or readers may prefer one look over the other, the identity of the symbol is not in question, and there's no impact on transmission of the contents of the text. Whenever we have the former case, that is, multiple conventional presentations that are symbols in their own right in other contexts, then encoding an additional "generic" shape should be avoided. Unicode explicitly did not encode a generic "decimal point". If the convention that is used matters, the author is better off being able to select a specific shape. The results will be more predictable. The downside is that a search will have to cover all the conventions. Conceptually, that is no different from having to search for both "color" and "colour". The final case is where a convention for depicting a concept uses a symbol that itself has some variability (for example when representing some other concepts), such that some of its forms make it less than ideal for the conventional use intended for the concept in question. Unicode has historically not always been able to provide a solution. In some of these cases, plain text (that is, without a fixed font association) may simply not give the desired answer. If specialized fonts for the convention (e.g. astrological fonts) do not usually exist or can't be expected, then disunifying the symbol's shapes may be an answer. A./
Re: Joined "ti" coded as "Ɵ" in PDF
There are a few things going on. In the first instance, it may be the font itself that is the source of the problem. My understanding is that PDF files contain a sequence of glyphs. A PDF file will contain a ToUnicode mapping between glyphs and codepoints. This iseither a 1-1 mapping or a 1-many mapping. The 1-many mapping provides support for ligatures and variation sequences. I assume it uses the data in the font's cmap table. If the ligature isn't mapped then you will have problems. I guess the problem could be either the font or the font subsetting and embedding performed when the PDF is generated. Although, it is worth noting that in opentype fonts not all glyphs will have mappings in the cmap file. The remedy, is to extensively tag the PDF and add ActualText attributes to the tags. But the PDF specs leave it up to the developer to decide what happens in there is both a visible text layer and ActualText. So even in an ideal PDF, tesults will vary from software to software when copying text or searching a PDF. At least thatsmy current understanding. Andrew On 18 Mar 2016 7:47 am, "Don Osborn"wrote: > Thanks all for the feedback. > > Doug, It may well be my clipboard (running Windows 7 on this particular > laptop). Get same results pasting into Word and EmEditor. > > So, when I did a web search on "internaƟonal," as previously mentioned, > and come up with a lot of results (mostly PDFs), were those also a > consequence of many not fully Unicode compliant conversions by others? > > A web search on what you came up with - "InternaƟonal" - yielded many > more (82k+) results, again mostly PDFs, with terms like "interna onal" > (such as what Steve noted) and "interna nature of, or how Google interprets, the private use character?). > > Searching within the PDF document already mentioned, "international" comes > up with nothing (which is a major fail as far as usability). Searching the > PDF in a Firefox browser window, only "internaƟonal" finds the occurrences > of what displays as "international." However after downloading the document > and searching it in Acrobat, only a search for "internaƟonal" will find > what displays as "international." > > A separate web search on "Eīects" came up with 300+ results, including > some GoogleBooks which in the texts display "effects" (as far as I > checked). So this is not limited to Adobe? > > Jörg, With regard to "Identity H," a quick search gives the impression > that this encoding has had a fairly wide and not so happy impact, even if > on the surface level it may have facilitated display in a particular style > of font in ways that no one complains about. > > Altogether a mess, from my limited encounter with it. There must have been > a good reason for or saving grace of this solution? > > Don > > On 3/17/2016 2:17 PM, Steve Swales wrote: > >> Yes, it seems like your mileage varies with the PDF >> viewer/interpreter/converter. Text copied from Preview on the Mac replaces >> the ti ligature with a space. Certainly not a Unicode problem, per se, but >> an interesting problem nevertheless. >> >> -steve >> >> On Mar 17, 2016, at 11:11 AM, Doug Ewell wrote: >>> >>> Don Osborn wrote: >>> >>> Odd result when copy/pasting text from a PDF: For some reason "ti" in the (English) text of the document at http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf is coded as "Ɵ". Looking more closely at the original text, it does appear that the glyph is a "ti" ligature (which afaik is not coded as such in Unicode). >>> When I copy and paste the PDF text in question into BabelPad, I get: >>> >>> InternaƟonal Order and the DistribuƟon of IdenƟty in 1950 (By invitaƟon only) >>> The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use >>> character. >>> >>> Truncating this character to 16 bits, which is a Bad Thing™, yields >>> U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either >>> Don's clipboard or the editor he pasted it into is not fully >>> Unicode-compliant. >>> >>> Don's point about using alternative characters to implement ligatures, >>> thereby messing up web searches, remains valid. >>> >>> -- >>> Doug Ewell | http://ewellic.org | Thornton, CO >>> >>> >>> >> >
Re: Purpose of and rationale behind Go Markers U+2686 to U+2689
On 18 March 2016 at 23:49, Garth Wallacewrote: > > Correction: the 2-digit pairs would require 19 characters. There would > be no need for a left half circle enclosed digit one, since the > enclosed numbers 10–19 are already encoded. This would only leave > enclosed 20 as a potential confusable. There would also be no need for > a left third digit zero, saving one code point if the thirds are not > unified with the halves, so there would be 29 thirds. > > And just to clarify, there would have to be separate half cirlced and > negative half circled digits. So that would be 96 characters > altogether, or 58 if left and right third-circles are unified with > their half-circle equivalents. Not counting ideographic numbers. Thanks for your suggestion, I have added two new options to my draft proposal, one based on your suggestion (60 characters: 10 left, 10 middle and 10 right for normal and negative circles) and one more verdyesque (four enclosing circle format characters). To be honest, I don't think the UTC will go for either of these options, but I doubt they will be keen to accept any of the suggested options. > This may not work very well for ideographic numbers though. In the > examples, they appear to be written vertically within their circles > (AFAICT none of the moves in those diagrams are numbered 100 or above, > although some are hard to read). I have now added an example with circled ideographic numbers greater than 100. See Fig. 13 in http://www.babelstone.co.uk/Unicode/GoNotation.pdf In this example, numbers greater than 100 are written in two columns within the circle, with hundreds on the right. Andrew
Re: Joined "ti" coded as "Ɵ" in PDF
On Thu, Mar 17, 2016 at 19:02:19, Pierpaolo Bernardi wrote: > unicode says nothing about font technologies It mentions them a little bit however in the core specifications: http://www.unicode.org/versions/Unicode8.0.0/ch23.pdf#G23126 > unicode does not mandate how to encode ligatures Probably because Unicode specifies that «it is the task of the rendering system» to select ligature glyphs on the basis of characteristic sequences of characters in the text stream. While having found some of the mentioned oddities in an old PDF file (ffi ligature ending up as Y, ffl ligature as Z), I’m now really puzzled about actual practise. Marcel
Re: Swapcase for Titlecase characters
The 'swapcase' just sounds bizarre. What on earth is it for? My inclination would be to just do the simplest possible implementation that has the expected results for the 1:1 case pairs, and whatever falls out from the algorithm for the others. Mark On Sat, Mar 19, 2016 at 4:11 AM, Asmus Freytag (t)wrote: > On 3/18/2016 12:33 PM, Marcel Schneider wrote: > > As about decomposing digraphs and ypogegrammeni to apply swapcase: That > probably would be doing no good, as itʼs unnecessary and users wonʼt expect > it. > > > That was my intuition as well, but based on a different line of argument. > If you add a feature to match behavior somewhere else, it rarely pays to > make that perform "better", because it just means it's now different and no > longer matches. > > The exception is a feature for which you can establish unambiguously that > there is a metric of correctness or a widely (universally?) shared > expectation by users as to the ideal behavior. In that case, being > compatible with a broken feature (or a random implementation of one) may in > fact be counter productive. > > The mere fact that you needed to ask here made me think that this would be > unlikely to be one of those exceptions: because in that case, you would > have easily be able to tap into a consensus that tells you what "better" > means. (And it the feature would probably have been more widely > implemented). > > This one is pretty bizarre on the face of it, but I like Marcel's > suggestion as to its putative purpose. > > A./ >
Re: Joined "ti" coded as "Ɵ" in PDF
The PDF *displays* correctly. But try copying the string 'ti' from the text another application outside of your PDF viewer, and you'll see that the thing that *displays* as 'ti' is *coded* as Ɵ, as Don Osborn said. 2016-03-17 14:26 GMT-03:00 Pierpaolo Bernardi: > That document displays correctly for me using both the pdf viewer > built into chrome and the standalone Acrobat reader v.11. The problem > could be in your PDF viewer? What are you viewing the document with? > > On Thu, Mar 17, 2016 at 5:43 PM, Don Osborn wrote: >> Odd result when copy/pasting text from a PDF: For some reason "ti" in the >> (English) text of the document at >> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf >> is coded as "Ɵ". Looking more closely at the original text, it does appear >> that the glyph is a "ti" ligature (which afaik is not coded as such in >> Unicode). >> >> Out of curiosity, did a web search on "internaƟonal" and got over 11k hits, >> apparently all PDFs. >> >> Anyone have any idea what's going on? Am assuming this is not a deliberate >> choice by diverse people creating PDFs and wanting "ti" ligatures for >> stylistic reasons. Note the document linked above is current, so this is not >> (just) an issue with older documents. >> >> Don Osborn >
Re: Purpose of and rationale behind Go Markers U+2686 to U+2689
On Thu, Mar 17, 2016 at 9:18 PM, Garth Wallacewrote: > There's another strategy for dealing with enclosed numbers, which is > taken by the font Quivira in its PUA: encoding separate > left-half-circle-enclosed and right-half-circle-enclosed digits. This > would require 20 characters to cover the double digit range 00–99. > Enclosed three digit numbers would require an additional 30 for left, > center, and right thirds, though it may be possible to reuse the left > and right half circle enclosed digits and assume that fonts will > provide left half-center third-right half ligatures (Quivira provides > "middle parts" though the result is a stadium instead of a true > circle). It should be possible to do the same for enclosed ideographic > numbers, I think. > > The problems I can see with this are confusability with the already > encoded atomic enclosed numbers, and breaking in vertical text. Correction: the 2-digit pairs would require 19 characters. There would be no need for a left half circle enclosed digit one, since the enclosed numbers 10–19 are already encoded. This would only leave enclosed 20 as a potential confusable. There would also be no need for a left third digit zero, saving one code point if the thirds are not unified with the halves, so there would be 29 thirds. And just to clarify, there would have to be separate half cirlced and negative half circled digits. So that would be 96 characters altogether, or 58 if left and right third-circles are unified with their half-circle equivalents. Not counting ideographic numbers. This may not work very well for ideographic numbers though. In the examples, they appear to be written vertically within their circles (AFAICT none of the moves in those diagrams are numbered 100 or above, although some are hard to read).
Re: Swapcase for Titlecase characters
On 3/18/2016 12:33 PM, Marcel Schneider wrote: As about decomposing digraphs and ypogegrammeni to apply swapcase: That probably would be doing no good, as itʼs unnecessary and users wonʼt expect it. That was my intuition as well, but based on a different line of argument. If you add a feature to match behavior somewhere else, it rarely pays to make that perform "better", because it just means it's now different and no longer matches. The exception is a feature for which you can establish unambiguously that there is a metric of correctness or a widely (universally?) shared expectation by users as to the ideal behavior. In that case, being compatible with a broken feature (or a random implementation of one) may in fact be counter productive. The mere fact that you needed to ask here made me think that this would be unlikely to be one of those exceptions: because in that case, you would have easily be able to tap into a consensus that tells you what "better" means. (And it the feature would probably have been more widely implemented). This one is pretty bizarre on the face of it, but I like Marcel's suggestion as to its putative purpose. A./
Re: Purpose of and rationale behind Go Markers U+2686 to U+2689
Hi Frédéric, The historic use of ideographic numbers for marking Go moves are discussed in the latest draft of my document: http://www.babelstone.co.uk/Unicode/GoNotation.pdf Andrew On 16 March 2016 at 13:35, Frédéric Grosshanswrote: > Le 15/03/2016 22:21, Andrew West a écrit : >> >> >> Possibly. I certainly have very little expectation that a proposal to >> complete both sets to 999 (or even 399) would have any chance of >> success. > > And then, there are also the historical example of ideographic numbers used > for the same purpose in historic texts (like here > http://sns.91ddcc.com/t/54057, here http://pmgs.kongfz.com/item_pic_464349/ > or here > http://www.weibo.com/p/1001593905063666976890?from=page_100106_profile=6=wenzhangmod > ). > > The above has been found with a quick google search, and I have no idea > whether these symbols were used in the running text or not. > > Frédéric >
Re: Purpose of and rationale behind Go Markers U+2686 to U+2689
On 2016/03/19 04:55, Garth Wallace wrote: On Fri, Mar 18, 2016 at 11:48 AM, Philippe Verdywrote: 2016-03-18 19:11 GMT+01:00 Garth Wallace : Rotation is definitely not salient in standard go kifu like it is in fairy chess notation. Go variants for more than 2 players are uncommon enough that I don't think any sort of standardized notation exists. The most frequent way to play Go with more than two players is to play in two teams, the players in each team taking turns when it's time for their team to play. But there's no need for any special notation for this case. Regards, Martin.