Re: Private Use areas
On 8/27/2018 2:20 PM, Rebecca Bettencourt via Unicode wrote: > That sounds like a non-conformant use of characters in the U+24xx block. Well, you are an expert on these things and I do not understand as to with what it would be non-conformant. A conformant process must interpret ⓅⓊⒶⒹⒶⓉⒶ as the characters ⓅⓊⒶⒹⒶⓉⒶ and not as a signal to process what follows as anything other than plain text. Not correct. If that was literally true, then all HTML, XML, CSS, C, C#, Java, Python source code files and their compilers would be non-conformant. It's more like, "if a process treats a sequence of bytes as Unicode plain text, then the bytes corresponding to the codes assigned to ⓅⓊⒶⒹⒶⓉⒶ just stand for ⓅⓊⒶⒹⒶⓉⒶ. Any meaning is imparted by the (human) reader." However, if the process treats the file as a source file in a markup language, there's nothing that prevents it from assigning particular interpretations to ⓅⓊⒶⒹⒶⓉⒶ, including, but not limited to not displaying these code points as characters. The interpretation of the remainder of the file may well be conformant to the Unicode Standard, just as the display of the contents of many HMTL elements is usually conformant to the Unicode Standard. What you are proposing is a higher-level protocol, whether you realize it or not. Correct, the rub here is that all these schemes that treat characters as both syntax and text depending on context amount to mark-up languages and are therefore ipso-facto no longer plain text (except if displayed as source code, but already applying syntax coloring would no longer be purely treating the data as plain text). In-band markup has thus a dual nature as plain text and rich text, depending on how it is processed. Unfortunately your higher-level protocol has a serious flaw in that it cannot represent the string "ⓅⓊⒶⒹⒶⓉⒶ". That could probably be remedied by the usual techniques. Also, seeing a bunch of circled alphanumeric characters in a document ⓘⓢ◯ⓕⓐⓡ◯ⓕⓡⓞⓜ◯ⓤⓝⓞⓑⓣⓡⓤⓢⓘⓥⓔ. :) There are plenty of already-existing higher-level protocols (you mentioned one: XML) that could be used to provide information about PUA characters, and they are all much better suited to that purpose than what you are proposing. There are situations where an ad-hoc markup language seems to fulfill a need that is not well served by the existing full-fledged markup languages. You find them in internet "bulletin boards" or services like GitHub, where pure plain text is too restrictive but the required text styles purposefully limited - which makes the syntactic overhead of a full-featured mark-up language burdensome. Too bad that there's been no "winner" among these, and therefore no universally accepted one. If so, it might have presented an obvious target for a PUA extension. A./
Re: Line wrapping of mixed LTR/RTL text
> Date: Tue, 28 Aug 2018 13:44:58 +0300 > From: Cosmin Apreutesei via Unicode > > There is this sentence in UAX#9 which provides a clue: "[...] trailing > whitespace will appear at the visual end of the line (in the paragraph > direction).". I'm not sure what that means, but by doing some tests > with fribidi and libunibreak I noticed that the whitespace always > sticks to the logical end of the word (so visually to the right for > LTR runs and to the left for RTL runs), regardless of the base > paragraph direction. That is not so if the line ends after the whitespace: in that case the whitespace is trailing, and will appear at the visual end of the line. Only if you add some character after the whitespace will the whitespace "jump" to the other side of the word. > Quick example showing the problem. The following text: > > لمفاتيح ABC DEF > > with RTL base direction would wrap (for a certain line width) as: > > ABC لمفاتيح > DEF > > with two spaces between the Latin and Arabic text, one from the Latin > text and one from the Arabic text. No, it should show the space after ABC to the left of ABC, i.e. immediately before the line end. What UAX#9 tells you is that you need to decide that the line will wrap after the space that follows "ABC", the reorder the line as if it ended after that space, which will produce this: لمفاتيح ABC (with the trailing space to the left of "ABC"). Then you should display "DEF" on the next line. IOW, the correct order is: . find levels . wrap in logical order . reorder wrapped lines
Line wrapping of mixed LTR/RTL text
Hello everyone, I'm having a bit of trouble implementing line wrapping with bidi and I would like to ask for some advice or hints on what is the proper way to do this. UAX#9 section 3.4 says that bidi reordering should be done after line wrapping. But in order to do line wrapping correctly I need to be able to visually ignore some whitespace, and I'm not sure exactly which whitespace must be ignored. There is this sentence in UAX#9 which provides a clue: "[...] trailing whitespace will appear at the visual end of the line (in the paragraph direction).". I'm not sure what that means, but by doing some tests with fribidi and libunibreak I noticed that the whitespace always sticks to the logical end of the word (so visually to the right for LTR runs and to the left for RTL runs), regardless of the base paragraph direction. Is it safe to use this assumption and always remove the whitespace at the logical end of the last word of the line? Or is it more complicated than that? Quick example showing the problem. The following text: لمفاتيح ABC DEF with RTL base direction would wrap (for a certain line width) as: ABC لمفاتيح DEF with two spaces between the Latin and Arabic text, one from the Latin text and one from the Arabic text. Since the line logically ends with the "C" and LTR direction, I should have to probably remove the space after the "C" (and, as a rule, just remove the whitespace at the logical end of the word, regardless of paragraph's direction or word's direction). Is this the right way to do it? Screenshots attached. Thanks!
Re: Private Use areas
Asmus Freytag wrote: > There are situations where an ad-hoc markup language seems to fulfill a need > that is not well served by the existing full-fledged markup languages. You > find them in internet "bulletin boards" or services like GitHub, where pure > plain text is too restrictive but the required text styles purposefully > limited - which makes the syntactic overhead of a full-featured mark-up > language burdensome. I am thinking of such an ad-hoc special purpose markup language. I am thinking of something like a special purpose version of the FORTH computer language being used but with no user definitions, no comparison operations and no loops and no compiler. Just a straight run through as if someone were typing commands into FORTH in interactive mode at a keyboard. Maybe no need for spaces between commands. For example, circled R might mean use Right-to-left text display. I am thinking that there could be three stacks, one for code points and one for numbers and one for external reference strings such as for accessing a web page or a PDF (Portable Document Format) document or listing an International Standard Book Number and so on. Code points could be entered by circled H followed by circled hexadecimal characters followed by a circled character to indicate Push onto the code point stack. Numbers could be entered in base 10, followed by a circled character to mean Push onto the number stack. A later circled character could mean to take a certain number of code points (maybe just 1, or maybe 0) from the character stack and a certain number of numbers (maybe just 1, or maybe just 0) from the number stack and use them to set some property. It could all be very lightweight software-wise, just reading the characters of the sequence of circled characters and obeying them one by one just one time only on a single run through, with just a few, such as the circled digits, each having its meaning dependent upon a state variable such as, for a circled digit, whether data entry is currently hexadecimal or base 10. I am wondering how many PUA property variables there would need to be set for the system to be useful. The sequence could start with all of those PUA property values set at their default values so only those that needed changing need be explicitly set, though others could be explicitly set to the default values if a record were desired. William Overington Tuesday 28 August 2018
Re: Private Use areas
James Kass wrote: > Non-conformant? Well, it's probably overkill anyway. A simpler method of > identifying which PUA convention is being used for a file would be to either have the first line of the file being something like [PUA1] or to have the file name be something like MYFILE.TXTPUA1. Where "PUA1" equals the CSUR. Other numbers (PUA2, PUA3, etc.) for other PUA conventions. The problem that then arises is that a registry is needed for what those numbers mean, such as PUA01728. So what if someone writes explaining his designs for glyphs for the language of the people who live in the northern part of the fifth planet from the sun in the science fiction novel he is writing? Is registration granted instantly upon request or is there a threshold of some sort? What if lots of people do that, including some people wanting a registry code number for the various emoji that they want? If there is a threshold of proving usage and so on, or of showing that the designs have been produced AT a business or AT a college or whatever, then the system will only work for some users of the Private Use Areas. My opinion is that the system needs to be free-standing, with each usage possibly self-contained or with an external reference to a document that is available. Care would need to be taken to send a copy of any such document to deposit libraries such as The British Library so as to ensure long-term conservation. William Overington Tuesday 28 August 2018
Re: Private Use areas
Hi Mark E. Shoulson wrote: > I'm not sure what the advantage is of using circled characters instead of > plain old ascii. My thinking is that "plain old ascii" might be used in the text encoded in the file. Sometimes a file containing Private Use Area characters is a mix of regular Unicode Latin characters with just a few Private Use Area characters mixed in with them. So my suggestion of using circled characters is for disambiguation purposes. The circled characters in the PUAINFO sequence would not be displayed if a special software program were being used to read in the text file, then act upon the information that is encoded using the circled characters. My thinking is that using this method just adds some encoded information at the start of the text file and does not require the whole document to become designated as a file conformant to a particular markup format. William Overington Tuesday 28 August 2018
Re: Line wrapping of mixed LTR/RTL text
Hi Eli, thanks for answering! I think I'm getting closer. Just a few more clarifications if you please. > That is not so if the line ends after the whitespace: in that case the > whitespace is trailing, and will appear at the visual end of the > line. So only if it's a soft break I should indeed remove the last logical space, if it's before a hard break then leave it alone. > Only if you add some character after the whitespace will the > whitespace "jump" to the other side of the word. ... because the hard break just turned into a soft break and the newly typed character will appear on the next line with a hard line break after it, right? > No, it should show the space after ABC to the left of ABC, > i.e. immediately before the line end. Just to make sure, this moving of the last space at the visual end of the line can only be experienced with a moving cursor, right? I mean as far as displaying goes (and as far as line width computation for the purposes of line wrapping goes), that space is just removed, right? I'm trying to infer the purpose of moving that space to the end of the line instead of just removing it: is the idea to always provide a cursor at the visual end of the line so that typing can continue there or is there more to it? > What UAX#9 tells you is that you need to decide that the line will > wrap after the space that follows "ABC" ... but when computing the line width I should not include the width of that space, right? since it will not take space in the box in the end. >, then reorder the line as if it > ended after that space, which will produce this: > > لمفاتيح ABC > > (with the trailing space to the left of "ABC"). Then you should > display "DEF" on the next line. You mean it will produce this: " ABC لمفاتيح"
Re: Line wrapping of mixed LTR/RTL text
Hi Philippe, > The space encoded just before the logical end of line or linewrap (in the > middle of the displayed line) has to be moved at end of the physical line (in > the paragraph direction), it should not be kept in the middle. Ok, that seem to confirm what Eli is saying and it clarifies that sentence from UAX#9. Thanks!
Re: Private Use areas
On August 23, 2011, Asmus Freytag wrote: > On 8/23/2011 7:22 AM, Doug Ewell wrote: >> Of all applications, a word processor or DTP application would want >> to know more about the properties of characters than just whether >> they are RTL. Line breaking, word breaking, and case mapping come to >> mind. >> >> I would think the format used by standard UCD files, or the XML >> equivalent, would be preferable to making one up: > > The right answer would follow the XML format of the UCD. > > That's the only format that allows all necessary information contained > in one file, and it would leverage of any effort that users of the > main UCD have made in parsing the XML format. > > An XML format shold also be flexible in that you can add/remove not > just characters, but properties as needed. > > The worst thing do do, other than designing something from scratch, > would be to replicate the UnicodeData.txt layout with its random, but > fixed collection of properties and insanely many semi-colons. None of > the existing UCD txt files carries all the needed data in a single > file. I don't know if or how I responded 7 years ago, but at least today, I think this is an excellent suggestion. If the goal is to encourage vendors to support PUA assignments, using an exceedingly well-defined format (UAX #42) sitting atop one of the most widely used base formats ever (XML), with all property information in a single repository (per PUA scheme), would be great encouragement. I've devised lots of novel file formats and I think this is one use case where that would be a real hindrance. Storing this information in a font, by hook or crook, would lock users of those PUA characters into that font. At that rate, you might as well use ASCII-hacked fonts, as we did 25 years ago. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: Line wrapping of mixed LTR/RTL text
> From: Cosmin Apreutesei > Date: Tue, 28 Aug 2018 21:28:58 +0300 > Cc: unicode@unicode.org > > > That is not so if the line ends after the whitespace: in that case the > > whitespace is trailing, and will appear at the visual end of the > > line. > > So only if it's a soft break I should indeed remove the last logical > space, if it's before a hard break then leave it alone. Actually, you don't have to remove it, you could leave it. It's only an aesthetic issue. > > No, it should show the space after ABC to the left of ABC, > > i.e. immediately before the line end. > > Just to make sure, this moving of the last space at the visual end of > the line can only be experienced with a moving cursor, right? I mean > as far as displaying goes (and as far as line width computation for > the purposes of line wrapping goes), that space is just removed, > right? As I said, not necessarily. But it is definitely there when you reorder characters for display. > I'm trying to infer the purpose of moving that space to the > end of the line instead of just removing it If you remove trailing space, then you need to see it being trailing before you remove it. That is the purpose of moving it. > > What UAX#9 tells you is that you need to decide that the line will > > wrap after the space that follows "ABC" > > ... but when computing the line width I should not include the width > of that space, right? since it will not take space in the box in the > end. If you will remove the space, then yes. > You mean it will produce this: > > " ABC لمفاتيح" Yes.
RE: Private Use areas - Vertical Text
> > On 27 August 2018 at 15:22 Peter Constable via Unicode > wrote: > > Layout engines that support CJK vertical layout do not rely on the 'vert' > feature to rotate glyphs for CJK ideographs, but rather rotate the glyph 90° > and switch to using vertical glyph metrics. The 'vert' feature is used to > substitute vertical alternate glyphs as needed, such as for punctuation that > isn't automatically rotated (and would probably need a differently-positioned > alternate in any case). > > Cf. UAX 50. > There have been some pretty confused statements. I believe the observed problem is that PUA characters for Zhuang CJK ideographs get rotated when displayed vertically rather than left-to-right. Unicode is doing what it can in this matter: (a) Zhuang PUA characters are being made individually obsolete. (b) By default, PUA characters have the value of Vertical_orientation=upright as do CJK ideographs. For CJK ideographs, it is not clear to me when the vert feature (if present) would be applied. Is it only for some codepoints (vo=tu), or is it for all that the engine expects to be displayed ‘upright’ in vertical text? The vrtr feature (if present) would be applied when glyphs are to be rotated. Is it for all such glyphs, or only those for which rotation is expected to be inadequate (vo=tr)? It seems that feature vrt2 is to be applied to all glyphs; perhaps rotation is the default behaviour when there is no look-up value for a glyph that the engine expects to be rotated. The truly difficult case would be when there is no attempt to apply a look-up – possibly vrtr would not apply to /p{vo=r}. I would expect that defining the lookup vrt2 or vrtr to map Zhuang glyphs to themselves (or something prerotated) would cure the problem. This would not work for sequences of Zhuang ideographs treated as RTL text - but that is unlikely to happen. Richard.
Re: Line wrapping of mixed LTR/RTL text
The space encoded just before the logical end of line or linewrap (in the middle of the displayed line) has to be moved at end of the physical line (in the paragraph direction), it should not be kept in the middle. If you need to force a linewrap on a non-breaking space (because there's no other break opportunity to wrap the line elsewhere), then treat that non-breaking space as a regular breaking space which will also be moved at end of the row (after the margin on the ending side of the paragraph), and choose the last non-breaking space on the row; usually, all spaces present at linewraps (including non-breaking spaces) are compacted. But there are other style policies that will force the linewrap preferably after a trailing punctuation or a separator punctuation, or before a leading punctuation, or just after the last unbreakable cluster that can fit the row (including ion the middle of words at arbitrary position if there's no hyphenation process or the script does not support hyphenation, such as sinograms and kanas). Where to insert linewraps is very fuzzy and depends on the rendering context and capabilities of the target device (you cannot scroll a piece of printed paper, but you can scroll a display with a scrollbar or using navigation cursors in a width-restricted input field) Le mar. 28 août 2018 à 16:34, Cosmin Apreutesei via Unicode < unicode@unicode.org> a écrit : > Hello everyone, > > I'm having a bit of trouble implementing line wrapping with bidi and I > would like to ask for some advice or hints on what is the proper way > to do this. > > UAX#9 section 3.4 says that bidi reordering should be done after line > wrapping. But in order to do line wrapping correctly I need to be able > to visually ignore some whitespace, and I'm not sure exactly which > whitespace must be ignored. > > There is this sentence in UAX#9 which provides a clue: "[...] trailing > whitespace will appear at the visual end of the line (in the paragraph > direction).". I'm not sure what that means, but by doing some tests > with fribidi and libunibreak I noticed that the whitespace always > sticks to the logical end of the word (so visually to the right for > LTR runs and to the left for RTL runs), regardless of the base > paragraph direction. Is it safe to use this assumption and always > remove the whitespace at the logical end of the last word of the line? > Or is it more complicated than that? > > Quick example showing the problem. The following text: > > لمفاتيح ABC DEF > > with RTL base direction would wrap (for a certain line width) as: > > ABC لمفاتيح > DEF > > with two spaces between the Latin and Arabic text, one from the Latin > text and one from the Arabic text. Since the line logically ends with > the "C" and LTR direction, I should have to probably remove the space > after the "C" (and, as a rule, just remove the whitespace at the > logical end of the word, regardless of paragraph's direction or word's > direction). Is this the right way to do it? > > Screenshots attached. > > Thanks! >
RE: Private Use areas - Vertical Text
Dear Richard and Peter, apologies for the lack of clarity. Let me try to explain below. On 2018-08-29 01:13, WORDINGHAM RICHARD via Unicode wrote: On 27 August 2018 at 15:22 Peter Constable via Unicode wrote: Layout engines that support CJK vertical layout do not rely on the 'vert' feature to rotate glyphs for CJK ideographs, but rather rotate the glyph 90° and switch to using vertical glyph metrics. The 'vert' feature is used to substitute vertical alternate glyphs as needed, such as for punctuation that isn't automatically rotated (and would probably need a differently-positioned alternate in any case). Cf. UAX 50. There have been some pretty confused statements. I believe the observed problem is that PUA characters for Zhuang CJK ideographs get rotated when displayed vertically rather than left-to-right. Yes, as Richard says when CJK Zhuang text is displayed vertically whilst the Zhuang characters in Unicode remain upright, but those with PUA codepoints are rotated 90°. This is because the PUA characters are treated like English text, which are correctly rotated 90°. The orientation of the CJK characters in this case appears to depend on which block they belong to. As Peter points out this does not seem to match UAX 50. Unicode is doing what it can in this matter: (a) Zhuang PUA characters are being made individually obsolete. Yes and No. Whilst a thousand Zhuang characters have been enocoded and two thousand have been submitted via IRG, however the number of PUA Zhuang characters is about the same or increasing. In 2006 when started just under 6k PUA points were used, presently there are over 8k, over 6k of which have not been submitted, and the earliest any future submissions can be encoded is 2026. That being said the number of more common Zhuang characters needing PUA support is coming down. So whilst individual characters are being resolved, the need for PUA Zhuang characters remains, and will so for decades to come. (b) By default, PUA characters have the value of Vertical_orientation=upright as do CJK ideographs. Noted above. Regards John For CJK ideographs, it is not clear to me when the vert feature (if present) would be applied. Is it only for some codepoints (vo=tu), or is it for all that the engine expects to be displayed 'upright' in vertical text? The vrtr feature (if present) would be applied when glyphs are to be rotated. Is it for all such glyphs, or only those for which rotation is expected to be inadequate (vo=tr)? It seems that feature vrt2 is to be applied to all glyphs; perhaps rotation is the default behaviour when there is no look-up value for a glyph that the engine expects to be rotated. The truly difficult case would be when there is no attempt to apply a look-up - possibly vrtr would not apply to /p{vo=r}. I would expect that defining the lookup vrt2 or vrtr to map Zhuang glyphs to themselves (or something prerotated) would cure the problem. This would not work for sequences of Zhuang ideographs treated as RTL text - but that is unlikely to happen. Richard.
Re: Private Use areas
On Tue, Aug 28 2018 at 9:43 -0700, unicode@unicode.org writes: > On August 23, 2011, Asmus Freytag wrote: > >> On 8/23/2011 7:22 AM, Doug Ewell wrote: >>> Of all applications, a word processor or DTP application would want >>> to know more about the properties of characters than just whether >>> they are RTL. Line breaking, word breaking, and case mapping come to >>> mind. >>> >>> I would think the format used by standard UCD files, or the XML >>> equivalent, would be preferable to making one up: Right. I was not so quick to state this so early, but 2 years ago I wrote to the MUFI list: --8<---cut here---start->8--- On Sat, Jan 02 2016 at 12:35 CET, odd.hau...@uib.no writes: [...] > Note the permanent URI at the University Library in Bergen. This will > in all likelihood be the last recommendation of its kind (and > certainly the last edited by the undersigned), so please look out for > new solutions (databases or the like) on the MUFI web site! I think that one of the forms, perhaps even the primary one, should follow the original Unicode Character Database and the output of Unibook (http://www.unicode.org/unibook/). The idea can be tested by converting the present recommendation to this form. Unfortunately I'm unable to contribute myself to this task. One of the advantages would be that the various character browsers can be adapted relatively easily to provide info about the MUFI characters. A simpler variant of this idea is to use Unibook-like format to document fonts. A quick-and-dirty tools for this purpose has been prepared by a student of mine: https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/ https://bitbucket.org/jsbien/unicode-ucd-parser A sample output of the tools is available at https://bitbucket.org/jsbien/parkosz-font/downloads/Parkosz1907draft.pdf (the font is also quick-and-dirty and unfinished work). --8<---cut here---end--->8--- Unfortunately there was no reaction. >> >> The right answer would follow the XML format of the UCD. >> >> That's the only format that allows all necessary information contained >> in one file, For me necessary are also comments and crossreferences contained in NamesList.txt. Do I understand correctly that only "ISO Comment properties" are included in the file? >> and it would leverage of any effort that users of the >> main UCD have made in parsing the XML format. >> >> An XML format shold also be flexible in that you can add/remove not >> just characters, but properties as needed. >> >> The worst thing do do, other than designing something from scratch, >> would be to replicate the UnicodeData.txt layout with its random, but >> fixed collection of properties and insanely many semi-colons. None of >> the existing UCD txt files carries all the needed data in a single >> file. > > I don't know if or how I responded 7 years ago, but at least today, I > think this is an excellent suggestion. > > If the goal is to encourage vendors to support PUA assignments, using an > exceedingly well-defined format (UAX #42) sitting atop one of the most > widely used base formats ever (XML), with all property information in a > single repository (per PUA scheme), would be great encouragement. I think we need also the data in the format acceptable by UniBook. > I've devised lots of novel file formats and I think this is one use > case where that would be a real hindrance. > Storing this information in a font, by hook or crook, would lock users > of those PUA characters into that font. At that rate, you might as well > use ASCII-hacked fonts, as we did 25 years ago. Storing the information in a font is inappropriate not only for the technical reasons, as I wrote recently (on Thu, Aug 23 2018) > Fonts are for *rendering*, new characters and variants are more and > more often needed for *input* of real life old texts with sufficient > precision. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien