Re: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters))
On Jun 3, 2015, at 1:26 AM, William_J_G Overington wjgo_10...@btinternet.com wrote: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). They are of limited usefulness precisely because it is pathologically hard to make use of them in their current state of technological evolution. If they were easy to make use of, people would be using them all the time. I’d bet good money that if you surveyed a lot of applications where custom characters are being used, they are not using private use ranges. Now why would that be? Actually, I have used Private Use Area characters a lot, and, once I had got used to them, I found them incredibly straightforward to use. That's nice; I've found some persistent annoyances when I use PUA codepoints. A while back I learned Quikscript, an alternate English orthography. Since May 2013, my blog's been in Quikscript using PUA codepoints. I've also joined the Shavian mailing list, sent e-mails in Shavian, and wrote an I'm switching my Quikscript blog to Shavian blog post in Shavian for April Fool's Day. To do all this typing, I made both Quikscript and Shavian keyboard layouts for OS X, as well as a Quikscript font. All of my Quikscript stuff is linked to from https://www.frogorbits.com/qs/ if you're interested. I'm something of a Johnny-come-lately to Shavian, so I've only used it in the SMP with fonts others have made. So, how much nicer is dealing with Shavian? - The Keyboard Viewer and input-source preview know what font to use for each key for Shavian; Quikscript keyboard layouts display boxes for the letters because there's no way for the system to guess which font to use for a particular codepoint. - Double-tapping a Shavian word in my browser will select the word; double-tapping a Quikscript word will select just one letter. - Internet Explorer will happily break Quikscript text in the middle of a word; Shavian gets broken at word boundaries just like English. While IE's behavior is unlike other browsers' and Not What I Want, I can't fault the IE team; I could be using PUA code points for a language that doesn't use spaces much, like Japanese. - I can read and write Shavian posts on Twitter on the desktop in a reasonable font for both Shavian and other scripts; if I wanted to do the same in Quikscript, I'd have to have a custom user-supplied stylesheet to override Twitter's own font suggestions. - Scripts already in Unicode attract the attention of talented completionist organizations that PUA communities generally can't attract beforehand. Everson Mono, Noto, and Segoe UI Historic (as of Windows 10) — all great typefaces — support Shavian and not Quikscript. This tends to be because: - I could have multiple fonts that have wildly differing meanings and glyphs mapped to the same code point; the OS can't guess which I might mean. - All the information that the OS needs to detect word breaks is in character properties data supplied by the Consortium and handled by the OS. ~ ~ ~ Specialists like us might be able to put up with these things, but we can't control everything about the reading and writing experience online unless we're all resigned to taking pictures of handwritten text.
Re: Tag characters and in-line graphics (from Tag characters)
Mark E. Shoulson mark at kli dot org wrote: Isn't this what webfonts are all about? You specify a font in the stylesheet, give it a URL, and your browser goes and downloads it and displays the text in it. That's great if you have a stylesheet, a URL, and a browser. HTML is fancy text, and pretty much implies some sort of online connection. I thought we were talking about plain text, and apologize if we weren't or if that important detail was not clear. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Tag characters and in-line graphics (from Tag characters)
2015-06-07 18:39 GMT+02:00 Doug Ewell d...@ewellic.org: Mark E. Shoulson mark at kli dot org wrote: Isn't this what webfonts are all about? You specify a font in the stylesheet, give it a URL, and your browser goes and downloads it and displays the text in it. That's great if you have a stylesheet, a URL, and a browser. HTML is fancy text, and pretty much implies some sort of online connection. Everything in HTML is embeddable in a standalone document, including graphics. HTML does not imply any online connection. HTML is independant of HTTP or other transports.
Re: Tag characters and in-line graphics (from Tag characters)
On 6/4/2015 17:03 , Chris wrote: This whole discussion is about the fact that it would be technically possible to have private character sets and private agreements that your OS downloads without the user being aware of it. The sticky issues are not the questions of how to make available fonts or images for use by the OS. Instead, they concern the fact that any such a model violates some pretty basic guarantees of plain text that the entire net infrastructure relies on. There are very obvious security issues. The start with tracking; every time you access a custom code point, that fact potentially results in a trackable interaction. This problem affects even the sticker solution that people are hoping for for emoji. (On my system, no external resources are displayed when I first open any message, and there is a reason for that). Beyond tracking, and beyond stickers (that is pictures that look like pictures) a generalized custom character set would allow text that is no longer really stable. You would be able to deliver identical e-mails to people that display differently, because when you serve the custom fonts, you would be able to customize what you deliver under the same custom character set designator. While this would be a wonderful way to circumvent censorship (other than the man in the middle version), you would likewise seriously undermine the ability to filter unwanted or undesirable texts, because the custom character set engine might recognize when a request comes from a filter and not the end user. (Just the other day, I came across a hacked website that responded differently to search engined than to live users, making the hack effective for one and invisible to the other. Custom character sets would seem to just add to the hackers' arsenal here). Finally, custom character sets sound like a great idea when thinking of an extension of an existing character set. But that's not where the issues are. The issues come in when you use the same technology to provide aliases for existing code points or for other custom characters. Aliasing undermines the ability to do search (or any other content-focused processing, from sorting to spell-check). At that point, the circle closes. When Unicode was created, the alternative then was ISO 2022, which was a standard that addressed the issue of how to switch among (albeit pre-defined) character sets to achieve, in principle, coverage equal to the union of these character sets. Unicode was created to address two main deficiencies of that situation. Unification addressed the aliasing issue, so that code points were no longer opaque but could be interpreted by software (other than display), which was the second big drawback of the patchwork of character sets. A processing model for opaque code points is possible to define, but it isn't very practical and in the late eighties people had had enough were glad to be quit of it. Seen from this perspective, the discussion about custom character sets presents itself as a giant step backward, undermining the very advances that underlie the rapid acceptance and spread of Unicode. A./
Re: Tag characters and in-line graphics (from Tag characters)
On 2015/06/04 17:03, Chris wrote: I wish Steve Jobs was here to give this lecture. Well, if Steve Jobs were still around, he could think about whether (and how many) users really want their private characters, and whether it was worth the time to have his engineers working on the solution. I'm not sure he would come to the same conclusion as you. This whole discussion is about the fact that it would be technically possible to have private character sets and private agreements that your OS downloads without the user being aware of it. Now if the unicode consortium were to decide on standardising a technological process whereby rendering engines could seamlessly download representations of custom characters without user intervention, no doubt all the vendors would support it, and all the technical mumbo jumbo of installing privately agreed character sets would be something users could leave for the technology to sort out. You are right that it would be strictly technically possible. Not only that, it has been so for 10 or 20 years. As an example, in 1996 at the WWW Conference in Paris I was participating in a workshop on internationalization for the Web, and by chance I was sitting between the participant from Adobe and the participant from Microsoft. These were the main companies working on font technology at that time, and I asked them how small it would be possible to make a font for a single character using their technologies (the purpose of such a font, as people on this thread should be able to guess, would be as part of a solution to exchange single, user-defined characters). I don't even remember their answers. The important thing here that the idea, and the technology, have been around for a long time. So why didn't it take on? Maybe the demand is just not as big as some contributors on this list claim. Also, maybe while the technology itself isn't rocket science, the responsible people at the relevant companies have enough experience with technology deployment to hold back. To give an example of why the deployment aspect is important, there were various Web-like hypertext technologies around when the Web took off in the 1990. One of them was called HyperG. It was technologically 'better' than the Web, in that it avoided broken links. But it was much more difficult to deploy, and so it is forgotten, whereas the Web took off. Regards, Martin.
Re: Tag characters and in-line graphics (from Tag characters)
Asmus Freytag wrote about security issues. This is interesting reading and I have learned a lot from the post about various security issues. Whilst the post is in this thread and follows from a post in this thread, the topic has seemed to moved to the Custom characters thread. I note that what you write about seems to me that it would not apply to my suggestion in my original post: is that correct? http://www.unicode.org/mail-arch/unicode-ml/y2015-m05/0218.html Also the following two posts. http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0009.html http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0027.html Whilst the ideas raised by Chris are interesting, they do seem to be distinctly different from what I suggested. So, for clarity, do you regard my suggested format as having any security issues, and if so, what please? I know that some people have opined that my suggested format is out of scope for Unicode, yet the scope of Unicode is what the Unicode Technical Committee decides is the scope of Unicode, and my suggested format does provide a way to include custom glyphs within a Unicode plain text document by using the new base character followed by tag characters method. William Overington 5 June 2015
Re: Tag characters and in-line graphics (from Tag characters)
I wrote, crumpled up, and threw away about three different responses. I thought about ISO 2022 and about accessing the web for every PUA character, as Asmus mentioned, and about the size of the user base, as Martin mentioned. I thought about character properties and about ephemerality. I didn't think of the spoofing implications that Asmus described, which would affect both the automatic PUA font download and the inline drawing language. Either of these could be used to spell out, let's say, paypal.com rather convincingly and with minimal effort. I might have more experience with the PUA than many list members, having transcribed the 27,000-word Alice's Adventures in Wonderland into my constructed alphabet two years ago, in a PUA encoding, so that Michael Everson could publish it in book form. One of the many learning experiences of this project was finding out which software tools play nicely with the PUA and which don't. Some tools just worked while others would not give acceptable results with any amount of effort. At no point, however, did I suppose that a font with my alphabet, or any of the jillions of others that have been invented during a boring day in class (see Omniglot for tons of examples), should be silently downloaded to a user's computer, consuming bandwidth and disk space, without her knowledge. That's practically malware. Maybe I'm just not enough of a Distinguished Visionary to understand how insanely great this would be (unfortunately, celebrity name-dropping doesn't work with me). Unicode has stated consistently for at least 23 years that it would not ever standardize PUA usage, and over the years some UTC members have used terms like strongly discouraged and not interoperable even in the presence of an agreement. Given this, and given that no system I'm aware of magically downloads fonts for *regularly encoded characters* (I still have no font for Arabic math symbols), I personally would not expect Unicode to perform a 180 on this. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Tag characters and in-line graphics (from Tag characters)
No, that's why you include a reference to the font in the private agreement, so that interested parties can install it and see the special character(s). People with their iphones and ipads and so forth don’t want to have “private agreements”, they don’t want to “install character sets”. The want it to “just work”. I wish Steve Jobs was here to give this lecture. I highly doubt actually that it is even possible to install a private character set font on an iphone such that it would be available to all applications. This whole discussion is about the fact that it would be technically possible to have private character sets and private agreements that your OS downloads without the user being aware of it. Now if the unicode consortium were to decide on standardising a technological process whereby rendering engines could seamlessly download representations of custom characters without user intervention, no doubt all the vendors would support it, and all the technical mumbo jumbo of installing privately agreed character sets would be something users could leave for the technology to sort out.
Re: Tag characters and in-line graphics (from Tag characters)
On 4 Jun 2015, at 10:59 am, David Starner prosfil...@gmail.com wrote: On Wed, Jun 3, 2015 at 5:46 PM Chris idou...@gmail.com mailto:idou...@gmail.com wrote: I personally think emoji should have one, single definitive representation for this exact reason. Then you want an image. I don't see what's hard about that. I already explained why an image and/or HTML5 is not a character. I’ll repeat again. And the world of characters is not limited to emoji. 1. HTML5 doesn’t separate one particular representation (font, size, etc) from the actual meaning of the character. So you can’t paste it somewhere and expect to increase its point size or change its font. 2. It’s highly inefficient in space to drop multi-kilobyte strings into a document to represent one character. 3. The entire design of HTML has nothing to do with characters. So there is no way to process a string of characters interspersed with HTML elements and know which of those elements are a “character”. This makes programatic manipulation impossible, and means most computer applications simply will not allow HTML in scenarios where they expect a list of “characters”. 4. There is no way to compare 2 HTML elements and know they are talking about the same character. I could put some HTML representation of a character in my document, you could put a different one in, and there would absolutely no way to know that they are the same character. Even if we are in the same community and agree on the existence of this character. 5. Similarly, there is no way to search or index html elements. If a HTML document contained an image of a particular custom character, there would be no way to ask google or whatever to find all the documents with that character. Different documents would represent it differently. HTML is a rendering technology. It makes things LOOK a particular way, without actually ENCODING anything about it. The only part of of HTML that is actually searchable in a deterministic fashion is the part that is encoded - the unicode part. The community interested in tony the tiger can make decisions like that. That is a hell of a handwave. In practice, you've got a complex decision that's always going to be a bit controversial, and one a decision that most communities won't bother trying to make. Apparently the world makes decisions all the time without meeting in committee. Strange but true. It’s called making a decision. Facebook have created a lot of emoji characters without consulting any committee and it seems to work fine, albeit restricted to the facebook universe because of a lack of a standard. You can’t know because they’re images. You can't know because the only obvious equivalence relation is exact image identity. Because… there is no standard!! If facebook wants to define 2 emoji images, maybe one is bigger than the other, and yet basically the same, to mean the same thing, then that would be their choice. Since I expect they have a lot of smart people working there, I expect it would work rather well. Just like Microsoft issues courier fonts in different point sizes and we all feel they have made that work fairly well. You seem to be arguing the nonsense position that if someone for example, made a snowflake glyph slightly different to the unicode official one, that it is wrong. That of course is nonsense. People can make sensible decisions about this without the unicode committee. You can’t iterate over compressed bits. You can’t process them. Why not? In any language I know of that has iterators, there would be no problem writing one that iterates over compressed input. If you need to mutate them, that is hard in compressed formats, but a new CPU can store War in Peace in the on-CPU cache. You can’t do it because no standard library, programming language, or operating system is set up to iterate over characters of compressed data. So if you want to shift compressed bits around in your app, it will take an awful lot of work, and the bits won’t be recognised by anyone else. Now if someone wants to define the next version of unicode to be a compressed format, and every platform supports that with standard libraries, computer languages etc, then fine that could work. Yet again I point out, lots of things MIGHT be possible in the real world IF that is how a standard is formulated. But all the chatter about this or that technology is pie in the sky without that standard.
Re: Tag characters and in-line graphics (from Tag characters)
On 3 Jun 2015, at 11:24 pm, David Starner prosfil...@gmail.com wrote: Chris wrote: There is no way to compare 2 HTML elements and know they are talking about the same character That's because character identity is a hard problem. Is the emoji TIGER the same as TONY THE TIGER or as TONY THE TIGER GIVING THE VICTORY SIGN? I personally think emoji should have one, single definitive representation for this exact reason. The subtley of different emotion between one happy face and another can be miles apart. Emoji are a little different to other symbols in that respect. Symbols that are purely symbolic can be changed as much as you like as long as they are recognisable. Emoji have too many shades of meaning for allowing change. Both of these scenarios are an argument that there should be custom characters with at least one official representation. Emoji because you don’t really want variation. Symbols because if you don’t have a local representation, then something is better than nothing. If you don’t have a local Snow Flake for example, any old snow flake will be fine. This is not a hard problem at all. Is one tony the tiger the same as another? The community interested in tony the tiger can make decisions like that. But having made that decision there needs to be a way for generic computer programs that don’t know about that community to do reasonable things with tony the tiger characters. You can index links to images. If two documents represent it differently, then I go back to the above; we can't know that they're the same thing. You can’t know because they’re images. That’s my exact point. Anybody talking about HTML5 and images as a solution to custom characters is not proposing a valid solution. On Tue, Jun 2, 2015 at 7:11 PM Chris idou...@gmail.com mailto:idou...@gmail.com wrote: You can’t ask the entire computing universe to compress everything all the time. Anytime we care about how much space text takes up, it should be compressed. It compresses very well. On the other hand, it's rare that anyone cares anymore; what's a few hundred kilobytes between friends? You compress things when they are on the move. Between computers and as you are writing it to a file. But you can’t compress generically while it is in memory. You can’t iterate over compressed bits. You can’t process them.
Re: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters))
I don’t use old software, I use up to date versions of everything on a Mac. Very standard setup. There’s a lot of links there. Maybe they do work in PDFs, but they certainly don’t work in the browser, and they don’t work when I click the txt files. Basically what you’re saying is that PDFs have a way to make this work. so what? Unless we are proposing that everything in the universe be PDF, this doesn’t really help. There should be a standard way to put custom characters anywhere that characters belong and have things “just work”. Clearly right now things don’t just work. And without even bothering to try I know if I tried cutting and pasting from those PDFs into somewhere else, it won’t work. — Chris On Wed, Jun 3, 2015 at 11:20 PM, Philippe Verdy verd...@wanadoo.fr wrote: Note that copy-pasting from a PDF to another document is very tricky, the PDF format requires that embedded fonts use precise glyph naming conventions to map glyphs back to characters, otherwise the Unicode characters sequences associated to a glyph (or multiple glyphs if they are ligatured or in complex layouts or with uncommon decorations, or rendered on a non uniform background, or with glyphs filled with pattern, such as labels over a photograph or cartographic map) will not be recognized. This remark about PDFs is also applicable to PostScript documents. Some PDF readers in that case attempt to perform some OCR (plus dictionary lookups to fix mis readings) for common glyph forms, but will almost always fail if the glyphs are too specific such as when they include swashes, ligatures, or unknown scripts and scripts with complex layouts (such as the invented script created by William for noting sentences with specific characters with new glyphs, and a specific syntax and specific layout rules. In other casesn the PDF reader will jsut put in the clipboard only a bitmap for the selection, and it will be another software that will attempt to interpret the bitmap with OCR. The glyph naming conventions are documented in PDF specifications, but many PDF creators do not follow these rules, and copying text from these PDFs fails 2015-06-03 15:03 GMT+02:00 Philippe Verdy verd...@wanadoo.fr: This possibly fails because William possibly forgot to embed his font in the document itself (or Serif PagePlus forgets to do it when it creates the PDF document, and refuses to embed glyphs from the font that are bound to Unicode PUAs when it creates the embeded font). However no such problem when creating PDFs with MS Office, or via the Adobe Acrobat printer driver or other printer drivers generating PDF files, including Google Cloud Print). So this could be a misuse of Serif PagePlus when creating the PDF (I don't know this software, may be there are options set up that ells it to not embed fonts from a list of fonts that the recipient is supposed to have installed locally, to save storage space for the document, byt evoiding such embedding). Another reason may be that the font is marked as not embeddable within its exposed properties. Another reason may be that John tries to open the document with a software that does not handle embedded fonts, or that ignores it to use only the fonts preinstalled by John in his preferences. And in such case the result depends only on fonts preinstalled on his local system (that does not include the fonts created by William), or his software is setup to use exclusively a specific local Unicode font for all PUAs. (Softwares that behaved in this bad way was old versions of Internet Explorer, due to limitation of his text renderers, however this should not happen with PDFs, provided you have used a correct plugion version for displaying PDF in the browser : if this fails in the browser, download the document and view it with Adobe Reader instead of view the plugin: there are many PDF plugins on markets that do not support essential features and just built to display PDF containing scanned bitmaps, but with very poor support of text or vector graphics, or tuned specifically to change the document for another device or paper format). Without citing which softwares are used (and which PDF in the list does not load correctly), it is difficult to tell, but for me I have no problems with a few docs I saw created by William. So: NO F = NO FAIL for me. 2015-06-03 13:38 GMT+02:00 John idou...@gmail.com: Yep, I clicked on your document and saw an empty square where your character should be. F = FAIL. — Chris On Wed, Jun 3, 2015 at 6:30 PM, William_J_G Overington wjgo_10...@btinternet.com wrote: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). They are of limited usefulness precisely because
Re: Tag characters and in-line graphics (from Tag characters)
So what you’re saying is that the current situation where you see an empty square □ for unknown characters is better than seeing something useful? — Chris On Thu, Jun 4, 2015 at 12:59 AM, Doug Ewell d...@ewellic.org wrote: Chris idou747 at gmail dot com wrote: Right now, what happens if you have a domain or locale requirement for a special character? That's what the PUA is for. Assign a PUA code point to your special character, create a font which implements the PUA character, create a brief private agreement which states that this code point refers to that character and which mentions the font, put the private agreement on the web, and publish your document with a reference to the agreement. For most non-professionals, creating the font is the tricky part. Also see Section 23.5 of TUS. Note that I am disagreeing with Martin about the PUA being useful only as a scratch area for standardization. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Tag characters and in-line graphics (from Tag characters)
On Wed, Jun 3, 2015 at 5:46 PM Chris idou...@gmail.com wrote: I personally think emoji should have one, single definitive representation for this exact reason. Then you want an image. I don't see what's hard about that. The community interested in tony the tiger can make decisions like that. That is a hell of a handwave. In practice, you've got a complex decision that's always going to be a bit controversial, and one a decision that most communities won't bother trying to make. You can’t know because they’re images. You can't know because the only obvious equivalence relation is exact image identity. You can’t iterate over compressed bits. You can’t process them. Why not? In any language I know of that has iterators, there would be no problem writing one that iterates over compressed input. If you need to mutate them, that is hard in compressed formats, but a new CPU can store War in Peace in the on-CPU cache.
Re: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters))
Yep, I clicked on your document and saw an empty square where your character should be. F = FAIL. — Chris On Wed, Jun 3, 2015 at 6:30 PM, William_J_G Overington wjgo_10...@btinternet.com wrote: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). They are of limited usefulness precisely because it is pathologically hard to make use of them in their current state of technological evolution. If they were easy to make use of, people would be using them all the time. I’d bet good money that if you surveyed a lot of applications where custom characters are being used, they are not using private use ranges. Now why would that be? Actually, I have used Private Use Area characters a lot, and, once I had got used to them, I found them incredibly straightforward to use. I have made fonts that include Private Use Area encodings using the High-Logic FontCreator program and then used those fonts in Serif PagePlus, both to produce PDF documents and PNG graphics, as needed for my particular project at the time. For example, http://forum.high-logic.com/viewtopic.php?f=10t=2957 http://forum.high-logic.com/viewtopic.php?f=10t=2672 William Overington 3 June 2015
Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters))
Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). They are of limited usefulness precisely because it is pathologically hard to make use of them in their current state of technological evolution. If they were easy to make use of, people would be using them all the time. I’d bet good money that if you surveyed a lot of applications where custom characters are being used, they are not using private use ranges. Now why would that be? Actually, I have used Private Use Area characters a lot, and, once I had got used to them, I found them incredibly straightforward to use. I have made fonts that include Private Use Area encodings using the High-Logic FontCreator program and then used those fonts in Serif PagePlus, both to produce PDF documents and PNG graphics, as needed for my particular project at the time. For example, http://forum.high-logic.com/viewtopic.php?f=10t=2957 http://forum.high-logic.com/viewtopic.php?f=10t=2672 William Overington 3 June 2015
Re: Tag characters and in-line graphics (from Tag characters)
Compression is even more important today on mobile networks: mobile apps are very verbose over the net, and you can easily pay the extra volume. In addition, mobile networks are frequently much slower than what they are advertized, even if you pay the extra subscription to get 3G/4G, you depend on antennas and the number of peoples around you. In my home, 3G/4G in faact does not work at all, and this is the case in many places around in my city, even though they are sold to have full coverage (for example, just downloading an application or updating it is simply impossible: I have to be at home connected to my Wifi router, but when its internet link fails (this happens sometimes for several hours, I have extremely slow connections on 3G/4G (which is also overcrowded at the same time, and only delivers 2G speeds). Lot of people have to support frequently low bandwidths on mobile networks, independantly of the price they paid for their subscription. So compressing data is stil lextremely important (even for texts or for the smallest web requests). Thanks, compression is now part of the web transport, but this does not mean that apps must learn to represent their interchanged data efficiently, and develop less verbose protocols and APIs). There are more and more people using mobile networks now than fixed landline internet accesses (or home wifi routers connected to it, and even for them, fiber access is still jsut for a minority of people in dense areas, the others don't get more than an handful of mebgatit/s on their DSL access: if you look at worldwide internet connections a large majority of people don't get more than 2 megabit/s: this is enough for reading/sending SMS or phone calls, or exchanging emails, but not if you need frequent updates to your apps and your apps are too verbose and there are too many apps in the background: many people cannot view videos on their mobile access, or only with very poor quality if they view it live (they cannot also download them slowly due to lack of storage space on their mobile device, so videos have to remain short in total volume and duration). So I disagree: compression is absolutely needed (even more today than iut was in the past when mobile Internet accesses were still for a minority. Mobile networks are not really faster today (their bandwidth does not double every three year like local performances of devices ! But with this extra local performance, you can support more complex compression schemes that require more CPU/GPU power which is no longer a bottleneck, when the real bottleneck is the effectively available bandwidth of the mobile network (smaller than the connection bandwidth because this bandwidth is shared... and expensive). 2015-06-03 15:24 GMT+02:00 David Starner prosfil...@gmail.com: Chris wrote: There is no way to compare 2 HTML elements and know they are talking about the same character That's because character identity is a hard problem. Is the emoji TIGER the same as TONY THE TIGER or as TONY THE TIGER GIVING THE VICTORY SIGN? http://www.engadget.com/2014/04/30/you-may-be-accidentally-sending-friends-a-hairy-heart-emoji/ Note that even in Unicode, the set ẛ ᷥ ſ ṡ s S Ŝ may be considered the same character or up to seven different characters, depending on case-folding, canonization and accent dropping. Similarly, there is no way to search or index html elements. If a HTML document contained an image of a particular custom character, there would be no way to ask google or whatever to find all the documents with that character. Different documents would represent it differently. You can index links to images. If two documents represent it differently, then I go back to the above; we can't know that they're the same thing. On Tue, Jun 2, 2015 at 7:11 PM Chris idou...@gmail.com wrote: You can’t ask the entire computing universe to compress everything all the time. Anytime we care about how much space text takes up, it should be compressed. It compresses very well. On the other hand, it's rare that anyone cares anymore; what's a few hundred kilobytes between friends?
Re: Tag characters and in-line graphics (from Tag characters)
Earlier in this thread, on 2 June 2015, I wrote as follows: A mechanism to be able to use the method to define a glyph linked to a Unicode code point would be a useful facility to add for use in a situation where the glyph is for a regular Unicode character. I have now thought of a mechanism to use. Please imagine the base character followed by a sequence of tag characters, the tag characters here represented by ordinary letters and digits. Here is an example of the mechanism for defining the glyph for U+E702 in a particular document as 7 red pixels. HE702U7r The tag H character switches to hexadecimal input mode, then there are as many tag characters as necessary to express in hexadecimal notation the code point of the character for which the definition is being made, then there is a tag U character to action the definition and go out of hexadecimal input mode. The tag 7r is to express 7 red pixels. In practice the number of tag characters after the tag U character might be around 200, the above tag 7r is just a minimal example so as to explain the concept. While posting, may I mention please one other matter? Previously I mentioned using tag R, tag G and tag B is defining colours. I now add tag A into that defining colour so as to define opacity, that is what is sometimes called transparency, yet 0 means totally transparent and 255 means totally opaque. If no value is stated for A then it should be presumed to have a value of 255, so that the default situation is to define opaque colours. I feel that the information in this thread is now a good basis for the assessment of this suggested format as to whether it could be a useful open source system with good interoperability potential that could usefully be submitted to the Unicode Technical Committee. William Overington 3 June 2015
Re: Tag characters and in-line graphics (from Tag characters)
Chris wrote: There is no way to compare 2 HTML elements and know they are talking about the same character That's because character identity is a hard problem. Is the emoji TIGER the same as TONY THE TIGER or as TONY THE TIGER GIVING THE VICTORY SIGN? http://www.engadget.com/2014/04/30/you-may-be-accidentally-sending-friends-a-hairy-heart-emoji/ Note that even in Unicode, the set ẛ ᷥ ſ ṡ s S Ŝ may be considered the same character or up to seven different characters, depending on case-folding, canonization and accent dropping. Similarly, there is no way to search or index html elements. If a HTML document contained an image of a particular custom character, there would be no way to ask google or whatever to find all the documents with that character. Different documents would represent it differently. You can index links to images. If two documents represent it differently, then I go back to the above; we can't know that they're the same thing. On Tue, Jun 2, 2015 at 7:11 PM Chris idou...@gmail.com wrote: You can’t ask the entire computing universe to compress everything all the time. Anytime we care about how much space text takes up, it should be compressed. It compresses very well. On the other hand, it's rare that anyone cares anymore; what's a few hundred kilobytes between friends?
Re: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters))
Note that copy-pasting from a PDF to another document is very tricky, the PDF format requires that embedded fonts use precise glyph naming conventions to map glyphs back to characters, otherwise the Unicode characters sequences associated to a glyph (or multiple glyphs if they are ligatured or in complex layouts or with uncommon decorations, or rendered on a non uniform background, or with glyphs filled with pattern, such as labels over a photograph or cartographic map) will not be recognized. This remark about PDFs is also applicable to PostScript documents. Some PDF readers in that case attempt to perform some OCR (plus dictionary lookups to fix mis readings) for common glyph forms, but will almost always fail if the glyphs are too specific such as when they include swashes, ligatures, or unknown scripts and scripts with complex layouts (such as the invented script created by William for noting sentences with specific characters with new glyphs, and a specific syntax and specific layout rules. In other casesn the PDF reader will jsut put in the clipboard only a bitmap for the selection, and it will be another software that will attempt to interpret the bitmap with OCR. The glyph naming conventions are documented in PDF specifications, but many PDF creators do not follow these rules, and copying text from these PDFs fails 2015-06-03 15:03 GMT+02:00 Philippe Verdy verd...@wanadoo.fr: This possibly fails because William possibly forgot to embed his font in the document itself (or Serif PagePlus forgets to do it when it creates the PDF document, and refuses to embed glyphs from the font that are bound to Unicode PUAs when it creates the embeded font). However no such problem when creating PDFs with MS Office, or via the Adobe Acrobat printer driver or other printer drivers generating PDF files, including Google Cloud Print). So this could be a misuse of Serif PagePlus when creating the PDF (I don't know this software, may be there are options set up that ells it to not embed fonts from a list of fonts that the recipient is supposed to have installed locally, to save storage space for the document, byt evoiding such embedding). Another reason may be that the font is marked as not embeddable within its exposed properties. Another reason may be that John tries to open the document with a software that does not handle embedded fonts, or that ignores it to use only the fonts preinstalled by John in his preferences. And in such case the result depends only on fonts preinstalled on his local system (that does not include the fonts created by William), or his software is setup to use exclusively a specific local Unicode font for all PUAs. (Softwares that behaved in this bad way was old versions of Internet Explorer, due to limitation of his text renderers, however this should not happen with PDFs, provided you have used a correct plugion version for displaying PDF in the browser : if this fails in the browser, download the document and view it with Adobe Reader instead of view the plugin: there are many PDF plugins on markets that do not support essential features and just built to display PDF containing scanned bitmaps, but with very poor support of text or vector graphics, or tuned specifically to change the document for another device or paper format). Without citing which softwares are used (and which PDF in the list does not load correctly), it is difficult to tell, but for me I have no problems with a few docs I saw created by William. So: NO F = NO FAIL for me. 2015-06-03 13:38 GMT+02:00 John idou...@gmail.com: Yep, I clicked on your document and saw an empty square where your character should be. F = FAIL. — Chris On Wed, Jun 3, 2015 at 6:30 PM, William_J_G Overington wjgo_10...@btinternet.com wrote: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). They are of limited usefulness precisely because it is pathologically hard to make use of them in their current state of technological evolution. If they were easy to make use of, people would be using them all the time. I’d bet good money that if you surveyed a lot of applications where custom characters are being used, they are not using private use ranges. Now why would that be? Actually, I have used Private Use Area characters a lot, and, once I had got used to them, I found them incredibly straightforward to use. I have made fonts that include Private Use Area encodings using the High-Logic FontCreator program and then used those fonts in Serif PagePlus, both to produce PDF documents and PNG graphics, as needed for my particular project at the time. For example, http://forum.high-logic.com/viewtopic.php?f=10t=2957 http://forum.high
Re: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters))
This possibly fails because William possibly forgot to embed his font in the document itself (or Serif PagePlus forgets to do it when it creates the PDF document, and refuses to embed glyphs from the font that are bound to Unicode PUAs when it creates the embeded font). However no such problem when creating PDFs with MS Office, or via the Adobe Acrobat printer driver or other printer drivers generating PDF files, including Google Cloud Print). So this could be a misuse of Serif PagePlus when creating the PDF (I don't know this software, may be there are options set up that ells it to not embed fonts from a list of fonts that the recipient is supposed to have installed locally, to save storage space for the document, byt evoiding such embedding). Another reason may be that the font is marked as not embeddable within its exposed properties. Another reason may be that John tries to open the document with a software that does not handle embedded fonts, or that ignores it to use only the fonts preinstalled by John in his preferences. And in such case the result depends only on fonts preinstalled on his local system (that does not include the fonts created by William), or his software is setup to use exclusively a specific local Unicode font for all PUAs. (Softwares that behaved in this bad way was old versions of Internet Explorer, due to limitation of his text renderers, however this should not happen with PDFs, provided you have used a correct plugion version for displaying PDF in the browser : if this fails in the browser, download the document and view it with Adobe Reader instead of view the plugin: there are many PDF plugins on markets that do not support essential features and just built to display PDF containing scanned bitmaps, but with very poor support of text or vector graphics, or tuned specifically to change the document for another device or paper format). Without citing which softwares are used (and which PDF in the list does not load correctly), it is difficult to tell, but for me I have no problems with a few docs I saw created by William. So: NO F = NO FAIL for me. 2015-06-03 13:38 GMT+02:00 John idou...@gmail.com: Yep, I clicked on your document and saw an empty square where your character should be. F = FAIL. — Chris On Wed, Jun 3, 2015 at 6:30 PM, William_J_G Overington wjgo_10...@btinternet.com wrote: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). They are of limited usefulness precisely because it is pathologically hard to make use of them in their current state of technological evolution. If they were easy to make use of, people would be using them all the time. I’d bet good money that if you surveyed a lot of applications where custom characters are being used, they are not using private use ranges. Now why would that be? Actually, I have used Private Use Area characters a lot, and, once I had got used to them, I found them incredibly straightforward to use. I have made fonts that include Private Use Area encodings using the High-Logic FontCreator program and then used those fonts in Serif PagePlus, both to produce PDF documents and PNG graphics, as needed for my particular project at the time. For example, http://forum.high-logic.com/viewtopic.php?f=10t=2957 http://forum.high-logic.com/viewtopic.php?f=10t=2672 William Overington 3 June 2015
Re: Tag characters and in-line graphics (from Tag characters)
Chris idou747 at gmail dot com wrote: Right now, what happens if you have a domain or locale requirement for a special character? That's what the PUA is for. Assign a PUA code point to your special character, create a font which implements the PUA character, create a brief private agreement which states that this code point refers to that character and which mentions the font, put the private agreement on the web, and publish your document with a reference to the agreement. For most non-professionals, creating the font is the tricky part. Also see Section 23.5 of TUS. Note that I am disagreeing with Martin about the PUA being useful only as a scratch area for standardization. -- Doug Ewell | http://ewellic.org | Thornton, CO
Tag characters and in-line graphics (from Tag characters)
Chris idou747 at gmail dot com wrote: Why shouldn’t there be a standard way to go out on the net and find the canonical glyph for a code? Because there isn't one. Glyphs are suggestions, meant to convey the identity of the character. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Tag characters and in-line graphics (from Tag characters)
2015-06-04 2:59 GMT+02:00 David Starner prosfil...@gmail.com: You can’t iterate over compressed bits. You can’t process them. Why not? In any language I know of that has iterators, there would be no problem writing one that iterates over compressed input. If you need to mutate them, that is hard in compressed formats, but a new CPU can store War in Peace in the on-CPU cache. You're right, today the CPU is no longer the bottleneck, which is now * the speed of long buses and communcaition links, with their limited (and costly) bandwidth as this is a shared medium used by more and more people but requiring mssive infrastures, or physical constraints even on the fastest serial buses, both implying transmission roundtrip times (limiting random access, which is a severe problem now that we have to access to extremely large volumes of data distributed over multiple devices or over a full network * the storage capacity for the fastest storage medium (such as flash memory, which is the only option for mobile devices, but also the most expensive). In both cases you need compression (the second bottleneck on storage volumes will fade out in a few years, but not the bandwidth constraints). It really pays now to use compression schemes (even the most complex ones such as those used to transmit live video: locally a CPU or GPU will easily handle the compression scheme. Researches on compression schemes is really not ended, it has never been so much active as it is today, including for text because of the explosion of the data volumes, even if now the volume of text is largely overwhelmed by the volume of images, videos and audio (but you can't compute a lot of things from audio/image/video data sources, we still need text for giving semantics to these medias from which you can derive data or perform searches (there is still a lot to do for handling images and audio speech and detect some semantics in them, but you won't get as much info from an audio/video than what can be represented by text: OCR for example is a very heuristic process with lots of false guesses produced, still much more than humain brains can process within a broad ranges of variations that we call cultures; computers are still very poor in recognizing cultures with as many variations as those we recognize through social interactions and years of education and *personal* experience).
Re: Tag characters and in-line graphics (from Tag characters)
Chris John idou747 at gmail dot com wrote: So what you’re saying is that the current situation where you see an empty square □ for unknown characters is better than seeing something useful? No, that's why you include a reference to the font in the private agreement, so that interested parties can install it and see the special character(s). -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Tag characters and in-line graphics (from Tag characters)
Once again no ! Unicode is a standard for encoding characters, not for encoding some syntaxic element of a glyph definition ! Your project is out of scope. You still want to reinvent the wheel. For creating syntax, define it within a language (which does not need new characters (you're not creating an APL grammar using specific symbols for some operators more or less based on Greek letters and geometric shapes: they are just like mathematic symbols). Programming languages and data languages (Javascript, XML, JOSN, HTML...) and their syntax are encoded themselves in plain text documents using standard characters) and don't need new characters, APL being an exception only because computers or keyboards were produced to facilitate the input (those that don't have such keyboards used specific editors or the APL runtime envitonment that offer an input method for entering programs in this APL input mode). Anf again you want the chicken before the egg: have you only ever read the encoding policy ? The UCS will not encode characters without a demonstrated usage. Nothing in what you propose is really used except being proposed only by you, and used only by you for your private use (or with a few of your unknown friends, but this is invisible and unverifiable). Nothing has been published. Even for currency symbols (which are an exception to the demonstrated use, only because once they are created they are extremely rapidly needed by lot of people, in fact most people of a region as large as a country, and many other countries that will reference or use it it). But even in this case, what is encoded is the character itself, not the glyph or new characters used to defined the glyph ! Can you stop proposing out of topic subjects like this on this list ? You are not speaking about Unicode or characters. Another list will be more appropriate. You help no one here because all you want is to change radically the goals of TUS. 2015-06-02 11:01 GMT+02:00 William_J_G Overington wjgo_10...@btinternet.com : Perhaps the solution to at least some of the various issues that have been discussed in this thread is to define a tag letter z as a code within the local glyph memory requests, as follows.
Re: Tag characters and in-line graphics (from Tag characters)
Perhaps the solution to at least some of the various issues that have been discussed in this thread is to define a tag letter z as a code within the local glyph memory requests, as follows. Local glyph memory, for use in compressing a document where the same glyph is used two or more times in the document: 3t7r means this is local glyph 3 being defined at its first use in the document as 7 red pixels 3h here local glyph 3 is being used 3z7r means this is local glyph 3 being defined, though not used, at the start of the document as 7 red pixels More than one local glyph could be defined at the start of the document, as desired. This would mean that use of such a glyph within the document would be by just using the quite short base character followed by tag characters sequence using the h request. This would enable document editing to be easier to accomplish. A mechanism to be able to use the method to define a glyph linked to a Unicode code point would be a useful facility to add for use in a situation where the glyph is for a regular Unicode character. May I mention something that I forgot to mention earlier please? When only one pixel of a particular colour is being specified, it can be specified using just the code for the colour. For example, for 1 red pixel please use r on its own, there is no need to use 1r though 1r should be made to work just in case anyone does use that format. There was a time when I used to use the FORTH programming language and this format of first inputting the number then the operator is based on the way that the FORTH programming language works. William Overington 2 June 2015 Original message From : wjgo_10...@btinternet.com Date : 27/05/2015 - 17:26 (GMTST) To : unicode@unicode.org Subject : Tag characters and in-line graphics (from Tag characters) Tag characters and in-line graphics (from Tag characters) This document suggests a way to use the method of a base character together with tag characters to produce a graphic. The approach is theoretical and has not, at this time, been tried in practice. The application in mind is to enable the graphic for an emoji character to be included within a plain text stream, though there will hopefully be other applications. The base character could be either an existing character, such as U+1F5BC FRAME WITH PICTURE, or a new character as decided. Tests could be carried out using a Private Use Area character as the base character. The explanation here is intended to explain the suggested technique by examples, as a basis for discussion. In each example, please consider for each example that the characters listed are each the tag version of the character used here and that they all as a group follow one base character. The examples are deliberately short so as to explain the idea. A real use example might have around two hundred or so tag characters following the base character, maybe more, sometimes fewer. Examples of displays: Each example is left to right along the line then lines down the page from upper to lower. 7r means 7 pixels red 7r5y means 7 pixels red then 5 pixels yellow 7r5y-3b means 7 pixels red then 5 pixels yellow then next line then 3 pixels blue Examples of colours available: k black n brown r red o orange y yellow g green (0, 255, 0) b blue m magenta e grey w white c cyan p pink d dark grey i light grey (thus avoiding using lowercase l so as to avoid confusion with figure 1) f deeper green (foliage colour) (0, 128, 0) Next line request: - moves to the next line Local palette requests: 192R224G64B2s means store as local palette colour 2 the colour (R=192, G=224, B=64) 7,2u means 7 pixels using local palette colour 2 Local glyph memory, for use in compressing a document where the same glyph is used two or more times in the document: 3t7r means this is local glyph 3 being defined at its first use in the document as 7 red pixels 3h here local glyph 3 is being used The above is for bitmaps. It would be possible to use a similar technique to specify a vector glyph as used in fontmaking using on-curve and off-curve points specified as X, Y coordinates together with N for on-curve and F for off-curve. There would need to be a few other commands so as to specify places in the tag character stream where definition of a contour starts and so as to separate the definitions of the glyphs for a colour font and so on. This could be made OpenType compatible so that a received glyph could be added into a font. Please feel free to suggest improvements. One improvement could be as to how to build a Unicode code point into a picture so that a font could be transmitted. William Overington 27 May 2015
Re: Tag characters and in-line graphics (from Tag characters)
On 2015/06/03 07:55, Chris wrote: As you point out, The UCS will not encode characters without a demonstrated usage.”. But there are use cases for characters that don’t meet UCS’s criteria for a world wide standard, but are necessary for more specific use cases, like specialised regional, business, or domain specific situations. Unicode contains *a lot* of characters for specialized regional, business, or domain specific situations. My question is, given that unicode can’t realistically (and doesn’t aim to) encode every possible symbol in the world, why shouldn’t there be an EXTENSIBLE method for encoding, so that people don’t have to totally rearchitect their computing universe because they want ONE non-standard character in their documents? As has been explained, there are technologies that allow you to do (more or less) that. Information technology, like many other technologies, works best when finding common cases used by many people. Let's look at some examples: Character encodings work best when they are used widely and uniformly. I don't know anybody who actually uses all the characters in Unicode (except the guys that work on the standard itself). So for each individual, a smaller set would be okay. And there were (and are) smaller sets, not for individuals, but for countries, regions, scripts, and so on. Originally (when memory was very limited), these legacy encodings were more efficient overall, but that's no longer the case. So everything is moving towards Unicode. Most Website creators don't use all the features in HTML5. So having different subsets for different use cases may seem to be convenient. But overall, it's much more efficient to have one Hypertext Markup Language, so that's were everybody is converging to. From your viewpoint, it looks like having something in between character encodings and HTML is what you want. It would only contain the features you need, and nothing more, and would work in all the places you wanted it to work. Asmus's inline text may be something similar. The problem is that such an intermediate technology only makes sense if it covers the needs of lots and lots of people. It would add a third technology level (between plain text and marked-up text), which would divert energy from the current two levels and make things more complicated. Up to now, such as third level hasn't emerged, among else because both existing technologies were good at absorbing the most important use cases from the middle. Unicode continues to encode whatever symbols that gain reasonable popularity, so every time somebody has a real good use case for the middle layer with a symbol that isn't yet in Unicode, that use case gets taken away. HTML (or Web technology in general) also worked to improve the situation, with technologies such as SVG and Web Fonts. No technology is perfect, and so there are still some gaps between character encoding and markup, some of which may in due time eventually be filled up, but I don't think a third layer in the middle will emerge soon. Regards, Martin.
Re: Tag characters and in-line graphics (from Tag characters)
I was asking why the glyphs for right arrow ➡ are inconsistent in many sources, through a couple of iterations of unicode. Perhaps I might observe that one of the reasons is there is no technical link between the code and the glyph. I can’t realistically write a display engine that goes to unicode.org http://unicode.org/ or wherever, and dynamically finds the right standard glyph for unknown codes. This is also manifest in my seeing empty squares □ for characters my platform doesn’t know about. This isn’t the case with XML where I can send someone a random XML document, and there is a standard way to go out there on the internet and check if that XML is conformant. Why shouldn’t there be a standard way to go out on the net and find the canonical glyph for a code? If there was, then non-standard glyphs would fall out of that technology naturally. So people are talking about all these technologies that are out there, html5, cmap, fonts and so forth, but there is no standard way to construct a list of “characters”, some of which might be non-standard, and be able to embed that ANYWHERE one might reasonably expect characters, have it processed in a normal way as characters, be sent anywhere and understood. As you point out, The UCS will not encode characters without a demonstrated usage.”. But there are use cases for characters that don’t meet UCS’s criteria for a world wide standard, but are necessary for more specific use cases, like specialised regional, business, or domain specific situations. My question is, given that unicode can’t realistically (and doesn’t aim to) encode every possible symbol in the world, why shouldn’t there be an EXTENSIBLE method for encoding, so that people don’t have to totally rearchitect their computing universe because they want ONE non-standard character in their documents? Right now, what happens if you have a domain or locale requirement for a special character? Most likely you suffer without it, because even though you could get it to render in some situations (like hand coding some IMGs into your web site), you just know you won’t be able to realistically input it into emails, word documents, spreadsheets, and whatever other random applications on a daily basis. What I’m saying is it really beyond the unicode consortium’s scope, and/or would it really be a redundant technology to, for example, define a UTF-64 coding format, where 32 bits allow 4 billion businesses and individuals to define their own characters sets (each of up to 4 billion characters), then have standard places on the internet (similar to DNS lookup servers) that can provide anyone with glyphs and fonts for it? Right now, yes there are cmaps, but no standard way to combine characters from different encodings. No standard way to find the cmap for an unknown encoding. There is HTML5, but that doesn’t produce something that is recognisable as a list of characters that can be processed as such. (If there is an IMG in text, is it a “character” or an illustration in the text? How can you refer to a particular set of characters without having your own web server? How you render that text bigger, with the standard reference glyph without manually searching the internet where to find it? There is a host of problems here). All these problems look unsolved to me, and they also look like encoding technology problems to me too. What other consortium is out there are working on character encoding problems? On 2 Jun 2015, at 7:40 pm, Philippe Verdy verd...@wanadoo.fr wrote: Once again no ! Unicode is a standard for encoding characters, not for encoding some syntaxic element of a glyph definition ! Your project is out of scope. You still want to reinvent the wheel. For creating syntax, define it within a language (which does not need new characters (you're not creating an APL grammar using specific symbols for some operators more or less based on Greek letters and geometric shapes: they are just like mathematic symbols). Programming languages and data languages (Javascript, XML, JOSN, HTML...) and their syntax are encoded themselves in plain text documents using standard characters) and don't need new characters, APL being an exception only because computers or keyboards were produced to facilitate the input (those that don't have such keyboards used specific editors or the APL runtime envitonment that offer an input method for entering programs in this APL input mode). Anf again you want the chicken before the egg: have you only ever read the encoding policy ? The UCS will not encode characters without a demonstrated usage. Nothing in what you propose is really used except being proposed only by you, and used only by you for your private use (or with a few of your unknown friends, but this is invisible and unverifiable). Nothing has been published. Even for currency symbols (which are an exception to the demonstrated use,
Re: Tag characters and in-line graphics (from Tag characters)
Martin, you seem to be labouring under the impression that HTML5 is a substitute for character encoding. If it is, why do we need unicode? We could just have documents laden with IMG tags, and restrict ourselves to ascii. It seems I need to spell out one more time why HTML is not character encoding: 1. HTML5 doesn’t separate one particular representation (font, size, etc) from the actual meaning of the character. So you can’t paste it somewhere and expect to increase its point size or change its font. 2. It’s highly inefficient in space to drop multi-kilobyte strings into a document to represent one character. 3. The entire design of HTML has nothing to do with characters. So there is no way to process a string of characters interspersed with HTML elements and know which of those elements are a “character”. This makes programatic manipulation impossible, and means most computer applications simply will not allow HTML in scenarios where they expect a list of “characters”. 4. There is no way to compare 2 HTML elements and know they are talking about the same character. I could put some HTML representation of a character in my document, you could put a different one in, and there would absolutely no way to know that they are the same character. Even if we are in the same community and agree on the existence of this character. 5. Similarly, there is no way to search or index html elements. If a HTML document contained an image of a particular custom character, there would be no way to ask google or whatever to find all the documents with that character. Different documents would represent it differently. HTML is a rendering technology. It makes things LOOK a particular way, without actually ENCODING anything about it. The only part of of HTML that is actually searchable in a deterministic fashion is the part that is encoded - the unicode part. Unicode encodes symbols that have “reasonable popularity”. (a) that is not all of them. (b) how can a symbol attain reasonable popularity when it is not in unicode? Of course some can, but others have their popularity hindered by the very fact that they are not encoded! Take the poop emoji that people recently have been talking about here. It gained popularity because the Japanese telecom companies decided to encode it. If they hadn’t encoded it, well would have become popular through normal culture such that the unicode consortium would have adopted it! No it wouldn’t! The Japanese telcos were able to do this because they controlled their entire user base from hardware on up to encodings. That won’t be happening into the future, so new interesting and potentially universal emojis won’t ever come into existence in the way that this one did because of the control the unicode consortium exercises over this technology. But the problem isn’t restricted to emojis, many other potentially popular symbols can’t come into existence either. The internet *COULD* be the birthplace of lots of interesting new symbols in the same way that Japanese telecom companies birthed the original emojis, but it won’t be because the unicode consortium r! ules it from the top down. Summary: 1. HTML renders stuff, it encodes nothing. It addresses a completely different problem domain. If rendering and encoding were the same problem, unicode can disband now. 2. Unicode encodes stuff, but isn’t extensible in a way that broadly useful. i.e. in a way that allows anybody (or any application) receiving a custom character to know what it is, or how to render it, or to combine it with other custom character sets. 3. The problem under discussion is not a rendering problem. HTML5 lacks nothing in terms of ability to render. Yet the problem remains. Because it’s an encoding problem. Encoding problems are in the unicode domain, not in the HTML5 domain. You say that character encodings work best when they are used widely and uniformly. But they can only be as wide or as uniform as reality itself. We could try and conform reality to technology and… for example… force all the world to use Latin characters and 128 ASCII representations. OR we can conform technology to reality. Not all encodings need to be, or ought to be as universal as requiring one world wide committee to pass judgment on them. On 3 Jun 2015, at 11:09 am, Martin J. Dürst due...@it.aoyama.ac.jp wrote: On 2015/06/03 07:55, Chris wrote: As you point out, The UCS will not encode characters without a demonstrated usage.”. But there are use cases for characters that don’t meet UCS’s criteria for a world wide standard, but are necessary for more specific use cases, like specialised regional, business, or domain specific situations. Unicode contains *a lot* of characters for specialized regional, business, or domain specific situations. My question is, given that unicode can’t realistically (and doesn’t aim to) encode every possible symbol in the world, why shouldn’t
Re: Tag characters and in-line graphics (from Tag characters)
On 3 Jun 2015, at 11:22 am, Martin J. Dürst due...@it.aoyama.ac.jp wrote: On 2015/05/29 11:37, John wrote: If I had a large document that reused a particular character thousands of times, Then it would be either a very boring document (containing almost only that same character) or it would be a very large document. If you have a daughter, look at her Facebook messenger, and then get back to me. would this HTML markup require embedding that character thousands of times, or could I define the character once at the beginning of the sequence, and then refer back to it in a space efficient way? If you want space efficiency, the best thing to do is to use generic compression. Many generic compression methods are available, many of them are widely supported, and all of them will be dealing with your case in a very efficient way You can’t ask the entire computing universe to compress everything all the time. And that is what your comment amounts to. Because the whole point under discussion is how can we encode stuff such that you can hope to universally move it around between different documents, formats, applications, input fields and platforms without any massage. Given that its been agreed that private use ranges are a good thing, That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). They are of limited usefulness precisely because it is pathologically hard to make use of them in their current state of technological evolution. If they were easy to make use of, people would be using them all the time. I’d bet good money that if you surveyed a lot of applications where custom characters are being used, they are not using private use ranges. Now why would that be? and given that we can agree that exchanging data is a good thing, Yes, but there are many other ways to do that besides Unicode. And for many purposes, these other ways are better suited. The point is a universally recognised way. Of course you, me or anybody could design many good ways to solve any problem we might come up with. That doesn’t mean it will interoperate with anybody else though. maybe something should bring those two things together. Just a thought. Just a 'non sequitur'. Regards, Martin.
Re: Tag characters and in-line graphics (from Tag characters)
On 2015/05/29 11:37, John wrote: If I had a large document that reused a particular character thousands of times, Then it would be either a very boring document (containing almost only that same character) or it would be a very large document. would this HTML markup require embedding that character thousands of times, or could I define the character once at the beginning of the sequence, and then refer back to it in a space efficient way? If you want space efficiency, the best thing to do is to use generic compression. Many generic compression methods are available, many of them are widely supported, and all of them will be dealing with your case in a very efficient way. Given that its been agreed that private use ranges are a good thing, That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). and given that we can agree that exchanging data is a good thing, Yes, but there are many other ways to do that besides Unicode. And for many purposes, these other ways are better suited. maybe something should bring those two things together. Just a thought. Just a 'non sequitur'. Regards, Martin.
Re: Tag characters and in-line graphics (from Tag characters)
No, nothing about what you propose, which is to encode graphics directly with a custom syntax using specific Unicode characters for this syntax itself. There's no such statement in the UTR, even for longer term. What is proposed instead is a way to *reference* (not define) graphics. For the rest, you need a rich-text format to embed graphics (using the syntax of this rich-text format, such as HTML), but this syntax remains out of scope of Unicode which will not standardize any graphic format, or any language by its syntax. Even for CLDR, you will use some JSON or XML rich-text format to create references, or embed some small graphics. But CLDR is NOT part of the Unicode Standard itself, and does not encode new characters (and I've not seen the CLDR requesing additions in the UCS for its own use, instead it uses its own assignments for PUAs where needed, als also for its own private locale tags for internal references within the CLDR data itself). 2015-06-02 12:37 GMT+02:00 William_J_G Overington wjgo_10...@btinternet.com : Responding to Philippe Verdy: Nothing has been published. It has been published. It is published in this thread for discussion prior to a possible submission to the Unicode Technical Committee that could take place if people on this mailing list feel that it is a good solution to the problem raised in section 8 of the following document. http://www.unicode.org/reports/tr51/tr51-2.html Direct link to 8 Longer Term Solutions http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term William Overington 2 June 2015
Re: Tag characters and in-line graphics (from Tag characters)
On 2015-06-02, William_J_G Overington wjgo_10...@btinternet.com wrote: take place if people on this mailing list feel that it is a good solution to the problem raised in section 8 of the following document. http://www.unicode.org/reports/tr51/tr51-2.html That section does not raise a problem. It says what the solution to the emoji problem is: namely that people who want to embed graphics in text should fix their protocols to allow it, instead of subverting Unicode to do it. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: Tag characters and in-line graphics (from Tag characters)
Responding to Philippe Verdy: Nothing has been published. It has been published. It is published in this thread for discussion prior to a possible submission to the Unicode Technical Committee that could take place if people on this mailing list feel that it is a good solution to the problem raised in section 8 of the following document. http://www.unicode.org/reports/tr51/tr51-2.html Direct link to 8 Longer Term Solutions http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term William Overington 2 June 2015
Re: Tag characters and in-line graphics (from Tag characters)
On 6/2/2015 2:01 AM, William_J_G Overington wrote: Local glyph memory, for use in compressing a document where the same glyph is used two or more times in the document: Um, that technology already exists. It is called a font. A mechanism to be able to use the method to define a glyph linked to a Unicode code point would be a useful facility to add for use in a situation where the glyph is for a regular Unicode character. And that mechanism has also already been defined. It is called a cmap: http://www.microsoft.com/typography/otspec/cmap.htm --Ken
Re: Tag characters and in-line graphics (from Tag characters)
2015-06-01 1:33 GMT+02:00 Chris idou...@gmail.com: Of course, anyone can invent a character set. The difficult bit is having a standard way of combining custom character sets. That’s why a standard would be useful. And while stuff like this can, to some extent, be recognised by magic numbers, and unique strings in headers, such things are unreliable. Just because example.net/mycharset/ appears near the start of a document, doesn’t necessarily mean it was meant to define a character set. Maybe it was a document discussing character sets. That's not what I described. I spoke about using a MIME-compatible private charset identifier, and how such private identifier can be made reasonnably unique by binding it to a domain name or URI. If you had read more carefully I also said that it was absolutely not necessary to dereference that URL: there are many XML schemas binding their namespaces to a URI which is itself not a webpage or to any downloadable DTD or XML schema or XML stylesheet. Google and Microsoft are using this a lot in lots of schemas (which are not described and documented at this URL if they are documented). The URI by itself is just an identifier, it becomes a webpage only when you use it in a web page with an href attribute to create an hyperlink, or to perform some query to a service returning some data. An identifier for a private charset does not need to perform any request to be usable by itself, we just have the identifier which is sufficient by itself. The URI can be also only a base URI for a collection of resources (whose URLs start by this base URI, with conventional extensions appended to get the character properties, or a font; but the best way is to embed this data in your document, in some header or footer, if your document using the private charset is not part of a collection of docs using the same private charset) In that case, you don't need a new UTF: UTF-8 remains usable and you can map your private charset to standard PUAs (and/or to hacked characters) according to the private charset needs. The charset indicated in your document (by some meta header) should be sufficient to avoid collisions with other private conventions, it will define the scope of your private charset as the document itself, which will then be interchangeable (and possibly mixable with other documents with some renumbering if there a collisions of assignments between two distinct private charsets: in the document header; add to the charset identifier the range of PUAs which is used, then with two documents colling on this range, you can reencode one automatically by creating a compound charset with subranges of PUAs remapped differently to other ranges).
Re: Tag characters and in-line graphics (from Tag characters)
On 5/31/2015 5:33 AM, Chris-as-John wrote: Yes, Asmus good post. But I don’t really think HTML, even a subset, is really the right solution. The longer I think about this, what would be needed would be something like an abstract format. A specification of the capabilities to be supported and the types of properties needed to support them in an extensible way. HTML and CSS would possibly become an implementation of such a specification. There would still be a place for a character set, that is Unicode, as an efficient way to implement the most basic and most standard features of text contents, but perhaps some extension mechanism that can handle various extensions. The first level of extension is support for recent (or rare) code points in the character set (additional fonts, etc, as you mention). The next level of extension could be support for collections of custom entities that are not available as character sets (stickers and the like). And finally, there would have to be a way to deal with one-offs, such as actual images that do not form categorizable sets, but are used in an ad-hoc manner and behave like custom characters. And so on. It should be possible to describe all of this in a way that allows it to be mapped to HMTL and CSS or to any other rich text format -- the goal, after all is to make such inline text as widely and effortlessly interchangeable as plain text is today (or at least nearly so). By keeping the specification abstract, you could accommodate both SGML like formats where ascii-string markup is intermixed with the text, as well as pure text buffers with place holder code points and links to external data. But, however bored you are with plain Unicode emoji, as long as there isn't an agreed upon common format for rich inline text I see very little chance that those cute facebook emoji will do anything other than firmly keep you in that particular ghetto. A./ I’m reminded of the design for XML itself, it is supposed to start with a header that defines what that XML will conform to. Those definitions contain some unique identifiers of that XML schema, which happens to be a URL. The URL is partly just a convenient unique identifier, but also, the XML engine, if it doesn’t know about that schema could go to that URL and download the schema, and check that the XML conforms to that schema. Similarly, imagine a text format that had a header with something like: \uCHARSET:facebook.com/charsets/pusheen-the-cat-emoji/,12345 Now all the characters following in the text will interpret characters that start with 12345 with respect to that character set. What would you find at at facebook.com/charsets/pusheen-the-cat-emoji/? You might find bitmaps, truetype fonts, vector graphics, etc. You might find many many representations of that character set that your rendering engine could cache for future use. The text format wouldn’t be reliant on today’s favorite rendering technology, whether bitmap, truetype fonts, or whatever. Right now, if you go to a website that references unicode that your platform doesn’t know about, you see nothing. If a format like this existed, character sets would be infinitely extensible, everybody on earth could see characters, even if their platform wasn’t previously aware of them, and the format would be independent of today’s rendering technologies. Let’s face it, HTML5 changes every few years, and I don’t think anybody wants the fundamental textual representation dependant on an entire layout engine. And also the whole range of what HTML5 can do, even some subset, is too much information. You don’t necessarily want your text to embed the actual character set. Perhaps that might be a useful option, but I think most people would want to uniquely identify the character set, in a way that an engine can download it, but without defining the actual details itself. Of course, certain charsets would probably become pervasive enough that platforms would just include them for convenience. Emojis by major messaging platforms. Maybe characters related to specialised domains like, I don’t know, mapping or specialised work domains or whatever, But without having to be subservient to the central unicode committee. As someone who is a keen user of Facebook messenger, and who sees them bring out a new set of emoji almost every week, I think the world will soon be totally bored with the plain basic emoji that unicode has defined. — Chris On Sun, May 31, 2015 at 9:06 PM, Asmus Freytag (t) asmus-...@ix.netcom.com mailto:asmus-...@ix.netcom.com wrote: reading this discussion, I agree with your reaductio ad absurdum of infinitely nested HTML. But I think you are onto something with your hypothetical example of the subset that works in ALL textual situations. There's clearly a use case for something like it, and I believe many people would intuitively agree on a set of features
Re: Tag characters and in-line graphics (from Tag characters)
The abstract format already exists also for HTML (with MIME charset extension of the media-type text/plain (it can also be embedded in a meta tag, where the HTML source file ius just stored in a filesystem, so that a webserver can parse it and provide the correct MIME header, if the webserver has no repository for metadata and must infer the media type from the file content itself with some guesser). It also exists in various conventions for source code (recognized by editors such as vi(m) or Emacs, or for Unic shells using embedded magic identifiers near the top of the file. You can use it to send an identifier for a private charset without having to request for a registration of the charset in the IANA database (which is not intended for private encodings). The pricate chrset can be named a unique way (consider using a private charset name based on a domain name you own, such as x-www.example.net-mycharset-1 if you own the domain name example.net). It will be enough for the initial experimentation for a few years (or more, provided that you renew this domain name). Your charset can contain various defitnitions: a mapping of your codepoints (including PUAs, or standard codepoints, or hacked codepoints if you have no other solution to get the correct character properties working with existing algorithms such as case mappings, collation, layout behavior in text renderers). Such solution would allow a more predictable management of PUAs (byt allowing to control their scope of use, by binding them, only in some magic header of the document, to a private charset that remains reasonnably unique. for example x-example.net-mycharset-1 would map to an URL like // www.example.net/mycharset/1/ containing some schema (it could be the base adress of an XML of JSON file, and of a web font containing the relevant glyphs, and of a character properties database to override the default ones from the standard: if you already know this private charset in your application, you don't need to download any of these files, the URL is just an identifier and you file can still be used in standalone mode, just like you can parse many standard XML schemas by just recognizing the URLs assigned to the XML namespaces, without even having to find a DTD or XML schema definition from an external resource; if needed you app can contain a local repository in some cache folder where you can extend the number of private charsets that can be recognized). Full interopability will still not be possible if you need to mix in the same document texts encoded with different private charsets (there's always a risk of collision), without a way to reencode some of them to a joined charset without the collisions) by infering a new private charset (it's not impossible to do, after all this is done already with XML schemas that you can mix together: you just need to rename the XML namespaces, keeping the URLs to which they are bound, when there's a collision on the XML namespace names, a situation that occurs sometimes because of versioning where some features of a schema are not fully upward compatible). Yes this complicate things a bit, but much less than when using documents in which PUA assignments are not negociated at all (even minimally to make sure they are compatible when mixing sources); and for which there exits for now absolutely no protocol defined for such negociation (TUS says that PUAs are usable and interchangeable under private mutual agreement but still provides no schemes for supporting such mutual agreement, and for this reason, PUAs are alsmost always rejected, and people want true permanent assignments for characters that are very specific, badly documented, or insufficiently known to have reliable permanent properties). So let's think about securing the use of PUAs with some identification scheme (for plain-text formats, it should just be allowed to negocaite a single charset for the whole, using the magic header tricks that re used since long by charset guessers (including for autodetecting UTF-8 encoded files). This would also solve the chicken-and-egg problem where we need more sources to attest an effective usage before encoding new characters, but developping this usages is extremely difficult (and much slower) in our modern technologies where most documents are now handled numerically (in the past it was possible to create a metal font and use it immediately to start editing books, and there were many more people using handwriting and drawings, so it was much less difficult to invent new characters, than it is today, unless you're a big company that has enough resources to develop this usage alone, such as Japanese telcos or Google, Yahoo, Samsung or Microsoft introducing new sets of Emojis for their instant messaging platform, with tons of developers working for them to develop a wide range of services around it...) However I'm not saying that Unicode should specify how such private charset containing private
Re: Tag characters and in-line graphics (from Tag characters)
David Starner wrote: I would say that a system would conform with Unicode in having yellow heart red (in a non-monochrome font) as well as if it made it a cross. Either way it's violating character identity. I'd say that being monochromatic is now like being monospaced; it's suboptimal for a Unicode implementation, but hardly something Unicode can condemn as nonconformant. This seems fair and sensible. My main point was that being monochromatic (i.e. black) is conformant, and was an attempt to challenge the statement about character color sometimes being a recorded property. I don't see any Unicode character properties that identify color, only character names, which don't carry property information. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Tag characters and in-line graphics (from Tag characters)
Of course, anyone can invent a character set. The difficult bit is having a standard way of combining custom character sets. That’s why a standard would be useful. And while stuff like this can, to some extent, be recognised by magic numbers, and unique strings in headers, such things are unreliable. Just because example.net/mycharset/ http://example.net/mycharset/ appears near the start of a document, doesn’t necessarily mean it was meant to define a character set. Maybe it was a document discussing character sets. And while it is tempting to allow the “container” to define the “header” information, whether the container be html defining something in its HEAD tag, or some proprietary format (MS-Word), or whatever, that doesn’t really solve anybody’s problem in a standard way. For a start, what if you want to copy text to the clipboard? You want the thing receiving it to be able to interpret it in a self-contained way. The 2 obvious implementations for a standard seem to be: 1) A standard (optional) header. Perhaps if the string starts with a special character, then follows a header defining charsets first. These would allocate character ranges for custom characters, and point to where their renderings can be found. Standard programming libraries on all platforms would invisibly act appropriately on these headers. If you concatenated strings with conflicting namespaces, standard libraries would seamlessly reallocate one of the custom namespaces and merge the headers. 2) Make a new character set, let’s call it UTF-64. 32 bits would be allocated for custom character sets. Anybody could apply to a central authority to be allocated a custom id (32 bits=4 billion ids). A central location, kind of like a domain name system, would map that id to the URL where the canonical definition for that character set is. The 2nd option has the advantage that the file format is fixed width like normal plain text documents. Concatenating custom character set strings is no issue. The canonical location for a character set isn’t forevermore mapped to a particular domain owner. Nothing about the meaning of the characters is defined in the actual bits other than the unique id. The disadvantage is it needs a central authority to maintain the list of ids, and map them to domains. On 1 Jun 2015, at 7:26 am, Philippe Verdy verd...@wanadoo.fr wrote: The abstract format already exists also for HTML (with MIME charset extension of the media-type text/plain (it can also be embedded in a meta tag, where the HTML source file ius just stored in a filesystem, so that a webserver can parse it and provide the correct MIME header, if the webserver has no repository for metadata and must infer the media type from the file content itself with some guesser). It also exists in various conventions for source code (recognized by editors such as vi(m) or Emacs, or for Unic shells using embedded magic identifiers near the top of the file. You can use it to send an identifier for a private charset without having to request for a registration of the charset in the IANA database (which is not intended for private encodings). The pricate chrset can be named a unique way (consider using a private charset name based on a domain name you own, such as x-www.example.net-mycharset-1 if you own the domain name example.net http://example.net/). It will be enough for the initial experimentation for a few years (or more, provided that you renew this domain name). Your charset can contain various defitnitions: a mapping of your codepoints (including PUAs, or standard codepoints, or hacked codepoints if you have no other solution to get the correct character properties working with existing algorithms such as case mappings, collation, layout behavior in text renderers). Such solution would allow a more predictable management of PUAs (byt allowing to control their scope of use, by binding them, only in some magic header of the document, to a private charset that remains reasonnably unique. for example x-example.net-mycharset-1 would map to an URL like //www.example.net/mycharset/1/ http://www.example.net/mycharset/1/ containing some schema (it could be the base adress of an XML of JSON file, and of a web font containing the relevant glyphs, and of a character properties database to override the default ones from the standard: if you already know this private charset in your application, you don't need to download any of these files, the URL is just an identifier and you file can still be used in standalone mode, just like you can parse many standard XML schemas by just recognizing the URLs assigned to the XML namespaces, without even having to find a DTD or XML schema definition from an external resource; if needed you app can contain a local repository in some cache folder where you can extend the number of private charsets that can be recognized).
Re: Tag characters and in-line graphics (from Tag characters)
John, reading this discussion, I agree with your reaductio ad absurdum of infinitely nested HTML. But I think you are onto something with your hypothetical example of the subset that works in ALL textual situations. There's clearly a use case for something like it, and I believe many people would intuitively agree on a set of features for it. What people seem to have in mind is something like inline text. Something beyond a mere stream of plain text (with effectively every character rendered visibly), but still limited in important ways by general behavior of inline text: a string of it, laid out, must wrap and line break, any objects included in it must behave like characters (albeit of custom width, height and appearance), and so on. Paragraph formatting, stacked layout, header levels and all those good things would not be available. With such a subset clearly defined, many quirky limitations might no longer be necessary; any container that today only takes plain text could be upgraded to take inline text. I can see some inline containers retaining a nesting limitation, but I could imagine that it is possible to arrive at a consistent definition of such inline format. Going further, I can't shake the impression that without a clean definition of an inline text format along those lines, any attempts at making stickers and similar solutions stick are doomed to failure. The interesting thing in defining such a format is not how to represent it in HTML or CSS syntax, but in describing what feature sets it must (minimally) support. Doing it that way would free existing implementations of rich text to map native formats onto that minimally required subset and to add them to their format translators for HMTL or whatever else they use for interchange. Only with a definition can you ever hope to develop a processing model. It won't be as simple as for plain text strings, but it should be able to support common abstractions (like iteration by logical unit). It would have to support the management of external resources - if the inline format allows images, custom fonts, etc. one would need a way to manage references to them in the local context. If your skeptical position proves correct in that this is something that turns out to not be tractable, then I think you've provided conclusive proof why stickers won't happen and why encoding emoji was the only sensible decision Unicode could have taken. A./ On 5/30/2015 7:14 AM, John wrote: Hmm, these once entities of which you speak, do they require javascript? Because I'm not sure what we are looking for here is static documents requiring a full programming language. But let's say for a moment that html5 can, or could do the job here. Then to make the dream come true that you could just cut and paste text that happened to contain a custom character to somewhere else, and nothing untoward would happen, would mean that everything in the computing universe should allow full blown html. So every Java Swing component, every Apple gui component, every .NET component, every windows component, every browser, every Android and IOS component would allow text entry of HTML entities. OK, so let's say everyone agrees with this course of action, now the universal text format is HTML. But in this new world where anywhere that previously you could input text, you can now input full blown html, does that actually make sense? Does it make sense that you can for example, put full blown HTML inside a H1 tag in html itself? That's a lot of recursion going on there. Or in a MS-Excel cell? Or interspersed in some otherwise fairly regular text in a Word document? I suppose someone could define a strict limited subset of HTML to be that subset that makes sense in ALL textual situations. That subset would be something like just defining things that act like characters, and not like a full blown rendering engine. But who would define that subset? Not the HTML groups, because their mandate is to define full blown rendering engines. It would be more likely to be something like the unicode group. And also, in this brave new world where HTML5 is the new standard text format, what would the binary format of it be? I mean, if I have the string of unicode characters IMG would that be HTML5 image definition that should be rendered as such? Or would it be text that happens to contain greater than symbol, I, M and G? It would have to be the former I guess, and thereby there would no longer be a unicode symbol for the mathematical greater than symbol. Rather there would be a unicode symbol for opening a HTML tag, and the text code for greater than would be gt; Never again would a computer store to mean greater than. Do we want HTML to be so pervasive? Not sure it deserves that. And from a programmers point of view, he wants to be able to iterate over an array of characters and treat each one the same way,
Re: Tag characters and in-line graphics (from Tag characters)
Yes, Asmus good post. But I don’t really think HTML, even a subset, is really the right solution. I’m reminded of the design for XML itself, it is supposed to start with a header that defines what that XML will conform to. Those definitions contain some unique identifiers of that XML schema, which happens to be a URL. The URL is partly just a convenient unique identifier, but also, the XML engine, if it doesn’t know about that schema could go to that URL and download the schema, and check that the XML conforms to that schema. Similarly, imagine a text format that had a header with something like: \uCHARSET:facebook.com/charsets/pusheen-the-cat-emoji/,12345 Now all the characters following in the text will interpret characters that start with 12345 with respect to that character set. What would you find at at facebook.com/charsets/pusheen-the-cat-emoji/? You might find bitmaps, truetype fonts, vector graphics, etc. You might find many many representations of that character set that your rendering engine could cache for future use. The text format wouldn’t be reliant on today’s favorite rendering technology, whether bitmap, truetype fonts, or whatever. Right now, if you go to a website that references unicode that your platform doesn’t know about, you see nothing. If a format like this existed, character sets would be infinitely extensible, everybody on earth could see characters, even if their platform wasn’t previously aware of them, and the format would be independent of today’s rendering technologies. Let’s face it, HTML5 changes every few years, and I don’t think anybody wants the fundamental textual representation dependant on an entire layout engine. And also the whole range of what HTML5 can do, even some subset, is too much information. You don’t necessarily want your text to embed the actual character set. Perhaps that might be a useful option, but I think most people would want to uniquely identify the character set, in a way that an engine can download it, but without defining the actual details itself. Of course, certain charsets would probably become pervasive enough that platforms would just include them for convenience. Emojis by major messaging platforms. Maybe characters related to specialised domains like, I don’t know, mapping or specialised work domains or whatever, But without having to be subservient to the central unicode committee. As someone who is a keen user of Facebook messenger, and who sees them bring out a new set of emoji almost every week, I think the world will soon be totally bored with the plain basic emoji that unicode has defined. — Chris On Sun, May 31, 2015 at 9:06 PM, Asmus Freytag (t) asmus-...@ix.netcom.com wrote: John, reading this discussion, I agree with your reaductio ad absurdum of infinitely nested HTML. But I think you are onto something with your hypothetical example of the subset that works in ALL textual situations. There's clearly a use case for something like it, and I believe many people would intuitively agree on a set of features for it. What people seem to have in mind is something like inline text. Something beyond a mere stream of plain text (with effectively every character rendered visibly), but still limited in important ways by general behavior of inline text: a string of it, laid out, must wrap and line break, any objects included in it must behave like characters (albeit of custom width, height and appearance), and so on. Paragraph formatting, stacked layout, header levels and all those good things would not be available. With such a subset clearly defined, many quirky limitations might no longer be necessary; any container that today only takes plain text could be upgraded to take inline text. I can see some inline containers retaining a nesting limitation, but I could imagine that it is possible to arrive at a consistent definition of such inline format. Going further, I can't shake the impression that without a clean definition of an inline text format along those lines, any attempts at making stickers and similar solutions stick are doomed to failure. The interesting thing in defining such a format is not how to represent it in HTML or CSS syntax, but in describing what feature sets it must (minimally) support. Doing it that way would free existing implementations of rich text to map native formats onto that minimally required subset and to add them to their format translators for HMTL or whatever else they use for interchange. Only with a definition can you ever hope to develop a processing model. It won't be as simple as for plain text strings, but it should be able to support common abstractions (like iteration by logical unit). It would have to support the management of external resources - if the inline format allows images, custom fonts, etc. one would need a way to manage references to them in the local context. If
Re: Tag characters and in-line graphics (from Tag characters)
2015-05-30 10:47 GMT+02:00 William_J_G Overington wjgo_10...@btinternet.com : Responding to Doug Ewell: I think this cuts to the heart of what people have been trying to say all along. Historically, Unicode was not meant to be the means by which brand new ideas are run up the proverbial flagpole to see if they will gain traction. History is interesting and can be a good guide, yet many things that are an accepted part of Unicode today started as new ideas that gained traction and became implemented. So history should not be allowed to be a reason to restrict progress. For example, there was the extension from 1 plane to 17 planes. Actually this was a restriction of the UCS to *only* 17 planes. Before that the UCS contained 31-bit code points, i.e. 32768 planes ! If you're speaking about the old Unicode 1.0 it was then still not the UCS and it was then incompatible with the UCS for many important parts, and the initial targets of Unicode was only to have an industry standard immediately usable between a few software providers (Unicode 1.0 was then not an international standard, forget it !).
Re: Tag characters and in-line graphics (from Tag characters)
Note: Everything below is my personal opinion and does not represent any official Unicode Consortium or UTC position. William_J_G Overington wjgo underscore 10009 at btinternet dot com wrote: Historically, Unicode was not meant to be the means by which brand new ideas are run up the proverbial flagpole to see if they will gain traction. History is interesting and can be a good guide, yet many things that are an accepted part of Unicode today started as new ideas that gained traction and became implemented. So history should not be allowed to be a reason to restrict progress. I used historically to distinguish between the pre- and post-Emoji Revolution eras. There have clearly been changes recently, but there is still at least a minimal expectation that proposed characters will fulfill a demonstrated need. I'm not seeing any truly novel, untested ideas in the list below that Unicode implemented purely on speculation. For example, there was the extension from 1 plane to 17 planes. That was an architectural extension, brought about by the realization that 64K code points wasn't enough for even the original scope. There's no comparison. There was the introduction of emoji support. Emoji proponents would argue that emoji support began in 1.0 with the inclusion of various dingbats. But even emoji are arguably characters in some sense. They aren't a mini-language used to define images pixel by pixel. There was the introduction of the policy of colour sometimes being a recorded property rather than having just the original monochrome recording policy. There isn't any such policy. There is a variation selector to suggest that the rendering engine show certain characters in emoji style instead of text style, and there are characters with colors in their names, but there is no policy that specific colors are recorded as part of the encoding. YELLOW HEART could conformantly appear in any color. There has been the change of encoding policy that facilitated the introduction of the Indian Rupee character into Unicode and ISO/IEC 10646 far more quickly than had been thought possible, so that the encoding was ready for use when needed. That's not a change to what types of things get encoded. It's a procedural change, one which I would agree has been applied with increasing creativity. There has been the recent encoding policy change regarding encoding of pure electronic use items taking place without (extensive prior use using a Private Use Area encoding), such as the encoding of the UNICORN FACE. This is probably your best analogy. People like Asmus have addressed it, saying it's not reasonable to expect users to adopt PUA solutions and wait for them to catch on. There is the recent change to the deprecation status of most of the tag characters and the acceptance of the base character followed by tag characters technique so as to allow the specifying of a larger collection of particular flags. There must have been a great wailing and gnashing of teeth over that decision. So many statements were made over the years about the basic evilness of tag characters. But the concept of representing flags was already agreed upon as a compatibility measure, and the Regional Indicator Symbols solution was a compromise that allowed expansion beyond the 10 flags that Japanese telcos chose to include. RIS were an architectural decision. The tag solution (to be fully outlined in a future PRI) was another architectural decision. Neither (I believe) is analogous to a scope decision to start encoding different types of non-character things as if they were characters, and as I have said before, assigning a glyph to a thing that isn't a character doesn't make it one. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Tag characters and in-line graphics (from Tag characters)
I would say that a system would conform with Unicode in having yellow heart red (in a non-monochrome font) as well as if it made it a cross. Either way it's violating character identity. I'd say that being monochromatic is now like being monospaced; it's suboptimal for a Unicode implementation, but hardly something Unicode can condemn as nonconformant. On 4:25pm, Sat, May 30, 2015 Doug Ewell d...@ewellic.org wrote: Note: Everything below is my personal opinion and does not represent any official Unicode Consortium or UTC position. William_J_G Overington wjgo underscore 10009 at btinternet dot com wrote: Historically, Unicode was not meant to be the means by which brand new ideas are run up the proverbial flagpole to see if they will gain traction. History is interesting and can be a good guide, yet many things that are an accepted part of Unicode today started as new ideas that gained traction and became implemented. So history should not be allowed to be a reason to restrict progress. I used historically to distinguish between the pre- and post-Emoji Revolution eras. There have clearly been changes recently, but there is still at least a minimal expectation that proposed characters will fulfill a demonstrated need. I'm not seeing any truly novel, untested ideas in the list below that Unicode implemented purely on speculation. For example, there was the extension from 1 plane to 17 planes. That was an architectural extension, brought about by the realization that 64K code points wasn't enough for even the original scope. There's no comparison. There was the introduction of emoji support. Emoji proponents would argue that emoji support began in 1.0 with the inclusion of various dingbats. But even emoji are arguably characters in some sense. They aren't a mini-language used to define images pixel by pixel. There was the introduction of the policy of colour sometimes being a recorded property rather than having just the original monochrome recording policy. There isn't any such policy. There is a variation selector to suggest that the rendering engine show certain characters in emoji style instead of text style, and there are characters with colors in their names, but there is no policy that specific colors are recorded as part of the encoding. YELLOW HEART could conformantly appear in any color. There has been the change of encoding policy that facilitated the introduction of the Indian Rupee character into Unicode and ISO/IEC 10646 far more quickly than had been thought possible, so that the encoding was ready for use when needed. That's not a change to what types of things get encoded. It's a procedural change, one which I would agree has been applied with increasing creativity. There has been the recent encoding policy change regarding encoding of pure electronic use items taking place without (extensive prior use using a Private Use Area encoding), such as the encoding of the UNICORN FACE. This is probably your best analogy. People like Asmus have addressed it, saying it's not reasonable to expect users to adopt PUA solutions and wait for them to catch on. There is the recent change to the deprecation status of most of the tag characters and the acceptance of the base character followed by tag characters technique so as to allow the specifying of a larger collection of particular flags. There must have been a great wailing and gnashing of teeth over that decision. So many statements were made over the years about the basic evilness of tag characters. But the concept of representing flags was already agreed upon as a compatibility measure, and the Regional Indicator Symbols solution was a compromise that allowed expansion beyond the 10 flags that Japanese telcos chose to include. RIS were an architectural decision. The tag solution (to be fully outlined in a future PRI) was another architectural decision. Neither (I believe) is analogous to a scope decision to start encoding different types of non-character things as if they were characters, and as I have said before, assigning a glyph to a thing that isn't a character doesn't make it one. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Tag characters and in-line graphics (from Tag characters)
Responding to Leo Broukhis: A more common occurrence is the need to include a non-standard character in a text message, be it a ski piste symbol or an obscure CJK ideogram. Have you thought of embedding TrueType in Unicode? Not congruently so, yet, in effect, yes, as I have considered including individual OpenType-compatible glyphs in a base character followed by tag characters format. OpenType is a development from TrueType that can achieve more than can TrueType on its own. There is a little about this in the last two paragraphs of the following post. http://www.unicode.org/mail-arch/unicode-ml/y2015-m05/0218.html There would need to be a few additions to make if work effectively: for example, a value for each of advance width, ascent maximum, descent maximum and fontunits per em. William Overington 30 May 2015
Re: Tag characters and in-line graphics (from Tag characters)
Responding to Doug Ewell: I think this cuts to the heart of what people have been trying to say all along. Historically, Unicode was not meant to be the means by which brand new ideas are run up the proverbial flagpole to see if they will gain traction. History is interesting and can be a good guide, yet many things that are an accepted part of Unicode today started as new ideas that gained traction and became implemented. So history should not be allowed to be a reason to restrict progress. For example, there was the extension from 1 plane to 17 planes. There was the introduction of emoji support. There was the introduction of the policy of colour sometimes being a recorded property rather than having just the original monochrome recording policy. There has been the change of encoding policy that facilitated the introduction of the Indian Rupee character into Unicode and ISO/IEC 10646 far more quickly than had been thought possible, so that the encoding was ready for use when needed. There has been the recent encoding policy change regarding encoding of pure electronic use items taking place without (extensive prior use using a Private Use Area encoding), such as the encoding of the UNICORN FACE. There is the recent change to the deprecation status of most of the tag characters and the acceptance of the base character followed by tag characters technique so as to allow the specifying of a larger collection of particular flags. The two questions that I asked in my response to a post by Mark E. Shoulson are relevant here. Suppose that a plain text file is to include just one non-standard emoji graphic. How would that be done otherwise than by the format that I am suggesting? What if there were three such non-standard emoji graphics needed in the plain text file, the second graphic being used twice. How would that be done otherwise than by the format that I am suggesting? William Overington 30 May 2015
Re: Tag characters and in-line graphics (from Tag characters)
Hmm, these once entities of which you speak, do they require javascript? Because I'm not sure what we are looking for here is static documents requiring a full programming language. But let's say for a moment that html5 can, or could do the job here. Then to make the dream come true that you could just cut and paste text that happened to contain a custom character to somewhere else, and nothing untoward would happen, would mean that everything in the computing universe should allow full blown html. So every Java Swing component, every Apple gui component, every .NET component, every windows component, every browser, every Android and IOS component would allow text entry of HTML entities. OK, so let's say everyone agrees with this course of action, now the universal text format is HTML. But in this new world where anywhere that previously you could input text, you can now input full blown html, does that actually make sense? Does it make sense that you can for example, put full blown HTML inside a H1 tag in html itself? That's a lot of recursion going on there. Or in a MS-Excel cell? Or interspersed in some otherwise fairly regular text in a Word document? I suppose someone could define a strict limited subset of HTML to be that subset that makes sense in ALL textual situations. That subset would be something like just defining things that act like characters, and not like a full blown rendering engine. But who would define that subset? Not the HTML groups, because their mandate is to define full blown rendering engines. It would be more likely to be something like the unicode group. And also, in this brave new world where HTML5 is the new standard text format, what would the binary format of it be? I mean, if I have the string of unicode characters IMG would that be HTML5 image definition that should be rendered as such? Or would it be text that happens to contain greater than symbol, I, M and G? It would have to be the former I guess, and thereby there would no longer be a unicode symbol for the mathematical greater than symbol. Rather there would be a unicode symbol for opening a HTML tag, and the text code for greater than would be gt; Never again would a computer store to mean greater than. Do we want HTML to be so pervasive? Not sure it deserves that. And from a programmers point of view, he wants to be able to iterate over an array of characters and treat each one the same way, regardless if it is a custom character or not. Without that kind of programmatic abstraction, the whole thing can never gain traction. I don't think fully blown HTML embedded in your text can fulfill that. A very strictly defined subset, possibly could. Sure HTML5 can RENDER stuff adquately, if the only aim of the game is provide a correct rendering. But to be able to actually treat particular images embedded as characters, and have some programming library see that abstraction consistently, I'm not sure I'm convinced that is possible. Not without nailing down exactly what html elements in what particular circumstances constitute a character. I guess in summary, yes we have the technology already to render anything. But I don't think the whole standards framework does anything to allow the computing universe to actually exchange custom characters as if they were just any other text. Someone would actually have to work on a standard to do that, not just point to html5. On Saturday, 30 May 2015 at 5:08 am, Philippe Verdy verd...@wanadoo.fr, wrote: 2015-05-29 4:37 GMT+02:00 John idou...@gmail.com: Today the world goes very well with HTML(5) which is now the bext markup language for document (including for inserting embedded images that don’t require any external request” If I had a large document that reused a particular character thousands of times, would this HTML markup require embedding that character thousands of times, or could I define the character once at the beginning of the sequence, and then refer back to it in a space efficient way? HTML(5) allows defining *once* entities for images that can then be reused thousands of times without repeting their definition. You can do this as well with CSS styles, just define a class for a small element. This element may still be an image, but the semantic is carried by the class you assign to it. You are not required to provide an external source URL for that image if the CSS style provides the content. You may also use PUAs for the same purpose (however I have not seen how CSS allows to style individual characters in text elements as these characters are not elements, and there's no defined selector for pseudo-elements matching a single character). PUAs are perfectly usable in the situation where you have embedded a custom font in your document for assigning glyphs to characters (you can still do that, but I would avoid TrueType/OpenType for this purpose, but would use the SVG
Re: Tag characters and in-line graphics (from Tag characters)
Responding to Mark E. Shoulson: As was pointed out to me, essentially what you are saying is you reject my premise that one size does not fit all. Well, I do not know where that came from, but no, I do not reject that premise. There is plain text, there is HTML, there is XML. HTML is good for web pages. Plain text is, amongst other applications, good for text messages. The format that I am suggesting would allow the image for a non-standard emoji character to be included in a text message, with the image located at the correct place in the text. I have not purported that it become the only format for transmitting images. You would prefer *everything* be in plain text, so you wouldn't have to use other formats for it. You're essentially converting plain text into THE format for everything. No. Use the best format for the task that is being carried out. I am enthusiastic that as much as possible can be done in open source formats rather than an end user of computing equipment needing to rely on expensive propriety software packages with proprietary file formats that cannot be accessed without expensive software. If you really believe one size should fit all in this way, ... But I don't. Just because I opine that plain text is best for some applications and I have suggested a format that would allow a graphic to be included directly in a plain text file does not mean that I opine that everything should be plain text. For example, I use HTML files, gif files, png files, pdf files, wav files, TTF files as appropriate. http://www.users.globalnet.co.uk/~ngo/library.htm http://www.users.globalnet.co.uk/~ngo/spec0001.htm http://www.users.globalnet.co.uk/~ngo/song1018.htm http://www.users.globalnet.co.uk/~ngo/song1021.htm I have embedded a wav file in a pdf and published the result on the web. http://www.users.globalnet.co.uk/~ngo/the_mobile_art_shop.pdf Suppose that a plain text file is to include just one non-standard emoji graphic. How would that be done otherwise than by the format that I am suggesting? What if there were three such non-standard emoji graphics needed in the plain text file, the second graphic being used twice. How would that be done otherwise than by the format that I am suggesting? William Overington 29 May 2015
Re: Tag characters and in-line graphics (from Tag characters)
Responding to Philippe Verdy: There's no advantage because what you want to create is effectively another markup language with its own syntax (but requiring new obscure characters that most applications and users will not be able to interpret and render correctly in the way intended by you, ... Well, if the format became accepted as part of Unicode then appropriate applications could well be produced that would interpret the format and display an image in the desired place. ... and with still many things you have forgotten about the specific needs for images (e.g. colorimetry profiles, aspect ratio of pixels with bitmaps, undesired effects that must be controled such as moiré artefacts). The format is just at present a basic suggestion. Rather than just state what you consider what I have forgotten and dismiss the format, how about joining in progress and specifying what you consider needs adding to the format and perhaps suggest how to add in that functionality in the style that the format uses. You don't need new characters to create a markup language and its syntax. Today the world goes very well with HTML(5) which is now the bext markup language for document (including for inserting embedded images that don't require any external request, or embedding special effects on images, such as animation or dynamic layouts for adapting the document to the redering device, with the help of CSS and Javascript that are also embeddable). The two questions that I asked in my response to a post by Mark E. Shoulson are relevant here. Suppose that a plain text file is to include just one non-standard emoji graphic. How would that be done otherwise than by the format that I am suggesting? What if there were three such non-standard emoji graphics needed in the plain text file, the second graphic being used twice. How would that be done otherwise than by the format that I am suggesting? At least with HTML5 they don't try to reinvent the image formats and there's ample space for supporting multiple images formats tuned for specific needs (e.g. JPEG, PNG, GIF, SVG, TIFF...) including animation and video, and synchronization of images and audio in time for videos, or with user interactions. They are designed separately and benefit from patient researches made since long (your desired format, still undocumented, is largely under the level needed for images, independantly of the markup syntax you want to create to support them, and independantly of the fact that you also want to encode these syntaxic elements with new characters, something that is absolutely not needed for any markup language) Well it is undocumented apart from posts in this thread because I have put forward the format for discussion. A pdf document for consideration by the Unicode Technical Committee could be produced and submitted if there is interest in the format, the content of the pdf document perhaps including suggestions from this thread if any such suggestions are forthcoming. In summary, you are reinventing the wheel. Well, this is progress, producing an additional format for expressing an image for application in various specific specialised circumstances. William Overington 29 May 2015
Re: Tag characters and in-line graphics (from Tag characters)
The format that I am suggesting would allow the image for a non-standard emoji character to be included in a text message, with the image located at the correct place in the text. A more common occurrence is the need to include a non-standard character in a text message, be it a ski piste symbol or an obscure CJK ideogram. Have you thought of embedding TrueType in Unicode? Leo On Fri, May 29, 2015 at 1:38 AM, William_J_G Overington wjgo_10...@btinternet.com wrote: Responding to Mark E. Shoulson: As was pointed out to me, essentially what you are saying is you reject my premise that one size does not fit all. Well, I do not know where that came from, but no, I do not reject that premise. There is plain text, there is HTML, there is XML. HTML is good for web pages. Plain text is, amongst other applications, good for text messages. The format that I am suggesting would allow the image for a non-standard emoji character to be included in a text message, with the image located at the correct place in the text. I have not purported that it become the only format for transmitting images. You would prefer *everything* be in plain text, so you wouldn't have to use other formats for it. You're essentially converting plain text into THE format for everything. No. Use the best format for the task that is being carried out. I am enthusiastic that as much as possible can be done in open source formats rather than an end user of computing equipment needing to rely on expensive propriety software packages with proprietary file formats that cannot be accessed without expensive software. If you really believe one size should fit all in this way, ... But I don't. Just because I opine that plain text is best for some applications and I have suggested a format that would allow a graphic to be included directly in a plain text file does not mean that I opine that everything should be plain text. For example, I use HTML files, gif files, png files, pdf files, wav files, TTF files as appropriate. http://www.users.globalnet.co.uk/~ngo/library.htm http://www.users.globalnet.co.uk/~ngo/spec0001.htm http://www.users.globalnet.co.uk/~ngo/song1018.htm http://www.users.globalnet.co.uk/~ngo/song1021.htm I have embedded a wav file in a pdf and published the result on the web. http://www.users.globalnet.co.uk/~ngo/the_mobile_art_shop.pdf Suppose that a plain text file is to include just one non-standard emoji graphic. How would that be done otherwise than by the format that I am suggesting? What if there were three such non-standard emoji graphics needed in the plain text file, the second graphic being used twice. How would that be done otherwise than by the format that I am suggesting? William Overington 29 May 2015
Re: Tag characters and in-line graphics (from Tag characters)
William_J_G Overington wjgo underscore 10009 at btinternet dot com wrote: There's no advantage because what you want to create is effectively another markup language with its own syntax (but requiring new obscure characters that most applications and users will not be able to interpret and render correctly in the way intended by you, ... Well, if the format became accepted as part of Unicode then appropriate applications could well be produced that would interpret the format and display an image in the desired place. I think this cuts to the heart of what people have been trying to say all along. Historically, Unicode was not meant to be the means by which brand new ideas are run up the proverbial flagpole to see if they will gain traction. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Tag characters and in-line graphics (from Tag characters)
2015-05-29 4:37 GMT+02:00 John idou...@gmail.com: Today the world goes very well with HTML(5) which is now the bext markup language for document (including for inserting embedded images that don’t require any external request” If I had a large document that reused a particular character thousands of times, would this HTML markup require embedding that character thousands of times, or could I define the character once at the beginning of the sequence, and then refer back to it in a space efficient way? HTML(5) allows defining *once* entities for images that can then be reused thousands of times without repeting their definition. You can do this as well with CSS styles, just define a class for a small element. This element may still be an image, but the semantic is carried by the class you assign to it. You are not required to provide an external source URL for that image if the CSS style provides the content. You may also use PUAs for the same purpose (however I have not seen how CSS allows to style individual characters in text elements as these characters are not elements, and there's no defined selector for pseudo-elements matching a single character). PUAs are perfectly usable in the situation where you have embedded a custom font in your document for assigning glyphs to characters (you can still do that, but I would avoid TrueType/OpenType for this purpose, but would use the SVG font format which is valid in CSS, for defining a collection of glyphs). If the document is not restricted to be standalone, of course you can use links to an external shared CSS stylesheet and to this SVG font referenced by the stylesheet. With such approach, you don't even need to use classes on elements, you use plain-text with very compact PUAs (it's up to you to decide if the document must be standalone (embedding everything it needs) or must use external references for missing definitions, HTML allows both (and SVG as well when it contains plain-text elements).
Re: Tag characters and in-line graphics (from Tag characters)
As was pointed out to me, essentially what you are saying is you reject my premise that one size does not fit all. You would prefer *everything* be in plain text, so you wouldn't have to use other formats for it. You're essentially converting plain text into THE format for everything. But it isn't suited for that. If you really believe one size should fit all in this way, I think the problem is that pretty much all of the rest of the computer science community doesn't agree with you. Sorry. ~mark On 05/28/2015 07:50 AM, William_J_G Overington wrote: Responding to Mark E. Shoulson: The big advantage of this new format is that the result is an unambiguous Unicode plain text file and could be placed within a file of plain text without having to make the whole document a markup file to some format. Plain text is the key advantage. The following may be useful as a guide to the original problem that I am trying to solve. http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term I tried to apply the brilliant new base character followed by tag characters format to the problem. In the future, maybe Serif DrawPlus will have the ability to export a picture to this new format. William Overington 28 May 2015
Re: Tag characters and in-line graphics (from Tag characters)
Today the world goes very well with HTML(5) which is now the bext markup language for document (including for inserting embedded images that don’t require any external request” If I had a large document that reused a particular character thousands of times, would this HTML markup require embedding that character thousands of times, or could I define the character once at the beginning of the sequence, and then refer back to it in a space efficient way? Part of the reason at least of having any code system rather than just pixels and images is to efficiently and consistently encode data. Unicode has private use ranges of codes. I can see an argument that it would be desirable to be able to send someone text with private use ranges and have the header define some default renderings. I’m not sure that replacing a document of 100,000 characters with 100,000 embedded html5 img tags is the same thing. It would be inefficient in space. Impossible to process (e.g. find all the instances of a particular character, or sequence), and so forth. Given that its been agreed that private use ranges are a good thing, and given that we can agree that exchanging data is a good thing, maybe something should bring those two things together. Just a thought. — Chris On Fri, May 29, 2015 at 9:45 AM, Mark E. Shoulson m...@kli.org wrote: As was pointed out to me, essentially what you are saying is you reject my premise that one size does not fit all. You would prefer *everything* be in plain text, so you wouldn't have to use other formats for it. You're essentially converting plain text into THE format for everything. But it isn't suited for that. If you really believe one size should fit all in this way, I think the problem is that pretty much all of the rest of the computer science community doesn't agree with you. Sorry. ~mark On 05/28/2015 07:50 AM, William_J_G Overington wrote: Responding to Mark E. Shoulson: The big advantage of this new format is that the result is an unambiguous Unicode plain text file and could be placed within a file of plain text without having to make the whole document a markup file to some format. Plain text is the key advantage. The following may be useful as a guide to the original problem that I am trying to solve. http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term I tried to apply the brilliant new base character followed by tag characters format to the problem. In the future, maybe Serif DrawPlus will have the ability to export a picture to this new format. William Overington 28 May 2015
Re: Tag characters and in-line graphics (from Tag characters)
Responding to Mark E. Shoulson: The big advantage of this new format is that the result is an unambiguous Unicode plain text file and could be placed within a file of plain text without having to make the whole document a markup file to some format. Plain text is the key advantage. The following may be useful as a guide to the original problem that I am trying to solve. http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term I tried to apply the brilliant new base character followed by tag characters format to the problem. In the future, maybe Serif DrawPlus will have the ability to export a picture to this new format. William Overington 28 May 2015
Tag characters and in-line graphics (from Tag characters)
Tag characters and in-line graphics (from Tag characters) This document suggests a way to use the method of a base character together with tag characters to produce a graphic. The approach is theoretical and has not, at this time, been tried in practice. The application in mind is to enable the graphic for an emoji character to be included within a plain text stream, though there will hopefully be other applications. The base character could be either an existing character, such as U+1F5BC FRAME WITH PICTURE, or a new character as decided. Tests could be carried out using a Private Use Area character as the base character. The explanation here is intended to explain the suggested technique by examples, as a basis for discussion. In each example, please consider for each example that the characters listed are each the tag version of the character used here and that they all as a group follow one base character. The examples are deliberately short so as to explain the idea. A real use example might have around two hundred or so tag characters following the base character, maybe more, sometimes fewer. Examples of displays: Each example is left to right along the line then lines down the page from upper to lower. 7r means 7 pixels red 7r5y means 7 pixels red then 5 pixels yellow 7r5y-3b means 7 pixels red then 5 pixels yellow then next line then 3 pixels blue Examples of colours available: k black n brown r red o orange y yellow g green (0, 255, 0) b blue m magenta e grey w white c cyan p pink d dark grey i light grey (thus avoiding using lowercase l so as to avoid confusion with figure 1) f deeper green (foliage colour) (0, 128, 0) Next line request: - moves to the next line Local palette requests: 192R224G64B2s means store as local palette colour 2 the colour (R=192, G=224, B=64) 7,2u means 7 pixels using local palette colour 2 Local glyph memory, for use in compressing a document where the same glyph is used two or more times in the document: 3t7r means this is local glyph 3 being defined at its first use in the document as 7 red pixels 3h here local glyph 3 is being used The above is for bitmaps. It would be possible to use a similar technique to specify a vector glyph as used in fontmaking using on-curve and off-curve points specified as X, Y coordinates together with N for on-curve and F for off-curve. There would need to be a few other commands so as to specify places in the tag character stream where definition of a contour starts and so as to separate the definitions of the glyphs for a colour font and so on. This could be made OpenType compatible so that a received glyph could be added into a font. Please feel free to suggest improvements. One improvement could be as to how to build a Unicode code point into a picture so that a font could be transmitted. William Overington 27 May 2015
RE: Tag characters and in-line graphics (from Tag characters)
William_J_G Overington wjgo underscore 10009 at btinternet dot com wrote: Please feel free to suggest improvements. http://en.wikipedia.org/wiki/Scalable_Vector_Graphics -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Tag characters and in-line graphics (from Tag characters)
I think I've figured out the philosophy WJGO is trying to follow here. We should have a way to encode graphics in Unicode We should have a way to encode programming instructions in Unicode How about We should have a way to encode sound-waves in Unicode? Or We should have a way to encode *moving* graphics, maybe with sound, in Unicode? Now, he didn't say the last two, in fairness to him. But I think that's the thinking. WJGO, not *everything* computers do has to be part of Unicode. Doing so essentially makes *everything* that wants to support Unicode have to be... well, pretty much *everything* all other computers are. We have graphics formats that encode graphics; they're *good* at it. They're made for it. We have sound formats for encoding sounds. We have various bytecodes for programming--different ones, written by different people, that do things in different ways, because one size does not fit all. Unicode can't be the one size. It was never intended to. Don't make Unicode into an operating system, or worse, THE operating system. It's a character encoding. For encoding characters. ~mark On 05/27/2015 12:26 PM, William_J_G Overington wrote: Tag characters and in-line graphics (from Tag characters) This document suggests a way to use the method of a base character together with tag characters to produce a graphic. The approach is theoretical and has not, at this time, been tried in practice. The application in mind is to enable the graphic for an emoji character to be included within a plain text stream, though there will hopefully be other applications.