RE: Normalization rate on the Web
I have no idea what the stats are, however some systems generate more NFC and others more NFD. And then some publisher uses NFC systems but an author uses an NFD system, so the pages served end up with a mixture. I generally recommend using comparisons and index keys that understand NFC/NFD and compare accurately regardless of the form. -Shawn -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Denis Jacquerye Sent: Monday, January 21, 2013 8:12 AM To: Unicode Discussion Subject: Normalization rate on the Web Does anybody have any idea of how much of the Web is normalized in NFC or NFD? Or how much not normalized? How would one find out or try to make a smart guess? I know a lot of library catalogue data is in NFD or somewhat decomposed. Is there any other field that heavily uses decomposition? -- Denis Moyogo Jacquerye African Network for Localisation http://www.africanlocalisation.net/ Nkótá ya Kongó míbalé --- http://info-langues-congo.1sd.org/ DejaVu fonts --- http://www.dejavu-fonts.org/
Re: Normalization rate on the Web
* Denis Jacquerye wrote: Does anybody have any idea of how much of the Web is normalized in NFC or NFD? Or how much not normalized? How would one find out or try to make a smart guess? How much is not a good question here. Let's say there are only two web pages: one is very short and used by only one person once a year, the other page is very long and used by one billion people once per day. If one of the two pages is in NFC, the other in NFD, it would be misleading to say that 50% of the web is in NFD. More realistically, let's say Wikipedia articles are all neither NFC nor NFD, but Google search result pages are all NFD. How much would that be? That problem aside, you could always get a list of web sites or pages, download them, or use a pre-packaged dataset, and analyse it. My personal experience is that non-NFC content in german or english is fairly rare; I can tell fairly easily because my smartphone cannot ren- der various characters like german umlauts properly when decomposed, so I encounter that problem sometimes, mainly on sites that quote heavily from PDFs and similar content. -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Spiral symbol
Hello. I have wondered if it may be a good idea to make a proposal to an spiral character, basically because I believe is the only mayor symbol recurrently used for represent swearing in comics that's missing from Unicode. Most of the time it is replaced with the more common at (@), but still an actual one may be good. Not sure yet if there's enough documentation. Some Emoji representations displays the CYCLONE character (U+1F300) as one, yet I don't think that fits as a better replacement. Andrés Sanhueza
Re: Normalization rate on the Web
Hi Denis, A fea thoughts ... library data may be nfc or nfd, but is more likely to conform to the MARC character repetoire, so isn't exactly NFD. Vietnamese data is either 1) NFC or 2) neither NFC nor NFD It would be rare to find vietnamese data in NFD For a range of afrjcan languages, maily ones uskng diacriti s anx diacritic stackkng, it may be 1) NFC, 2) NFD or 3) niether NFC nor NFD depending on the input framework used. On Jan 22, 2013 3:26 AM, Denis Jacquerye moy...@gmail.com wrote: Does anybody have any idea of how much of the Web is normalized in NFC or NFD? Or how much not normalized? How would one find out or try to make a smart guess? I know a lot of library catalogue data is in NFD or somewhat decomposed. Is there any other field that heavily uses decomposition? -- Denis Moyogo Jacquerye African Network for Localisation http://www.africanlocalisation.net/ Nkótá ya Kongó míbalé --- http://info-langues-congo.1sd.org/ DejaVu fonts --- http://www.dejavu-fonts.org/
Re: Spiral symbol
On 1/21/2013 4:11 PM, Andrés Sanhueza wrote: Hello. I have wondered if it may be a good idea to make a proposal to an spiral character, basically because I believe is the only mayor symbol recurrently used for represent swearing in comics that's missing from Unicode. If it should come to a proposal, I can help out with one or two citations of the use of this symbol for that purpose in contexts that are not that different from other lettering in the same sources. Not more than emoji are from regular words. A./ Most of the time it is replaced with the more common at (@), but still an actual one may be good. Not sure yet if there's enough documentation. Some Emoji representations displays the CYCLONE character (U+1F300) as one, yet I don't think that fits as a better replacement. Andrés Sanhueza
Re: Normalization rate on the Web
On 2013/01/22 1:12, Denis Jacquerye wrote: Does anybody have any idea of how much of the Web is normalized in NFC or NFD? Or how much not normalized? I have never measured this. But at one time, there was only NFD (and NFKD). The Unicode Consortium, with input from W3C, then defined NFC (and NFKC) to be much closer to the actual encodings used on the Web. So in some sense, Web Content is (mostly) NFC *by design*. Regards,Martin. How would one find out or try to make a smart guess? I know a lot of library catalogue data is in NFD or somewhat decomposed. Is there any other field that heavily uses decomposition? -- Denis Moyogo Jacquerye African Network for Localisation http://www.africanlocalisation.net/ Nkótá ya Kongó míbalé --- http://info-langues-congo.1sd.org/ DejaVu fonts --- http://www.dejavu-fonts.org/