John, you seem to say normalization but mean decomposition.
Please note that there are several normalization forms, and the most popular one is
NFC, typically using code points for precomposed characters.
Your email suggests that MacOS is using NFD, which I find surprising.
On the issue of
The human-readable part of the email address (the friendly name) can contain any
character, while the internal or actual address is very limited.
A posting to the unicode list a while ago has the following header lines (among
others):
From:
For borders and arbitrary logos/symbols, it sounds like the best would be to do what
someone else suggested on this list a few days ago:
Define markup to specify a font and a glyph ID in that font to display something
without the need for a pseudo-character encoding for it.
Something for
There are a Java and a C++ reference implementation linked from the Bidi TR.
The Java one is straightforward (and slow), written so that you can read each rule in
the TR and see in the source that it works as specified.
The C++ code is verified to produce the same results as the Java code.
Databases use table definitions that usually define what encoding is used in which
parts of the database.
Encodings can be set per database, per table, or per column, and the definition syntax
seems to vary widely among vendors and products.
Generally, UTF-8 or 1208 or unicode or similar is
Doug Ewell wrote:
Not sure if this is relevant to your specific case, bit I still use the
command prompt (MS-DOS Prompt) a lot ...
Interesting. I just tried the following:
Windows 2000. New text document with Notepad, arbitrary contents.
Save as AC06 0436.txt (Hangul letter + Cyrillic
a few years ago using the w versions of
main(), printf(), etc., and they worked just fine.
I think I switched the file mode of stdout to binary in those tools.
markus
Shlomi Tal wrote:
Hello Markus Scherer.
You wrote: chcp 1 to change the command prompt code page to
UTF-16. But as far as I
Joseph Boyle wrote:
Don't you need a fixed width font though? My W2K shows only Raster Fonts and
Lucida Console when I try to change the command window font.
Yes, I used Lucida Console, as I wrote originally.
The command prompt window does not appear to accept duospace fonts, which would
The SARA AM problem seems to be with the compatibility decomposition (NFKD and NFKC).
NFK* change a lot of characters and strings - not just Thai - in various visible and
functional ways and must be used with caution.
markus
Samphan Raruenrom wrote:
Mark Davis wrote:
- decomposition of
So far, the Unicode Standard has defined code points to be from the contiguous range
of 0..0x10.
Some definitions are fuzzy in the standard, with hopes of clarification in Unicode 4.0.
It is true that UTF-16 cannot encode d800 dc00, but it can encode d800 0061 dc00.
There are at least
Mark Davis wrote:
We do have that in ICU 2.2. It is not a public interface (meaning that we
will likely change the API before we make it public), but it is accessible
if you want to test with it for now.
See the ICU i18n library's caniter.h and caniter.cpp
Tex, the presentation forms are marked with Bidi AL just like the normal Arabic
characters.
A conformant Bidi implementation must treat them the same.
markus
Mark Davis wrote:
Note that we have a gazillion other dots already:
...
And these are just the obvious ones found with a quick search (and just
for the single dots). There are probably more hiding out in little
corners of scripts (it's a bit like Where's Waldo looking for them.
To find
Boris Becker against Steffi Graf 6:4 4:6 6:7...
There are UTC/L2 documents for the agenda, topics, action items, minutes, etc.
whenever appropriate.
William Overington wrote:
This is not in the same news gathering league as having CNN and other
Oh -
markus
Stefan Persson wrote:
This links to a different page on the same server:
http://www.cl.cam.ac.uk/~mgk25/unicode.html
That page contains a strange UTF-8 table:
...
The last two byte sequences are invalid.
Markus Kuhn's page shows the original ISO 10646 definition.
This necessarily
Doug Ewell wrote:
... They are not necessarily intended to replace
the established mechanisms, although I suspect the ICU team does intend
BOCU to replace SCSU. ...
Nope. They have different properties and are useful for different if overlapping
applications.
BOCU-1 was developed for
Not that I have anything against French or German(!), but beware of what you would do
with a translation.
Translated names are fine as an annotation.
Character names are treated as identifiers of abstract characters.
They do not necessarily describe the abstract character well, or even
[EMAIL PROTECTED] wrote:
A friend of a friend asked me if Unicode has a code for small s with a grave.
U+0073 U+0300
Has it been added since 3.0? Thanks in advance.
Afaik, there is not and will not be any new precomposed characters since Unicode 3.0
I think the policy is to not add new
Tex Texin wrote:
However, a Japanese user might have to choose a Japanese font, if the
Unicode font does not favor (and cannot be made to favor with language
tags) Japanese renderings.
So it's catch 22. They have native fonts because Unicode fonts are
inadequate, but we can be relieved that
BOCU-1 is now an IANA-registered charset:
http://www.iana.org/assignments/character-sets
I thought it might be useful and interesting to show the list of Unicode charsets that
are registered:
Charset name, MIBenum, aliases (if any *)
UTF-7(MIBenum 1012)
UTF-8(MIBenum 106)
UTF-16
Barry Caplan wrote:
There is a link with the story on the fron page of www.i18n.com
Nice story, similar to the one with Gary Miller. It seems like we have three stories
of origin now (with mid-'80s DEC).
The i18n.com version does not date the MIT meeting, does it?
markus
Doug Ewell wrote:
What is the correct IBM GCGID value for U+03B8 GREEK SMALL LETTER THETA?
Is it GT61 or GT610002?
I have an internal document that shows
GT61 Theta Small - (see GT610001, GT610002) U3B8 GREEK SMALL LETTER THETA
GT610001 Theta Small (Open Form) - (resembles SA50)
Andrew C. West wrote:
On Tue, 15 Oct 2002, Stefan Persson wrote:
That font also includes some characters mapped to the PUA: A € sign, and
several #28450; character, many of which look like radicals. Why? Is that
something that's also required by that law?
It's my experience that many fonts
David Starner wrote:
First, is it compliant with Unicode for an Antiqua font to use an s
glyph for ſ (U+017F)? It makes switching between Antiqua and Fraktur
fonts possible, and it is arguably the glyph given to the middle s in
modern Antiqua fonts.
Likewise, ä is printed as a with e above in
Doug Ewell wrote:
[92-C23] Consensus: Add a definition of XML Suitable and a
recommendation that SCSU encoders should be XML Suitable.
[L2/02-262]
[92-A46] Action Item for Markus Scherer, Editorial Committee:
Post a proposed update to Unicode Technical Standard #6 A
Standard Compression Scheme
Dominikus Scherkl wrote:
My other suggestion (and the main reason to call the proposed
charakter source failure indicator symbol (SFIS)) was intended
especaly for mall-formed utf-8 input that has overlong encodings.
In this special case a converter exactly knows which char is
intended, but needs
David Starner wrote:
Chances are nearly 100% that overlong UTF-8 was a spoofing attempt, or the
result of something other than a UTF-8 encoder.
With the exception of overlong sequences for null (C0 80?), which Java
generates in an attempt to avoid true nulls.
I am aware of this one. This
Mark Davis wrote:
Little probability that right double quote would appear at the start of a
document either. Doesn't mean that you are free to delete it (*and* say that
you are not modifying the contents).
This points to a pragmatic way to deal with this issue:
If software claims that it does
Lars Kristan wrote:
Markus Scherer wrote:
If software claims that it does not modify the contents of a
document *except* for initial U+FEFF
then it can do with initial U+FEFF what it wants. If the
whole discussion hinges on what is allowed
emif software claims to not modify text/em then one
Quick question:
IMAP specifies a modified UTF-7 encoding for mailbox names. I imagine that this might be implemented
in some applications as a converter. If so, what charset name is used for it? Is there a common one?
If there is no commonly used charset name, then how about imap-mailbox-name?
Dominikus Scherkl wrote:
I don't believe that English readers encountering an fb
ligature in the middle of the compound word 'goofball'
are confused about where the syllables, and hence the subwords,
end and begin.
That may be because english doesn't use word-concatenations the
way german do:
Michael (michka) Kaplan wrote:
Michael, in answer to your request for a UTF-8 converter, that will have to
be another day (its a bit more complicated, and I spend most of my time in
UTF-16 and UTF-32 so I can't really pretend its work related). If you wanted
to provide the code in VBScript or
Christoph Päper wrote:
Moin,
Selber moin :-)
I've checked the existing chars,
http://www.unicode.org/alloc/Pipeline.html and this year's thread titles
of this mailing list, but didn't find characters to represent UI controls of
media devices (or a proposal for including them for that matter)
sourav mazumder wrote:
Need an urgent help regarding UTF-8 data conversion in
IBM Mainframe 390.
I have a data file in Windows system which contains
Japanese characters encoded using UTF-8. I need to
send this file to IBM Mainframe 390, where an
application will read this data.
In this context
-Original Message-
We are now looking to expand the market for this product into
countries such as China. To achieve this I have been informed
we need to enable our application for Double Byte Character
Set (DBCS).
DBCS is an old, pre-Unicode term for character sets with
xjliu_ca wrote:
I have searched all the web on IBM about the support of GB18030 in OS
AIX 4.3 and 5, but didn't find anything. I only can see they support
GB2312 and GBK.
Google found something for me:
http://www-3.ibm.com/software/ts/mqseries/support/readme/aix530_read.html
Search for 18030
Carl W. Brown wrote:
Some Unix systems adapted faster because the later Unicode adopters used 32
bit Unicode characters making the job 100 times easier. Other companies
like Microsoft took a very big gamble and implemented the code for surrogate
support into Windows 2000 based on early drafts of
Jane Liu wrote:
That may mean IBM AIX 5 support converison between GB18030 and
Unicode, but I don't see this is a system level of support because
there is no locale names for GB18030 in the doc of AIX 5 :
The GB 18030 standard requires software to be able to _read and write_ text in the GB18030
Michael Yau wrote:
Markus,
The standard does _not_ require to _process_ internally in GB18030. It
is sufficient to have a converter and to process in Unicode, which does
contain all of the characters.
Just curious, do you have this in writing from the China standards body?
I don't
Jane, you are right, I over-simplified. I tried to make the point that you need not _process_ text
in GB18030 but that Unicode processing and conversion to/from GB18030 fulfills the requirement to be
able to read and write GB18030 text.
Yes, you need to have font support for all the characters
David J. Perry wrote:
The convention of using a horizontal line to mark an abbreviation, often
the omission of m or n, goes back to the middle ages (if not earlier)
and was often used in early printed books; apparently it has lived on in
some handwriting, to judge from your post. ...
I can
GB 18030 is defined with a 1:1 mapping table to Unicode. It has large code spaces for user-defined
characters, but the standard repertoire is the same as Unicode's.
In practice, all modern browsers work internally with Unicode no matter what page charset is
received. They all convert from the
I am pleasantly surprised to see Esperanto on this list, even just in a quote :-) No, I don't claim
to be proficient any more.
Anto'nio Martins-Tuva'lkin wrote:
I got the following reaction from a specialist on aragonese issues,
Ferran Marin i Ramos [EMAIL PROTECTED]:
Certe temas pri eraro,
Raymond Mercier wrote:
The problem is rather: when are Unicode going to include the great many
symbols covered in Betacode...
Any characters make it into Unicode by someone - you? - writing complete, reasonable, convincing
proposals that then make it through the committees and get approved in
ICU has a function u_shapeArabic():
http://oss.software.ibm.com/icu/apiref/ushape_8h.html#a24
markus
Mete Kural wrote:
I need to figure out a method to convert Arabic
Unicode text encoded in its normal form to Arabic
Unicode text encoded in Arabic presentation forms. ...
William Overington wrote:
Kenneth Whistler now states an opinion as to what the review is about and
mentions a file PropList.txt of which I was previously unaware.
Kenneth Whistler referred to a file that is part of the publicicly and freely provided Unicode
Character Database, showing various
Michael (michka) Kaplan wrote:
GB18030 does not define a specific standard for sorting (as far as I know, neither does GB13000). It
is an encoding standard.
GB 18030 certainly does not define sorting. It defines a CCS/CES based on a mapping table to/from
Unicode/ISO 10646.
GB 13000 is, as far
Doug Ewell wrote:
SRIDHARAN Aravind ASridharan at covansys dot com wrote:
How to convert EBCDIC data into Unicode?
There are informative mapping tables available at:
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/
There are also various places where IBM publishes
Marco Cimarosti wrote:
It has been repeated a lot of times that no more precomposed character will
never ever ever ever be added. ...
Stability requires that no more precomposed characters will be added that are equivalent to
sequences of already-existing other characters. This is because it
Tom Gewecke wrote:
Aside from Tex Texin's experimental pages, does any one have url's of web
sites done in UTF-16 rather than UTF-8? I don't, but would not mind having
some examples for testing.
In some of the ICU online demos you can choose the output charset. This should be interesting with
SRIDHARAN Aravind wrote:
My database is Oracle and its character set is WE8ISO8859P1.
In database, I have stored special Polish characters.
First of all, the database character set is ISO 8859-1 which cannot represent special Polish
characters. In all likelyhood, you have taken a byte stream
I would like to add some information here without getting myself into the core of the discussion:
HTML recognizes a lot fewer whitespace characters than Java or Unicode. Different people have
different sets of whitespace characters.
Unicode's White_Space property (PropList.txt) contains 24 code
Michael (michka) Kaplan wrote:
Well, DBCS means double byte character set and thus it is always two
bytes. But its a theoretical definition since there are no actual DBCS
code pages -- all of the ones that exist are MBCS (multibyte character
set) since they support both one-byte and two-byte
Jungshik Shin wrote:
On Mon, 17 Feb 2003, Markus Scherer wrote:
Other examples: There are EUC-JP (1/2/3 bytes per character) and
EUC-CN (1/2/4 BpC) which are quite old (much older than GB 18030).
Markus's fingers made a mistake here :-). It's EUC-TW (not EUC-CN)
that encodes CNS 11643
[EMAIL PROTECTED] wrote:
Does anyone know of a way to process GB 18030 data in COBOL on MVS?
You could try to call ICU4C from COBOL http://oss.software.ibm.com/icu/userguide/cobol.html
ICU has a GB 18030 converter.
markus
Frank, http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-03.txt addresses these, and
version -04 of this draft will be public shortly.
markus
Marco Cimarosti wrote:
BTW, would it be possible to encode XML in SCSU?
Yes. Any reasonable SCSU encoder will stay in the ASCII-compatible single-byte mode until it sees a
character from beyond Latin-1. Thus the encoding declaration will be ASCII-readable.
The next version of UTR #6 will say so
Martin Duerst wrote:
- Is it *probable* that an XML processor decodes XML in SCSU?
No, XML processors are only required to support UTF-8 and UTF-16.
Many of them support other encodings, such as iso-8859-1,..., but
support for SCSU is thin as far as I'm aware.
Well, Xerces is a reasonably
Werner LEMBERG wrote:
... Similarly, the year of marriage is
depicted as two intertwined circles. How will this be represented in
Unicode? Are there characters for it?
For the marriage symbol, U+221E INFINITY should work fine - and quite appropriately.
markus
A UTF-x converter must handle non-characters like U+FFFE, U+FDD0, etc.
Unicode 3.0 chapter 3.8 Transformations clause D29 defines this, and the text there and below spells
out that non-characters and the like must be converted as well. The change since 3.0 only affects
single-surrogate code
SRIDHARAN Aravind wrote:
I just want to know whether a particulat string from the source has got special characters. How can I make a dynamic check for it?
Well, you usually use the methods on the String class to search for a matching character or
substring, or methods to iterate through the code
It sounds like you don't know in what encoding you get your input, and you are munging the input
bytes(?!) in a custom way.
You need to identify the input encoding/charset and, in Java, instantiate an InputStreamReader with
the correct encoding name. Then you get proper Unicode strings, and
Yung-Fong Tang wrote:
I see a hole here. How about UTF-8 representing a paired of surrogate
code point with two 3 octets sequence instead of an one octets UTF-8
sequence? It should be ill-formed since it is non-shortest form also,
right? But we really need to watch out the language used there
Yung-Fong Tang wrote:
Same thing for JIS x0208 (a TWO and only TWO bytes character set, not a
variable length character set). If I am processing a ISO-2022-JP message
and in the JIS x0208 mode and I got a 0x24 0xa8 I know the boundary of
that problem is 16 bits, not 8 -bits nor 32 bits.
Not
I am not sure yet how far I want to get into this discussion... but this seems worth mentioning:
Asmus Freytag wrote:
The ideal case is one where the converter stops in a restartable
configuration, allowing the client to implement (or ask for) a variety
of error-recovery options.
A nice
No takers for this question? Let me try...
askq1 askq1 wrote:
The CollationTest_NON_IGNORABLE.txt NormalizationTest.txt contain
test-cases for sorting and normalization. The strings that are mentioned
in these files follow a specific order:
...
I want to know if these files are organized
Kenneth Whistler wrote:
Unicode character (\uFFE2\uFF80\uFF93)
...
What you are actually looking for is the UTF-8 sequence:
0xE2 0x80 0x93
The 8-bit UTF-8 bytes E2 80 93 (all with the most significant bit set) get *sign-extended* to 16
bits, producing FFE2 FF80 FF93. It should suffice in a
Roozbeh Pournader wrote:
Well, anything that is completely ignored in collation creates problems
with deterministic sorting.
I don't think you mean deterministic. UCA is deterministic, it just sorts many strings as equal.
There are certain words in Persian, with
completely different meanings,
ICU4C 2.6 (June/July) will support Unicode 4 but also provide an option for Unicode 3.2
normalization (with NormalizationCorrections.txt applied though).
http://oss.software.ibm.com/icu/
http://oss.software.ibm.com/pipermail/icu/2003-March/005406.html
We do not have any plans so far to do this
Generally, try instantiating an InputStreamReader or similar from your input, with an explicit
encoding=UTF8. That will perform the conversion from UTF-8 to the internal 16-bit Unicode that
Java processes.
Always use XYZReader classes for text input and XYZWriter classes for text output.
(from Re: geometric shapes)
It has been suggested many times to build a database (list, document, XML, ...) where each
designated/assigned code point and each character gets its story: Comments on the glyphs, from
what codepage it was inherited, usage comments and examples, alternate names,
Nooo - Java's old UTF functions do not process UTF-8! They are there for String serialization, a
Java-internal format.
Use the Java Reader/Writer classes instead of these old ones!
See the Java tutorials on Internationalization:
http://java.sun.com/docs/books/tutorial/i18n/text/convertintro.html
Pim Blokland wrote:
Why is there no UTF-24?
Well, I once proposed UTF-20...
See, these MathText characters take up a lot of space. No matter how
you encode them; UTF-8, UTF-16 or UTF-32; they always are 4 bytes
long.
True for them alone, in those UTFs. Short of defining another Unicode encoding,
Paul Hastings wrote:
would it be correct to say that javascript natively supports unicode?
ECMAScript, of which JavaScript and JScript are implementations, is defined on 16-bit Unicode
scripts and using 16-bit Unicode strings.
In other words, the basic encoding support is there, but there are
Ben Dougall wrote:
On Wednesday, May 28, 2003, at 06:59 pm, Otto Stolz wrote:
PS. In these tow languages, the quote-marks are paired thusly:
en_US: U+201C ... U+201D, and U+2018 ... U+2019
de_DE: U+201E ... U+201C, and U+201A ... U+2018
are they the right way round? so in german it'd be:
Ben Dougall wrote:
So, there is not comprehensive list of openers vs. closers possible.
so that's a 99 shaped quote on the baseline to open and, and a 99 high
up to close. seems very odd to use 99 high or low to open, not a 66. but
if that's how it is, that's how it is.
Well, wait - I was
FYI
I wrote a little program for other standards activities to check which Unicode characters have
simple lower-/uppercase mappings across UTF-8 length boundaries (0080, 0800, 1).
This is with Unicode 4 data.
I thought some unicode subscribers might be interested in the result.
Best
could that mean?
From: Markus Scherer [mailto:[EMAIL PROTECTED]
Sent: Tuesday, July 01, 2003 1:30 PM
To: unicode
Subject: simple case mappings across UTF-8 length boundaries
U+2126 simple-lowercases to U+03c9
U+2126 is OHM SIGN
U+212a simple-lowercases to U+006b
U+212a is KELVIN SIGN
U+212b simple
ICU4J 2.6 provides build options out of the box to select certain functionalities. Please see the
bullet Modularization on http://oss.software.ibm.com/icu4j/download/2.6/
markus
There are many codepages for Indic languages.
Modern systems support Unicode. It is what Windows and MacOS X and Java and modern web browsers etc.
use internally - everything else is supported via conversion, which can be problematic.
The ISCII standard is byte-based and stateful. (Complicated
Jon Hanna wrote:
Hi,
I'm currently experimenting with various trade-offs for Unicode normalisation code. Any comments
on these (particularly of the that's insane, here's why, stop now! variety) would be
welcome.
You might want to look at, if not even use, the ICU open-source implementation:
Peter Kirk wrote:
On 25/09/2003 12:27, [EMAIL PROTECTED] wrote:
It's not a reordering per se, as the first combining character is
given the first opportunity to combine.
Thanks for the clarification.
In other words, yes, Unicode's NFC does perform discontiguous composition. Some things might be
Peter Kirk wrote:
On 25/09/2003 14:25, Markus Scherer wrote:
In other words, yes, Unicode's NFC does perform discontiguous
composition. Some things might be easier if only contiguous
composition were used, but the current definition does give you the
shortest strings.
And this current
You might want to look at East Asian Width http://unicode.org/reports/tr11/ for an approximation of
the green-screen width of a string.
To be absolutely precise, you need feedback from your green-screen layout engine and its font, of
course, like you do for a graphical display.
markus
Edward
I think Addison is on the right track here.
I would like to point to ICU sample code for this kind of thing:
http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/samples/numfmt/main.cpp
See the code there from setNumberFormatCurrency_2_6 on down (the preceding code is for older ICU
Jill Ramonsky wrote:
I had to write an API for my employer last year to handle some aspects
of Unicode. We normalised everything to NFD, not NFC (but that's easier,
not harder). Nonetheless, all the string handling routines were not
allowed to assume that the input was in NFD, but they had to
Philippe Verdy wrote:
... In fact, to further optimize and reduce the
memory footprint of Java strings, in fact I choosed to store
the String in a array of bytes with UTF-8, instead of an
array of chars with UTF-16. The internal representation is
This does or does not save space and time depending
Stefan Persson wrote:
Stephane Bortzmeyer wrote:
I do not agree. It would mean *each* application has to normalize
because it cannot rely on the kernel. It has huge security
implications (two file names with the same name in NFC, so visually
impossible to distinguish, but two different string of
If this is in C/C++ and your text is in Unicode, and you convert to a legacy (non-Unicode) codepage,
then you could use the ICU conversion API. It has an option to turn non-mappable characters into
numeric character references for HTML/XML.
Please see
You should use Unicode internally - UTF-16 when you use ICU or most other libraries and software.
Externally, that is for protocols and files and other data exchange, you need to identify (input:
determine; output: label) the encoding of the data and convert between it and Unicode. If you can
[EMAIL PROTECTED] wrote:
is it possible to design a program that takes the vaLue of the osmanya script
and compare it with the somali latin script. then afterwards, displaying the
equivalent.
Generally, yes - this is called script transliteration. You could try this online at
Like German heute (=today) where the eu sounds like the oy in Spanish hoy?
hui=hoy=heu(te)... Neat!
markus
Michael Everson wrote:
At 23:07 +0100 2003-10-27, Philippe Verdy wrote:
The historic French word hui is now completely obsoleted, and commonly
found only in the single expression
Philippe Verdy wrote:
the input:determine strategy will work fine for UTF-8 or SCSU, provided that
the leading BOM is explicitly encoded. ...
With determine I do not mean to restrict to checking for a BOM. There are several ways to
determine the input charset, depending on the protocol and
I suggest you try it out -
http://oss.software.ibm.com/cgi-bin/icu/lx/en_US/utf-8/?_=heEXPLORE_CollationElements=
ICU implements the UCA, including discontiguous contractions.
markus
Peter Kirk wrote:
On 03/11/2003 07:01, Kent Karlsson wrote:
However, the UCA does ignore differences between
Peter Kirk wrote:
On 03/11/2003 15:26, Markus Scherer wrote:
I suggest you try it out -
http://oss.software.ibm.com/cgi-bin/icu/lx/en_US/utf-8/?_=heEXPLORE_CollationElements=
ICU implements the UCA, including discontiguous contractions.
Thank you, Markus. Unfortunately the results are barely
[EMAIL PROTECTED] wrote:
We are talking about charset value for the internet protocol here. It is
a special narrow field of charset name. The value used by Internet
protocol are defined by a well defined process-
http://www.faqs.org/rfcs/rfc2278.html RFC 2278 - IANA Charset
Registration
I would like to comment on several statements that I have seen in this thread -
- Migrating from UCS-2 to UTF-16:
Doable, and has been done for many applications and libraries.
- Difficult to handle UTF-16?
Use ICU - it handles all of Unicode for collation,
regular expressions, string
John Cowan wrote:
Here's a little table of the combining classes, showing the value, the
number of characters in the class, and a handy name (typically the one
used in the Unicode Standard, or a CODE POINT NAME if there is only one;
sometimes of my own invention).
This is already published with
Try
a) #x2510; etc.
b) Use an application to find those characters, copy them, and paste them into your HTML editor. For
this you need to use a Unicode charset for your HTML document, see
http://www.unicode.org/faq/unicode_web.html#9
Possible applications to use to find and copy the
Theodore H. Smith wrote:
Can someone give me some advice? If I was to write a dictionary class
for Unicode, would I be better off writing it using a b-tree, or
hash-bin system? Or maybe an array of pointers to arrays system?
See John's reply. Tries of some sort should be good. I think there was
101 - 200 of 446 matches
Mail list logo