Re: Unicode and Encoding Problems in Browsers

2003-02-07 Thread Shlomi Tal
I'd like to mention that this problem which Muhammad Asif brings forth is an 
extant one in my circle of work. I work as PC technician, and one complaint 
I often get in tech support calls is that the user is unable to type Hebrew 
in the Search box in the MSN Israel website (msn.co.il) under Windows XP. At 
the first time, I told the user to set the Language for Non-Unicode 
Programs (known as the System Locale in Windows 2000, which sets the 
emulated ANSI codepage), but it didn't help: the user still complained of 
seeing boxes instead of proper Hebrew letters. The encoding of MSN.co.il is 
Hebrew (Windows).

It doesn't happen under all machines. Mine at home runs XP too, but I don't 
have that problem. I suspect it's not related to Unicode/encodings stuff at 
all. The fact that it appears only under XP (and not 2000 or 98, for 
instance) leads me to believe it may have something to do with the Java VM 
(which is by default lacking in XP and updates browser components when 
installed).

I hope that is of some enlightenment.

ST

_
MSN 8 helps eliminate e-mail viruses. Get 2 months FREE*. 
http://join.msn.com/?page=features/virus




OT: Haikus for Unicode-Haters

2003-02-02 Thread Shlomi Tal
Unicode is shit!
What a dreadful encoding.
Who thought up this crap?

UTF-16
Has those pesky surrogates
Very bad design.

Arabic shaping
Difficult to implement
It's a complex script.

One should circumvent
Endian related issues.
UTF-8 does.

_
STOP MORE SPAM with the new MSN 8 and get 2 months FREE* 
http://join.msn.com/?page=features/junkmail




Re: OT: Haikus for Unicode-Haters

2003-02-02 Thread Shlomi Tal
You're right, but neither Monogolian nor Indic fits the 5-7-5 syllable 
constraint of haiku. Ben-ga-li-Sha-ping maybe? :-)

But anyway, as I've been reading on Thomas Milo's (Decotype) paper on Arabic 
recently refered to here, Arabic typography isn't so simple once you get out 
of the simplified printing-Arabic paradigm.

I have been using Arabic on computers since 1993, on Accent Software's word 
processor Dagesh (a multiscript word processor for Windows 3.x). The shaping 
mechanism for Arabic hasn't changed since. And I read this implementation 
goes back to the Apple Mac Arabic word processor Al-Kaatib Ad-Dawli, in 
the late 1980s.

ST

_
Tired of spam? Get advanced junk mail protection with MSN 8. 
http://join.msn.com/?page=features/junkmail




Re: Arabic Presentation Forms

2003-01-31 Thread Shlomi Tal
Do you any suggestions on how I could convert a piece
of Unicode text in this manner? Are there any programs
that could do this?


Roman Czyborra's arabjoin (a Perl script):

http://czyborra.com/arabjoin/

It does the conversion to Arabic Presentation Forms. But also, which may not 
be what you need, it converts logically-ordered Arabic to visual order; this 
for display on systems that support neither BiDi nor Arabic shaping.

ST

_
MSN 8 with e-mail virus protection service: 2 months FREE* 
http://join.msn.com/?page=features/virus




Q: Any Unicode Qur'an extant?

2002-09-11 Thread Shlomi Tal

Hello Unicoders.

I'd like to know if there is any text version (Unicode) of the Arabic 
Qur'an. I don't expect there to be an exact book-copy version with all the 
cantillation frills; what I'm asking is whether a Qur'an of Arabic letters 
and Arabic tashkeel (vowel-pointing) alone is available. Has this important 
project already been carried out?

Thanks in advance.

--

Shlomi Tal
שלומי טל (my name in UTF-8 encoded Hebrew)


_
Send and receive Hotmail on your mobile device: http://mobile.msn.com





Re: Any Unicode Qur'an extant?

2002-09-11 Thread Shlomi Tal

Thank you. I'd be glad to know when it's finished.

--

Shlomi Tal
שלומי טל (my name in UTF-8 encoded Hebrew)


_
MSN Photos is the easiest way to share and print your photos: 
http://photos.msn.com/support/worldwide.aspx





Re: Teletext

2002-07-31 Thread Shlomi Tal

Teletext uses VERY old technology encoding in general. I don't know if it's 
true for other languages, but Hebrew teletext encodes the Hebrew letters 
using the 7-bit SI-960, which maps the Hebrew letters instead of the 
lowercase Latin letters (positions 0x60 to 0x7A). In Hebrew teletext you get 
the following unmodern practices:

1. 7-bit encoding, which allows only uppercase Latin letters to be used in 
the mixed Hebrew/English mode. Compare Russian KOI-7, Greek ELOT 927, which 
are like Hebrew SI-960 in mapping the non-Latin alphabet on top of the 
lowercase letters.

2. Teletext offers no bidirectional algorithm. The display mechanism is 
limited to monodirectional LTR, necessitating the use of visually encoded 
Hebrew (that is, monodirectional LTR written Hebrew; see also my Hebrew FAQ 
for a longer explanation). This needs to be inverted to logical order when 
converting to Unicode.

--

Shlomi Tal
שלומי טל


_
Send and receive Hotmail on your mobile device: http://mobile.msn.com





Re: Teletext

2002-07-31 Thread Shlomi Tal

From: Lars Marius Garshol [EMAIL PROTECTED]

This reminds me: does anyone have any pointers to information on how
to convert visually encoded text (especially HTML, but also other
formats) to Unicode?

There are programs that do it on the fly for Hebrew. The best, which I have 
used myself, is HebTML, available for free downloading from 
http://www.billy.co.il . The author has been working with me on testing a 
new version that supports Unicode. However, I use this app much less than 
before, because Hebrew Internet is rapidly making the transition from visual 
to logical ordering. With IE 5.x and Mozilla supporting logical Hebrew, the 
years-old visual order is on the way out.

The conversion of visual to logical text in BiDi scripts is straightforward: 
validate the BiDi property of the character, and if RTL then reverse. That 
means Hebrew letters reverse their order, digits and Latin letters stay the 
same. Things get more complicated, however, when hyphens, paired punctuation 
and telephone numbers appear. You need a smart converter for that.

In essence, visually ordered Hebrew is a kludge for supporting Hebrew on 
platforms that weren't designed for it. In other words, it is an adaptation 
of Hebrew text to monodirectional LTR platforms. In modern software the onus 
of directionality passes on to software.

--

Shlomi Tal
שלומי טל


_
Join the world’s largest e-mail service with MSN Hotmail. 
http://www.hotmail.com





Re: chcp 10000 (was: Filesystems)

2002-07-13 Thread Shlomi Tal

Markus Scherer wrote:

Hi Shlomi, [sending to the list]

The number 1 in chcp 1 on Windows is, I assume, a magic number.
It switches the command prompt into 16-bit-Unicode mode (=UTF-16 encoding 
form).

All I can say is that this works, and works at least since NT 4.

Not in my case, it doesn't - neither in Windows 2000 in the past, nor now in 
XP. Definitely chcp 1 switches me to the Macintosh Roman charset. 
Perhaps because I have all the codepage conversion tables installed. Look in 
Regional Options, Advanced: 1 is explicitly Mac-Roman.

I do manage to work in UTF-16 through the command line, though; not by 
chcp, but by launching the command line in UTF-16 mode: cmd /u. (without 
the /u it is in ANSI mode). Plus, UTF-8 is available by doing chcp 
65001.

Strange.

╭──────────────────────────────╮

│ שלומי טל │
│ ♂♋ looking for ♀ of any sign │
╰──────────────────────────────╯




_
Join the world’s largest e-mail service with MSN Hotmail. 
http://www.hotmail.com





Q: Filesystem Encoding

2002-07-10 Thread Shlomi Tal

Hello Unicoders, I have a question about filesystems. I never use anything 
but ASCII characters in filenames, and I would like to know if it is still 
justified. Of the various filesystems in use, I know only that the Joliet 
CDFS uses UCS-2BE. What about FAT16, FAT32, NTFS and Linux Ext2?

In short: should I still stick to ASCII alone in filenames, or are there 
filesystems where I really don't have to anymore? Thanks in advance.

_
Send and receive Hotmail on your mobile device: http://mobile.msn.com





The irony of it (was Re: Can browsers show text? I don't think so!)

2002-07-03 Thread Shlomi Tal

The irony of it, that Linux users are much better organized, font-wise, than 
Windows users, thanks to Markus Kuhn's ISO 10646 X11 fonts which come with 
the XFree86 v4.0 distribution. I have yet to find Ethiopic or Cherokee 
anywhere on a default Win2000/XP install. So that Mozilla on Linux displays 
all characters fine. Except those which need complex rendering that X11 
can't support: Arabic and Indic.

/--\
| Marvelst thou not how matter combineth   |
| And assembles itself in wonderful shapes?|
| Protons, electrons move of their own accord: |
| The atoms are arranged at no-one's behest!   |
|  |
|  http://www.geocities.com/stmetanat/ |
\--/


_
Send and receive Hotmail on your mobile device: http://mobile.msn.com





Vim 6 - int'l support on any Windows platform!

2002-06-24 Thread Shlomi Tal

Hello Unicoders!

I've just done a test run of Vim (vi improved) version 6.1 on a localized 
Hebrew MS-Windows 98 Second Edition.

I use Vim on my own Win2K machine, but I had no surprises that it should 
work there, because Win2K supports Unicode throughout. However, it was on 
the Hebrew Win2K that I got a real uplift: by setting the encoding to UTF-8 
and changing the keymap, I could write not just English, not even also 
Hebrew, but also unsupported (for that platform) languages such as Greek and 
Russian! Saving to a file naturally wrote down the international characters 
in UTF-8.

What a good way to get more international support on a system when you need 
it. Kudos to Bram Moolenaar and all the other Vim programmers.

_
Send and receive Hotmail on your mobile device: http://mobile.msn.com





XTF-3 Description, Advantages/Drawbacks

2002-06-19 Thread Shlomi Tal

OK, the eXperimental Transformation Format goes thus (I didn't make it clear 
enough):

C0, G0, G1 and NBSP (0xA0) stay the same: a single byte.
All Unicode characters from U+00A1 onwards are encoded in three bytes, the 
first of which is in the range C2..FE, the other two A1..C1.

Thus U+00A1 = 0xC2 0xA1 0xA1

Advantages:

1. ASCII compatibility
2. C1 compatibility
3. Can be reduced to 7-bit SI/SO scheme with no control code overlap, thus 
being a UTF-7 without the real UTF-7's chief disadvantage of no sync.

Disadvantages:

1. No simple way of filling bits like UTF-8's 110x 10xx. I suppose 
this brings us back to UTF-1's modulo complexities...

2. 3 bytes for all Unicode characters above U+00A0.

3. UTF-16 surrogate piggybacking - 6 bytes per outside-BMP codepoint. Really 
yucky, but those characters are rare.

_
Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp.





Re: Lost in translation

2002-06-11 Thread Shlomi Tal

Surprisingly to some, Unicode won't do much to solve this problem.  It
will make it much easier to store, exchange, and query Arabic-script
text. But people who can't read the Arabic script will continue to need
Latin transcriptions.

However, Unicode does make transcription much easier, if you have an 
implementation that supports combining marks. Finally I can distinguish 
between front Teh and back (velarized) Tah by putting a dot under the 
latter, pharyngeal h with dot below, glottal marks for the two glottal 
consonants and so forth. Pity I only have it in Lucida Sans Unicode and 
Arial Unicode MS - Times New Roman lacks some of the combining marks.

(btw RE: Uniconv - I recall Roman Czyborra mentioning it as the charset 
conversion module in Gaspar Sinai's Yudit editor).

_
Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp.





Re: How is UTF8, UTF16 and UTF32 encoded?

2002-05-31 Thread Shlomi Tal

The best non-technical introduction I've seen for UTF-8 is The Properties 
and Promizes (sic) of UTF-8 by Martin Dürst, here:

http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf

And another good easy introduction is Richard Gillam's Unicode 
Demystified, here:

http://www.concentric.net/~rtgillam/pubs/unibook/

Look into chapter 6, Encoding Forms, which has useful illustrations of the 
UTFs.

_
Join the world’s largest e-mail service with MSN Hotmail. 
http://www.hotmail.com





(informative) Explanation of Microsoft Windows Text-File Modes

2002-05-31 Thread Shlomi Tal

Another FAQ-like essay of mine. Request for corrections.

-

Explanation of Microsoft Windows Text-File Modes

by Shlomi Tal ([EMAIL PROTECTED])

Contents

1. Concepts
2. ANSI Mode
3. Unicode Mode
4. UTF-8 Mode
--

Preliminary note: Windows 9x is shorthand for Microsoft Windows 95, 98
and ME; Windows XP is shorthand for Microsoft Windows NT 4.0, 2000 and
XP.

1. Concepts
^^^

The more legacy-free line of Microsoft Windows operating systems are
designed to use Unicode for all text internally, with provision of
other representation modes for text for interoperability with other
environments. The modes are specifically those that appear in the
Windows XP text editor (Notepad), but they apply as general concepts.

Text files can be divided according to the bit-stream representation
they have, and according to the repertoire of characters they
potentially hold. Bit-stream representation is the number and order of
bits and bytes for encoding the text. Repertoire determines what
characters are legal to use in a text file. Bit-stream and repertoire
are closely linked, though the relations are not always
straightforward.

Microsoft Windows can handle text in at least one of three modes:

1. 8-bit stream with 256-character repertoire
2. 16-bit stream with 65536-character repertoire
3. 8-bit stream with 65536-character repertoire

The first is the only option for Windows 9x, and the second is the
native internal mode of Windows XP. The first involves switching the
repertoire by changing 8-bit codepages, whereas the second is fix
16-bit repertoire. The third mode is a hybrid, combining the
65536-character repertoire in a single extended 8-bit codepage.

2. ANSI Mode

The oldest mode for text files in Microsoft Windows, and the only
option for the Windows 9x family, is ANSI mode, in which the system
recognizes 256 characters. Half of these (the ASCII range, 00 to 7F)
are constant, and the other half (80 to FF) change according to the
particular language version of the system. ANSI modes enable the use
of only two scripts: Basic Latin plus one more codeset. Other codesets
cannot be used in ANSI mode without changing the codepage (which, as
regards Windows 9x, means installing a different version of the
operating system).

In this area there is a notable difference between the enabled and
the localized versions of Windows 9x. Enabled means supporting a
codepage and input methods that make it possible to write in a
particular language. For example, the US version of Windows 9x is also
French enabled, for it has characters for French in the second half of
its codepage (CP1252 in this case). Localized means that the whole
interface has been translated to a different language. Localized is
inherently enabled, and there are more different localized versions
than enabled versions.

The practical consequence of ANSI mode is that text files are not
viewed uniformly between operating system versions when characters
from the second half of the codepage are used. For example, German
o-umlaut (U+00F6 LATIN SMALL LETTER O WITH DIAERESIS) will appear as
Hebrew Tsadi (U+05E6 HEBREW LETTER TSADI) when the text file
containing it is opened on an enabled or localized Hebrew Windows 9x
system. This is because German o-umlaut is located on the same integer
in the codepage map of CP1252 as is Hebrew Tsadi in CP1255 (the
MS-Windows Latin/Hebrew codepage). There is no way of entering
o-umlaut in a Hebrew Windows 9x version except through special
applications.

Windows XP abandons ANSI mode and uses Unicode mode instead (see
next), but for compatibility with Windows 9x and other codepage-based
environment it emulates the ANSI mode for one codepage at a time. That
is, an option of system locale or system default language is
chosen to determine which one of the 8-bit codepages Windows XP
supports. This has the consequence that, for example, German o-umlaut
will appear as Hebrew Tsadi if it is in an ANSI-mode text file and the
system default language is set to Hebrew (more exactly to CP1255). All
Windows 9x applications running on Windows XP will exhibit such
behaviour. This applies mainly to the interface (menus, captions) of
applications.

Windows XP does not use ANSI mode internally, but it can save an
external representation in a text file by saving it as ANSI. The
file will be saved ordinarily only on condition that it does not
contain any character outside the system's default ANSI codepage. If
it does, then Notepad will trigger a warning to save as Unicode
instead, and further saving will corrupt the original data
(transcoding or conversion to question marks).

3. Unicode Mode
^^^

Windows XP handles text internally as UTF-16 (16 bits per character,
plus support for surrogates from Windows 2000 onwards), and can store
text as UTF-16 in either of little-endian or big-endian byte orders.
The native byte order for the Intel x86

Re: Emoticons

2002-05-22 Thread Shlomi Tal

Doug Ewell wrote:

The smiling face and frowning face have fairly obvious value as
emoticons.  I use U+263A (in its UCN form, \u263a) sometimes when
posting to this list.  A winking face and a surprised or shocked
face could arguably be useful as well.  But once you get past those
four, there's not much left except glyph variants

In fact, of the three emoticons now extant, I use only the white smiling 
face. I don't see any special point in using the black smiling face (it's 
there because of CP437, I believe), and as for the white frowning face, it 
isn't in Times New Roman and other common WGL4 fontsets. Lucida Sans Unicode 
and Arial Unicode MS are not universal enough to trust. The only Unicode 
symbols I trust to use are those of the Times/WGL4 set, such as the male 
sign, female sign, card signs and so forth. Astrological symbols, for 
example, are out - I can print them on my laser printer from the Lucida Sans 
Unicode font, but I can't expect them to appear properly in everyone's 
browser.

Some people believe that encoding certain entities (Klingon comes to
mind) would bring great embarrassment to Unicode and cause people not to 
respect it or take it seriously.  That's how I feel about encoding 
 additional smileys.

True, and yet there are so many symbols included for compatibility purposes, 
which are otherwise not useful. The box shades (dark, middle, light etc) 
make sense when you think of compatibility with CP437 and other terminal 
implementations, but I can't think of another situation where they might be 
useful. (The box-drawing lines, on the other hand, are still useful even 
outside the terminal graphics context).

I think of the benefits of Unicode in terms of what more characters are 
available. I remember how hard it was, back in 1993 or so, to transliterate 
phonetic writing with all the macrons and combining dots, and now it's much 
easier. And :-) is just like a in that respect: just as the latter used to 
be a hack when you weren't sure you could get that diaeresis through, the 
former is the hack for those systems where you couldn't rely on universal 
CP437 display. Hacks such as those are where humans begin serving the 
machines instead of the other way round. I find it detestable.

_
Chat with friends online, try MSN Messenger: http://messenger.msn.com





Emoticons

2002-05-20 Thread Shlomi Tal

Branching off from the subject of symbol encodings, I wondered about the 
application of emoticons in the Miscellaneous Symbols block. Now even though 
I know characters such as the white smiling face were included for 
compatibility with DOS CP437 and its offshoots, yet the white frowning face 
wasn't in CP437.

Pike and Thompson's paper on the Plan 9 Unicode conversion (Hello world or 
Kalimera Kosme or Konichiwa Sekai) says this:

--- QUOTE ---

Although we converted Plan 9 in the altruistic interests of serving foreign 
languages, we have found the large character set attractive for other 
reasons. The Unicode Standard includes many characters — mathematical 
symbols, scientific notation, more general punctuation, and more — that we 
now use daily in our work. We no longer test our imaginations to find ways 
to include non-ASCII symbols in our text; why type :-) when you can use the 
character ☺?

--- UNQUOTE ---

So, that emoticon, far from its original use as a compatibility character 
for CP437 (much like the box-drawing symbols), is actually useful for 
regular, running application. And since emoticons are very useful, and are 
not compatibility hacks, then why not add a few more to the Misc Symbols 
set? White winking face, for example? I already use the white smiling face 
on discussion boards, as an HTML NCR, and it's smashing. Wouldn't a few more 
be useful?

Just my thoughts...

Shlomi Tal
Author of The Guide To Hebrew Computing
http://www.pcphobia.co.il/hebcomp/

_
Chat with friends online, try MSN Messenger: http://messenger.msn.com





Re: Welcome to list 'unicode'

2002-04-27 Thread Shlomi Tal

MS-Word 2000 and upwards use Unicode (to be more specific, UTF-16 
little-endian). Earlier than that (97 and downwards) are still based on 
codepages, according to their versions (for example Hebrew MS-Word 97 stores 
strings in the Windows-1255 codepage; other languages can be typed and saved 
if you have the input method, as I did when I had Word 97 on my Win2K 
system, but they are stored as an easily-corrupted extended encoding. I 
upgraded to Word 2000 after Word 97 had corrupted all my Arabic text).

i have a question..
 i have a word editor(say MSWord)..does MSword have unicode
compatibility...if not then how do i make it compatible to unicode
standard.??
regards,
deepak


_
MSN Photos is the easiest way to share and print your photos: 
http://photos.msn.com/support/worldwide.aspx





Re: Variations of UTF-16

2002-04-24 Thread Shlomi Tal

{{ But a BOM in every UTF-16 plain text file would make this completely 
hopeless. If we ever think we might want to do UNIX-style text processing on 
UTF-16, we have to resist that! }}

If you're going to take the trouble of making text tools 16-bit aware, then 
you can afford to make them BOM-aware too.

type a.txt b.txt c.txt  d.txt

on Windows 2000, assuming that they are all UTF-16 (with an FFFE at the 
beginning of each, as is usual in MS-Windows Unicode files), strips every 
BOM except the last, so that d.txt has only the usual one initial FFFE. So 
it's not an immovable obstacle.

Concerning text files: nearly all of plain-text Unicode I've ever seen is in 
UTF-8. However, the ubiquitous MS-Office documents, from Office 2000 
onwards, are all in UTF-16 (little-endian, without BOM).

_
Join the world’s largest e-mail service with MSN Hotmail. 
http://www.hotmail.com





Re: Please help: Unicode sig in Hotmail

2002-04-13 Thread Shlomi Tal

The sig is one of the situations where UTF-8 transfer hasn't worked for me. 
Normally I use UTF-8 in mails to transfer texts in Hebrew and Arabic, and it 
passes with no problem. I just switch IE to UTF-8 and it passes the bytes as 
they should be.

Maybe those symbols are confusing the browser? As you know, the symbols 
(male sign, female sign, black heart) are control characters as mapped in 
CP437. Could it be the browser is interpreting them as such for 
compatibility? Particularly intriguing is that the black heart causes 
problems every time, because it corresponds to Control-C (interrupt).

Pity about the lack of Hotmail support for UTF-8 transfer specification. 
This quite diminishes the advantage of web-based mail (not everywhere I can 
use Outlook Express with my account set up...).

_
Chat with friends online, try MSN Messenger: http://messenger.msn.com





Please help: Unicode sig in Hotmail

2002-04-12 Thread Shlomi Tal

I've built a UTF-8 sig for my outgoing messages:

|--|
| a BOY ♂ ...|
| a GIRL ♀ ...   |
| they MEET ♂♀ ... |
| HERE WE GO!  |
| ♂♥♀|
|--|

with Unicode symbols from the U+26xx block. However, it doesn't show up at 
all: neither in Compose, nor when I send a message to myself, nor when I 
send a message to someone else.

Please tell me how I can put it right.

Thanks in advance.

_
Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp.





Hebrew Computing FAQ

2002-04-09 Thread Shlomi Tal

I have a website at http://www.pcphobia.co.il/hebcomp/ called The Guide to 
Hebrew Computing, which is meant for native users of Hebrew and is therefore 
entirely in Hebrew (in two versions: UTF-8 encoded logical Hebrew and 
ISO-8859-8 encoded visual Hebrew); for the basic questions about Hebrew, 
especially about the difference between visual and logical which people have 
asked me after seeing those options in Mozilla and Internet Explorer, I have 
this FAQ, in English. Criticism and pointing out of errors gladly accepted.

--- BEGIN ---

Hebrew Computing FAQ

by Shlomi Tal ([EMAIL PROTECTED])

Contents:

1. What is the difference between ISO-Visual and ISO-Logical?
2. How was Hebrew used on MS-DOS?
3. What is special about MS-Windows Hebrew (windows-1255) encoding?
4. Review of Standards
-

1. What is the difference between ISO-Visual and ISO-Logical?
^

This question needs a long explanation going down to the very
rudiments of human handwriting. ISO is just an encoding scheme; the
difference between visual order and logical order has nothing to do
with the encoding itself (ie the numbers assigned to each letter), but
with the storage order of the numbers.

Let us review the writing of English text by hand. The hand holds the
pen near the top-left corner of the paper and then moves rightwards
constantly. When there is no more room on the paper to the right, the
hand moves back to the left edge and slides one row lower than before,
and then begins the rightwards movement again.

Writing Hebrew (and Arabic and other Semitic languages) by hand is a
different matter. The hand holds the pen near the top-right corner of
the paper and then moves leftwards. However, it moves leftwards as
long as the text is in Hebrew. If numbers (or English text) are to be
written, the hand will move rightwards for them and then resume the
leftwards movement for Hebrew text again. In other words, writing
Hebrew involves bidirectional (left-to-right and right-to-left)
movement of the hand, in contrast to monodirectional English writing.
Finally, upon running out of room to move leftwards, the hand moves
back to the right edge and slides one row lower.

So much for human handwriting. Computers, however, know nothing about
directions. The numbers representing human letters are stored
sequentially on the media. Making them flow from left to right and
move on to the beginning of the next line is the job of software.

Since computer systems were designed around English, the
screen-handling routines have a uniform, clear rule for mimicking the
handwriting process: if a byte follows another byte, it will be
presented on the screen as a letter to the right of the letter that
the previous byte represents:

Sequential bytes:
0x48
0x65
0x6C
0x6C
0x6F

Letters displayed on screen:
Hello

In addition, for word-wrapping applications (such as text editors)
there is a routine for going to the beginning of the next line when
the row is full.

When it comes to displaying Hebrew on the screen, there is great
difficulty. The display mechanisms of computers were originally
designed for English, and can easily be accommodated to other
left-to-right scripts, or even to a monodirectional right-to-left
script by employing a simple display inversion, but Hebrew is
bidirectional and more complicated to display (Arabic is even more
complicated than Hebrew, but that's another story).

There are two options for dealing with Hebrew text display:

1) Forcing Hebrew to conform to the constraints of English text
display (ie treating Hebrew like a monodirectional LTR script).

2) Updating the display software to handle bidirectional display of
Hebrew text in a way akin to its flow in handwriting.

The first option is simple, easy to implement and does not require
large computing resources by the standards of early computing (which
for Hebrew means from the 1960s to the early 1980s). It requires only
an encoding and a font mapping: numbers assigned to Hebrew letters,
and Hebrew fonts for their display. However, it requires an effort on
the part of the writer, since all text, including Hebrew letters, is
written from left to right. Hebrew text must be written with the last
letter typed first, so that the left-to-right display of the text can
form the illusion of natural Hebrew flow. There were a few mechanisms
to aid writers, such as pushing input methods for typing the Hebrew
letters the natural way (from right to left), but editing, sorting,
copying and any kind of manipulation stayed a painful task.

The second option, implemented for Arabic first and then for Hebrew,
consists in more intelligent software, and therefore more resources.
The method assigns an implicit directionality to each character: LTR
for English and numbers, RTL for Hebrew letters and neutral for
punctuation marks. The Hebrew text is stored in the same sequential

MS/Unix BOM FAQ again (small fix)

2002-04-09 Thread Shlomi Tal

A small fix for the FAQ; specifically, a fix for the typo/braino of 
construing 0x071F as little-endian 1F 70 instead of (the now fixed) 1F 07. 
Thanks to Wladislaw Vaintroub for pointing it out for me.

--- BEGIN ---

Microsoft Unicode Text File Byte Order Mark (BOM) FAQ

by Shlomi Tal ([EMAIL PROTECTED])

Contents

1. What is a BOM?
2. Why does it matter?
3. Is the BOM mandatory or optional?
-

1. What is a BOM?
^

BOM, or Byte-Order Mark, is a signature at the beginning of a Unicode
text file. Since different processors handle sequences of bytes in a
particular way, the BOM is used to mark which byte-order the text file
was written in.

Processors are either big-endian or little-endian. The former put the
most significant byte first, and the latter put the least significant
byte first. So that the 16-bit number 0x071F is serialized as:

Big-endian 07 1F
Little-endian 1F 07

Obviously a code with the value 0x071F will be interpreted as 0x1F07
if it passes from a processor of different byte-order without
information about its original state. This is what the Unicode BOM
seeks to avoid.

The Unicode standard permits the character U+FEFF (Zero-Width
Non-Breaking Space) at the beginning of the file as a mark for the
byte order of the file. A Unicode text file beginning with FEFF is
big-endian, and a file beginning with FFFE (not a legal Unicode
character for any other purpose) is little-endian.

All this is relevant to the 16-bit and 32-bit encodings of Unicode
characters - UTF-16 and UTF-32 respectively. Thus:

FE FF is UTF-16 Big-Endian
FF FE is UTF-16 Little-Endian
00 00 FE FF is UTF-32 Big-Endian
FF FE 00 00 is UTF-32 Little-Endian

There is another, very common Unicode encoding scheme called UTF-8,
which maps the Unicode repertoire into sequences of bytes. Since the
order of bytes (as opposed to words of more than one byte) is the same
for all processors, UTF-8 does not require a BOM. It can have one,
though.

In addition, a Unicode encoding scheme named UTF-7, which was meant as
a mail-safe encoding but is now nearly obsolete, can have a BOM as
well. Here too the BOM is not mandatory.

2. Why does it matter?
^^

It matters because Microsoft tools (most prominently Windows Notepad)
prefix the BOM to Unicode text files regularly, whereas other systems
and environments (Unix, Linux, web pages) are better off without the
BOM, especially in the case of UTF-8 text files.

Unix systems, for example, search for an initial #! in a shell script
file in order to determine the interpreter for it. An initial BOM
coming instead of the #! could easily disrupt this convention. Also,
and this applies particularly to databases, and not only in Unix, the
BOM can cause disorder when files are merged. Web pages usually use
UTF-8, and although they can handle the BOM, it may appear as a
strange character (a blank square or a question mark) on a browser
that doesn't recognize it, and may also cause the above troubles when
the file is saved to the local disk.

Most of the Unicode text meant for open transfer between various
systems (and the Web) is encoded in UTF-8. Unix systems regularly form
UTF-8 text files without the BOM, but Windows systems prefix the BOM
as usual. Here follows an explanation of when the Unicode BOM can or
cannot be removed from text files on Microsoft Windows systems.

3. Is the BOM mandatory or optional?


Microsoft Windows, beginning with the Unicode-supporting operating
systems Windows 2000 and Windows XP, can handle UTF-16 Little-Endian,
UTF-16 Big-Endian, UTF-8 and old 8-bit ANSI (Microsoft's
non-standard name for its 8-bit Windows codepages, consisting of the
ASCII repertoire for the first 128 characters and varying characters
for the other 128). The native encoding for these systems is UTF-16
Little-Endian, which when saving under Notepad is called Unicode.
UTF-16 Big-Endian is called Unicode Big-Endian, and UTF-8 keeps its
name.

Upon saving a Unicode text file in Notepad, the BOM is always
prefixed. Thus, opening such a file with a text editor which is not
Unicode-aware (such as edit.com) or doing a hexdump on it, you will
see UTF-16 Little-Endian (Unicode) starting with FF FE, UTF-16
Big-Endian (Unicode Big-Endian) starting with FE FF, and UTF-8
starting with the UTF-8 encoding of the BOM: EF BB BF.

For the first two encoding schemes (UTF-16), the user MUST NOT remove
the BOM manually. Removing the BOM using an external tool (such as
edit.com) and then opening the file with Notepad will reveal a pile of
gibberish. Then, saving the file will corrupt it beyond recovery. This
is because the BOM is necessary for the system to read the 16-bit
values as they are and ignore their values as 8-bit sequences. Without
the BOM, an 8-bit sequence forming part of a 16-bit Unicode character
will be given its special ASCII value, which may be a control
character. Many