Re: [farsiweb]heh + hamzeh

Roozbeh Pournader Sun, 02 Jun 2002 13:24:39 -0700

On Sun, 2 Jun 2002, C Bobroff wrote:

> > Ok, it seems that we are seeing a lot of monolouges here.
> I'm sure more people than just me are finding the monologues educational


I just wish to emphasize that I have seen repetitions of the same concern. 
And I can't forget referring to some of us as dictators or things like 
that. (At least as people who try to impose their ideas on others.)

Regarding the dictatorship things, I wish to emphasize that the matter of
Heh+Hamza was also discussed at the ISIRI meeting for approval of the
standard, and all of the experts agreed or got convinced. The list
includes Dr Mostafa Asi (A computational linguist also working with
Farhangestan), Mr Ebrahim Mashayekh (President of Informatics Society of
Iran), Dr Mohammad Ghodsi (Project Leader of FarsiTeX), Mr Mohammad
Azadnia (Technical manager of Persian project at Iran Communication
Research Center), Mr Arash Rezaiizadeh (one of entrepreneurs of Windows
Farsification), and Mr Arash Zeini (President of Chapar Shabdiz, the first
Iranian Free Software company, also of FarsiKDE fame), and Mr Hashemi (Gam
Electronic's Persian Expert).

All other known experts, if present in Iran, were invited, but some could
not attend: this includes people like Dr Mohammad San'ati of SinaSoft
fame, whom Behdad and me met personally after the meeting, to make sure he
does not have major objections.

I can't understand who Abi was refering to, when she or he writes "Next I
expect we will be told how to combe out hair. [...] They have nothing to
offer to the Persain IT and language discussion." Was he refering to me,
or to Mr Khanban? (We are both members of the technical committee of the
standard you heard a lot about.) To say the least, neither me nor Mr
Khanban have anything to hide about what we have done for the Persian IT
world: Just search Google for "Khanban" or "Pournader". We both use our
real and full names, and have done everything publicly. But who is "Abi 
Lover"?

Also, quoting Abi's exact words, she or he is against any standardization:  
"There are some people [...] who think that they have a duty to lay down
rules for other people to follow." Unicode Consortium is doing this. ISO
is doing this. W3C is doing this. Many software companies, from Microsoft
to SinaSoft also do this, by creating things that will become de facto
standards. You are not obliged to follow standards, but you will come to
trouble if you don't. Noone will be able to use your software with other 
software.

> Roozbeh, can you please tell us about this "normalization" and why
> the mention of "Persian" is to be removed from this character?

Sure. I have explained the problem a number of times, and I will explain
it again:

There is a notion in Unicode, called Normalization. You can read about it
at <http://www.unicode.org/unicode/reports/tr15/>. If you don't have the
time, I will brief you in short: Since Unicode is not just for displaying
the text, but also for processing, and it sometimes has different
alternatives for encoding the same text, you need to have some mechanism
to find that two strings of characters are actually the same.

One example, is the equivalence of U+0624 ARABIC LETTER WAW WITH HAMZA
ABOVE, with the string <U+0648, U+0654> which is <ARABIC LETTER WAW,
ARABIC HAMZA AOBVE>. The algorithm is intelligent enough so it can detect 
the equivalence even if you put a FATHA between the WAW and the HAMZA, so 
<WAW WITH HAMZA ABOVE, FATHA> will be equal to both <WAW, FATHA, HAMZA> 
and <WAW, HAMZA, FATHA>.

This equivalence is very important for security issues, and proper
functioning of the software, but I won't get into the details. To say the
least, this is an important part of the two most awaited standards, which
are still a draft: "Internationalized Domain Names",

        http://www.ietf.org/internet-drafts/draft-ietf-idn-idna-09.tx

where applications MUST do normalization before doing name lookup for 
a non-ASCII domain name, and "Character Model for the World Wide Web",

        http://www.w3.org/TR/charmod/

where all web authoring or web content generation software is REQUIRED to
normalize the text of a web document before putting it on the wire.

Getting back to our U+06C0 ARABIC LETTER HEH WITH SMALL YEH ABOVE, this 
letter is specified to be equal to <U+06D5, U+0654>, which is <ARABIC 
LETTER AE, ARABIC HAMZA ABOVE>. This AE things, is a letter similiar to 
HEH in shape, but only used in Final and Isolated forms, something like 
U+0629 ARABIC LETTER TEH MARBUTA but without the dots. (I think that 
everyone agrees that this AE letter has no place in Persian.)

Now let's consider the real sitation: one likes to encode this "ezaafe"  
thing. He may look at the charts, and he will either choose U+06D5, or
<U+0647, U+0645> (HEH, HAMZA ABOVE), based on his preference for
"precomposed" or "decomposed" forms. Let me say that you choose the first,
and I choose the second. The sad point will be that no Unicode compliant
application will be able to tell you that these string are equivalent. In
a rewording, you will have two ways to encode the same text, without
having them considered equal.

The first time I found this, I asked Unicode people for changing the
decomposition for U+06C0. I then found that there is a stability policy
about these, and that they have had their own reasons for selecting this
decomposition. After that, I asked them to remove the mention of "Persian" 
from the comments for this character. They asked me for a formal proposal, 
which will not have any problems for passing, I guess.

This is the whole story. If you have questions, please be brief and 
patient, so I can answer them.

roozbeh

_______________________________________________
FarsiWeb mailing list
[EMAIL PROTECTED]
http://lists.sharif.edu/mailman/listinfo/farsiweb

Re: [farsiweb]heh + hamzeh

Reply via email to