Re: Misinformation!

2004-06-05 Thread Roozbeh Pournader
On Thu, 2004-06-03 at 20:04, Ordak D. Coward wrote:
> Is there a trustworhty easy-to-read document somewhere on the Internet
> that mentions all this issues that I can refer people to it?

I don't know "easy to read" may mean. Perhaps Connie's pages are the
best for that. For the more technical type, there is always ISIRI 6219.

roozbeh


___
PersianComputing mailing list
[EMAIL PROTECTED]
http://lists.sharif.edu/mailman/listinfo/persiancomputing


Re: Misinformation!

2004-06-05 Thread Roozbeh Pournader
On Fri, 2004-06-04 at 18:03, Ordak D. Coward wrote:

> Behdad, does Unicode consortium provide a search collation table in
> addition to the collation table used for sorting? Or can the same
> table be used for this seach purposes as well?

Well, I'm not Behdad, but I guess I have some answers.

The first answer is: no, the Unicode Consortium doesn't provide any
collation table for sorting. The second answer is: Yes, you can use the
same table for searching. For example, you can use the data to ignore
secondary and tertiary differences in you string comparison for a loose
matching. But please note that the table is just there for the cases
that you don't know anything about the locale. For Persian, the table
needs to be tailored heavily.

roozbeh


___
PersianComputing mailing list
[EMAIL PROTECTED]
http://lists.sharif.edu/mailman/listinfo/persiancomputing


RE: Misinformation!

2004-06-04 Thread Ehsan Akhgari
> There's a difference in the case of C++ standard and web
> standards:  Writing non-standard C++ code only produces compile-time
> problems, but if you happen to compile the code, it works correctly
> (or supposed to do so).

Well, that's not exactly so.  Some non-conformant behavior tend to generate
(maybe subtle) runtime behavior differences.  But I see what your point here
is.

> But it's quite a different case in web.
> 30-40 percent is low enough to get ignored, counting that the other
> way you are sacrificing the other 60-70% for not being able to find
> the document by searching in Google.  And note that even with Win9x
> and a recent IE, and updated fonts, there's no problem.

I'd definitely do so if the Google search problem couldn't be solved.  But
I've been using a method I've mentioned in my other post to solve that
problem as well.  This was the best way of having the best of the two worlds
that I could think of, but I'm wide open for suggestions/improvements to
this idea.

> About using HTML entities, no matter what the encoding of the page is,
> HTML entities generate Unicode characters.

They do on most browsers, but browsers are not required to do so.  Consider
a browser which can't handle UTF-8 (well, or at all).

> It's quite common to see
> people exporting Persian documents in MS Word, and get an HTML page
> encoded in MS Arabic encoding, with Persian Yeh and Keh encoded in
> HTML entities.

Yes, and that will make their document even more difficult for search
engines to index.  And of course, I'd debate that using CP1256/ISO-8859-6 is
not suitable for Persian documents, but that's another story perhaps.

> PS.  BTW, I just found that using Harakat (kasre, fathe, ...) also
> prevent a hit in Google search :(.  That's quite expected, but perhaps
> I should reconsider my habbit of putting those tiny marks everywhere.

That's another sad fact.  I really think that Google must seriously consider
implementing some such details on their indexing process.  That's also one
of the things that AriaSearch.com handles.

---

Hmmm, now that we're here, how about gathering some volunteers who can work
with Google to fix some of these problems?  In the past, I've contacted
Google on a number of occassions about small problems in their services, and
they seemed quite willing to fix them.  Maybe we would hopefully have a more
Persian-friendly Google in the future this way.

If you feel that this is a good idea, I'd be pleased to take part in that
team.  Comments?

-
Ehsan Akhgari

Farda Technology (http://www.farda-tech.com/)

List Owner: [EMAIL PROTECTED]

[ Email: [EMAIL PROTECTED] ]
[ WWW: http://www.beginthread.com/Ehsan ]



___
PersianComputing mailing list
[EMAIL PROTECTED]
http://lists.sharif.edu/mailman/listinfo/persiancomputing


RE: Misinformation!

2004-06-04 Thread Ehsan Akhgari
> Here is a solution (in fact a hack) that if implemented correctly, can
> resolve some of the issues till people and Google start using correct
> software:
>
> With a little tweaking, the web servers can translate the correct
> Unicode to the incorrect unicode desired so much by the Win9X users.
> That is, the web severs looks at the browser request, and if it can
> detect Win9X, translates all U+06CC's in the document to U+064A (and
> all other required translations). The same technique could be used to
> fool google into generating correct search results. That, is the web
> server generates a Win9X friendly version of the document and appends
> it to the original document. You can also allocate tags that the user
> of the web server can disable or enable some of these features. This
> may even make one gain some advatnage over other web hosting
> companies.

That solves half of the problem.  On Win9x, the key d on the keyboard
inserts an Arabic YEH, and on Win2K+, it inserts FARSI YEH.  So, if you use
this method, when a user types in a word containing yeh in the google's
search box on Win9x, they wouldn't find your site.

The best hack (or solution, as one might call it) I've found for this is
feeding a version of page too Google which contains both forms of words
(using YEH and FARSI YEH) so that the chances of google finding your page
for a certain keyword gets maximized.  Of course, certain measures must be
taken to prevent bad results, for example, the proximity of the words must
not get touched.  Nevertheless, this will cause other problems, such as
malformed keyword density, which cannot be solved reliably.  The problem
must be fixed in the search engine code, really, and such hacks have their
own downsides.  The search engine project I've been working on
 handles this (and the ARABIC KEHEH and FARSI KEH
problem) among other problems for searching in Persian text.

> Of course, the solution above is only a transient one, and it is up to
> people to upgrade their Win9X machines to something that is
> Unicode-compliant, also it is up to Google to program their systems
> such that it can understand that both U+06CC and U+064A are the same
> shape and hence should be regarded the same for searching unless user
> requests otherwise. This is the same as case-insensitive search that
> is usually implemented by mapping all upper and lower case characters
> -- in documents and queries alike -- to uppercase.

Yeah that's right.  Of course great attention must be paid so that it
doesn't break Arabic search results.


-
Ehsan Akhgari

Farda Technology (http://www.farda-tech.com/)

List Owner: [EMAIL PROTECTED]

[ Email: [EMAIL PROTECTED] ]
[ WWW: http://www.beginthread.com/Ehsan ]

He who sees the abyss, but with eagle's eyes - he who with eagle's talons
grasps the abyss: he has courage.
-Thus Spoke Zarathustra, F. W. Nietzsche



___
PersianComputing mailing list
[EMAIL PROTECTED]
http://lists.sharif.edu/mailman/listinfo/persiancomputing


Re: Misinformation!

2004-06-04 Thread Behdad Esfahbod
Hi there,

Well, this approach has been investigated by some people already.
Another approach that is easier to implement is use a javascript
to translate the page on the browser side.  For people using PHP,
it's a couple on lines to open an output buffer that does the
translation, and I'm sure we've seen that before in this list.

But to the main question, unfortunately no, Unicode does not
define any kind of loose searching.  There are some loose
equivalency data in Unicode database, but that apparently does
not include cases like Arabic and Persian Yeh.  We at FarsiWeb
are developing an standard for loose searching in Persian, but
you know that's nothing to be implemented by Google.  It's
generally a tough problem.  You can do much better in
language-specific area, but a global loose searching scheme, I
guess, typically gives a worse precision/recall, so will be
avoided by search engines.

behdad


On Fri, 4 Jun 2004, Ordak D. Coward wrote:

> Here is a solution (in fact a hack) that if implemented correctly, can
> resolve some of the issues till people and Google start using correct
> software:
>
> With a little tweaking, the web servers can translate the correct
> Unicode to the incorrect unicode desired so much by the Win9X users.
> That is, the web severs looks at the browser request, and if it can
> detect Win9X, translates all U+06CC's in the document to U+064A (and
> all other required translations). The same technique could be used to
> fool google into generating correct search results. That, is the web
> server generates a Win9X friendly version of the document and appends
> it to the original document. You can also allocate tags that the user
> of the web server can disable or enable some of these features. This
> may even make one gain some advatnage over other web hosting
> companies.
>
> Of course, the solution above is only a transient one, and it is up to
> people to upgrade their Win9X machines to something that is
> Unicode-compliant, also it is up to Google to program their systems
> such that it can understand that both U+06CC and U+064A are the same
> shape and hence should be regarded the same for searching unless user
> requests otherwise. This is the same as case-insensitive search that
> is usually implemented by mapping all upper and lower case characters
> -- in documents and queries alike -- to uppercase.
>
> Behdad, does Unicode consortium provide a search collation table in
> addition to the collation table used for sorting? Or can the same
> table be used for this seach purposes as well?
>
> On Fri, 4 Jun 2004 08:50:41 -0400, Behdad Esfahbod
> <[EMAIL PROTECTED]> wrote:
> >
> > Thanks for you note.
> >
> > There's a difference in the case of C++ standard and web
> > standards:  Writing non-standard C++ code only produces
> > compile-time problems, but if you happen to compile the code, it
> > works correctly (or supposed to do so).  But it's quite a
> > different case in web.  30-40 percent is low enough to get
> > ignored, counting that the other way you are sacrificing the
> > other 60-70% for not being able to find the document by searching
> > in Google.  And note that even with Win9x and a recent IE, and
> > updated fonts, there's no problem.
> >
> > About using HTML entities, no matter what the encoding of the
> > page is, HTML entities generate Unicode characters.  It's quite
> > common to see people exporting Persian documents in MS Word, and
> > get an HTML page encoded in MS Arabic encoding, with Persian Yeh
> > and Keh encoded in HTML entities.
> >
> > behdad
> >
> > PS.  BTW, I just found that using Harakat (kasre, fathe, ...)
> > also prevent a hit in Google search :(.  That's quite expected,
> > but perhaps I should reconsider my habbit of putting those tiny
> > marks everywhere.
> >
> >
> > On Fri, 4 Jun 2004, Ehsan Akhgari wrote:
> >
> > > > Unfortunately this kind of misinforming is quite popular in weblogs,
> > > > where people only care about being visible to more people.
> > >
> > > I confess that I'm one of those who use this technique on their web sites.
> > > I don't believe it's correct, and I don't think of it even as a semi-elegant
> > > solution.  It's a solution which just works on the largest number of
> > > platforms.  By inspecting the web server logs, I notice that still an
> > > average of 30-40 percent of the visitors are using Win9x.  Hopefully one can
> > > start dropping support for Win9x users as their number is constantly
> > > decreasing, but right now if I choose the standards compliant route of using
> > > FARSI YEH everywhere, those Win9x-ers will not be able to browse my sites.
> > >
> > > I have a high respect and tendency to the standards.  I'm mostly a C++
> > > programmer, and I'm one of those "preachers" of the C++ Standard.  However,
> > > today's C++ compilers are still not fully compliant to the C++ Standard, so
> > > whenever someone asks me for advice on how to accomplish a certain task on a
> > > non-conformant compile

Re: Misinformation!

2004-06-04 Thread Ordak D. Coward
Here is a solution (in fact a hack) that if implemented correctly, can
resolve some of the issues till people and Google start using correct
software:

With a little tweaking, the web servers can translate the correct
Unicode to the incorrect unicode desired so much by the Win9X users.
That is, the web severs looks at the browser request, and if it can
detect Win9X, translates all U+06CC's in the document to U+064A (and
all other required translations). The same technique could be used to
fool google into generating correct search results. That, is the web
server generates a Win9X friendly version of the document and appends
it to the original document. You can also allocate tags that the user
of the web server can disable or enable some of these features. This
may even make one gain some advatnage over other web hosting
companies.

Of course, the solution above is only a transient one, and it is up to
people to upgrade their Win9X machines to something that is
Unicode-compliant, also it is up to Google to program their systems
such that it can understand that both U+06CC and U+064A are the same
shape and hence should be regarded the same for searching unless user
requests otherwise. This is the same as case-insensitive search that
is usually implemented by mapping all upper and lower case characters
-- in documents and queries alike -- to uppercase.

Behdad, does Unicode consortium provide a search collation table in
addition to the collation table used for sorting? Or can the same
table be used for this seach purposes as well?

On Fri, 4 Jun 2004 08:50:41 -0400, Behdad Esfahbod
<[EMAIL PROTECTED]> wrote:
> 
> Thanks for you note.
> 
> There's a difference in the case of C++ standard and web
> standards:  Writing non-standard C++ code only produces
> compile-time problems, but if you happen to compile the code, it
> works correctly (or supposed to do so).  But it's quite a
> different case in web.  30-40 percent is low enough to get
> ignored, counting that the other way you are sacrificing the
> other 60-70% for not being able to find the document by searching
> in Google.  And note that even with Win9x and a recent IE, and
> updated fonts, there's no problem.
> 
> About using HTML entities, no matter what the encoding of the
> page is, HTML entities generate Unicode characters.  It's quite
> common to see people exporting Persian documents in MS Word, and
> get an HTML page encoded in MS Arabic encoding, with Persian Yeh
> and Keh encoded in HTML entities.
> 
> behdad
> 
> PS.  BTW, I just found that using Harakat (kasre, fathe, ...)
> also prevent a hit in Google search :(.  That's quite expected,
> but perhaps I should reconsider my habbit of putting those tiny
> marks everywhere.
> 
> 
> On Fri, 4 Jun 2004, Ehsan Akhgari wrote:
> 
> > > Unfortunately this kind of misinforming is quite popular in weblogs,
> > > where people only care about being visible to more people.
> >
> > I confess that I'm one of those who use this technique on their web sites.
> > I don't believe it's correct, and I don't think of it even as a semi-elegant
> > solution.  It's a solution which just works on the largest number of
> > platforms.  By inspecting the web server logs, I notice that still an
> > average of 30-40 percent of the visitors are using Win9x.  Hopefully one can
> > start dropping support for Win9x users as their number is constantly
> > decreasing, but right now if I choose the standards compliant route of using
> > FARSI YEH everywhere, those Win9x-ers will not be able to browse my sites.
> >
> > I have a high respect and tendency to the standards.  I'm mostly a C++
> > programmer, and I'm one of those "preachers" of the C++ Standard.  However,
> > today's C++ compilers are still not fully compliant to the C++ Standard, so
> > whenever someone asks me for advice on how to accomplish a certain task on a
> > non-conformant compiler, I show them the non-standards way, and also mention
> > the standards way, so that they know what the *right* way is, and also what
> > the way to do their job right now is.  I see little difference in the web
> > standards land as well.
> >
> > Of course this 'solution' (if it can be called so) poses other problems,
> > such as the inability of correctly indexing of such words with both forms of
> > YEH by search engine spiders such as Google's, which must be addressed
> > separately.  Also, if you choose to use the FARSI YEH form everywhere, then
> > again such problems will occur (such as a Win9x-er can neither correctly see
> > your pages nor fine them in Google; if they query for a word containing
> > YEH.)
> >
> > > They even go on and use HTML entities (like ٚ) instead of UTF-8,
> > > just because if the user's browser is set to something other than auto
> > > and UTF-8, the page is still rendered correctly...
> >
> > This one is silly, and I don't see how this can solve any problem.  The
> > browsers are required to be able to correctly resolve such numerical
> > entities o

RE: Misinformation!

2004-06-04 Thread Behdad Esfahbod
Thanks for you note.

There's a difference in the case of C++ standard and web
standards:  Writing non-standard C++ code only produces
compile-time problems, but if you happen to compile the code, it
works correctly (or supposed to do so).  But it's quite a
different case in web.  30-40 percent is low enough to get
ignored, counting that the other way you are sacrificing the
other 60-70% for not being able to find the document by searching
in Google.  And note that even with Win9x and a recent IE, and
updated fonts, there's no problem.

About using HTML entities, no matter what the encoding of the
page is, HTML entities generate Unicode characters.  It's quite
common to see people exporting Persian documents in MS Word, and
get an HTML page encoded in MS Arabic encoding, with Persian Yeh
and Keh encoded in HTML entities.

behdad

PS.  BTW, I just found that using Harakat (kasre, fathe, ...)
also prevent a hit in Google search :(.  That's quite expected,
but perhaps I should reconsider my habbit of putting those tiny
marks everywhere.

On Fri, 4 Jun 2004, Ehsan Akhgari wrote:

> > Unfortunately this kind of misinforming is quite popular in weblogs,
> > where people only care about being visible to more people.
>
> I confess that I'm one of those who use this technique on their web sites.
> I don't believe it's correct, and I don't think of it even as a semi-elegant
> solution.  It's a solution which just works on the largest number of
> platforms.  By inspecting the web server logs, I notice that still an
> average of 30-40 percent of the visitors are using Win9x.  Hopefully one can
> start dropping support for Win9x users as their number is constantly
> decreasing, but right now if I choose the standards compliant route of using
> FARSI YEH everywhere, those Win9x-ers will not be able to browse my sites.
>
> I have a high respect and tendency to the standards.  I'm mostly a C++
> programmer, and I'm one of those "preachers" of the C++ Standard.  However,
> today's C++ compilers are still not fully compliant to the C++ Standard, so
> whenever someone asks me for advice on how to accomplish a certain task on a
> non-conformant compiler, I show them the non-standards way, and also mention
> the standards way, so that they know what the *right* way is, and also what
> the way to do their job right now is.  I see little difference in the web
> standards land as well.
>
> Of course this 'solution' (if it can be called so) poses other problems,
> such as the inability of correctly indexing of such words with both forms of
> YEH by search engine spiders such as Google's, which must be addressed
> separately.  Also, if you choose to use the FARSI YEH form everywhere, then
> again such problems will occur (such as a Win9x-er can neither correctly see
> your pages nor fine them in Google; if they query for a word containing
> YEH.)
>
> > They even go on and use HTML entities (like ٚ) instead of UTF-8,
> > just because if the user's browser is set to something other than auto
> > and UTF-8, the page is still rendered correctly...
>
> This one is silly, and I don't see how this can solve any problem.  The
> browsers are required to be able to correctly resolve such numerical
> entities only if the page's encoding is already UTF-8, and if it is so, why
> not use UTF-8 encoded characters in the first place?  Also, some agents have
> difficulties interpreting such numerical forms.  Furthermore, maintaining
> them is impossible (not hard), and even they can't be treated as text by
> most software packages (for example, they can't be searched for by many
> programs.)  And the last, but not least, for a regular Persian document,
> they're likely to increase the document size by more than two times.
>
> They have their own usage, of course, but I don't see any sense in using
> them instead of UTF-8 characters for regular web pages.
>
> -
> Ehsan Akhgari
>
> Farda Technology (http://www.farda-tech.com/)
>
> List Owner: [EMAIL PROTECTED]
>
> [ Email: [EMAIL PROTECTED] ]
> [ WWW: http://www.beginthread.com/Ehsan ]
>
>
>
> ___
> PersianComputing mailing list
> [EMAIL PROTECTED]
> http://lists.sharif.edu/mailman/listinfo/persiancomputing
>
>

--behdad
  behdad.org
___
PersianComputing mailing list
[EMAIL PROTECTED]
http://lists.sharif.edu/mailman/listinfo/persiancomputing


RE: Misinformation!

2004-06-04 Thread Ehsan Akhgari
> Unfortunately this kind of misinforming is quite popular in weblogs,
> where people only care about being visible to more people.

I confess that I'm one of those who use this technique on their web sites.
I don't believe it's correct, and I don't think of it even as a semi-elegant
solution.  It's a solution which just works on the largest number of
platforms.  By inspecting the web server logs, I notice that still an
average of 30-40 percent of the visitors are using Win9x.  Hopefully one can
start dropping support for Win9x users as their number is constantly
decreasing, but right now if I choose the standards compliant route of using
FARSI YEH everywhere, those Win9x-ers will not be able to browse my sites.

I have a high respect and tendency to the standards.  I'm mostly a C++
programmer, and I'm one of those "preachers" of the C++ Standard.  However,
today's C++ compilers are still not fully compliant to the C++ Standard, so
whenever someone asks me for advice on how to accomplish a certain task on a
non-conformant compiler, I show them the non-standards way, and also mention
the standards way, so that they know what the *right* way is, and also what
the way to do their job right now is.  I see little difference in the web
standards land as well.

Of course this 'solution' (if it can be called so) poses other problems,
such as the inability of correctly indexing of such words with both forms of
YEH by search engine spiders such as Google's, which must be addressed
separately.  Also, if you choose to use the FARSI YEH form everywhere, then
again such problems will occur (such as a Win9x-er can neither correctly see
your pages nor fine them in Google; if they query for a word containing
YEH.)

> They even go on and use HTML entities (like ٚ) instead of UTF-8,
> just because if the user's browser is set to something other than auto
> and UTF-8, the page is still rendered correctly...

This one is silly, and I don't see how this can solve any problem.  The
browsers are required to be able to correctly resolve such numerical
entities only if the page's encoding is already UTF-8, and if it is so, why
not use UTF-8 encoded characters in the first place?  Also, some agents have
difficulties interpreting such numerical forms.  Furthermore, maintaining
them is impossible (not hard), and even they can't be treated as text by
most software packages (for example, they can't be searched for by many
programs.)  And the last, but not least, for a regular Persian document,
they're likely to increase the document size by more than two times.

They have their own usage, of course, but I don't see any sense in using
them instead of UTF-8 characters for regular web pages.

-
Ehsan Akhgari

Farda Technology (http://www.farda-tech.com/)

List Owner: [EMAIL PROTECTED]

[ Email: [EMAIL PROTECTED] ]
[ WWW: http://www.beginthread.com/Ehsan ]



___
PersianComputing mailing list
[EMAIL PROTECTED]
http://lists.sharif.edu/mailman/listinfo/persiancomputing


Re: Misinformation!

2004-06-03 Thread Behdad Esfahbod

Unfortunately this kind of misinforming is quite popular in
weblogs, where people only care about being visible to more
people.  They even go on and use HTML entities (like ٚ)
instead of UTF-8, just because if the user's browser is set to
something other than auto and UTF-8, the page is still rendered
correctly...

Ehsan, you here?

b

On Thu, 3 Jun 2004, Ordak D. Coward wrote:

> I recently came across this article
> http://www.khabgard.com/?id=844986758 which is endorsed by some other
> weblog authors. The author encourages using adifferent Yeh characters
> for middle and end placements. The author in fact uses U+064A(ARABIC
> LETTER YEH) for middle-of-word and beginning-of-word Yeh's and uses
> U+06CC (ARABIC LETTER FARSI YEH) for end-of-word Yeh's. I believe he
> is giving bad advice to people. His jsutification is that people with
> older MS-Windows systems will see texts correctly by his suggestion.
> This is bad principle to bend standards to support non-comppliant
> platforms. I am going to send an e-mail to him about the issue, but I
> want to confirm my understanding of the issue before contacting the
> author.
>
> Furthemore, while he correctly asks people to use U+06F4, U+06F5, and
> U+06F6 in place of U+0664, U+0665, and U+0666, he stops there and does
> not extend this advice to all digits.
>
> Is there a trustworhty easy-to-read document somewhere on the Internet
> that mentions all this issues that I can refer people to it?
> ___
> PersianComputing mailing list
> [EMAIL PROTECTED]
> http://lists.sharif.edu/mailman/listinfo/persiancomputing
>
>

--behdad
  behdad.org
___
PersianComputing mailing list
[EMAIL PROTECTED]
http://lists.sharif.edu/mailman/listinfo/persiancomputing


Re: Misinformation!

2004-06-03 Thread C Bobroff

On Thu, 3 Jun 2004, Ordak D. Coward wrote:

> I recently came across this article
> http://www.khabgard.com/?id=844986758 which is endorsed by some other
> weblog authors. The author encourages using adifferent Yeh characters
> for middle and end placements.

Oh my!
I think someone was listening to the discussion on this list back in Nov
2003 with subject, "What the hell is this "Yeh" and "Keheh" problem?" and
took all that as a nice "How to" and increased the problem!
-Connie
___
PersianComputing mailing list
[EMAIL PROTECTED]
http://lists.sharif.edu/mailman/listinfo/persiancomputing