Re: farsi language auto-detection in web pages

2005-10-27 Thread era+gmane
On Tue, 9 Aug 2005 16:11:39 +0430, mohsen ali momeni
<[EMAIL PROTECTED]> posted to gmane.comp.misc.persiancomputing:
 > How can I auto-detect language of a webpage without knowing it's
 > charset? (suppose language and charset is not defined in header)
 > Is there a simple (not time-consuming) method to detect a page charset?

Sorry for the late reply -- I just stumbled over your question via Google.
I'm hoping I might be able to help, a little bit.

There are tools which attempt to match both the language and the
character set of a piece of text at the same time, which indeed kind
of solves the "chicken and egg" problem you stumble into if you
attempt to tackle the problems separately. A popular free tool is
TextCat, which contains language (and incidentally charset)
identification for almost 50 languages, although not for Farsi.

... However, it would seem that the Arabic language detection which is
included in the distribution has in fact been trained with Farsi data
to some extent. Also it would seem to be getting the charset wrong (as
per the garbage in / garbage out principle).

I know neither Farsi nor Arabic so I would appreciate if somebody
could confirm this information. (I'm hoping this message will go to
the list too, but I'm not subscribed, so it might get rejected.)

In any event, the language models which are included in the TextCat
distribution are "proof of concept" ones, not really production
quality. Another tool which comes with better models is mguesser. And
in fact, the mguesser models can be used more or less directly by
TextCat (and vice versa).

If you have a sizable corpus of good-quality Persian / Farsi text
(sorry, I'm vague about the possible difference) to train the tool
with, it would be most helpful if you could make it (the raw data, or
your training results) available.

Links:
  
 Perl source + (small) models + on-line demo + good documentation

  
 C source + models; site is somewhat poor quality but the software is OK

The TextCat site also has a page with a survey of the competition
(although mguesser is mysteriously missing from the list).

Hope this helps,

/* era */

-- 
If this were a real .signature, it would suck less.  Well, maybe not.

___
PersianComputing mailing list
PersianComputing@lists.sharif.edu
http://lists.sharif.edu/mailman/listinfo/persiancomputing


Re: farsi language auto-detection in web pages

2005-08-24 Thread Behdad Esfahbod
On Wed, 24 Aug 2005, mohsen ali momeni wrote:

> Hi,
> I need Add_date function for jalali calendar. This will be used in an
> open source project.

What is the Add_date function?

> An alternative can be a perfect algorithm to detect whether a year is
> leap or not.
> Is there anyone having a perfect implemetation of this function? (I

What do you mean by perfect?

  function is_leap_year () {
$y = $this->y;
/* 33-year cycles, it better matches Iranian rules */
return (($y+16)%33+33)%33*8%33<8;
  }

  function is_leap_year () {
$y = $this->y;
/* 2820-year cycles, idiots think it's more precise */
return $y)-474)%2820+2820)%2820*31%128<31);
  }

(from behdad.org/cal/cal.phps behdad.org/cal/)

> have checked the conversion codes (J2G,G2J) in farsiweb but it seems
> to have problems)

What problems?  We don't know of any.


> Regards,

--behdad
http://behdad.org/
___
PersianComputing mailing list
PersianComputing@lists.sharif.edu
http://lists.sharif.edu/mailman/listinfo/persiancomputing


Re: farsi language auto-detection in web pages

2005-08-24 Thread mohsen ali momeni
Hi,
I need Add_date function for jalali calendar. This will be used in an
open source project.
An alternative can be a perfect algorithm to detect whether a year is
leap or not.
Is there anyone having a perfect implemetation of this function? (I
have checked the conversion codes (J2G,G2J) in farsiweb but it seems
to have problems)

Regards,

-- 
__ \ /_\\_-//_ Mohsen A. Momeni
http://lists.sharif.edu/mailman/listinfo/persiancomputing


Re: farsi language auto-detection in web pages

2005-08-10 Thread Behdad Esfahbod
On Wed, 10 Aug 2005, mohsen ali momeni wrote:

> Hi,
> Thanks for reply,
> What I exatly need is CP1256 detection, and after that detecting
> whether the language is persian or not.

As you can guess, all non-Unicode character sets share the same
8-bit space, so they overlap all the time.  Your only bet at
charset detection is to look at the areas that are left unencoded
in each character set and cross-out charsets as use those
forbidden areas.  As for language detection, that can be used in
charset detection too, you can look for the string SPACE REH
ALEPH SPACE as a good indicator of Persian.


> Regards,

--behdad
http://behdad.org/
___
PersianComputing mailing list
PersianComputing@lists.sharif.edu
http://lists.sharif.edu/mailman/listinfo/persiancomputing


Re: farsi language auto-detection in web pages

2005-08-10 Thread mohsen ali momeni
Hi,
Thanks for reply,
What I exatly need is CP1256 detection, and after that detecting
whether the language is persian or not.

Regards,

-- 
__ \ /_\\_-//_ Mohsen A. Momeni
http://lists.sharif.edu/mailman/listinfo/persiancomputing


Re: farsi language auto-detection in web pages

2005-08-09 Thread Behdad Esfahbod
On Tue, 9 Aug 2005, mohsen ali momeni wrote:

> Hi,
>
> How can I auto-detect language of a webpage without knowing it's
> charset? (suppose language and charset is not defined in header)
> Is there a simple (not time-consuming) method to detect a page charset?

If it's UTF-8 or UTF-16, kinda easy, not really otherwise.


> Regards,

--behdad
http://behdad.org/
___
PersianComputing mailing list
PersianComputing@lists.sharif.edu
http://lists.sharif.edu/mailman/listinfo/persiancomputing


Re: farsi language auto-detection in web pages

2005-08-09 Thread Paul Hastings

mohsen ali momeni wrote:

How can I auto-detect language of a webpage without knowing it's
charset? (suppose language and charset is not defined in header)


kind of a lousy way to do things but ...


Is there a simple (not time-consuming) method to detect a page charset?


http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html
http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/text/CharsetMatch.html
___
PersianComputing mailing list
PersianComputing@lists.sharif.edu
http://lists.sharif.edu/mailman/listinfo/persiancomputing