Re: Charset detection

Alex Ott Wed, 09 Dec 2009 07:56:30 -0800

Hello

>From my experience, use of n-gram's for one-byte encodings works pretty
good for language/charset detection



2009/12/9 Jérôme Charron <jerome.char...@gmail.com>:
> Hi Antoni,
>
> I tried many charset detection libraries while working on Nutch but none of
> them was really working.
> I also tried to take a look at the mozilla charset detector , but it was
> really too complicated to integrate into Nutch (or Tika).
>
> Best regards
>
> Jérôme
>
> 2009/12/9 Antoni Mylka <antoni.my...@gmail.com>
>
>> Aperturians, Tika
>>
>> I was wondering if anyone has any experience with the jchardet library
>> for charset detection. Does it work? What kinds of documents does it
>> actually support.
>>
>> Christiaan has posted an idea to the Aperture tracker how we could use
>> jchardet to improve the plain text extractor, but it doesn't seem to
>> work.  Or maybe the Tika guys have figured it out already and I can just
>> use Tika for this? :)
>>
>> Antoni Mylka
>> antoni.my...@gmail.com
>>
>
>
>
> --
> Jérôme Charron
> Directeur Technique @ WebPulse
> Tel: +33675742890 <= ** NEW **
> eMail : jerome.char...@webpulse.fr
> http://www.webpulse.fr/
> http://www.shopreflex.com/
> http://www.staragora.com/
>



-- 
With best wishes,                    Alex Ott, MBA
http://alexott.blogspot.com/
http://alexott-ru.blogspot.com/
http://xtalk.msk.su/~ott/

Re: Charset detection

Reply via email to