Yeah, there are many indefinites with regards to charset detection and
there is no 100% accurate method of interpreting the charset. Its more art
than science. That said, I will hunt around for a decent library too.

> Hi Antoni,
>
> I tried many charset detection libraries while working on Nutch but none
> of
> them was really working.
> I also tried to take a look at the mozilla charset detector , but it was
> really too complicated to integrate into Nutch (or Tika).
>
> Best regards
>
> Jérôme
>
> 2009/12/9 Antoni Mylka <antoni.my...@gmail.com>
>
>> Aperturians, Tika
>>
>> I was wondering if anyone has any experience with the jchardet library
>> for charset detection. Does it work? What kinds of documents does it
>> actually support.
>>
>> Christiaan has posted an idea to the Aperture tracker how we could use
>> jchardet to improve the plain text extractor, but it doesn't seem to
>> work.  Or maybe the Tika guys have figured it out already and I can just
>> use Tika for this? :)
>>
>> Antoni Mylka
>> antoni.my...@gmail.com
>>
>
>
>
> --
> Jérôme Charron
> Directeur Technique @ WebPulse
> Tel: +33675742890 <= ** NEW **
> eMail : jerome.char...@webpulse.fr
> http://www.webpulse.fr/
> http://www.shopreflex.com/
> http://www.staragora.com/
> ------------------------------------------------------------------------------
> Return on Information:
> Google Enterprise Search pays you back
> Get the facts.
> http://p.sf.net/sfu/google-dev2dev
> _______________________________________________
> Aperture-devel mailing list
> aperture-de...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>

Reply via email to