Hello >From my experience, use of n-gram's for one-byte encodings works pretty good for language/charset detection
2009/12/9 Jérôme Charron <jerome.char...@gmail.com>: > Hi Antoni, > > I tried many charset detection libraries while working on Nutch but none of > them was really working. > I also tried to take a look at the mozilla charset detector , but it was > really too complicated to integrate into Nutch (or Tika). > > Best regards > > Jérôme > > 2009/12/9 Antoni Mylka <antoni.my...@gmail.com> > >> Aperturians, Tika >> >> I was wondering if anyone has any experience with the jchardet library >> for charset detection. Does it work? What kinds of documents does it >> actually support. >> >> Christiaan has posted an idea to the Aperture tracker how we could use >> jchardet to improve the plain text extractor, but it doesn't seem to >> work. Or maybe the Tika guys have figured it out already and I can just >> use Tika for this? :) >> >> Antoni Mylka >> antoni.my...@gmail.com >> > > > > -- > Jérôme Charron > Directeur Technique @ WebPulse > Tel: +33675742890 <= ** NEW ** > eMail : jerome.char...@webpulse.fr > http://www.webpulse.fr/ > http://www.shopreflex.com/ > http://www.staragora.com/ > -- With best wishes, Alex Ott, MBA http://alexott.blogspot.com/ http://alexott-ru.blogspot.com/ http://xtalk.msk.su/~ott/