Ha.  I'm in the process of comparing mimetype detection results from DROID, 
Tika and 'file' on our TIKA-1302 corpus.

After that, I was going to compare our different encoding detectors on the 
corpus...I'll have a better answer in a few weeks.

Others on this list probably have more info, but our general Encoding detector 
tries to get the encoding from an html meta charset info, then the 
UniversalEncodingDetector and then the Icu4JDetector.  It stops when the first 
encoding detector returns a non-null answer.  That order was initially set in 
July 2012, and we haven't changed it since.

In short, this is an area for further analysis.

-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Monday, April 18, 2016 9:59 AM
To: d...@tika.apache.org
Subject: Fwd: Need Help



Sent from my iPhone

Begin forwarded message:

From: harsh kumar <kumarhars...@gmail.com<mailto:kumarhars...@gmail.com>>
Date: April 18, 2016 at 2:02:23 AM PDT
To: <dev-ow...@tika.apache.org<mailto:dev-ow...@tika.apache.org>>
Subject: Fwd: Need Help

Hi,

I am using tika for detecting the encoding of a file. But I found that the 
results are not uniform If I use charsetdetector and universalEncodingdetector 
for the same file.

Can you please brief me with the major differences between them and their 
best-fit use cases.

Looking forward to your early reply.

--
Warm Regards.....*
Harsh Kumar

  • RE: Need Help Allison, Timothy B.

Reply via email to