Thanks, Alex - great input.
We'd run into similar problems at Krugle, with determining the correct
mime-type for source code. Sometimes you wind up needing to parse the
code to make the correct choice.
We had extended the Nutch mime-type detector to support both regex and
post-processing to handle this disambiguation.
But that was hard-coded for a handful of known edge cases.
One possible way for this to work with the current XML-based mime-type
definitions is to have a "here's the name of the class you'll have to
instantiate and run to make the final call"
-- Ken
On Mar 18, 2010, at 11:21am, Alex Ott wrote:
I'm not sure, that this is actual for Tika, but I looked into its mime
database and see problem in definitions - both types uses common OLE
(MS
CFBF - Microsoft Compound File Binary Format) signature, that also
used by
dozens of file formats. To perform correct mime type detection of
CFBF
files, you need to analyze it (with POI?) and detect which objects are
located at top-directory (directly under Root Directory entry) of
the OLE
file. For word this is object WordDocument, while for Excel this is
Workbook or Book. Simple search for corresponding names will not
help,
because all these objects could be embedded into other documents via
OLE.
Other details you can find in official Microsoft Documentation
Simon Tyler at "Thu, 18 Mar 2010 18:12:16 +0000" wrote:
ST> Hi,
ST> I haven't seen any responses to this. Does anyone know why I
should be
ST> seeing such unpredictable behaviour?
ST> Simon
ST> On 15/03/2010 09:27, "Simon Tyler" <sty...@mimecast.net> wrote:
Hi,
I am doing some testing of Tika 0.6 and noticed some odd results
for the
testEXCEL.xls file included in the test suite.
100 calls to the following code:
is = new BufferedInputStream(new
FileInputStream(filename));
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
String type = tika.detect(is, metadata);
Results in different matches as application/msword or
application/vnd.ms-excel seemingly at random.
Is this expected? Is there a way to mitigate it?
Simon
--
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/ http://alexott.net/
http://alexott-ru.blogspot.com/
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g