I'd been running a large web crawl in EC2, using a Hadoop job jar
where I'd excluded all of the support jars used for Microsoft formats.
This dramatically reduced the size of the job jar that I needed to
constantly push to EC2 via a relatively slow DSL connection.
During the crawl, I ignored all responses that didn't have a mime-type
of text/plain or one of the three HTML mime-types.
But I ran into a problem, where the Tika auto-detect code was
correctly identifying a file as being a Microsoft format, even though
the server said it was text/plain. The Tika Microsoft parser would try
to dynamically figure out which support code to call, and in the end
it throws a NoSuchMethodError.
Note that this is an Error, not an Exception. As such, it flies on
past all of the Tika catch blocks, and my own code's catch blocks, and
kills the Hadoop job in weird and wonderful ways.
It seems like Errors shouldn't be thrown for situations where dynamic
configuration could result in a class not existing, but before I
started writing up an issue I wanted to get input from the community
about this. It's a bit gray to me, since I essentially "did it to
myself" by excluding jars.
Thanks,
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g