I'd been running a large web crawl in EC2, using a Hadoop job jar where I'd excluded all of the support jars used for Microsoft formats. This dramatically reduced the size of the job jar that I needed to constantly push to EC2 via a relatively slow DSL connection.

During the crawl, I ignored all responses that didn't have a mime-type of text/plain or one of the three HTML mime-types.

But I ran into a problem, where the Tika auto-detect code was correctly identifying a file as being a Microsoft format, even though the server said it was text/plain. The Tika Microsoft parser would try to dynamically figure out which support code to call, and in the end it throws a NoSuchMethodError.

Note that this is an Error, not an Exception. As such, it flies on past all of the Tika catch blocks, and my own code's catch blocks, and kills the Hadoop job in weird and wonderful ways.

It seems like Errors shouldn't be thrown for situations where dynamic configuration could result in a class not existing, but before I started writing up an issue I wanted to get input from the community about this. It's a bit gray to me, since I essentially "did it to myself" by excluding jars.

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to