Re: CHM Files and Tika
Hi Jan, opened a Jira issue: https://issues.apache.org/jira/browse/NUTCH-1454 Thanks! Beyond the can't retrieve parser error: I've tried a couple of chm files (among them the test files from Tika) but I wasn't able to get Tika to extract content. % java -jar tika-app/target/tika-app-1.3-SNAPSHOT.jar -v \ tika-parsers/src/test/resources/test-documents/testChm2.chm only extracts: ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head meta name=Content-Length content=10807437/ meta name=Content-Type content=application/vnd.ms-htmlhelp/ meta name=resourceName content=testChm2.chm/ title/ /head body//html A CHM-viewer shows much more content. What's wrong? Sebastian On 08/10/2012 09:32 AM, Julien Nioche wrote: new JIRA? On 9 August 2012 23:30, Markus Jelsma markus.jel...@openindex.io wrote: hmm, i'm not sure but maybe we don't include all Tika parser deps in our build.xml? -Original message- From:Sebastian Nagel wastl.na...@googlemail.com Sent: Thu 09-Aug-2012 23:18 To: user@nutch.apache.org Subject: Re: CHM Files and Tika Hi Jan, confirmed: Nutch cannot parse, while Tika (same version used by Nutch) can parse chm. The chm parsers are in tika-parser*.jar which is contained in the Nutch package. Any ideas? Sebastian On 08/08/2012 12:03 PM, Jan Riewe wrote: Hey there, i try to parse CHM (Microsoft Help Files) with Nucht, but i get a: Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which should be able to parse those files https://issues.apache.org/jira/browse/TIKA-245 In the tika-mimetypes.xml i do find a entry related to application/vnd.ms-htmlhelp Does anyone ever ran into the same issues and knows how to fix that? Bye Jan
Re: CHM Files and Tika
new JIRA? On 9 August 2012 23:30, Markus Jelsma markus.jel...@openindex.io wrote: hmm, i'm not sure but maybe we don't include all Tika parser deps in our build.xml? -Original message- From:Sebastian Nagel wastl.na...@googlemail.com Sent: Thu 09-Aug-2012 23:18 To: user@nutch.apache.org Subject: Re: CHM Files and Tika Hi Jan, confirmed: Nutch cannot parse, while Tika (same version used by Nutch) can parse chm. The chm parsers are in tika-parser*.jar which is contained in the Nutch package. Any ideas? Sebastian On 08/08/2012 12:03 PM, Jan Riewe wrote: Hey there, i try to parse CHM (Microsoft Help Files) with Nucht, but i get a: Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which should be able to parse those files https://issues.apache.org/jira/browse/TIKA-245 In the tika-mimetypes.xml i do find a entry related to application/vnd.ms-htmlhelp Does anyone ever ran into the same issues and knows how to fix that? Bye Jan -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: CHM Files and Tika
Hi Jan, confirmed: Nutch cannot parse, while Tika (same version used by Nutch) can parse chm. The chm parsers are in tika-parser*.jar which is contained in the Nutch package. Any ideas? Sebastian On 08/08/2012 12:03 PM, Jan Riewe wrote: Hey there, i try to parse CHM (Microsoft Help Files) with Nucht, but i get a: Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which should be able to parse those files https://issues.apache.org/jira/browse/TIKA-245 In the tika-mimetypes.xml i do find a entry related to application/vnd.ms-htmlhelp Does anyone ever ran into the same issues and knows how to fix that? Bye Jan
RE: CHM Files and Tika
hmm, i'm not sure but maybe we don't include all Tika parser deps in our build.xml? -Original message- From:Sebastian Nagel wastl.na...@googlemail.com Sent: Thu 09-Aug-2012 23:18 To: user@nutch.apache.org Subject: Re: CHM Files and Tika Hi Jan, confirmed: Nutch cannot parse, while Tika (same version used by Nutch) can parse chm. The chm parsers are in tika-parser*.jar which is contained in the Nutch package. Any ideas? Sebastian On 08/08/2012 12:03 PM, Jan Riewe wrote: Hey there, i try to parse CHM (Microsoft Help Files) with Nucht, but i get a: Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which should be able to parse those files https://issues.apache.org/jira/browse/TIKA-245 In the tika-mimetypes.xml i do find a entry related to application/vnd.ms-htmlhelp Does anyone ever ran into the same issues and knows how to fix that? Bye Jan
CHM Files and Tika
Hey there, i try to parse CHM (Microsoft Help Files) with Nucht, but i get a: Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which should be able to parse those files https://issues.apache.org/jira/browse/TIKA-245 In the tika-mimetypes.xml i do find a entry related to application/vnd.ms-htmlhelp Does anyone ever ran into the same issues and knows how to fix that? Bye Jan