Hey Sebastian, as far is i found out, the Tika parser is far away from being perfect, but i would expect that the included Testfiles should get correct results.
There is an alternative lib (http://sourceforge.net/projects/chm4j/), but i don't think that there are enough possible users to switch for this filetype to a differed parser. Jan Am Dienstag, den 14.08.2012, 22:28 +0200 schrieb Sebastian Nagel: > Hi Jan, > > opened a Jira issue: https://issues.apache.org/jira/browse/NUTCH-1454 > Thanks! > > Beyond the "can't retrieve parser" error: > I've tried a couple of chm files (among them the test files from Tika) > but I wasn't able to get Tika to extract content. > > % java -jar tika-app/target/tika-app-1.3-SNAPSHOT.jar -v \ > tika-parsers/src/test/resources/test-documents/testChm2.chm > > only extracts: > > <?xml version="1.0" encoding="UTF-8"?><html > xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta name="Content-Length" content="10807437"/> > <meta name="Content-Type" content="application/vnd.ms-htmlhelp"/> > <meta name="resourceName" content="testChm2.chm"/> > <title/> > </head> > <body/></html> > > A CHM-viewer shows much more content. What's wrong? > > Sebastian > > On 08/10/2012 09:32 AM, Julien Nioche wrote: > > new JIRA? > > > > On 9 August 2012 23:30, Markus Jelsma <[email protected]> wrote: > > > >> hmm, i'm not sure but maybe we don't include all Tika parser deps in our > >> build.xml? > >> > >> > >> > >> -----Original message----- > >>> From:Sebastian Nagel <[email protected]> > >>> Sent: Thu 09-Aug-2012 23:18 > >>> To: [email protected] > >>> Subject: Re: CHM Files and Tika > >>> > >>> Hi Jan, > >>> > >>> confirmed: Nutch cannot parse, while Tika (same version used by Nutch) > >>> can parse chm. The chm parsers are in tika-parser*.jar which is contained > >>> in the Nutch package. > >>> > >>> Any ideas? > >>> > >>> Sebastian > >>> > >>> On 08/08/2012 12:03 PM, Jan Riewe wrote: > >>>> Hey there, > >>>> > >>>> i try to parse CHM (Microsoft Help Files) with Nucht, but i get a: > >>>> > >>>> Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp > >>>> > >>>> i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which > >>>> should be able to parse those files > >>>> https://issues.apache.org/jira/browse/TIKA-245 > >>>> > >>>> In the tika-mimetypes.xml i do find a entry related to > >>>> application/vnd.ms-htmlhelp > >>>> > >>>> Does anyone ever ran into the same issues and knows how to fix that? > >>>> > >>>> Bye > >>>> Jan > >>>> > >>> > >>> > >> > > > > > > >

