Hey Sebastian,

as far is i found out, the Tika parser is far away from being perfect,
but i would expect that the included Testfiles should get correct
results. 

There is an alternative lib (http://sourceforge.net/projects/chm4j/),
but i don't think that there are enough possible users to switch for
this filetype to a differed parser. 

Jan

Am Dienstag, den 14.08.2012, 22:28 +0200 schrieb Sebastian Nagel: 
> Hi Jan,
> 
> opened a Jira issue: https://issues.apache.org/jira/browse/NUTCH-1454
> Thanks!
> 
> Beyond the "can't retrieve parser" error:
> I've tried a couple of chm files (among them the test files from Tika)
> but I wasn't able to get Tika to extract content.
> 
>  % java -jar tika-app/target/tika-app-1.3-SNAPSHOT.jar -v \
>     tika-parsers/src/test/resources/test-documents/testChm2.chm
> 
> only extracts:
> 
> <?xml version="1.0" encoding="UTF-8"?><html 
> xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta name="Content-Length" content="10807437"/>
> <meta name="Content-Type" content="application/vnd.ms-htmlhelp"/>
> <meta name="resourceName" content="testChm2.chm"/>
> <title/>
> </head>
> <body/></html>
> 
> A CHM-viewer shows much more content. What's wrong?
> 
> Sebastian
> 
> On 08/10/2012 09:32 AM, Julien Nioche wrote:
> > new JIRA?
> > 
> > On 9 August 2012 23:30, Markus Jelsma <[email protected]> wrote:
> > 
> >> hmm, i'm not sure but maybe we don't include all Tika parser deps in our
> >> build.xml?
> >>
> >>
> >>
> >> -----Original message-----
> >>> From:Sebastian Nagel <[email protected]>
> >>> Sent: Thu 09-Aug-2012 23:18
> >>> To: [email protected]
> >>> Subject: Re: CHM Files and Tika
> >>>
> >>> Hi Jan,
> >>>
> >>> confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
> >>> can parse chm. The chm parsers are in tika-parser*.jar which is contained
> >>> in the Nutch package.
> >>>
> >>> Any ideas?
> >>>
> >>> Sebastian
> >>>
> >>> On 08/08/2012 12:03 PM, Jan Riewe wrote:
> >>>> Hey there,
> >>>>
> >>>> i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
> >>>>
> >>>> Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
> >>>>
> >>>> i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
> >>>> should be able to parse those files
> >>>> https://issues.apache.org/jira/browse/TIKA-245
> >>>>
> >>>> In the tika-mimetypes.xml i do find a entry related to
> >>>> application/vnd.ms-htmlhelp
> >>>>
> >>>> Does anyone ever ran into the same issues and knows how to fix that?
> >>>>
> >>>> Bye
> >>>> Jan
> >>>>
> >>>
> >>>
> >>
> > 
> > 
> > 
> 

Reply via email to