Hi Jan,
opened a Jira issue: https://issues.apache.org/jira/browse/NUTCH-1454
Thanks!
Beyond the "can't retrieve parser" error:
I've tried a couple of chm files (among them the test files from Tika)
but I wasn't able to get Tika to extract content.
% java -jar tika-app/target/tika-app-1.3-SNAPSHOT.jar -v \
tika-parsers/src/test/resources/test-documents/testChm2.chm
only extracts:
<?xml version="1.0" encoding="UTF-8"?><html
xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Content-Length" content="10807437"/>
<meta name="Content-Type" content="application/vnd.ms-htmlhelp"/>
<meta name="resourceName" content="testChm2.chm"/>
<title/>
</head>
<body/></html>
A CHM-viewer shows much more content. What's wrong?
Sebastian
On 08/10/2012 09:32 AM, Julien Nioche wrote:
> new JIRA?
>
> On 9 August 2012 23:30, Markus Jelsma <[email protected]> wrote:
>
>> hmm, i'm not sure but maybe we don't include all Tika parser deps in our
>> build.xml?
>>
>>
>>
>> -----Original message-----
>>> From:Sebastian Nagel <[email protected]>
>>> Sent: Thu 09-Aug-2012 23:18
>>> To: [email protected]
>>> Subject: Re: CHM Files and Tika
>>>
>>> Hi Jan,
>>>
>>> confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
>>> can parse chm. The chm parsers are in tika-parser*.jar which is contained
>>> in the Nutch package.
>>>
>>> Any ideas?
>>>
>>> Sebastian
>>>
>>> On 08/08/2012 12:03 PM, Jan Riewe wrote:
>>>> Hey there,
>>>>
>>>> i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
>>>>
>>>> Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
>>>>
>>>> i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
>>>> should be able to parse those files
>>>> https://issues.apache.org/jira/browse/TIKA-245
>>>>
>>>> In the tika-mimetypes.xml i do find a entry related to
>>>> application/vnd.ms-htmlhelp
>>>>
>>>> Does anyone ever ran into the same issues and knows how to fix that?
>>>>
>>>> Bye
>>>> Jan
>>>>
>>>
>>>
>>
>
>
>