Re: CHM Files and Tika

2012-08-14 Thread Sebastian Nagel
Hi Jan,

opened a Jira issue: https://issues.apache.org/jira/browse/NUTCH-1454
Thanks!

Beyond the can't retrieve parser error:
I've tried a couple of chm files (among them the test files from Tika)
but I wasn't able to get Tika to extract content.

 % java -jar tika-app/target/tika-app-1.3-SNAPSHOT.jar -v \
tika-parsers/src/test/resources/test-documents/testChm2.chm

only extracts:

?xml version=1.0 encoding=UTF-8?html 
xmlns=http://www.w3.org/1999/xhtml;
head
meta name=Content-Length content=10807437/
meta name=Content-Type content=application/vnd.ms-htmlhelp/
meta name=resourceName content=testChm2.chm/
title/
/head
body//html

A CHM-viewer shows much more content. What's wrong?

Sebastian

On 08/10/2012 09:32 AM, Julien Nioche wrote:
 new JIRA?
 
 On 9 August 2012 23:30, Markus Jelsma markus.jel...@openindex.io wrote:
 
 hmm, i'm not sure but maybe we don't include all Tika parser deps in our
 build.xml?



 -Original message-
 From:Sebastian Nagel wastl.na...@googlemail.com
 Sent: Thu 09-Aug-2012 23:18
 To: user@nutch.apache.org
 Subject: Re: CHM Files and Tika

 Hi Jan,

 confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
 can parse chm. The chm parsers are in tika-parser*.jar which is contained
 in the Nutch package.

 Any ideas?

 Sebastian

 On 08/08/2012 12:03 PM, Jan Riewe wrote:
 Hey there,

 i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:

 Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp

 i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
 should be able to parse those files
 https://issues.apache.org/jira/browse/TIKA-245

 In the tika-mimetypes.xml i do find a entry related to
 application/vnd.ms-htmlhelp

 Does anyone ever ran into the same issues and knows how to fix that?

 Bye
 Jan




 
 
 



Re: CHM Files and Tika

2012-08-10 Thread Julien Nioche
new JIRA?

On 9 August 2012 23:30, Markus Jelsma markus.jel...@openindex.io wrote:

 hmm, i'm not sure but maybe we don't include all Tika parser deps in our
 build.xml?



 -Original message-
  From:Sebastian Nagel wastl.na...@googlemail.com
  Sent: Thu 09-Aug-2012 23:18
  To: user@nutch.apache.org
  Subject: Re: CHM Files and Tika
 
  Hi Jan,
 
  confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
  can parse chm. The chm parsers are in tika-parser*.jar which is contained
  in the Nutch package.
 
  Any ideas?
 
  Sebastian
 
  On 08/08/2012 12:03 PM, Jan Riewe wrote:
   Hey there,
  
   i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
  
   Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
  
   i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
   should be able to parse those files
   https://issues.apache.org/jira/browse/TIKA-245
  
   In the tika-mimetypes.xml i do find a entry related to
   application/vnd.ms-htmlhelp
  
   Does anyone ever ran into the same issues and knows how to fix that?
  
   Bye
   Jan
  
 
 




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: CHM Files and Tika

2012-08-09 Thread Sebastian Nagel
Hi Jan,

confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
can parse chm. The chm parsers are in tika-parser*.jar which is contained
in the Nutch package.

Any ideas?

Sebastian

On 08/08/2012 12:03 PM, Jan Riewe wrote:
 Hey there,
 
 i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
 
 Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
 
 i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
 should be able to parse those files
 https://issues.apache.org/jira/browse/TIKA-245
 
 In the tika-mimetypes.xml i do find a entry related to
 application/vnd.ms-htmlhelp
 
 Does anyone ever ran into the same issues and knows how to fix that?
 
 Bye
 Jan
 



RE: CHM Files and Tika

2012-08-09 Thread Markus Jelsma
hmm, i'm not sure but maybe we don't include all Tika parser deps in our 
build.xml?

 
 
-Original message-
 From:Sebastian Nagel wastl.na...@googlemail.com
 Sent: Thu 09-Aug-2012 23:18
 To: user@nutch.apache.org
 Subject: Re: CHM Files and Tika
 
 Hi Jan,
 
 confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
 can parse chm. The chm parsers are in tika-parser*.jar which is contained
 in the Nutch package.
 
 Any ideas?
 
 Sebastian
 
 On 08/08/2012 12:03 PM, Jan Riewe wrote:
  Hey there,
  
  i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
  
  Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
  
  i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
  should be able to parse those files
  https://issues.apache.org/jira/browse/TIKA-245
  
  In the tika-mimetypes.xml i do find a entry related to
  application/vnd.ms-htmlhelp
  
  Does anyone ever ran into the same issues and knows how to fix that?
  
  Bye
  Jan
  
 
 


CHM Files and Tika

2012-08-08 Thread Jan Riewe
Hey there,

i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:

Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp

i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
should be able to parse those files
https://issues.apache.org/jira/browse/TIKA-245

In the tika-mimetypes.xml i do find a entry related to
application/vnd.ms-htmlhelp

Does anyone ever ran into the same issues and knows how to fix that?

Bye
Jan