CHM Files and Tika

2012-08-08 Thread Jan Riewe
Hey there,

i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:

Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp

i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
should be able to parse those files
https://issues.apache.org/jira/browse/TIKA-245

In the tika-mimetypes.xml i do find a entry related to
application/vnd.ms-htmlhelp

Does anyone ever ran into the same issues and knows how to fix that?

Bye
Jan


recrawl a single page explicit

2012-04-02 Thread Jan Riewe
Hi there,

till now i did not find a way to crawl a specific page manuell. 
Is there a possibility manuell set the recrawl interval or the crawl
date, or any other explicit way to make nutch invalidate a page?

We have got 70k+ pages in the index and a full recrawl would take to
long.

Thanks 
Jan


Pages that does not dedup

2012-03-26 Thread Jan Riewe
Hey there,

currently i try to debug the dedup results from nutch. There is a page
with is exactly the same (compared the HTML with a diff tool) as on a
differed Domain but dedup does not delete this entry. 

Is this caused by the differed Domain? If so, is there a possibility to
configure that?

Thanks in advice
Jan
--