CHM Files and Tika

2012-08-08 Thread Jan Riewe
Hey there, i try to parse CHM (Microsoft Help Files) with Nucht, but i get a: Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which should be able to parse those files

recrawl a single page explicit

2012-04-02 Thread Jan Riewe
Hi there, till now i did not find a way to crawl a specific page manuell. Is there a possibility manuell set the recrawl interval or the crawl date, or any other explicit way to make nutch invalidate a page? We have got 70k+ pages in the index and a full recrawl would take to long. Thanks Jan

Pages that does not dedup

2012-03-26 Thread Jan Riewe
Hey there, currently i try to debug the dedup results from nutch. There is a page with is exactly the same (compared the HTML with a diff tool) as on a differed Domain but dedup does not delete this entry. Is this caused by the differed Domain? If so, is there a possibility to configure that?