CHM Files and Tika
Hey there, i try to parse CHM (Microsoft Help Files) with Nucht, but i get a: Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which should be able to parse those files https://issues.apache.org/jira/browse/TIKA-245 In the tika-mimetypes.xml i do find a entry related to application/vnd.ms-htmlhelp Does anyone ever ran into the same issues and knows how to fix that? Bye Jan
recrawl a single page explicit
Hi there, till now i did not find a way to crawl a specific page manuell. Is there a possibility manuell set the recrawl interval or the crawl date, or any other explicit way to make nutch invalidate a page? We have got 70k+ pages in the index and a full recrawl would take to long. Thanks Jan
Pages that does not dedup
Hey there, currently i try to debug the dedup results from nutch. There is a page with is exactly the same (compared the HTML with a diff tool) as on a differed Domain but dedup does not delete this entry. Is this caused by the differed Domain? If so, is there a possibility to configure that? Thanks in advice Jan --