Hey there,
i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
should be able to parse those files
Hi there,
till now i did not find a way to crawl a specific page manuell.
Is there a possibility manuell set the recrawl interval or the crawl
date, or any other explicit way to make nutch invalidate a page?
We have got 70k+ pages in the index and a full recrawl would take to
long.
Thanks
Jan
Hey there,
currently i try to debug the dedup results from nutch. There is a page
with is exactly the same (compared the HTML with a diff tool) as on a
differed Domain but dedup does not delete this entry.
Is this caused by the differed Domain? If so, is there a possibility to
configure that?
3 matches
Mail list logo