In UpdateDatabaseTool, the function pageGone( ... ) sets pages that have
remained unreachable for a certain number of retries to never be fetched. Is
there a compelling reason to keep such pages around? It seems like the right
thing to do in this case would be to just remove the page from the
Hi,
In the process of a moderately sized crawl I was running, I hit a page
that sent nutch into an infinite fetch cycle. The page that I hit
contained relative links to itself with the syntax /page.shtml. So
once the initial page was fetched, each new generated fetchlist
contained the same url
Hello Everyone,
I currently have nutch set up doing a whole-web style crawl. When I
need to index a new page or reindex an existing page immediately, I
start a process that waits until the webdb is not being used by the
normal crawl process, locks the webdb using the existence of a file as
a
Hello Kamil,
Do you want to generate a fetchlist with urls that are present in WebDB
but where not fetched till now?
I am not sure what you are trying to achive but, you can generate any
fetchlist you want using latest tool by Andrzej Bialecki