why are unfetchable sites kept in webdb?

2005-09-12 Thread Kamil Wnuk
In UpdateDatabaseTool, the function pageGone( ... ) sets pages that have remained unreachable for a certain number of retries to never be fetched. Is there a compelling reason to keep such pages around? It seems like the right thing to do in this case would be to just remove the page from the

ran into a site that sends a crawl into an infinite loop

2005-08-30 Thread Kamil Wnuk
Hi, In the process of a moderately sized crawl I was running, I hit a page that sent nutch into an infinite fetch cycle. The page that I hit contained relative links to itself with the syntax /page.shtml. So once the initial page was fetched, each new generated fetchlist contained the same url

crawler: priority domain reindexing and sitemaps

2005-08-09 Thread Kamil Wnuk
Hello Everyone, I currently have nutch set up doing a whole-web style crawl. When I need to index a new page or reindex an existing page immediately, I start a process that waits until the webdb is not being used by the normal crawl process, locks the webdb using the existence of a file as a

Re: prioritizing newly injected urls for fetching

2005-07-29 Thread Kamil Wnuk
Hello Kamil, Do you want to generate a fetchlist with urls that are present in WebDB but where not fetched till now? I am not sure what you are trying to achive but, you can generate any fetchlist you want using latest tool by Andrzej Bialecki