from:"Kamil Wnuk"

why are unfetchable sites kept in webdb?

2005-09-12 Thread Kamil Wnuk

In UpdateDatabaseTool, the function pageGone( ... ) sets pages that have remained unreachable for a certain number of retries to never be fetched. Is there a compelling reason to keep such pages around? It seems like the right thing to do in this case would be to just remove the page from the

ran into a site that sends a crawl into an infinite loop

2005-08-30 Thread Kamil Wnuk

Hi, In the process of a moderately sized crawl I was running, I hit a page that sent nutch into an infinite fetch cycle. The page that I hit contained relative links to itself with the syntax /page.shtml. So once the initial page was fetched, each new generated fetchlist contained the same url

crawler: priority domain reindexing and sitemaps

2005-08-09 Thread Kamil Wnuk

Hello Everyone, I currently have nutch set up doing a whole-web style crawl. When I need to index a new page or reindex an existing page immediately, I start a process that waits until the webdb is not being used by the normal crawl process, locks the webdb using the existence of a file as a

Re: prioritizing newly injected urls for fetching

2005-07-29 Thread Kamil Wnuk

Hello Kamil, Do you want to generate a fetchlist with urls that are present in WebDB but where not fetched till now? I am not sure what you are trying to achive but, you can generate any fetchlist you want using latest tool by Andrzej Bialecki

why are unfetchable sites kept in webdb?

ran into a site that sends a crawl into an infinite loop

crawler: priority domain reindexing and sitemaps

Re: prioritizing newly injected urls for fetching

4 matches

Site Navigation

Mail list logo

Footer information