Re: crawling in any depth until no new pages were found

lewis john mcgibbney Wed, 20 Jul 2011 14:08:49 -0700

Hi Marek,

As were talking about automating the task were immediately looking at
implementing a bash script. In the situation we have described, we wish
Nutch to adopt a breadth first search BFS behaviour when crawling. Between
us can we suggest any methods for best practice relating to BFS?

As you have highlighted we can check the crawldb after every updatedb
command to determine whether there are any status (?) unfetched urls, and
ideally we wish to continue until this number is non existent when we either
dump stats or read them via stdout. I would suggest that we discuss a method
for obtaining the dbunfecthed value and creating a loop based on whether or
not it is = 0. Is this possible?

On Wed, Jul 20, 2011 at 2:05 PM, Marek Bachmann <[email protected]>wrote:

> Hi all,
>
> has anyone suggestions how I could solve following task:
>
> I want to crawl a sub-domain of our network completely. I always did it by
> multiple fetch / parse / update cycles manually. After a few cycles I
> checked if there are unfetched pages in the crawldb. If so, I started the
> cycle over again. I repeated that until no new pages were discovered.
> But that is annoying me and that is why I am looking for a way to do this
> steps automatic until no unfetched pages are left.
>
> Any ideas?
>

-- 
*Lewis*

Re: crawling in any depth until no new pages were found

Reply via email to