Re: Running crawls between a specified time interval

.: Abishek :. Thu, 10 Feb 2011 06:27:41 -0800

Thanks folks. Will try to do one of these...

Could I also pause crawling for a while, then index the whole crawl till the
time it was paused(move the indexes out of to different locations) and then
continue crawling from where it was paused?


 Just a simple pause - resume kind of thing

On Thu, Feb 10, 2011 at 10:11 PM, Alexander Aristov <
[email protected]> wrote:

> Hi
>
> You may put separate crawling phases to separate scripts something like
>
> inject.sh
> crawl.sh
> indexing.sh
>
> And configure these scripts to start at certain time using any scheduling
> tool
>
> for example I find it very easy to use linux cron scheduler.
>
> But you can configure that crawl can work between 12.00- 13.00. Crawl is
> working until it has unfetched resources in queue or max fetch limit is
> reached. And it takes as much time as needed.
>
> Best Regards
> Alexander Aristov
>
>
> On 9 February 2011 04:17, .: Abhishek :. <[email protected]> wrote:
>
> > Hi all,
> >
> >  I am just trying to figure out if there is some way I can set Nutch
> crawls
> > between a time interval say like crawl from 12:00 AM to 12:00 PM and then
> > start the further processing(start process of indexing and so on that
> > follows the crawl) after that.
> >
> >  I think Nutch job is tied to Hadoop's JobConf. I am not sure on  how
> this
> > could be done. Rather, if I am to use an external shell script for doing
> > this, how do I chain the crawl process and trigger further processing
> after
> > crawl?
> >
> > Thanks,
> > Abi
> >
>

Re: Running crawls between a specified time interval

Reply via email to