Re: Running crawls between a specified time interval

.: Abishek :. Thu, 10 Feb 2011 20:23:51 -0800

Hi folks,

 I am planning to,


   1. Use quartz schedular to do crawl and fetch(in a single job) for a day
   or two, then pause it.
   2. Copy the crawldb and segments folder to a separate temp folder.
   3. Do link inversion, indexing on this temp folder.
   4. Then resume the step 1.

 Does this work fine? Has anyone done this before?

Cheers,
Abi


On Thu, Feb 10, 2011 at 10:29 PM, Markus Jelsma
<[email protected]>wrote:

> In Nutch 1.x you cannot abort and resume the fetch process.
>
> On Thursday 10 February 2011 15:27:05 .: Abishek :. wrote:
> > Thanks folks. Will try to do one of these...
> >
> > Could I also pause crawling for a while, then index the whole crawl till
> > the time it was paused(move the indexes out of to different locations)
> and
> > then continue crawling from where it was paused?
> >
> >  Just a simple pause - resume kind of thing
> >
> > On Thu, Feb 10, 2011 at 10:11 PM, Alexander Aristov <
> >
> > [email protected]> wrote:
> > > Hi
> > >
> > > You may put separate crawling phases to separate scripts something like
> > >
> > > inject.sh
> > > crawl.sh
> > > indexing.sh
> > >
> > > And configure these scripts to start at certain time using any
> scheduling
> > > tool
> > >
> > > for example I find it very easy to use linux cron scheduler.
> > >
> > > But you can configure that crawl can work between 12.00- 13.00. Crawl
> is
> > > working until it has unfetched resources in queue or max fetch limit is
> > > reached. And it takes as much time as needed.
> > >
> > > Best Regards
> > > Alexander Aristov
> > >
> > > On 9 February 2011 04:17, .: Abhishek :. <[email protected]> wrote:
> > > > Hi all,
> > > >
> > > >  I am just trying to figure out if there is some way I can set Nutch
> > >
> > > crawls
> > >
> > > > between a time interval say like crawl from 12:00 AM to 12:00 PM and
> > > > then start the further processing(start process of indexing and so on
> > > > that follows the crawl) after that.
> > > >
> > > >  I think Nutch job is tied to Hadoop's JobConf. I am not sure on  how
> > >
> > > this
> > >
> > > > could be done. Rather, if I am to use an external shell script for
> > > > doing this, how do I chain the crawl process and trigger further
> > > > processing
> > >
> > > after
> > >
> > > > crawl?
> > > >
> > > > Thanks,
> > > > Abi
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: Running crawls between a specified time interval

Reply via email to