Re: Crawl-tool for iterative crawling?

Vikas Hazrati Mon, 21 May 2012 02:13:57 -0700

Thanks Markus, usually for recrawling, I see that there are options which
do not use the bin/nutch crawl like "Can it recrawl section " in
http://wiki.apache.org/nutch/Crawl


However, would there be a difference? other things to be kept in mind? if
we end up setting up a cron job on linux to say crawl every day and each
day we trigger something like

 bin/nutch crawl urls -dir arndme -depth 4 -topN 3

So we have the cron which calls this again and again.

Regards | Vikas



On Tue, May 15, 2012 at 6:07 PM, Markus Jelsma
<[email protected]>wrote:

> On Tuesday 15 May 2012 17:39:31 Vikas Hazrati wrote:
> > So once the crawl (which abstracts iterative crawls till the depth is
> > reached) is finished, is there a way to trigger a recrawl as well as a
> part
> > of some command line option so that Nutch continues to run as a daemon or
> > is shell script the way out?
>
> shell scripting is the way to go. Nutch will automatically recrawl pages
> that
> are due to be refetched.
>
> >
> > Regards | Vikas
> >
> > On Fri, May 11, 2012 at 8:26 PM, Lewis John Mcgibbney <
> >
> > [email protected]> wrote:
> > > If you would like I could add you to the moderators group and you can
> > > word it how you wish.
> > >
> > > Please sign up to Jira, give me your Jira username on this page, and I
> > > will happily add you the the group.
> > >
> > > On the other-hand, if you don't wish to do this, then please reply
> > > here with your suggestion and I'll make sure something gets changed to
> > > accommodate your suggestions.
> > >
> > > Thanks
> > >
> > > On Fri, May 11, 2012 at 2:52 PM, Matthias Paul <
> [email protected]>
> > >
> > > wrote:
> > > > In was confused by this tutorial:
> > > http://wiki.apache.org/nutch/NutchTutorial
> > >
> > > > Reading this page one might get to the conclusion that the crawl tool
> > > > can't do iterative crawling, because under "3.2 Using Individual
> > > > Commands for Whole-Web Crawling" there's  the sentence "This also
> > > > permits ... incremental crawling", as if the crawl command described
> > > > before (3.1 Using the Crawl Command) couldn't do that.
> > > >
> > > > Could someone perhaps improve this part of the tutorial?
> > > >
> > > > Matthias
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, May 10, 2012 at 8:39 PM, Markus Jelsma
> > > >
> > > > <[email protected]> wrote:
> > > >> By default each crawl is iterative. The crawl command is nothing
> more
> > >
> > > than a wrapper around the individual crawl cycle commands. The depth
> > > parameter is nothing more than executing a single crawl cycle multiple
> > > times. This is, if i am not mistaken, also true for older releases,
> > > certainly 1.2 and above.
> > >
> > > >> On Thu, 10 May 2012 19:31:27 +0100, Lewis John Mcgibbney <
> > >
> > > [email protected]> wrote:
> > > >>> For the record, there is a patch pending review for Nutchgora which
> > > >>> will sort part of this for you as well.
> > > >>>
> > > >>> https://issues.apache.org/jira/browse/NUTCH-1301
> > > >>>
> > > >>> Susam Pal also contributed a patch for Nutchgora regarding
> incremental
> > > >>> indexing but I can't find it just now sorry.
> > > >>>
> > > >>> Lewis
> > > >>>
> > > >>>
> > > >>> On Thu, May 10, 2012 at 5:18 PM, Matthias Paul
> > > >>>
> > > >>> <[email protected]> wrote:
> > > >>>> Hi all,
> > > >>>>
> > > >>>> can the crawl-command also be used for iterative crawls?
> > > >>>> In older Nutch-versions this was not possible but in 1.5 it seems
> to
> > >
> > > work?
> > >
> > > >>>> Thanks
> > > >>>> Matthias
> > > >>
> > > >> --
> > > >> Markus Jelsma - CTO - Openindex
> > >
> > > --
> > > Lewis
> --
> Markus Jelsma - CTO - Openindex
>
>

Re: Crawl-tool for iterative crawling?

Reply via email to