Re: Index while crawling

Gabriele Kahlout Tue, 22 Mar 2011 10:02:21 -0700

Okay, and what about the loop terminating condition. If I'm parsing an
ulimited domain (the web) then the depth is probably a good option as
described on the wiki <http://wiki.apache.org/nutch/IntranetRecrawl> and on
so.com<http://stackoverflow.com/questions/2537874/nutch-how-to-crawl-by-small-patches>,
but if the domain is limited, i.e. we could finish crawling all of it, we
just want to do it incrementally then depth is no longer relevant.


Essentially I invision a while(true) where if generate returns no new url
(Q: how can I know this in the script) it breakes the loop. But generate
doesn't seem to report this:

bin/nutch generate crawl/crawldb crawl/segments -topN 200
Generator: starting at 2011-03-22 17:58:24
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 200
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20110322175835
Generator: finished at 2011-03-22 17:58:40, elapsed: 00:00:15



On Tue, Mar 22, 2011 at 5:27 PM, Markus Jelsma
<[email protected]>wrote:

> Use -topN N. You can also limitByHost via configuration.
>
> On Tuesday 22 March 2011 17:20:33 Gabriele Kahlout wrote:
> > On Tue, Mar 22, 2011 at 2:28 PM, Markus Jelsma
> >
> > <[email protected]>wrote:
> > > On Tuesday 22 March 2011 14:14:06 Gabriele Kahlout wrote:
> > > > > Yes, you need to wait. You must finish the fetch, then parse the
> > > > > fetch and update the crawldb (and optionally the linkdb). Finally
> > > > > you must index and only then are your documents searchable.
> > > >
> > > > I can see injecting fewer urls at a time. I.e. I complete a
> > > > inject-fetch-index cycle and then re-start it with new urls.
> > >
> > > You don't need to inject every cycle. Inject once then repeat the
> > > following
> >
> > Yes, but how do I limit the # urls fetched at each cycle?
> > Are we talking about -maxNumSegments?
> > $ bin/nutch generate
> > Usage: Generator <crawldb> <segments_dir> [-force] [-topN N]
> [-numFetchers
> > numFetchers] [-adddays numDays] [-noFilter] [-noNorm][*-maxNumSegments
> > num*]
> >
> > > cycle:
> > > - fetch
> >
> > - parse
> >
> > > - update linkdb and crawldb
> > > - index
> > >
> > > > Q1: After the 1st iteration can I start searching, while the 2nd
> > >
> > > iteration
> > >
> > > > is in progress?
> > >
> > > Yes. Once you indexed the data you can start the 2nd iteration and
> > > search.
> > >
> > > > Q2: during the fetch of the 2nd iteration, what prevents fetch from
> > > > fetching again what was fetched in the 1st iteration (assuming it's
> > > > still before db.fetch.interval.default)?
> > >
> > > Well, if fetch_time + interval < NOW then it won't get fetched.
> > >
> > > > I'm not sure if fetching fewer segments and index them, and then
> fetch
> > >
> > > more
> > >
> > > > (i.e. iterate only fetch-index) is a better option, such that after
> the
> > >
> > > 1st
> > >
> > > > iteration I can start searching.
> > > >
> > > >
> > > > Thank you.
> > > >
> > > > > > >but remember that results don't come available for searching
> > > > > > >immediately after
> > > > > >
> > > > > > *fetching*. *all* pages must be fetched andf then* indexed* first
> > > > > > to be searchable.
> > > > >
> > > > > --
> > > > > Markus Jelsma - CTO - Openindex
> > > > > http://www.linkedin.com/in/markus17
> > > > > 050-8536620 / 06-50258350
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Index while crawling

Reply via email to