On Tue, Mar 22, 2011 at 2:28 PM, Markus Jelsma
<[email protected]>wrote:

>
>
> On Tuesday 22 March 2011 14:14:06 Gabriele Kahlout wrote:
> > >
> > > Yes, you need to wait. You must finish the fetch, then parse the fetch
> > > and update the crawldb (and optionally the linkdb). Finally you must
> > > index and only then are your documents searchable.
> > >
> >
> > I can see injecting fewer urls at a time. I.e. I complete a
> > inject-fetch-index cycle and then re-start it with new urls.
>
> You don't need to inject every cycle. Inject once then repeat the following
>
Yes, but how do I limit the # urls fetched at each cycle?
Are we talking about -maxNumSegments?
$ bin/nutch generate
Usage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers
numFetchers] [-adddays numDays] [-noFilter] [-noNorm][*-maxNumSegments num*]


> cycle:
> - fetch

- parse
> - update linkdb and crawldb
> - index
>
> > Q1: After the 1st iteration can I start searching, while the 2nd
> iteration
> > is in progress?
>
> Yes. Once you indexed the data you can start the 2nd iteration and search.
>
> > Q2: during the fetch of the 2nd iteration, what prevents fetch from
> > fetching again what was fetched in the 1st iteration (assuming it's still
> > before db.fetch.interval.default)?
>
> Well, if fetch_time + interval < NOW then it won't get fetched.
>
> >
> >
> > I'm not sure if fetching fewer segments and index them, and then fetch
> more
> > (i.e. iterate only fetch-index) is a better option, such that after the
> 1st
> > iteration I can start searching.
> >
> >
> > Thank you.
> >
> > > > >but remember that results don't come available for searching
> > > > >immediately after
> > > >
> > > > *fetching*. *all* pages must be fetched andf then* indexed* first to
> > > > be searchable.
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Reply via email to