On Tue, Mar 22, 2011 at 2:28 PM, Markus Jelsma <[email protected]>wrote:
> > > On Tuesday 22 March 2011 14:14:06 Gabriele Kahlout wrote: > > > > > > Yes, you need to wait. You must finish the fetch, then parse the fetch > > > and update the crawldb (and optionally the linkdb). Finally you must > > > index and only then are your documents searchable. > > > > > > > I can see injecting fewer urls at a time. I.e. I complete a > > inject-fetch-index cycle and then re-start it with new urls. > > You don't need to inject every cycle. Inject once then repeat the following > Yes, but how do I limit the # urls fetched at each cycle? Are we talking about -maxNumSegments? $ bin/nutch generate Usage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm][*-maxNumSegments num*] > cycle: > - fetch - parse > - update linkdb and crawldb > - index > > > Q1: After the 1st iteration can I start searching, while the 2nd > iteration > > is in progress? > > Yes. Once you indexed the data you can start the 2nd iteration and search. > > > Q2: during the fetch of the 2nd iteration, what prevents fetch from > > fetching again what was fetched in the 1st iteration (assuming it's still > > before db.fetch.interval.default)? > > Well, if fetch_time + interval < NOW then it won't get fetched. > > > > > > > I'm not sure if fetching fewer segments and index them, and then fetch > more > > (i.e. iterate only fetch-index) is a better option, such that after the > 1st > > iteration I can start searching. > > > > > > Thank you. > > > > > > >but remember that results don't come available for searching > > > > >immediately after > > > > > > > > *fetching*. *all* pages must be fetched andf then* indexed* first to > > > > be searchable. > > > > > > -- > > > Markus Jelsma - CTO - Openindex > > > http://www.linkedin.com/in/markus17 > > > 050-8536620 / 06-50258350 > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536620 / 06-50258350 > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

