Okay, and what about the loop terminating condition. If I'm parsing an ulimited domain (the web) then the depth is probably a good option as described on the wiki <http://wiki.apache.org/nutch/IntranetRecrawl> and on so.com<http://stackoverflow.com/questions/2537874/nutch-how-to-crawl-by-small-patches>, but if the domain is limited, i.e. we could finish crawling all of it, we just want to do it incrementally then depth is no longer relevant.
Essentially I invision a while(true) where if generate returns no new url (Q: how can I know this in the script) it breakes the loop. But generate doesn't seem to report this: bin/nutch generate crawl/crawldb crawl/segments -topN 200 Generator: starting at 2011-03-22 17:58:24 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 200 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20110322175835 Generator: finished at 2011-03-22 17:58:40, elapsed: 00:00:15 On Tue, Mar 22, 2011 at 5:27 PM, Markus Jelsma <[email protected]>wrote: > Use -topN N. You can also limitByHost via configuration. > > On Tuesday 22 March 2011 17:20:33 Gabriele Kahlout wrote: > > On Tue, Mar 22, 2011 at 2:28 PM, Markus Jelsma > > > > <[email protected]>wrote: > > > On Tuesday 22 March 2011 14:14:06 Gabriele Kahlout wrote: > > > > > Yes, you need to wait. You must finish the fetch, then parse the > > > > > fetch and update the crawldb (and optionally the linkdb). Finally > > > > > you must index and only then are your documents searchable. > > > > > > > > I can see injecting fewer urls at a time. I.e. I complete a > > > > inject-fetch-index cycle and then re-start it with new urls. > > > > > > You don't need to inject every cycle. Inject once then repeat the > > > following > > > > Yes, but how do I limit the # urls fetched at each cycle? > > Are we talking about -maxNumSegments? > > $ bin/nutch generate > > Usage: Generator <crawldb> <segments_dir> [-force] [-topN N] > [-numFetchers > > numFetchers] [-adddays numDays] [-noFilter] [-noNorm][*-maxNumSegments > > num*] > > > > > cycle: > > > - fetch > > > > - parse > > > > > - update linkdb and crawldb > > > - index > > > > > > > Q1: After the 1st iteration can I start searching, while the 2nd > > > > > > iteration > > > > > > > is in progress? > > > > > > Yes. Once you indexed the data you can start the 2nd iteration and > > > search. > > > > > > > Q2: during the fetch of the 2nd iteration, what prevents fetch from > > > > fetching again what was fetched in the 1st iteration (assuming it's > > > > still before db.fetch.interval.default)? > > > > > > Well, if fetch_time + interval < NOW then it won't get fetched. > > > > > > > I'm not sure if fetching fewer segments and index them, and then > fetch > > > > > > more > > > > > > > (i.e. iterate only fetch-index) is a better option, such that after > the > > > > > > 1st > > > > > > > iteration I can start searching. > > > > > > > > > > > > Thank you. > > > > > > > > > > >but remember that results don't come available for searching > > > > > > >immediately after > > > > > > > > > > > > *fetching*. *all* pages must be fetched andf then* indexed* first > > > > > > to be searchable. > > > > > > > > > > -- > > > > > Markus Jelsma - CTO - Openindex > > > > > http://www.linkedin.com/in/markus17 > > > > > 050-8536620 / 06-50258350 > > > > > > -- > > > Markus Jelsma - CTO - Openindex > > > http://www.linkedin.com/in/markus17 > > > 050-8536620 / 06-50258350 > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536620 / 06-50258350 > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

