Use -topN N. You can also limitByHost via configuration. On Tuesday 22 March 2011 17:20:33 Gabriele Kahlout wrote: > On Tue, Mar 22, 2011 at 2:28 PM, Markus Jelsma > > <[email protected]>wrote: > > On Tuesday 22 March 2011 14:14:06 Gabriele Kahlout wrote: > > > > Yes, you need to wait. You must finish the fetch, then parse the > > > > fetch and update the crawldb (and optionally the linkdb). Finally > > > > you must index and only then are your documents searchable. > > > > > > I can see injecting fewer urls at a time. I.e. I complete a > > > inject-fetch-index cycle and then re-start it with new urls. > > > > You don't need to inject every cycle. Inject once then repeat the > > following > > Yes, but how do I limit the # urls fetched at each cycle? > Are we talking about -maxNumSegments? > $ bin/nutch generate > Usage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers > numFetchers] [-adddays numDays] [-noFilter] [-noNorm][*-maxNumSegments > num*] > > > cycle: > > - fetch > > - parse > > > - update linkdb and crawldb > > - index > > > > > Q1: After the 1st iteration can I start searching, while the 2nd > > > > iteration > > > > > is in progress? > > > > Yes. Once you indexed the data you can start the 2nd iteration and > > search. > > > > > Q2: during the fetch of the 2nd iteration, what prevents fetch from > > > fetching again what was fetched in the 1st iteration (assuming it's > > > still before db.fetch.interval.default)? > > > > Well, if fetch_time + interval < NOW then it won't get fetched. > > > > > I'm not sure if fetching fewer segments and index them, and then fetch > > > > more > > > > > (i.e. iterate only fetch-index) is a better option, such that after the > > > > 1st > > > > > iteration I can start searching. > > > > > > > > > Thank you. > > > > > > > > >but remember that results don't come available for searching > > > > > >immediately after > > > > > > > > > > *fetching*. *all* pages must be fetched andf then* indexed* first > > > > > to be searchable. > > > > > > > > -- > > > > Markus Jelsma - CTO - Openindex > > > > http://www.linkedin.com/in/markus17 > > > > 050-8536620 / 06-50258350 > > > > -- > > Markus Jelsma - CTO - Openindex > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350
-- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

