Use -topN N. You can also limitByHost via configuration.

On Tuesday 22 March 2011 17:20:33 Gabriele Kahlout wrote:
> On Tue, Mar 22, 2011 at 2:28 PM, Markus Jelsma
> 
> <[email protected]>wrote:
> > On Tuesday 22 March 2011 14:14:06 Gabriele Kahlout wrote:
> > > > Yes, you need to wait. You must finish the fetch, then parse the
> > > > fetch and update the crawldb (and optionally the linkdb). Finally
> > > > you must index and only then are your documents searchable.
> > > 
> > > I can see injecting fewer urls at a time. I.e. I complete a
> > > inject-fetch-index cycle and then re-start it with new urls.
> > 
> > You don't need to inject every cycle. Inject once then repeat the
> > following
> 
> Yes, but how do I limit the # urls fetched at each cycle?
> Are we talking about -maxNumSegments?
> $ bin/nutch generate
> Usage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers
> numFetchers] [-adddays numDays] [-noFilter] [-noNorm][*-maxNumSegments
> num*]
> 
> > cycle:
> > - fetch
> 
> - parse
> 
> > - update linkdb and crawldb
> > - index
> > 
> > > Q1: After the 1st iteration can I start searching, while the 2nd
> > 
> > iteration
> > 
> > > is in progress?
> > 
> > Yes. Once you indexed the data you can start the 2nd iteration and
> > search.
> > 
> > > Q2: during the fetch of the 2nd iteration, what prevents fetch from
> > > fetching again what was fetched in the 1st iteration (assuming it's
> > > still before db.fetch.interval.default)?
> > 
> > Well, if fetch_time + interval < NOW then it won't get fetched.
> > 
> > > I'm not sure if fetching fewer segments and index them, and then fetch
> > 
> > more
> > 
> > > (i.e. iterate only fetch-index) is a better option, such that after the
> > 
> > 1st
> > 
> > > iteration I can start searching.
> > > 
> > > 
> > > Thank you.
> > > 
> > > > > >but remember that results don't come available for searching
> > > > > >immediately after
> > > > > 
> > > > > *fetching*. *all* pages must be fetched andf then* indexed* first
> > > > > to be searchable.
> > > > 
> > > > --
> > > > Markus Jelsma - CTO - Openindex
> > > > http://www.linkedin.com/in/markus17
> > > > 050-8536620 / 06-50258350
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to