Re: Index while crawling

Markus Jelsma Tue, 22 Mar 2011 06:28:02 -0700


On Tuesday 22 March 2011 14:14:06 Gabriele Kahlout wrote:
> > 
> > Yes, you need to wait. You must finish the fetch, then parse the fetch
> > and update the crawldb (and optionally the linkdb). Finally you must
> > index and only then are your documents searchable.
> > 
> 
> I can see injecting fewer urls at a time. I.e. I complete a
> inject-fetch-index cycle and then re-start it with new urls.


You don't need to inject every cycle. Inject once then repeat the following 
cycle:
- fetch
- parse
- update linkdb and crawldb
- index

> Q1: After the 1st iteration can I start searching, while the 2nd iteration
> is in progress?

Yes. Once you indexed the data you can start the 2nd iteration and search.

> Q2: during the fetch of the 2nd iteration, what prevents fetch from
> fetching again what was fetched in the 1st iteration (assuming it's still
> before db.fetch.interval.default)?

Well, if fetch_time + interval < NOW then it won't get fetched.

> 
> 
> I'm not sure if fetching fewer segments and index them, and then fetch more
> (i.e. iterate only fetch-index) is a better option, such that after the 1st
> iteration I can start searching.
> 
> 
> Thank you.
> 
> > > >but remember that results don't come available for searching
> > > >immediately after
> > > 
> > > *fetching*. *all* pages must be fetched andf then* indexed* first to
> > > be searchable.
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Index while crawling

Reply via email to