RE: nutch solrindex doesn't index all the documents

Juan Felix Tue, 02 Nov 2010 12:18:40 -0700

Yes.

I have two scripts. The first one is like a recrawl script and it does the 
following tasks:


inject
generate
fetch
parse
updatedb
mergesegs
invertlinks
index
dedup
merge

The second one just calls the solrindex command:

bin/nutch solrindex mySolrUrl myDB myLink mySegments

So, I'm indexing two times, the first one uses the lucene indexation (first 
script) and the second one uses the solr indexation.

> Date: Tue, 2 Nov 2010 19:00:43 +0000
> Subject: Re: nutch solrindex doesn't index all the documents
> From: [email protected]
> To: [email protected]
> 
> did you run the deduplication before indexing?
> 
> On 2 November 2010 00:23, Juan Felix <[email protected]> wrote:
> 
> >
> > Hi.
> >
> > I'm trying to index all the documents using solrindex command, but for some
> > reason sometimes it doesn't index all the documents.
> >
> > For example, I saw the crawl db stats and it has 75,031 fetched pages but
> > after index them to solr, the number of documents in solr are 74,827
> >
> > Any Idea? What about the other 204 pages that are not on solr?
> >
> > Thanks
> > Juan Felix
> >
> 
> 
> 
> 
> -- 
> *
> *Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com

RE: nutch solrindex doesn't index all the documents

Reply via email to