Thanks for reply Markus,

By doc count I mean document got indexed in solr this is also the same
count which is shown as stats when indexing job completes (Indexer:  31125
 indexed (add/update)

What is  index-dummy and how to use this ?


On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma <[email protected]>
wrote:

> Hello - are you sure you are observing docCount and not maxDoc? I don't
> remember having seen this kind of behaviour in the past years.
> If it is docCount, then i'd recommend using index-dummy backend twice and
> diffing their results so you can see which documents are emitted, or not,
> between indexing jobs.
> That would allow you to find the records that change between jobs.
>
> Also, you mention indexing the same CrawlDB but that is not just what you
> index, the segments matter. If you can reproduce it with the same CrawlDB
> and the same set of segments, unchanged, with index-dummy, it would be
> helpful. If the problem is only reproducible with different sets of
> segments, then there is no problem.
>
> Markus
>
>
>
> -----Original message-----
> > From:mark mark <[email protected]>
> > Sent: Monday 8th August 2016 19:39
> > To: [email protected]
> > Subject: Indexing Same CrawlDB Result In Different Indexed Doc Count
> >
> > Hi All,
> >
> > I am using nutch 1.12 , observed  indexing same crawlDB multiple times
> > gives different indexed doc count.
> >
> > We indexing from crawlDB and noted the indexed doc count, then wiped all
> > index from solr and indexed again, this time number of document indexed
> > were less then before.
> >
> > I removed all our customized plugins  but indexed doc count still varies
> > and it's reproducible almost every time.
> >
> > Command I used for crawl
> > ./crawl seedPath crawlDir -1
> >
> > Command Used for Indexing to solr:
> > ./nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/* -filter
> > -normalize -deleteGone
> >
> > Please suggest.
> >
> > Thanks Mark
> >
>

Reply via email to