To add on my previous reply by "indexing same crawdDB" I mean same set of segments with same crawldb.
On Mon, Aug 8, 2016 at 1:51 PM, mark mark <[email protected]> wrote: > Thanks for reply Markus, > > By doc count I mean document got indexed in solr this is also the same > count which is shown as stats when indexing job completes (Indexer: 31125 > indexed (add/update) > > What is index-dummy and how to use this ? > > > On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma <[email protected]> > wrote: > >> Hello - are you sure you are observing docCount and not maxDoc? I don't >> remember having seen this kind of behaviour in the past years. >> If it is docCount, then i'd recommend using index-dummy backend twice and >> diffing their results so you can see which documents are emitted, or not, >> between indexing jobs. >> That would allow you to find the records that change between jobs. >> >> Also, you mention indexing the same CrawlDB but that is not just what you >> index, the segments matter. If you can reproduce it with the same CrawlDB >> and the same set of segments, unchanged, with index-dummy, it would be >> helpful. If the problem is only reproducible with different sets of >> segments, then there is no problem. >> >> Markus >> >> >> >> -----Original message----- >> > From:mark mark <[email protected]> >> > Sent: Monday 8th August 2016 19:39 >> > To: [email protected] >> > Subject: Indexing Same CrawlDB Result In Different Indexed Doc Count >> > >> > Hi All, >> > >> > I am using nutch 1.12 , observed indexing same crawlDB multiple times >> > gives different indexed doc count. >> > >> > We indexing from crawlDB and noted the indexed doc count, then wiped all >> > index from solr and indexed again, this time number of document indexed >> > were less then before. >> > >> > I removed all our customized plugins but indexed doc count still varies >> > and it's reproducible almost every time. >> > >> > Command I used for crawl >> > ./crawl seedPath crawlDir -1 >> > >> > Command Used for Indexing to solr: >> > ./nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/* >> -filter >> > -normalize -deleteGone >> > >> > Please suggest. >> > >> > Thanks Mark >> > >> > >

