Ah, this doesn't look right. Can you also check Sebastian Nagel's response to this thread? He explains a likely explanation thoroughly, as usual.
Markus -----Original message----- > From:mark mark <[email protected]> > Sent: Tuesday 9th August 2016 20:23 > To: [email protected] > Subject: Re: Indexing Same CrawlDB Result In Different Indexed Doc Count > > Ok I ran dummy-indexer 2 times , before running 2nd time cleaned the > indexes from solr. > The diff is, 5 docs are present in one not in another. > > e.g > *First Time Indexing Report* > > add *http*://www.apple.com/ca/retail/chinookcentre/ > <http://www.apple.com/ca/retail/chinookcentre/> > delete *https*://www.apple.com/ca/retail/chinookcentre/ > <http://www.apple.com/ca/retail/chinookcentre/> > > *Second Time Indexing Report* > > delete *https*://www.apple.com/ca/retail/chinookcentre/ > > No Add record of http version > > > We have same documents with https and http protocol, > > first time it added *http* version and deleted *https* version of same url > which make sense as in crawlDB http version url status is *db_fetched* and > https url has status *db_duplicate* so it deleted the duplicate. > > Second time it just deleted https version but did not add the http version > which seems wrong, it should behave same way as it did first time. > > Thanks Mark > > > On Mon, Aug 8, 2016 at 3:16 PM, Markus Jelsma <[email protected]> > wrote: > > > Ok, so it is the same CrawlDB and the same set of segments. Index-dummy is > > an indexing plugin, just as index-solr or index-elastic is. Index-dummy > > just emits a file with records added or deleted and is easy to use for the > > diff program since it is sorted by URL. Make sure you use one reducer in > > this case. With it you can track troublesome records and find out what is > > going on. > > > > M. > > > > > > > > -----Original message----- > > > From:mark mark <[email protected]> > > > Sent: Monday 8th August 2016 22:57 > > > To: [email protected] > > > Subject: Re: Indexing Same CrawlDB Result In Different Indexed Doc Count > > > > > > To add on my previous reply by "indexing same crawdDB" I mean same set > > of > > > segments with same crawldb. > > > > > > > > > On Mon, Aug 8, 2016 at 1:51 PM, mark mark <[email protected]> wrote: > > > > > > > Thanks for reply Markus, > > > > > > > > By doc count I mean document got indexed in solr this is also the same > > > > count which is shown as stats when indexing job completes (Indexer: > > 31125 > > > > indexed (add/update) > > > > > > > > What is index-dummy and how to use this ? > > > > > > > > > > > > On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma < > > [email protected]> > > > > wrote: > > > > > > > >> Hello - are you sure you are observing docCount and not maxDoc? I > > don't > > > >> remember having seen this kind of behaviour in the past years. > > > >> If it is docCount, then i'd recommend using index-dummy backend twice > > and > > > >> diffing their results so you can see which documents are emitted, or > > not, > > > >> between indexing jobs. > > > >> That would allow you to find the records that change between jobs. > > > >> > > > >> Also, you mention indexing the same CrawlDB but that is not just what > > you > > > >> index, the segments matter. If you can reproduce it with the same > > CrawlDB > > > >> and the same set of segments, unchanged, with index-dummy, it would be > > > >> helpful. If the problem is only reproducible with different sets of > > > >> segments, then there is no problem. > > > >> > > > >> Markus > > > >> > > > >> > > > >> > > > >> -----Original message----- > > > >> > From:mark mark <[email protected]> > > > >> > Sent: Monday 8th August 2016 19:39 > > > >> > To: [email protected] > > > >> > Subject: Indexing Same CrawlDB Result In Different Indexed Doc Count > > > >> > > > > >> > Hi All, > > > >> > > > > >> > I am using nutch 1.12 , observed indexing same crawlDB multiple > > times > > > >> > gives different indexed doc count. > > > >> > > > > >> > We indexing from crawlDB and noted the indexed doc count, then > > wiped all > > > >> > index from solr and indexed again, this time number of document > > indexed > > > >> > were less then before. > > > >> > > > > >> > I removed all our customized plugins but indexed doc count still > > varies > > > >> > and it's reproducible almost every time. > > > >> > > > > >> > Command I used for crawl > > > >> > ./crawl seedPath crawlDir -1 > > > >> > > > > >> > Command Used for Indexing to solr: > > > >> > ./nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/* > > > >> -filter > > > >> > -normalize -deleteGone > > > >> > > > > >> > Please suggest. > > > >> > > > > >> > Thanks Mark > > > >> > > > > >> > > > > > > > > > > > > > >

