Ok I ran dummy-indexer 2 times , before running 2nd time cleaned the indexes from solr. The diff is, 5 docs are present in one not in another.
e.g *First Time Indexing Report* add *http*://www.apple.com/ca/retail/chinookcentre/ <http://www.apple.com/ca/retail/chinookcentre/> delete *https*://www.apple.com/ca/retail/chinookcentre/ <http://www.apple.com/ca/retail/chinookcentre/> *Second Time Indexing Report* delete *https*://www.apple.com/ca/retail/chinookcentre/ No Add record of http version We have same documents with https and http protocol, first time it added *http* version and deleted *https* version of same url which make sense as in crawlDB http version url status is *db_fetched* and https url has status *db_duplicate* so it deleted the duplicate. Second time it just deleted https version but did not add the http version which seems wrong, it should behave same way as it did first time. Thanks Mark On Mon, Aug 8, 2016 at 3:16 PM, Markus Jelsma <[email protected]> wrote: > Ok, so it is the same CrawlDB and the same set of segments. Index-dummy is > an indexing plugin, just as index-solr or index-elastic is. Index-dummy > just emits a file with records added or deleted and is easy to use for the > diff program since it is sorted by URL. Make sure you use one reducer in > this case. With it you can track troublesome records and find out what is > going on. > > M. > > > > -----Original message----- > > From:mark mark <[email protected]> > > Sent: Monday 8th August 2016 22:57 > > To: [email protected] > > Subject: Re: Indexing Same CrawlDB Result In Different Indexed Doc Count > > > > To add on my previous reply by "indexing same crawdDB" I mean same set > of > > segments with same crawldb. > > > > > > On Mon, Aug 8, 2016 at 1:51 PM, mark mark <[email protected]> wrote: > > > > > Thanks for reply Markus, > > > > > > By doc count I mean document got indexed in solr this is also the same > > > count which is shown as stats when indexing job completes (Indexer: > 31125 > > > indexed (add/update) > > > > > > What is index-dummy and how to use this ? > > > > > > > > > On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma < > [email protected]> > > > wrote: > > > > > >> Hello - are you sure you are observing docCount and not maxDoc? I > don't > > >> remember having seen this kind of behaviour in the past years. > > >> If it is docCount, then i'd recommend using index-dummy backend twice > and > > >> diffing their results so you can see which documents are emitted, or > not, > > >> between indexing jobs. > > >> That would allow you to find the records that change between jobs. > > >> > > >> Also, you mention indexing the same CrawlDB but that is not just what > you > > >> index, the segments matter. If you can reproduce it with the same > CrawlDB > > >> and the same set of segments, unchanged, with index-dummy, it would be > > >> helpful. If the problem is only reproducible with different sets of > > >> segments, then there is no problem. > > >> > > >> Markus > > >> > > >> > > >> > > >> -----Original message----- > > >> > From:mark mark <[email protected]> > > >> > Sent: Monday 8th August 2016 19:39 > > >> > To: [email protected] > > >> > Subject: Indexing Same CrawlDB Result In Different Indexed Doc Count > > >> > > > >> > Hi All, > > >> > > > >> > I am using nutch 1.12 , observed indexing same crawlDB multiple > times > > >> > gives different indexed doc count. > > >> > > > >> > We indexing from crawlDB and noted the indexed doc count, then > wiped all > > >> > index from solr and indexed again, this time number of document > indexed > > >> > were less then before. > > >> > > > >> > I removed all our customized plugins but indexed doc count still > varies > > >> > and it's reproducible almost every time. > > >> > > > >> > Command I used for crawl > > >> > ./crawl seedPath crawlDir -1 > > >> > > > >> > Command Used for Indexing to solr: > > >> > ./nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/* > > >> -filter > > >> > -normalize -deleteGone > > >> > > > >> > Please suggest. > > >> > > > >> > Thanks Mark > > >> > > > >> > > > > > > > > >

