Ok, so it is the same CrawlDB and the same set of segments. Index-dummy is an 
indexing plugin, just as index-solr or index-elastic is. Index-dummy just emits 
a file with records added or deleted and is easy to use for the diff program 
since it is sorted by URL. Make sure you use one reducer in this case. With it 
you can track troublesome records and find out what is going on.

M.

 
 
-----Original message-----
> From:mark mark <[email protected]>
> Sent: Monday 8th August 2016 22:57
> To: [email protected]
> Subject: Re: Indexing Same CrawlDB Result In Different Indexed Doc Count
> 
> To add on my previous reply by "indexing same crawdDB"  I mean same set of
> segments with same crawldb.
> 
> 
> On Mon, Aug 8, 2016 at 1:51 PM, mark mark <[email protected]> wrote:
> 
> > Thanks for reply Markus,
> >
> > By doc count I mean document got indexed in solr this is also the same
> > count which is shown as stats when indexing job completes (Indexer:  31125
> >  indexed (add/update)
> >
> > What is  index-dummy and how to use this ?
> >
> >
> > On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma <[email protected]>
> > wrote:
> >
> >> Hello - are you sure you are observing docCount and not maxDoc? I don't
> >> remember having seen this kind of behaviour in the past years.
> >> If it is docCount, then i'd recommend using index-dummy backend twice and
> >> diffing their results so you can see which documents are emitted, or not,
> >> between indexing jobs.
> >> That would allow you to find the records that change between jobs.
> >>
> >> Also, you mention indexing the same CrawlDB but that is not just what you
> >> index, the segments matter. If you can reproduce it with the same CrawlDB
> >> and the same set of segments, unchanged, with index-dummy, it would be
> >> helpful. If the problem is only reproducible with different sets of
> >> segments, then there is no problem.
> >>
> >> Markus
> >>
> >>
> >>
> >> -----Original message-----
> >> > From:mark mark <[email protected]>
> >> > Sent: Monday 8th August 2016 19:39
> >> > To: [email protected]
> >> > Subject: Indexing Same CrawlDB Result In Different Indexed Doc Count
> >> >
> >> > Hi All,
> >> >
> >> > I am using nutch 1.12 , observed  indexing same crawlDB multiple times
> >> > gives different indexed doc count.
> >> >
> >> > We indexing from crawlDB and noted the indexed doc count, then wiped all
> >> > index from solr and indexed again, this time number of document indexed
> >> > were less then before.
> >> >
> >> > I removed all our customized plugins  but indexed doc count still varies
> >> > and it's reproducible almost every time.
> >> >
> >> > Command I used for crawl
> >> > ./crawl seedPath crawlDir -1
> >> >
> >> > Command Used for Indexing to solr:
> >> > ./nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/*
> >> -filter
> >> > -normalize -deleteGone
> >> >
> >> > Please suggest.
> >> >
> >> > Thanks Mark
> >> >
> >>
> >
> >
> 

Reply via email to