RE: Indexing Same CrawlDB Result In Different Indexed Doc Count

Markus Jelsma Wed, 10 Aug 2016 12:33:56 -0700

Ah, this doesn't look right. Can you also check Sebastian Nagel's response to 
this thread? He explains a likely explanation thoroughly, as usual.


Markus

 
 
-----Original message-----
> From:mark mark <[email protected]>
> Sent: Tuesday 9th August 2016 20:23
> To: [email protected]
> Subject: Re: Indexing Same CrawlDB Result In Different Indexed Doc Count
> 
> Ok I ran dummy-indexer 2 times , before running 2nd time cleaned the
> indexes from solr.
> The diff is, 5 docs are present in one not in another.
> 
> e.g
> *First Time Indexing Report*
> 
> add *http*://www.apple.com/ca/retail/chinookcentre/
> <http://www.apple.com/ca/retail/chinookcentre/>
> delete *https*://www.apple.com/ca/retail/chinookcentre/
> <http://www.apple.com/ca/retail/chinookcentre/>
> 
> *Second Time Indexing Report*
> 
> delete *https*://www.apple.com/ca/retail/chinookcentre/
> 
> No Add record of http version
> 
> 
> We have same documents with https and http protocol,
> 
> first time it added *http* version and deleted *https* version of same url
> which make sense as in crawlDB http version url status is *db_fetched* and
> https url has status *db_duplicate* so it deleted the duplicate.
> 
> Second time it just deleted https version but did not add the http version
> which seems wrong, it should behave same way as it did first time.
> 
> Thanks Mark
> 
> 
> On Mon, Aug 8, 2016 at 3:16 PM, Markus Jelsma <[email protected]>
> wrote:
> 
> > Ok, so it is the same CrawlDB and the same set of segments. Index-dummy is
> > an indexing plugin, just as index-solr or index-elastic is. Index-dummy
> > just emits a file with records added or deleted and is easy to use for the
> > diff program since it is sorted by URL. Make sure you use one reducer in
> > this case. With it you can track troublesome records and find out what is
> > going on.
> >
> > M.
> >
> >
> >
> > -----Original message-----
> > > From:mark mark <[email protected]>
> > > Sent: Monday 8th August 2016 22:57
> > > To: [email protected]
> > > Subject: Re: Indexing Same CrawlDB Result In Different Indexed Doc Count
> > >
> > > To add on my previous reply by "indexing same crawdDB"  I mean same set
> > of
> > > segments with same crawldb.
> > >
> > >
> > > On Mon, Aug 8, 2016 at 1:51 PM, mark mark <[email protected]> wrote:
> > >
> > > > Thanks for reply Markus,
> > > >
> > > > By doc count I mean document got indexed in solr this is also the same
> > > > count which is shown as stats when indexing job completes (Indexer:
> > 31125
> > > >  indexed (add/update)
> > > >
> > > > What is  index-dummy and how to use this ?
> > > >
> > > >
> > > > On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma <
> > [email protected]>
> > > > wrote:
> > > >
> > > >> Hello - are you sure you are observing docCount and not maxDoc? I
> > don't
> > > >> remember having seen this kind of behaviour in the past years.
> > > >> If it is docCount, then i'd recommend using index-dummy backend twice
> > and
> > > >> diffing their results so you can see which documents are emitted, or
> > not,
> > > >> between indexing jobs.
> > > >> That would allow you to find the records that change between jobs.
> > > >>
> > > >> Also, you mention indexing the same CrawlDB but that is not just what
> > you
> > > >> index, the segments matter. If you can reproduce it with the same
> > CrawlDB
> > > >> and the same set of segments, unchanged, with index-dummy, it would be
> > > >> helpful. If the problem is only reproducible with different sets of
> > > >> segments, then there is no problem.
> > > >>
> > > >> Markus
> > > >>
> > > >>
> > > >>
> > > >> -----Original message-----
> > > >> > From:mark mark <[email protected]>
> > > >> > Sent: Monday 8th August 2016 19:39
> > > >> > To: [email protected]
> > > >> > Subject: Indexing Same CrawlDB Result In Different Indexed Doc Count
> > > >> >
> > > >> > Hi All,
> > > >> >
> > > >> > I am using nutch 1.12 , observed  indexing same crawlDB multiple
> > times
> > > >> > gives different indexed doc count.
> > > >> >
> > > >> > We indexing from crawlDB and noted the indexed doc count, then
> > wiped all
> > > >> > index from solr and indexed again, this time number of document
> > indexed
> > > >> > were less then before.
> > > >> >
> > > >> > I removed all our customized plugins  but indexed doc count still
> > varies
> > > >> > and it's reproducible almost every time.
> > > >> >
> > > >> > Command I used for crawl
> > > >> > ./crawl seedPath crawlDir -1
> > > >> >
> > > >> > Command Used for Indexing to solr:
> > > >> > ./nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/*
> > > >> -filter
> > > >> > -normalize -deleteGone
> > > >> >
> > > >> > Please suggest.
> > > >> >
> > > >> > Thanks Mark
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

RE: Indexing Same CrawlDB Result In Different Indexed Doc Count

Reply via email to