Re: Indexing Same CrawlDB Result In Different Indexed Doc Count

mark mark Tue, 09 Aug 2016 11:23:47 -0700

Ok I ran dummy-indexer 2 times , before running 2nd time cleaned the
indexes from solr.
The diff is, 5 docs are present in one not in another.


e.g
*First Time Indexing Report*

add *http*://www.apple.com/ca/retail/chinookcentre/
<http://www.apple.com/ca/retail/chinookcentre/>
delete *https*://www.apple.com/ca/retail/chinookcentre/
<http://www.apple.com/ca/retail/chinookcentre/>

*Second Time Indexing Report*

delete *https*://www.apple.com/ca/retail/chinookcentre/

No Add record of http version


We have same documents with https and http protocol,

first time it added *http* version and deleted *https* version of same url
which make sense as in crawlDB http version url status is *db_fetched* and
https url has status *db_duplicate* so it deleted the duplicate.

Second time it just deleted https version but did not add the http version
which seems wrong, it should behave same way as it did first time.

Thanks Mark


On Mon, Aug 8, 2016 at 3:16 PM, Markus Jelsma <[email protected]>
wrote:

> Ok, so it is the same CrawlDB and the same set of segments. Index-dummy is
> an indexing plugin, just as index-solr or index-elastic is. Index-dummy
> just emits a file with records added or deleted and is easy to use for the
> diff program since it is sorted by URL. Make sure you use one reducer in
> this case. With it you can track troublesome records and find out what is
> going on.
>
> M.
>
>
>
> -----Original message-----
> > From:mark mark <[email protected]>
> > Sent: Monday 8th August 2016 22:57
> > To: [email protected]
> > Subject: Re: Indexing Same CrawlDB Result In Different Indexed Doc Count
> >
> > To add on my previous reply by "indexing same crawdDB"  I mean same set
> of
> > segments with same crawldb.
> >
> >
> > On Mon, Aug 8, 2016 at 1:51 PM, mark mark <[email protected]> wrote:
> >
> > > Thanks for reply Markus,
> > >
> > > By doc count I mean document got indexed in solr this is also the same
> > > count which is shown as stats when indexing job completes (Indexer:
> 31125
> > >  indexed (add/update)
> > >
> > > What is  index-dummy and how to use this ?
> > >
> > >
> > > On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma <
> [email protected]>
> > > wrote:
> > >
> > >> Hello - are you sure you are observing docCount and not maxDoc? I
> don't
> > >> remember having seen this kind of behaviour in the past years.
> > >> If it is docCount, then i'd recommend using index-dummy backend twice
> and
> > >> diffing their results so you can see which documents are emitted, or
> not,
> > >> between indexing jobs.
> > >> That would allow you to find the records that change between jobs.
> > >>
> > >> Also, you mention indexing the same CrawlDB but that is not just what
> you
> > >> index, the segments matter. If you can reproduce it with the same
> CrawlDB
> > >> and the same set of segments, unchanged, with index-dummy, it would be
> > >> helpful. If the problem is only reproducible with different sets of
> > >> segments, then there is no problem.
> > >>
> > >> Markus
> > >>
> > >>
> > >>
> > >> -----Original message-----
> > >> > From:mark mark <[email protected]>
> > >> > Sent: Monday 8th August 2016 19:39
> > >> > To: [email protected]
> > >> > Subject: Indexing Same CrawlDB Result In Different Indexed Doc Count
> > >> >
> > >> > Hi All,
> > >> >
> > >> > I am using nutch 1.12 , observed  indexing same crawlDB multiple
> times
> > >> > gives different indexed doc count.
> > >> >
> > >> > We indexing from crawlDB and noted the indexed doc count, then
> wiped all
> > >> > index from solr and indexed again, this time number of document
> indexed
> > >> > were less then before.
> > >> >
> > >> > I removed all our customized plugins  but indexed doc count still
> varies
> > >> > and it's reproducible almost every time.
> > >> >
> > >> > Command I used for crawl
> > >> > ./crawl seedPath crawlDir -1
> > >> >
> > >> > Command Used for Indexing to solr:
> > >> > ./nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/*
> > >> -filter
> > >> > -normalize -deleteGone
> > >> >
> > >> > Please suggest.
> > >> >
> > >> > Thanks Mark
> > >> >
> > >>
> > >
> > >
> >
>

Re: Indexing Same CrawlDB Result In Different Indexed Doc Count

Reply via email to