Hi Sebastian, Thanks for reply.
What I observed this issue is with 1.12 only, it's working perfectly on 1.11. I crawled on 1.11 and then indexed on solr , wiped solr index and reindexed from same crawl segments, all time indexed doc count is same. When I do same steps using 1.12 indexed doc count varies. crawl - crawl $SEED_PATH $CRAWLDB_DIR $NUM_ROUNDS Index - nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/* -filter -normalize -deleteGone Regards Mark On Mon, Aug 8, 2016 at 11:26 PM, Sebastian Nagel <[email protected] > wrote: > Hi Mark, hi Markus, > > bin/crawl does for each cycle: generate, fetch, parse, updatedb, dedup, > and then > > bin/nutch index crawldb segments/timestamp_this_cycle > bin/nutch clean crawldb > > That's different from calling once > > bin/nutch index ... crawldb segments/* -filter -normalize -deleteGone > > I see the following problems (there could be more): > > 1. bin/crawl does not pass > -filter -normalize > to the index command. Of course, this could be the trivial reason > if filter or normalizer rules have been changed since the crawl has > been started. Also note: if filters are changed while a crawl is running > URLs/documents are deleted from CrawlDb but are not deleted from index. > > 2. How long was bin/crawl running? In case it was for a longer time, so > that some documents are refetched: indexing multiple segments is not > stable, > see https://issues.apache.org/jira/browse/NUTCH-1416 > > 3. "index + clean" may behave different than "index -deleteGone" > But this is rather a bug. > > Sebastian > > > On 08/08/2016 10:56 PM, mark mark wrote: > > To add on my previous reply by "indexing same crawdDB" I mean same set > of > > segments with same crawldb. > > > > > > On Mon, Aug 8, 2016 at 1:51 PM, mark mark <[email protected]> wrote: > > > >> Thanks for reply Markus, > >> > >> By doc count I mean document got indexed in solr this is also the same > >> count which is shown as stats when indexing job completes (Indexer: > 31125 > >> indexed (add/update) > >> > >> What is index-dummy and how to use this ? > >> > >> > >> On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma < > [email protected]> > >> wrote: > >> > >>> Hello - are you sure you are observing docCount and not maxDoc? I don't > >>> remember having seen this kind of behaviour in the past years. > >>> If it is docCount, then i'd recommend using index-dummy backend twice > and > >>> diffing their results so you can see which documents are emitted, or > not, > >>> between indexing jobs. > >>> That would allow you to find the records that change between jobs. > >>> > >>> Also, you mention indexing the same CrawlDB but that is not just what > you > >>> index, the segments matter. If you can reproduce it with the same > CrawlDB > >>> and the same set of segments, unchanged, with index-dummy, it would be > >>> helpful. If the problem is only reproducible with different sets of > >>> segments, then there is no problem. > >>> > >>> Markus > >>> > >>> > >>> > >>> -----Original message----- > >>>> From:mark mark <[email protected]> > >>>> Sent: Monday 8th August 2016 19:39 > >>>> To: [email protected] > >>>> Subject: Indexing Same CrawlDB Result In Different Indexed Doc Count > >>>> > >>>> Hi All, > >>>> > >>>> I am using nutch 1.12 , observed indexing same crawlDB multiple times > >>>> gives different indexed doc count. > >>>> > >>>> We indexing from crawlDB and noted the indexed doc count, then wiped > all > >>>> index from solr and indexed again, this time number of document > indexed > >>>> were less then before. > >>>> > >>>> I removed all our customized plugins but indexed doc count still > varies > >>>> and it's reproducible almost every time. > >>>> > >>>> Command I used for crawl > >>>> ./crawl seedPath crawlDir -1 > >>>> > >>>> Command Used for Indexing to solr: > >>>> ./nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/* > >>> -filter > >>>> -normalize -deleteGone > >>>> > >>>> Please suggest. > >>>> > >>>> Thanks Mark > >>>> > >>> > >> > >> > > > >

