Hi Mark, thanks. I have no idea what could be the reason. Nutch 1.12 comes with an upgrade from Solr 4 to Solr 5.
But since it is reproducible with indexer-dummy, it may be not related to Solr at all. One question: are there any special URL normalization rules for the indexer? Thanks, Sebastian On 08/10/2016 10:05 PM, mark mark wrote: > Hi Sebastian, > > Thanks for reply. > > What I observed this issue is with 1.12 only, it's working perfectly on > 1.11. > > I crawled on 1.11 and then indexed on solr , wiped solr index and reindexed > from same crawl segments, all time indexed doc count is same. > > When I do same steps using 1.12 indexed doc count varies. > > crawl - > crawl $SEED_PATH $CRAWLDB_DIR $NUM_ROUNDS > > Index - > nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/* -filter > -normalize -deleteGone > > > Regards Mark > > On Mon, Aug 8, 2016 at 11:26 PM, Sebastian Nagel <[email protected] >> wrote: > >> Hi Mark, hi Markus, >> >> bin/crawl does for each cycle: generate, fetch, parse, updatedb, dedup, >> and then >> >> bin/nutch index crawldb segments/timestamp_this_cycle >> bin/nutch clean crawldb >> >> That's different from calling once >> >> bin/nutch index ... crawldb segments/* -filter -normalize -deleteGone >> >> I see the following problems (there could be more): >> >> 1. bin/crawl does not pass >> -filter -normalize >> to the index command. Of course, this could be the trivial reason >> if filter or normalizer rules have been changed since the crawl has >> been started. Also note: if filters are changed while a crawl is running >> URLs/documents are deleted from CrawlDb but are not deleted from index. >> >> 2. How long was bin/crawl running? In case it was for a longer time, so >> that some documents are refetched: indexing multiple segments is not >> stable, >> see https://issues.apache.org/jira/browse/NUTCH-1416 >> >> 3. "index + clean" may behave different than "index -deleteGone" >> But this is rather a bug. >> >> Sebastian >> >> >> On 08/08/2016 10:56 PM, mark mark wrote: >>> To add on my previous reply by "indexing same crawdDB" I mean same set >> of >>> segments with same crawldb. >>> >>> >>> On Mon, Aug 8, 2016 at 1:51 PM, mark mark <[email protected]> wrote: >>> >>>> Thanks for reply Markus, >>>> >>>> By doc count I mean document got indexed in solr this is also the same >>>> count which is shown as stats when indexing job completes (Indexer: >> 31125 >>>> indexed (add/update) >>>> >>>> What is index-dummy and how to use this ? >>>> >>>> >>>> On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma < >> [email protected]> >>>> wrote: >>>> >>>>> Hello - are you sure you are observing docCount and not maxDoc? I don't >>>>> remember having seen this kind of behaviour in the past years. >>>>> If it is docCount, then i'd recommend using index-dummy backend twice >> and >>>>> diffing their results so you can see which documents are emitted, or >> not, >>>>> between indexing jobs. >>>>> That would allow you to find the records that change between jobs. >>>>> >>>>> Also, you mention indexing the same CrawlDB but that is not just what >> you >>>>> index, the segments matter. If you can reproduce it with the same >> CrawlDB >>>>> and the same set of segments, unchanged, with index-dummy, it would be >>>>> helpful. If the problem is only reproducible with different sets of >>>>> segments, then there is no problem. >>>>> >>>>> Markus >>>>> >>>>> >>>>> >>>>> -----Original message----- >>>>>> From:mark mark <[email protected]> >>>>>> Sent: Monday 8th August 2016 19:39 >>>>>> To: [email protected] >>>>>> Subject: Indexing Same CrawlDB Result In Different Indexed Doc Count >>>>>> >>>>>> Hi All, >>>>>> >>>>>> I am using nutch 1.12 , observed indexing same crawlDB multiple times >>>>>> gives different indexed doc count. >>>>>> >>>>>> We indexing from crawlDB and noted the indexed doc count, then wiped >> all >>>>>> index from solr and indexed again, this time number of document >> indexed >>>>>> were less then before. >>>>>> >>>>>> I removed all our customized plugins but indexed doc count still >> varies >>>>>> and it's reproducible almost every time. >>>>>> >>>>>> Command I used for crawl >>>>>> ./crawl seedPath crawlDir -1 >>>>>> >>>>>> Command Used for Indexing to solr: >>>>>> ./nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/* >>>>> -filter >>>>>> -normalize -deleteGone >>>>>> >>>>>> Please suggest. >>>>>> >>>>>> Thanks Mark >>>>>> >>>>> >>>> >>>> >>> >> >> >

