Hi Sebastian, No custom normalization.
Mark On Thu, Aug 11, 2016 at 12:54 PM, Sebastian Nagel < [email protected]> wrote: > Hi Mark, > > thanks. I have no idea what could be the reason. > Nutch 1.12 comes with an upgrade from Solr 4 to Solr 5. > > But since it is reproducible with indexer-dummy, it may > be not related to Solr at all. > > One question: are there any special URL normalization rules > for the indexer? > > Thanks, > Sebastian > > > On 08/10/2016 10:05 PM, mark mark wrote: > > Hi Sebastian, > > > > Thanks for reply. > > > > What I observed this issue is with 1.12 only, it's working perfectly on > > 1.11. > > > > I crawled on 1.11 and then indexed on solr , wiped solr index and > reindexed > > from same crawl segments, all time indexed doc count is same. > > > > When I do same steps using 1.12 indexed doc count varies. > > > > crawl - > > crawl $SEED_PATH $CRAWLDB_DIR $NUM_ROUNDS > > > > Index - > > nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/* -filter > > -normalize -deleteGone > > > > > > Regards Mark > > > > On Mon, Aug 8, 2016 at 11:26 PM, Sebastian Nagel < > [email protected] > >> wrote: > > > >> Hi Mark, hi Markus, > >> > >> bin/crawl does for each cycle: generate, fetch, parse, updatedb, dedup, > >> and then > >> > >> bin/nutch index crawldb segments/timestamp_this_cycle > >> bin/nutch clean crawldb > >> > >> That's different from calling once > >> > >> bin/nutch index ... crawldb segments/* -filter -normalize > -deleteGone > >> > >> I see the following problems (there could be more): > >> > >> 1. bin/crawl does not pass > >> -filter -normalize > >> to the index command. Of course, this could be the trivial reason > >> if filter or normalizer rules have been changed since the crawl has > >> been started. Also note: if filters are changed while a crawl is > running > >> URLs/documents are deleted from CrawlDb but are not deleted from > index. > >> > >> 2. How long was bin/crawl running? In case it was for a longer time, so > >> that some documents are refetched: indexing multiple segments is not > >> stable, > >> see https://issues.apache.org/jira/browse/NUTCH-1416 > >> > >> 3. "index + clean" may behave different than "index -deleteGone" > >> But this is rather a bug. > >> > >> Sebastian > >> > >> > >> On 08/08/2016 10:56 PM, mark mark wrote: > >>> To add on my previous reply by "indexing same crawdDB" I mean same set > >> of > >>> segments with same crawldb. > >>> > >>> > >>> On Mon, Aug 8, 2016 at 1:51 PM, mark mark <[email protected]> > wrote: > >>> > >>>> Thanks for reply Markus, > >>>> > >>>> By doc count I mean document got indexed in solr this is also the same > >>>> count which is shown as stats when indexing job completes (Indexer: > >> 31125 > >>>> indexed (add/update) > >>>> > >>>> What is index-dummy and how to use this ? > >>>> > >>>> > >>>> On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma < > >> [email protected]> > >>>> wrote: > >>>> > >>>>> Hello - are you sure you are observing docCount and not maxDoc? I > don't > >>>>> remember having seen this kind of behaviour in the past years. > >>>>> If it is docCount, then i'd recommend using index-dummy backend twice > >> and > >>>>> diffing their results so you can see which documents are emitted, or > >> not, > >>>>> between indexing jobs. > >>>>> That would allow you to find the records that change between jobs. > >>>>> > >>>>> Also, you mention indexing the same CrawlDB but that is not just what > >> you > >>>>> index, the segments matter. If you can reproduce it with the same > >> CrawlDB > >>>>> and the same set of segments, unchanged, with index-dummy, it would > be > >>>>> helpful. If the problem is only reproducible with different sets of > >>>>> segments, then there is no problem. > >>>>> > >>>>> Markus > >>>>> > >>>>> > >>>>> > >>>>> -----Original message----- > >>>>>> From:mark mark <[email protected]> > >>>>>> Sent: Monday 8th August 2016 19:39 > >>>>>> To: [email protected] > >>>>>> Subject: Indexing Same CrawlDB Result In Different Indexed Doc Count > >>>>>> > >>>>>> Hi All, > >>>>>> > >>>>>> I am using nutch 1.12 , observed indexing same crawlDB multiple > times > >>>>>> gives different indexed doc count. > >>>>>> > >>>>>> We indexing from crawlDB and noted the indexed doc count, then wiped > >> all > >>>>>> index from solr and indexed again, this time number of document > >> indexed > >>>>>> were less then before. > >>>>>> > >>>>>> I removed all our customized plugins but indexed doc count still > >> varies > >>>>>> and it's reproducible almost every time. > >>>>>> > >>>>>> Command I used for crawl > >>>>>> ./crawl seedPath crawlDir -1 > >>>>>> > >>>>>> Command Used for Indexing to solr: > >>>>>> ./nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/* > >>>>> -filter > >>>>>> -normalize -deleteGone > >>>>>> > >>>>>> Please suggest. > >>>>>> > >>>>>> Thanks Mark > >>>>>> > >>>>> > >>>> > >>>> > >>> > >> > >> > > > > -- Thanks & Regards Manish Verma 669-224-9924

