Re: Indexing Same CrawlDB Result In Different Indexed Doc Count

manish verma Thu, 11 Aug 2016 18:01:18 -0700

Hi Sebastian,

No custom normalization.


Mark

On Thu, Aug 11, 2016 at 12:54 PM, Sebastian Nagel <
[email protected]> wrote:

> Hi Mark,
>
> thanks. I have no idea what could be the reason.
> Nutch 1.12 comes with an upgrade from Solr 4 to Solr 5.
>
> But since it is reproducible with indexer-dummy, it may
> be not related to Solr at all.
>
> One question: are there any special URL normalization rules
> for the indexer?
>
> Thanks,
> Sebastian
>
>
> On 08/10/2016 10:05 PM, mark mark wrote:
> > Hi Sebastian,
> >
> > Thanks for reply.
> >
> > What I observed this issue is with 1.12 only, it's working perfectly on
> > 1.11.
> >
> > I crawled on 1.11 and then indexed on solr , wiped solr index and
> reindexed
> > from same crawl segments, all time indexed doc count is same.
> >
> > When I do same steps using  1.12  indexed doc count  varies.
> >
> > crawl -
> > crawl $SEED_PATH $CRAWLDB_DIR $NUM_ROUNDS
> >
> > Index -
> > nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/* -filter
> > -normalize -deleteGone
> >
> >
> > Regards Mark
> >
> > On Mon, Aug 8, 2016 at 11:26 PM, Sebastian Nagel <
> [email protected]
> >> wrote:
> >
> >> Hi Mark, hi Markus,
> >>
> >> bin/crawl does for each cycle: generate, fetch, parse, updatedb, dedup,
> >> and then
> >>
> >>     bin/nutch index crawldb segments/timestamp_this_cycle
> >>     bin/nutch clean crawldb
> >>
> >> That's different from calling once
> >>
> >>     bin/nutch index ... crawldb segments/* -filter -normalize
> -deleteGone
> >>
> >> I see the following problems (there could be more):
> >>
> >> 1. bin/crawl does not pass
> >>      -filter -normalize
> >>    to the index command. Of course, this could be the trivial reason
> >>    if filter or normalizer rules have been changed since the crawl has
> >>    been started. Also note: if filters are changed while a crawl is
> running
> >>    URLs/documents are deleted from CrawlDb but are not deleted from
> index.
> >>
> >> 2. How long was bin/crawl running? In case it was for a longer time, so
> >>    that some documents are refetched: indexing multiple segments is not
> >> stable,
> >>    see https://issues.apache.org/jira/browse/NUTCH-1416
> >>
> >> 3. "index + clean" may behave different than "index -deleteGone"
> >>    But this is rather a bug.
> >>
> >> Sebastian
> >>
> >>
> >> On 08/08/2016 10:56 PM, mark mark wrote:
> >>> To add on my previous reply by "indexing same crawdDB"  I mean same set
> >> of
> >>> segments with same crawldb.
> >>>
> >>>
> >>> On Mon, Aug 8, 2016 at 1:51 PM, mark mark <[email protected]>
> wrote:
> >>>
> >>>> Thanks for reply Markus,
> >>>>
> >>>> By doc count I mean document got indexed in solr this is also the same
> >>>> count which is shown as stats when indexing job completes (Indexer:
> >> 31125
> >>>>  indexed (add/update)
> >>>>
> >>>> What is  index-dummy and how to use this ?
> >>>>
> >>>>
> >>>> On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma <
> >> [email protected]>
> >>>> wrote:
> >>>>
> >>>>> Hello - are you sure you are observing docCount and not maxDoc? I
> don't
> >>>>> remember having seen this kind of behaviour in the past years.
> >>>>> If it is docCount, then i'd recommend using index-dummy backend twice
> >> and
> >>>>> diffing their results so you can see which documents are emitted, or
> >> not,
> >>>>> between indexing jobs.
> >>>>> That would allow you to find the records that change between jobs.
> >>>>>
> >>>>> Also, you mention indexing the same CrawlDB but that is not just what
> >> you
> >>>>> index, the segments matter. If you can reproduce it with the same
> >> CrawlDB
> >>>>> and the same set of segments, unchanged, with index-dummy, it would
> be
> >>>>> helpful. If the problem is only reproducible with different sets of
> >>>>> segments, then there is no problem.
> >>>>>
> >>>>> Markus
> >>>>>
> >>>>>
> >>>>>
> >>>>> -----Original message-----
> >>>>>> From:mark mark <[email protected]>
> >>>>>> Sent: Monday 8th August 2016 19:39
> >>>>>> To: [email protected]
> >>>>>> Subject: Indexing Same CrawlDB Result In Different Indexed Doc Count
> >>>>>>
> >>>>>> Hi All,
> >>>>>>
> >>>>>> I am using nutch 1.12 , observed  indexing same crawlDB multiple
> times
> >>>>>> gives different indexed doc count.
> >>>>>>
> >>>>>> We indexing from crawlDB and noted the indexed doc count, then wiped
> >> all
> >>>>>> index from solr and indexed again, this time number of document
> >> indexed
> >>>>>> were less then before.
> >>>>>>
> >>>>>> I removed all our customized plugins  but indexed doc count still
> >> varies
> >>>>>> and it's reproducible almost every time.
> >>>>>>
> >>>>>> Command I used for crawl
> >>>>>> ./crawl seedPath crawlDir -1
> >>>>>>
> >>>>>> Command Used for Indexing to solr:
> >>>>>> ./nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/*
> >>>>> -filter
> >>>>>> -normalize -deleteGone
> >>>>>>
> >>>>>> Please suggest.
> >>>>>>
> >>>>>> Thanks Mark
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>


-- 
 Thanks & Regards
 Manish Verma
669-224-9924

Re: Indexing Same CrawlDB Result In Different Indexed Doc Count

Reply via email to