Hi Sebastian,

Thanks for reply.

What I observed this issue is with 1.12 only, it's working perfectly on
1.11.

I crawled on 1.11 and then indexed on solr , wiped solr index and reindexed
from same crawl segments, all time indexed doc count is same.

When I do same steps using  1.12  indexed doc count  varies.

crawl -
crawl $SEED_PATH $CRAWLDB_DIR $NUM_ROUNDS

Index -
nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/* -filter
-normalize -deleteGone


Regards Mark

On Mon, Aug 8, 2016 at 11:26 PM, Sebastian Nagel <[email protected]
> wrote:

> Hi Mark, hi Markus,
>
> bin/crawl does for each cycle: generate, fetch, parse, updatedb, dedup,
> and then
>
>     bin/nutch index crawldb segments/timestamp_this_cycle
>     bin/nutch clean crawldb
>
> That's different from calling once
>
>     bin/nutch index ... crawldb segments/* -filter -normalize -deleteGone
>
> I see the following problems (there could be more):
>
> 1. bin/crawl does not pass
>      -filter -normalize
>    to the index command. Of course, this could be the trivial reason
>    if filter or normalizer rules have been changed since the crawl has
>    been started. Also note: if filters are changed while a crawl is running
>    URLs/documents are deleted from CrawlDb but are not deleted from index.
>
> 2. How long was bin/crawl running? In case it was for a longer time, so
>    that some documents are refetched: indexing multiple segments is not
> stable,
>    see https://issues.apache.org/jira/browse/NUTCH-1416
>
> 3. "index + clean" may behave different than "index -deleteGone"
>    But this is rather a bug.
>
> Sebastian
>
>
> On 08/08/2016 10:56 PM, mark mark wrote:
> > To add on my previous reply by "indexing same crawdDB"  I mean same set
> of
> > segments with same crawldb.
> >
> >
> > On Mon, Aug 8, 2016 at 1:51 PM, mark mark <[email protected]> wrote:
> >
> >> Thanks for reply Markus,
> >>
> >> By doc count I mean document got indexed in solr this is also the same
> >> count which is shown as stats when indexing job completes (Indexer:
> 31125
> >>  indexed (add/update)
> >>
> >> What is  index-dummy and how to use this ?
> >>
> >>
> >> On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma <
> [email protected]>
> >> wrote:
> >>
> >>> Hello - are you sure you are observing docCount and not maxDoc? I don't
> >>> remember having seen this kind of behaviour in the past years.
> >>> If it is docCount, then i'd recommend using index-dummy backend twice
> and
> >>> diffing their results so you can see which documents are emitted, or
> not,
> >>> between indexing jobs.
> >>> That would allow you to find the records that change between jobs.
> >>>
> >>> Also, you mention indexing the same CrawlDB but that is not just what
> you
> >>> index, the segments matter. If you can reproduce it with the same
> CrawlDB
> >>> and the same set of segments, unchanged, with index-dummy, it would be
> >>> helpful. If the problem is only reproducible with different sets of
> >>> segments, then there is no problem.
> >>>
> >>> Markus
> >>>
> >>>
> >>>
> >>> -----Original message-----
> >>>> From:mark mark <[email protected]>
> >>>> Sent: Monday 8th August 2016 19:39
> >>>> To: [email protected]
> >>>> Subject: Indexing Same CrawlDB Result In Different Indexed Doc Count
> >>>>
> >>>> Hi All,
> >>>>
> >>>> I am using nutch 1.12 , observed  indexing same crawlDB multiple times
> >>>> gives different indexed doc count.
> >>>>
> >>>> We indexing from crawlDB and noted the indexed doc count, then wiped
> all
> >>>> index from solr and indexed again, this time number of document
> indexed
> >>>> were less then before.
> >>>>
> >>>> I removed all our customized plugins  but indexed doc count still
> varies
> >>>> and it's reproducible almost every time.
> >>>>
> >>>> Command I used for crawl
> >>>> ./crawl seedPath crawlDir -1
> >>>>
> >>>> Command Used for Indexing to solr:
> >>>> ./nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/*
> >>> -filter
> >>>> -normalize -deleteGone
> >>>>
> >>>> Please suggest.
> >>>>
> >>>> Thanks Mark
> >>>>
> >>>
> >>
> >>
> >
>
>

Reply via email to