Re: Indexing Same CrawlDB Result In Different Indexed Doc Count

Sebastian Nagel Mon, 08 Aug 2016 23:27:01 -0700

Hi Mark, hi Markus,

bin/crawl does for each cycle: generate, fetch, parse, updatedb, dedup, and then


    bin/nutch index crawldb segments/timestamp_this_cycle
    bin/nutch clean crawldb

That's different from calling once

    bin/nutch index ... crawldb segments/* -filter -normalize -deleteGone

I see the following problems (there could be more):

1. bin/crawl does not pass
     -filter -normalize
   to the index command. Of course, this could be the trivial reason
   if filter or normalizer rules have been changed since the crawl has
   been started. Also note: if filters are changed while a crawl is running
   URLs/documents are deleted from CrawlDb but are not deleted from index.

2. How long was bin/crawl running? In case it was for a longer time, so
   that some documents are refetched: indexing multiple segments is not stable,
   see https://issues.apache.org/jira/browse/NUTCH-1416

3. "index + clean" may behave different than "index -deleteGone"
   But this is rather a bug.

Sebastian


On 08/08/2016 10:56 PM, mark mark wrote:
> To add on my previous reply by "indexing same crawdDB"  I mean same set of
> segments with same crawldb.
> 
> 
> On Mon, Aug 8, 2016 at 1:51 PM, mark mark <[email protected]> wrote:
> 
>> Thanks for reply Markus,
>>
>> By doc count I mean document got indexed in solr this is also the same
>> count which is shown as stats when indexing job completes (Indexer:  31125
>>  indexed (add/update)
>>
>> What is  index-dummy and how to use this ?
>>
>>
>> On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma <[email protected]>
>> wrote:
>>
>>> Hello - are you sure you are observing docCount and not maxDoc? I don't
>>> remember having seen this kind of behaviour in the past years.
>>> If it is docCount, then i'd recommend using index-dummy backend twice and
>>> diffing their results so you can see which documents are emitted, or not,
>>> between indexing jobs.
>>> That would allow you to find the records that change between jobs.
>>>
>>> Also, you mention indexing the same CrawlDB but that is not just what you
>>> index, the segments matter. If you can reproduce it with the same CrawlDB
>>> and the same set of segments, unchanged, with index-dummy, it would be
>>> helpful. If the problem is only reproducible with different sets of
>>> segments, then there is no problem.
>>>
>>> Markus
>>>
>>>
>>>
>>> -----Original message-----
>>>> From:mark mark <[email protected]>
>>>> Sent: Monday 8th August 2016 19:39
>>>> To: [email protected]
>>>> Subject: Indexing Same CrawlDB Result In Different Indexed Doc Count
>>>>
>>>> Hi All,
>>>>
>>>> I am using nutch 1.12 , observed  indexing same crawlDB multiple times
>>>> gives different indexed doc count.
>>>>
>>>> We indexing from crawlDB and noted the indexed doc count, then wiped all
>>>> index from solr and indexed again, this time number of document indexed
>>>> were less then before.
>>>>
>>>> I removed all our customized plugins  but indexed doc count still varies
>>>> and it's reproducible almost every time.
>>>>
>>>> Command I used for crawl
>>>> ./crawl seedPath crawlDir -1
>>>>
>>>> Command Used for Indexing to solr:
>>>> ./nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/*
>>> -filter
>>>> -normalize -deleteGone
>>>>
>>>> Please suggest.
>>>>
>>>> Thanks Mark
>>>>
>>>
>>
>>
>

Re: Indexing Same CrawlDB Result In Different Indexed Doc Count

Reply via email to