Re: Indexing Same CrawlDB Result In Different Indexed Doc Count

Sebastian Nagel Thu, 11 Aug 2016 12:54:39 -0700

Hi Mark,

thanks. I have no idea what could be the reason.
Nutch 1.12 comes with an upgrade from Solr 4 to Solr 5.


But since it is reproducible with indexer-dummy, it may
be not related to Solr at all.

One question: are there any special URL normalization rules
for the indexer?

Thanks,
Sebastian


On 08/10/2016 10:05 PM, mark mark wrote:
> Hi Sebastian,
> 
> Thanks for reply.
> 
> What I observed this issue is with 1.12 only, it's working perfectly on
> 1.11.
> 
> I crawled on 1.11 and then indexed on solr , wiped solr index and reindexed
> from same crawl segments, all time indexed doc count is same.
> 
> When I do same steps using  1.12  indexed doc count  varies.
> 
> crawl -
> crawl $SEED_PATH $CRAWLDB_DIR $NUM_ROUNDS
> 
> Index -
> nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/* -filter
> -normalize -deleteGone
> 
> 
> Regards Mark
> 
> On Mon, Aug 8, 2016 at 11:26 PM, Sebastian Nagel <[email protected]
>> wrote:
> 
>> Hi Mark, hi Markus,
>>
>> bin/crawl does for each cycle: generate, fetch, parse, updatedb, dedup,
>> and then
>>
>>     bin/nutch index crawldb segments/timestamp_this_cycle
>>     bin/nutch clean crawldb
>>
>> That's different from calling once
>>
>>     bin/nutch index ... crawldb segments/* -filter -normalize -deleteGone
>>
>> I see the following problems (there could be more):
>>
>> 1. bin/crawl does not pass
>>      -filter -normalize
>>    to the index command. Of course, this could be the trivial reason
>>    if filter or normalizer rules have been changed since the crawl has
>>    been started. Also note: if filters are changed while a crawl is running
>>    URLs/documents are deleted from CrawlDb but are not deleted from index.
>>
>> 2. How long was bin/crawl running? In case it was for a longer time, so
>>    that some documents are refetched: indexing multiple segments is not
>> stable,
>>    see https://issues.apache.org/jira/browse/NUTCH-1416
>>
>> 3. "index + clean" may behave different than "index -deleteGone"
>>    But this is rather a bug.
>>
>> Sebastian
>>
>>
>> On 08/08/2016 10:56 PM, mark mark wrote:
>>> To add on my previous reply by "indexing same crawdDB"  I mean same set
>> of
>>> segments with same crawldb.
>>>
>>>
>>> On Mon, Aug 8, 2016 at 1:51 PM, mark mark <[email protected]> wrote:
>>>
>>>> Thanks for reply Markus,
>>>>
>>>> By doc count I mean document got indexed in solr this is also the same
>>>> count which is shown as stats when indexing job completes (Indexer:
>> 31125
>>>>  indexed (add/update)
>>>>
>>>> What is  index-dummy and how to use this ?
>>>>
>>>>
>>>> On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma <
>> [email protected]>
>>>> wrote:
>>>>
>>>>> Hello - are you sure you are observing docCount and not maxDoc? I don't
>>>>> remember having seen this kind of behaviour in the past years.
>>>>> If it is docCount, then i'd recommend using index-dummy backend twice
>> and
>>>>> diffing their results so you can see which documents are emitted, or
>> not,
>>>>> between indexing jobs.
>>>>> That would allow you to find the records that change between jobs.
>>>>>
>>>>> Also, you mention indexing the same CrawlDB but that is not just what
>> you
>>>>> index, the segments matter. If you can reproduce it with the same
>> CrawlDB
>>>>> and the same set of segments, unchanged, with index-dummy, it would be
>>>>> helpful. If the problem is only reproducible with different sets of
>>>>> segments, then there is no problem.
>>>>>
>>>>> Markus
>>>>>
>>>>>
>>>>>
>>>>> -----Original message-----
>>>>>> From:mark mark <[email protected]>
>>>>>> Sent: Monday 8th August 2016 19:39
>>>>>> To: [email protected]
>>>>>> Subject: Indexing Same CrawlDB Result In Different Indexed Doc Count
>>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I am using nutch 1.12 , observed  indexing same crawlDB multiple times
>>>>>> gives different indexed doc count.
>>>>>>
>>>>>> We indexing from crawlDB and noted the indexed doc count, then wiped
>> all
>>>>>> index from solr and indexed again, this time number of document
>> indexed
>>>>>> were less then before.
>>>>>>
>>>>>> I removed all our customized plugins  but indexed doc count still
>> varies
>>>>>> and it's reproducible almost every time.
>>>>>>
>>>>>> Command I used for crawl
>>>>>> ./crawl seedPath crawlDir -1
>>>>>>
>>>>>> Command Used for Indexing to solr:
>>>>>> ./nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/*
>>>>> -filter
>>>>>> -normalize -deleteGone
>>>>>>
>>>>>> Please suggest.
>>>>>>
>>>>>> Thanks Mark
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: Indexing Same CrawlDB Result In Different Indexed Doc Count

Reply via email to