Hi Mark, hi Markus,
bin/crawl does for each cycle: generate, fetch, parse, updatedb, dedup, and then
bin/nutch index crawldb segments/timestamp_this_cycle
bin/nutch clean crawldb
That's different from calling once
bin/nutch index ... crawldb segments/* -filter -normalize -deleteGone
I see the following problems (there could be more):
1. bin/crawl does not pass
-filter -normalize
to the index command. Of course, this could be the trivial reason
if filter or normalizer rules have been changed since the crawl has
been started. Also note: if filters are changed while a crawl is running
URLs/documents are deleted from CrawlDb but are not deleted from index.
2. How long was bin/crawl running? In case it was for a longer time, so
that some documents are refetched: indexing multiple segments is not stable,
see https://issues.apache.org/jira/browse/NUTCH-1416
3. "index + clean" may behave different than "index -deleteGone"
But this is rather a bug.
Sebastian
On 08/08/2016 10:56 PM, mark mark wrote:
> To add on my previous reply by "indexing same crawdDB" I mean same set of
> segments with same crawldb.
>
>
> On Mon, Aug 8, 2016 at 1:51 PM, mark mark <[email protected]> wrote:
>
>> Thanks for reply Markus,
>>
>> By doc count I mean document got indexed in solr this is also the same
>> count which is shown as stats when indexing job completes (Indexer: 31125
>> indexed (add/update)
>>
>> What is index-dummy and how to use this ?
>>
>>
>> On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma <[email protected]>
>> wrote:
>>
>>> Hello - are you sure you are observing docCount and not maxDoc? I don't
>>> remember having seen this kind of behaviour in the past years.
>>> If it is docCount, then i'd recommend using index-dummy backend twice and
>>> diffing their results so you can see which documents are emitted, or not,
>>> between indexing jobs.
>>> That would allow you to find the records that change between jobs.
>>>
>>> Also, you mention indexing the same CrawlDB but that is not just what you
>>> index, the segments matter. If you can reproduce it with the same CrawlDB
>>> and the same set of segments, unchanged, with index-dummy, it would be
>>> helpful. If the problem is only reproducible with different sets of
>>> segments, then there is no problem.
>>>
>>> Markus
>>>
>>>
>>>
>>> -----Original message-----
>>>> From:mark mark <[email protected]>
>>>> Sent: Monday 8th August 2016 19:39
>>>> To: [email protected]
>>>> Subject: Indexing Same CrawlDB Result In Different Indexed Doc Count
>>>>
>>>> Hi All,
>>>>
>>>> I am using nutch 1.12 , observed indexing same crawlDB multiple times
>>>> gives different indexed doc count.
>>>>
>>>> We indexing from crawlDB and noted the indexed doc count, then wiped all
>>>> index from solr and indexed again, this time number of document indexed
>>>> were less then before.
>>>>
>>>> I removed all our customized plugins but indexed doc count still varies
>>>> and it's reproducible almost every time.
>>>>
>>>> Command I used for crawl
>>>> ./crawl seedPath crawlDir -1
>>>>
>>>> Command Used for Indexing to solr:
>>>> ./nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/*
>>> -filter
>>>> -normalize -deleteGone
>>>>
>>>> Please suggest.
>>>>
>>>> Thanks Mark
>>>>
>>>
>>
>>
>