Re: Indexing Same CrawlDB Result In Different Indexed Doc Count

mark mark Mon, 08 Aug 2016 13:57:18 -0700

To add on my previous reply by "indexing same crawdDB"  I mean same set of
segments with same crawldb.



On Mon, Aug 8, 2016 at 1:51 PM, mark mark <[email protected]> wrote:

> Thanks for reply Markus,
>
> By doc count I mean document got indexed in solr this is also the same
> count which is shown as stats when indexing job completes (Indexer:  31125
>  indexed (add/update)
>
> What is  index-dummy and how to use this ?
>
>
> On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma <[email protected]>
> wrote:
>
>> Hello - are you sure you are observing docCount and not maxDoc? I don't
>> remember having seen this kind of behaviour in the past years.
>> If it is docCount, then i'd recommend using index-dummy backend twice and
>> diffing their results so you can see which documents are emitted, or not,
>> between indexing jobs.
>> That would allow you to find the records that change between jobs.
>>
>> Also, you mention indexing the same CrawlDB but that is not just what you
>> index, the segments matter. If you can reproduce it with the same CrawlDB
>> and the same set of segments, unchanged, with index-dummy, it would be
>> helpful. If the problem is only reproducible with different sets of
>> segments, then there is no problem.
>>
>> Markus
>>
>>
>>
>> -----Original message-----
>> > From:mark mark <[email protected]>
>> > Sent: Monday 8th August 2016 19:39
>> > To: [email protected]
>> > Subject: Indexing Same CrawlDB Result In Different Indexed Doc Count
>> >
>> > Hi All,
>> >
>> > I am using nutch 1.12 , observed  indexing same crawlDB multiple times
>> > gives different indexed doc count.
>> >
>> > We indexing from crawlDB and noted the indexed doc count, then wiped all
>> > index from solr and indexed again, this time number of document indexed
>> > were less then before.
>> >
>> > I removed all our customized plugins  but indexed doc count still varies
>> > and it's reproducible almost every time.
>> >
>> > Command I used for crawl
>> > ./crawl seedPath crawlDir -1
>> >
>> > Command Used for Indexing to solr:
>> > ./nutch solrindex $SOLR_URL $CRAWLDB_PATH $CRAWLDB_DIR/segments/*
>> -filter
>> > -normalize -deleteGone
>> >
>> > Please suggest.
>> >
>> > Thanks Mark
>> >
>>
>
>

Re: Indexing Same CrawlDB Result In Different Indexed Doc Count

Reply via email to