There are two schools of thought on distributed tf/idf. The lazy way and
exact way.

1) Lazy way says that if you have consistent number of docs in each
shard (index) then your tf/idf should be approximate even though the
scoring only pulls from each index individually during calculation.
2) Exact way says we should know the global counts. To do this you would
need to store tf/idf outside of the indexes, create a custom similarity
in Lucene, override Nutch's default Similarity, and pass in the
distributed calculation when doing searches. This isn't impossible. It
just hasn't been implemented yet in Nutch.

To answer your question. The idf is relative to a single index according
to Lucene's DefaultSimilarity which Nutch overrides.

Dennis

On 11/15/2010 07:25 PM, 朱诗雄 wrote:
> So the indexes are combined from some batches of documents in
> different servers. Is there any mechanism to make the df accurate?
> Or the df is just  relative to indexes from one batch?
>
> 2010/11/15 Dennis Kubes <[email protected]>:
>> Usually you wouldn't cut indexes. When doing distributed searching usually
>> you are crawling, processing, and indexing a batch of documents (say 10
>> million) at a time and pushing them out to a distributed search server on a
>> local file system along with their segments. Then you would move on to the
>> next batch and the next until you run out of available hardware resources.
>> Then you reset the crawldb so every document is crawlable again and you
>> start the process all over.
>>
>> There isn't an index cutter per se. You can use the segment merger to put
>> multiple segments together and then index that segment. I have found that
>> the shard approach above is a better option in most cases.
>>
>> Dennis
>>
>> On 11/14/2010 11:07 PM, 朱诗雄 wrote:
>>> hi,all
>>>
>>> I want to use nutch for distributed searching. But I don't know how to
>>> cut indexes for distributed searching?
>>> Is there a guide for that?
>>>
>
>

Reply via email to