1) Diego's observation about IDF is absolutely correct here, but I don't
think he was pointing it to be a negative aspect of your new approach.
I think he just wanted to warn you about this.

The way BM25 uses the IDF feature of a term is to estimate how important is
the term in the context ( giving its document frequency in the corpus).
I don't think you should remove IDF from your similarity function, actually
the IDF value coming from the bigger index is closer to reality ( being your
domain the web, an ideal IDF should be the one calculated over the entire

Of course this is valid if you like BM25 as a similarity function ( and if
it is fit for purpose)

2) Related the way to evaluate the experiments based on experiment and
crawling cycle, the quickest way to do that may be to have the crawlingCycle
field to be a dynamic field.
the name of the field will depend on the experimentID.
such as : *_crawling_cycle
For experimentId= exp01, you will have the field : exp01_crawling_cycle.
For experimentId= exp02, you will have the field : exp02_crawling_cycle.
If I understood your evaluation time queries, you will be able to check each
field depending on the experiment you are interested.
My doubt using the incremental approach is that running a query such as :
"I want to know for the experiment 1 , which pages where crawled at the
first cycle"
Will not work, as you just store the last cycle that involved that page. So
the exact cycles ids assigned to the pages will not be known.
But I am not sure I fully understood your use case, so ignore my observation
if it is useless.


Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to