Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

2005-12-12 Thread Dawid Weiss


Hi Andrzej,

This was a very interesting experiment -- thanks for sharing the results 
with us.


The last range was the maximum in this case - Google wouldn't display 
any hit above 652 (which I find curious, too - because the total number 
of hits is, well, significantly higher - and Google claims to return up 
to the first 1000 results).


I believe this may have something to do with the way Google compacts 
URLs. My guess is that initially a 1000 results is found and ranked. 
Then pruning is performed on that, leaving just a subset of results for 
the user to select from.


If you try this, self-indulging, query (with filtering enabled):

http://www.google.com/search?as_q=dawid+weissnum=10hl=enas_qdr=allas_occt=anyas_dt=isafe=activestart=900

You get: Results 781 - 782 of about 61,700

Now try disabling filtering:

http://www.google.com/search?as_q=dawid+weissnum=10hl=enas_qdr=allas_occt=anyas_dt=isafe=imagesstart=900

Then you get: Results 781 - 782 of about 65,500

Hmmm... still the same number of available results, but the total 
estimate is higher.


So far I used URL parameters found on the advanced search page. I 
tried to display the omitted search results, as Google suggested. 
Interestingly, this lead to:


http://www.google.com/search?q=dawid+weisshl=enshb=tfilter=0start=900

Results 541 - 549 of about 65,400 

And that's the maximum you can get.

Sorry, my initial intuition proved wrong -- there is no clear logic 
behind the maximum limit of results you can see (unless you can find 
some logic in the fact that I can see _more_ results when I _exclude_ 
repeated ones from the total).


Dawid








Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

2005-12-12 Thread Andrzej Bialecki

Dawid Weiss wrote:



Hi Andrzej,

This was a very interesting experiment -- thanks for sharing the 
results with us.


The last range was the maximum in this case - Google wouldn't display 
any hit above 652 (which I find curious, too - because the total 
number of hits is, well, significantly higher - and Google claims to 
return up to the first 1000 results).



I believe this may have something to do with the way Google compacts 
URLs. My guess is that initially a 1000 results is found and ranked. 
Then pruning is performed on that, leaving just a subset of results 
for the user to select from.




That was my guess, too ...

Sorry, my initial intuition proved wrong -- there is no clear logic 
behind the maximum limit of results you can see (unless you can find 
some logic in the fact that I can see _more_ results when I _exclude_ 
repeated ones from the total).



Well, trying not to sound too much like Spock... Fascinating :-), but 
the only logical conclusion is that at the user end we never deal with 
any hard results calculated directly from the hypothetical main index, 
we deal just with rough estimates from the estimated indexes. These 
change in time, and perhaps even with the group of servers that answered 
this particular query... My guess is that there could be different 
estimated indexes prepared for different values of the main boolean 
parameters, like filter=0...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

2005-12-09 Thread Andrzej Bialecki

Hi,

I made an experiment with Google, to see if they use a similar approach.

I find the results to be most interesting. I selected a query which is 
guaranteed to give large result sets, but is more complicated than a 
single term query: http com.


The total number of hits (approx) is 2,780,000,000. BTW, I find it 
curious that the last 3 or 6 digits always seem to be zeros ... there's 
some clever guesstimation involved here. The fact that Google Suggest is 
able to return results so quickly would support this suspicion.


When I ran the query for the first time, the response time was 0.29 sec. 
All subsequent queries retrieving the first 10 results are in the order 
of 0.07 sec.


This is for retrieving just the first page (first 10 results). 
Retrieving results 10-20 also takes 0.08 sec, which suggests that this 
result was cached somewhere. Starting from results 20+ the response time 
increases (linearly?), although it varies wildly between requests, 
sometimes returning quicker, sometimes taking the max time - which 
suggests that I'm hitting different servers each time. Also, if I wait 
~30 sec to 1 minute, the response times are back to the values for the 
first-time run.


start   first repeated response
30  0.14  0.08-0.21
50  0.29  0.11-0.22
100 0.36  0.22-0.45
200 0.73  0.49-0.65
300 0.96  0.64-0.98
500 1.36  1.43-1.87
650 2.24  1.49-1.85

The last range was the maximum in this case - Google wouldn't display 
any hit above 652 (which I find curious, too - because the total number 
of hits is, well, significantly higher - and Google claims to return up 
to the first 1000 results).


My impressions from this excercise are perhaps not so surprising: Google 
is highly optimized for retrieving the first couple of results, and the 
more results you want to retrieve the worse the performance. Finally, 
you won't be able to retrieve any results above a couple hundred, quite 
often less than the claimed 1000 results threshold.


As for the exact techniques of this optimization, we'll never know for 
sure, but it seems like something similar is going on to what you 
outlined in your email. I think it would be great to try it out.


Andrzej


Doug Cutting wrote:


Doug Cutting wrote:

Implementing something like this for Lucene would not be too 
difficult. The index would need to be re-sorted by document boost: 
documents would be re-numbered so that highly-boosted documents had 
low document numbers.



In particular, one could:

1. Create an array of int[maxDoc], with a[i] = i.
2. Sort the array with order(i,j) = boost(i) - boost(j);
3. Implement a FilterIndexReader that re-numbers using the sorted 
array.  So, for example, the document numbers in the TermPositions 
will a[old.doc()].  Each term's positions will need to be read 
entirely into memory and sorted to perform this renumbering.


The IndexOptimizer.java class in the searcher package was an old 
attempt to create something like what Suel calls fancy postings.  It 
creates an index with the top 10% scoring postings.  Since documents 
are not renumbered one can intermix postings from this with the full 
index.  So for example, one can first try searching using this index 
for terms that occur more than, e.g., 10k times, and use the full 
index for rarer words.  If that does not find 1000 hits then the full 
index must be searched.  Such an approach can be combined with using a 
pre-sorted index.


I think the first thing to implement would be to implement something 
like what Suel calls first-1000.  Then we need to evaluate this and 
determine, for query log, how different the results are.


Then a HitCollector can simply stop searching once a given number of 
hits are found.


Doug








--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

2005-12-09 Thread Jérôme Charron
 The total number of hits (approx) is 2,780,000,000. BTW, I find it
 curious that the last 3 or 6 digits always seem to be zeros ... there's
 some clever guesstimation involved here. The fact that Google Suggest is
 able to return results so quickly would support this suspicion.

For more informations about fake Google counts, I suggest you to take a
look to some
tests performed by Jean Véronis, a French academic :
http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-mystery.html

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/