On 23/02/2017 01:07, kasilak wrote:
(1)  You have mentioned above in your response, that Lexicon/Postings are
stored as sorted on-disk array with binary search. Is the sorted array is
represented as a binary tree (underlying data structure)? Is the binary
search is binary tree search?

No, it's the classic binary search algorithm that works on a sorted array:

    https://en.wikipedia.org/wiki/Binary_search_algorithm

Can I point me to "C" source code where this
is done?

It's implemented in LexIndex_Seek:


https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/LexIndex.c;h=0d7a0ea4f4d4f2d720df13be2919691d0977c8ab;hb=HEAD#l141

(2) While performing searches on a indexed cf.dat, is there a fixed cost to
decompress the "delta encoded numbers and strings"?

What do you mean by "fixed cost"?

(3) I am taking measurements using my toy benchmark. I have indexed 10,000
documents. In the IxSearcher_Hits, I am changing the num_wanted for the
given query string. My expectation was changing the num_wanted shouldn't
affect the performance (measured using QPS "time taken to search and report
the query in queries per second (QPS)").
On the contrary what I am observing is when the num_wanted is set to 10
hits, then the QPS is higher in comparison with num_wanted is set to 4000
hits.
If the TF/IDF (term frequency/inverse document frequency) ranking is
constructed always and will report the top 10 hits or top 4000 hits,  then
my expectation is QPS shouldn't change.
"The only explanation I have is the difference in QPS is because in the case
of 10 hits the search stops as soon as the IxSearcher_Hits found 10 hits
(disregarding the ranking algorithm) and in the case of 4000 hits, it will
continue and stop as soon as 4000 hits or maximum the application can find.
This means IxSearcher_Hits doesn't use any inherent ranking to return the
relevant documents. Is my theory right?"

No, IxSearcher_Hits always returns sorted documents. The search process can be divided into three steps:

1. Find all N documents matching a query.
2. Find and sort the top M documents according to the SortSpec. Lucy
   uses a priority queue implemented as a heap (HitQueue), so time
   complexity is O(N * log(M)).
3. Retrieve the stored fields for each of the top M documents.

Steps 2 and 3 are obviously faster for smaller values of M.

Nick

Reply via email to