On 09/02/2017 01:46, Kasi Lakshman Karthi Anbumony wrote:
(1) Plan is to report the below metrics:
- Index creation: tokens/second
- Can I know how to obtain the tokens in the lucy_index created? Do
you think a better metric will be (Number of terms in the posting
list/second)? If so, how to obtain the number of terms in the
posting list?
AFAIK, the total number of terms in all input documents isn't available
because the term frequencies aren't stored separately. I'd simply use the
total size of the input documents in bytes.
(2) What are the different query types possible?
- vary document weighting
- Is it possible or is it fixed for a given lucy_index generated?
You can apply a boost to queries at query time:
http://lucy.apache.org/docs/c/Lucy/Search/Query.html#func_Set_Boost
And to fields and documents at indexing time:
http://lucy.apache.org/docs/c/Lucy/Plan/FieldType.html#func_Set_Boost
http://lucy.apache.org/docs/c/Lucy/Index/Indexer.html#func_Add_Doc
But for benchmarking purposes, it mostly matters whether you sort by score,
document id, or a field value. See
http://lucy.apache.org/docs/c/Lucy/Search/SortSpec.html
- vary relationship of terms (e.g., proximity)
- How to do it? Is there an operator like NEAR?
There's ProximityQuery but I'm not sure how it works:
http://lucy.apache.org/docs/c/LucyX/Search/ProximityQuery.html
- vary operations (e.g., AND, OR)
- I see that the support is available for boolean query parser. Can I
know whether for a given search instance I can have multiple boolean
queries like below?
Yes, that's possible.
Nick