Re: [lucy-user] Lucy Benchmarking

Nick Wellnhofer Tue, 14 Feb 2017 05:47:58 -0800

On 14/02/2017 00:57, Kasi Lakshman Karthi Anbumony wrote:

(1) What is the data structure used to represent Lexicon? (Clownfish
supports hashtable. Does it mean Lucy uses hashtable?)

Lexicon is essentially a sorted on-disk array that is searched with binarysearch. Clownfish::Hash, on the other hand, is an in-memory data structure.Lucy doesn't build in-memory structures for most index data because this wouldincur a huge startup penalty. This also makes it possible to work with indicesthat don't fit in RAM, although performance deteriorates quickly in this case.

(2) What is the data structure used to represent postings? (Clownfish
supports hashtable. Does it mean Lucy uses hashtable?)


Posting lists are stored in an on-disk array. The indices are found in Lexicon.

(3) Which compression method is used? Is it enabled by default?

Lexicon and posting list data is always compressed with delta encoding fornumbers and incremental encoding for strings.

(4) Why there is no API (function call) to know the number of terms in
lexicon and posting list for a given cf.dat?

It's generally hard to tell why a certain feature wasn't implemented. The onlyanswer I can give is that no one deemed it important enough so far. But Lucyis open-source software. So, basically, anyone can implement any features theywant.

(3) Can I know whether searching through lexicon/posting list is in-memory
process or IO process?

Lucy uses memory-mapped files to access most index data so the distinctionbetween in-memory and IO-based operation blurs quite a bit.


Nick

Re: [lucy-user] Lucy Benchmarking

Reply via email to