On 14/02/2017 00:57, Kasi Lakshman Karthi Anbumony wrote:
(1) What is the data structure used to represent Lexicon? (Clownfish supports hashtable. Does it mean Lucy uses hashtable?)
Lexicon is essentially a sorted on-disk array that is searched with binary search. Clownfish::Hash, on the other hand, is an in-memory data structure. Lucy doesn't build in-memory structures for most index data because this would incur a huge startup penalty. This also makes it possible to work with indices that don't fit in RAM, although performance deteriorates quickly in this case.
(2) What is the data structure used to represent postings? (Clownfish supports hashtable. Does it mean Lucy uses hashtable?)
Posting lists are stored in an on-disk array. The indices are found in Lexicon.
(3) Which compression method is used? Is it enabled by default?
Lexicon and posting list data is always compressed with delta encoding for numbers and incremental encoding for strings.
(4) Why there is no API (function call) to know the number of terms in lexicon and posting list for a given cf.dat?
It's generally hard to tell why a certain feature wasn't implemented. The only answer I can give is that no one deemed it important enough so far. But Lucy is open-source software. So, basically, anyone can implement any features they want.
(3) Can I know whether searching through lexicon/posting list is in-memory process or IO process?
Lucy uses memory-mapped files to access most index data so the distinction between in-memory and IO-based operation blurs quite a bit.
Nick
