On Thu, Feb 9, 2017 at 3:51 PM, Kasi Lakshman Karthi Anbumony <kasi.anbum...@gmail.com> wrote:
> As a follow on question, based on this link: > https://lucy.apache.org/docs/c/Lucy/Docs/FileFormat.html > > (1) Why the cf.dat has a document section? The search needs to give something back to you to identify which documents were hits. Lucy's internal document IDs change over time, so are not suitable for that purpose. You need to at least store your own identifier, even if you choose not to store other parts of the document. > (2) Why is it not compressed? It's not done by default, but there are extension points allowing that behavior to be overridden. There's even example code which ships with Lucy which does exactly what you suggest. It's in Perl, but could be ported to C. $REPO/perl/lib/LucyX/Index/ZlibDocReader.pm $REPO/perl/lib/LucyX/Index/ZlibDocWriter.pm > I see most of the content of the books I have indexed being part of cf.dat > file and can read the text as it is! Is this how the inverted indexing > works? The document storage part of a Lucy datastore is separate from the inverted index. The inverted index data structures are definitely compressed, using algorithms tuned to the task of search. The first part of the search yields a set of internal Lucy document IDs, which are then used to look up whatever's in document storage. >From a performance perspective, the cost to perform the inverted index search is roughly proportional to the size of the corpus, whereas the cost to retrieve the document content afterwards is proportional to the number of documents retrieved. When scaling to larger collections, compressing the inverted index is more important than compressing document storage, since the number of documents searched grows while the number of documents retrieved often stays the same. Of course it may still be reasonable to compress document storage, depending on usage pattern. But if for example you're only storing short identifiers, there's no need. Marvin Humphrey