codecs for sorted indexes
Hello, We're using a sorted index in order to implement early termination efficiently over an index of hundreds of millions of documents. As of now, we're using the default codecs coming with Lucene 4, but we believe that due to the fact that the docids are sorted, we should be able to do much better in terms of storage and achieve much better performance, especially decompression performance. In particular, Robert Muir is commenting on these lines here: https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411 We're aware that the in the bulkpostings branch there are different codecs being implemented and different experiments being done. We don't know whether we should implement our own codec (i.e. using some RLE-like techniques) or we should use one of the codecs implemented there (PFOR, Simple64, ...). Can you please give us some advice on this? Thanks Carlos Carlos Gonzalez-Cadenas CEO, ExperienceOn - New generation search http://www.experienceon.com Mobile: +34 652 911 201 Skype: carlosgonzalezcadenas LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
Re: codecs for sorted indexes
Do you mean you are pre-sorting the documents (by what criteria?) yourself, before adding them to the index? In which case... you should already be seeing some benefits (smaller index size) than had you randomly added them (ie the vInts should take fewer bytes), I think. (Probably the savings would be greater for better intblock codecs like PForDelta, SimpleX, but I'm not sure...). Or do you mean having a codec re-sort the documents (on flush/merge)? I think this should be possible w/ the Codec API... but nobody has tried it yet that I know of. Note that the bulkpostings branch is effectively dead (nobody is iterating on it, and we've removed the old bulk API from trunk), but there is likely a GSoC project to add a PForDelta codec to trunk: https://issues.apache.org/jira/browse/LUCENE-3892 Mike McCandless http://blog.mikemccandless.com On Thu, Apr 12, 2012 at 6:13 AM, Carlos Gonzalez-Cadenas c...@experienceon.com wrote: Hello, We're using a sorted index in order to implement early termination efficiently over an index of hundreds of millions of documents. As of now, we're using the default codecs coming with Lucene 4, but we believe that due to the fact that the docids are sorted, we should be able to do much better in terms of storage and achieve much better performance, especially decompression performance. In particular, Robert Muir is commenting on these lines here: https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411 We're aware that the in the bulkpostings branch there are different codecs being implemented and different experiments being done. We don't know whether we should implement our own codec (i.e. using some RLE-like techniques) or we should use one of the codecs implemented there (PFOR, Simple64, ...). Can you please give us some advice on this? Thanks Carlos Carlos Gonzalez-Cadenas CEO, ExperienceOn - New generation search http://www.experienceon.com Mobile: +34 652 911 201 Skype: carlosgonzalezcadenas LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
Re: codecs for sorted indexes
Hello Michael, Yes, we are pre-sorting the documents before adding them to the index. We have a score associated to every document (not an IR score but a document-related score that reflects its importance). Therefore, the document with the biggest score will have the lowest docid (we add it first to the index). We do this in order to apply early termination effectively. With the actual coded, we haven't seen much of a difference in terms of space when we have the index sorted vs not sorted. So, the question would be: if we force the docids to be sorted, what is the best way to encode them?. We don't really care if the codec doesn't work for cases where the documents are not sorted (i.e. if it throws an exception if documents are not ordered when creating the index). Our idea here is that it may be possible to trade off generality but achieve very significant improvements for the specific case. Would something along the lines of RLE coding work? i.e. if we have to store docids 1 to 1500, we can represent it as 1::1499 (it would be 2 ints to represent 1500 docids). Thanks a lot for your help, Carlos On Thu, Apr 12, 2012 at 6:19 PM, Michael McCandless luc...@mikemccandless.com wrote: Do you mean you are pre-sorting the documents (by what criteria?) yourself, before adding them to the index? In which case... you should already be seeing some benefits (smaller index size) than had you randomly added them (ie the vInts should take fewer bytes), I think. (Probably the savings would be greater for better intblock codecs like PForDelta, SimpleX, but I'm not sure...). Or do you mean having a codec re-sort the documents (on flush/merge)? I think this should be possible w/ the Codec API... but nobody has tried it yet that I know of. Note that the bulkpostings branch is effectively dead (nobody is iterating on it, and we've removed the old bulk API from trunk), but there is likely a GSoC project to add a PForDelta codec to trunk: https://issues.apache.org/jira/browse/LUCENE-3892 Mike McCandless http://blog.mikemccandless.com On Thu, Apr 12, 2012 at 6:13 AM, Carlos Gonzalez-Cadenas c...@experienceon.com wrote: Hello, We're using a sorted index in order to implement early termination efficiently over an index of hundreds of millions of documents. As of now, we're using the default codecs coming with Lucene 4, but we believe that due to the fact that the docids are sorted, we should be able to do much better in terms of storage and achieve much better performance, especially decompression performance. In particular, Robert Muir is commenting on these lines here: https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411 We're aware that the in the bulkpostings branch there are different codecs being implemented and different experiments being done. We don't know whether we should implement our own codec (i.e. using some RLE-like techniques) or we should use one of the codecs implemented there (PFOR, Simple64, ...). Can you please give us some advice on this? Thanks Carlos Carlos Gonzalez-Cadenas CEO, ExperienceOn - New generation search http://www.experienceon.com Mobile: +34 652 911 201 Skype: carlosgonzalezcadenas LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
Re: codecs for sorted indexes
On Thu, Apr 12, 2012 at 6:35 PM, Carlos Gonzalez-Cadenas c...@experienceon.com wrote: Hello Michael, Yes, we are pre-sorting the documents before adding them to the index. We have a score associated to every document (not an IR score but a document-related score that reflects its importance). Therefore, the document with the biggest score will have the lowest docid (we add it first to the index). We do this in order to apply early termination effectively. With the actual coded, we haven't seen much of a difference in terms of space when we have the index sorted vs not sorted. I wouldn't expect that you will see space savings when you sort this way. The techniques I was mentioning involve sorting documents by other factors instead (such as grouping related documents from the same website together: idea being they probably share many of the same terms): this hopefully creates smaller document deltas that require less bits to represent. -- lucidimagination.com