codecs for sorted indexes

2012-04-12 Thread Carlos Gonzalez-Cadenas
Hello,

We're using a sorted index in order to implement early termination
efficiently over an index of hundreds of millions of documents. As of now,
we're using the default codecs coming with Lucene 4, but we believe that
due to the fact that the docids are sorted, we should be able to do much
better in terms of storage and achieve much better performance, especially
decompression performance.

In particular, Robert Muir is commenting on these lines here:

https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411

We're aware that the in the bulkpostings branch there are different codecs
being implemented and different experiments being done. We don't know
whether we should implement our own codec (i.e. using some RLE-like
techniques) or we should use one of the codecs implemented there (PFOR,
Simple64, ...).

Can you please give us some advice on this?

Thanks
Carlos

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


Re: codecs for sorted indexes

2012-04-12 Thread Michael McCandless
Do you mean you are pre-sorting the documents (by what criteria?)
yourself, before adding them to the index?

In which case... you should already be seeing some benefits (smaller
index size) than had you randomly added them (ie the vInts should
take fewer bytes), I think.  (Probably the savings would be greater
for better intblock codecs like PForDelta, SimpleX, but I'm not
sure...).

Or do you mean having a codec re-sort the documents (on flush/merge)?
I think this should be possible w/ the Codec API... but nobody has
tried it yet that I know of.

Note that the bulkpostings branch is effectively dead (nobody is
iterating on it, and we've removed the old bulk API from trunk), but
there is likely a GSoC project to add a PForDelta codec to trunk:

https://issues.apache.org/jira/browse/LUCENE-3892

Mike McCandless

http://blog.mikemccandless.com



On Thu, Apr 12, 2012 at 6:13 AM, Carlos Gonzalez-Cadenas
c...@experienceon.com wrote:
 Hello,

 We're using a sorted index in order to implement early termination
 efficiently over an index of hundreds of millions of documents. As of now,
 we're using the default codecs coming with Lucene 4, but we believe that
 due to the fact that the docids are sorted, we should be able to do much
 better in terms of storage and achieve much better performance, especially
 decompression performance.

 In particular, Robert Muir is commenting on these lines here:

 https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411

 We're aware that the in the bulkpostings branch there are different codecs
 being implemented and different experiments being done. We don't know
 whether we should implement our own codec (i.e. using some RLE-like
 techniques) or we should use one of the codecs implemented there (PFOR,
 Simple64, ...).

 Can you please give us some advice on this?

 Thanks
 Carlos

 Carlos Gonzalez-Cadenas
 CEO, ExperienceOn - New generation search
 http://www.experienceon.com

 Mobile: +34 652 911 201
 Skype: carlosgonzalezcadenas
 LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


Re: codecs for sorted indexes

2012-04-12 Thread Carlos Gonzalez-Cadenas
Hello Michael,

Yes, we are pre-sorting the documents before adding them to the index. We
have a score associated to every document (not an IR score but a
document-related score that reflects its importance). Therefore, the
document with the biggest score will have the lowest docid (we add it first
to the index). We do this in order to apply early termination effectively.
With the actual coded, we haven't seen much of a difference in terms of
space when we have the index sorted vs not sorted.

So, the question would be: if we force the docids to be sorted, what is the
best way to encode them?. We don't really care if the codec doesn't work
for cases where the documents are not sorted (i.e. if it throws an
exception if documents are not ordered when creating the index). Our idea
here is that it may be possible to trade off generality but achieve very
significant improvements for the specific case.

Would something along the lines of RLE coding work? i.e. if we have to
store docids 1 to 1500, we can represent it as 1::1499 (it would be 2
ints to represent 1500 docids).

Thanks a lot for your help,
Carlos

On Thu, Apr 12, 2012 at 6:19 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Do you mean you are pre-sorting the documents (by what criteria?)
 yourself, before adding them to the index?



 In which case... you should already be seeing some benefits (smaller
 index size) than had you randomly added them (ie the vInts should
 take fewer bytes), I think.  (Probably the savings would be greater
 for better intblock codecs like PForDelta, SimpleX, but I'm not
 sure...).

 Or do you mean having a codec re-sort the documents (on flush/merge)?
 I think this should be possible w/ the Codec API... but nobody has
 tried it yet that I know of.

 Note that the bulkpostings branch is effectively dead (nobody is
 iterating on it, and we've removed the old bulk API from trunk), but
 there is likely a GSoC project to add a PForDelta codec to trunk:

https://issues.apache.org/jira/browse/LUCENE-3892

 Mike McCandless

 http://blog.mikemccandless.com



 On Thu, Apr 12, 2012 at 6:13 AM, Carlos Gonzalez-Cadenas
 c...@experienceon.com wrote:
  Hello,
 
  We're using a sorted index in order to implement early termination
  efficiently over an index of hundreds of millions of documents. As of
 now,
  we're using the default codecs coming with Lucene 4, but we believe that
  due to the fact that the docids are sorted, we should be able to do much
  better in terms of storage and achieve much better performance,
 especially
  decompression performance.
 
  In particular, Robert Muir is commenting on these lines here:
 
 
 https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411
 
  We're aware that the in the bulkpostings branch there are different
 codecs
  being implemented and different experiments being done. We don't know
  whether we should implement our own codec (i.e. using some RLE-like
  techniques) or we should use one of the codecs implemented there (PFOR,
  Simple64, ...).
 
  Can you please give us some advice on this?
 
  Thanks
  Carlos
 
  Carlos Gonzalez-Cadenas
  CEO, ExperienceOn - New generation search
  http://www.experienceon.com
 
  Mobile: +34 652 911 201
  Skype: carlosgonzalezcadenas
  LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas



Re: codecs for sorted indexes

2012-04-12 Thread Robert Muir
On Thu, Apr 12, 2012 at 6:35 PM, Carlos Gonzalez-Cadenas
c...@experienceon.com wrote:
 Hello Michael,

 Yes, we are pre-sorting the documents before adding them to the index. We
 have a score associated to every document (not an IR score but a
 document-related score that reflects its importance). Therefore, the
 document with the biggest score will have the lowest docid (we add it first
 to the index). We do this in order to apply early termination effectively.
 With the actual coded, we haven't seen much of a difference in terms of
 space when we have the index sorted vs not sorted.

I wouldn't expect that you will see space savings when you sort this way.

The techniques I was mentioning involve sorting documents by other
factors instead (such as grouping related documents from the same
website together: idea being they probably share many of the same
terms): this hopefully creates smaller document deltas that require
less bits to represent.

-- 
lucidimagination.com