[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-06-01 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014922#comment-14014922
 ] 

Shai Erera commented on LUCENE-5688:


Ahh, I see now that you only wrote a DVFormat, not a Codec. In that case I 
agree, apps should plug it in per-field and that it doesn't need to wrap 
another format. Can you perhaps make the Consumer/Producer package-private? I 
think only the Format needs to be public?

About Binary field, indeed it doesn't write the data if a BytesRef is missing, 
but it does write all the meta information, e.g. the missing bitset, the 
addresses (in case the BytesRef aren't of equal length). So I think sparseness 
should be really sparse. But I'm fine if you leave that out for now - we first 
need to make sure the numeric field performs and that there are any gains (even 
if only during indexing).

 NumericDocValues fields with sparse data can be compressed better 
 --

 Key: LUCENE-5688
 URL: https://issues.apache.org/jira/browse/LUCENE-5688
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
 Attachments: LUCENE-5688.patch, LUCENE-5688.patch


 I ran into this problem where I had a dynamic field in Solr and indexed data 
 into lots of fields. For each field only a few documents had actual values 
 and the remaining documents the default value ( 0 ) got indexed. Now when I 
 merge segments, the index size jumps up.
 For example I have 10 segments - Each with 1 DV field. When I merge segments 
 into 1 that segment will contain all 10 DV fields with lots if 0s. 
 This was the motivation behind trying to come up with a compression for a use 
 case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-30 Thread Varun Thacker (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013414#comment-14013414
 ] 

Varun Thacker commented on LUCENE-5688:
---

Hi Shai,

Thanks for reviewing.

bq. Perhaps you can also experiment with a tiny hash-map, using plain 
int[]+long[] or a pair of packed arrays, instead of the binary search tree. I 
am writing one now because I am experimenting with improvements to updatable 
DocValues. It's based on Solr's HashDocSet which I modify to act as an 
int-to-long map. I can share the code here if you want

Sure this approach looks promising also. Faster access vs more memory. Perhaps 
we could provide both options in the same codec.

bq. Another thing, maybe this codec should wrap another and delegate to in case 
the number of docs-with-values exceeds some threshold? For instance, ignoring 
packing, the default DV encodes 8 bytes per document, while this codec encodes 
12 bytes (doc+value) per document which has a value. So I'm thinking that 
unless the field is really sparse, we might prefer the default encoding. We 
should fold that as well into the benchmark.

I thought about it, but since we are writing a codec dedicated to sparse values 
and not adding it as an optimization for the default codec I did not include it 
in my patch. If you feel that we should then I will add it.

A couple of other general doubts that I had - 
- Currently only addNumericField is implemented. Looking at the 
Lucene45DocValuesConsumer - addBinaryField does not write missing value so the 
same code can be reused?
- For addSortedField, addSortedSetField the only method which needs to be 
changed would be addTermsDict?

 NumericDocValues fields with sparse data can be compressed better 
 --

 Key: LUCENE-5688
 URL: https://issues.apache.org/jira/browse/LUCENE-5688
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
 Attachments: LUCENE-5688.patch, LUCENE-5688.patch


 I ran into this problem where I had a dynamic field in Solr and indexed data 
 into lots of fields. For each field only a few documents had actual values 
 and the remaining documents the default value ( 0 ) got indexed. Now when I 
 merge segments, the index size jumps up.
 For example I have 10 segments - Each with 1 DV field. When I merge segments 
 into 1 that segment will contain all 10 DV fields with lots if 0s. 
 This was the motivation behind trying to come up with a compression for a use 
 case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-29 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012177#comment-14012177
 ] 

Shai Erera commented on LUCENE-5688:


bq. It does a binary search on the position data which is read using 
MonotonicBlockPackedReader.

Perhaps you can also experiment with a tiny hash-map, using plain int[]+long[] 
or a pair of packed arrays, instead of the binary search tree. I am writing one 
now because I am experimenting with improvements to updatable DocValues. It's 
based on Solr's {{HashDocSet}} which I modify to act as an int-to-long map. I 
can share the code here if you want.

bq. Also I am not too familiar with lucene-util but is there a test which 
benchmarks DocValue read times? Should be interesting to see the read time 
difference.

Luceneutil has a search task benchmark (searchBench.py) which you can use. I 
recently augmented it (while benchmarking updatable DV) with a 
sort-by-DocValues, so I think you can use that to exercise the sparse DV? Once 
you're ready to run the benchmark let me know, I can share the tasks file with 
you. You will also need to modify the indexer to create sparse DVs (make it 
configurable) as currently when DV is turned on, each document is indexed a set 
of fields.

About the patch, I see you always encode a bitset + the values (sparse). I 
wonder if you used a hashtable-approach as I described above, you could just 
encode the docs that have a value. Then in the producer you can load them into 
memory (it's supposed to be small) and implement both getDocsWithField and 
getNumeric. It will impact docsWithField, but it's worth benchmarking I think.

Another thing, maybe this codec should wrap another and delegate to in case the 
number of docs-with-values exceeds some threshold? For instance, ignoring 
packing, the default DV encodes 8 bytes per document, while this codec encodes 
12 bytes (doc+value) per document which has a value. So I'm thinking that 
unless the field is really sparse, we might prefer the default encoding. We 
should fold that as well into the benchmark.

 NumericDocValues fields with sparse data can be compressed better 
 --

 Key: LUCENE-5688
 URL: https://issues.apache.org/jira/browse/LUCENE-5688
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
 Attachments: LUCENE-5688.patch, LUCENE-5688.patch


 I ran into this problem where I had a dynamic field in Solr and indexed data 
 into lots of fields. For each field only a few documents had actual values 
 and the remaining documents the default value ( 0 ) got indexed. Now when I 
 merge segments, the index size jumps up.
 For example I have 10 segments - Each with 1 DV field. When I merge segments 
 into 1 that segment will contain all 10 DV fields with lots if 0s. 
 This was the motivation behind trying to come up with a compression for a use 
 case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-21 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004562#comment-14004562
 ] 

Adrien Grand commented on LUCENE-5688:
--

+1 to using binary search on an in-memory {{MonotonicBlockPackedReader}} to 
implement sparse doc values. 

 NumericDocValues fields with sparse data can be compressed better 
 --

 Key: LUCENE-5688
 URL: https://issues.apache.org/jira/browse/LUCENE-5688
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
 Attachments: LUCENE-5688.patch


 I ran into this problem where I had a dynamic field in Solr and indexed data 
 into lots of fields. For each field only a few documents had actual values 
 and the remaining documents the default value ( 0 ) got indexed. Now when I 
 merge segments, the index size jumps up.
 For example I have 10 segments - Each with 1 DV field. When I merge segments 
 into 1 that segment will contain all 10 DV fields with lots if 0s. 
 This was the motivation behind trying to come up with a compression for a use 
 case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-20 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003220#comment-14003220
 ] 

Robert Muir commented on LUCENE-5688:
-

I think this is a duplicate of LUCENE-4921 ?

I guess the main thing is to differentiate between sparse data and thousands 
and thousands of fields which usually hints at the problem not being in lucene 
:)

 NumericDocValues fields with sparse data can be compressed better 
 --

 Key: LUCENE-5688
 URL: https://issues.apache.org/jira/browse/LUCENE-5688
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
 Attachments: LUCENE-5688.patch


 I ran into this problem where I had a dynamic field in Solr and indexed data 
 into lots of fields. For each field only a few documents had actual values 
 and the remaining documents the default value ( 0 ) got indexed. Now when I 
 merge segments, the index size jumps up.
 For example I have 10 segments - Each with 1 DV field. When I merge segments 
 into 1 that segment will contain all 10 DV fields with lots if 0s. 
 This was the motivation behind trying to come up with a compression for a use 
 case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-20 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003226#comment-14003226
 ] 

Robert Muir commented on LUCENE-5688:
-

Varun, i dont think we should make a long[] of size maxDoc in ram here just to 
save some space on disk.

 NumericDocValues fields with sparse data can be compressed better 
 --

 Key: LUCENE-5688
 URL: https://issues.apache.org/jira/browse/LUCENE-5688
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
 Attachments: LUCENE-5688.patch


 I ran into this problem where I had a dynamic field in Solr and indexed data 
 into lots of fields. For each field only a few documents had actual values 
 and the remaining documents the default value ( 0 ) got indexed. Now when I 
 merge segments, the index size jumps up.
 For example I have 10 segments - Each with 1 DV field. When I merge segments 
 into 1 that segment will contain all 10 DV fields with lots if 0s. 
 This was the motivation behind trying to come up with a compression for a use 
 case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-20 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003231#comment-14003231
 ] 

Grant Ingersoll commented on LUCENE-5688:
-

bq. Varun, i dont think we should make a long[] of size maxDoc in ram here just 
to save some space on disk.

In a large index, this can be quite significant, FWIW.  Agreed on the long[] in 
RAM, but would be good to have a better way of controlling the on-disk behavior.

 NumericDocValues fields with sparse data can be compressed better 
 --

 Key: LUCENE-5688
 URL: https://issues.apache.org/jira/browse/LUCENE-5688
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
 Attachments: LUCENE-5688.patch


 I ran into this problem where I had a dynamic field in Solr and indexed data 
 into lots of fields. For each field only a few documents had actual values 
 and the remaining documents the default value ( 0 ) got indexed. Now when I 
 merge segments, the index size jumps up.
 For example I have 10 segments - Each with 1 DV field. When I merge segments 
 into 1 that segment will contain all 10 DV fields with lots if 0s. 
 This was the motivation behind trying to come up with a compression for a use 
 case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-20 Thread Varun Thacker (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003234#comment-14003234
 ] 

Varun Thacker commented on LUCENE-5688:
---

Completely overlooked LUCENE-4921. Should I mark this as duplicate and post the 
same patch there?

bq. Varun, i dont think we should make a long[] of size maxDoc in ram here just 
to save some space on disk.

I felt the same way when I was writing it, but that was the easiest way to get 
a quick patch out. I will try to think of a better way to achieve this. Do you 
have any suggestions? 

 NumericDocValues fields with sparse data can be compressed better 
 --

 Key: LUCENE-5688
 URL: https://issues.apache.org/jira/browse/LUCENE-5688
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
 Attachments: LUCENE-5688.patch


 I ran into this problem where I had a dynamic field in Solr and indexed data 
 into lots of fields. For each field only a few documents had actual values 
 and the remaining documents the default value ( 0 ) got indexed. Now when I 
 merge segments, the index size jumps up.
 For example I have 10 segments - Each with 1 DV field. When I merge segments 
 into 1 that segment will contain all 10 DV fields with lots if 0s. 
 This was the motivation behind trying to come up with a compression for a use 
 case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-20 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003321#comment-14003321
 ] 

Robert Muir commented on LUCENE-5688:
-

You otherwise don't load hardly anything in RAM, so its extremely trappy to do 
this.

As i mentioned, the obvious approach is O(log N), like android's SparseArray. 
so array 1 is increasing documents that have value (can be a 
monotonicblockreader). you can binarysearch that to find your value in the 
real values.

You have to decide how 'missing' should be represented. currently it will be 1 
bit per document as well. if it stays that way, you can check that first (which 
is the typical case) before binary searching.

In all cases this has performance implications (slower access), and isn't 
specific to numerics (all dv fields could be sparse). So I think its best to 
start outside of the default codec rather than trying to do it automatically. 
Not everyone will want the space-time tradeoff.

 NumericDocValues fields with sparse data can be compressed better 
 --

 Key: LUCENE-5688
 URL: https://issues.apache.org/jira/browse/LUCENE-5688
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
 Attachments: LUCENE-5688.patch


 I ran into this problem where I had a dynamic field in Solr and indexed data 
 into lots of fields. For each field only a few documents had actual values 
 and the remaining documents the default value ( 0 ) got indexed. Now when I 
 merge segments, the index size jumps up.
 For example I have 10 segments - Each with 1 DV field. When I merge segments 
 into 1 that segment will contain all 10 DV fields with lots if 0s. 
 This was the motivation behind trying to come up with a compression for a use 
 case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org