subject:"\[jira\] \[Commented\] \(LUCENE\-5688\) NumericDocValues fields with sparse data can be compressed better"

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-06-01 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014922#comment-14014922
]

Shai Erera commented on LUCENE-5688:

Ahh, I see now that you only wrote a DVFormat, not a Codec. In that case I
agree, apps should plug it in per-field and that it doesn't need to wrap
another format. Can you perhaps make the Consumer/Producer package-private? I
think only the Format needs to be public?

About Binary field, indeed it doesn't write the data if a BytesRef is missing,
but it does write all the meta information, e.g. the missing bitset, the
addresses (in case the BytesRef aren't of equal length). So I think sparseness
should be really sparse. But I'm fine if you leave that out for now - we first
need to make sure the numeric field performs and that there are any gains (even
if only during indexing).

NumericDocValues fields with sparse data can be compressed better
--

Key: LUCENE-5688
URL: https://issues.apache.org/jira/browse/LUCENE-5688
Project: Lucene - Core
Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
Attachments: LUCENE-5688.patch, LUCENE-5688.patch

I ran into this problem where I had a dynamic field in Solr and indexed data
into lots of fields. For each field only a few documents had actual values
and the remaining documents the default value ( 0 ) got indexed. Now when I
merge segments, the index size jumps up.
For example I have 10 segments - Each with 1 DV field. When I merge segments
into 1 that segment will contain all 10 DV fields with lots if 0s.
This was the motivation behind trying to come up with a compression for a use
case like this.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-30 Thread Varun Thacker (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013414#comment-14013414
]

Varun Thacker commented on LUCENE-5688:
---

Hi Shai,

Thanks for reviewing.

bq. Perhaps you can also experiment with a tiny hash-map, using plain
int[]+long[] or a pair of packed arrays, instead of the binary search tree. I
am writing one now because I am experimenting with improvements to updatable
DocValues. It's based on Solr's HashDocSet which I modify to act as an
int-to-long map. I can share the code here if you want

Sure this approach looks promising also. Faster access vs more memory. Perhaps
we could provide both options in the same codec.

bq. Another thing, maybe this codec should wrap another and delegate to in case
the number of docs-with-values exceeds some threshold? For instance, ignoring
packing, the default DV encodes 8 bytes per document, while this codec encodes
12 bytes (doc+value) per document which has a value. So I'm thinking that
unless the field is really sparse, we might prefer the default encoding. We
should fold that as well into the benchmark.

I thought about it, but since we are writing a codec dedicated to sparse values
and not adding it as an optimization for the default codec I did not include it
in my patch. If you feel that we should then I will add it.

A couple of other general doubts that I had -
- Currently only addNumericField is implemented. Looking at the
Lucene45DocValuesConsumer - addBinaryField does not write missing value so the
same code can be reused?
- For addSortedField, addSortedSetField the only method which needs to be
changed would be addTermsDict?

NumericDocValues fields with sparse data can be compressed better
--

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-29 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012177#comment-14012177
]

Shai Erera commented on LUCENE-5688:

bq. It does a binary search on the position data which is read using
MonotonicBlockPackedReader.

Perhaps you can also experiment with a tiny hash-map, using plain int[]+long[]
or a pair of packed arrays, instead of the binary search tree. I am writing one
now because I am experimenting with improvements to updatable DocValues. It's
based on Solr's {{HashDocSet}} which I modify to act as an int-to-long map. I
can share the code here if you want.

bq. Also I am not too familiar with lucene-util but is there a test which
benchmarks DocValue read times? Should be interesting to see the read time
difference.

Luceneutil has a search task benchmark (searchBench.py) which you can use. I
recently augmented it (while benchmarking updatable DV) with a
sort-by-DocValues, so I think you can use that to exercise the sparse DV? Once
you're ready to run the benchmark let me know, I can share the tasks file with
you. You will also need to modify the indexer to create sparse DVs (make it
configurable) as currently when DV is turned on, each document is indexed a set
of fields.

About the patch, I see you always encode a bitset + the values (sparse). I
wonder if you used a hashtable-approach as I described above, you could just
encode the docs that have a value. Then in the producer you can load them into
memory (it's supposed to be small) and implement both getDocsWithField and
getNumeric. It will impact docsWithField, but it's worth benchmarking I think.

Another thing, maybe this codec should wrap another and delegate to in case the
number of docs-with-values exceeds some threshold? For instance, ignoring
packing, the default DV encodes 8 bytes per document, while this codec encodes
12 bytes (doc+value) per document which has a value. So I'm thinking that
unless the field is really sparse, we might prefer the default encoding. We
should fold that as well into the benchmark.

NumericDocValues fields with sparse data can be compressed better
--

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-21 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004562#comment-14004562
 ] 

Adrien Grand commented on LUCENE-5688:
--

+1 to using binary search on an in-memory {{MonotonicBlockPackedReader}} to 
implement sparse doc values. 

 NumericDocValues fields with sparse data can be compressed better 
 --

 Key: LUCENE-5688
 URL: https://issues.apache.org/jira/browse/LUCENE-5688
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
 Attachments: LUCENE-5688.patch


 I ran into this problem where I had a dynamic field in Solr and indexed data 
 into lots of fields. For each field only a few documents had actual values 
 and the remaining documents the default value ( 0 ) got indexed. Now when I 
 merge segments, the index size jumps up.
 For example I have 10 segments - Each with 1 DV field. When I merge segments 
 into 1 that segment will contain all 10 DV fields with lots if 0s. 
 This was the motivation behind trying to come up with a compression for a use 
 case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-20 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003220#comment-14003220
 ] 

Robert Muir commented on LUCENE-5688:
-

I think this is a duplicate of LUCENE-4921 ?

I guess the main thing is to differentiate between sparse data and thousands 
and thousands of fields which usually hints at the problem not being in lucene 
:)

 NumericDocValues fields with sparse data can be compressed better 
 --

 Key: LUCENE-5688
 URL: https://issues.apache.org/jira/browse/LUCENE-5688
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
 Attachments: LUCENE-5688.patch


 I ran into this problem where I had a dynamic field in Solr and indexed data 
 into lots of fields. For each field only a few documents had actual values 
 and the remaining documents the default value ( 0 ) got indexed. Now when I 
 merge segments, the index size jumps up.
 For example I have 10 segments - Each with 1 DV field. When I merge segments 
 into 1 that segment will contain all 10 DV fields with lots if 0s. 
 This was the motivation behind trying to come up with a compression for a use 
 case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-20 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003226#comment-14003226
 ] 

Robert Muir commented on LUCENE-5688:
-

Varun, i dont think we should make a long[] of size maxDoc in ram here just to 
save some space on disk.

 NumericDocValues fields with sparse data can be compressed better 
 --

 Key: LUCENE-5688
 URL: https://issues.apache.org/jira/browse/LUCENE-5688
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
 Attachments: LUCENE-5688.patch


 I ran into this problem where I had a dynamic field in Solr and indexed data 
 into lots of fields. For each field only a few documents had actual values 
 and the remaining documents the default value ( 0 ) got indexed. Now when I 
 merge segments, the index size jumps up.
 For example I have 10 segments - Each with 1 DV field. When I merge segments 
 into 1 that segment will contain all 10 DV fields with lots if 0s. 
 This was the motivation behind trying to come up with a compression for a use 
 case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-20 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003231#comment-14003231
 ] 

Grant Ingersoll commented on LUCENE-5688:
-

bq. Varun, i dont think we should make a long[] of size maxDoc in ram here just 
to save some space on disk.

In a large index, this can be quite significant, FWIW.  Agreed on the long[] in 
RAM, but would be good to have a better way of controlling the on-disk behavior.

 NumericDocValues fields with sparse data can be compressed better 
 --

 Key: LUCENE-5688
 URL: https://issues.apache.org/jira/browse/LUCENE-5688
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
 Attachments: LUCENE-5688.patch


 I ran into this problem where I had a dynamic field in Solr and indexed data 
 into lots of fields. For each field only a few documents had actual values 
 and the remaining documents the default value ( 0 ) got indexed. Now when I 
 merge segments, the index size jumps up.
 For example I have 10 segments - Each with 1 DV field. When I merge segments 
 into 1 that segment will contain all 10 DV fields with lots if 0s. 
 This was the motivation behind trying to come up with a compression for a use 
 case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-20 Thread Varun Thacker (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003234#comment-14003234
]

Varun Thacker commented on LUCENE-5688:
---

Completely overlooked LUCENE-4921. Should I mark this as duplicate and post the
same patch there?

bq. Varun, i dont think we should make a long[] of size maxDoc in ram here just
to save some space on disk.

I felt the same way when I was writing it, but that was the easiest way to get
a quick patch out. I will try to think of a better way to achieve this. Do you
have any suggestions?

NumericDocValues fields with sparse data can be compressed better
--

Key: LUCENE-5688
URL: https://issues.apache.org/jira/browse/LUCENE-5688
Project: Lucene - Core
Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
Attachments: LUCENE-5688.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-20 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003321#comment-14003321
]

Robert Muir commented on LUCENE-5688:
-

You otherwise don't load hardly anything in RAM, so its extremely trappy to do
this.

As i mentioned, the obvious approach is O(log N), like android's SparseArray.
so array 1 is increasing documents that have value (can be a
monotonicblockreader). you can binarysearch that to find your value in the
real values.

You have to decide how 'missing' should be represented. currently it will be 1
bit per document as well. if it stays that way, you can check that first (which
is the typical case) before binary searching.

In all cases this has performance implications (slower access), and isn't
specific to numerics (all dv fields could be sparse). So I think its best to
start outside of the default codec rather than trying to do it automatically.
Not everyone will want the space-time tradeoff.

NumericDocValues fields with sparse data can be compressed better
--

Key: LUCENE-5688
URL: https://issues.apache.org/jira/browse/LUCENE-5688
Project: Lucene - Core
Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
Attachments: LUCENE-5688.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

9 matches

Site Navigation

Mail list logo

Footer information