[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better
[ https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014922#comment-14014922 ] Shai Erera commented on LUCENE-5688: Ahh, I see now that you only wrote a DVFormat, not a Codec. In that case I agree, apps should plug it in per-field and that it doesn't need to wrap another format. Can you perhaps make the Consumer/Producer package-private? I think only the Format needs to be public? About Binary field, indeed it doesn't write the data if a BytesRef is missing, but it does write all the meta information, e.g. the missing bitset, the addresses (in case the BytesRef aren't of equal length). So I think sparseness should be really sparse. But I'm fine if you leave that out for now - we first need to make sure the numeric field performs and that there are any gains (even if only during indexing). NumericDocValues fields with sparse data can be compressed better -- Key: LUCENE-5688 URL: https://issues.apache.org/jira/browse/LUCENE-5688 Project: Lucene - Core Issue Type: Improvement Reporter: Varun Thacker Priority: Minor Attachments: LUCENE-5688.patch, LUCENE-5688.patch I ran into this problem where I had a dynamic field in Solr and indexed data into lots of fields. For each field only a few documents had actual values and the remaining documents the default value ( 0 ) got indexed. Now when I merge segments, the index size jumps up. For example I have 10 segments - Each with 1 DV field. When I merge segments into 1 that segment will contain all 10 DV fields with lots if 0s. This was the motivation behind trying to come up with a compression for a use case like this. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better
[ https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013414#comment-14013414 ] Varun Thacker commented on LUCENE-5688: --- Hi Shai, Thanks for reviewing. bq. Perhaps you can also experiment with a tiny hash-map, using plain int[]+long[] or a pair of packed arrays, instead of the binary search tree. I am writing one now because I am experimenting with improvements to updatable DocValues. It's based on Solr's HashDocSet which I modify to act as an int-to-long map. I can share the code here if you want Sure this approach looks promising also. Faster access vs more memory. Perhaps we could provide both options in the same codec. bq. Another thing, maybe this codec should wrap another and delegate to in case the number of docs-with-values exceeds some threshold? For instance, ignoring packing, the default DV encodes 8 bytes per document, while this codec encodes 12 bytes (doc+value) per document which has a value. So I'm thinking that unless the field is really sparse, we might prefer the default encoding. We should fold that as well into the benchmark. I thought about it, but since we are writing a codec dedicated to sparse values and not adding it as an optimization for the default codec I did not include it in my patch. If you feel that we should then I will add it. A couple of other general doubts that I had - - Currently only addNumericField is implemented. Looking at the Lucene45DocValuesConsumer - addBinaryField does not write missing value so the same code can be reused? - For addSortedField, addSortedSetField the only method which needs to be changed would be addTermsDict? NumericDocValues fields with sparse data can be compressed better -- Key: LUCENE-5688 URL: https://issues.apache.org/jira/browse/LUCENE-5688 Project: Lucene - Core Issue Type: Improvement Reporter: Varun Thacker Priority: Minor Attachments: LUCENE-5688.patch, LUCENE-5688.patch I ran into this problem where I had a dynamic field in Solr and indexed data into lots of fields. For each field only a few documents had actual values and the remaining documents the default value ( 0 ) got indexed. Now when I merge segments, the index size jumps up. For example I have 10 segments - Each with 1 DV field. When I merge segments into 1 that segment will contain all 10 DV fields with lots if 0s. This was the motivation behind trying to come up with a compression for a use case like this. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better
[ https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012177#comment-14012177 ] Shai Erera commented on LUCENE-5688: bq. It does a binary search on the position data which is read using MonotonicBlockPackedReader. Perhaps you can also experiment with a tiny hash-map, using plain int[]+long[] or a pair of packed arrays, instead of the binary search tree. I am writing one now because I am experimenting with improvements to updatable DocValues. It's based on Solr's {{HashDocSet}} which I modify to act as an int-to-long map. I can share the code here if you want. bq. Also I am not too familiar with lucene-util but is there a test which benchmarks DocValue read times? Should be interesting to see the read time difference. Luceneutil has a search task benchmark (searchBench.py) which you can use. I recently augmented it (while benchmarking updatable DV) with a sort-by-DocValues, so I think you can use that to exercise the sparse DV? Once you're ready to run the benchmark let me know, I can share the tasks file with you. You will also need to modify the indexer to create sparse DVs (make it configurable) as currently when DV is turned on, each document is indexed a set of fields. About the patch, I see you always encode a bitset + the values (sparse). I wonder if you used a hashtable-approach as I described above, you could just encode the docs that have a value. Then in the producer you can load them into memory (it's supposed to be small) and implement both getDocsWithField and getNumeric. It will impact docsWithField, but it's worth benchmarking I think. Another thing, maybe this codec should wrap another and delegate to in case the number of docs-with-values exceeds some threshold? For instance, ignoring packing, the default DV encodes 8 bytes per document, while this codec encodes 12 bytes (doc+value) per document which has a value. So I'm thinking that unless the field is really sparse, we might prefer the default encoding. We should fold that as well into the benchmark. NumericDocValues fields with sparse data can be compressed better -- Key: LUCENE-5688 URL: https://issues.apache.org/jira/browse/LUCENE-5688 Project: Lucene - Core Issue Type: Improvement Reporter: Varun Thacker Priority: Minor Attachments: LUCENE-5688.patch, LUCENE-5688.patch I ran into this problem where I had a dynamic field in Solr and indexed data into lots of fields. For each field only a few documents had actual values and the remaining documents the default value ( 0 ) got indexed. Now when I merge segments, the index size jumps up. For example I have 10 segments - Each with 1 DV field. When I merge segments into 1 that segment will contain all 10 DV fields with lots if 0s. This was the motivation behind trying to come up with a compression for a use case like this. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better
[ https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004562#comment-14004562 ] Adrien Grand commented on LUCENE-5688: -- +1 to using binary search on an in-memory {{MonotonicBlockPackedReader}} to implement sparse doc values. NumericDocValues fields with sparse data can be compressed better -- Key: LUCENE-5688 URL: https://issues.apache.org/jira/browse/LUCENE-5688 Project: Lucene - Core Issue Type: Improvement Reporter: Varun Thacker Priority: Minor Attachments: LUCENE-5688.patch I ran into this problem where I had a dynamic field in Solr and indexed data into lots of fields. For each field only a few documents had actual values and the remaining documents the default value ( 0 ) got indexed. Now when I merge segments, the index size jumps up. For example I have 10 segments - Each with 1 DV field. When I merge segments into 1 that segment will contain all 10 DV fields with lots if 0s. This was the motivation behind trying to come up with a compression for a use case like this. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better
[ https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003220#comment-14003220 ] Robert Muir commented on LUCENE-5688: - I think this is a duplicate of LUCENE-4921 ? I guess the main thing is to differentiate between sparse data and thousands and thousands of fields which usually hints at the problem not being in lucene :) NumericDocValues fields with sparse data can be compressed better -- Key: LUCENE-5688 URL: https://issues.apache.org/jira/browse/LUCENE-5688 Project: Lucene - Core Issue Type: Improvement Reporter: Varun Thacker Priority: Minor Attachments: LUCENE-5688.patch I ran into this problem where I had a dynamic field in Solr and indexed data into lots of fields. For each field only a few documents had actual values and the remaining documents the default value ( 0 ) got indexed. Now when I merge segments, the index size jumps up. For example I have 10 segments - Each with 1 DV field. When I merge segments into 1 that segment will contain all 10 DV fields with lots if 0s. This was the motivation behind trying to come up with a compression for a use case like this. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better
[ https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003226#comment-14003226 ] Robert Muir commented on LUCENE-5688: - Varun, i dont think we should make a long[] of size maxDoc in ram here just to save some space on disk. NumericDocValues fields with sparse data can be compressed better -- Key: LUCENE-5688 URL: https://issues.apache.org/jira/browse/LUCENE-5688 Project: Lucene - Core Issue Type: Improvement Reporter: Varun Thacker Priority: Minor Attachments: LUCENE-5688.patch I ran into this problem where I had a dynamic field in Solr and indexed data into lots of fields. For each field only a few documents had actual values and the remaining documents the default value ( 0 ) got indexed. Now when I merge segments, the index size jumps up. For example I have 10 segments - Each with 1 DV field. When I merge segments into 1 that segment will contain all 10 DV fields with lots if 0s. This was the motivation behind trying to come up with a compression for a use case like this. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better
[ https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003231#comment-14003231 ] Grant Ingersoll commented on LUCENE-5688: - bq. Varun, i dont think we should make a long[] of size maxDoc in ram here just to save some space on disk. In a large index, this can be quite significant, FWIW. Agreed on the long[] in RAM, but would be good to have a better way of controlling the on-disk behavior. NumericDocValues fields with sparse data can be compressed better -- Key: LUCENE-5688 URL: https://issues.apache.org/jira/browse/LUCENE-5688 Project: Lucene - Core Issue Type: Improvement Reporter: Varun Thacker Priority: Minor Attachments: LUCENE-5688.patch I ran into this problem where I had a dynamic field in Solr and indexed data into lots of fields. For each field only a few documents had actual values and the remaining documents the default value ( 0 ) got indexed. Now when I merge segments, the index size jumps up. For example I have 10 segments - Each with 1 DV field. When I merge segments into 1 that segment will contain all 10 DV fields with lots if 0s. This was the motivation behind trying to come up with a compression for a use case like this. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better
[ https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003234#comment-14003234 ] Varun Thacker commented on LUCENE-5688: --- Completely overlooked LUCENE-4921. Should I mark this as duplicate and post the same patch there? bq. Varun, i dont think we should make a long[] of size maxDoc in ram here just to save some space on disk. I felt the same way when I was writing it, but that was the easiest way to get a quick patch out. I will try to think of a better way to achieve this. Do you have any suggestions? NumericDocValues fields with sparse data can be compressed better -- Key: LUCENE-5688 URL: https://issues.apache.org/jira/browse/LUCENE-5688 Project: Lucene - Core Issue Type: Improvement Reporter: Varun Thacker Priority: Minor Attachments: LUCENE-5688.patch I ran into this problem where I had a dynamic field in Solr and indexed data into lots of fields. For each field only a few documents had actual values and the remaining documents the default value ( 0 ) got indexed. Now when I merge segments, the index size jumps up. For example I have 10 segments - Each with 1 DV field. When I merge segments into 1 that segment will contain all 10 DV fields with lots if 0s. This was the motivation behind trying to come up with a compression for a use case like this. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better
[ https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003321#comment-14003321 ] Robert Muir commented on LUCENE-5688: - You otherwise don't load hardly anything in RAM, so its extremely trappy to do this. As i mentioned, the obvious approach is O(log N), like android's SparseArray. so array 1 is increasing documents that have value (can be a monotonicblockreader). you can binarysearch that to find your value in the real values. You have to decide how 'missing' should be represented. currently it will be 1 bit per document as well. if it stays that way, you can check that first (which is the typical case) before binary searching. In all cases this has performance implications (slower access), and isn't specific to numerics (all dv fields could be sparse). So I think its best to start outside of the default codec rather than trying to do it automatically. Not everyone will want the space-time tradeoff. NumericDocValues fields with sparse data can be compressed better -- Key: LUCENE-5688 URL: https://issues.apache.org/jira/browse/LUCENE-5688 Project: Lucene - Core Issue Type: Improvement Reporter: Varun Thacker Priority: Minor Attachments: LUCENE-5688.patch I ran into this problem where I had a dynamic field in Solr and indexed data into lots of fields. For each field only a few documents had actual values and the remaining documents the default value ( 0 ) got indexed. Now when I merge segments, the index size jumps up. For example I have 10 segments - Each with 1 DV field. When I merge segments into 1 that segment will contain all 10 DV fields with lots if 0s. This was the motivation behind trying to come up with a compression for a use case like this. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org