[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user asfgit closed the pull request at: https://github.com/apache/lucene-solr/pull/513 --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user s1monw commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239136654 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,280 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, if all updates share the + * same value for a numeric field we only store the value once. + */ +final class FieldUpdatesBuffer { + private static final long SELF_SHALLOW_SIZE = RamUsageEstimator.shallowSizeOfInstance(FieldUpdatesBuffer.class); + private static final long STRING_SHALLOW_SIZE = RamUsageEstimator.shallowSizeOfInstance(String.class); + private final Counter bytesUsed; + private int numUpdates = 1; + // we use a very simple approach and store the update term values without de-duplication + // which is also not a common case to keep updating the same value more than once... + // we might pay a higher price in terms of memory in certain cases but will gain + // on CPU for those. We also save on not needing to sort in order to apply the terms in order + // since by definition we store them in order. + private final BytesRefArray termValues; + private final BytesRefArray byteValues; // this will be null if we are buffering numerics + private int[] docsUpTo; + private long[] numericValues; // this will be null if we are buffering binaries + private FixedBitSet hasValues; + private String[] fields; + private final boolean isNumeric; + + private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate initialValue, int docUpTo, boolean isNumeric) { +this.bytesUsed = bytesUsed; +this.bytesUsed.addAndGet(SELF_SHALLOW_SIZE); +termValues = new BytesRefArray(bytesUsed); +termValues.append(initialValue.term.bytes); +fields = new String[] {initialValue.term.field}; +bytesUsed.addAndGet(sizeOfString(initialValue.term.field)); +docsUpTo = new int[] {docUpTo}; +if (initialValue.hasValue == false) { + hasValues = new FixedBitSet(1); + bytesUsed.addAndGet(hasValues.ramBytesUsed()); +} +this.isNumeric = isNumeric; +byteValues = isNumeric ? null : new BytesRefArray(bytesUsed); + } + + private static long sizeOfString(String string) { +return STRING_SHALLOW_SIZE + (string.length() * Character.BYTES); + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, true); +if (initialValue.hasValue()) { + numericValues = new long[] {initialValue.getValue()}; +} else { + numericValues = new long[] {0}; +} +bytesUsed.addAndGet(Long.BYTES); + } + +
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user jpountz commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239127024 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,280 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, if all updates share the + * same value for a numeric field we only store the value once. + */ +final class FieldUpdatesBuffer { + private static final long SELF_SHALLOW_SIZE = RamUsageEstimator.shallowSizeOfInstance(FieldUpdatesBuffer.class); + private static final long STRING_SHALLOW_SIZE = RamUsageEstimator.shallowSizeOfInstance(String.class); + private final Counter bytesUsed; + private int numUpdates = 1; + // we use a very simple approach and store the update term values without de-duplication + // which is also not a common case to keep updating the same value more than once... + // we might pay a higher price in terms of memory in certain cases but will gain + // on CPU for those. We also save on not needing to sort in order to apply the terms in order + // since by definition we store them in order. + private final BytesRefArray termValues; + private final BytesRefArray byteValues; // this will be null if we are buffering numerics + private int[] docsUpTo; + private long[] numericValues; // this will be null if we are buffering binaries + private FixedBitSet hasValues; + private String[] fields; + private final boolean isNumeric; + + private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate initialValue, int docUpTo, boolean isNumeric) { +this.bytesUsed = bytesUsed; +this.bytesUsed.addAndGet(SELF_SHALLOW_SIZE); +termValues = new BytesRefArray(bytesUsed); +termValues.append(initialValue.term.bytes); +fields = new String[] {initialValue.term.field}; +bytesUsed.addAndGet(sizeOfString(initialValue.term.field)); +docsUpTo = new int[] {docUpTo}; +if (initialValue.hasValue == false) { + hasValues = new FixedBitSet(1); + bytesUsed.addAndGet(hasValues.ramBytesUsed()); +} +this.isNumeric = isNumeric; +byteValues = isNumeric ? null : new BytesRefArray(bytesUsed); + } + + private static long sizeOfString(String string) { +return STRING_SHALLOW_SIZE + (string.length() * Character.BYTES); + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, true); +if (initialValue.hasValue()) { + numericValues = new long[] {initialValue.getValue()}; +} else { + numericValues = new long[] {0}; +} +bytesUsed.addAndGet(Long.BYTES); + } + +
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user jpountz commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239123489 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, if all updates share the + * same value for a numeric field we only store the value once. + */ +final class FieldUpdatesBuffer { + private static final long SELF_SHALLOW_SIZE = RamUsageEstimator.shallowSizeOfInstance(FieldUpdatesBuffer.class); + private static final long STRING_SHALLOW_SIZE = RamUsageEstimator.shallowSizeOf(String.class); + private final Counter bytesUsed; + private int numUpdates = 1; + // we use a very simple approach and store the update term values without de-duplication + // which is also not a common case to keep updating the same value more than once... + // we might pay a higher price in terms of memory in certain cases but will gain + // on CPU for those. We also save on not needing to sort in order to apply the terms in order + // since by definition we store them in order. + private final BytesRefArray termValues; + private final BytesRefArray byteValues; // this will be null if we are buffering numerics + private int[] docsUpTo; + private long[] numericValues; // this will be null if we are buffering binaries + private FixedBitSet hasValues; + private String[] fields; + private final boolean isNumeric; + + private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate initialValue, int docUpTo, boolean isNumeric) { +this.bytesUsed = bytesUsed; +this.bytesUsed.addAndGet(SELF_SHALLOW_SIZE); +termValues = new BytesRefArray(bytesUsed); +termValues.append(initialValue.term.bytes); +fields = new String[] {initialValue.term.field}; +bytesUsed.addAndGet(sizeOfString(initialValue.term.field)); +docsUpTo = new int[] {docUpTo}; +if (initialValue.hasValue == false) { + hasValues = new FixedBitSet(1); + bytesUsed.addAndGet(hasValues.ramBytesUsed()); +} +this.isNumeric = isNumeric; +byteValues = isNumeric ? null : new BytesRefArray(bytesUsed); + } + + private long sizeOfString(String string) { +return STRING_SHALLOW_SIZE + (string.length() * Character.BYTES); + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, true); +if (initialValue.hasValue()) { + numericValues = new long[] {initialValue.getValue()}; +} else { + numericValues = new long[] {0}; +} +bytesUsed.addAndGet(Long.BYTES); + } + + FieldUpdatesBuffer(Counter
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user jpountz commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239116677 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, if all updates share the + * same value for a numeric field we only store the value once. + */ +final class FieldUpdatesBuffer { + private static final long SELF_SHALLOW_SIZE = RamUsageEstimator.shallowSizeOfInstance(FieldUpdatesBuffer.class); + private static final long STRING_SHALLOW_SIZE = RamUsageEstimator.shallowSizeOf(String.class); + private final Counter bytesUsed; + private int numUpdates = 1; + // we use a very simple approach and store the update term values without de-duplication + // which is also not a common case to keep updating the same value more than once... + // we might pay a higher price in terms of memory in certain cases but will gain + // on CPU for those. We also save on not needing to sort in order to apply the terms in order + // since by definition we store them in order. + private final BytesRefArray termValues; + private final BytesRefArray byteValues; // this will be null if we are buffering numerics + private int[] docsUpTo; + private long[] numericValues; // this will be null if we are buffering binaries + private FixedBitSet hasValues; + private String[] fields; + private final boolean isNumeric; + + private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate initialValue, int docUpTo, boolean isNumeric) { +this.bytesUsed = bytesUsed; +this.bytesUsed.addAndGet(SELF_SHALLOW_SIZE); +termValues = new BytesRefArray(bytesUsed); +termValues.append(initialValue.term.bytes); +fields = new String[] {initialValue.term.field}; +bytesUsed.addAndGet(sizeOfString(initialValue.term.field)); +docsUpTo = new int[] {docUpTo}; +if (initialValue.hasValue == false) { + hasValues = new FixedBitSet(1); + bytesUsed.addAndGet(hasValues.ramBytesUsed()); +} +this.isNumeric = isNumeric; +byteValues = isNumeric ? null : new BytesRefArray(bytesUsed); + } + + private long sizeOfString(String string) { --- End diff -- static? --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user jpountz commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239114521 --- Diff: lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java --- @@ -288,15 +184,24 @@ void clear() { deleteTerms.clear(); deleteQueries.clear(); deleteDocIDs.clear(); -numericUpdates.clear(); -binaryUpdates.clear(); numTermDeletes.set(0); -numNumericUpdates.set(0); -numBinaryUpdates.set(0); -bytesUsed.set(0); +numFieldUpdates.set(0); +fieldUpdates.clear(); +bytesUsed.addAndGet(-bytesUsed.get()); +fieldUpdatesBytesUsed.addAndGet(-fieldUpdatesBytesUsed.get()); } boolean any() { -return deleteTerms.size() > 0 || deleteDocIDs.size() > 0 || deleteQueries.size() > 0 || numericUpdates.size() > 0 || binaryUpdates.size() > 0; +return deleteTerms.size() > 0 || deleteDocIDs.size() > 0 || deleteQueries.size() > 0 || numFieldUpdates.get() > 0; + } + + @Override + public long ramBytesUsed() { +return bytesUsed.get() + fieldUpdatesBytesUsed.get(); + } + + public void clearDeletedDocIds() { --- End diff -- no need for the public modifier? --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user jpountz commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239121847 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, if all updates share the + * same value for a numeric field we only store the value once. + */ +final class FieldUpdatesBuffer { + private static final long SELF_SHALLOW_SIZE = RamUsageEstimator.shallowSizeOfInstance(FieldUpdatesBuffer.class); + private static final long STRING_SHALLOW_SIZE = RamUsageEstimator.shallowSizeOf(String.class); + private final Counter bytesUsed; + private int numUpdates = 1; + // we use a very simple approach and store the update term values without de-duplication + // which is also not a common case to keep updating the same value more than once... + // we might pay a higher price in terms of memory in certain cases but will gain + // on CPU for those. We also save on not needing to sort in order to apply the terms in order + // since by definition we store them in order. + private final BytesRefArray termValues; + private final BytesRefArray byteValues; // this will be null if we are buffering numerics + private int[] docsUpTo; + private long[] numericValues; // this will be null if we are buffering binaries + private FixedBitSet hasValues; + private String[] fields; + private final boolean isNumeric; + + private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate initialValue, int docUpTo, boolean isNumeric) { +this.bytesUsed = bytesUsed; +this.bytesUsed.addAndGet(SELF_SHALLOW_SIZE); +termValues = new BytesRefArray(bytesUsed); +termValues.append(initialValue.term.bytes); +fields = new String[] {initialValue.term.field}; +bytesUsed.addAndGet(sizeOfString(initialValue.term.field)); +docsUpTo = new int[] {docUpTo}; +if (initialValue.hasValue == false) { + hasValues = new FixedBitSet(1); + bytesUsed.addAndGet(hasValues.ramBytesUsed()); +} +this.isNumeric = isNumeric; +byteValues = isNumeric ? null : new BytesRefArray(bytesUsed); + } + + private long sizeOfString(String string) { +return STRING_SHALLOW_SIZE + (string.length() * Character.BYTES); + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, true); +if (initialValue.hasValue()) { + numericValues = new long[] {initialValue.getValue()}; +} else { + numericValues = new long[] {0}; +} +bytesUsed.addAndGet(Long.BYTES); + } + + FieldUpdatesBuffer(Counter
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user jpountz commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239115033 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, if all updates share the + * same value for a numeric field we only store the value once. + */ +final class FieldUpdatesBuffer { + private static final long SELF_SHALLOW_SIZE = RamUsageEstimator.shallowSizeOfInstance(FieldUpdatesBuffer.class); + private static final long STRING_SHALLOW_SIZE = RamUsageEstimator.shallowSizeOf(String.class); --- End diff -- I think you meant `shallowSizeOfInstance` here too? --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user jpountz commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239122170 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, if all updates share the + * same value for a numeric field we only store the value once. + */ +final class FieldUpdatesBuffer { + private static final long SELF_SHALLOW_SIZE = RamUsageEstimator.shallowSizeOfInstance(FieldUpdatesBuffer.class); + private static final long STRING_SHALLOW_SIZE = RamUsageEstimator.shallowSizeOf(String.class); + private final Counter bytesUsed; + private int numUpdates = 1; + // we use a very simple approach and store the update term values without de-duplication + // which is also not a common case to keep updating the same value more than once... + // we might pay a higher price in terms of memory in certain cases but will gain + // on CPU for those. We also save on not needing to sort in order to apply the terms in order + // since by definition we store them in order. + private final BytesRefArray termValues; + private final BytesRefArray byteValues; // this will be null if we are buffering numerics + private int[] docsUpTo; + private long[] numericValues; // this will be null if we are buffering binaries + private FixedBitSet hasValues; + private String[] fields; + private final boolean isNumeric; + + private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate initialValue, int docUpTo, boolean isNumeric) { +this.bytesUsed = bytesUsed; +this.bytesUsed.addAndGet(SELF_SHALLOW_SIZE); +termValues = new BytesRefArray(bytesUsed); +termValues.append(initialValue.term.bytes); +fields = new String[] {initialValue.term.field}; +bytesUsed.addAndGet(sizeOfString(initialValue.term.field)); +docsUpTo = new int[] {docUpTo}; +if (initialValue.hasValue == false) { + hasValues = new FixedBitSet(1); + bytesUsed.addAndGet(hasValues.ramBytesUsed()); +} +this.isNumeric = isNumeric; +byteValues = isNumeric ? null : new BytesRefArray(bytesUsed); + } + + private long sizeOfString(String string) { +return STRING_SHALLOW_SIZE + (string.length() * Character.BYTES); + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, true); +if (initialValue.hasValue()) { + numericValues = new long[] {initialValue.getValue()}; +} else { + numericValues = new long[] {0}; +} +bytesUsed.addAndGet(Long.BYTES); + } + + FieldUpdatesBuffer(Counter
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user s1monw commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239062234 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, the soft_deletes case + * where all values for a specific field is shared this also stores numeric values only once if all updates share + * the same value. + */ +final class FieldUpdatesBuffer { + private final Counter bytesUsed; + private int numUpdates = 1; + // we use a very simple approach and store the update term values without de-duplication + // which is also not a common case to keep updating the same value more than once... + // we might pay a higher price in terms of memory in certain cases but will gain + // on CPU for those. We also save on not needing to sort in order to apply the terms in order + // since by definition we store them in order. + private final BytesRefArray termValues; + private final BytesRefArray byteValues; // this will be null if we are buffering numerics + private int[] docsUpTo = null; + private long[] numericValues; // this will be null if we are buffering binaries + private boolean[] hasValues; + private String[] fields; + private final String firstField; + private final boolean firstHasValue; + private long firstNumericValue; + private final int firstDocUpTo; + private final boolean isNumeric; + + private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate initialValue, int docUpTo, boolean isNumeric) { +this.bytesUsed = bytesUsed; +termValues = new BytesRefArray(bytesUsed); +termValues.append(initialValue.term.bytes); +firstField = initialValue.term.field; +firstDocUpTo = docUpTo; +firstHasValue = initialValue.hasValue; +if (firstHasValue == false) { + hasValues = new boolean[] {false}; + bytesUsed.addAndGet(1); +} +this.isNumeric = isNumeric; +byteValues = isNumeric ? null : new BytesRefArray(bytesUsed); + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, true); +if (initialValue.hasValue) { + firstNumericValue = initialValue.getValue(); +} + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, false); +if (initialValue.hasValue()) { + byteValues.append(initialValue.getValue()); +} + } + + void add(String field, int docUpTo, int ord, boolean hasValue) { +if (this.firstField.equals(field) == false || fields != null) { + if (fields == null) { +int
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user s1monw commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239062190 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, the soft_deletes case + * where all values for a specific field is shared this also stores numeric values only once if all updates share + * the same value. + */ +final class FieldUpdatesBuffer { + private final Counter bytesUsed; + private int numUpdates = 1; + // we use a very simple approach and store the update term values without de-duplication + // which is also not a common case to keep updating the same value more than once... + // we might pay a higher price in terms of memory in certain cases but will gain + // on CPU for those. We also save on not needing to sort in order to apply the terms in order + // since by definition we store them in order. + private final BytesRefArray termValues; + private final BytesRefArray byteValues; // this will be null if we are buffering numerics + private int[] docsUpTo = null; + private long[] numericValues; // this will be null if we are buffering binaries + private boolean[] hasValues; + private String[] fields; + private final String firstField; + private final boolean firstHasValue; + private long firstNumericValue; + private final int firstDocUpTo; + private final boolean isNumeric; + + private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate initialValue, int docUpTo, boolean isNumeric) { +this.bytesUsed = bytesUsed; +termValues = new BytesRefArray(bytesUsed); +termValues.append(initialValue.term.bytes); +firstField = initialValue.term.field; +firstDocUpTo = docUpTo; +firstHasValue = initialValue.hasValue; +if (firstHasValue == false) { + hasValues = new boolean[] {false}; + bytesUsed.addAndGet(1); +} +this.isNumeric = isNumeric; +byteValues = isNumeric ? null : new BytesRefArray(bytesUsed); + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, true); +if (initialValue.hasValue) { + firstNumericValue = initialValue.getValue(); +} + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, false); +if (initialValue.hasValue()) { + byteValues.append(initialValue.getValue()); +} + } + + void add(String field, int docUpTo, int ord, boolean hasValue) { +if (this.firstField.equals(field) == false || fields != null) { + if (fields == null) { +int
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user s1monw commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239036539 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, the soft_deletes case + * where all values for a specific field is shared this also stores numeric values only once if all updates share + * the same value. + */ +final class FieldUpdatesBuffer { + private final Counter bytesUsed; + private int numUpdates = 1; + // we use a very simple approach and store the update term values without de-duplication + // which is also not a common case to keep updating the same value more than once... + // we might pay a higher price in terms of memory in certain cases but will gain + // on CPU for those. We also save on not needing to sort in order to apply the terms in order + // since by definition we store them in order. + private final BytesRefArray termValues; + private final BytesRefArray byteValues; // this will be null if we are buffering numerics + private int[] docsUpTo = null; + private long[] numericValues; // this will be null if we are buffering binaries + private boolean[] hasValues; --- End diff -- yeah so I wasn't sure I can explore it --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user s1monw commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239036461 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, the soft_deletes case + * where all values for a specific field is shared this also stores numeric values only once if all updates share + * the same value. + */ +final class FieldUpdatesBuffer { --- End diff -- I will look into this! --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user s1monw commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239036263 --- Diff: lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java --- @@ -288,15 +186,24 @@ void clear() { deleteTerms.clear(); deleteQueries.clear(); deleteDocIDs.clear(); -numericUpdates.clear(); -binaryUpdates.clear(); numTermDeletes.set(0); -numNumericUpdates.set(0); -numBinaryUpdates.set(0); -bytesUsed.set(0); +numFieldUpdates.set(0); +fieldUpdates.clear(); +bytesUsed.addAndGet(-bytesUsed.get()); --- End diff -- agreed I will open a followup. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user mikemccand commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239027156 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, the soft_deletes case + * where all values for a specific field is shared this also stores numeric values only once if all updates share + * the same value. + */ +final class FieldUpdatesBuffer { + private final Counter bytesUsed; + private int numUpdates = 1; + // we use a very simple approach and store the update term values without de-duplication + // which is also not a common case to keep updating the same value more than once... + // we might pay a higher price in terms of memory in certain cases but will gain + // on CPU for those. We also save on not needing to sort in order to apply the terms in order + // since by definition we store them in order. + private final BytesRefArray termValues; + private final BytesRefArray byteValues; // this will be null if we are buffering numerics + private int[] docsUpTo = null; + private long[] numericValues; // this will be null if we are buffering binaries + private boolean[] hasValues; + private String[] fields; + private final String firstField; + private final boolean firstHasValue; + private long firstNumericValue; + private final int firstDocUpTo; + private final boolean isNumeric; + + private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate initialValue, int docUpTo, boolean isNumeric) { +this.bytesUsed = bytesUsed; +termValues = new BytesRefArray(bytesUsed); +termValues.append(initialValue.term.bytes); +firstField = initialValue.term.field; +firstDocUpTo = docUpTo; +firstHasValue = initialValue.hasValue; +if (firstHasValue == false) { + hasValues = new boolean[] {false}; + bytesUsed.addAndGet(1); +} +this.isNumeric = isNumeric; +byteValues = isNumeric ? null : new BytesRefArray(bytesUsed); + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, true); +if (initialValue.hasValue) { + firstNumericValue = initialValue.getValue(); +} + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, false); +if (initialValue.hasValue()) { + byteValues.append(initialValue.getValue()); +} + } + + void add(String field, int docUpTo, int ord, boolean hasValue) { +if (this.firstField.equals(field) == false || fields != null) { + if (fields == null) { +
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user mikemccand commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239025447 --- Diff: lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java --- @@ -42,7 +41,7 @@ // instance on DocumentWriterPerThread, or via sync'd code by // DocumentsWriterDeleteQueue -class BufferedUpdates { +class BufferedUpdates implements Accountable { --- End diff -- Ha! Despite all the crazy accounting we were doing here we didn't implement this before :) --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user mikemccand commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239027692 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, the soft_deletes case + * where all values for a specific field is shared this also stores numeric values only once if all updates share + * the same value. + */ +final class FieldUpdatesBuffer { + private final Counter bytesUsed; + private int numUpdates = 1; + // we use a very simple approach and store the update term values without de-duplication + // which is also not a common case to keep updating the same value more than once... + // we might pay a higher price in terms of memory in certain cases but will gain + // on CPU for those. We also save on not needing to sort in order to apply the terms in order + // since by definition we store them in order. + private final BytesRefArray termValues; + private final BytesRefArray byteValues; // this will be null if we are buffering numerics + private int[] docsUpTo = null; + private long[] numericValues; // this will be null if we are buffering binaries + private boolean[] hasValues; + private String[] fields; + private final String firstField; + private final boolean firstHasValue; + private long firstNumericValue; + private final int firstDocUpTo; + private final boolean isNumeric; + + private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate initialValue, int docUpTo, boolean isNumeric) { +this.bytesUsed = bytesUsed; +termValues = new BytesRefArray(bytesUsed); +termValues.append(initialValue.term.bytes); +firstField = initialValue.term.field; +firstDocUpTo = docUpTo; +firstHasValue = initialValue.hasValue; +if (firstHasValue == false) { + hasValues = new boolean[] {false}; + bytesUsed.addAndGet(1); +} +this.isNumeric = isNumeric; +byteValues = isNumeric ? null : new BytesRefArray(bytesUsed); + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, true); +if (initialValue.hasValue) { + firstNumericValue = initialValue.getValue(); +} + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, false); +if (initialValue.hasValue()) { + byteValues.append(initialValue.getValue()); +} + } + + void add(String field, int docUpTo, int ord, boolean hasValue) { +if (this.firstField.equals(field) == false || fields != null) { + if (fields == null) { +
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user mikemccand commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239027983 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, the soft_deletes case + * where all values for a specific field is shared this also stores numeric values only once if all updates share + * the same value. + */ +final class FieldUpdatesBuffer { + private final Counter bytesUsed; + private int numUpdates = 1; + // we use a very simple approach and store the update term values without de-duplication + // which is also not a common case to keep updating the same value more than once... + // we might pay a higher price in terms of memory in certain cases but will gain + // on CPU for those. We also save on not needing to sort in order to apply the terms in order + // since by definition we store them in order. + private final BytesRefArray termValues; + private final BytesRefArray byteValues; // this will be null if we are buffering numerics + private int[] docsUpTo = null; + private long[] numericValues; // this will be null if we are buffering binaries + private boolean[] hasValues; + private String[] fields; + private final String firstField; + private final boolean firstHasValue; + private long firstNumericValue; + private final int firstDocUpTo; + private final boolean isNumeric; + + private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate initialValue, int docUpTo, boolean isNumeric) { +this.bytesUsed = bytesUsed; +termValues = new BytesRefArray(bytesUsed); +termValues.append(initialValue.term.bytes); +firstField = initialValue.term.field; +firstDocUpTo = docUpTo; +firstHasValue = initialValue.hasValue; +if (firstHasValue == false) { + hasValues = new boolean[] {false}; + bytesUsed.addAndGet(1); +} +this.isNumeric = isNumeric; +byteValues = isNumeric ? null : new BytesRefArray(bytesUsed); + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, true); +if (initialValue.hasValue) { + firstNumericValue = initialValue.getValue(); +} + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, false); +if (initialValue.hasValue()) { + byteValues.append(initialValue.getValue()); +} + } + + void add(String field, int docUpTo, int ord, boolean hasValue) { +if (this.firstField.equals(field) == false || fields != null) { + if (fields == null) { +
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user mikemccand commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239025745 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, the soft_deletes case + * where all values for a specific field is shared this also stores numeric values only once if all updates share + * the same value. + */ +final class FieldUpdatesBuffer { + private final Counter bytesUsed; + private int numUpdates = 1; + // we use a very simple approach and store the update term values without de-duplication + // which is also not a common case to keep updating the same value more than once... + // we might pay a higher price in terms of memory in certain cases but will gain + // on CPU for those. We also save on not needing to sort in order to apply the terms in order + // since by definition we store them in order. + private final BytesRefArray termValues; + private final BytesRefArray byteValues; // this will be null if we are buffering numerics + private int[] docsUpTo = null; --- End diff -- `null` init not needed in java. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user mikemccand commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239026711 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, the soft_deletes case + * where all values for a specific field is shared this also stores numeric values only once if all updates share + * the same value. + */ +final class FieldUpdatesBuffer { + private final Counter bytesUsed; + private int numUpdates = 1; + // we use a very simple approach and store the update term values without de-duplication + // which is also not a common case to keep updating the same value more than once... + // we might pay a higher price in terms of memory in certain cases but will gain + // on CPU for those. We also save on not needing to sort in order to apply the terms in order + // since by definition we store them in order. + private final BytesRefArray termValues; + private final BytesRefArray byteValues; // this will be null if we are buffering numerics + private int[] docsUpTo = null; + private long[] numericValues; // this will be null if we are buffering binaries + private boolean[] hasValues; + private String[] fields; + private final String firstField; + private final boolean firstHasValue; + private long firstNumericValue; + private final int firstDocUpTo; + private final boolean isNumeric; + + private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate initialValue, int docUpTo, boolean isNumeric) { +this.bytesUsed = bytesUsed; +termValues = new BytesRefArray(bytesUsed); +termValues.append(initialValue.term.bytes); +firstField = initialValue.term.field; +firstDocUpTo = docUpTo; +firstHasValue = initialValue.hasValue; +if (firstHasValue == false) { + hasValues = new boolean[] {false}; + bytesUsed.addAndGet(1); +} +this.isNumeric = isNumeric; +byteValues = isNumeric ? null : new BytesRefArray(bytesUsed); + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, true); +if (initialValue.hasValue) { + firstNumericValue = initialValue.getValue(); +} + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, false); +if (initialValue.hasValue()) { + byteValues.append(initialValue.getValue()); +} + } + + void add(String field, int docUpTo, int ord, boolean hasValue) { +if (this.firstField.equals(field) == false || fields != null) { + if (fields == null) { +
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user jpountz commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r238792876 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, the soft_deletes case + * where all values for a specific field is shared this also stores numeric values only once if all updates share + * the same value. + */ +final class FieldUpdatesBuffer { + private final Counter bytesUsed; + private int numUpdates = 1; + // we use a very simple approach and store the update term values without de-duplication + // which is also not a common case to keep updating the same value more than once... + // we might pay a higher price in terms of memory in certain cases but will gain + // on CPU for those. We also save on not needing to sort in order to apply the terms in order + // since by definition we store them in order. + private final BytesRefArray termValues; + private final BytesRefArray byteValues; // this will be null if we are buffering numerics + private int[] docsUpTo = null; + private long[] numericValues; // this will be null if we are buffering binaries + private boolean[] hasValues; --- End diff -- would a bitset be better? --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user jpountz commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r239014808 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, the soft_deletes case + * where all values for a specific field is shared this also stores numeric values only once if all updates share + * the same value. + */ +final class FieldUpdatesBuffer { --- End diff -- This would be a bit easier to read for me if you introduced tiny abstractions around an (initially null) array + a default value, as all conditions make the code a bit hard to follow. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user shaie commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r238938229 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, the soft_deletes case + * where all values for a specific field is shared this also stores numeric values only once if all updates share + * the same value. + */ +final class FieldUpdatesBuffer { + private final Counter bytesUsed; + private int numUpdates = 1; + // we use a very simple approach and store the update term values without de-duplication + // which is also not a common case to keep updating the same value more than once... + // we might pay a higher price in terms of memory in certain cases but will gain + // on CPU for those. We also save on not needing to sort in order to apply the terms in order + // since by definition we store them in order. + private final BytesRefArray termValues; + private final BytesRefArray byteValues; // this will be null if we are buffering numerics + private int[] docsUpTo = null; + private long[] numericValues; // this will be null if we are buffering binaries + private boolean[] hasValues; + private String[] fields; + private final String firstField; + private final boolean firstHasValue; + private long firstNumericValue; + private final int firstDocUpTo; + private final boolean isNumeric; + + private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate initialValue, int docUpTo, boolean isNumeric) { +this.bytesUsed = bytesUsed; +termValues = new BytesRefArray(bytesUsed); +termValues.append(initialValue.term.bytes); +firstField = initialValue.term.field; +firstDocUpTo = docUpTo; +firstHasValue = initialValue.hasValue; +if (firstHasValue == false) { + hasValues = new boolean[] {false}; + bytesUsed.addAndGet(1); +} +this.isNumeric = isNumeric; +byteValues = isNumeric ? null : new BytesRefArray(bytesUsed); + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, true); +if (initialValue.hasValue) { + firstNumericValue = initialValue.getValue(); +} + } + + FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) { +this(bytesUsed, initialValue, docUpTo, false); +if (initialValue.hasValue()) { + byteValues.append(initialValue.getValue()); +} + } + + void add(String field, int docUpTo, int ord, boolean hasValue) { +if (this.firstField.equals(field) == false || fields != null) { + if (fields == null) { +int
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user shaie commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r238937830 --- Diff: lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java --- @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.Arrays; + +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefArray; +import org.apache.lucene.util.BytesRefIterator; +import org.apache.lucene.util.Counter; +import org.apache.lucene.util.RamUsageEstimator; + +/** + * This class efficiently buffers numeric and binary field updates and stores + * terms, values and metadata in a memory efficient way without creating large amounts + * of objects. Update terms are stored without de-duplicating the update term. + * In general we try to optimize for several use-cases. For instance we try to use constant + * space for update terms field since the common case always updates on the same field. Also for docUpTo + * we try to optimize for the case when updates should be applied to all docs ie. docUpTo=Integer.MAX_VALUE. + * In other cases each update will likely have a different docUpTo. + * Along the same lines this impl optimizes the case when all updates have a value. Lastly, the soft_deletes case --- End diff -- The last sentence about "soft_deletes" reads too cumbersome to me. Can we just say "Lastly, if all updates share the same value for a field, the value is stored only once."? Or something along this. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
Github user shaie commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/513#discussion_r238937433 --- Diff: lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java --- @@ -288,15 +186,24 @@ void clear() { deleteTerms.clear(); deleteQueries.clear(); deleteDocIDs.clear(); -numericUpdates.clear(); -binaryUpdates.clear(); numTermDeletes.set(0); -numNumericUpdates.set(0); -numBinaryUpdates.set(0); -bytesUsed.set(0); +numFieldUpdates.set(0); +fieldUpdates.clear(); +bytesUsed.addAndGet(-bytesUsed.get()); --- End diff -- Unrelated to this PR, but it feels like `Counter` could expose a `reset()` method and internally set the value to 0, instead of everyone doing this juggling. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...
GitHub user s1monw opened a pull request: https://github.com/apache/lucene-solr/pull/513 LUCENE-8590: Optimize DocValues update datastructures Today we are using a LinkedHashMap to buffer doc-values updates in BufferedUpdates. This on the one hand uses an Object based datastructure and on the other requires re-encoding the data into a more compact representation once the BufferedUpdates are frozen. This change uses a more compact represenation for the updates already in the BufferedUpdates in a parallel-array like datastructure that can be reused in FrozenBufferedDeletes. It also adds an much simpler to use API to consume the updates and allows for internal memory optimization for common case updates. You can merge this pull request into a Git repository by running: $ git pull https://github.com/s1monw/lucene-solr improve_buffered_updates Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/513.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #513 commit 6646d1ce1227fae937a2d9560afcc973baa84243 Author: Simon Willnauer Date: 2018-12-04T16:36:46Z LUCENE-8590: Optimize DocValues update datastructures Today we are using a LinkedHashMap to buffer doc-values updates in BufferedUpdates. This on the one hand uses an Object based datastructure and on the other requires re-encoding the data into a more compact representation once the BufferedUpdates are frozen. This change uses a more compact represenation for the updates already in the BufferedUpdates in a parallel-array like datastructure that can be reused in FrozenBufferedDeletes. It also adds an much simpler to use API to consume the updates and allows for internal memory optimization for common case updates. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org