[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/lucene-solr/pull/513


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread s1monw
Github user s1monw commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239136654
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,280 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.FixedBitSet;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, if all updates share the
+ * same value for a numeric field we only store the value once.
+ */
+final class FieldUpdatesBuffer {
+  private static final long SELF_SHALLOW_SIZE = 
RamUsageEstimator.shallowSizeOfInstance(FieldUpdatesBuffer.class);
+  private static final long STRING_SHALLOW_SIZE = 
RamUsageEstimator.shallowSizeOfInstance(String.class);
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private FixedBitSet hasValues;
+  private String[] fields;
+  private final boolean isNumeric;
+
+  private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate 
initialValue, int docUpTo, boolean isNumeric) {
+this.bytesUsed = bytesUsed;
+this.bytesUsed.addAndGet(SELF_SHALLOW_SIZE);
+termValues = new BytesRefArray(bytesUsed);
+termValues.append(initialValue.term.bytes);
+fields = new String[] {initialValue.term.field};
+bytesUsed.addAndGet(sizeOfString(initialValue.term.field));
+docsUpTo = new int[] {docUpTo};
+if (initialValue.hasValue == false) {
+  hasValues = new FixedBitSet(1);
+  bytesUsed.addAndGet(hasValues.ramBytesUsed());
+}
+this.isNumeric = isNumeric;
+byteValues = isNumeric ? null : new BytesRefArray(bytesUsed);
+  }
+
+  private static long sizeOfString(String string) {
+return STRING_SHALLOW_SIZE + (string.length() * Character.BYTES);
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, true);
+if (initialValue.hasValue()) {
+  numericValues = new long[] {initialValue.getValue()};
+} else {
+  numericValues = new long[] {0};
+}
+bytesUsed.addAndGet(Long.BYTES);
+  }
+
+  

[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread jpountz
Github user jpountz commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239127024
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,280 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.FixedBitSet;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, if all updates share the
+ * same value for a numeric field we only store the value once.
+ */
+final class FieldUpdatesBuffer {
+  private static final long SELF_SHALLOW_SIZE = 
RamUsageEstimator.shallowSizeOfInstance(FieldUpdatesBuffer.class);
+  private static final long STRING_SHALLOW_SIZE = 
RamUsageEstimator.shallowSizeOfInstance(String.class);
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private FixedBitSet hasValues;
+  private String[] fields;
+  private final boolean isNumeric;
+
+  private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate 
initialValue, int docUpTo, boolean isNumeric) {
+this.bytesUsed = bytesUsed;
+this.bytesUsed.addAndGet(SELF_SHALLOW_SIZE);
+termValues = new BytesRefArray(bytesUsed);
+termValues.append(initialValue.term.bytes);
+fields = new String[] {initialValue.term.field};
+bytesUsed.addAndGet(sizeOfString(initialValue.term.field));
+docsUpTo = new int[] {docUpTo};
+if (initialValue.hasValue == false) {
+  hasValues = new FixedBitSet(1);
+  bytesUsed.addAndGet(hasValues.ramBytesUsed());
+}
+this.isNumeric = isNumeric;
+byteValues = isNumeric ? null : new BytesRefArray(bytesUsed);
+  }
+
+  private static long sizeOfString(String string) {
+return STRING_SHALLOW_SIZE + (string.length() * Character.BYTES);
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, true);
+if (initialValue.hasValue()) {
+  numericValues = new long[] {initialValue.getValue()};
+} else {
+  numericValues = new long[] {0};
+}
+bytesUsed.addAndGet(Long.BYTES);
+  }
+
+  

[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread jpountz
Github user jpountz commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239123489
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.FixedBitSet;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, if all updates share the
+ * same value for a numeric field we only store the value once.
+ */
+final class FieldUpdatesBuffer {
+  private static final long SELF_SHALLOW_SIZE = 
RamUsageEstimator.shallowSizeOfInstance(FieldUpdatesBuffer.class);
+  private static final long STRING_SHALLOW_SIZE = 
RamUsageEstimator.shallowSizeOf(String.class);
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private FixedBitSet hasValues;
+  private String[] fields;
+  private final boolean isNumeric;
+
+  private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate 
initialValue, int docUpTo, boolean isNumeric) {
+this.bytesUsed = bytesUsed;
+this.bytesUsed.addAndGet(SELF_SHALLOW_SIZE);
+termValues = new BytesRefArray(bytesUsed);
+termValues.append(initialValue.term.bytes);
+fields = new String[] {initialValue.term.field};
+bytesUsed.addAndGet(sizeOfString(initialValue.term.field));
+docsUpTo = new int[] {docUpTo};
+if (initialValue.hasValue == false) {
+  hasValues = new FixedBitSet(1);
+  bytesUsed.addAndGet(hasValues.ramBytesUsed());
+}
+this.isNumeric = isNumeric;
+byteValues = isNumeric ? null : new BytesRefArray(bytesUsed);
+  }
+
+  private long sizeOfString(String string) {
+return STRING_SHALLOW_SIZE + (string.length() * Character.BYTES);
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, true);
+if (initialValue.hasValue()) {
+  numericValues = new long[] {initialValue.getValue()};
+} else {
+  numericValues = new long[] {0};
+}
+bytesUsed.addAndGet(Long.BYTES);
+  }
+
+  FieldUpdatesBuffer(Counter 

[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread jpountz
Github user jpountz commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239116677
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.FixedBitSet;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, if all updates share the
+ * same value for a numeric field we only store the value once.
+ */
+final class FieldUpdatesBuffer {
+  private static final long SELF_SHALLOW_SIZE = 
RamUsageEstimator.shallowSizeOfInstance(FieldUpdatesBuffer.class);
+  private static final long STRING_SHALLOW_SIZE = 
RamUsageEstimator.shallowSizeOf(String.class);
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private FixedBitSet hasValues;
+  private String[] fields;
+  private final boolean isNumeric;
+
+  private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate 
initialValue, int docUpTo, boolean isNumeric) {
+this.bytesUsed = bytesUsed;
+this.bytesUsed.addAndGet(SELF_SHALLOW_SIZE);
+termValues = new BytesRefArray(bytesUsed);
+termValues.append(initialValue.term.bytes);
+fields = new String[] {initialValue.term.field};
+bytesUsed.addAndGet(sizeOfString(initialValue.term.field));
+docsUpTo = new int[] {docUpTo};
+if (initialValue.hasValue == false) {
+  hasValues = new FixedBitSet(1);
+  bytesUsed.addAndGet(hasValues.ramBytesUsed());
+}
+this.isNumeric = isNumeric;
+byteValues = isNumeric ? null : new BytesRefArray(bytesUsed);
+  }
+
+  private long sizeOfString(String string) {
--- End diff --

static?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread jpountz
Github user jpountz commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239114521
  
--- Diff: lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java 
---
@@ -288,15 +184,24 @@ void clear() {
 deleteTerms.clear();
 deleteQueries.clear();
 deleteDocIDs.clear();
-numericUpdates.clear();
-binaryUpdates.clear();
 numTermDeletes.set(0);
-numNumericUpdates.set(0);
-numBinaryUpdates.set(0);
-bytesUsed.set(0);
+numFieldUpdates.set(0);
+fieldUpdates.clear();
+bytesUsed.addAndGet(-bytesUsed.get());
+fieldUpdatesBytesUsed.addAndGet(-fieldUpdatesBytesUsed.get());
   }
   
   boolean any() {
-return deleteTerms.size() > 0 || deleteDocIDs.size() > 0 || 
deleteQueries.size() > 0 || numericUpdates.size() > 0 || binaryUpdates.size() > 
0;
+return deleteTerms.size() > 0 || deleteDocIDs.size() > 0 || 
deleteQueries.size() > 0 || numFieldUpdates.get() > 0;
+  }
+
+  @Override
+  public long ramBytesUsed() {
+return bytesUsed.get() + fieldUpdatesBytesUsed.get();
+  }
+
+  public void clearDeletedDocIds() {
--- End diff --

no need for the public modifier?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread jpountz
Github user jpountz commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239121847
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.FixedBitSet;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, if all updates share the
+ * same value for a numeric field we only store the value once.
+ */
+final class FieldUpdatesBuffer {
+  private static final long SELF_SHALLOW_SIZE = 
RamUsageEstimator.shallowSizeOfInstance(FieldUpdatesBuffer.class);
+  private static final long STRING_SHALLOW_SIZE = 
RamUsageEstimator.shallowSizeOf(String.class);
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private FixedBitSet hasValues;
+  private String[] fields;
+  private final boolean isNumeric;
+
+  private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate 
initialValue, int docUpTo, boolean isNumeric) {
+this.bytesUsed = bytesUsed;
+this.bytesUsed.addAndGet(SELF_SHALLOW_SIZE);
+termValues = new BytesRefArray(bytesUsed);
+termValues.append(initialValue.term.bytes);
+fields = new String[] {initialValue.term.field};
+bytesUsed.addAndGet(sizeOfString(initialValue.term.field));
+docsUpTo = new int[] {docUpTo};
+if (initialValue.hasValue == false) {
+  hasValues = new FixedBitSet(1);
+  bytesUsed.addAndGet(hasValues.ramBytesUsed());
+}
+this.isNumeric = isNumeric;
+byteValues = isNumeric ? null : new BytesRefArray(bytesUsed);
+  }
+
+  private long sizeOfString(String string) {
+return STRING_SHALLOW_SIZE + (string.length() * Character.BYTES);
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, true);
+if (initialValue.hasValue()) {
+  numericValues = new long[] {initialValue.getValue()};
+} else {
+  numericValues = new long[] {0};
+}
+bytesUsed.addAndGet(Long.BYTES);
+  }
+
+  FieldUpdatesBuffer(Counter 

[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread jpountz
Github user jpountz commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239115033
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.FixedBitSet;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, if all updates share the
+ * same value for a numeric field we only store the value once.
+ */
+final class FieldUpdatesBuffer {
+  private static final long SELF_SHALLOW_SIZE = 
RamUsageEstimator.shallowSizeOfInstance(FieldUpdatesBuffer.class);
+  private static final long STRING_SHALLOW_SIZE = 
RamUsageEstimator.shallowSizeOf(String.class);
--- End diff --

I think you meant `shallowSizeOfInstance` here too?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread jpountz
Github user jpountz commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239122170
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.FixedBitSet;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, if all updates share the
+ * same value for a numeric field we only store the value once.
+ */
+final class FieldUpdatesBuffer {
+  private static final long SELF_SHALLOW_SIZE = 
RamUsageEstimator.shallowSizeOfInstance(FieldUpdatesBuffer.class);
+  private static final long STRING_SHALLOW_SIZE = 
RamUsageEstimator.shallowSizeOf(String.class);
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private FixedBitSet hasValues;
+  private String[] fields;
+  private final boolean isNumeric;
+
+  private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate 
initialValue, int docUpTo, boolean isNumeric) {
+this.bytesUsed = bytesUsed;
+this.bytesUsed.addAndGet(SELF_SHALLOW_SIZE);
+termValues = new BytesRefArray(bytesUsed);
+termValues.append(initialValue.term.bytes);
+fields = new String[] {initialValue.term.field};
+bytesUsed.addAndGet(sizeOfString(initialValue.term.field));
+docsUpTo = new int[] {docUpTo};
+if (initialValue.hasValue == false) {
+  hasValues = new FixedBitSet(1);
+  bytesUsed.addAndGet(hasValues.ramBytesUsed());
+}
+this.isNumeric = isNumeric;
+byteValues = isNumeric ? null : new BytesRefArray(bytesUsed);
+  }
+
+  private long sizeOfString(String string) {
+return STRING_SHALLOW_SIZE + (string.length() * Character.BYTES);
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, true);
+if (initialValue.hasValue()) {
+  numericValues = new long[] {initialValue.getValue()};
+} else {
+  numericValues = new long[] {0};
+}
+bytesUsed.addAndGet(Long.BYTES);
+  }
+
+  FieldUpdatesBuffer(Counter 

[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread s1monw
Github user s1monw commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239062234
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
+ * where all values for a specific field is shared this also stores 
numeric values only once if all updates share
+ * the same value.
+ */
+final class FieldUpdatesBuffer {
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo = null;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private boolean[] hasValues;
+  private String[] fields;
+  private final String firstField;
+  private final boolean firstHasValue;
+  private long firstNumericValue;
+  private final int firstDocUpTo;
+  private final boolean isNumeric;
+
+  private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate 
initialValue, int docUpTo, boolean isNumeric) {
+this.bytesUsed = bytesUsed;
+termValues = new BytesRefArray(bytesUsed);
+termValues.append(initialValue.term.bytes);
+firstField = initialValue.term.field;
+firstDocUpTo = docUpTo;
+firstHasValue = initialValue.hasValue;
+if (firstHasValue == false) {
+  hasValues = new boolean[] {false};
+  bytesUsed.addAndGet(1);
+}
+this.isNumeric = isNumeric;
+byteValues = isNumeric ? null : new BytesRefArray(bytesUsed);
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, true);
+if (initialValue.hasValue) {
+  firstNumericValue = initialValue.getValue();
+}
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, false);
+if (initialValue.hasValue()) {
+  byteValues.append(initialValue.getValue());
+}
+  }
+
+  void add(String field, int docUpTo, int ord, boolean hasValue) {
+if (this.firstField.equals(field) == false || fields != null) {
+  if (fields == null) {
+int 

[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread s1monw
Github user s1monw commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239062190
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
+ * where all values for a specific field is shared this also stores 
numeric values only once if all updates share
+ * the same value.
+ */
+final class FieldUpdatesBuffer {
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo = null;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private boolean[] hasValues;
+  private String[] fields;
+  private final String firstField;
+  private final boolean firstHasValue;
+  private long firstNumericValue;
+  private final int firstDocUpTo;
+  private final boolean isNumeric;
+
+  private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate 
initialValue, int docUpTo, boolean isNumeric) {
+this.bytesUsed = bytesUsed;
+termValues = new BytesRefArray(bytesUsed);
+termValues.append(initialValue.term.bytes);
+firstField = initialValue.term.field;
+firstDocUpTo = docUpTo;
+firstHasValue = initialValue.hasValue;
+if (firstHasValue == false) {
+  hasValues = new boolean[] {false};
+  bytesUsed.addAndGet(1);
+}
+this.isNumeric = isNumeric;
+byteValues = isNumeric ? null : new BytesRefArray(bytesUsed);
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, true);
+if (initialValue.hasValue) {
+  firstNumericValue = initialValue.getValue();
+}
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, false);
+if (initialValue.hasValue()) {
+  byteValues.append(initialValue.getValue());
+}
+  }
+
+  void add(String field, int docUpTo, int ord, boolean hasValue) {
+if (this.firstField.equals(field) == false || fields != null) {
+  if (fields == null) {
+int 

[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread s1monw
Github user s1monw commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239036539
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
+ * where all values for a specific field is shared this also stores 
numeric values only once if all updates share
+ * the same value.
+ */
+final class FieldUpdatesBuffer {
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo = null;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private boolean[] hasValues;
--- End diff --

yeah so I wasn't sure I can explore it


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread s1monw
Github user s1monw commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239036461
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
+ * where all values for a specific field is shared this also stores 
numeric values only once if all updates share
+ * the same value.
+ */
+final class FieldUpdatesBuffer {
--- End diff --

I will look into this!


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread s1monw
Github user s1monw commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239036263
  
--- Diff: lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java 
---
@@ -288,15 +186,24 @@ void clear() {
 deleteTerms.clear();
 deleteQueries.clear();
 deleteDocIDs.clear();
-numericUpdates.clear();
-binaryUpdates.clear();
 numTermDeletes.set(0);
-numNumericUpdates.set(0);
-numBinaryUpdates.set(0);
-bytesUsed.set(0);
+numFieldUpdates.set(0);
+fieldUpdates.clear();
+bytesUsed.addAndGet(-bytesUsed.get());
--- End diff --

agreed I will open a followup.


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239027156
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
+ * where all values for a specific field is shared this also stores 
numeric values only once if all updates share
+ * the same value.
+ */
+final class FieldUpdatesBuffer {
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo = null;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private boolean[] hasValues;
+  private String[] fields;
+  private final String firstField;
+  private final boolean firstHasValue;
+  private long firstNumericValue;
+  private final int firstDocUpTo;
+  private final boolean isNumeric;
+
+  private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate 
initialValue, int docUpTo, boolean isNumeric) {
+this.bytesUsed = bytesUsed;
+termValues = new BytesRefArray(bytesUsed);
+termValues.append(initialValue.term.bytes);
+firstField = initialValue.term.field;
+firstDocUpTo = docUpTo;
+firstHasValue = initialValue.hasValue;
+if (firstHasValue == false) {
+  hasValues = new boolean[] {false};
+  bytesUsed.addAndGet(1);
+}
+this.isNumeric = isNumeric;
+byteValues = isNumeric ? null : new BytesRefArray(bytesUsed);
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, true);
+if (initialValue.hasValue) {
+  firstNumericValue = initialValue.getValue();
+}
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, false);
+if (initialValue.hasValue()) {
+  byteValues.append(initialValue.getValue());
+}
+  }
+
+  void add(String field, int docUpTo, int ord, boolean hasValue) {
+if (this.firstField.equals(field) == false || fields != null) {
+  if (fields == null) {
+

[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239025447
  
--- Diff: lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java 
---
@@ -42,7 +41,7 @@
 // instance on DocumentWriterPerThread, or via sync'd code by
 // DocumentsWriterDeleteQueue
 
-class BufferedUpdates {
+class BufferedUpdates implements Accountable {
--- End diff --

Ha!  Despite all the crazy accounting we were doing here we didn't 
implement this before :)


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239027692
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
+ * where all values for a specific field is shared this also stores 
numeric values only once if all updates share
+ * the same value.
+ */
+final class FieldUpdatesBuffer {
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo = null;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private boolean[] hasValues;
+  private String[] fields;
+  private final String firstField;
+  private final boolean firstHasValue;
+  private long firstNumericValue;
+  private final int firstDocUpTo;
+  private final boolean isNumeric;
+
+  private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate 
initialValue, int docUpTo, boolean isNumeric) {
+this.bytesUsed = bytesUsed;
+termValues = new BytesRefArray(bytesUsed);
+termValues.append(initialValue.term.bytes);
+firstField = initialValue.term.field;
+firstDocUpTo = docUpTo;
+firstHasValue = initialValue.hasValue;
+if (firstHasValue == false) {
+  hasValues = new boolean[] {false};
+  bytesUsed.addAndGet(1);
+}
+this.isNumeric = isNumeric;
+byteValues = isNumeric ? null : new BytesRefArray(bytesUsed);
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, true);
+if (initialValue.hasValue) {
+  firstNumericValue = initialValue.getValue();
+}
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, false);
+if (initialValue.hasValue()) {
+  byteValues.append(initialValue.getValue());
+}
+  }
+
+  void add(String field, int docUpTo, int ord, boolean hasValue) {
+if (this.firstField.equals(field) == false || fields != null) {
+  if (fields == null) {
+

[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239027983
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
+ * where all values for a specific field is shared this also stores 
numeric values only once if all updates share
+ * the same value.
+ */
+final class FieldUpdatesBuffer {
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo = null;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private boolean[] hasValues;
+  private String[] fields;
+  private final String firstField;
+  private final boolean firstHasValue;
+  private long firstNumericValue;
+  private final int firstDocUpTo;
+  private final boolean isNumeric;
+
+  private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate 
initialValue, int docUpTo, boolean isNumeric) {
+this.bytesUsed = bytesUsed;
+termValues = new BytesRefArray(bytesUsed);
+termValues.append(initialValue.term.bytes);
+firstField = initialValue.term.field;
+firstDocUpTo = docUpTo;
+firstHasValue = initialValue.hasValue;
+if (firstHasValue == false) {
+  hasValues = new boolean[] {false};
+  bytesUsed.addAndGet(1);
+}
+this.isNumeric = isNumeric;
+byteValues = isNumeric ? null : new BytesRefArray(bytesUsed);
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, true);
+if (initialValue.hasValue) {
+  firstNumericValue = initialValue.getValue();
+}
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, false);
+if (initialValue.hasValue()) {
+  byteValues.append(initialValue.getValue());
+}
+  }
+
+  void add(String field, int docUpTo, int ord, boolean hasValue) {
+if (this.firstField.equals(field) == false || fields != null) {
+  if (fields == null) {
+

[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239025745
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
+ * where all values for a specific field is shared this also stores 
numeric values only once if all updates share
+ * the same value.
+ */
+final class FieldUpdatesBuffer {
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo = null;
--- End diff --

`null` init not needed in java.


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239026711
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
+ * where all values for a specific field is shared this also stores 
numeric values only once if all updates share
+ * the same value.
+ */
+final class FieldUpdatesBuffer {
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo = null;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private boolean[] hasValues;
+  private String[] fields;
+  private final String firstField;
+  private final boolean firstHasValue;
+  private long firstNumericValue;
+  private final int firstDocUpTo;
+  private final boolean isNumeric;
+
+  private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate 
initialValue, int docUpTo, boolean isNumeric) {
+this.bytesUsed = bytesUsed;
+termValues = new BytesRefArray(bytesUsed);
+termValues.append(initialValue.term.bytes);
+firstField = initialValue.term.field;
+firstDocUpTo = docUpTo;
+firstHasValue = initialValue.hasValue;
+if (firstHasValue == false) {
+  hasValues = new boolean[] {false};
+  bytesUsed.addAndGet(1);
+}
+this.isNumeric = isNumeric;
+byteValues = isNumeric ? null : new BytesRefArray(bytesUsed);
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, true);
+if (initialValue.hasValue) {
+  firstNumericValue = initialValue.getValue();
+}
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, false);
+if (initialValue.hasValue()) {
+  byteValues.append(initialValue.getValue());
+}
+  }
+
+  void add(String field, int docUpTo, int ord, boolean hasValue) {
+if (this.firstField.equals(field) == false || fields != null) {
+  if (fields == null) {
+

[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread jpountz
Github user jpountz commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r238792876
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
+ * where all values for a specific field is shared this also stores 
numeric values only once if all updates share
+ * the same value.
+ */
+final class FieldUpdatesBuffer {
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo = null;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private boolean[] hasValues;
--- End diff --

would a bitset be better?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread jpountz
Github user jpountz commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239014808
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
+ * where all values for a specific field is shared this also stores 
numeric values only once if all updates share
+ * the same value.
+ */
+final class FieldUpdatesBuffer {
--- End diff --

This would be a bit easier to read for me if you introduced tiny 
abstractions around an (initially null) array + a default value, as all 
conditions make the code a bit hard to follow.


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-04 Thread shaie
Github user shaie commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r238938229
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
+ * where all values for a specific field is shared this also stores 
numeric values only once if all updates share
+ * the same value.
+ */
+final class FieldUpdatesBuffer {
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo = null;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private boolean[] hasValues;
+  private String[] fields;
+  private final String firstField;
+  private final boolean firstHasValue;
+  private long firstNumericValue;
+  private final int firstDocUpTo;
+  private final boolean isNumeric;
+
+  private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate 
initialValue, int docUpTo, boolean isNumeric) {
+this.bytesUsed = bytesUsed;
+termValues = new BytesRefArray(bytesUsed);
+termValues.append(initialValue.term.bytes);
+firstField = initialValue.term.field;
+firstDocUpTo = docUpTo;
+firstHasValue = initialValue.hasValue;
+if (firstHasValue == false) {
+  hasValues = new boolean[] {false};
+  bytesUsed.addAndGet(1);
+}
+this.isNumeric = isNumeric;
+byteValues = isNumeric ? null : new BytesRefArray(bytesUsed);
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, true);
+if (initialValue.hasValue) {
+  firstNumericValue = initialValue.getValue();
+}
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, false);
+if (initialValue.hasValue()) {
+  byteValues.append(initialValue.getValue());
+}
+  }
+
+  void add(String field, int docUpTo, int ord, boolean hasValue) {
+if (this.firstField.equals(field) == false || fields != null) {
+  if (fields == null) {
+int 

[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-04 Thread shaie
Github user shaie commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r238937830
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
--- End diff --

The last sentence about "soft_deletes" reads too cumbersome to me. Can we 
just say "Lastly, if all updates share the same value for a field, the value is 
stored only once."? Or something along this.


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-04 Thread shaie
Github user shaie commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r238937433
  
--- Diff: lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java 
---
@@ -288,15 +186,24 @@ void clear() {
 deleteTerms.clear();
 deleteQueries.clear();
 deleteDocIDs.clear();
-numericUpdates.clear();
-binaryUpdates.clear();
 numTermDeletes.set(0);
-numNumericUpdates.set(0);
-numBinaryUpdates.set(0);
-bytesUsed.set(0);
+numFieldUpdates.set(0);
+fieldUpdates.clear();
+bytesUsed.addAndGet(-bytesUsed.get());
--- End diff --

Unrelated to this PR, but it feels like `Counter` could expose a `reset()` 
method and internally set the value to 0, instead of everyone doing this 
juggling.


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-04 Thread s1monw
GitHub user s1monw opened a pull request:

https://github.com/apache/lucene-solr/pull/513

LUCENE-8590: Optimize DocValues update datastructures

Today we are using a LinkedHashMap to buffer doc-values updates in
BufferedUpdates. This on the one hand uses an Object based datastructure
and on the other requires re-encoding the data into a more compact 
representation
once the BufferedUpdates are frozen. This change uses a more compact 
represenation
for the updates already in the BufferedUpdates in a parallel-array like 
datastructure
that can be reused in FrozenBufferedDeletes. It also adds an much simpler 
to use
API to consume the updates and allows for internal memory optimization for 
common
case updates.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/s1monw/lucene-solr improve_buffered_updates

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/lucene-solr/pull/513.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #513


commit 6646d1ce1227fae937a2d9560afcc973baa84243
Author: Simon Willnauer 
Date:   2018-12-04T16:36:46Z

LUCENE-8590: Optimize DocValues update datastructures

Today we are using a LinkedHashMap to buffer doc-values updates in
BufferedUpdates. This on the one hand uses an Object based datastructure
and on the other requires re-encoding the data into a more compact 
representation
once the BufferedUpdates are frozen. This change uses a more compact 
represenation
for the updates already in the BufferedUpdates in a parallel-array like 
datastructure
that can be reused in FrozenBufferedDeletes. It also adds an much simpler 
to use
API to consume the updates and allows for internal memory optimization for 
common
case updates.




---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org