[GitHub] lucene-solr issue #527: LUCENE-8609: Allow getting consistent docstats from ...

2018-12-13 Thread mikemccand
Github user mikemccand commented on the issue:

https://github.com/apache/lucene-solr/pull/527
  
LGTM


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #526: LUCENE-8608: Extract utility class to iterate...

2018-12-13 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/526#discussion_r241372586
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FrozenBufferedUpdates.java ---
@@ -821,4 +759,93 @@ public String toString() {
   boolean any() {
 return deleteTerms.size() > 0 || deleteQueries.length > 0 || 
fieldUpdatesCount > 0 ;
   }
+
+  /**
+   * This class helps iterating a term dictionary and consuming all the 
docs for each terms.
+   * It accepts a field, value tuple and returns a {@link 
DocIdSetIterator} if the field has an entry
+   * for the given value. It has an optimized way of iterating the term 
dictionary if the terms are
+   * passed in sorted order and makes sure terms and postings are reused 
as much as possible.
+   */
+  static final class TermDocsIterator {
+private final TermsProvider provider;
+private String field;
+private TermsEnum termsEnum;
+private PostingsEnum postingsEnum;
+private final boolean sortedTerms;
+private BytesRef readerTerm;
+
+@FunctionalInterface
+interface TermsProvider {
+  Terms terms(String field) throws IOException;
+}
+
+TermDocsIterator(Fields fields, boolean sortedTerms) {
+  this(fields::terms, sortedTerms);
+}
+
+TermDocsIterator(LeafReader reader, boolean sortedTerms) {
+  this(reader::terms, sortedTerms);
+}
+
+private TermDocsIterator(TermsProvider provider, boolean sortedTerms) {
+  this.sortedTerms = sortedTerms;
+  this.provider = provider;
+}
+
+private void nextField(String field) throws IOException {
+  if (this.field == null || this.field.equals(field) == false) {
+this.field = field;
+Terms terms = provider.terms(field);
+if (terms != null) {
+  termsEnum = terms.iterator();
+  if (sortedTerms) {
+readerTerm = termsEnum.next();
+  }
+} else {
+  termsEnum = null;
+}
+  }
+}
+
+DocIdSetIterator nextTerm(String field, BytesRef term) throws 
IOException {
+  nextField(field);
+  if (termsEnum != null) {
+if (sortedTerms) {
+  // in the sorted case we can take advantage of the "seeking 
forward" property
--- End diff --

Maybe we should confirm the incoming `term` is `>=` the last term, under 
assert?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #526: LUCENE-8608: Extract utility class to iterate...

2018-12-13 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/526#discussion_r241372971
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FrozenBufferedUpdates.java ---
@@ -821,4 +759,93 @@ public String toString() {
   boolean any() {
 return deleteTerms.size() > 0 || deleteQueries.length > 0 || 
fieldUpdatesCount > 0 ;
   }
+
+  /**
+   * This class helps iterating a term dictionary and consuming all the 
docs for each terms.
+   * It accepts a field, value tuple and returns a {@link 
DocIdSetIterator} if the field has an entry
+   * for the given value. It has an optimized way of iterating the term 
dictionary if the terms are
+   * passed in sorted order and makes sure terms and postings are reused 
as much as possible.
+   */
+  static final class TermDocsIterator {
+private final TermsProvider provider;
+private String field;
+private TermsEnum termsEnum;
+private PostingsEnum postingsEnum;
+private final boolean sortedTerms;
+private BytesRef readerTerm;
+
+@FunctionalInterface
+interface TermsProvider {
+  Terms terms(String field) throws IOException;
+}
+
+TermDocsIterator(Fields fields, boolean sortedTerms) {
+  this(fields::terms, sortedTerms);
+}
+
+TermDocsIterator(LeafReader reader, boolean sortedTerms) {
+  this(reader::terms, sortedTerms);
+}
+
+private TermDocsIterator(TermsProvider provider, boolean sortedTerms) {
+  this.sortedTerms = sortedTerms;
+  this.provider = provider;
+}
+
+private void nextField(String field) throws IOException {
+  if (this.field == null || this.field.equals(field) == false) {
+this.field = field;
+Terms terms = provider.terms(field);
+if (terms != null) {
+  termsEnum = terms.iterator();
+  if (sortedTerms) {
+readerTerm = termsEnum.next();
+  }
+} else {
+  termsEnum = null;
+}
+  }
+}
+
+DocIdSetIterator nextTerm(String field, BytesRef term) throws 
IOException {
+  nextField(field);
+  if (termsEnum != null) {
+if (sortedTerms) {
+  // in the sorted case we can take advantage of the "seeking 
forward" property
+  // this allows us depending on the term dict impl to reuse 
data-strucutres internally
+  // which speed up iteration over terms and docs significantly.
+  int cmp = term.compareTo(readerTerm);
+  if (cmp < 0) {
+return null;
--- End diff --

Maybe add comment `// requested term does not exist in this segment`?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #526: LUCENE-8608: Extract utility class to iterate...

2018-12-13 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/526#discussion_r241372770
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FrozenBufferedUpdates.java ---
@@ -821,4 +759,93 @@ public String toString() {
   boolean any() {
 return deleteTerms.size() > 0 || deleteQueries.length > 0 || 
fieldUpdatesCount > 0 ;
   }
+
+  /**
+   * This class helps iterating a term dictionary and consuming all the 
docs for each terms.
+   * It accepts a field, value tuple and returns a {@link 
DocIdSetIterator} if the field has an entry
+   * for the given value. It has an optimized way of iterating the term 
dictionary if the terms are
+   * passed in sorted order and makes sure terms and postings are reused 
as much as possible.
+   */
+  static final class TermDocsIterator {
+private final TermsProvider provider;
+private String field;
+private TermsEnum termsEnum;
+private PostingsEnum postingsEnum;
+private final boolean sortedTerms;
+private BytesRef readerTerm;
+
+@FunctionalInterface
+interface TermsProvider {
+  Terms terms(String field) throws IOException;
+}
+
+TermDocsIterator(Fields fields, boolean sortedTerms) {
+  this(fields::terms, sortedTerms);
+}
+
+TermDocsIterator(LeafReader reader, boolean sortedTerms) {
+  this(reader::terms, sortedTerms);
+}
+
+private TermDocsIterator(TermsProvider provider, boolean sortedTerms) {
+  this.sortedTerms = sortedTerms;
+  this.provider = provider;
+}
+
+private void nextField(String field) throws IOException {
+  if (this.field == null || this.field.equals(field) == false) {
+this.field = field;
+Terms terms = provider.terms(field);
+if (terms != null) {
+  termsEnum = terms.iterator();
+  if (sortedTerms) {
+readerTerm = termsEnum.next();
+  }
+} else {
+  termsEnum = null;
+}
+  }
+}
+
+DocIdSetIterator nextTerm(String field, BytesRef term) throws 
IOException {
+  nextField(field);
+  if (termsEnum != null) {
+if (sortedTerms) {
+  // in the sorted case we can take advantage of the "seeking 
forward" property
+  // this allows us depending on the term dict impl to reuse 
data-strucutres internally
--- End diff --

`data-strucutres` -> `data-structures`


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #526: LUCENE-8608: Extract utility class to iterate...

2018-12-13 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/526#discussion_r241373334
  
--- Diff: 
lucene/core/src/test/org/apache/lucene/index/TestFrozenBufferedUpdates.java ---
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+import com.carrotsearch.randomizedtesting.generators.RandomPicks;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.document.StringField;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.FixedBitSet;
+import org.apache.lucene.util.LuceneTestCase;
+import org.apache.lucene.index.FrozenBufferedUpdates.TermDocsIterator;
+import org.apache.lucene.util.TestUtil;
+
+public class TestFrozenBufferedUpdates extends LuceneTestCase {
+
+  public void testTermDocsIterator() throws IOException {
--- End diff --

Nice random test :)


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #526: LUCENE-8608: Extract utility class to iterate...

2018-12-13 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/526#discussion_r241372182
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FrozenBufferedUpdates.java ---
@@ -821,4 +759,93 @@ public String toString() {
   boolean any() {
 return deleteTerms.size() > 0 || deleteQueries.length > 0 || 
fieldUpdatesCount > 0 ;
   }
+
+  /**
+   * This class helps iterating a term dictionary and consuming all the 
docs for each terms.
+   * It accepts a field, value tuple and returns a {@link 
DocIdSetIterator} if the field has an entry
+   * for the given value. It has an optimized way of iterating the term 
dictionary if the terms are
+   * passed in sorted order and makes sure terms and postings are reused 
as much as possible.
+   */
+  static final class TermDocsIterator {
+private final TermsProvider provider;
+private String field;
+private TermsEnum termsEnum;
+private PostingsEnum postingsEnum;
+private final boolean sortedTerms;
+private BytesRef readerTerm;
+
+@FunctionalInterface
+interface TermsProvider {
+  Terms terms(String field) throws IOException;
+}
+
+TermDocsIterator(Fields fields, boolean sortedTerms) {
+  this(fields::terms, sortedTerms);
+}
+
+TermDocsIterator(LeafReader reader, boolean sortedTerms) {
+  this(reader::terms, sortedTerms);
+}
+
+private TermDocsIterator(TermsProvider provider, boolean sortedTerms) {
+  this.sortedTerms = sortedTerms;
+  this.provider = provider;
+}
+
+private void nextField(String field) throws IOException {
--- End diff --

Maybe simply name `setField` or `changeField`?  `nextField` made me think 
it was somehow following along in sorted field order or something.


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr issue #482: LUCENE-8539: fix some typos and improve style in Tes...

2018-12-06 Thread mikemccand
Github user mikemccand commented on the issue:

https://github.com/apache/lucene-solr/pull/482
  
Great, thanks @diegoceccarelli -- I will merge soon!


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #516: LUCENE-8594: DV update are broken for updates...

2018-12-06 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/516#discussion_r239426275
  
--- Diff: 
lucene/backward-codecs/src/java/org/apache/lucene/codecs/lucene70/Lucene70Codec.java
 ---
@@ -66,7 +67,7 @@ public PostingsFormat getPostingsFormatForField(String 
field) {
   private final DocValuesFormat docValuesFormat = new 
PerFieldDocValuesFormat() {
 @Override
 public DocValuesFormat getDocValuesFormatForField(String field) {
-  throw new IllegalStateException("This codec should only be used for 
reading, not writing");
+  return defaultDVFormat;
--- End diff --

Ahh this is because the DV updates must also be written with the old codec?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr issue #513: LUCENE-8590: Optimize DocValues update datastructure...

2018-12-05 Thread mikemccand
Github user mikemccand commented on the issue:

https://github.com/apache/lucene-solr/pull/513
  
Wow, that's a massive reduction!


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239027156
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
+ * where all values for a specific field is shared this also stores 
numeric values only once if all updates share
+ * the same value.
+ */
+final class FieldUpdatesBuffer {
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo = null;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private boolean[] hasValues;
+  private String[] fields;
+  private final String firstField;
+  private final boolean firstHasValue;
+  private long firstNumericValue;
+  private final int firstDocUpTo;
+  private final boolean isNumeric;
+
+  private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate 
initialValue, int docUpTo, boolean isNumeric) {
+this.bytesUsed = bytesUsed;
+termValues = new BytesRefArray(bytesUsed);
+termValues.append(initialValue.term.bytes);
+firstField = initialValue.term.field;
+firstDocUpTo = docUpTo;
+firstHasValue = initialValue.hasValue;
+if (firstHasValue == false) {
+  hasValues = new boolean[] {false};
+  bytesUsed.addAndGet(1);
+}
+this.isNumeric = isNumeric;
+byteValues = isNumeric ? null : new BytesRefArray(bytesUsed);
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, true);
+if (initialValue.hasValue) {
+  firstNumericValue = initialValue.getValue();
+}
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, false);
+if (initialValue.hasValue()) {
+  byteValues.append(initialValue.getValue());
+}
+  }
+
+  void add(String field, int docUpTo, int ord, boolean hasValue) {
+if (this.firstField.equals(field) == false || fields != null) {
+

[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239025447
  
--- Diff: lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java 
---
@@ -42,7 +41,7 @@
 // instance on DocumentWriterPerThread, or via sync'd code by
 // DocumentsWriterDeleteQueue
 
-class BufferedUpdates {
+class BufferedUpdates implements Accountable {
--- End diff --

Ha!  Despite all the crazy accounting we were doing here we didn't 
implement this before :)


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239027692
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
+ * where all values for a specific field is shared this also stores 
numeric values only once if all updates share
+ * the same value.
+ */
+final class FieldUpdatesBuffer {
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo = null;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private boolean[] hasValues;
+  private String[] fields;
+  private final String firstField;
+  private final boolean firstHasValue;
+  private long firstNumericValue;
+  private final int firstDocUpTo;
+  private final boolean isNumeric;
+
+  private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate 
initialValue, int docUpTo, boolean isNumeric) {
+this.bytesUsed = bytesUsed;
+termValues = new BytesRefArray(bytesUsed);
+termValues.append(initialValue.term.bytes);
+firstField = initialValue.term.field;
+firstDocUpTo = docUpTo;
+firstHasValue = initialValue.hasValue;
+if (firstHasValue == false) {
+  hasValues = new boolean[] {false};
+  bytesUsed.addAndGet(1);
+}
+this.isNumeric = isNumeric;
+byteValues = isNumeric ? null : new BytesRefArray(bytesUsed);
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, true);
+if (initialValue.hasValue) {
+  firstNumericValue = initialValue.getValue();
+}
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, false);
+if (initialValue.hasValue()) {
+  byteValues.append(initialValue.getValue());
+}
+  }
+
+  void add(String field, int docUpTo, int ord, boolean hasValue) {
+if (this.firstField.equals(field) == false || fields != null) {
+

[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239027983
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
+ * where all values for a specific field is shared this also stores 
numeric values only once if all updates share
+ * the same value.
+ */
+final class FieldUpdatesBuffer {
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo = null;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private boolean[] hasValues;
+  private String[] fields;
+  private final String firstField;
+  private final boolean firstHasValue;
+  private long firstNumericValue;
+  private final int firstDocUpTo;
+  private final boolean isNumeric;
+
+  private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate 
initialValue, int docUpTo, boolean isNumeric) {
+this.bytesUsed = bytesUsed;
+termValues = new BytesRefArray(bytesUsed);
+termValues.append(initialValue.term.bytes);
+firstField = initialValue.term.field;
+firstDocUpTo = docUpTo;
+firstHasValue = initialValue.hasValue;
+if (firstHasValue == false) {
+  hasValues = new boolean[] {false};
+  bytesUsed.addAndGet(1);
+}
+this.isNumeric = isNumeric;
+byteValues = isNumeric ? null : new BytesRefArray(bytesUsed);
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, true);
+if (initialValue.hasValue) {
+  firstNumericValue = initialValue.getValue();
+}
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, false);
+if (initialValue.hasValue()) {
+  byteValues.append(initialValue.getValue());
+}
+  }
+
+  void add(String field, int docUpTo, int ord, boolean hasValue) {
+if (this.firstField.equals(field) == false || fields != null) {
+

[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239025745
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
+ * where all values for a specific field is shared this also stores 
numeric values only once if all updates share
+ * the same value.
+ */
+final class FieldUpdatesBuffer {
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo = null;
--- End diff --

`null` init not needed in java.


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #513: LUCENE-8590: Optimize DocValues update datast...

2018-12-05 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/513#discussion_r239026711
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefArray;
+import org.apache.lucene.util.BytesRefIterator;
+import org.apache.lucene.util.Counter;
+import org.apache.lucene.util.RamUsageEstimator;
+
+/**
+ * This class efficiently buffers numeric and binary field updates and 
stores
+ * terms, values and metadata in a memory efficient way without creating 
large amounts
+ * of objects. Update terms are stored without de-duplicating the update 
term.
+ * In general we try to optimize for several use-cases. For instance we 
try to use constant
+ * space for update terms field since the common case always updates on 
the same field. Also for docUpTo
+ * we try to optimize for the case when updates should be applied to all 
docs ie. docUpTo=Integer.MAX_VALUE.
+ * In other cases each update will likely have a different docUpTo.
+ * Along the same lines this impl optimizes the case when all updates have 
a value. Lastly, the soft_deletes case
+ * where all values for a specific field is shared this also stores 
numeric values only once if all updates share
+ * the same value.
+ */
+final class FieldUpdatesBuffer {
+  private final Counter bytesUsed;
+  private int numUpdates = 1;
+  // we use a very simple approach and store the update term values 
without de-duplication
+  // which is also not a common case to keep updating the same value more 
than once...
+  // we might pay a higher price in terms of memory in certain cases but 
will gain
+  // on CPU for those. We also save on not needing to sort in order to 
apply the terms in order
+  // since by definition we store them in order.
+  private final BytesRefArray termValues;
+  private final BytesRefArray byteValues; // this will be null if we are 
buffering numerics
+  private int[] docsUpTo = null;
+  private long[] numericValues; // this will be null if we are buffering 
binaries
+  private boolean[] hasValues;
+  private String[] fields;
+  private final String firstField;
+  private final boolean firstHasValue;
+  private long firstNumericValue;
+  private final int firstDocUpTo;
+  private final boolean isNumeric;
+
+  private FieldUpdatesBuffer(Counter bytesUsed, DocValuesUpdate 
initialValue, int docUpTo, boolean isNumeric) {
+this.bytesUsed = bytesUsed;
+termValues = new BytesRefArray(bytesUsed);
+termValues.append(initialValue.term.bytes);
+firstField = initialValue.term.field;
+firstDocUpTo = docUpTo;
+firstHasValue = initialValue.hasValue;
+if (firstHasValue == false) {
+  hasValues = new boolean[] {false};
+  bytesUsed.addAndGet(1);
+}
+this.isNumeric = isNumeric;
+byteValues = isNumeric ? null : new BytesRefArray(bytesUsed);
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.NumericDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, true);
+if (initialValue.hasValue) {
+  firstNumericValue = initialValue.getValue();
+}
+  }
+
+  FieldUpdatesBuffer(Counter bytesUsed, 
DocValuesUpdate.BinaryDocValuesUpdate initialValue, int docUpTo) {
+this(bytesUsed, initialValue, docUpTo, false);
+if (initialValue.hasValue()) {
+  byteValues.append(initialValue.getValue());
+}
+  }
+
+  void add(String field, int docUpTo, int ord, boolean hasValue) {
+if (this.firstField.equals(field) == false || fields != null) {
+

[GitHub] lucene-solr pull request #:

2018-12-04 Thread mikemccand
Github user mikemccand commented on the pull request:


https://github.com/apache/lucene-solr/commit/b689cf146bb1a5db022926ea1a3ff5f1ce75a9d6#commitcomment-31555170
  
In lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java:
In lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java on 
line 133:
Don't need the extra `(` ... `)`.


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #:

2018-12-04 Thread mikemccand
Github user mikemccand commented on the pull request:


https://github.com/apache/lucene-solr/commit/b689cf146bb1a5db022926ea1a3ff5f1ce75a9d6#commitcomment-31555166
  
In lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java:
In lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java on 
line 129:
Here too.


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #:

2018-12-04 Thread mikemccand
Github user mikemccand commented on the pull request:


https://github.com/apache/lucene-solr/commit/b689cf146bb1a5db022926ea1a3ff5f1ce75a9d6#commitcomment-31555162
  
In lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java:
In lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java on 
line 113:
Also here.


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #:

2018-12-04 Thread mikemccand
Github user mikemccand commented on the pull request:


https://github.com/apache/lucene-solr/commit/b689cf146bb1a5db022926ea1a3ff5f1ce75a9d6#commitcomment-31554358
  
In lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java:
In lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java on 
line 97:
Insert space between `){`.


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #:

2018-12-04 Thread mikemccand
Github user mikemccand commented on the pull request:


https://github.com/apache/lucene-solr/commit/b689cf146bb1a5db022926ea1a3ff5f1ce75a9d6#commitcomment-31554312
  
In lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java:
In lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java on 
line 51:
`null` is the default for java already.


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #:

2018-12-04 Thread mikemccand
Github user mikemccand commented on the pull request:


https://github.com/apache/lucene-solr/commit/b689cf146bb1a5db022926ea1a3ff5f1ce75a9d6#commitcomment-31554308
  
In lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java:
In lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java on 
line 45:
It's also not a common case to keep updating the same value more than once 
...


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #:

2018-12-04 Thread mikemccand
Github user mikemccand commented on the pull request:


https://github.com/apache/lucene-solr/commit/b689cf146bb1a5db022926ea1a3ff5f1ce75a9d6#commitcomment-31554289
  
In lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java:
In lucene/core/src/java/org/apache/lucene/index/FieldUpdatesBuffer.java on 
line 34:
Remove one of the `several`s?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr issue #482: LUCENE-8539: fix some typos and improve style in Tes...

2018-11-26 Thread mikemccand
Github user mikemccand commented on the issue:

https://github.com/apache/lucene-solr/pull/482
  
Hmm `and precommit` is a little angry:

```
[forbidden-apis] Forbidden method invocation: 
java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default 
locale]
[forbidden-apis]   in org.apache.lucene.analysis.TestStopFilter 
(TestStopFilter.java:61)
[forbidden-apis] Forbidden method invocation: 
java.util.Collections#shuffle(java.util.List) [Use shuffle(List, Random) 
instead so that it can be reproduced]
[forbidden-apis]   in org.apache.lucene.analysis.TestStopFilter 
(TestStopFilter.java:130)
[forbidden-apis] Forbidden method invocation: 
java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default 
locale]
[forbidden-apis]   in org.apache.lucene.analysis.TestStopFilter 
(TestStopFilter.java:184)
[forbidden-apis] Forbidden method invocation: 
java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default 
locale]
[forbidden-apis]   in org.apache.lucene.analysis.TestStopFilter 
(TestStopFilter.java:186)
```

Can you address these issues and commit / squash?  Thanks @diegoceccarelli!


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr issue #482: LUCENE-8539: fix some typos and improve style in Tes...

2018-11-26 Thread mikemccand
Github user mikemccand commented on the issue:

https://github.com/apache/lucene-solr/pull/482
  
@diegoceccarelli this looks great -- I'll merge soon.  Thanks!


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #500: LUCENE-8517: do not wrap FixedShingleFilter w...

2018-11-22 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/500#discussion_r235819263
  
--- Diff: 
lucene/test-framework/src/java/org/apache/lucene/analysis/ValidatingTokenFilter.java
 ---
@@ -50,6 +56,9 @@
   private final OffsetAttribute offsetAtt = 
getAttribute(OffsetAttribute.class);
   private final CharTermAttribute termAtt = 
getAttribute(CharTermAttribute.class);
 
+  // record all the Tokens seen so they can be dumped on failure
--- End diff --

Update comment to `record last MAX_DEBUG_TOKENS...`?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #503: LUCENE-8571: Don't block on FrozenBufferedUpd...

2018-11-21 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/503#discussion_r235350971
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/FrozenBufferedUpdates.java ---
@@ -214,150 +216,174 @@ public FrozenBufferedUpdates(InfoStream infoStream, 
BufferedUpdates updates, Seg
 
   /** Translates a frozen packet of delete term/query, or doc values
*  updates, into their actual docIDs in the index, and applies the 
change.  This is a heavy
-   *  operation and is done concurrently by incoming indexing threads. */
+   *  operation and is done concurrently by incoming indexing threads.
+   *  This method will return immediately without blocking if another 
thread is currently
+   *  applying the package. In order to ensure the packet has been 
applied, {@link #forceApply(IndexWriter)}
+   *  must be called.
+   *  */
   @SuppressWarnings("try")
-  public synchronized void apply(IndexWriter writer) throws IOException {
-if (applied.getCount() == 0) {
-  // already done
-  return;
+  boolean tryApply(IndexWriter writer) throws IOException {
+if (applyLock.tryLock()) {
+  try {
+return forceApply(writer);
+  } finally {
+applyLock.unlock();
+  }
 }
+return false;
+  }
 
-long startNS = System.nanoTime();
+  /** Translates a frozen packet of delete term/query, or doc values
+   *  updates, into their actual docIDs in the index, and applies the 
change.  This is a heavy
+   *  operation and is done concurrently by incoming indexing threads.
+   *  */
+  boolean forceApply(IndexWriter writer) throws IOException {
--- End diff --

Can we make this `void`?  It always just `return true` now right?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #502: LUCENE-8569: Never count soft-deletes if read...

2018-11-20 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/502#discussion_r235148814
  
--- Diff: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java ---
@@ -4398,42 +4426,41 @@ private int mergeMiddle(MergePolicy.OneMerge merge, 
MergePolicy mergePolicy) thr
 
   // Let the merge wrap readers
   List mergeReaders = new ArrayList<>();
-  int softDeleteCount = 0;
+  Counter softDeleteCount = Counter.newCounter(false);
   for (int r = 0; r < merge.readers.size(); r++) {
 SegmentReader reader = merge.readers.get(r);
 CodecReader wrappedReader = merge.wrapForMerge(reader);
 validateMergeReader(wrappedReader);
 if (softDeletesEnabled) {
+
   if (reader != wrappedReader) { // if we don't have a wrapped 
reader we won't preserve any soft-deletes
 Bits hardLiveDocs = merge.hardLiveDocs.get(r);
-Bits wrappedLiveDocs = wrappedReader.getLiveDocs();
-int hardDeleteCount = 0;
-DocIdSetIterator softDeletedDocs = 
DocValuesFieldExistsQuery.getDocValuesDocIdSetIterator(config.getSoftDeletesField(),
 wrappedReader);
-if (softDeletedDocs != null) {
-  int docId;
-  while ((docId = softDeletedDocs.nextDoc()) != 
DocIdSetIterator.NO_MORE_DOCS) {
-if (wrappedLiveDocs == null || wrappedLiveDocs.get(docId)) 
{
-  if (hardLiveDocs == null || hardLiveDocs.get(docId)) {
-softDeleteCount++;
-  } else {
-hardDeleteCount++;
+if (hardLiveDocs != null) { // we only need to do this 
accounting if we have mixed deletes
+  Bits wrappedLiveDocs = wrappedReader.getLiveDocs();
+  int hardDeleteCount = countSoftDeletes(wrappedReader, 
wrappedLiveDocs, hardLiveDocs, softDeleteCount);
+  // Wrap the wrapped reader again if we have excluded some 
hard-deleted docs
+  if (hardLiveDocs != null && hardDeleteCount > 0) {
+Bits liveDocs = wrappedLiveDocs == null ? hardLiveDocs : 
new Bits() {
+  @Override
+  public boolean get(int index) {
+return hardLiveDocs.get(index) && 
wrappedLiveDocs.get(index);
   }
-}
+
+  @Override
+  public int length() {
+return hardLiveDocs.length();
+  }
+};
+wrappedReader = 
FilterCodecReader.wrapLiveDocs(wrappedReader, liveDocs, wrappedReader.numDocs() 
- hardDeleteCount);
   }
-}
-// Wrap the wrapped reader again if we have excluded some 
hard-deleted docs
-if (hardLiveDocs != null && hardDeleteCount > 0) {
-  Bits liveDocs = wrappedLiveDocs == null ? hardLiveDocs : new 
Bits() {
-@Override
-public boolean get(int index) {
-  return hardLiveDocs.get(index) && 
wrappedLiveDocs.get(index);
-}
-@Override
-public int length() {
-  return hardLiveDocs.length();
-}
-  };
-  wrappedReader = 
FilterCodecReader.wrapLiveDocs(wrappedReader, liveDocs, wrappedReader.numDocs() 
- hardDeleteCount);
+} else {
+  int carryOverSoftDeletes = 
reader.getSegmentInfo().getSoftDelCount() - wrappedReader.numDeletedDocs();
+  if (carryOverSoftDeletes < 0) {
+throw new IllegalStateException("carry-over soft-deletes 
must be positive");
--- End diff --

`AssertionError` instead?  This means we have a bug right?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #497: LUCENE-8026: ExitableDirectoryReader does not...

2018-11-12 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/497#discussion_r232832055
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/ExitableDirectoryReader.java ---
@@ -100,13 +109,150 @@ public CacheHelper getCoreCacheHelper() {
 
   }
 
+  /**
+   * Wrapper class for another PointValues implementation that is used by 
ExitableFields.
+   */
+  public static class ExitablePointValues extends PointValues {
+
+private final PointValues in;
+private final QueryTimeout queryTimeout;
+
+public ExitablePointValues(PointValues in, QueryTimeout queryTimeout) {
+  this.in = in;
+  this.queryTimeout = queryTimeout;
+  checkAndThrow();
+}
+
+/**
+ * Throws {@link ExitingReaderException} if {@link 
QueryTimeout#shouldExit()} returns true,
+ * or if {@link Thread#interrupted()} returns true.
+ */
+private void checkAndThrow() {
+  if (queryTimeout.shouldExit()) {
+throw new ExitingReaderException("The request took too long to 
iterate over point values. Timeout: "
++ queryTimeout.toString()
++ ", PointValues=" + in
+);
+  } else if (Thread.interrupted()) {
+throw new ExitingReaderException("Interrupted while iterating over 
point values. PointValues=" + in);
+  }
+}
+
+@Override
+public void intersect(IntersectVisitor visitor) throws IOException {
+  checkAndThrow();
+  in.intersect(new ExitableIntersectVisitor(visitor, queryTimeout));
+}
+
+@Override
+public long estimatePointCount(IntersectVisitor visitor) {
+  checkAndThrow();
+  return in.estimatePointCount(visitor);
+}
+
+@Override
+public byte[] getMinPackedValue() throws IOException {
+  checkAndThrow();
+  return in.getMinPackedValue();
+}
+
+@Override
+public byte[] getMaxPackedValue() throws IOException {
+  checkAndThrow();
+  return in.getMaxPackedValue();
+}
+
+@Override
+public int getNumDataDimensions() throws IOException {
+  checkAndThrow();
+  return in.getNumDataDimensions();
+}
+
+@Override
+public int getNumIndexDimensions() throws IOException {
+  checkAndThrow();
+  return in.getNumIndexDimensions();
+}
+
+@Override
+public int getBytesPerDimension() throws IOException {
+  checkAndThrow();
+  return in.getBytesPerDimension();
+}
+
+@Override
+public long size() {
+  checkAndThrow();
+  return in.size();
+}
+
+@Override
+public int getDocCount() {
+  checkAndThrow();
+  return in.getDocCount();
+}
+  }
+
+  public static class ExitableIntersectVisitor implements 
PointValues.IntersectVisitor {
+
+public static final int MAX_CALLS_BEFORE_QUERY_TIMEOUT_CHECK = 10;
+
+private final PointValues.IntersectVisitor in;
+private final QueryTimeout queryTimeout;
+private int calls = 0;
--- End diff --

Minor: you don't need the `= 0` -- java does that for you by default.


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #496: LUCENE-8463: Early-terminate queries sorted b...

2018-11-12 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/496#discussion_r232664496
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java ---
@@ -68,6 +68,20 @@ public void setScorer(Scorable scorer) throws 
IOException {
   }
 
   static boolean canEarlyTerminate(Sort searchSort, Sort indexSort) {
+return canEarlyTerminateOnDocId(searchSort, indexSort) ||
+   canEarlyTerminateOnPrefix(searchSort, indexSort);
+  }
+
+  private static boolean canEarlyTerminateOnDocId(Sort searchSort, Sort 
indexSort) {
+final SortField[] fields1 = searchSort.getSort();
+final SortField[] fields2 = indexSort.getSort();
+return fields1.length == 1 &&
--- End diff --

It'd be weird, but we can also safely early terminate even if there are 
other sort fields after `docid` right (since `docid` is a total sort)?  So we 
don't need to insist the length is 1?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #497: LUCENE-8026: ExitableDirectoryReader does not...

2018-11-12 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/497#discussion_r232661641
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/ExitableDirectoryReader.java ---
@@ -100,13 +109,97 @@ public CacheHelper getCoreCacheHelper() {
 
   }
 
+  /**
+   * Wrapper class for another PointValues implementation that is used by 
ExitableFields.
+   */
+  public static class ExitablePointValues extends PointValues {
+
+private final PointValues in;
+private final QueryTimeout queryTimeout;
+
+public ExitablePointValues(PointValues in, QueryTimeout queryTimeout) {
+  this.in = in;
+  this.queryTimeout = queryTimeout;
+  checkAndThrow();
+}
+
+/**
+ * Throws {@link ExitingReaderException} if {@link 
QueryTimeout#shouldExit()} returns true,
+ * or if {@link Thread#interrupted()} returns true.
+ */
+private void checkAndThrow() {
+  if (queryTimeout.shouldExit()) {
+throw new ExitingReaderException("The request took too long to 
iterate over terms. Timeout: "
++ queryTimeout.toString()
++ ", PointValues=" + in
+);
+  } else if (Thread.interrupted()) {
+throw new ExitingReaderException("Interrupted while iterating over 
terms. PointValues=" + in);
--- End diff --

Same here?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #497: LUCENE-8026: ExitableDirectoryReader does not...

2018-11-12 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/497#discussion_r232661519
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/ExitableDirectoryReader.java ---
@@ -100,13 +109,97 @@ public CacheHelper getCoreCacheHelper() {
 
   }
 
+  /**
+   * Wrapper class for another PointValues implementation that is used by 
ExitableFields.
+   */
+  public static class ExitablePointValues extends PointValues {
+
+private final PointValues in;
+private final QueryTimeout queryTimeout;
+
+public ExitablePointValues(PointValues in, QueryTimeout queryTimeout) {
+  this.in = in;
+  this.queryTimeout = queryTimeout;
+  checkAndThrow();
+}
+
+/**
+ * Throws {@link ExitingReaderException} if {@link 
QueryTimeout#shouldExit()} returns true,
+ * or if {@link Thread#interrupted()} returns true.
+ */
+private void checkAndThrow() {
+  if (queryTimeout.shouldExit()) {
+throw new ExitingReaderException("The request took too long to 
iterate over terms. Timeout: "
--- End diff --

Maybe change `to iterate over terms` to `to iterate over points`?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #497: LUCENE-8026: ExitableDirectoryReader does not...

2018-11-12 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/497#discussion_r232661927
  
--- Diff: 
lucene/core/src/test/org/apache/lucene/index/TestExitableDirectoryReader.java 
---
@@ -160,5 +160,78 @@ public void testExitableFilterIndexReader() throws 
Exception {
 
 directory.close();
   }
+
+  /**
+   * Tests timing out of PointValues queries
+   *
+   * @throws Exception on error
+   */
+  @Ignore("this test relies on wall clock time and sometimes false fails")
--- End diff --

I wonder if we could make a mock clock and then have deterministic control 
and be able to re-enable these tests?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #497: LUCENE-8026: ExitableDirectoryReader does not...

2018-11-12 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/497#discussion_r232662145
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/index/ExitableDirectoryReader.java ---
@@ -100,13 +109,97 @@ public CacheHelper getCoreCacheHelper() {
 
   }
 
+  /**
+   * Wrapper class for another PointValues implementation that is used by 
ExitableFields.
+   */
+  public static class ExitablePointValues extends PointValues {
+
+private final PointValues in;
+private final QueryTimeout queryTimeout;
+
+public ExitablePointValues(PointValues in, QueryTimeout queryTimeout) {
+  this.in = in;
+  this.queryTimeout = queryTimeout;
+  checkAndThrow();
+}
+
+/**
+ * Throws {@link ExitingReaderException} if {@link 
QueryTimeout#shouldExit()} returns true,
+ * or if {@link Thread#interrupted()} returns true.
+ */
+private void checkAndThrow() {
+  if (queryTimeout.shouldExit()) {
+throw new ExitingReaderException("The request took too long to 
iterate over terms. Timeout: "
++ queryTimeout.toString()
++ ", PointValues=" + in
+);
+  } else if (Thread.interrupted()) {
+throw new ExitingReaderException("Interrupted while iterating over 
terms. PointValues=" + in);
+  }
+}
+
+@Override
+public void intersect(IntersectVisitor visitor) throws IOException {
+  checkAndThrow();
+  in.intersect(visitor);
--- End diff --

A lot of time/effort can be spent in this (recursive) intersect call -- 
should we also wrap the `IntersectVisitor` and sometimes check for timeout?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr issue #411: Debugging PriorityQueue.java

2018-06-28 Thread mikemccand
Github user mikemccand commented on the issue:

https://github.com/apache/lucene-solr/pull/411
  
Great, thanks @rsaavedraf!


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr issue #411: Debugging PriorityQueue.java

2018-06-28 Thread mikemccand
Github user mikemccand commented on the issue:

https://github.com/apache/lucene-solr/pull/411
  
OK I merged this change into Lucene master & 7.x, and added a simple unit 
test.

Thanks @rsaavedraf!


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr issue #411: Debugging PriorityQueue.java

2018-06-28 Thread mikemccand
Github user mikemccand commented on the issue:

https://github.com/apache/lucene-solr/pull/411
  
Change looks great; I'll push.  Thanks @rsaavedraf!


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #398: Lucene 8343 data type migration

2018-06-14 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/398#discussion_r195387148
  
--- Diff: 
lucene/suggest/src/java/org/apache/lucene/search/suggest/Lookup.java ---
@@ -53,7 +53,7 @@
 public final Object highlightKey;
 
 /** the key's weight */
-public final long value;
+public final double value;
--- End diff --

Maybe improve javadocs here, explaining that this is not just the weight 
originally supplied during indexing, for some suggesters?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #398: Lucene 8343 data type migration

2018-06-14 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/398#discussion_r195386599
  
--- Diff: 
lucene/suggest/src/java/org/apache/lucene/search/suggest/InputIterator.java ---
@@ -34,7 +34,7 @@
 public interface InputIterator extends BytesRefIterator {
 
   /** A term's weight, higher numbers mean better suggestions. */
--- End diff --

Maybe add javadocs explaining what `null` means?  Though, why do we need to 
allow for `null`?  Shouldn't the iterator not return a suggestion that has no 
weight?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #286: [LUCENE-8075] Possible null pointer dereferen...

2017-12-12 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/286#discussion_r156538059
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/codecs/blocktree/IntersectTermsEnum.java 
---
@@ -106,37 +106,37 @@ public IntersectTermsEnum(FieldReader fr, Automaton 
automaton, RunAutomaton runA
 if (fr.index == null) {
   fstReader = null;
--- End diff --

Sorry for the slow respons ehere @imgpulak and @jpountz but Adrien is 
right: `fr.index` can never be null anymore.  So I think we should change the 
code to `assert` it's never null and only do the `else` clause of the current 
`if` statement?


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr issue #239: Make FacetField.TYPE public

2017-08-27 Thread mikemccand
Github user mikemccand commented on the issue:

https://github.com/apache/lucene-solr/pull/239
  
OK I pushed this for 7.1 and 8.0.  Can you close this PR now?  Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr issue #239: Make FacetField.TYPE public

2017-08-27 Thread mikemccand
Github user mikemccand commented on the issue:

https://github.com/apache/lucene-solr/pull/239
  
Thanks @msokolov; I'll look.

On that cryptic error you saw, that might mean your ivy version is too old: 
see http://lucene.472066.n3.nabble.com/BUILD-FAILED-solr-6-5-0-td4330787.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr issue #239: Make FacetField.TYPE public

2017-08-24 Thread mikemccand
Github user mikemccand commented on the issue:

https://github.com/apache/lucene-solr/pull/239
  
OK `ant precommit` is angry because there are no javadocs for `TYPE` ... 
could you please add one?  Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr issue #239: Make FacetField.TYPE public

2017-08-24 Thread mikemccand
Github user mikemccand commented on the issue:

https://github.com/apache/lucene-solr/pull/239
  
Thanks @msokolov, that makes sense ... I'll merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #:

2017-01-05 Thread mikemccand
Github user mikemccand commented on the pull request:


https://github.com/apache/lucene-solr/commit/c9aaa771821eb0ca8dd6ce4650784d824fbe8c39#commitcomment-20370748
  
Hi @xcl3721, the functionality from `TrackingIndexWriter` has been folded 
into `IndexWriter` in 6.2 ... its APIs now return a `long` sequence number 
which you can pass to `ControlledRealTimeReopenThread`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr issue #130: LUCENE-7603: branch_6x Support Graph Token Streams i...

2017-01-03 Thread mikemccand
Github user mikemccand commented on the issue:

https://github.com/apache/lucene-solr/pull/130
  
I've merged this into Lucene's branch_6x; thank you @mattweber!  Can you 
please close this now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr issue #129: LUCENE-7603: Support Graph Token Streams in QueryBui...

2017-01-03 Thread mikemccand
Github user mikemccand commented on the issue:

https://github.com/apache/lucene-solr/pull/129
  
I've merged this into Lucene's master (7.0), and I'm working on 6.x (#130) 
now.  Thanks @mattweber! Can you close this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr issue #129: LUCENE-7603: Support Graph Token Streams in QueryBui...

2016-12-31 Thread mikemccand
Github user mikemccand commented on the issue:

https://github.com/apache/lucene-solr/pull/129
  
This change looks great to me!  What an awesome improvement, to properly 
use graph token streams at search time so multi-token synonyms are correct.

I'll push this in a few days once I'm back home unless someone pushes first 
(@dsmiley feel free)...

Thank you @mattweber!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #129: LUCENE-7603: Support Graph Token Streams in Q...

2016-12-30 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/129#discussion_r94217475
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/util/graph/GraphTokenStreamFiniteStrings.java
 ---
@@ -0,0 +1,237 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.graph;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.BytesTermAttribute;
+import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
+import 
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
+import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute;
+import org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.automaton.Automaton;
+import org.apache.lucene.util.automaton.FiniteStringsIterator;
+import org.apache.lucene.util.automaton.Operations;
+
+import static 
org.apache.lucene.util.automaton.Operations.DEFAULT_MAX_DETERMINIZED_STATES;
+
+/**
+ * Creates a list of {@link TokenStream} where each stream is the tokens 
that make up a finite string in graph token stream.  To do this,
+ * the graph token stream is converted to an {@link Automaton} and from 
there we use a {@link FiniteStringsIterator} to collect the various
+ * token streams for each finite string.
+ */
+public class GraphTokenStreamFiniteStrings {
+  /* TODO:
+ Most of this is a combination of code from TermAutomatonQuery and 
TokenStreamToTermAutomatonQuery. Would be
+ good to make this so it could be shared. */
+  private final Automaton.Builder builder;
+  Automaton det;
+  private final Map<BytesRef, Integer> termToID = new HashMap<>();
+  private final Map<Integer, BytesRef> idToTerm = new HashMap<>();
+  private final Map<Integer, Integer> idToInc = new HashMap<>();
+
+  public GraphTokenStreamFiniteStrings() {
+this.builder = new Automaton.Builder();
+  }
+
+  private static class BytesRefArrayTokenStream extends TokenStream {
+private final BytesTermAttribute termAtt = 
addAttribute(BytesTermAttribute.class);
+private final PositionIncrementAttribute posIncAtt = 
addAttribute(PositionIncrementAttribute.class);
+
+private final BytesRef[] terms;
+private final int[] increments;
+private int offset;
+
+BytesRefArrayTokenStream(BytesRef[] terms, int[] increments) {
+  this.terms = terms;
+  this.increments = increments;
+  assert terms.length == increments.length;
+  offset = 0;
+}
+
+@Override
+public boolean incrementToken() throws IOException {
+  if (offset < terms.length) {
+clearAttributes();
+termAtt.setBytesRef(terms[offset]);
+posIncAtt.setPositionIncrement(increments[offset]);
+offset = offset + 1;
+return true;
+  }
+
+  return false;
+}
+  }
+
+  /**
+   * Gets the list of finite string token streams from the given input 
graph token stream.
+   */
+  public List getTokenStreams(final TokenStream in) throws 
IOException {
+// build automation
+build(in);
+
+List tokenStreams = new ArrayList<>();
+final FiniteStringsIterator finiteStrings = new 
FiniteStringsIterator(det);
+for (IntsRef string; (string = finiteStrings.next()) != null; ) {
+  final BytesRef[] tokens = new BytesRef[string.length];
+  final int[] increments = new int[string.length];
+  for (int idx = string.offset, len = string.offset + string.length; 
idx < len; idx++) {
+int 

[GitHub] lucene-solr pull request #129: LUCENE-7603: Support Graph Token Streams in Q...

2016-12-30 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/129#discussion_r94217160
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/util/graph/GraphTokenStreamFiniteStrings.java
 ---
@@ -0,0 +1,237 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.graph;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.BytesTermAttribute;
+import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
+import 
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
+import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute;
+import org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.automaton.Automaton;
+import org.apache.lucene.util.automaton.FiniteStringsIterator;
+import org.apache.lucene.util.automaton.Operations;
+
+import static 
org.apache.lucene.util.automaton.Operations.DEFAULT_MAX_DETERMINIZED_STATES;
+
+/**
+ * Creates a list of {@link TokenStream} where each stream is the tokens 
that make up a finite string in graph token stream.  To do this,
+ * the graph token stream is converted to an {@link Automaton} and from 
there we use a {@link FiniteStringsIterator} to collect the various
+ * token streams for each finite string.
+ */
+public class GraphTokenStreamFiniteStrings {
+  /* TODO:
+ Most of this is a combination of code from TermAutomatonQuery and 
TokenStreamToTermAutomatonQuery. Would be
+ good to make this so it could be shared. */
+  private final Automaton.Builder builder;
+  Automaton det;
+  private final Map<BytesRef, Integer> termToID = new HashMap<>();
+  private final Map<Integer, BytesRef> idToTerm = new HashMap<>();
+  private final Map<Integer, Integer> idToInc = new HashMap<>();
+
+  public GraphTokenStreamFiniteStrings() {
+this.builder = new Automaton.Builder();
+  }
+
+  private static class BytesRefArrayTokenStream extends TokenStream {
+private final BytesTermAttribute termAtt = 
addAttribute(BytesTermAttribute.class);
+private final PositionIncrementAttribute posIncAtt = 
addAttribute(PositionIncrementAttribute.class);
+
+private final BytesRef[] terms;
+private final int[] increments;
+private int offset;
+
+BytesRefArrayTokenStream(BytesRef[] terms, int[] increments) {
+  this.terms = terms;
+  this.increments = increments;
+  assert terms.length == increments.length;
+  offset = 0;
+}
+
+@Override
+public boolean incrementToken() throws IOException {
+  if (offset < terms.length) {
+clearAttributes();
+termAtt.setBytesRef(terms[offset]);
+posIncAtt.setPositionIncrement(increments[offset]);
+offset = offset + 1;
+return true;
+  }
+
+  return false;
+}
+  }
+
+  /**
+   * Gets the list of finite string token streams from the given input 
graph token stream.
+   */
+  public List getTokenStreams(final TokenStream in) throws 
IOException {
--- End diff --

Could we make this method private, make this class's constructor private, 
and add a `static` method here, the sole public method on this class, that 
receives the incoming `TokenStream` and returns the resulting `TokenStream[]`?  
Otherwise the API is sort of awkard, since e.g. this method seems like a getter 
yet it's doing lots of side-effects under the hood ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If yo

[GitHub] lucene-solr pull request #129: LUCENE-7603: Support Graph Token Streams in Q...

2016-12-30 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/129#discussion_r94215751
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/util/graph/GraphTokenStreamFiniteStrings.java
 ---
@@ -114,21 +127,20 @@ private void build(final TokenStream in) throws 
IOException {
 in.reset();
 
 int pos = -1;
-int lastPos = 0;
+int lastIncr = 1;
 int maxOffset = 0;
 int maxPos = -1;
 int state = -1;
 while (in.incrementToken()) {
   int posInc = posIncAtt.getPositionIncrement();
-  assert pos > -1 || posInc > 0;
 
-  if (posInc > 1) {
-throw new IllegalArgumentException("cannot handle holes; to accept 
any term, use '*' term");
-  }
+  // always use inc 1 while building, but save original increment
+  int fakePosInc = posInc > 1 ? 1 : posInc;
--- End diff --

Maybe just `Math.min(1, posInc)` instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #129: LUCENE-7603: Support Graph Token Streams in Q...

2016-12-30 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/129#discussion_r94216469
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/util/graph/GraphTokenStreamFiniteStrings.java
 ---
@@ -210,82 +215,20 @@ private void finish() {
*/
   private void finish(int maxDeterminizedStates) {
 Automaton automaton = builder.finish();
-
-// System.out.println("before det:\n" + automaton.toDot());
-
-Transition t = new Transition();
-
-// TODO: should we add "eps back to initial node" for all states,
-// and det that?  then we don't need to revisit initial node at
-// every position?  but automaton could blow up?  And, this makes it
-// harder to skip useless positions at search time?
-
-if (anyTermID != -1) {
-
-  // Make sure there are no leading or trailing ANY:
-  int count = automaton.initTransition(0, t);
-  for (int i = 0; i < count; i++) {
-automaton.getNextTransition(t);
-if (anyTermID >= t.min && anyTermID <= t.max) {
-  throw new IllegalStateException("automaton cannot lead with an 
ANY transition");
-}
-  }
-
-  int numStates = automaton.getNumStates();
-  for (int i = 0; i < numStates; i++) {
-count = automaton.initTransition(i, t);
-for (int j = 0; j < count; j++) {
-  automaton.getNextTransition(t);
-  if (automaton.isAccept(t.dest) && anyTermID >= t.min && 
anyTermID <= t.max) {
-throw new IllegalStateException("automaton cannot end with an 
ANY transition");
-  }
-}
-  }
-
-  int termCount = termToID.size();
-
-  // We have to carefully translate these transitions so automaton
-  // realizes they also match all other terms:
-  Automaton newAutomaton = new Automaton();
-  for (int i = 0; i < numStates; i++) {
-newAutomaton.createState();
-newAutomaton.setAccept(i, automaton.isAccept(i));
-  }
-
-  for (int i = 0; i < numStates; i++) {
-count = automaton.initTransition(i, t);
-for (int j = 0; j < count; j++) {
-  automaton.getNextTransition(t);
-  int min, max;
-  if (t.min <= anyTermID && anyTermID <= t.max) {
-// Match any term
-min = 0;
-max = termCount - 1;
-  } else {
-min = t.min;
-max = t.max;
-  }
-  newAutomaton.addTransition(t.source, t.dest, min, max);
-}
-  }
-  newAutomaton.finishState();
-  automaton = newAutomaton;
-}
-
 det = Operations.removeDeadStates(Operations.determinize(automaton, 
maxDeterminizedStates));
   }
 
-  private int getTermID(BytesRef term) {
+  private int getTermID(int incr, BytesRef term) {
 Integer id = termToID.get(term);
-if (id == null) {
+if (incr > 1 || id == null) {
--- End diff --

Hmm doesn't this mean that if the same term shows up, but with different 
`incr`, that it will get different `id` assigned?  But I think that is actually 
fine, since nowhere here do we depend on / expect that the same term must have 
the same id.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #129: LUCENE-7603: Support Graph Token Streams in Q...

2016-12-30 Thread mikemccand
Github user mikemccand commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/129#discussion_r94216511
  
--- Diff: 
lucene/core/src/java/org/apache/lucene/util/graph/GraphTokenStreamFiniteStrings.java
 ---
@@ -174,25 +191,13 @@ private void setAccept(int state, boolean accept) {
   /**
* Adds a transition to the automaton.
*/
-  private void addTransition(int source, int dest, String term) {
-addTransition(source, dest, new BytesRef(term));
-  }
-
-  /**
-   * Adds a transition to the automaton.
-   */
-  private void addTransition(int source, int dest, BytesRef term) {
+  private int addTransition(int source, int dest, int incr, BytesRef term) 
{
 if (term == null) {
--- End diff --

This can become an assert?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #:

2016-08-10 Thread mikemccand
Github user mikemccand commented on the pull request:


https://github.com/apache/lucene-solr/commit/3816a0eb2bbd1929523ae27db3c90d0942ed5f5f#commitcomment-18587727
  
Thank you for raising it @makeyang!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request:

2016-02-04 Thread mikemccand
Github user mikemccand commented on the pull request:


https://github.com/apache/lucene-solr/commit/a48ba5090beafc725ab04e67996e42f0b3f31348#commitcomment-15900830
  
Thanks @iorixxx.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request:

2016-02-02 Thread mikemccand
Github user mikemccand commented on the pull request:


https://github.com/apache/lucene-solr/commit/a48ba5090beafc725ab04e67996e42f0b3f31348#commitcomment-15840601
  
Yes it would @lorixxx!  Maybe open an issue / PR at Lucene for this?  
Thanks.

There are many places doing the conversion "directly" now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org