[GitHub] [lucene] zhaih commented on pull request #163: LUCENE-9983: Stop sorting determinize powersets unnecessarily

2021-06-02 Thread GitBox


zhaih commented on pull request #163:
URL: https://github.com/apache/lucene/pull/163#issuecomment-853538280


   Thank you @mikemccand @dweiss and @bruno-roustant all for reviewing this PR! 
Since this PR is more of an optimization for adversarial cases so we don't want 
to sacrifice performance of our normal use cases. I will take some time to 
benchmark those changes (this one as well as few others that are not yet 
presented) first and see what numbers we'll get to decide how to move on with 
this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7321) Character Mapping

2021-06-02 Thread Marcus Eagan (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17356136#comment-17356136
 ] 

Marcus Eagan commented on LUCENE-7321:
--

Hi [~iprovalo] I'm curious if you have been maintaining this patch through 
version `8` for your company? If so, do you want to revive this discussion?

> Character Mapping
> -
>
> Key: LUCENE-7321
> URL: https://issues.apache.org/jira/browse/LUCENE-7321
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1
>Reporter: Ivan Provalov
>Priority: Minor
>  Labels: patch
> Fix For: 6.0.1
>
> Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch
>
>
> One of the challenges in search is recall of an item with a common typing 
> variant.  These cases can be as simple as lower/upper case in most languages, 
> accented characters, or more complex morphological phenomena like prefix 
> omitting, or constructing a character with some combining mark.  This 
> component addresses the cases, which are not covered by ASCII folding 
> component, or more complex to design with other tools.  The idea is that a 
> linguist could provide the mappings in a tab-delimited file, which then can 
> be directly used by Solr.
> The mappings are maintained in the tab-delimited file, which could be just a 
> copy paste from Excel spreadsheet.  This gives the linguists the opportunity 
> to create the mappings, then for the developer to include them in Solr 
> configuration.  There are a few cases, when the mappings grow complex, where 
> some additional debugging may be required.  The mappings can contain any 
> sequence of characters to any other sequence of characters.
> Some of the cases I discuss in detail document are handling the voiced vowels 
> for Japanese; common typing substitutions for Korean, Russian, Polish; 
> transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding 
> for Japanese.  In the appendix, I give an example of implementing a Russian 
> light weight stemmer using this component.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent

2021-06-02 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17356092#comment-17356092
 ] 

Zach Chen commented on LUCENE-9976:
---

Hi Dawid and Michael! I tried again with the command line above with 1000 
iterations, but it still didn't reproduce for me for some reasons.
{code:java}
xichen@Xis-MacBook-Pro lucene % ./gradlew test -Ptests.iters=1000 --tests 
TestExpressionSorts.testQueries -Dtests.seed=FF571CE915A0955 
-Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true 
-Dtests.asserts=true -p lucene/expressions/
Starting a Gradle Daemon, 7 busy and 18 incompatible Daemons could not be 
reused, use --status for details


> Task :randomizationInfo
Running tests with randomization seed: tests.seed=FF571CE915A0955


> Task :lucene:expressions:test
:lucene:expressions:test (SUCCESS): 1000 test(s)
The slowest tests (exceeding 500 ms) during this run:
   6.62s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:159F353910AC3564]} (:lucene:expressions)
   6.56s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:993EFB36FB8A23F3]} (:lucene:expressions)
   6.22s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:C9E931CFB8A6C82E]} (:lucene:expressions)
   6.21s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:2854FA7396FAF62F]} (:lucene:expressions)
   5.84s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:5515E173B4FD16BA]} (:lucene:expressions)
   5.65s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:A8C1890BB457C90F]} (:lucene:expressions)
   5.62s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:A44F7F3F8B79B2DB]} (:lucene:expressions)
   5.57s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:328FA3364F99C839]} (:lucene:expressions)
   5.56s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:9D8BCE5B3371B6E2]} (:lucene:expressions)
   5.55s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:2E635F6265446CED]} (:lucene:expressions)
The slowest suites (exceeding 1s) during this run:
  2662.21s TestExpressionSorts (:lucene:expressions)


BUILD SUCCESSFUL in 45m 1s{code}

> WANDScorer assertion error in ensureConsistent
> --
>
> Key: LUCENE-9976
> URL: https://issues.apache.org/jira/browse/LUCENE-9976
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Priority: Major
>
> Build fails and is reproducible:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console
> {code}
> ./gradlew test --tests TestExpressionSorts.testQueries 
> -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9823) SynonymQuery rewrite can change field boost calculation

2021-06-02 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani resolved LUCENE-9823.
--
Fix Version/s: main (9.0)
   Resolution: Fixed

> SynonymQuery rewrite can change field boost calculation
> ---
>
> Key: LUCENE-9823
> URL: https://issues.apache.org/jira/browse/LUCENE-9823
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Minor
>  Labels: newdev
> Fix For: main (9.0)
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> SynonymQuery accepts a boost per term, which acts as a multiplier on the term 
> frequency in the document. When rewriting a SynonymQuery with a single term, 
> we create a BoostQuery wrapping a TermQuery. This changes the meaning of the 
> boost: it now multiplies the final TermQuery score instead of multiplying the 
> term frequency before it's passed to the score calculation.
> This is a small point, but maybe it's worth avoiding rewriting a single-term 
> SynonymQuery unless the boost is 1.0.
> The same consideration affects CombinedFieldQuery in sandbox.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9823) SynonymQuery rewrite can change field boost calculation

2021-06-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17356058#comment-17356058
 ] 

ASF subversion and git services commented on LUCENE-9823:
-

Commit 89034ad8cf8019c62a0a4ed1e477cd52e1277e60 in lucene's branch 
refs/heads/main from Naoto MINAMI
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=89034ad ]

LUCENE-9823: Prevent unsafe rewrites for SynonymQuery and CombinedFieldQuery. 
(#160)

Before, rewriting could slightly change the scoring when weights were
specified. We now rewrite less aggressively to avoid changing the query's
behavior.

> SynonymQuery rewrite can change field boost calculation
> ---
>
> Key: LUCENE-9823
> URL: https://issues.apache.org/jira/browse/LUCENE-9823
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Minor
>  Labels: newdev
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> SynonymQuery accepts a boost per term, which acts as a multiplier on the term 
> frequency in the document. When rewriting a SynonymQuery with a single term, 
> we create a BoostQuery wrapping a TermQuery. This changes the meaning of the 
> boost: it now multiplies the final TermQuery score instead of multiplying the 
> term frequency before it's passed to the score calculation.
> This is a small point, but maybe it's worth avoiding rewriting a single-term 
> SynonymQuery unless the boost is 1.0.
> The same consideration affects CombinedFieldQuery in sandbox.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani merged pull request #160: LUCENE-9823: Fix not to rewrite boosted single term SynonymQuery

2021-06-02 Thread GitBox


jtibshirani merged pull request #160:
URL: https://github.com/apache/lucene/pull/160


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #166: LUCENE-9905: Move HNSW build parameters to codec

2021-06-02 Thread GitBox


jtibshirani commented on a change in pull request #166:
URL: https://github.com/apache/lucene/pull/166#discussion_r644395730



##
File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorFormat.java
##
@@ -76,14 +79,55 @@
   static final int VERSION_START = 0;
   static final int VERSION_CURRENT = VERSION_START;
 
-  /** Sole constructor */
+  static final String BEAM_WIDTH_KEY =
+  Lucene90HnswVectorFormat.class.getSimpleName() + ".beam_width";
+  static final String MAX_CONN_KEY = 
Lucene90HnswVectorFormat.class.getSimpleName() + ".max_conn";
+
+  /**
+   * Controls how many of the nearest neighbor candidates are connected to the 
new node. See {@link
+   * HnswGraph} for details.
+   */
+  private final int maxConn;
+
+  /**
+   * The number of candidate neighbors to track while searching the graph for 
each newly inserted
+   * node. See {@link HnswGraph} for details.
+   */
+  private final int beamWidth;
+
   public Lucene90HnswVectorFormat() {
 super("Lucene90HnswVectorFormat");
+this.maxConn = HnswGraphBuilder.DEFAULT_MAX_CONN;
+this.beamWidth = HnswGraphBuilder.DEFAULT_BEAM_WIDTH;
+  }
+
+  public Lucene90HnswVectorFormat(int maxConn, int beamWidth) {
+super("Lucene90HnswVectorFormat");
+this.maxConn = maxConn;
+this.beamWidth = beamWidth;
   }
 
   @Override
   public VectorWriter fieldsWriter(SegmentWriteState state) throws IOException 
{
-return new Lucene90HnswVectorWriter(state);
+SegmentInfo segmentInfo = state.segmentInfo;
+putFormatAttribute(segmentInfo, MAX_CONN_KEY, String.valueOf(maxConn));
+putFormatAttribute(segmentInfo, BEAM_WIDTH_KEY, String.valueOf(beamWidth));
+return new Lucene90HnswVectorWriter(state, maxConn, beamWidth);
+  }
+
+  private void putFormatAttribute(SegmentInfo si, String key, String value) {
+String previousValue = si.putAttribute(key, value);
+if (previousValue != null && previousValue.equals(value) == false) {

Review comment:
   I'm not sure that writing and validating these format attributes is 
necessary, since we don't use them when reading. It just seemed nice (and low 
cost) to have the construction parameters available in the segment infos for 
debugging.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani opened a new pull request #166: LUCENE-9905: Move HNSW build parameters to codec

2021-06-02 Thread GitBox


jtibshirani opened a new pull request #166:
URL: https://github.com/apache/lucene/pull/166


   Previously, the max connections and beam width parameters could be 
configured as
   field type attributes. This PR moves them to be parameters on
   Lucene90HnswVectorFormat, to avoid exposing details of the vector format
   implementation in the API.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] chlorochrule commented on pull request #160: LUCENE-9823: Fix not to rewrite boosted single term SynonymQuery

2021-06-02 Thread GitBox


chlorochrule commented on pull request #160:
URL: https://github.com/apache/lucene/pull/160#issuecomment-853248112


   Thanks for explaining and reviewing again!
   I fixed :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a change in pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes

2021-06-02 Thread GitBox


zhaih commented on a change in pull request #157:
URL: https://github.com/apache/lucene/pull/157#discussion_r644172488



##
File path: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/FlattenGraphFilter.java
##
@@ -255,6 +260,32 @@ private boolean releaseBufferedToken() {
 return false;
   }
 
+  /**
+   * Free output nodes before the given outputs. Free inputs nodes before the 
minimum input node for
+   * this output.
+   *
+   * @param output target output node
+   */
+  private void freeBefore(OutputNode output) {
+// We've released all of the tokens that end at the current output,
+// so free all output nodes before this. Input nodes are more complex.
+// The second shingled tokens with alternate paths can appear later in the 
output graph than
+// than some of their alternate path tokens.
+// Because of this case we can only free from the minimum because the 
minimum node will have
+// come from before the second shingled token.
+// This means we have to hold onto input nodes who's tokens get stacked on 
previous nodes until
+// we've completely passed those inputs.
+// Related tests testShingledGap, testShingledGapWithHoles
+outputFrom++;
+int freeBefore = Collections.min(output.inputNodes);
+// This will catch a node being freed early if it's input to the next 
output.
+// Could a freed early node be input to a later output?
+assert outputNodes.get(outputFrom).inputNodes.stream().filter(n -> 
freeBefore < n).count() > 0
+: "FreeBefore " + output.inputNodes.get(0) + " will free in use nodes";

Review comment:
   Isn't this still the old assertion that need to be changed?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a change in pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes

2021-06-02 Thread GitBox


zhaih commented on a change in pull request #157:
URL: https://github.com/apache/lucene/pull/157#discussion_r644165757



##
File path: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/FlattenGraphFilter.java
##
@@ -362,6 +394,48 @@ public boolean incrementToken() throws IOException {
 }
   }
 
+  private OutputNode recoverFromHole(InputNode src, int startOffset) {
+// This means the "from" node of this token was never seen as a "to" node,
+// which should only happen if we just crossed a hole.  This is a 
challenging
+// case for us because we normally rely on the full dependencies expressed
+// by the arcs to assign outgoing node IDs.  It would be better if tokens
+// were never dropped but instead just marked deleted with a new
+// TermDeletedAttribute (boolean valued) ... but until that future, we have
+// a hack here to forcefully jump the output node ID:
+assert src.outputNode == -1;
+src.node = inputFrom;
+
+int maxOutIndex = outputNodes.getMaxPos();
+OutputNode outSrc = outputNodes.get(maxOutIndex);
+// There are two types of holes, neighbor holes and consumed holes. A 
neighbor hole is between
+// two tokens, it looks like a->*hole*->b.
+// A consumed hole is between the start a long token and the next token 
that is "under" the path
+// of the long token.
+// It looks like :___abc__
+//   ||
+//   |V
+// *hole*->b->c
+// A consumed hole should have the outputsrc node of the short token after 
the hole be the out
+// dest
+// of the long token as that's how we'd resolve it if the missing token 
were present.
+// neighbor holes should start a new output node and continue as if the 
hole didn't
+// exist.
+// Related tests testAltPathLastStepHoleFollowedByHole, 
testAltPathFirstStepHole,

Review comment:
   Thank you for linking the tests here!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a change in pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes

2021-06-02 Thread GitBox


zhaih commented on a change in pull request #157:
URL: https://github.com/apache/lucene/pull/157#discussion_r644165438



##
File path: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/FlattenGraphFilter.java
##
@@ -362,6 +378,40 @@ public boolean incrementToken() throws IOException {
 }
   }
 
+  private OutputNode recoverFromHole(InputNode src, int startOffset) {
+// This means the "from" node of this token was never seen as a "to" node,
+// which should only happen if we just crossed a hole.  This is a 
challenging
+// case for us because we normally rely on the full dependencies expressed
+// by the arcs to assign outgoing node IDs.  It would be better if tokens
+// were never dropped but instead just marked deleted with a new
+// TermDeletedAttribute (boolean valued) ... but until that future, we have
+// a hack here to forcefully jump the output node ID:
+assert src.outputNode == -1;
+src.node = inputFrom;
+
+int maxOutIndex = outputNodes.getMaxPos();
+OutputNode outSrc = outputNodes.get(maxOutIndex);
+// There are two types of holes, neighbor holes and consumed holes. A 
neighbor hole is between

Review comment:
   Thank you, that helps a lot!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #160: LUCENE-9823: Fix not to rewrite boosted single term SynonymQuery

2021-06-02 Thread GitBox


jtibshirani commented on a change in pull request #160:
URL: https://github.com/apache/lucene/pull/160#discussion_r644158197



##
File path: lucene/core/src/test/org/apache/lucene/search/TestSynonymQuery.java
##
@@ -466,4 +467,26 @@ public void testRandomTopDocs() throws IOException {
 reader.close();
 dir.close();
   }
+
+  public void testRewrite() throws IOException {
+// zero length SynonymQuery is rewritten
+SynonymQuery q = new SynonymQuery.Builder("f").build();

Review comment:
   A small comment: in most other rewrite tests (like 
`TestBoostQuery#testRewrite`) we check the higher-level call 
`IndexSearcher#rewrite`. It'd be nice to do this to be consistent and to 
exercise the full rewrite logic.

##
File path: 
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java
##
@@ -237,14 +236,6 @@ public Query rewrite(IndexReader reader) throws 
IOException {
 if (fieldTerms.length == 1) {

Review comment:
   I like the simplification below of removing the rewrite to synonym query 
(which is not perfectly accurate). I think we also need to fix or remove this 
check `if (fieldTerms.length == 1) { ... }`, since it's only accurate when the 
field weight is 1.0f.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a change in pull request #163: LUCENE-9983: Stop sorting determinize powersets unnecessarily

2021-06-02 Thread GitBox


zhaih commented on a change in pull request #163:
URL: https://github.com/apache/lucene/pull/163#discussion_r644146024



##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/StateSet.java
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import com.carrotsearch.hppc.BitMixer;
+import com.carrotsearch.hppc.IntIntHashMap;
+import com.carrotsearch.hppc.cursors.IntCursor;
+import java.util.Arrays;
+
+/** A thin wrapper of {@link com.carrotsearch.hppc.IntIntHashMap} */
+final class StateSet extends IntSet {
+
+  private final IntIntHashMap inner;
+  private int hashCode;
+  private boolean changed;
+  private int[] arrayCache = new int[0];
+
+  StateSet(int capacity) {
+inner = new IntIntHashMap(capacity);
+  }
+
+  // Adds this state to the set
+  void incr(int num) {
+if (inner.addTo(num, 1) == 1) {
+  changed = true;
+}
+  }
+
+  // Removes this state from the set, if count decrs to 0
+  void decr(int num) {
+assert inner.containsKey(num);
+int keyIndex = inner.indexOf(num);
+int count = inner.indexGet(keyIndex) - 1;
+if (count == 0) {
+  inner.remove(num);
+  changed = true;
+} else {
+  inner.indexReplace(keyIndex, count);
+}
+  }
+
+  void computeHash() {
+if (changed == false) {
+  return;
+}
+hashCode = inner.size();
+for (IntCursor cursor : inner.keys()) {
+  hashCode += BitMixer.mix(cursor.value);
+}
+  }
+
+  /**
+   * Create a snapshot of this int set associated with a given state. The 
snapshot will not retain
+   * any frequency information about the elements of this set, only existence.
+   *
+   * It is the caller's responsibility to ensure that the hashCode and data 
are up to date via
+   * the {@link #computeHash()} method before calling this method.
+   *
+   * @param state the state to associate with the frozen set.
+   * @return A new FrozenIntSet with the same values as this set.
+   */
+  FrozenIntSet freeze(int state) {
+if (changed == false) {
+  assert arrayCache != null;

Review comment:
   @mikemccand We actually might fall inside this `if`. So before we call 
`freeze`, we will perform a `get` using this `StateSet` by the `newStates` 
hashmap, there we might call `getArray()` if there's hashcode collision and 
`changed` should be set to `false` there.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a change in pull request #163: LUCENE-9983: Stop sorting determinize powersets unnecessarily

2021-06-02 Thread GitBox


zhaih commented on a change in pull request #163:
URL: https://github.com/apache/lucene/pull/163#discussion_r644142000



##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/StateSet.java
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import com.carrotsearch.hppc.BitMixer;
+import com.carrotsearch.hppc.IntIntHashMap;
+import com.carrotsearch.hppc.cursors.IntCursor;
+import java.util.Arrays;
+
+/** A thin wrapper of {@link com.carrotsearch.hppc.IntIntHashMap} */
+final class StateSet extends IntSet {
+
+  private final IntIntHashMap inner;
+  private int hashCode;
+  private boolean changed;
+  private int[] arrayCache = new int[0];
+
+  StateSet(int capacity) {
+inner = new IntIntHashMap(capacity);
+  }
+
+  // Adds this state to the set
+  void incr(int num) {
+if (inner.addTo(num, 1) == 1) {
+  changed = true;

Review comment:
   +1




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a change in pull request #163: LUCENE-9983: Stop sorting determinize powersets unnecessarily

2021-06-02 Thread GitBox


zhaih commented on a change in pull request #163:
URL: https://github.com/apache/lucene/pull/163#discussion_r644141734



##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
##
@@ -676,7 +677,7 @@ public static Automaton determinize(Automaton a, int 
maxDeterminizedStates) {
 // a.writeDot("/l/la/lucene/core/detin.dot");
 
 // Same initial values and state will always have the same hashCode
-FrozenIntSet initialset = new FrozenIntSet(new int[] {0}, 683, 0);
+FrozenIntSet initialset = new FrozenIntSet(new int[] {0}, BitMixer.mix(0) 
+ 1, 0);

Review comment:
   Just to keep the hash code the same with the one used in `StateSet`, 0 
should be the hash code used for 0 length array I think?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a change in pull request #163: LUCENE-9983: Stop sorting determinize powersets unnecessarily

2021-06-02 Thread GitBox


zhaih commented on a change in pull request #163:
URL: https://github.com/apache/lucene/pull/163#discussion_r644138606



##
File path: lucene/core/build.gradle
##
@@ -20,6 +20,8 @@ apply plugin: 'java-library'
 description = 'Lucene core library'
 
 dependencies {
+  implementation 'com.carrotsearch:hppc'

Review comment:
   @bruno-roustant Thank you for the advice! Unfortunately I tried WormMap 
yesterday (with hppc 0.9.0.RC2) and I didn't see benefits from the adversarial 
test case. Just to educate myself, is removal also a fast operation?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a change in pull request #163: LUCENE-9983: Stop sorting determinize powersets unnecessarily

2021-06-02 Thread GitBox


zhaih commented on a change in pull request #163:
URL: https://github.com/apache/lucene/pull/163#discussion_r644138606



##
File path: lucene/core/build.gradle
##
@@ -20,6 +20,8 @@ apply plugin: 'java-library'
 description = 'Lucene core library'
 
 dependencies {
+  implementation 'com.carrotsearch:hppc'

Review comment:
   @bruno-roustant I tried WormMap yesterday (with hppc 0.9.0.RC2) and I 
didn't see benefits from the adversarial test case. Just to educate myself, is 
removal also a fast operation?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-06-02 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355852#comment-17355852
 ] 

Haoyu Zhai commented on LUCENE-9983:


+1 to have a set of regexps so that we can benchmark them, I'm also a little 
worried the PR might make the normal cases worse too.

[~broustant] That is a good idea, I've tried to use a 128 size array as a map 
for first 128 states and it doesn't help the adversarial cases (I also pulled 
out some stats and found in adversarial cases states are actually much more 
than that number). But I think we might see some benefits from the normal cases 
once we have benchmark set up.

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9905) Revise approach to specifying NN algorithm

2021-06-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355808#comment-17355808
 ] 

ASF subversion and git services commented on LUCENE-9905:
-

Commit eecd1971fa748c2593e8a452484af5ba5d598915 in lucene's branch 
refs/heads/main from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=eecd197 ]

LUCENE-9905: Allow Lucene90Codec to be configured with a per-field vector 
format (#164)

Previously only AssertingCodec could handle a per-field vector format. This PR
also strengthens the checks in TestPerFieldVectorFormat.

> Revise approach to specifying NN algorithm
> --
>
> Key: LUCENE-9905
> URL: https://issues.apache.org/jira/browse/LUCENE-9905
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: main (9.0)
>Reporter: Julie Tibshirani
>Priority: Blocker
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
> particular nearest-neighbor search data structure and algorithm. This 
> flexibility is important since NN search is a developing area and we'd like 
> to be able to experiment and evolve the algorithm. Right now we only have one 
> algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for 
> example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation 
> is expected to handle multiple algorithms. Instead we could have one format 
> implementation per algorithm. Our current implementation would be 
> HNSW-specific like {{HnswVectorFormat}}, and to experiment with another 
> algorithm you could create a new implementation like {{ClusterVectorFormat}}. 
> This would be better aligned with the codec framework, and help avoid 
> exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is 
> addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
> HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just 
> {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something 
> like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
> beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
> parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric 
> vector per document, as well as constructing + storing a HNSW graph. When 
> adding another implementation, it’d be nice to be able to reuse logic for 
> reading/ writing numeric vectors. I don’t think we need to design for this 
> right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started: 
> [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani merged pull request #164: LUCENE-9905: Allow Lucene90Codec to be configured with a per-field vector format

2021-06-02 Thread GitBox


jtibshirani merged pull request #164:
URL: https://github.com/apache/lucene/pull/164


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #164: LUCENE-9905: Allow Lucene90Codec to be configured with a per-field vector format

2021-06-02 Thread GitBox


jtibshirani commented on pull request #164:
URL: https://github.com/apache/lucene/pull/164#issuecomment-853136935


   Thanks for reviewing !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9987) JVM 11.0.6 crash while trying to read term vectors in CheckIndex?

2021-06-02 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355764#comment-17355764
 ] 

Michael McCandless commented on LUCENE-9987:


OK good to know, thanks everyone!

> JVM 11.0.6 crash while trying to read term vectors in CheckIndex?
> -
>
> Key: LUCENE-9987
> URL: https://issues.apache.org/jira/browse/LUCENE-9987
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: hs_err_pid529873.log, hs_err_pid536810.log
>
>
> [This build|https://jenkins.thetaphi.de/job/Lucene-main-Linux/30482/] failed 
> with JVM crash:
> {noformat}
> Current thread (0x7f68780d24e0):  JavaThread 
> "TEST-TestExpressionSorts.testQueries-seed#[25E6600265A5C2D4]" 
> [_thread_in_Java, id=530195, stack(0x7f68ae1f2000,0x7f68ae2f3000)]
> Stack: [0x7f68ae1f2000,0x7f68ae2f3000],  sp=0x7f68ae2ef800,  free 
> space=1014k
> Native frames: (J=compiled Java code, A=aot compiled Java code, 
> j=interpreted, Vv=VM code, C=native code)
> J 4987% c2 
> org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.positionIndex(IILorg/apache/lucene/util/LongValues;[I)[[I
>  (136 bytes) @ 0x7f69347af02e [0x7f69347ae480+0x0bae]
> J 4952 c1 
> org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
>  (2695 bytes) @ 0x7f692db07754 [0x7f692db025c0+0x5194]
> J 4895 c1 
> org.apache.lucene.codecs.asserting.AssertingTermVectorsFormat$AssertingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
>  (26 bytes) @ 0x7f692daace2c [0x7f692daacd20+0x010c]
> j  
> org.apache.lucene.index.CheckIndex.testTermVectors(Lorg/apache/lucene/index/CodecReader;Ljava/io/PrintStream;ZZZ)Lorg/apache/lucene/index/CheckIndex$Status$TermVectorStatus;+96
> j  
> org.apache.lucene.index.CheckIndex.checkIndex(Ljava/util/List;)Lorg/apache/lucene/index/CheckIndex$Status;+1718
> j  
> org.apache.lucene.util.TestUtil.checkIndex(Lorg/apache/lucene/store/Directory;ZZLjava/io/ByteArrayOutputStream;)Lorg/apache/lucene/index/CheckIndex$Status;+67
> j  org.apache.lucene.store.MockDirectoryWrapper.close()V+276
> j  org.apache.lucene.expressions.TestExpressionSorts.tearDown()V+11
> v  ~StubRoutines::call_stub
> V  [libjvm.so+0x887569]  JavaCalls::call_helper(JavaValue*, methodHandle 
> const&, JavaCallArguments*, Thread*)+0x3b9
> V  [libjvm.so+0xcb1a2d]  invoke(InstanceKlass*, methodHandle const&, Handle, 
> bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) [clone 
> .constprop.80]+0x43d
> V  [libjvm.so+0xcb2a62]  Reflection::invoke_method(oopDesc*, Handle, 
> objArrayHandle, Thread*)+0x102
> V  [libjvm.so+0x93b0cc]  JVM_InvokeMethod+0xfc
> j  
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+0
>  java.base@11.0.6
> j  
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+100
>  java.base@11.0.6
> j  
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+6
>  java.base@11.0.6
> j  
> java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+59
>  java.base@11.0.6
> j  
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)V+69
> j  com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate()V+69
> j  org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate()V+20 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] chlorochrule commented on pull request #160: LUCENE-9823: Fix not to rewrite boosted single term SynonymQuery

2021-06-02 Thread GitBox


chlorochrule commented on pull request #160:
URL: https://github.com/apache/lucene/pull/160#issuecomment-853065004


   Thanks for reviewing, @jtibshirani !
   I added `TestSynonymQuery#testRewrite` and fixed the same problem of 
`CombinedFieldQuery`. I may not fully understand the meaning of:
   > The same consideration affects CombinedFieldQuery in sandbox.
   
   If the fix of bc85a2a is  incorrect, please explain what is the same 
consideration. 
   
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy commented on pull request #2503: Re-introduce ant precommit github action in 8x branch

2021-06-02 Thread GitBox


janhoy commented on pull request #2503:
URL: https://github.com/apache/lucene-solr/pull/2503#issuecomment-853050898


   Thanks, now let's see if this works...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy merged pull request #2503: Re-introduce ant precommit github action in 8x branch

2021-06-02 Thread GitBox


janhoy merged pull request #2503:
URL: https://github.com/apache/lucene-solr/pull/2503


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9987) JVM 11.0.6 crash while trying to read term vectors in CheckIndex?

2021-06-02 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355745#comment-17355745
 ] 

Uwe Schindler commented on LUCENE-9987:
---

Again: JDK 11 is unstable and was used to drill down the failures. JDK 16 is 
quite stable, but I would still not run on production with ZGC or Shennandoah.

> JVM 11.0.6 crash while trying to read term vectors in CheckIndex?
> -
>
> Key: LUCENE-9987
> URL: https://issues.apache.org/jira/browse/LUCENE-9987
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: hs_err_pid529873.log, hs_err_pid536810.log
>
>
> [This build|https://jenkins.thetaphi.de/job/Lucene-main-Linux/30482/] failed 
> with JVM crash:
> {noformat}
> Current thread (0x7f68780d24e0):  JavaThread 
> "TEST-TestExpressionSorts.testQueries-seed#[25E6600265A5C2D4]" 
> [_thread_in_Java, id=530195, stack(0x7f68ae1f2000,0x7f68ae2f3000)]
> Stack: [0x7f68ae1f2000,0x7f68ae2f3000],  sp=0x7f68ae2ef800,  free 
> space=1014k
> Native frames: (J=compiled Java code, A=aot compiled Java code, 
> j=interpreted, Vv=VM code, C=native code)
> J 4987% c2 
> org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.positionIndex(IILorg/apache/lucene/util/LongValues;[I)[[I
>  (136 bytes) @ 0x7f69347af02e [0x7f69347ae480+0x0bae]
> J 4952 c1 
> org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
>  (2695 bytes) @ 0x7f692db07754 [0x7f692db025c0+0x5194]
> J 4895 c1 
> org.apache.lucene.codecs.asserting.AssertingTermVectorsFormat$AssertingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
>  (26 bytes) @ 0x7f692daace2c [0x7f692daacd20+0x010c]
> j  
> org.apache.lucene.index.CheckIndex.testTermVectors(Lorg/apache/lucene/index/CodecReader;Ljava/io/PrintStream;ZZZ)Lorg/apache/lucene/index/CheckIndex$Status$TermVectorStatus;+96
> j  
> org.apache.lucene.index.CheckIndex.checkIndex(Ljava/util/List;)Lorg/apache/lucene/index/CheckIndex$Status;+1718
> j  
> org.apache.lucene.util.TestUtil.checkIndex(Lorg/apache/lucene/store/Directory;ZZLjava/io/ByteArrayOutputStream;)Lorg/apache/lucene/index/CheckIndex$Status;+67
> j  org.apache.lucene.store.MockDirectoryWrapper.close()V+276
> j  org.apache.lucene.expressions.TestExpressionSorts.tearDown()V+11
> v  ~StubRoutines::call_stub
> V  [libjvm.so+0x887569]  JavaCalls::call_helper(JavaValue*, methodHandle 
> const&, JavaCallArguments*, Thread*)+0x3b9
> V  [libjvm.so+0xcb1a2d]  invoke(InstanceKlass*, methodHandle const&, Handle, 
> bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) [clone 
> .constprop.80]+0x43d
> V  [libjvm.so+0xcb2a62]  Reflection::invoke_method(oopDesc*, Handle, 
> objArrayHandle, Thread*)+0x102
> V  [libjvm.so+0x93b0cc]  JVM_InvokeMethod+0xfc
> j  
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+0
>  java.base@11.0.6
> j  
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+100
>  java.base@11.0.6
> j  
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+6
>  java.base@11.0.6
> j  
> java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+59
>  java.base@11.0.6
> j  
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)V+69
> j  com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate()V+69
> j  org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate()V+20 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9987) JVM 11.0.6 crash while trying to read term vectors in CheckIndex?

2021-06-02 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355737#comment-17355737
 ] 

Uwe Schindler commented on LUCENE-9987:
---

ZGC in Java 11 is very early release. So we just know from this: Unuseable with 
stable Java. I have the variant running to allow Oracle to analyse this.

It got better with later versions, so the checks helped.

> JVM 11.0.6 crash while trying to read term vectors in CheckIndex?
> -
>
> Key: LUCENE-9987
> URL: https://issues.apache.org/jira/browse/LUCENE-9987
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: hs_err_pid529873.log, hs_err_pid536810.log
>
>
> [This build|https://jenkins.thetaphi.de/job/Lucene-main-Linux/30482/] failed 
> with JVM crash:
> {noformat}
> Current thread (0x7f68780d24e0):  JavaThread 
> "TEST-TestExpressionSorts.testQueries-seed#[25E6600265A5C2D4]" 
> [_thread_in_Java, id=530195, stack(0x7f68ae1f2000,0x7f68ae2f3000)]
> Stack: [0x7f68ae1f2000,0x7f68ae2f3000],  sp=0x7f68ae2ef800,  free 
> space=1014k
> Native frames: (J=compiled Java code, A=aot compiled Java code, 
> j=interpreted, Vv=VM code, C=native code)
> J 4987% c2 
> org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.positionIndex(IILorg/apache/lucene/util/LongValues;[I)[[I
>  (136 bytes) @ 0x7f69347af02e [0x7f69347ae480+0x0bae]
> J 4952 c1 
> org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
>  (2695 bytes) @ 0x7f692db07754 [0x7f692db025c0+0x5194]
> J 4895 c1 
> org.apache.lucene.codecs.asserting.AssertingTermVectorsFormat$AssertingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
>  (26 bytes) @ 0x7f692daace2c [0x7f692daacd20+0x010c]
> j  
> org.apache.lucene.index.CheckIndex.testTermVectors(Lorg/apache/lucene/index/CodecReader;Ljava/io/PrintStream;ZZZ)Lorg/apache/lucene/index/CheckIndex$Status$TermVectorStatus;+96
> j  
> org.apache.lucene.index.CheckIndex.checkIndex(Ljava/util/List;)Lorg/apache/lucene/index/CheckIndex$Status;+1718
> j  
> org.apache.lucene.util.TestUtil.checkIndex(Lorg/apache/lucene/store/Directory;ZZLjava/io/ByteArrayOutputStream;)Lorg/apache/lucene/index/CheckIndex$Status;+67
> j  org.apache.lucene.store.MockDirectoryWrapper.close()V+276
> j  org.apache.lucene.expressions.TestExpressionSorts.tearDown()V+11
> v  ~StubRoutines::call_stub
> V  [libjvm.so+0x887569]  JavaCalls::call_helper(JavaValue*, methodHandle 
> const&, JavaCallArguments*, Thread*)+0x3b9
> V  [libjvm.so+0xcb1a2d]  invoke(InstanceKlass*, methodHandle const&, Handle, 
> bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) [clone 
> .constprop.80]+0x43d
> V  [libjvm.so+0xcb2a62]  Reflection::invoke_method(oopDesc*, Handle, 
> objArrayHandle, Thread*)+0x102
> V  [libjvm.so+0x93b0cc]  JVM_InvokeMethod+0xfc
> j  
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+0
>  java.base@11.0.6
> j  
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+100
>  java.base@11.0.6
> j  
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+6
>  java.base@11.0.6
> j  
> java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+59
>  java.base@11.0.6
> j  
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)V+69
> j  com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate()V+69
> j  org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate()V+20 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9962) DrillSideways users should be able to opt-out of "drill down" facet collecting

2021-06-02 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-9962.
-
Fix Version/s: main (9.0)
   Resolution: Fixed

> DrillSideways users should be able to opt-out of "drill down" facet collecting
> --
>
> Key: LUCENE-9962
> URL: https://issues.apache.org/jira/browse/LUCENE-9962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
> Fix For: main (9.0)
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The {{DrillSideways}} search methods will _always_ populate a 
> {{FacetsCollector}} for the "drill down" dimensions in addition to the "drill 
> sideways" dimensions. For most cases, this makes sense, but it would be nice 
> if users had a way to opt-out of this collection. It's possible a user may 
> not care to do any faceting on "drill down" dims, or may have custom needs 
> for facet collecting on the "drill downs." For the latter case, the user 
> might want to provide a {{Collector}}/{{CollectorManager}} that does facet 
> collecting with some custom logic (e.g., behind a 
> {{MultiCollector}}/{{MultiCollectorManager}}), in which case the population 
> of an additional {{FacetsCollector}} in {{DrillSideways}} is wasteful.
> The {{DrillSidewaysScorer}} already supports a {{null}} 
> {{drillDownCollector}} gracefully, so this change should mostly just involve 
> creating a {{protected}} method in {{DrillSideways}} for the purpose of 
> creating a "drill down" {{FacetsCollector}} that users can override by 
> providing {{null}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9979) Implement negation of facet path in DrillDownQuery

2021-06-02 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-9979:

Component/s: (was: core/search)
 modules/facet

> Implement negation of facet path in DrillDownQuery
> --
>
> Key: LUCENE-9979
> URL: https://issues.apache.org/jira/browse/LUCENE-9979
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Nicola Buso
>Priority: Major
>  Labels: faceted-search
> Attachments: 0001-Implement-negate-facet-path-in-DrillDownQuery.patch
>
>
> Suppose the following facet values tree:
> Facet
>  - V1
>  -- V1.1
>  -- V1.2
>  -- V1.3
>  -- V1.4
>  -- (not topK values)
>  - V2
>  -- V2.1
>  -- V2.2
>  -- V2.3
>  -- V2.4
>  -- (not topK values)
> Use case:
> 1 - select V1 => all V1.x are selected
> 2 - de-select V1.1
> The implementation of the negation of value V1.1 is missing in 
> DrillDownQuery, it would be nice to implement it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9944) Implement alternative drill sideways faceting with provided CollectorManager

2021-06-02 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-9944.
-
Fix Version/s: main (9.0)
   Resolution: Fixed

> Implement alternative drill sideways faceting with provided CollectorManager
> 
>
> Key: LUCENE-9944
> URL: https://issues.apache.org/jira/browse/LUCENE-9944
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
> Fix For: main (9.0)
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Today, if a user of {{DrillSideways}} wants to provide their own 
> {{CollectorManager}} when invoking {{search}}, they get this alternate, 
> "concurrent" implementation that creates N copies of the provided 
> {{DrillDownQuery}} (where N is the number of drill-down dimensions) and runs 
> them all concurrently. This is a very different implementation than the one a 
> user would get if providing a {{Collector}} instead. Additionally, an 
> {{ExecutorService}} must be provided when constructing a {{DrillSideways}} 
> instance if the user wants to bring their own {{CollectorManager}} 
> (otherwise, they'll get an unfriendly NPE when calling {{search}}).
> I propose adding an implementation to {{DrillSideways}} that will run the 
> "non-concurrent" algorithm in the case that a user wants to provide their own 
> {{CollectorManager}} but doesn't want to provide an {{ExecutorService}} (and 
> doesn't want the concurrent algorithm).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9944) Implement alternative drill sideways faceting with provided CollectorManager

2021-06-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355718#comment-17355718
 ] 

ASF subversion and git services commented on LUCENE-9944:
-

Commit 8b60641bcac14663a75f8efe5667c506347acda5 in lucene's branch 
refs/heads/main from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8b60641 ]

LUCENE-9944: Allow DrillSideways users to pass a CollectorManager without 
requiring an ExecutorService (and concurrent DrillSideways approach). (#142)



> Implement alternative drill sideways faceting with provided CollectorManager
> 
>
> Key: LUCENE-9944
> URL: https://issues.apache.org/jira/browse/LUCENE-9944
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Today, if a user of {{DrillSideways}} wants to provide their own 
> {{CollectorManager}} when invoking {{search}}, they get this alternate, 
> "concurrent" implementation that creates N copies of the provided 
> {{DrillDownQuery}} (where N is the number of drill-down dimensions) and runs 
> them all concurrently. This is a very different implementation than the one a 
> user would get if providing a {{Collector}} instead. Additionally, an 
> {{ExecutorService}} must be provided when constructing a {{DrillSideways}} 
> instance if the user wants to bring their own {{CollectorManager}} 
> (otherwise, they'll get an unfriendly NPE when calling {{search}}).
> I propose adding an implementation to {{DrillSideways}} that will run the 
> "non-concurrent" algorithm in the case that a user wants to provide their own 
> {{CollectorManager}} but doesn't want to provide an {{ExecutorService}} (and 
> doesn't want the concurrent algorithm).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #142: LUCENE-9944: Allow DrillSideways users to pass a CollectorManager without requiring an ExecutorService (and concurrent DrillSideways approach).

2021-06-02 Thread GitBox


gsmiller merged pull request #142:
URL: https://github.com/apache/lucene/pull/142


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #163: LUCENE-9983: Stop sorting determinize powersets unnecessarily

2021-06-02 Thread GitBox


dweiss commented on a change in pull request #163:
URL: https://github.com/apache/lucene/pull/163#discussion_r643959663



##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/StateSet.java
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import com.carrotsearch.hppc.BitMixer;
+import com.carrotsearch.hppc.IntIntHashMap;
+import com.carrotsearch.hppc.cursors.IntCursor;
+import java.util.Arrays;
+
+/** A thin wrapper of {@link com.carrotsearch.hppc.IntIntHashMap} */
+final class StateSet extends IntSet {
+
+  private final IntIntHashMap inner;
+  private int hashCode;
+  private boolean changed;
+  private int[] arrayCache = new int[0];
+
+  StateSet(int capacity) {
+inner = new IntIntHashMap(capacity);
+  }
+
+  // Adds this state to the set
+  void incr(int num) {
+if (inner.addTo(num, 1) == 1) {
+  changed = true;
+}
+  }
+
+  // Removes this state from the set, if count decrs to 0
+  void decr(int num) {
+assert inner.containsKey(num);
+int keyIndex = inner.indexOf(num);
+int count = inner.indexGet(keyIndex) - 1;
+if (count == 0) {
+  inner.remove(num);
+  changed = true;
+} else {
+  inner.indexReplace(keyIndex, count);
+}
+  }
+
+  void computeHash() {
+if (changed == false) {
+  return;
+}
+hashCode = inner.size();
+for (IntCursor cursor : inner.keys()) {
+  hashCode += BitMixer.mix(cursor.value);
+}
+  }
+
+  /**
+   * Create a snapshot of this int set associated with a given state. The 
snapshot will not retain
+   * any frequency information about the elements of this set, only existence.
+   *
+   * It is the caller's responsibility to ensure that the hashCode and data 
are up to date via
+   * the {@link #computeHash()} method before calling this method.
+   *
+   * @param state the state to associate with the frozen set.
+   * @return A new FrozenIntSet with the same values as this set.
+   */
+  FrozenIntSet freeze(int state) {
+if (changed == false) {
+  assert arrayCache != null;

Review comment:
   bq. why hasn't this been "published in a public revision" yet :)
   
   Don't know. Life gets in the way.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #142: LUCENE-9944: Allow DrillSideways users to pass a CollectorManager without requiring an ExecutorService (and concurrent DrillSidew

2021-06-02 Thread GitBox


gsmiller commented on a change in pull request #142:
URL: https://github.com/apache/lucene/pull/142#discussion_r643959302



##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysQuery.java
##
@@ -131,8 +185,24 @@ public boolean isCacheable(LeafReaderContext ctx) {
   public BulkScorer bulkScorer(LeafReaderContext context) throws 
IOException {
 Scorer baseScorer = baseWeight.scorer(context);
 
+int drillDownCount = drillDowns.length;
+
+// TODO: If the caller provided a FacetsCollectorManager instead of 
directly providing
+// FacetsCollectors, we assume this will be invoked during a 
concurrent search. Ideally
+// we'd only create new FacetsCollectors for each "leaf slice" that 
will be concurrently
+// searched, as opposed to each actual leaf, but we don't have that 
information at this
+// level so we always provide a new FacetsCollector. There might be a 
better way to
+// refactor this logic.

Review comment:
   Thanks for this suggestion! I realized I never responded. I suspect it 
wouldn't make all that much of a practical difference if `DrillSidewaysQuery` 
creates new `FacetsCollector`s for each `BulkScorer` it instantiates (as this 
implementation currently does), but it feels just a bit "cleaner" if it only 
created new FCs for each `LeafSlice` since that's the granularity of 
concurrency within `IndexSearcher`. It would take a bit of refactoring to get 
this working since the `FacetsCollector`s are a bit of a (package-private) 
implementation detail of `DrillSidewaysQuery` right now, but I'm going to try 
to experiment with this a bit more to see what it looks like. Thanks!
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9987) JVM 11.0.6 crash while trying to read term vectors in CheckIndex?

2021-06-02 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355714#comment-17355714
 ] 

Dawid Weiss commented on LUCENE-9987:
-

See build failure history here:
https://jenkins.thetaphi.de/job/Lucene-main-Linux/

> JVM 11.0.6 crash while trying to read term vectors in CheckIndex?
> -
>
> Key: LUCENE-9987
> URL: https://issues.apache.org/jira/browse/LUCENE-9987
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: hs_err_pid529873.log, hs_err_pid536810.log
>
>
> [This build|https://jenkins.thetaphi.de/job/Lucene-main-Linux/30482/] failed 
> with JVM crash:
> {noformat}
> Current thread (0x7f68780d24e0):  JavaThread 
> "TEST-TestExpressionSorts.testQueries-seed#[25E6600265A5C2D4]" 
> [_thread_in_Java, id=530195, stack(0x7f68ae1f2000,0x7f68ae2f3000)]
> Stack: [0x7f68ae1f2000,0x7f68ae2f3000],  sp=0x7f68ae2ef800,  free 
> space=1014k
> Native frames: (J=compiled Java code, A=aot compiled Java code, 
> j=interpreted, Vv=VM code, C=native code)
> J 4987% c2 
> org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.positionIndex(IILorg/apache/lucene/util/LongValues;[I)[[I
>  (136 bytes) @ 0x7f69347af02e [0x7f69347ae480+0x0bae]
> J 4952 c1 
> org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
>  (2695 bytes) @ 0x7f692db07754 [0x7f692db025c0+0x5194]
> J 4895 c1 
> org.apache.lucene.codecs.asserting.AssertingTermVectorsFormat$AssertingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
>  (26 bytes) @ 0x7f692daace2c [0x7f692daacd20+0x010c]
> j  
> org.apache.lucene.index.CheckIndex.testTermVectors(Lorg/apache/lucene/index/CodecReader;Ljava/io/PrintStream;ZZZ)Lorg/apache/lucene/index/CheckIndex$Status$TermVectorStatus;+96
> j  
> org.apache.lucene.index.CheckIndex.checkIndex(Ljava/util/List;)Lorg/apache/lucene/index/CheckIndex$Status;+1718
> j  
> org.apache.lucene.util.TestUtil.checkIndex(Lorg/apache/lucene/store/Directory;ZZLjava/io/ByteArrayOutputStream;)Lorg/apache/lucene/index/CheckIndex$Status;+67
> j  org.apache.lucene.store.MockDirectoryWrapper.close()V+276
> j  org.apache.lucene.expressions.TestExpressionSorts.tearDown()V+11
> v  ~StubRoutines::call_stub
> V  [libjvm.so+0x887569]  JavaCalls::call_helper(JavaValue*, methodHandle 
> const&, JavaCallArguments*, Thread*)+0x3b9
> V  [libjvm.so+0xcb1a2d]  invoke(InstanceKlass*, methodHandle const&, Handle, 
> bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) [clone 
> .constprop.80]+0x43d
> V  [libjvm.so+0xcb2a62]  Reflection::invoke_method(oopDesc*, Handle, 
> objArrayHandle, Thread*)+0x102
> V  [libjvm.so+0x93b0cc]  JVM_InvokeMethod+0xfc
> j  
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+0
>  java.base@11.0.6
> j  
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+100
>  java.base@11.0.6
> j  
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+6
>  java.base@11.0.6
> j  
> java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+59
>  java.base@11.0.6
> j  
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)V+69
> j  com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate()V+69
> j  org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate()V+20 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9987) JVM 11.0.6 crash while trying to read term vectors in CheckIndex?

2021-06-02 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355713#comment-17355713
 ] 

Dawid Weiss commented on LUCENE-9987:
-

This is happening all the time on jenkins. Always with ZGC.

> JVM 11.0.6 crash while trying to read term vectors in CheckIndex?
> -
>
> Key: LUCENE-9987
> URL: https://issues.apache.org/jira/browse/LUCENE-9987
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: hs_err_pid529873.log, hs_err_pid536810.log
>
>
> [This build|https://jenkins.thetaphi.de/job/Lucene-main-Linux/30482/] failed 
> with JVM crash:
> {noformat}
> Current thread (0x7f68780d24e0):  JavaThread 
> "TEST-TestExpressionSorts.testQueries-seed#[25E6600265A5C2D4]" 
> [_thread_in_Java, id=530195, stack(0x7f68ae1f2000,0x7f68ae2f3000)]
> Stack: [0x7f68ae1f2000,0x7f68ae2f3000],  sp=0x7f68ae2ef800,  free 
> space=1014k
> Native frames: (J=compiled Java code, A=aot compiled Java code, 
> j=interpreted, Vv=VM code, C=native code)
> J 4987% c2 
> org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.positionIndex(IILorg/apache/lucene/util/LongValues;[I)[[I
>  (136 bytes) @ 0x7f69347af02e [0x7f69347ae480+0x0bae]
> J 4952 c1 
> org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
>  (2695 bytes) @ 0x7f692db07754 [0x7f692db025c0+0x5194]
> J 4895 c1 
> org.apache.lucene.codecs.asserting.AssertingTermVectorsFormat$AssertingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
>  (26 bytes) @ 0x7f692daace2c [0x7f692daacd20+0x010c]
> j  
> org.apache.lucene.index.CheckIndex.testTermVectors(Lorg/apache/lucene/index/CodecReader;Ljava/io/PrintStream;ZZZ)Lorg/apache/lucene/index/CheckIndex$Status$TermVectorStatus;+96
> j  
> org.apache.lucene.index.CheckIndex.checkIndex(Ljava/util/List;)Lorg/apache/lucene/index/CheckIndex$Status;+1718
> j  
> org.apache.lucene.util.TestUtil.checkIndex(Lorg/apache/lucene/store/Directory;ZZLjava/io/ByteArrayOutputStream;)Lorg/apache/lucene/index/CheckIndex$Status;+67
> j  org.apache.lucene.store.MockDirectoryWrapper.close()V+276
> j  org.apache.lucene.expressions.TestExpressionSorts.tearDown()V+11
> v  ~StubRoutines::call_stub
> V  [libjvm.so+0x887569]  JavaCalls::call_helper(JavaValue*, methodHandle 
> const&, JavaCallArguments*, Thread*)+0x3b9
> V  [libjvm.so+0xcb1a2d]  invoke(InstanceKlass*, methodHandle const&, Handle, 
> bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) [clone 
> .constprop.80]+0x43d
> V  [libjvm.so+0xcb2a62]  Reflection::invoke_method(oopDesc*, Handle, 
> objArrayHandle, Thread*)+0x102
> V  [libjvm.so+0x93b0cc]  JVM_InvokeMethod+0xfc
> j  
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+0
>  java.base@11.0.6
> j  
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+100
>  java.base@11.0.6
> j  
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+6
>  java.base@11.0.6
> j  
> java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+59
>  java.base@11.0.6
> j  
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)V+69
> j  com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate()V+69
> j  org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate()V+20 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #142: LUCENE-9944: Allow DrillSideways users to pass a CollectorManager without requiring an ExecutorService (and concurrent DrillSidew

2021-06-02 Thread GitBox


gsmiller commented on a change in pull request #142:
URL: https://github.com/apache/lucene/pull/142#discussion_r643949418



##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java
##
@@ -233,11 +251,32 @@ public ScoreMode scoreMode() {
 }
 searcher.search(dsq, hitCollector);
 
+FacetsCollector drillDownCollector;
+if (drillDownCollectorManager != null) {
+  drillDownCollector = 
drillDownCollectorManager.reduce(dsq.managedDrillDownCollectors);
+} else {
+  drillDownCollector = null;
+}
+
+FacetsCollector[] drillSidewaysCollectors = new FacetsCollector[numDims];
+int numSlices = dsq.managedDrillSidewaysCollectors.size();
+
+for (int dim = 0; dim < numDims; dim++) {
+  List facetsCollectorsForDim = new 
ArrayList<>(numSlices);
+
+  for (int slice = 0; slice < numSlices; slice++) {
+
facetsCollectorsForDim.add(dsq.managedDrillSidewaysCollectors.get(slice)[dim]);
+  }
+
+  drillSidewaysCollectors[dim] =
+  
drillSidewaysFacetsCollectorManagers[dim].reduce(facetsCollectorsForDim);
+}
+
 return new DrillSidewaysResult(
 buildFacetsResult(
 drillDownCollector,
 drillSidewaysCollectors,
-drillDownDims.keySet().toArray(new String[drillDownDims.size()])),
+drillDownDims.keySet().toArray(new String[0])),

Review comment:
   I agree; it's very strange! 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9987) JVM 11.0.6 crash while trying to read term vectors in CheckIndex?

2021-06-02 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355694#comment-17355694
 ] 

Robert Muir commented on LUCENE-9987:
-

ZAC though. Workaround: don't use ZGC

> JVM 11.0.6 crash while trying to read term vectors in CheckIndex?
> -
>
> Key: LUCENE-9987
> URL: https://issues.apache.org/jira/browse/LUCENE-9987
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: hs_err_pid529873.log, hs_err_pid536810.log
>
>
> [This build|https://jenkins.thetaphi.de/job/Lucene-main-Linux/30482/] failed 
> with JVM crash:
> {noformat}
> Current thread (0x7f68780d24e0):  JavaThread 
> "TEST-TestExpressionSorts.testQueries-seed#[25E6600265A5C2D4]" 
> [_thread_in_Java, id=530195, stack(0x7f68ae1f2000,0x7f68ae2f3000)]
> Stack: [0x7f68ae1f2000,0x7f68ae2f3000],  sp=0x7f68ae2ef800,  free 
> space=1014k
> Native frames: (J=compiled Java code, A=aot compiled Java code, 
> j=interpreted, Vv=VM code, C=native code)
> J 4987% c2 
> org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.positionIndex(IILorg/apache/lucene/util/LongValues;[I)[[I
>  (136 bytes) @ 0x7f69347af02e [0x7f69347ae480+0x0bae]
> J 4952 c1 
> org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
>  (2695 bytes) @ 0x7f692db07754 [0x7f692db025c0+0x5194]
> J 4895 c1 
> org.apache.lucene.codecs.asserting.AssertingTermVectorsFormat$AssertingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
>  (26 bytes) @ 0x7f692daace2c [0x7f692daacd20+0x010c]
> j  
> org.apache.lucene.index.CheckIndex.testTermVectors(Lorg/apache/lucene/index/CodecReader;Ljava/io/PrintStream;ZZZ)Lorg/apache/lucene/index/CheckIndex$Status$TermVectorStatus;+96
> j  
> org.apache.lucene.index.CheckIndex.checkIndex(Ljava/util/List;)Lorg/apache/lucene/index/CheckIndex$Status;+1718
> j  
> org.apache.lucene.util.TestUtil.checkIndex(Lorg/apache/lucene/store/Directory;ZZLjava/io/ByteArrayOutputStream;)Lorg/apache/lucene/index/CheckIndex$Status;+67
> j  org.apache.lucene.store.MockDirectoryWrapper.close()V+276
> j  org.apache.lucene.expressions.TestExpressionSorts.tearDown()V+11
> v  ~StubRoutines::call_stub
> V  [libjvm.so+0x887569]  JavaCalls::call_helper(JavaValue*, methodHandle 
> const&, JavaCallArguments*, Thread*)+0x3b9
> V  [libjvm.so+0xcb1a2d]  invoke(InstanceKlass*, methodHandle const&, Handle, 
> bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) [clone 
> .constprop.80]+0x43d
> V  [libjvm.so+0xcb2a62]  Reflection::invoke_method(oopDesc*, Handle, 
> objArrayHandle, Thread*)+0x102
> V  [libjvm.so+0x93b0cc]  JVM_InvokeMethod+0xfc
> j  
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+0
>  java.base@11.0.6
> j  
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+100
>  java.base@11.0.6
> j  
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+6
>  java.base@11.0.6
> j  
> java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+59
>  java.base@11.0.6
> j  
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)V+69
> j  com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate()V+69
> j  org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate()V+20 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a change in pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes

2021-06-02 Thread GitBox


msokolov commented on a change in pull request #157:
URL: https://github.com/apache/lucene/pull/157#discussion_r643925065



##
File path: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/FlattenGraphFilter.java
##
@@ -273,6 +260,32 @@ private boolean releaseBufferedToken() {
 return false;
   }
 
+  /**
+   * Free output nodes before the given outputs. Free inputs nodes before the 
minimum input node for
+   * this output.
+   *
+   * @param output target output node
+   */
+  private void freeBefore(OutputNode output) {
+// We've released all of the tokens that end at the current output,
+// so free all output nodes before this. Input nodes are more complex.
+// The second shingled tokens with alternate paths can appear later in the 
output graph than

Review comment:
   "than than"

##
File path: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/FlattenGraphFilter.java
##
@@ -273,6 +260,32 @@ private boolean releaseBufferedToken() {
 return false;
   }
 
+  /**
+   * Free output nodes before the given outputs. Free inputs nodes before the 
minimum input node for

Review comment:
   "Free input nodes" I think

##
File path: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/FlattenGraphFilter.java
##
@@ -310,8 +323,11 @@ public boolean incrementToken() throws IOException {
   int outMax = outputNodes.getMaxPos();
   // If positionIncrement > 1 this node should be at the end of the 
flattened graph
   if (positionIncrement > 1 && src.outputNode < outMax) {
-// We crossed a gap that we need to account for. This node exists 
from a length >1 path
-// jumping to get here.
+// If there was a hole at the end of an alternate path then the 
input and output nodes

Review comment:
   minor, but: if you use block comments, then our autoformatter won't 
apply its annoying line-break rules :)

##
File path: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/FlattenGraphFilter.java
##
@@ -273,6 +260,32 @@ private boolean releaseBufferedToken() {
 return false;
   }
 
+  /**
+   * Free output nodes before the given outputs. Free inputs nodes before the 
minimum input node for
+   * this output.
+   *
+   * @param output target output node
+   */
+  private void freeBefore(OutputNode output) {
+// We've released all of the tokens that end at the current output,
+// so free all output nodes before this. Input nodes are more complex.
+// The second shingled tokens with alternate paths can appear later in the 
output graph than
+// than some of their alternate path tokens.
+// Because of this case we can only free from the minimum because the 
minimum node will have
+// come from before the second shingled token.
+// This means we have to hold onto input nodes who's tokens get stacked on 
previous nodes until

Review comment:
   "whose"

##
File path: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/FlattenGraphFilter.java
##
@@ -392,12 +408,20 @@ private OutputNode recoverFromHole(InputNode src, int 
startOffset) {
 int maxOutIndex = outputNodes.getMaxPos();
 OutputNode outSrc = outputNodes.get(maxOutIndex);
 // There are two types of holes, neighbor holes and consumed holes. A 
neighbor hole is between
-// two tokens. A consumed hole is
-// between the start a long token and the next token that is "under" the 
path of the long token.
-// A consumed hole should have the outputsrc node of the short token be 
the out dest
+// two tokens, it looks like a->*hole*->b.
+// A consumed hole is between the start a long token and the next token 
that is "under" the path
+// of the long token.
+// It looks like :___abc__
+//   ||

Review comment:
   Have you run the formatter? I think it might mess these pictures up 
unless you use block comments




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent

2021-06-02 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355687#comment-17355687
 ] 

Michael McCandless commented on LUCENE-9976:


Spooky that the seed only sometimes reproduces!  Are there threads involved in 
these tests?  if not, maybe there some sneaky "randomness bug" in the test, 
though I thought we buttoned down the forbidden APIs here.

> WANDScorer assertion error in ensureConsistent
> --
>
> Key: LUCENE-9976
> URL: https://issues.apache.org/jira/browse/LUCENE-9976
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Priority: Major
>
> Build fails and is reproducible:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console
> {code}
> ./gradlew test --tests TestExpressionSorts.testQueries 
> -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9986) Create a simple "real world" regexp benchmark

2021-06-02 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355684#comment-17355684
 ] 

Michael Sokolov commented on LUCENE-9986:
-

[This SO 
post|https://stackoverflow.com/questions/15819919/where-can-i-find-unit-tests-for-regular-expressions-in-multiple-languages]
 links to many test suites in various open source projects. Not sure which 
would be best/best licensed for copying here?

> Create a simple "real world" regexp benchmark
> -
>
> Key: LUCENE-9986
> URL: https://issues.apache.org/jira/browse/LUCENE-9986
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> For issues like LUCENE-9983, where we are struggling to decide which 
> low-level optimizations to make for our (complicated!) {{determinize}} 
> method, it would really help to have a large, real-world corpus of regexps to 
> evaluate performance metrics of our automata operations, like CPU and HEAP 
> required to parse the regexp and determinize.
> Does anyone know of such an existing, hopefully compatibly licensed, corpus?
> Probably we would add these benchmarks to {{luceneutil}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9987) JVM 11.0.6 crash while trying to read term vectors in CheckIndex?

2021-06-02 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-9987:
---
Attachment: hs_err_pid536810.log

> JVM 11.0.6 crash while trying to read term vectors in CheckIndex?
> -
>
> Key: LUCENE-9987
> URL: https://issues.apache.org/jira/browse/LUCENE-9987
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: hs_err_pid529873.log, hs_err_pid536810.log
>
>
> [This build|https://jenkins.thetaphi.de/job/Lucene-main-Linux/30482/] failed 
> with JVM crash:
> {noformat}
> Current thread (0x7f68780d24e0):  JavaThread 
> "TEST-TestExpressionSorts.testQueries-seed#[25E6600265A5C2D4]" 
> [_thread_in_Java, id=530195, stack(0x7f68ae1f2000,0x7f68ae2f3000)]
> Stack: [0x7f68ae1f2000,0x7f68ae2f3000],  sp=0x7f68ae2ef800,  free 
> space=1014k
> Native frames: (J=compiled Java code, A=aot compiled Java code, 
> j=interpreted, Vv=VM code, C=native code)
> J 4987% c2 
> org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.positionIndex(IILorg/apache/lucene/util/LongValues;[I)[[I
>  (136 bytes) @ 0x7f69347af02e [0x7f69347ae480+0x0bae]
> J 4952 c1 
> org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
>  (2695 bytes) @ 0x7f692db07754 [0x7f692db025c0+0x5194]
> J 4895 c1 
> org.apache.lucene.codecs.asserting.AssertingTermVectorsFormat$AssertingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
>  (26 bytes) @ 0x7f692daace2c [0x7f692daacd20+0x010c]
> j  
> org.apache.lucene.index.CheckIndex.testTermVectors(Lorg/apache/lucene/index/CodecReader;Ljava/io/PrintStream;ZZZ)Lorg/apache/lucene/index/CheckIndex$Status$TermVectorStatus;+96
> j  
> org.apache.lucene.index.CheckIndex.checkIndex(Ljava/util/List;)Lorg/apache/lucene/index/CheckIndex$Status;+1718
> j  
> org.apache.lucene.util.TestUtil.checkIndex(Lorg/apache/lucene/store/Directory;ZZLjava/io/ByteArrayOutputStream;)Lorg/apache/lucene/index/CheckIndex$Status;+67
> j  org.apache.lucene.store.MockDirectoryWrapper.close()V+276
> j  org.apache.lucene.expressions.TestExpressionSorts.tearDown()V+11
> v  ~StubRoutines::call_stub
> V  [libjvm.so+0x887569]  JavaCalls::call_helper(JavaValue*, methodHandle 
> const&, JavaCallArguments*, Thread*)+0x3b9
> V  [libjvm.so+0xcb1a2d]  invoke(InstanceKlass*, methodHandle const&, Handle, 
> bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) [clone 
> .constprop.80]+0x43d
> V  [libjvm.so+0xcb2a62]  Reflection::invoke_method(oopDesc*, Handle, 
> objArrayHandle, Thread*)+0x102
> V  [libjvm.so+0x93b0cc]  JVM_InvokeMethod+0xfc
> j  
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+0
>  java.base@11.0.6
> j  
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+100
>  java.base@11.0.6
> j  
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+6
>  java.base@11.0.6
> j  
> java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+59
>  java.base@11.0.6
> j  
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)V+69
> j  com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate()V+69
> j  org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate()V+20 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9987) JVM 11.0.6 crash while trying to read term vectors in CheckIndex?

2021-06-02 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-9987:
---
Attachment: hs_err_pid529873.log

> JVM 11.0.6 crash while trying to read term vectors in CheckIndex?
> -
>
> Key: LUCENE-9987
> URL: https://issues.apache.org/jira/browse/LUCENE-9987
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: hs_err_pid529873.log, hs_err_pid536810.log
>
>
> [This build|https://jenkins.thetaphi.de/job/Lucene-main-Linux/30482/] failed 
> with JVM crash:
> {noformat}
> Current thread (0x7f68780d24e0):  JavaThread 
> "TEST-TestExpressionSorts.testQueries-seed#[25E6600265A5C2D4]" 
> [_thread_in_Java, id=530195, stack(0x7f68ae1f2000,0x7f68ae2f3000)]
> Stack: [0x7f68ae1f2000,0x7f68ae2f3000],  sp=0x7f68ae2ef800,  free 
> space=1014k
> Native frames: (J=compiled Java code, A=aot compiled Java code, 
> j=interpreted, Vv=VM code, C=native code)
> J 4987% c2 
> org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.positionIndex(IILorg/apache/lucene/util/LongValues;[I)[[I
>  (136 bytes) @ 0x7f69347af02e [0x7f69347ae480+0x0bae]
> J 4952 c1 
> org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
>  (2695 bytes) @ 0x7f692db07754 [0x7f692db025c0+0x5194]
> J 4895 c1 
> org.apache.lucene.codecs.asserting.AssertingTermVectorsFormat$AssertingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
>  (26 bytes) @ 0x7f692daace2c [0x7f692daacd20+0x010c]
> j  
> org.apache.lucene.index.CheckIndex.testTermVectors(Lorg/apache/lucene/index/CodecReader;Ljava/io/PrintStream;ZZZ)Lorg/apache/lucene/index/CheckIndex$Status$TermVectorStatus;+96
> j  
> org.apache.lucene.index.CheckIndex.checkIndex(Ljava/util/List;)Lorg/apache/lucene/index/CheckIndex$Status;+1718
> j  
> org.apache.lucene.util.TestUtil.checkIndex(Lorg/apache/lucene/store/Directory;ZZLjava/io/ByteArrayOutputStream;)Lorg/apache/lucene/index/CheckIndex$Status;+67
> j  org.apache.lucene.store.MockDirectoryWrapper.close()V+276
> j  org.apache.lucene.expressions.TestExpressionSorts.tearDown()V+11
> v  ~StubRoutines::call_stub
> V  [libjvm.so+0x887569]  JavaCalls::call_helper(JavaValue*, methodHandle 
> const&, JavaCallArguments*, Thread*)+0x3b9
> V  [libjvm.so+0xcb1a2d]  invoke(InstanceKlass*, methodHandle const&, Handle, 
> bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) [clone 
> .constprop.80]+0x43d
> V  [libjvm.so+0xcb2a62]  Reflection::invoke_method(oopDesc*, Handle, 
> objArrayHandle, Thread*)+0x102
> V  [libjvm.so+0x93b0cc]  JVM_InvokeMethod+0xfc
> j  
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+0
>  java.base@11.0.6
> j  
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+100
>  java.base@11.0.6
> j  
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+6
>  java.base@11.0.6
> j  
> java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+59
>  java.base@11.0.6
> j  
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)V+69
> j  com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate()V+69
> j  org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate()V+20 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9987) JVM 11.0.6 crash while trying to read term vectors in CheckIndex?

2021-06-02 Thread Michael McCandless (Jira)
Michael McCandless created LUCENE-9987:
--

 Summary: JVM 11.0.6 crash while trying to read term vectors in 
CheckIndex?
 Key: LUCENE-9987
 URL: https://issues.apache.org/jira/browse/LUCENE-9987
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless


[This build|https://jenkins.thetaphi.de/job/Lucene-main-Linux/30482/] failed 
with JVM crash:
{noformat}
Current thread (0x7f68780d24e0):  JavaThread 
"TEST-TestExpressionSorts.testQueries-seed#[25E6600265A5C2D4]" 
[_thread_in_Java, id=530195, stack(0x7f68ae1f2000,0x7f68ae2f3000)]

Stack: [0x7f68ae1f2000,0x7f68ae2f3000],  sp=0x7f68ae2ef800,  free 
space=1014k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, 
Vv=VM code, C=native code)
J 4987% c2 
org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.positionIndex(IILorg/apache/lucene/util/LongValues;[I)[[I
 (136 bytes) @ 0x7f69347af02e [0x7f69347ae480+0x0bae]
J 4952 c1 
org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
 (2695 bytes) @ 0x7f692db07754 [0x7f692db025c0+0x5194]
J 4895 c1 
org.apache.lucene.codecs.asserting.AssertingTermVectorsFormat$AssertingTermVectorsReader.get(I)Lorg/apache/lucene/index/Fields;
 (26 bytes) @ 0x7f692daace2c [0x7f692daacd20+0x010c]
j  
org.apache.lucene.index.CheckIndex.testTermVectors(Lorg/apache/lucene/index/CodecReader;Ljava/io/PrintStream;ZZZ)Lorg/apache/lucene/index/CheckIndex$Status$TermVectorStatus;+96
j  
org.apache.lucene.index.CheckIndex.checkIndex(Ljava/util/List;)Lorg/apache/lucene/index/CheckIndex$Status;+1718
j  
org.apache.lucene.util.TestUtil.checkIndex(Lorg/apache/lucene/store/Directory;ZZLjava/io/ByteArrayOutputStream;)Lorg/apache/lucene/index/CheckIndex$Status;+67
j  org.apache.lucene.store.MockDirectoryWrapper.close()V+276
j  org.apache.lucene.expressions.TestExpressionSorts.tearDown()V+11
v  ~StubRoutines::call_stub
V  [libjvm.so+0x887569]  JavaCalls::call_helper(JavaValue*, methodHandle 
const&, JavaCallArguments*, Thread*)+0x3b9
V  [libjvm.so+0xcb1a2d]  invoke(InstanceKlass*, methodHandle const&, Handle, 
bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) [clone 
.constprop.80]+0x43d
V  [libjvm.so+0xcb2a62]  Reflection::invoke_method(oopDesc*, Handle, 
objArrayHandle, Thread*)+0x102
V  [libjvm.so+0x93b0cc]  JVM_InvokeMethod+0xfc
j  
jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+0
 java.base@11.0.6
j  
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+100
 java.base@11.0.6
j  
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+6
 java.base@11.0.6
j  
java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+59
 java.base@11.0.6
j  
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)V+69
j  com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate()V+69
j  org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate()V+20 
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a change in pull request #164: LUCENE-9905: Allow Lucene90Codec to be configured with a per-field vector format

2021-06-02 Thread GitBox


msokolov commented on a change in pull request #164:
URL: https://github.com/apache/lucene/pull/164#discussion_r643912033



##
File path: 
lucene/core/src/test/org/apache/lucene/codecs/perfield/TestPerFieldVectorFormat.java
##
@@ -52,53 +52,54 @@ protected Codec getCodec() {
 return codec;
   }
 
-  // just a simple trivial test
   public void testTwoFieldsTwoFormats() throws IOException {
 Analyzer analyzer = new MockAnalyzer(random());
 
 try (Directory directory = newDirectory()) {
   // we don't use RandomIndexWriter because it might add more values than 
we expect 1
   IndexWriterConfig iwc = newIndexWriterConfig(analyzer);
-  final VectorFormat fast = TestUtil.getDefaultVectorFormat();
-  final VectorFormat slow = VectorFormat.forName("Asserting");
+  VectorFormat defaultFormat = TestUtil.getDefaultVectorFormat();
+  VectorFormat emptyFormat = VectorFormat.EMPTY;
   iwc.setCodec(
   new AssertingCodec() {
 @Override
 public VectorFormat getVectorFormatForField(String field) {
-  if ("v1".equals(field)) {
-return fast;
+  if ("empty".equals(field)) {
+return emptyFormat;
   } else {
-return slow;
+return defaultFormat;
   }
 }
   });
+
   try (IndexWriter iwriter = new IndexWriter(directory, iwc)) {
 Document doc = new Document();
 doc.add(newTextField("id", "1", Field.Store.YES));
-doc.add(new VectorField("v1", new float[] {1, 2, 3}));
+doc.add(new VectorField("field", new float[] {1, 2, 3}));
 iwriter.addDocument(doc);
-doc = new Document();
+iwriter.commit();
+
+// Check that we use the empty vector format, which doesn't support 
writes
+doc.clear();
 doc.add(newTextField("id", "2", Field.Store.YES));
-doc.add(new VectorField("v2", new float[] {4, 5, 6}));
-iwriter.addDocument(doc);
+doc.add(new VectorField("empty", new float[] {4, 5, 6}));
+expectThrows(
+RuntimeException.class,
+() -> {
+  iwriter.addDocument(doc);
+  iwriter.commit();
+});
   }
 
-  // Now search the index:
+  // Now search for the field that was successfully indexed
   try (IndexReader ireader = DirectoryReader.open(directory)) {
 TopDocs hits1 =
 ireader
 .leaves()
 .get(0)
 .reader()
-.searchNearestVectors("v1", new float[] {1, 2, 3}, 10, 1);
+.searchNearestVectors("field", new float[] {1, 2, 3}, 10, 1);
 assertEquals(1, hits1.scoreDocs.length);
-TopDocs hits2 =

Review comment:
   weird, what was this doing here :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a change in pull request #142: LUCENE-9944: Allow DrillSideways users to pass a CollectorManager without requiring an ExecutorService (and concurrent DrillSidew

2021-06-02 Thread GitBox


msokolov commented on a change in pull request #142:
URL: https://github.com/apache/lucene/pull/142#discussion_r643904768



##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java
##
@@ -233,11 +251,32 @@ public ScoreMode scoreMode() {
 }
 searcher.search(dsq, hitCollector);
 
+FacetsCollector drillDownCollector;
+if (drillDownCollectorManager != null) {
+  drillDownCollector = 
drillDownCollectorManager.reduce(dsq.managedDrillDownCollectors);
+} else {
+  drillDownCollector = null;
+}
+
+FacetsCollector[] drillSidewaysCollectors = new FacetsCollector[numDims];
+int numSlices = dsq.managedDrillSidewaysCollectors.size();
+
+for (int dim = 0; dim < numDims; dim++) {
+  List facetsCollectorsForDim = new 
ArrayList<>(numSlices);
+
+  for (int slice = 0; slice < numSlices; slice++) {
+
facetsCollectorsForDim.add(dsq.managedDrillSidewaysCollectors.get(slice)[dim]);
+  }
+
+  drillSidewaysCollectors[dim] =
+  
drillSidewaysFacetsCollectorManagers[dim].reduce(facetsCollectorsForDim);
+}
+
 return new DrillSidewaysResult(
 buildFacetsResult(
 drillDownCollector,
 drillSidewaysCollectors,
-drillDownDims.keySet().toArray(new String[drillDownDims.size()])),
+drillDownDims.keySet().toArray(new String[0])),

Review comment:
   This is the weirdest Java idiom. I've never understood why creating a 
useless zero-length array is the accepted way to handle type-safety?? Yet 
apparently it is.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a change in pull request #163: LUCENE-9983: Stop sorting determinize powersets unnecessarily

2021-06-02 Thread GitBox


mikemccand commented on a change in pull request #163:
URL: https://github.com/apache/lucene/pull/163#discussion_r643866949



##
File path: lucene/core/build.gradle
##
@@ -20,6 +20,8 @@ apply plugin: 'java-library'
 description = 'Lucene core library'
 
 dependencies {
+  implementation 'com.carrotsearch:hppc'

Review comment:
   > @bruno-roustant came up with some clever new hashing improvements 
recently - these are not published as a public revision but you can get them 
from the repository and compile it locally. See this for details:
   > 
   > https://issues.carrot2.org/browse/HPPC-176
   
   Whoa, this looks new "worm" hashing looks great!  That is frequently a great 
tradeoff (slower put, faster get)?  Hmm why hasn't this been "published in a 
public revision" yet :)

##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/StateSet.java
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import com.carrotsearch.hppc.BitMixer;
+import com.carrotsearch.hppc.IntIntHashMap;
+import com.carrotsearch.hppc.cursors.IntCursor;
+import java.util.Arrays;
+
+/** A thin wrapper of {@link com.carrotsearch.hppc.IntIntHashMap} */
+final class StateSet extends IntSet {
+
+  private final IntIntHashMap inner;
+  private int hashCode;
+  private boolean changed;
+  private int[] arrayCache = new int[0];
+
+  StateSet(int capacity) {
+inner = new IntIntHashMap(capacity);
+  }
+
+  // Adds this state to the set
+  void incr(int num) {
+if (inner.addTo(num, 1) == 1) {
+  changed = true;
+}
+  }
+
+  // Removes this state from the set, if count decrs to 0
+  void decr(int num) {
+assert inner.containsKey(num);
+int keyIndex = inner.indexOf(num);
+int count = inner.indexGet(keyIndex) - 1;
+if (count == 0) {
+  inner.remove(num);
+  changed = true;
+} else {
+  inner.indexReplace(keyIndex, count);
+}
+  }
+
+  void computeHash() {
+if (changed == false) {
+  return;
+}
+hashCode = inner.size();
+for (IntCursor cursor : inner.keys()) {
+  hashCode += BitMixer.mix(cursor.value);
+}
+  }
+
+  /**
+   * Create a snapshot of this int set associated with a given state. The 
snapshot will not retain
+   * any frequency information about the elements of this set, only existence.
+   *
+   * It is the caller's responsibility to ensure that the hashCode and data 
are up to date via
+   * the {@link #computeHash()} method before calling this method.
+   *
+   * @param state the state to associate with the frozen set.
+   * @return A new FrozenIntSet with the same values as this set.
+   */
+  FrozenIntSet freeze(int state) {
+if (changed == false) {
+  assert arrayCache != null;

Review comment:
   Hmm do we actually fall inside this `if`?  I would think we shouldn't 
ever hit this -- we should't call `freeze` unless something had in fact 
changed?  Or, if we are, something else might be wrong?  Or maybe I am simply 
confused ;)

##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/StateSet.java
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import com.carrotsearch.hppc.BitMixer;
+import com.carrotsearch.hppc.IntIntHashMap;
+import com.carrotsearch.hppc.cursors.IntCursor;
+import java.util.Arrays;
+
+/** A thin wrapper 

[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-06-02 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355644#comment-17355644
 ] 

Michael McCandless commented on LUCENE-9983:


OK I opened LUCENE-9986.

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9986) Create a simple "real world" regexp benchmark

2021-06-02 Thread Michael McCandless (Jira)
Michael McCandless created LUCENE-9986:
--

 Summary: Create a simple "real world" regexp benchmark
 Key: LUCENE-9986
 URL: https://issues.apache.org/jira/browse/LUCENE-9986
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless


For issues like LUCENE-9983, where we are struggling to decide which low-level 
optimizations to make for our (complicated!) {{determinize}} method, it would 
really help to have a large, real-world corpus of regexps to evaluate 
performance metrics of our automata operations, like CPU and HEAP required to 
parse the regexp and determinize.

Does anyone know of such an existing, hopefully compatibly licensed, corpus?

Probably we would add these benchmarks to {{luceneutil}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-06-02 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355642#comment-17355642
 ] 

Michael McCandless commented on LUCENE-9983:


In the sort of "opposite extreme" case, where someone calls det on an already 
"happens to be determinized" NFA (I think we already catch if someone tries to 
det an Automaton that we already previously det'd, and skip it?), I think we 
would see much more balanced {{incr}}/{{decr}} versus {{freeze}}?
{quote}The algorithmic complexity is one thing but if these sets are short (and 
they will be, right?) then it's a small constant. 
{quote}
Yeah, +1, they will "usually" be very short sets, I think, in the 
non-adversarial cases.

I think we are badly missing a "representative" set of "real-world" regexp to 
use as a benchmarking corpus, to make decisions about optimizations like this. 
I love that this adversarial regexp go s much faster with [~zhai7631]'s PR, 
but I'm worried that it might then make the more normal, real-world, 
non-adversarial cases slower.

Does anyone know of an existing "corpus" of "real-world" regexps by any chance 
;)  I will open a dedicated issue for this.

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] glawson0 commented on pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes

2021-06-02 Thread GitBox


glawson0 commented on pull request #157:
URL: https://github.com/apache/lucene/pull/157#issuecomment-852856814


   I've fleshed out the comments for the 4 change areas explaining each area 
and what tests exercise them. Do they help? are there areas you feel aren't 
fully explained or that I could be clearer on?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] glawson0 commented on a change in pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes

2021-06-02 Thread GitBox


glawson0 commented on a change in pull request #157:
URL: https://github.com/apache/lucene/pull/157#discussion_r643770929



##
File path: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/FlattenGraphFilter.java
##
@@ -362,6 +378,40 @@ public boolean incrementToken() throws IOException {
 }
   }
 
+  private OutputNode recoverFromHole(InputNode src, int startOffset) {
+// This means the "from" node of this token was never seen as a "to" node,
+// which should only happen if we just crossed a hole.  This is a 
challenging
+// case for us because we normally rely on the full dependencies expressed
+// by the arcs to assign outgoing node IDs.  It would be better if tokens
+// were never dropped but instead just marked deleted with a new
+// TermDeletedAttribute (boolean valued) ... but until that future, we have
+// a hack here to forcefully jump the output node ID:
+assert src.outputNode == -1;
+src.node = inputFrom;
+
+int maxOutIndex = outputNodes.getMaxPos();
+OutputNode outSrc = outputNodes.get(maxOutIndex);
+// There are two types of holes, neighbor holes and consumed holes. A 
neighbor hole is between

Review comment:
   I've added some ascII graphs into the comment. do those help? They're a 
little weird since I have tokens in node positions which isn't quite right.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-06-02 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355600#comment-17355600
 ] 

Bruno Roustant commented on LUCENE-9983:


How many states are manipulated?
If the states are numbered from 0 to N, and we keep most of the states during 
the computation, or N is not too high, then should we use an array instead of a 
map? With array[state] is the "reference count". We wouldn't have to sort the 
set of states for equality check because it would be directly the array order 
(skipping states with 0 reference).

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy opened a new pull request #2503: Re-introduce ant precommit github action in 8x branch

2021-06-02 Thread GitBox


janhoy opened a new pull request #2503:
URL: https://github.com/apache/lucene-solr/pull/2503


   This PR re-introduces the `ant precommit` github action for branch_8x, which 
was removed when "wiping" master branch after the split.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9985) Upgrade Jetty to 9.4.41

2021-06-02 Thread Jira


[ 
https://issues.apache.org/jira/browse/LUCENE-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355533#comment-17355533
 ] 

Jan Høydahl edited comment on LUCENE-9985 at 6/2/21, 7:49 AM:
--

I tag this change in Lucene's CHANGES under 8.9 section, since SOLR-15316 
backport will also upgrade Lucene Replicator's jetty version. This PR will only 
be merged to lucene/main and will thus not need a separate backport, since the 
lucene CHANGES entry is also part of the solr-lucene backport, see 
https://github.com/apache/lucene-solr/pull/2502


was (Author: janhoy):
I tag this change in Lucene's CHANGES under 8.9 section, since SOLR-15316 
backport will also upgrade Lucene Replicator's jetty version. This PR will only 
be merged to lucene/main and will thus not need a separate backport, even if 
CHANGES entry is for 8.9. Any objections?

> Upgrade Jetty to 9.4.41
> ---
>
> Key: LUCENE-9985
> URL: https://issues.apache.org/jira/browse/LUCENE-9985
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As Solr is upgrading jetty dependency in 8.9 (shared with lucene), Lucene 
> main should also do the same



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy opened a new pull request #2502: SOLR-15316 Update Jetty to 9.4.41 (backport 8x)

2021-06-02 Thread GitBox


janhoy opened a new pull request #2502:
URL: https://github.com/apache/lucene-solr/pull/2502


   See https://issues.apache.org/jira/browse/SOLR-15316
   
   This is a backport of SOLR-15316 with mostly ivy changes. But in this 8x 
branch, the upgrade also affects lucene-replicator module. So I filed 
LUCENE-9985 to make sure lucene 9 (main) does not downgrade jetty again for 
that module :) Therefore I also added the LUCENE-9985 changes entry to this PR, 
since Lucene 8.9 is the first that has jetty 9.4.41 for the replicator...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9985) Upgrade Jetty to 9.4.41

2021-06-02 Thread Jira


[ 
https://issues.apache.org/jira/browse/LUCENE-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355533#comment-17355533
 ] 

Jan Høydahl commented on LUCENE-9985:
-

I tag this change in Lucene's CHANGES under 8.9 section, since SOLR-15316 
backport will also upgrade Lucene Replicator's jetty version. This PR will only 
be merged to lucene/main and will thus not need a separate backport, even if 
CHANGES entry is for 8.9. Any objections?

> Upgrade Jetty to 9.4.41
> ---
>
> Key: LUCENE-9985
> URL: https://issues.apache.org/jira/browse/LUCENE-9985
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As Solr is upgrading jetty dependency in 8.9 (shared with lucene), Lucene 
> main should also do the same



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] janhoy opened a new pull request #165: LUCENE-9985 Upgrade Jetty to 9.4.41

2021-06-02 Thread GitBox


janhoy opened a new pull request #165:
URL: https://github.com/apache/lucene/pull/165


   See https://issues.apache.org/jira/browse/LUCENE-9985


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9985) Upgrade Jetty to 9.4.41

2021-06-02 Thread Jira
Jan Høydahl created LUCENE-9985:
---

 Summary: Upgrade Jetty to 9.4.41
 Key: LUCENE-9985
 URL: https://issues.apache.org/jira/browse/LUCENE-9985
 Project: Lucene - Core
  Issue Type: Task
Reporter: Jan Høydahl
Assignee: Jan Høydahl


As Solr is upgrading jetty dependency in 8.9 (shared with lucene), Lucene main 
should also do the same



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent

2021-06-02 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355511#comment-17355511
 ] 

Dawid Weiss commented on LUCENE-9976:
-

Hi [~zacharymorn]! Hmm... I optimistically assumed it's going to reproduce on 
that seed because it did it the first time I re-run it... but indeed, it's not 
reproducible. I do have a good ratio of failures with tests.iters though:
{code}
gradlew test -Ptests.iters=10 --tests TestExpressionSorts.testQueries 
-Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true 
-Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/
{code}
results in (sample):
{code}
10 tests completed, 2 failed

> Task :lucene:expressions:test FAILED

ERROR: The following test(s) have failed:
  - org.apache.lucene.expressions.TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:537BBD158B33BCFB]} (:lucene:expressions)
Test output: 
C:\Work\apache\lucene\main\lucene\expressions\build\test-results\test\outputs\OUTPUT-org.apache.lucene.expressions.TestExpressionSorts.txt
Reproduce with: gradlew :lucene:expressions:test --tests 
"org.apache.lucene.expressions.TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:537BBD158B33BCFB]}" -Ptests.jvms=12 
-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -Ptests.seed=FF571CE915A0955 
-Ptests.iters=10 -Ptests.multiplier=2 -Ptests.nightly=true 
-Ptests.file.encoding=ISO-8859-1

  - org.apache.lucene.expressions.TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:C25B270A7CC74D2E]} (:lucene:expressions)
Test output: 
C:\Work\apache\lucene\main\lucene\expressions\build\test-results\test\outputs\OUTPUT-org.apache.lucene.expressions.TestExpressionSorts.txt
Reproduce with: gradlew :lucene:expressions:test --tests 
"org.apache.lucene.expressions.TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:C25B270A7CC74D2E]}" -Ptests.jvms=12 
-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -Ptests.seed=FF571CE915A0955 
-Ptests.iters=10 -Ptests.multiplier=2 -Ptests.nightly=true 
-Ptests.file.encoding=ISO-8859-1
{code}

So something is definitely going on there. :(

> WANDScorer assertion error in ensureConsistent
> --
>
> Key: LUCENE-9976
> URL: https://issues.apache.org/jira/browse/LUCENE-9976
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Priority: Major
>
> Build fails and is reproducible:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console
> {code}
> ./gradlew test --tests TestExpressionSorts.testQueries 
> -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] glawson0 commented on a change in pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes

2021-06-02 Thread GitBox


glawson0 commented on a change in pull request #157:
URL: https://github.com/apache/lucene/pull/157#discussion_r643674911



##
File path: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/FlattenGraphFilter.java
##
@@ -193,14 +194,25 @@ private boolean releaseBufferedToken() {
 }
 if (inputNode.tokens.size() == 0) {
   assert inputNode.nextOut == 0;
-  assert output.nextOut == 0;
   // Hole dest nodes should never be merged since 1) we always
   // assign them to a new output position, and 2) since they never
   // have arriving tokens they cannot be pushed:
-  assert output.inputNodes.size() == 1 : output.inputNodes.size();
+  // skip hole sources, but don't free until every input is checked
+  if (output.inputNodes.size() > 1) {
+output.inputNodes.remove(output.nextOut);
+if (output.nextOut < output.inputNodes.size()) {
+  continue;
+}
+  }
+
   outputFrom++;
-  inputNodes.freeBefore(output.inputNodes.get(0));
+  int freeBefore = Collections.min(output.inputNodes);
+  assert outputNodes.get(outputFrom).inputNodes.stream().filter(n -> 
freeBefore < n).count()

Review comment:
   You're correct. The test as written here just checks if at least one 
node is ok instead of all nodes, then prints the wrong message.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org