[GitHub] [lucene-solr] madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front

2020-01-03 Thread GitBox
madrob commented on a change in pull request #1042: LUCENE-9068: Build 
FuzzyQuery automata up-front
URL: https://github.com/apache/lucene-solr/pull/1042#discussion_r362972308
 
 

 ##
 File path: 
lucene/core/src/test/org/apache/lucene/util/TestRamUsageEstimator.java
 ##
 @@ -32,6 +32,7 @@
 import org.apache.lucene.search.BooleanQuery;
 import org.apache.lucene.search.DisjunctionMaxQuery;
 import org.apache.lucene.search.FuzzyQuery;
 
 Review comment:
   This is an unused import failing recommit still.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front

2020-01-03 Thread GitBox
madrob commented on a change in pull request #1042: LUCENE-9068: Build 
FuzzyQuery automata up-front
URL: https://github.com/apache/lucene-solr/pull/1042#discussion_r362974069
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java
 ##
 @@ -92,76 +105,62 @@
* 
* @param terms Delivers terms.
* @param atts {@link AttributeSource} created by the rewrite method of 
{@link MultiTermQuery}
-   * thats contains information about competitive boosts during rewrite. It is 
also used
-   * to cache DFAs between segment transitions.
+   *  that contains information about competitive boosts during 
rewrite
* @param term Pattern term.
* @param maxEdits Maximum edit distance.
-   * @param prefixLength Length of required common prefix. Default value is 0.
+   * @param automata An array of levenshtein automata to match against terms,
+   * see {@link #buildAutomata(String, int[], int, boolean, 
int)}
* @throws IOException if there is a low-level IO error
*/
-  public FuzzyTermsEnum(Terms terms, AttributeSource atts, Term term, 
-  final int maxEdits, final int prefixLength, boolean transpositions) 
throws IOException {
-if (maxEdits < 0 || maxEdits > 
LevenshteinAutomata.MAXIMUM_SUPPORTED_DISTANCE) {
-  throw new IllegalArgumentException("max edits must be 0.." + 
LevenshteinAutomata.MAXIMUM_SUPPORTED_DISTANCE + ", inclusive; got: " + 
maxEdits);
-}
-if (prefixLength < 0) {
-  throw new IllegalArgumentException("prefixLength cannot be less than 0");
-}
+  public FuzzyTermsEnum(Terms terms, AttributeSource atts, Term term, int 
termLength,
+  final int maxEdits, CompiledAutomaton[] automata) throws IOException {
+
 this.maxEdits = maxEdits;
 this.terms = terms;
 this.term = term;
-
-// convert the string into a utf32 int[] representation for fast 
comparisons
-this.termText = stringToUTF32(term.text());
-this.termLength = termText.length;
+this.atts = atts;
+this.termLength = termLength;
 
-this.dfaAtt = atts.addAttribute(LevenshteinAutomataAttribute.class);
 this.maxBoostAtt = 
atts.addAttribute(MaxNonCompetitiveBoostAttribute.class);
+this.boostAtt = atts.addAttribute(BoostAttribute.class);
 
-// NOTE: boostAtt must pulled from attributes() not from atts!  This is 
because TopTermsRewrite looks for boostAtt from this TermsEnum's
-// private attributes() and not the global atts passed to us from 
MultiTermQuery:
-this.boostAtt = attributes().addAttribute(BoostAttribute.class);
-
-//The prefix could be longer than the word.
-//It's kind of silly though.  It means we must match the entire word.
-this.realPrefixLength = prefixLength > termLength ? termLength : 
prefixLength;
-this.transpositions = transpositions;
-
-CompiledAutomaton[] prevAutomata = dfaAtt.automata();
-if (prevAutomata == null) {
-  prevAutomata = new CompiledAutomaton[maxEdits+1];
-  Automaton[] automata = buildAutomata(termText, prefixLength, 
transpositions, maxEdits);
-  for (int i = 0; i <= maxEdits; i++) {
-try {
-  prevAutomata[i] = new CompiledAutomaton(automata[i], true, false);
-} catch (TooComplexToDeterminizeException e) {
-  throw new FuzzyTermsException(term.text(), e);
-}
-  }
-  // first segment computes the automata, and we share with subsequent 
segments via this Attribute:
-  dfaAtt.setAutomata(prevAutomata);
-}
+this.automata = automata;
 
-this.automata = prevAutomata;
 bottom = maxBoostAtt.getMaxNonCompetitiveBoost();
 bottomTerm = maxBoostAtt.getCompetitiveTerm();
 bottomChanged(null);
   }
 
   /**
-   * Builds a binary Automaton to match a fuzzy term
-   * @param textthe term to match
-   * @param prefixLengthlength of a required common prefix
-   * @param transpositions  {@code true} if transpositions should count as a 
single edit
-   * @param maxEditsthe maximum edit distance of matching terms
+   * Sets the maximum non-competitive boost, which may allow switching to a
+   * lower max-edit automaton at run time
+   */
+  public void setMaxNonCompetitiveBoost(float boost) {
+this.maxBoostAtt.setMaxNonCompetitiveBoost(boost);
 
 Review comment:
   Thinking on this further... would it make sense to somehow check the terms 
before building all of the automata, and then possibly skip building the ed=2 
case to begin with? Maybe checking the terms is more expensive than building 
the automaton, so we only care about saving time on how much we use the 
automaton at runtime? I'm not sure here and haven't done the profiling 
necessary to be confident.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, 

[GitHub] [lucene-solr] madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front

2020-01-03 Thread GitBox
madrob commented on a change in pull request #1042: LUCENE-9068: Build 
FuzzyQuery automata up-front
URL: https://github.com/apache/lucene-solr/pull/1042#discussion_r362860716
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/search/FuzzyQuery.java
 ##
 @@ -101,7 +106,19 @@ public FuzzyQuery(Term term, int maxEdits, int 
prefixLength, int maxExpansions,
 this.prefixLength = prefixLength;
 this.transpositions = transpositions;
 this.maxExpansions = maxExpansions;
+int[] codePoints = FuzzyTermsEnum.stringToUTF32(term.text());
+this.termLength = codePoints.length;
+this.automata = FuzzyTermsEnum.buildAutomata(term.text(), codePoints, 
prefixLength, transpositions, maxEdits);
 setRewriteMethod(new 
MultiTermQuery.TopTermsBlendedFreqScoringRewrite(maxExpansions));
+this.ramBytesUsed = calculateRamBytesUsed(term, this.automata);
+  }
+
+  private static long calculateRamBytesUsed(Term term, CompiledAutomaton[] 
automata) {
 
 Review comment:
   Is this going to be a few primitives short? I'm not very well versed in how 
accurate the RAM usage is expected to be, so I'm not sure if that matters or 
not.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front

2020-01-03 Thread GitBox
madrob commented on a change in pull request #1042: LUCENE-9068: Build 
FuzzyQuery automata up-front
URL: https://github.com/apache/lucene-solr/pull/1042#discussion_r362871453
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java
 ##
 @@ -92,76 +105,62 @@
* 
* @param terms Delivers terms.
* @param atts {@link AttributeSource} created by the rewrite method of 
{@link MultiTermQuery}
-   * thats contains information about competitive boosts during rewrite. It is 
also used
-   * to cache DFAs between segment transitions.
+   *  that contains information about competitive boosts during 
rewrite
* @param term Pattern term.
* @param maxEdits Maximum edit distance.
-   * @param prefixLength Length of required common prefix. Default value is 0.
+   * @param automata An array of levenshtein automata to match against terms,
+   * see {@link #buildAutomata(String, int[], int, boolean, 
int)}
* @throws IOException if there is a low-level IO error
*/
-  public FuzzyTermsEnum(Terms terms, AttributeSource atts, Term term, 
-  final int maxEdits, final int prefixLength, boolean transpositions) 
throws IOException {
-if (maxEdits < 0 || maxEdits > 
LevenshteinAutomata.MAXIMUM_SUPPORTED_DISTANCE) {
-  throw new IllegalArgumentException("max edits must be 0.." + 
LevenshteinAutomata.MAXIMUM_SUPPORTED_DISTANCE + ", inclusive; got: " + 
maxEdits);
-}
-if (prefixLength < 0) {
-  throw new IllegalArgumentException("prefixLength cannot be less than 0");
-}
+  public FuzzyTermsEnum(Terms terms, AttributeSource atts, Term term, int 
termLength,
+  final int maxEdits, CompiledAutomaton[] automata) throws IOException {
+
 this.maxEdits = maxEdits;
 this.terms = terms;
 this.term = term;
-
-// convert the string into a utf32 int[] representation for fast 
comparisons
-this.termText = stringToUTF32(term.text());
-this.termLength = termText.length;
+this.atts = atts;
+this.termLength = termLength;
 
-this.dfaAtt = atts.addAttribute(LevenshteinAutomataAttribute.class);
 this.maxBoostAtt = 
atts.addAttribute(MaxNonCompetitiveBoostAttribute.class);
+this.boostAtt = atts.addAttribute(BoostAttribute.class);
 
-// NOTE: boostAtt must pulled from attributes() not from atts!  This is 
because TopTermsRewrite looks for boostAtt from this TermsEnum's
-// private attributes() and not the global atts passed to us from 
MultiTermQuery:
-this.boostAtt = attributes().addAttribute(BoostAttribute.class);
-
-//The prefix could be longer than the word.
-//It's kind of silly though.  It means we must match the entire word.
-this.realPrefixLength = prefixLength > termLength ? termLength : 
prefixLength;
-this.transpositions = transpositions;
-
-CompiledAutomaton[] prevAutomata = dfaAtt.automata();
-if (prevAutomata == null) {
-  prevAutomata = new CompiledAutomaton[maxEdits+1];
-  Automaton[] automata = buildAutomata(termText, prefixLength, 
transpositions, maxEdits);
-  for (int i = 0; i <= maxEdits; i++) {
-try {
-  prevAutomata[i] = new CompiledAutomaton(automata[i], true, false);
-} catch (TooComplexToDeterminizeException e) {
-  throw new FuzzyTermsException(term.text(), e);
-}
-  }
-  // first segment computes the automata, and we share with subsequent 
segments via this Attribute:
-  dfaAtt.setAutomata(prevAutomata);
-}
+this.automata = automata;
 
-this.automata = prevAutomata;
 bottom = maxBoostAtt.getMaxNonCompetitiveBoost();
 bottomTerm = maxBoostAtt.getCompetitiveTerm();
 bottomChanged(null);
   }
 
   /**
-   * Builds a binary Automaton to match a fuzzy term
-   * @param textthe term to match
-   * @param prefixLengthlength of a required common prefix
-   * @param transpositions  {@code true} if transpositions should count as a 
single edit
-   * @param maxEditsthe maximum edit distance of matching terms
+   * Sets the maximum non-competitive boost, which may allow switching to a
+   * lower max-edit automaton at run time
+   */
+  public void setMaxNonCompetitiveBoost(float boost) {
+this.maxBoostAtt.setMaxNonCompetitiveBoost(boost);
 
 Review comment:
   Does this need to call `bottomChanged`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front

2020-01-03 Thread GitBox
madrob commented on a change in pull request #1042: LUCENE-9068: Build 
FuzzyQuery automata up-front
URL: https://github.com/apache/lucene-solr/pull/1042#discussion_r362860923
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/search/FuzzyQuery.java
 ##
 @@ -150,26 +167,13 @@ public boolean getTranspositions() {
 return transpositions;
   }
 
-  /**
-   * Expert: Constructs an equivalent Automaton accepting terms matched by 
this query
-   */
-  public Automaton toAutomaton() {
-return FuzzyTermsEnum.buildAutomaton(term.text(), prefixLength, 
transpositions, maxEdits);
-  }
-
   @Override
   public void visit(QueryVisitor visitor) {
 if (visitor.acceptField(field)) {
   if (maxEdits == 0 || prefixLength >= term.text().length()) {
 visitor.consumeTerms(this, term);
   } else {
-// Note: we're rebuilding the automaton here, so this can be expensive
-try {
-  visitor.consumeTermsMatching(this, field,
-  new ByteRunAutomaton(toAutomaton(), false, 
Operations.DEFAULT_MAX_DETERMINIZED_STATES));
-} catch (TooComplexToDeterminizeException e) {
-  throw new FuzzyTermsEnum.FuzzyTermsException(term.text(), e);
-}
+automata[automata.length - 1].visit(visitor, this, field);
 
 Review comment:
    


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front

2020-01-02 Thread GitBox
madrob commented on a change in pull request #1042: LUCENE-9068: Build 
FuzzyQuery automata up-front
URL: https://github.com/apache/lucene-solr/pull/1042#discussion_r362638865
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java
 ##
 @@ -58,30 +60,42 @@
   // which we use to know when we can reduce the automaton from ed=2 to ed=1, 
or ed=0 if only single top term is collected:
   private final MaxNonCompetitiveBoostAttribute maxBoostAtt;
 
-  // We use this to share the pre-built (once for the query) Levenshtein 
automata across segments:
-  private final LevenshteinAutomataAttribute dfaAtt;
+  private final CompiledAutomaton[] automata;
   
   private float bottom;
   private BytesRef bottomTerm;
-  private final CompiledAutomaton automata[];
 
   private BytesRef queuedBottom;
 
-  final int termLength;
+  private final int termLength;
 
   // Maximum number of edits we will accept.  This is either 2 or 1 (or, 
degenerately, 0) passed by the user originally,
   // but as we collect terms, we can lower this (e.g. from 2 to 1) if we 
detect that the term queue is full, and all
   // collected terms are ed=1:
   private int maxEdits;
 
-  final Terms terms;
-  final Term term;
-  final int termText[];
-  final int realPrefixLength;
+  private final Terms terms;
+  private final Term term;
+
+  /**
+   * Constructor for enumeration of all terms from specified 
reader which share a prefix of
+   * length prefixLength with term and which have at 
most {@code maxEdits} edits.
+   * 
+   * After calling the constructor the enumeration is already pointing to the 
first
+   * valid term if such a term exists.
+   *
+   * @param terms Delivers terms.
+   * @param term Pattern term.
+   * @param maxEdits Maximum edit distance.
+   * @param prefixLength the length of the required common prefix
+   * @param transitions whether transitions should count as a single edit
 
 Review comment:
   nit: why was this renamed from transpositions?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front

2020-01-02 Thread GitBox
madrob commented on a change in pull request #1042: LUCENE-9068: Build 
FuzzyQuery automata up-front
URL: https://github.com/apache/lucene-solr/pull/1042#discussion_r362639329
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java
 ##
 @@ -344,67 +363,4 @@ public BytesRef term() throws IOException {
 return actualEnum.term();
   }
 
-  /**
-   * reuses compiled automata across different segments,
-   * because they are independent of the index
-   * @lucene.internal */
-  public static interface LevenshteinAutomataAttribute extends Attribute {
-public CompiledAutomaton[] automata();
-public void setAutomata(CompiledAutomaton[] automata);
-  }
-
-  /** 
-   * Stores compiled automata as a list (indexed by edit distance)
-   * @lucene.internal */
-  public static final class LevenshteinAutomataAttributeImpl extends 
AttributeImpl implements LevenshteinAutomataAttribute {
-private CompiledAutomaton[] automata;
-  
-@Override
-public CompiledAutomaton[] automata() {
-  return automata;
-}
-
-@Override
-public void setAutomata(CompiledAutomaton[] automata) {
-  this.automata = automata;
-}
-
-@Override
-public void clear() {
-  automata = null;
-}
-
-@Override
-public int hashCode() {
-  if (automata == null) {
-return 0;
-  } else {
-return automata.hashCode();
-  }
-}
-
-@Override
-public boolean equals(Object other) {
-  if (this == other)
-return true;
-  if (!(other instanceof LevenshteinAutomataAttributeImpl))
-return false;
-  return Arrays.equals(automata, ((LevenshteinAutomataAttributeImpl) 
other).automata);
 
 Review comment:
   nit: we have some unused imports after this deletion.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org