[GitHub] [lucene-solr] madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front
madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front URL: https://github.com/apache/lucene-solr/pull/1042#discussion_r362972308 ## File path: lucene/core/src/test/org/apache/lucene/util/TestRamUsageEstimator.java ## @@ -32,6 +32,7 @@ import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.DisjunctionMaxQuery; import org.apache.lucene.search.FuzzyQuery; Review comment: This is an unused import failing recommit still. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front
madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front URL: https://github.com/apache/lucene-solr/pull/1042#discussion_r362974069 ## File path: lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java ## @@ -92,76 +105,62 @@ * * @param terms Delivers terms. * @param atts {@link AttributeSource} created by the rewrite method of {@link MultiTermQuery} - * thats contains information about competitive boosts during rewrite. It is also used - * to cache DFAs between segment transitions. + * that contains information about competitive boosts during rewrite * @param term Pattern term. * @param maxEdits Maximum edit distance. - * @param prefixLength Length of required common prefix. Default value is 0. + * @param automata An array of levenshtein automata to match against terms, + * see {@link #buildAutomata(String, int[], int, boolean, int)} * @throws IOException if there is a low-level IO error */ - public FuzzyTermsEnum(Terms terms, AttributeSource atts, Term term, - final int maxEdits, final int prefixLength, boolean transpositions) throws IOException { -if (maxEdits < 0 || maxEdits > LevenshteinAutomata.MAXIMUM_SUPPORTED_DISTANCE) { - throw new IllegalArgumentException("max edits must be 0.." + LevenshteinAutomata.MAXIMUM_SUPPORTED_DISTANCE + ", inclusive; got: " + maxEdits); -} -if (prefixLength < 0) { - throw new IllegalArgumentException("prefixLength cannot be less than 0"); -} + public FuzzyTermsEnum(Terms terms, AttributeSource atts, Term term, int termLength, + final int maxEdits, CompiledAutomaton[] automata) throws IOException { + this.maxEdits = maxEdits; this.terms = terms; this.term = term; - -// convert the string into a utf32 int[] representation for fast comparisons -this.termText = stringToUTF32(term.text()); -this.termLength = termText.length; +this.atts = atts; +this.termLength = termLength; -this.dfaAtt = atts.addAttribute(LevenshteinAutomataAttribute.class); this.maxBoostAtt = atts.addAttribute(MaxNonCompetitiveBoostAttribute.class); +this.boostAtt = atts.addAttribute(BoostAttribute.class); -// NOTE: boostAtt must pulled from attributes() not from atts! This is because TopTermsRewrite looks for boostAtt from this TermsEnum's -// private attributes() and not the global atts passed to us from MultiTermQuery: -this.boostAtt = attributes().addAttribute(BoostAttribute.class); - -//The prefix could be longer than the word. -//It's kind of silly though. It means we must match the entire word. -this.realPrefixLength = prefixLength > termLength ? termLength : prefixLength; -this.transpositions = transpositions; - -CompiledAutomaton[] prevAutomata = dfaAtt.automata(); -if (prevAutomata == null) { - prevAutomata = new CompiledAutomaton[maxEdits+1]; - Automaton[] automata = buildAutomata(termText, prefixLength, transpositions, maxEdits); - for (int i = 0; i <= maxEdits; i++) { -try { - prevAutomata[i] = new CompiledAutomaton(automata[i], true, false); -} catch (TooComplexToDeterminizeException e) { - throw new FuzzyTermsException(term.text(), e); -} - } - // first segment computes the automata, and we share with subsequent segments via this Attribute: - dfaAtt.setAutomata(prevAutomata); -} +this.automata = automata; -this.automata = prevAutomata; bottom = maxBoostAtt.getMaxNonCompetitiveBoost(); bottomTerm = maxBoostAtt.getCompetitiveTerm(); bottomChanged(null); } /** - * Builds a binary Automaton to match a fuzzy term - * @param textthe term to match - * @param prefixLengthlength of a required common prefix - * @param transpositions {@code true} if transpositions should count as a single edit - * @param maxEditsthe maximum edit distance of matching terms + * Sets the maximum non-competitive boost, which may allow switching to a + * lower max-edit automaton at run time + */ + public void setMaxNonCompetitiveBoost(float boost) { +this.maxBoostAtt.setMaxNonCompetitiveBoost(boost); Review comment: Thinking on this further... would it make sense to somehow check the terms before building all of the automata, and then possibly skip building the ed=2 case to begin with? Maybe checking the terms is more expensive than building the automaton, so we only care about saving time on how much we use the automaton at runtime? I'm not sure here and haven't done the profiling necessary to be confident. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service,
[GitHub] [lucene-solr] madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front
madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front URL: https://github.com/apache/lucene-solr/pull/1042#discussion_r362860716 ## File path: lucene/core/src/java/org/apache/lucene/search/FuzzyQuery.java ## @@ -101,7 +106,19 @@ public FuzzyQuery(Term term, int maxEdits, int prefixLength, int maxExpansions, this.prefixLength = prefixLength; this.transpositions = transpositions; this.maxExpansions = maxExpansions; +int[] codePoints = FuzzyTermsEnum.stringToUTF32(term.text()); +this.termLength = codePoints.length; +this.automata = FuzzyTermsEnum.buildAutomata(term.text(), codePoints, prefixLength, transpositions, maxEdits); setRewriteMethod(new MultiTermQuery.TopTermsBlendedFreqScoringRewrite(maxExpansions)); +this.ramBytesUsed = calculateRamBytesUsed(term, this.automata); + } + + private static long calculateRamBytesUsed(Term term, CompiledAutomaton[] automata) { Review comment: Is this going to be a few primitives short? I'm not very well versed in how accurate the RAM usage is expected to be, so I'm not sure if that matters or not. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front
madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front URL: https://github.com/apache/lucene-solr/pull/1042#discussion_r362871453 ## File path: lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java ## @@ -92,76 +105,62 @@ * * @param terms Delivers terms. * @param atts {@link AttributeSource} created by the rewrite method of {@link MultiTermQuery} - * thats contains information about competitive boosts during rewrite. It is also used - * to cache DFAs between segment transitions. + * that contains information about competitive boosts during rewrite * @param term Pattern term. * @param maxEdits Maximum edit distance. - * @param prefixLength Length of required common prefix. Default value is 0. + * @param automata An array of levenshtein automata to match against terms, + * see {@link #buildAutomata(String, int[], int, boolean, int)} * @throws IOException if there is a low-level IO error */ - public FuzzyTermsEnum(Terms terms, AttributeSource atts, Term term, - final int maxEdits, final int prefixLength, boolean transpositions) throws IOException { -if (maxEdits < 0 || maxEdits > LevenshteinAutomata.MAXIMUM_SUPPORTED_DISTANCE) { - throw new IllegalArgumentException("max edits must be 0.." + LevenshteinAutomata.MAXIMUM_SUPPORTED_DISTANCE + ", inclusive; got: " + maxEdits); -} -if (prefixLength < 0) { - throw new IllegalArgumentException("prefixLength cannot be less than 0"); -} + public FuzzyTermsEnum(Terms terms, AttributeSource atts, Term term, int termLength, + final int maxEdits, CompiledAutomaton[] automata) throws IOException { + this.maxEdits = maxEdits; this.terms = terms; this.term = term; - -// convert the string into a utf32 int[] representation for fast comparisons -this.termText = stringToUTF32(term.text()); -this.termLength = termText.length; +this.atts = atts; +this.termLength = termLength; -this.dfaAtt = atts.addAttribute(LevenshteinAutomataAttribute.class); this.maxBoostAtt = atts.addAttribute(MaxNonCompetitiveBoostAttribute.class); +this.boostAtt = atts.addAttribute(BoostAttribute.class); -// NOTE: boostAtt must pulled from attributes() not from atts! This is because TopTermsRewrite looks for boostAtt from this TermsEnum's -// private attributes() and not the global atts passed to us from MultiTermQuery: -this.boostAtt = attributes().addAttribute(BoostAttribute.class); - -//The prefix could be longer than the word. -//It's kind of silly though. It means we must match the entire word. -this.realPrefixLength = prefixLength > termLength ? termLength : prefixLength; -this.transpositions = transpositions; - -CompiledAutomaton[] prevAutomata = dfaAtt.automata(); -if (prevAutomata == null) { - prevAutomata = new CompiledAutomaton[maxEdits+1]; - Automaton[] automata = buildAutomata(termText, prefixLength, transpositions, maxEdits); - for (int i = 0; i <= maxEdits; i++) { -try { - prevAutomata[i] = new CompiledAutomaton(automata[i], true, false); -} catch (TooComplexToDeterminizeException e) { - throw new FuzzyTermsException(term.text(), e); -} - } - // first segment computes the automata, and we share with subsequent segments via this Attribute: - dfaAtt.setAutomata(prevAutomata); -} +this.automata = automata; -this.automata = prevAutomata; bottom = maxBoostAtt.getMaxNonCompetitiveBoost(); bottomTerm = maxBoostAtt.getCompetitiveTerm(); bottomChanged(null); } /** - * Builds a binary Automaton to match a fuzzy term - * @param textthe term to match - * @param prefixLengthlength of a required common prefix - * @param transpositions {@code true} if transpositions should count as a single edit - * @param maxEditsthe maximum edit distance of matching terms + * Sets the maximum non-competitive boost, which may allow switching to a + * lower max-edit automaton at run time + */ + public void setMaxNonCompetitiveBoost(float boost) { +this.maxBoostAtt.setMaxNonCompetitiveBoost(boost); Review comment: Does this need to call `bottomChanged`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front
madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front URL: https://github.com/apache/lucene-solr/pull/1042#discussion_r362860923 ## File path: lucene/core/src/java/org/apache/lucene/search/FuzzyQuery.java ## @@ -150,26 +167,13 @@ public boolean getTranspositions() { return transpositions; } - /** - * Expert: Constructs an equivalent Automaton accepting terms matched by this query - */ - public Automaton toAutomaton() { -return FuzzyTermsEnum.buildAutomaton(term.text(), prefixLength, transpositions, maxEdits); - } - @Override public void visit(QueryVisitor visitor) { if (visitor.acceptField(field)) { if (maxEdits == 0 || prefixLength >= term.text().length()) { visitor.consumeTerms(this, term); } else { -// Note: we're rebuilding the automaton here, so this can be expensive -try { - visitor.consumeTermsMatching(this, field, - new ByteRunAutomaton(toAutomaton(), false, Operations.DEFAULT_MAX_DETERMINIZED_STATES)); -} catch (TooComplexToDeterminizeException e) { - throw new FuzzyTermsEnum.FuzzyTermsException(term.text(), e); -} +automata[automata.length - 1].visit(visitor, this, field); Review comment: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front
madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front URL: https://github.com/apache/lucene-solr/pull/1042#discussion_r362638865 ## File path: lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java ## @@ -58,30 +60,42 @@ // which we use to know when we can reduce the automaton from ed=2 to ed=1, or ed=0 if only single top term is collected: private final MaxNonCompetitiveBoostAttribute maxBoostAtt; - // We use this to share the pre-built (once for the query) Levenshtein automata across segments: - private final LevenshteinAutomataAttribute dfaAtt; + private final CompiledAutomaton[] automata; private float bottom; private BytesRef bottomTerm; - private final CompiledAutomaton automata[]; private BytesRef queuedBottom; - final int termLength; + private final int termLength; // Maximum number of edits we will accept. This is either 2 or 1 (or, degenerately, 0) passed by the user originally, // but as we collect terms, we can lower this (e.g. from 2 to 1) if we detect that the term queue is full, and all // collected terms are ed=1: private int maxEdits; - final Terms terms; - final Term term; - final int termText[]; - final int realPrefixLength; + private final Terms terms; + private final Term term; + + /** + * Constructor for enumeration of all terms from specified reader which share a prefix of + * length prefixLength with term and which have at most {@code maxEdits} edits. + * + * After calling the constructor the enumeration is already pointing to the first + * valid term if such a term exists. + * + * @param terms Delivers terms. + * @param term Pattern term. + * @param maxEdits Maximum edit distance. + * @param prefixLength the length of the required common prefix + * @param transitions whether transitions should count as a single edit Review comment: nit: why was this renamed from transpositions? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front
madrob commented on a change in pull request #1042: LUCENE-9068: Build FuzzyQuery automata up-front URL: https://github.com/apache/lucene-solr/pull/1042#discussion_r362639329 ## File path: lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java ## @@ -344,67 +363,4 @@ public BytesRef term() throws IOException { return actualEnum.term(); } - /** - * reuses compiled automata across different segments, - * because they are independent of the index - * @lucene.internal */ - public static interface LevenshteinAutomataAttribute extends Attribute { -public CompiledAutomaton[] automata(); -public void setAutomata(CompiledAutomaton[] automata); - } - - /** - * Stores compiled automata as a list (indexed by edit distance) - * @lucene.internal */ - public static final class LevenshteinAutomataAttributeImpl extends AttributeImpl implements LevenshteinAutomataAttribute { -private CompiledAutomaton[] automata; - -@Override -public CompiledAutomaton[] automata() { - return automata; -} - -@Override -public void setAutomata(CompiledAutomaton[] automata) { - this.automata = automata; -} - -@Override -public void clear() { - automata = null; -} - -@Override -public int hashCode() { - if (automata == null) { -return 0; - } else { -return automata.hashCode(); - } -} - -@Override -public boolean equals(Object other) { - if (this == other) -return true; - if (!(other instanceof LevenshteinAutomataAttributeImpl)) -return false; - return Arrays.equals(automata, ((LevenshteinAutomataAttributeImpl) other).automata); Review comment: nit: we have some unused imports after this deletion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org