[jira] [Updated] (LUCENE-8102) CompiledAutomaton performance for determining common suffix
[ https://issues.apache.org/jira/browse/LUCENE-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Poppe updated LUCENE-8102: - Description: We're using the automaton package as part of Elasticsearch for doing regexp queries. Our business requires us to process rather complex regular expressions, for example (we have more complex examples, but this one illustrates the problem): {noformat} (¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,10}c(¦.)*(¦?[^¦]){1,10}d {noformat} With a large enough value of maxDeterminizedStates, this works. The problem we're having is that the conversion of this regular expression to a CompiledAutomaton takes very long. Almost all of the time goes into determining the common suffix for the Automaton (which is "d" in this example) - calculated with a call to Operations.getCommonSuffixBytesRef. This suffix is only used as an optimization. Skipping the calculation of this suffix allows us to process these kinds of queries. - Would it be possible to introduce a way to skip the calculation of this common suffix (ideally something we control from within our query to Elasticsearch)? - Or would it be possible to take a look at this getCommonSuffixBytesRef operation, to see if it can be optimized? Most of the time goes to determinizing the reversed automaton - maybe this can be avoided somehow? Reaction from Mike McCandless on the mailing list: This is just an optimization; maybe we should expose an option to disable it? Or maybe we can find the common suffix on an NFA instead, to avoid determinization? was: We're using the automaton package as part of Elasticsearch for doing regexp queries. Our business requires us to process rather complex regular expressions, for example (we have more complex examples, but this one illustrates the problem): (¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,10}c(¦.)*(¦?[^¦]){1,10}d With a large enough value of maxDeterminizedStates, this works. The problem we're having is that the conversion of this regular expression to a CompiledAutomaton takes very long. Almost all of the time goes into determining the common suffix for the Automaton (which is "d" in this example) - calculated with a call to Operations.getCommonSuffixBytesRef. This suffix is only used as an optimization. Skipping the calculation of this suffix allows us to process these kinds of queries. - Would it be possible to introduce a way to skip the calculation of this common suffix (ideally something we control from within our query to Elasticsearch)? - Or would it be possible to take a look at this getCommonSuffixBytesRef operation, to see if it can be optimized? Most of the time goes to determinizing the reversed automaton - maybe this can be avoided somehow? Reaction from Mike McCandless on the mailing list: This is just an optimization; maybe we should expose an option to disable it? Or maybe we can find the common suffix on an NFA instead, to avoid determinization? > CompiledAutomaton performance for determining common suffix > --- > > Key: LUCENE-8102 > URL: https://issues.apache.org/jira/browse/LUCENE-8102 > Project: Lucene - Core > Issue Type: Improvement > Components: core/FSTs >Affects Versions: 7.1 >Reporter: Thomas Poppe > > We're using the automaton package as part of Elasticsearch for doing regexp > queries. Our business requires us to process rather complex regular > expressions, for example (we have more complex examples, but this one > illustrates the problem): > {noformat} > (¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,10}c(¦.)*(¦?[^¦]){1,10}d > {noformat} > With a large enough value of maxDeterminizedStates, this works. The problem > we're having is that the conversion of this regular expression to a > CompiledAutomaton takes very long. Almost all of the time goes into > determining the common suffix for the Automaton (which is "d" in this > example) - calculated with a call to Operations.getCommonSuffixBytesRef. > This suffix is only used as an optimization. Skipping the calculation of > this suffix allows us to process these kinds of queries. > - Would it be possible to introduce a way to skip the calculation of this > common suffix (ideally something we control from within our query to > Elasticsearch)? > - Or would it be possible to take a look at this getCommonSuffixBytesRef > operation, to see if it can be optimized? Most of the time goes to > determinizing the reversed automaton - maybe this can be avoided somehow? > Reaction from Mike McCandless on the mailing list: > This is just an optimization; maybe we should expose an option to disable it? > Or maybe we can find the common suffix on an NFA instead, to avoid > determinization? -- This message was sent by Atlassian JIRA (v6.4.14#64029) --
[jira] [Created] (LUCENE-8102) CompiledAutomaton performance for determining common suffix
Thomas Poppe created LUCENE-8102: Summary: CompiledAutomaton performance for determining common suffix Key: LUCENE-8102 URL: https://issues.apache.org/jira/browse/LUCENE-8102 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 7.1 Reporter: Thomas Poppe We're using the automaton package as part of Elasticsearch for doing regexp queries. Our business requires us to process rather complex regular expressions, for example (we have more complex examples, but this one illustrates the problem): (¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,10}c(¦.)*(¦?[^¦]){1,10}d With a large enough value of maxDeterminizedStates, this works. The problem we're having is that the conversion of this regular expression to a CompiledAutomaton takes very long. Almost all of the time goes into determining the common suffix for the Automaton (which is "d" in this example) - calculated with a call to Operations.getCommonSuffixBytesRef. This suffix is only used as an optimization. Skipping the calculation of this suffix allows us to process these kinds of queries. - Would it be possible to introduce a way to skip the calculation of this common suffix (ideally something we control from within our query to Elasticsearch)? - Or would it be possible to take a look at this getCommonSuffixBytesRef operation, to see if it can be optimized? Most of the time goes to determinizing the reversed automaton - maybe this can be avoided somehow? Reaction from Mike McCandless on the mailing list: This is just an optimization; maybe we should expose an option to disable it? Or maybe we can find the common suffix on an NFA instead, to avoid determinization? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7921) More efficient way to transform a RegExp to an Automaton
[ https://issues.apache.org/jira/browse/LUCENE-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119540#comment-16119540 ] Thomas Poppe edited comment on LUCENE-7921 at 8/9/17 9:46 AM: -- Thanks for your comment Dawid. One more thing I would like to note: the second case also takes more memory and CPU to convert to an automaton, so there might be an opportunity to optimize - but I guess you were already suggesting that. was (Author: thomaspoppe): Thanks for your comment Dawid. One more think I would like to note: the second case also takes more memory and CPU to convert to an automaton, so there might be an opportunity to optimize - but I guess you were already suggesting that. > More efficient way to transform a RegExp to an Automaton > > > Key: LUCENE-7921 > URL: https://issues.apache.org/jira/browse/LUCENE-7921 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 6.5.1 >Reporter: Thomas Poppe >Priority: Minor > Attachments: capture-7.png, capture-8.png > > > Consider the following example: > {code:title=ToAutomatonExample.java|borderStyle=solid} > public static void main(String[] args) { > org.apache.lucene.util.automaton.RegExp regExp = > new > org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}"); > org.apache.lucene.util.automaton.Automaton automaton = > regExp.toAutomaton(); > System.out.println("states: " + automaton.getNumStates()); > System.out.println("transitions: " + automaton.getNumTransitions()); > System.out.println("---"); > try { > regExp = new > org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}"); > automaton = regExp.toAutomaton(); > System.out.println("Will not happen..."); > } catch > (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) { > automaton = regExp.toAutomaton(1_000_000); > System.out.println("states: " + automaton.getNumStates()); > System.out.println("transitions: " + > automaton.getNumTransitions()); > System.out.println("---"); > } > } > {code} > Both regular expressions are equivalent, but it's much more efficient to > "unroll" the repetition. It might be possible to optimize the > Regex#toAutomaton() method to handle this repetition without going over the > default number of determinized states, and using less memory and CPU? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7921) More efficient way to transform a RegExp to an Automaton
[ https://issues.apache.org/jira/browse/LUCENE-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119540#comment-16119540 ] Thomas Poppe edited comment on LUCENE-7921 at 8/9/17 7:56 AM: -- Thanks for your comment Dawid. One more think I would like to note: the second case also takes more memory and CPU to convert to an automaton, so there might be an opportunity to optimize - but I guess you were already suggesting that. was (Author: thomaspoppe): Thanks for your comment Dawid. One more think I would like to note: the second case also takes more memory and CPU to convert to an automaton, so there might be an opportunity to optimize. > More efficient way to transform a RegExp to an Automaton > > > Key: LUCENE-7921 > URL: https://issues.apache.org/jira/browse/LUCENE-7921 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 6.5.1 >Reporter: Thomas Poppe >Priority: Minor > > Consider the following example: > {code:title=ToAutomatonExample.java|borderStyle=solid} > public static void main(String[] args) { > org.apache.lucene.util.automaton.RegExp regExp = > new > org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}"); > org.apache.lucene.util.automaton.Automaton automaton = > regExp.toAutomaton(); > System.out.println("states: " + automaton.getNumStates()); > System.out.println("transitions: " + automaton.getNumTransitions()); > System.out.println("---"); > try { > regExp = new > org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}"); > automaton = regExp.toAutomaton(); > System.out.println("Will not happen..."); > } catch > (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) { > automaton = regExp.toAutomaton(1_000_000); > System.out.println("states: " + automaton.getNumStates()); > System.out.println("transitions: " + > automaton.getNumTransitions()); > System.out.println("---"); > } > } > {code} > Both regular expressions are equivalent, but it's much more efficient to > "unroll" the repetition. It might be possible to optimize the > Regex#toAutomaton() method to handle this repetition without going over the > default number of determinized states, and using less memory and CPU? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7921) More efficient way to transform a RegExp to an Automaton
[ https://issues.apache.org/jira/browse/LUCENE-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119540#comment-16119540 ] Thomas Poppe commented on LUCENE-7921: -- Thanks for your comment Dawid. One more think I would like to note: the second case also takes more memory and CPU to convert to an automaton, so there might be an opportunity to optimize. > More efficient way to transform a RegExp to an Automaton > > > Key: LUCENE-7921 > URL: https://issues.apache.org/jira/browse/LUCENE-7921 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 6.5.1 >Reporter: Thomas Poppe >Priority: Minor > > Consider the following example: > {code:title=ToAutomatonExample.java|borderStyle=solid} > public static void main(String[] args) { > org.apache.lucene.util.automaton.RegExp regExp = > new > org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}"); > org.apache.lucene.util.automaton.Automaton automaton = > regExp.toAutomaton(); > System.out.println("states: " + automaton.getNumStates()); > System.out.println("transitions: " + automaton.getNumTransitions()); > System.out.println("---"); > try { > regExp = new > org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}"); > automaton = regExp.toAutomaton(); > System.out.println("Will not happen..."); > } catch > (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) { > automaton = regExp.toAutomaton(1_000_000); > System.out.println("states: " + automaton.getNumStates()); > System.out.println("transitions: " + > automaton.getNumTransitions()); > System.out.println("---"); > } > } > {code} > Both regular expressions are equivalent, but it's much more efficient to > "unroll" the repetition. It might be possible to optimize the > Regex#toAutomaton() method to handle this repetition without going over the > default number of determinized states, and using less memory and CPU? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7921) More efficient way to transform a RegExp to an Automaton
[ https://issues.apache.org/jira/browse/LUCENE-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119523#comment-16119523 ] Thomas Poppe commented on LUCENE-7921: -- It's the opposite: unrolling gets you the benefit. I was hoping more for the conclusion that none of the cases should be throwing the exception - as the regexp is not that complex, and neither is the resulting automaton. Elasticsearch has no problems executing it (in the unrolled variant). > More efficient way to transform a RegExp to an Automaton > > > Key: LUCENE-7921 > URL: https://issues.apache.org/jira/browse/LUCENE-7921 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 6.5.1 >Reporter: Thomas Poppe >Priority: Minor > > Consider the following example: > {code:title=ToAutomatonExample.java|borderStyle=solid} > public static void main(String[] args) { > org.apache.lucene.util.automaton.RegExp regExp = > new > org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}"); > org.apache.lucene.util.automaton.Automaton automaton = > regExp.toAutomaton(); > System.out.println("states: " + automaton.getNumStates()); > System.out.println("transitions: " + automaton.getNumTransitions()); > System.out.println("---"); > try { > regExp = new > org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}"); > automaton = regExp.toAutomaton(); > System.out.println("Will not happen..."); > } catch > (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) { > automaton = regExp.toAutomaton(1_000_000); > System.out.println("states: " + automaton.getNumStates()); > System.out.println("transitions: " + > automaton.getNumTransitions()); > System.out.println("---"); > } > } > {code} > Both regular expressions are equivalent, but it's much more efficient to > "unroll" the repetition. It might be possible to optimize the > Regex#toAutomaton() method to handle this repetition without going over the > default number of determinized states, and using less memory and CPU? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7921) More efficient way to transform a RegExp to an Automaton
[ https://issues.apache.org/jira/browse/LUCENE-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Poppe updated LUCENE-7921: - Description: Consider the following example: {code:title=ToAutomatonExample.java|borderStyle=solid} public static void main(String[] args) { org.apache.lucene.util.automaton.RegExp regExp = new org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}"); org.apache.lucene.util.automaton.Automaton automaton = regExp.toAutomaton(); System.out.println("states: " + automaton.getNumStates()); System.out.println("transitions: " + automaton.getNumTransitions()); System.out.println("---"); try { regExp = new org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}"); automaton = regExp.toAutomaton(); System.out.println("Will not happen..."); } catch (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) { automaton = regExp.toAutomaton(1_000_000); System.out.println("states: " + automaton.getNumStates()); System.out.println("transitions: " + automaton.getNumTransitions()); System.out.println("---"); } } {code} Both regular expressions are equivalent, but it's much more efficient to "unroll" the repetition. It might be possible to optimize the Regex#toAutomaton() method to handle this repetition without going over the default number of determinized states, and using less memory and CPU? was: Consider the following example: public static void main(String[] args) { org.apache.lucene.util.automaton.RegExp regExp = new org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}"); org.apache.lucene.util.automaton.Automaton automaton = regExp.toAutomaton(); System.out.println("states: " + automaton.getNumStates()); System.out.println("transitions: " + automaton.getNumTransitions()); System.out.println("---"); try { regExp = new org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}"); automaton = regExp.toAutomaton(); System.out.println("Will not happen..."); } catch (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) { automaton = regExp.toAutomaton(1_000_000); System.out.println("states: " + automaton.getNumStates()); System.out.println("transitions: " + automaton.getNumTransitions()); System.out.println("---"); } } Both regular expressions are equivalent, but it's much more efficient to "unroll" the repetition. It might be possible to optimize the Regex#toAutomaton() method to handle this repetition without going over the default number of determinized states, and using less memory and CPU? > More efficient way to transform a RegExp to an Automaton > > > Key: LUCENE-7921 > URL: https://issues.apache.org/jira/browse/LUCENE-7921 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 6.5.1 >Reporter: Thomas Poppe >Priority: Minor > > Consider the following example: > {code:title=ToAutomatonExample.java|borderStyle=solid} > public static void main(String[] args) { > org.apache.lucene.util.automaton.RegExp regExp = > new > org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}"); > org.apache.lucene.util.automaton.Automaton automaton = > regExp.toAutomaton(); > System.out.println("states: " + automaton.getNumStates()); > System.out.println("transitions: " + automaton.getNumTransitions()); > System.out.println("---"); > try { > regExp = new > org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}"); > automaton = regExp.toAutomaton(); > System.out.println("Will not happen..."); > } catch > (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) { > automaton = regExp.toAutomaton(1_000_000); > System.out.println("states: " + automaton.getNumStates()); > System.out.println("transitions: " + > automaton.getNumTransitions()); > System.out.println("---"); > } > } > {code} > Both regular expressions are equivalent, but it's much more efficient to > "unroll" the repetition. It might be possible to optimize the > Regex#toAutomaton() method to handle this repetition without going over the > default n
[jira] [Created] (LUCENE-7921) More efficient way to transform a RegExp to an Automaton
Thomas Poppe created LUCENE-7921: Summary: More efficient way to transform a RegExp to an Automaton Key: LUCENE-7921 URL: https://issues.apache.org/jira/browse/LUCENE-7921 Project: Lucene - Core Issue Type: Improvement Affects Versions: 6.5.1 Reporter: Thomas Poppe Priority: Minor Consider the following example: public static void main(String[] args) { org.apache.lucene.util.automaton.RegExp regExp = new org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}"); org.apache.lucene.util.automaton.Automaton automaton = regExp.toAutomaton(); System.out.println("states: " + automaton.getNumStates()); System.out.println("transitions: " + automaton.getNumTransitions()); System.out.println("---"); try { regExp = new org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}"); automaton = regExp.toAutomaton(); System.out.println("Will not happen..."); } catch (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) { automaton = regExp.toAutomaton(1_000_000); System.out.println("states: " + automaton.getNumStates()); System.out.println("transitions: " + automaton.getNumTransitions()); System.out.println("---"); } } Both regular expressions are equivalent, but it's much more efficient to "unroll" the repetition. It might be possible to optimize the Regex#toAutomaton() method to handle this repetition without going over the default number of determinized states, and using less memory and CPU? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org