[jira] [Comment Edited] (LUCENE-7921) More efficient way to transform a RegExp to an Automaton
[ https://issues.apache.org/jira/browse/LUCENE-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119540#comment-16119540 ] Thomas Poppe edited comment on LUCENE-7921 at 8/9/17 9:46 AM: -- Thanks for your comment Dawid. One more thing I would like to note: the second case also takes more memory and CPU to convert to an automaton, so there might be an opportunity to optimize - but I guess you were already suggesting that. was (Author: thomaspoppe): Thanks for your comment Dawid. One more think I would like to note: the second case also takes more memory and CPU to convert to an automaton, so there might be an opportunity to optimize - but I guess you were already suggesting that. > More efficient way to transform a RegExp to an Automaton > > > Key: LUCENE-7921 > URL: https://issues.apache.org/jira/browse/LUCENE-7921 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 6.5.1 >Reporter: Thomas Poppe >Priority: Minor > Attachments: capture-7.png, capture-8.png > > > Consider the following example: > {code:title=ToAutomatonExample.java|borderStyle=solid} > public static void main(String[] args) { > org.apache.lucene.util.automaton.RegExp regExp = > new > org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}"); > org.apache.lucene.util.automaton.Automaton automaton = > regExp.toAutomaton(); > System.out.println("states: " + automaton.getNumStates()); > System.out.println("transitions: " + automaton.getNumTransitions()); > System.out.println("---"); > try { > regExp = new > org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}"); > automaton = regExp.toAutomaton(); > System.out.println("Will not happen..."); > } catch > (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) { > automaton = regExp.toAutomaton(1_000_000); > System.out.println("states: " + automaton.getNumStates()); > System.out.println("transitions: " + > automaton.getNumTransitions()); > System.out.println("---"); > } > } > {code} > Both regular expressions are equivalent, but it's much more efficient to > "unroll" the repetition. It might be possible to optimize the > Regex#toAutomaton() method to handle this repetition without going over the > default number of determinized states, and using less memory and CPU? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7921) More efficient way to transform a RegExp to an Automaton
[ https://issues.apache.org/jira/browse/LUCENE-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119580#comment-16119580 ] Dawid Weiss edited comment on LUCENE-7921 at 8/9/17 8:41 AM: - I think we should optimize {{Operations.repeat}} so that it produces saner input for minimization (single start state and epsilon arcs if x>=1 in \{x,y\}). It still wouldn't be the same behavior (concatenation would see different input clauses of a larger regexp), but it should be less costly (fewer states). was (Author: dweiss): I think we should optimize {{Operations.repeat}} so that it produces saner input for minimization (single start state and epsilon arcs if x>=1 in {x,y}). It still wouldn't be the same behavior (concatenation would see different input clauses of a larger regexp), but it should be less costly (fewer states). > More efficient way to transform a RegExp to an Automaton > > > Key: LUCENE-7921 > URL: https://issues.apache.org/jira/browse/LUCENE-7921 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 6.5.1 >Reporter: Thomas Poppe >Priority: Minor > Attachments: capture-7.png, capture-8.png > > > Consider the following example: > {code:title=ToAutomatonExample.java|borderStyle=solid} > public static void main(String[] args) { > org.apache.lucene.util.automaton.RegExp regExp = > new > org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}"); > org.apache.lucene.util.automaton.Automaton automaton = > regExp.toAutomaton(); > System.out.println("states: " + automaton.getNumStates()); > System.out.println("transitions: " + automaton.getNumTransitions()); > System.out.println("---"); > try { > regExp = new > org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}"); > automaton = regExp.toAutomaton(); > System.out.println("Will not happen..."); > } catch > (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) { > automaton = regExp.toAutomaton(1_000_000); > System.out.println("states: " + automaton.getNumStates()); > System.out.println("transitions: " + > automaton.getNumTransitions()); > System.out.println("---"); > } > } > {code} > Both regular expressions are equivalent, but it's much more efficient to > "unroll" the repetition. It might be possible to optimize the > Regex#toAutomaton() method to handle this repetition without going over the > default number of determinized states, and using less memory and CPU? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7921) More efficient way to transform a RegExp to an Automaton
[ https://issues.apache.org/jira/browse/LUCENE-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119540#comment-16119540 ] Thomas Poppe edited comment on LUCENE-7921 at 8/9/17 7:56 AM: -- Thanks for your comment Dawid. One more think I would like to note: the second case also takes more memory and CPU to convert to an automaton, so there might be an opportunity to optimize - but I guess you were already suggesting that. was (Author: thomaspoppe): Thanks for your comment Dawid. One more think I would like to note: the second case also takes more memory and CPU to convert to an automaton, so there might be an opportunity to optimize. > More efficient way to transform a RegExp to an Automaton > > > Key: LUCENE-7921 > URL: https://issues.apache.org/jira/browse/LUCENE-7921 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 6.5.1 >Reporter: Thomas Poppe >Priority: Minor > > Consider the following example: > {code:title=ToAutomatonExample.java|borderStyle=solid} > public static void main(String[] args) { > org.apache.lucene.util.automaton.RegExp regExp = > new > org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}"); > org.apache.lucene.util.automaton.Automaton automaton = > regExp.toAutomaton(); > System.out.println("states: " + automaton.getNumStates()); > System.out.println("transitions: " + automaton.getNumTransitions()); > System.out.println("---"); > try { > regExp = new > org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}"); > automaton = regExp.toAutomaton(); > System.out.println("Will not happen..."); > } catch > (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) { > automaton = regExp.toAutomaton(1_000_000); > System.out.println("states: " + automaton.getNumStates()); > System.out.println("transitions: " + > automaton.getNumTransitions()); > System.out.println("---"); > } > } > {code} > Both regular expressions are equivalent, but it's much more efficient to > "unroll" the repetition. It might be possible to optimize the > Regex#toAutomaton() method to handle this repetition without going over the > default number of determinized states, and using less memory and CPU? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org