[jira] [Comment Edited] (LUCENE-7921) More efficient way to transform a RegExp to an Automaton

2017-08-09 Thread Thomas Poppe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119540#comment-16119540
 ] 

Thomas Poppe edited comment on LUCENE-7921 at 8/9/17 9:46 AM:
--

Thanks for your comment Dawid.  One more thing I would like to note: the second 
case also takes more memory and CPU to convert to an automaton, so there might 
be an opportunity to optimize - but I guess you were already suggesting that.


was (Author: thomaspoppe):
Thanks for your comment Dawid.  One more think I would like to note: the second 
case also takes more memory and CPU to convert to an automaton, so there might 
be an opportunity to optimize - but I guess you were already suggesting that.

> More efficient way to transform a RegExp to an Automaton
> 
>
> Key: LUCENE-7921
> URL: https://issues.apache.org/jira/browse/LUCENE-7921
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 6.5.1
>Reporter: Thomas Poppe
>Priority: Minor
> Attachments: capture-7.png, capture-8.png
>
>
> Consider the following example:
> {code:title=ToAutomatonExample.java|borderStyle=solid}
> public static void main(String[] args) {
> org.apache.lucene.util.automaton.RegExp regExp =
> new 
> org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}");
> org.apache.lucene.util.automaton.Automaton automaton = 
> regExp.toAutomaton();
> System.out.println("states: " + automaton.getNumStates());
> System.out.println("transitions: " + automaton.getNumTransitions());
> System.out.println("---");
> try {
> regExp = new 
> org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}");
> automaton = regExp.toAutomaton();
> System.out.println("Will not happen...");
> } catch 
> (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) {
> automaton = regExp.toAutomaton(1_000_000);
> System.out.println("states: " + automaton.getNumStates());
> System.out.println("transitions: " + 
> automaton.getNumTransitions());
> System.out.println("---");
> }
> }
> {code}
> Both regular expressions are equivalent, but it's much more efficient to 
> "unroll" the repetition.  It might be possible to optimize the 
> Regex#toAutomaton() method to handle this repetition without going over the 
> default number of determinized states, and using less memory and CPU?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7921) More efficient way to transform a RegExp to an Automaton

2017-08-09 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119580#comment-16119580
 ] 

Dawid Weiss edited comment on LUCENE-7921 at 8/9/17 8:41 AM:
-

I think we should optimize {{Operations.repeat}} so that it produces saner 
input for minimization (single start state and epsilon arcs if x>=1 in 
\{x,y\}). It still wouldn't be the same behavior (concatenation would see 
different input clauses of a larger regexp), but it should be less costly 
(fewer states).


was (Author: dweiss):
I think we should optimize {{Operations.repeat}} so that it produces saner 
input for minimization (single start state and epsilon arcs if x>=1 in {x,y}). 
It still wouldn't be the same behavior (concatenation would see different input 
clauses of a larger regexp), but it should be less costly (fewer states).

> More efficient way to transform a RegExp to an Automaton
> 
>
> Key: LUCENE-7921
> URL: https://issues.apache.org/jira/browse/LUCENE-7921
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 6.5.1
>Reporter: Thomas Poppe
>Priority: Minor
> Attachments: capture-7.png, capture-8.png
>
>
> Consider the following example:
> {code:title=ToAutomatonExample.java|borderStyle=solid}
> public static void main(String[] args) {
> org.apache.lucene.util.automaton.RegExp regExp =
> new 
> org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}");
> org.apache.lucene.util.automaton.Automaton automaton = 
> regExp.toAutomaton();
> System.out.println("states: " + automaton.getNumStates());
> System.out.println("transitions: " + automaton.getNumTransitions());
> System.out.println("---");
> try {
> regExp = new 
> org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}");
> automaton = regExp.toAutomaton();
> System.out.println("Will not happen...");
> } catch 
> (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) {
> automaton = regExp.toAutomaton(1_000_000);
> System.out.println("states: " + automaton.getNumStates());
> System.out.println("transitions: " + 
> automaton.getNumTransitions());
> System.out.println("---");
> }
> }
> {code}
> Both regular expressions are equivalent, but it's much more efficient to 
> "unroll" the repetition.  It might be possible to optimize the 
> Regex#toAutomaton() method to handle this repetition without going over the 
> default number of determinized states, and using less memory and CPU?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7921) More efficient way to transform a RegExp to an Automaton

2017-08-09 Thread Thomas Poppe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119540#comment-16119540
 ] 

Thomas Poppe edited comment on LUCENE-7921 at 8/9/17 7:56 AM:
--

Thanks for your comment Dawid.  One more think I would like to note: the second 
case also takes more memory and CPU to convert to an automaton, so there might 
be an opportunity to optimize - but I guess you were already suggesting that.


was (Author: thomaspoppe):
Thanks for your comment Dawid.  One more think I would like to note: the second 
case also takes more memory and CPU to convert to an automaton, so there might 
be an opportunity to optimize.

> More efficient way to transform a RegExp to an Automaton
> 
>
> Key: LUCENE-7921
> URL: https://issues.apache.org/jira/browse/LUCENE-7921
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 6.5.1
>Reporter: Thomas Poppe
>Priority: Minor
>
> Consider the following example:
> {code:title=ToAutomatonExample.java|borderStyle=solid}
> public static void main(String[] args) {
> org.apache.lucene.util.automaton.RegExp regExp =
> new 
> org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}");
> org.apache.lucene.util.automaton.Automaton automaton = 
> regExp.toAutomaton();
> System.out.println("states: " + automaton.getNumStates());
> System.out.println("transitions: " + automaton.getNumTransitions());
> System.out.println("---");
> try {
> regExp = new 
> org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}");
> automaton = regExp.toAutomaton();
> System.out.println("Will not happen...");
> } catch 
> (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) {
> automaton = regExp.toAutomaton(1_000_000);
> System.out.println("states: " + automaton.getNumStates());
> System.out.println("transitions: " + 
> automaton.getNumTransitions());
> System.out.println("---");
> }
> }
> {code}
> Both regular expressions are equivalent, but it's much more efficient to 
> "unroll" the repetition.  It might be possible to optimize the 
> Regex#toAutomaton() method to handle this repetition without going over the 
> default number of determinized states, and using less memory and CPU?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org