[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791481#action_12791481
 ] 

Hadoop QA commented on PIG-965:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428184/poregex2.patch
  against trunk revision 890596.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to cause Findbugs to fail.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/130/testReport/
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/130/console

This message is automatically generated.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Benjamin Francisoud
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-16 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791561#action_12791561
 ] 

Thejas M Nair commented on PIG-965:
---

+1 . new patch looks good 
Hudson findbugs and core-tests will fail because it does not include the 
attached jar while compiling.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-16 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791668#action_12791668
 ] 

Olga Natkovich commented on PIG-965:


Thanks, Thejas. I will run test-commit tests manually and commit if it passes.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-15 Thread Ankit Modi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791096#action_12791096
 ] 

Ankit Modi commented on PIG-965:


Here are numbers comparing comparing optimization 12 against optimization 1  
dk.brics

dk.brics.Runautomaton is as fast as optimization 2 and also provides similar 
speeds in a set of additional expressions.

|| Query || svn_trunk || std_dev || Optimization 1  2 || std_dev || 
Optimization 1  brics.RunAutomaton || std_dev ||
| .\*ABCD.\* |  33.87 |  0.71 | 18.77 | 0.71 | 18.94 | 0.02 |
| .\*ABCD | 30.06 | 2.91 | 18.44 | 0.05 | 18.94 | 0.03 |
| ABCD.\* |  21.93 | 2.91 | 18.35 | 0.1 | 18.85 | 0.04 |

Values are averaged over 3 runs.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-15 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791225#action_12791225
 ] 

Hadoop QA commented on PIG-965:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428066/poregex2.patch
  against trunk revision 890596.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to cause Findbugs to fail.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/127/testReport/
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/127/console

This message is automatically generated.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790104#action_12790104
 ] 

Hadoop QA commented on PIG-965:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427913/poregex2.patch
  against trunk revision 889870.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to cause Findbugs to fail.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/121/testReport/
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/121/console

This message is automatically generated.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-14 Thread Ankit Modi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790545#action_12790545
 ] 

Ankit Modi commented on PIG-965:


* NonConstantRegex - I did not think of equals. But I added a length check 
before as it could find out change in length faster and to best of my knowledge 
its a getMethod. And yes as you mentioned equals will check for same object and 
instanceOf which is not useful in our case.

* The numbers published above are using dk.brics.automaton.RunAutomaton. Do you 
want me to publish numbers for more set of regexs ?

I'll create a patch for rest of the comments.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789380#action_12789380
 ] 

Hadoop QA commented on PIG-965:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427730/automaton.jar
  against trunk revision 889346.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/117/console

This message is automatically generated.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-11 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789668#action_12789668
 ] 

Thejas M Nair commented on PIG-965:
---

Review comments: 

* The regex will always be on the rhs. So we don't need the code/classes which 
tries to determine which side has the regular expression based on which side 
has constant.

* in determineBestRegexMethod,  need to add (? to the list of regex strings 
not supported in dk.bricks (in javaRegexOnly) . It has special meanings in java 
regex, which is not honored by dk.brics .

* in determineBestRegexMethod,  We are dealing with cases like \d (choose 
java regex), \\d (choose dk.brics), but not dealing with \\\d (which should 
be choose java regex). ie we need to go back until we find a non '\' char.

* in RegexInit.compile(..), the following message is more appropriate at debug 
level, not at info . At info level, it might also confuse the user.
+log.info(Got an IllegalArgumentException for Pattern:  + 
pattern );
+log.info(e.getMessage());
+log.info(Switching to java.util.regex );

* The following comment in PORegex.java seems to be out of place . 
 // This is a BinaryComparisonOperator hence there can only be two inputs



 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-02 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784849#action_12784849
 ] 

Thejas M Nair commented on PIG-965:
---

In the above performance numbers, I assume optimization 2 (custom string 
comparison) is used only for the regex .*ABCD.* , while optimization 1 
(re-using compiled pattern) is used with dk.brics.automaton as well. Can you 
please confirm ?

From the performance numbers, it looks like we don't need to do optimization 
2. We can just use dk.brics.automaton for the common regexes as well and keep 
the pig code simpler.



 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi

 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-01 Thread Ankit Modi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784596#action_12784596
 ] 

Ankit Modi commented on PIG-965:


I implemented a patch with optimization 1 and 2 mentioned above and another 
patch with optimization 1,2 and dk.brics.automaton.

dk.brics.automaton does not support all features of java.util.regex hence the 
second patch considers that and switches to java.util.regex if the regex can 
only be handled by java.util.regex.

Here are the numbers

||Regex||   svn_trunk   ||Optimization 1 and 2||
dk.brics.automaton|| comments ||
| .\*ABCD.\* | 92.74 | 50.92| 49.32 | Here only optimization 2 is 
used |
| .\*[A-F]{2,3}.\*  |152.3| 133.48| 105.93 | dk.brics.automaton is used |
| A.B.C.D | 54.492 | 44.46 | 44.66 | dk.brics.automaton is used |
|   .\*([A-F]{4})\w\*\1.\* | 129.29 | 112.89 | 109.43 | java.util.regex used in 
all cases |
|   .\*\[A-F\]\{4\}\w\*[N-Z]\{3\}.\* | 129.63 | 108.11 | 54.42 | 
dk.brics.automaton used |


These results were obtained using Local Mode on 1 Billion lines of data of 
following format
f1:Chararray(100) of random chars from [A-Z]
f2:int random integer

dk.brics.automaton provides good performance in case of complex regex. 


 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi

 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-10-17 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766985#action_12766985
 ] 

Thejas M Nair commented on PIG-965:
---

I found another regex library that is supposed to be faster than 
java.util.regex . - dk.brics.automaton.RegExp (BSD license, used in apache 
nutch).  It does not support all features of java regex, but it is a candidate 
that can be used for purposes of this patch (common simpler regexes).

It is faster than java regex, but much slower than 'optimization2' (see numbers 
in code comments below)
{code}
String prefix = 123;
Pattern p =  Pattern.compile(123.*);
RegExp r = new RegExp(123.*);
Automaton a = r.toAutomaton();

while((str = in.readLine()) != null ){

  // optimization 1 - takes 30 secs
//if((p.matcher(str).matches()))
//matches++;


//optimization 2 - takes 15 secs
//int len = prefix.length();
//boolean matched = true;
//for(int i=0; ilen; i++){
//if(prefix.charAt(i) != str.charAt(i)){
//matched = false;
//break;
//}
//}
//if(matched)
//matches++;

// dk.brics.automaton - takes 25 secs
//if(a.run(str))
//matches++;

tot++;
}
{code}


 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair

 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-09-18 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12757242#action_12757242
 ] 

Thejas M Nair commented on PIG-965:
---

The 'common' use case to which these optimization apply usually has a constant 
string specifying the pattern. It makes sense to use this optimization only 
(specifically optimization 2) in such cases, so that the worst case is not 
worse off.

Another thing to check is if there are alternative faster regex implementations 
.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair

 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-09-17 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756673#action_12756673
 ] 

Thejas M Nair commented on PIG-965:
---

Hive like clause implementation is here - 
http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/java/org/apache/hadoop
/hive/ql/udf/UDFLike.java?revision=802066view=markup

I ran simple tests with a simple java program to see the impact of these 
optimizations. Optimization 1 reduces runtime to 1/2, optimization 2 reduces 
runtime to 1/4 . 

{code}
int matches =0;
int tot = 0;
String prefix = 123;
Pattern p =  Pattern.compile(123.*);
while((str = in.readLine()) != null ){



//without proposed optimizations
//test setups 1 and 2 took 9secs, 126 secs respectively
//if(str.matches(123.*))
//matches++;



// with optimization 1
//test sestups 1, 2 took  4, 57 secs respectively
//if((p.matcher(str).matches()))
//matches++;


// with optimization 1
//test sestups 1, 2 took  2.5, 25 secs respectively
//takes 2.5, 25 secs
//int len = prefix.length();
//boolean matched = true;
//for(int i=0; ilen; i++){
//if(prefix.charAt(i) != str.charAt(i)){
//matched = false;
//break;
//}
//}
//if(matched)
//matches++;

tot++;
}
   }
System.out.println(matches  + matches +  tot  + tot);
{code}

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair

 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.