[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-09-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164870#comment-16164870
 ] 

ASF GitHub Bot commented on DRILL-5697:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/907


> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>  Labels: ready-to-commit
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16140898#comment-16140898
 ] 

ASF GitHub Bot commented on DRILL-5697:
---

Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/907
  
@kkhatua Kunal, we can add more patterns later if we want. For now, let us 
get the most simple cases done first. 


> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16140893#comment-16140893
 ] 

ASF GitHub Bot commented on DRILL-5697:
---

Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/907
  
@paul-rogers Paul, thanks a lot for the review. I made changes as per your 
comments. Please review updated diffs. 


> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133513#comment-16133513
 ] 

ASF GitHub Bot commented on DRILL-5697:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/907#discussion_r134035511
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctions.java
 ---
@@ -57,22 +57,120 @@ private StringFunctions() {}
 @Output BitHolder out;
 @Workspace java.util.regex.Matcher matcher;
 @Workspace org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper 
charSequenceWrapper;
+@Workspace 
org.apache.drill.exec.expr.fn.impl.RegexpUtil.sqlPatternInfo patternInfo;
 
 @Override
 public void setup() {
-  matcher = 
java.util.regex.Pattern.compile(org.apache.drill.exec.expr.fn.impl.RegexpUtil.sqlToRegexLike(
 //
-  
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(pattern.start,
  pattern.end,  pattern.buffer))).matcher("");
+  patternInfo = 
org.apache.drill.exec.expr.fn.impl.RegexpUtil.sqlToRegexLike(
+  
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(pattern.start,
 pattern.end, pattern.buffer));
   charSequenceWrapper = new 
org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper();
-  matcher.reset(charSequenceWrapper);
+
+  // Use java regex and compile pattern only if it is not a simple 
pattern.
+  if (patternInfo.getPatternType() == 
org.apache.drill.exec.expr.fn.impl.RegexpUtil.sqlPatternType.NOT_SIMPLE) {
+java.lang.String javaPatternString = 
patternInfo.getJavaPatternString();
+matcher = 
java.util.regex.Pattern.compile(javaPatternString).matcher("");
+matcher.reset(charSequenceWrapper);
+  }
 }
 
 @Override
 public void eval() {
   charSequenceWrapper.setBuffer(input.start, input.end, input.buffer);
   // Reusing same charSequenceWrapper, no need to pass it in.
   // This saves one method call since reset(CharSequence) calls reset()
-  matcher.reset();
-  out.value = matcher.matches()? 1:0;
+
+  // Not a simple case. Just use Java regex.
+  if (patternInfo.getPatternType() == 
org.apache.drill.exec.expr.fn.impl.RegexpUtil.sqlPatternType.NOT_SIMPLE) {
+matcher.reset();
+out.value = matcher.matches() ? 1 : 0;
+  }
+
+  // This is a simple pattern that ends with a constant string i.e. 
%ABC
+  // Compare the characters starting from end.
+  if (patternInfo.getPatternType() == 
org.apache.drill.exec.expr.fn.impl.RegexpUtil.sqlPatternType.ENDS_WITH) {
--- End diff --

Chains of ifs are rather old-school. Do a switch on the enum. And, do it in 
the pattern class so it can be unit tested.


> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133508#comment-16133508
 ] 

ASF GitHub Bot commented on DRILL-5697:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/907#discussion_r134032343
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/RegexpUtil.java 
---
@@ -47,18 +47,55 @@
   "[:alnum:]", "\\p{Alnum}"
   };
 
+  // type of pattern string.
+  public enum sqlPatternType {
--- End diff --

Class name format: `SqlPatternType`


> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133514#comment-16133514
 ] 

ASF GitHub Bot commented on DRILL-5697:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/907#discussion_r134035089
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/RegexpUtil.java 
---
@@ -96,20 +145,46 @@ public static String sqlToRegexLike(
 || (nextChar == '%')
 || (nextChar == escapeChar)) {
   javaPattern.append(nextChar);
+  simplePattern.append(nextChar);
   i++;
 } else {
   throw invalidEscapeSequence(sqlPattern, i);
 }
   } else if (c == '_') {
+// if we find _, it is not simple pattern, we are looking for only 
%
+notSimple = true;
 javaPattern.append('.');
   } else if (c == '%') {
+if (i == 0) {
+  // % at the start could potentially be one of the simple cases 
i.e. ENDS_WITH.
+  endsWith = true;
+} else if (i == (len-1)) {
+  // % at the end could potentially be one of the simple cases 
i.e. STARTS_WITH
+  startsWith = true;
+} else {
+  // If we find % anywhere other than start or end, it is not a 
simple case.
+  notSimple = true;
+}
 javaPattern.append(".");
 javaPattern.append('*');
   } else {
 javaPattern.append(c);
+simplePattern.append(c);
   }
 }
-return javaPattern.toString();
+
+if (!notSimple) {
--- End diff --

Yeah, the zillion-flags approach is too complex to follow. Really need a 
good-old state machine.


> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133516#comment-16133516
 ] 

ASF GitHub Bot commented on DRILL-5697:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/907#discussion_r134035728
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctions.java
 ---
@@ -85,23 +183,118 @@ public void eval() {
 @Output BitHolder out;
 @Workspace java.util.regex.Matcher matcher;
 @Workspace org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper 
charSequenceWrapper;
+@Workspace 
org.apache.drill.exec.expr.fn.impl.RegexpUtil.sqlPatternInfo patternInfo;
 
 @Override
 public void setup() {
-  matcher = 
java.util.regex.Pattern.compile(org.apache.drill.exec.expr.fn.impl.RegexpUtil.sqlToRegexLike(
 //
+  patternInfo = 
org.apache.drill.exec.expr.fn.impl.RegexpUtil.sqlToRegexLike(
   
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(pattern.start,
  pattern.end,  pattern.buffer),
-  
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(escape.start,
  escape.end,  escape.buffer))).matcher("");
+  
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(escape.start,
  escape.end,  escape.buffer));
   charSequenceWrapper = new 
org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper();
-  matcher.reset(charSequenceWrapper);
+
+  // Use java regex and compile pattern only if it is not a simple 
pattern.
+  if (patternInfo.getPatternType() == 
org.apache.drill.exec.expr.fn.impl.RegexpUtil.sqlPatternType.NOT_SIMPLE) {
+java.lang.String javaPatternString = 
patternInfo.getJavaPatternString();
+matcher = 
java.util.regex.Pattern.compile(javaPatternString).matcher("");
+matcher.reset(charSequenceWrapper);
+  }
 }
 
 @Override
 public void eval() {
   charSequenceWrapper.setBuffer(input.start, input.end, input.buffer);
   // Reusing same charSequenceWrapper, no need to pass it in.
   // This saves one method call since reset(CharSequence) calls reset()
-  matcher.reset();
-  out.value = matcher.matches()? 1:0;
+
+  // Not a simple case. Just use Java regex.
+  if (patternInfo.getPatternType() == 
org.apache.drill.exec.expr.fn.impl.RegexpUtil.sqlPatternType.NOT_SIMPLE) {
--- End diff --

We are doing a switch (actually, chain of ifs) per value. This is a tight 
inner loop. Far better to simply generate an instance of the proper class and 
call a single method to do the work.


> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133511#comment-16133511
 ] 

ASF GitHub Bot commented on DRILL-5697:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/907#discussion_r134035411
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctions.java
 ---
@@ -57,22 +57,120 @@ private StringFunctions() {}
 @Output BitHolder out;
 @Workspace java.util.regex.Matcher matcher;
 @Workspace org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper 
charSequenceWrapper;
+@Workspace 
org.apache.drill.exec.expr.fn.impl.RegexpUtil.sqlPatternInfo patternInfo;
 
 @Override
 public void setup() {
-  matcher = 
java.util.regex.Pattern.compile(org.apache.drill.exec.expr.fn.impl.RegexpUtil.sqlToRegexLike(
 //
-  
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(pattern.start,
  pattern.end,  pattern.buffer))).matcher("");
+  patternInfo = 
org.apache.drill.exec.expr.fn.impl.RegexpUtil.sqlToRegexLike(
+  
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(pattern.start,
 pattern.end, pattern.buffer));
   charSequenceWrapper = new 
org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper();
-  matcher.reset(charSequenceWrapper);
+
+  // Use java regex and compile pattern only if it is not a simple 
pattern.
+  if (patternInfo.getPatternType() == 
org.apache.drill.exec.expr.fn.impl.RegexpUtil.sqlPatternType.NOT_SIMPLE) {
--- End diff --

You have an enum to describe the cases, and a class to capture the info. 
That is the perfect place to encode the information about how to process. In 
particular, the pattern class should act as a factory for a pattern executor: 
will create an instance of the class needed to do the work. That will also 
allow this stuff to be unit tested without needing all of Drill.


> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133517#comment-16133517
 ] 

ASF GitHub Bot commented on DRILL-5697:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/907#discussion_r134035974
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/expr/fn/impl/TestStringFunctions.java
 ---
@@ -157,6 +157,967 @@ public void testRegexpReplace() throws Exception {
   }
 
   @Test
+  public void testLikeStartsWith() throws Exception {
+
+// all ASCII.
+testBuilder()
--- End diff --

The regex parsing and execution code is becoming complex. Let's test it 
with a true unit test, not just a system-level test using a query. See the test 
frameworks available. We can also discuss in person.


> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133515#comment-16133515
 ] 

ASF GitHub Bot commented on DRILL-5697:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/907#discussion_r134034498
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/RegexpUtil.java 
---
@@ -76,12 +113,24 @@ public static String sqlToRegexLike(
   /**
* Translates a SQL LIKE pattern to Java regex pattern.
*/
-  public static String sqlToRegexLike(
+  public static sqlPatternInfo sqlToRegexLike(
   String sqlPattern,
   char escapeChar) {
 int i;
 final int len = sqlPattern.length();
 final StringBuilder javaPattern = new StringBuilder(len + len);
+final StringBuilder simplePattern = new StringBuilder(len);
+
+// Figure out the pattern type and build simplePatternString
+// as we are going through the sql pattern string
+// to build java regex pattern string. This is better instead of using
+// regex later for determining if a pattern is simple or not.
+// Saves CPU cycles.
+sqlPatternType patternType = sqlPatternType.NOT_SIMPLE;
+boolean startsWith = false;
+boolean endsWith = false;
+boolean notSimple = false;
--- End diff --

Or, since your enum represents terminal states, create a new enum with 
internal states. CONST_ONLY, WILDCARD, COMPLEX with transitions
```
initial: CONST_ONLY
all constant caracters, CONST_ONLY: -> CONST_ONLY
%, CONST_ONLY --> WILDCARD
any other special char, any state --> COMPLEX
%, WILDCARD --> COMPLEX
```
Or, even better, define a simple recursive decent parser in which states 
are encoded as methods rather than as state variables.
```
parseConstant() ...
- parseWildcard()
- parseComplex()

parseWildcard() ...
- parseComplex()

parseComplex() ...
```

Here, I'm ignoring the details of detecting abc, abc%, ab%c, %abc. These 
can also be represented as states with the resulting transitions.


> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133512#comment-16133512
 ] 

ASF GitHub Bot commented on DRILL-5697:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/907#discussion_r134032943
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/RegexpUtil.java 
---
@@ -96,20 +145,46 @@ public static String sqlToRegexLike(
 || (nextChar == '%')
 || (nextChar == escapeChar)) {
   javaPattern.append(nextChar);
+  simplePattern.append(nextChar);
   i++;
 } else {
   throw invalidEscapeSequence(sqlPattern, i);
 }
   } else if (c == '_') {
+// if we find _, it is not simple pattern, we are looking for only 
%
+notSimple = true;
--- End diff --

`type = NOT_SIMPLE`


> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133510#comment-16133510
 ] 

ASF GitHub Bot commented on DRILL-5697:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/907#discussion_r134032668
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/RegexpUtil.java 
---
@@ -76,12 +113,24 @@ public static String sqlToRegexLike(
   /**
* Translates a SQL LIKE pattern to Java regex pattern.
*/
-  public static String sqlToRegexLike(
+  public static sqlPatternInfo sqlToRegexLike(
   String sqlPattern,
   char escapeChar) {
 int i;
 final int len = sqlPattern.length();
 final StringBuilder javaPattern = new StringBuilder(len + len);
+final StringBuilder simplePattern = new StringBuilder(len);
+
+// Figure out the pattern type and build simplePatternString
+// as we are going through the sql pattern string
+// to build java regex pattern string. This is better instead of using
+// regex later for determining if a pattern is simple or not.
+// Saves CPU cycles.
+sqlPatternType patternType = sqlPatternType.NOT_SIMPLE;
+boolean startsWith = false;
+boolean endsWith = false;
+boolean notSimple = false;
--- End diff --

These are not independent states. Probably better to use your enum, with 
the initial value as null (unknown).


> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133509#comment-16133509
 ] 

ASF GitHub Bot commented on DRILL-5697:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/907#discussion_r134034748
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/RegexpUtil.java 
---
@@ -96,20 +145,46 @@ public static String sqlToRegexLike(
 || (nextChar == '%')
 || (nextChar == escapeChar)) {
   javaPattern.append(nextChar);
+  simplePattern.append(nextChar);
   i++;
 } else {
   throw invalidEscapeSequence(sqlPattern, i);
 }
   } else if (c == '_') {
+// if we find _, it is not simple pattern, we are looking for only 
%
+notSimple = true;
 javaPattern.append('.');
   } else if (c == '%') {
+if (i == 0) {
+  // % at the start could potentially be one of the simple cases 
i.e. ENDS_WITH.
+  endsWith = true;
+} else if (i == (len-1)) {
--- End diff --

A bit of a funky way to do this. Might was well actually wait to the end. 
This is why we need states (as an enum or via recursive descent.) At end:

If all constants: CONST
If one wildcard: one of the simple cases
Otherwise: COMPLEX


> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131755#comment-16131755
 ] 

ASF GitHub Bot commented on DRILL-5697:
---

Github user kkhatua commented on a diff in the pull request:

https://github.com/apache/drill/pull/907#discussion_r133859377
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/RegexpUtil.java 
---
@@ -47,18 +47,55 @@
   "[:alnum:]", "\\p{Alnum}"
   };
 
+  // type of pattern string.
+  public enum sqlPatternType {
+STARTS_WITH, // Starts with a constant string followed by any string 
values (ABC%)
+ENDS_WITH, // Ends with a constant string, starts with any string 
values (%ABC)
+CONTAINS, // Contains a constant string, starts and ends with any 
string values (%ABC%)
--- End diff --

You should add a pattern of the form 'Starts with a constant, ends with 
another constant, and has any string in between'
(ABC%XYZ)


> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131756#comment-16131756
 ] 

ASF GitHub Bot commented on DRILL-5697:
---

Github user kkhatua commented on a diff in the pull request:

https://github.com/apache/drill/pull/907#discussion_r133859807
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/RegexpUtil.java 
---
@@ -96,20 +145,46 @@ public static String sqlToRegexLike(
 || (nextChar == '%')
 || (nextChar == escapeChar)) {
   javaPattern.append(nextChar);
+  simplePattern.append(nextChar);
   i++;
 } else {
   throw invalidEscapeSequence(sqlPattern, i);
 }
   } else if (c == '_') {
+// if we find _, it is not simple pattern, we are looking for only 
%
+notSimple = true;
 javaPattern.append('.');
   } else if (c == '%') {
+if (i == 0) {
+  // % at the start could potentially be one of the simple cases 
i.e. ENDS_WITH.
+  endsWith = true;
+} else if (i == (len-1)) {
+  // % at the end could potentially be one of the simple cases 
i.e. STARTS_WITH
+  startsWith = true;
+} else {
+  // If we find % anywhere other than start or end, it is not a 
simple case.
--- End diff --

Consider ABC%XYZ.
It might be worthwhile to decide whether to leverage a pattern or fall back 
to Java's Regex util based on the number of occurrences of '%' as a criteria.


> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-15 Thread Karthikeyan Manivannan (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128070#comment-16128070
 ] 

Karthikeyan Manivannan commented on DRILL-5697:
---

Where should I be looking if I want to see what the Baseline code does?

> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-14 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126475#comment-16126475
 ] 

Padma Penumarthy commented on DRILL-5697:
-

yes, it is 6 times worse. Reason could be how we are creating the regex pattern 
string from sql like pattern string.

> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122402#comment-16122402
 ] 

Padma Penumarthy commented on DRILL-5697:
-

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

Baseline:
select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec

For all ASCII, since getByte is doing a bounds check every time we call it, I 
want to see if getting the bytes  in one shot is better. That did not help much 
with performance. In fact, it made it worse for 'a%' type of  match.

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec (vs 1min 6 sec baseline)

Use find instead of matcher.matches(). The numbers are better, but not by much.
select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)

Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains.
Numbers are better. But, not by much.
select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)

I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.

Finally, I implemented simple character by character comparison functions for 
each of the special cases 
and got pretty good numbers.

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)













> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)