[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2018-01-10 Thread sachouche
Github user sachouche closed the pull request at:

https://github.com/apache/drill/pull/1001


---


[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-25 Thread sachouche
Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r146952578
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -17,37 +17,166 @@
  */
 package org.apache.drill.exec.expr.fn.impl;
 
-public class SqlPatternContainsMatcher implements SqlPatternMatcher {
+public final class SqlPatternContainsMatcher implements SqlPatternMatcher {
   final String patternString;
   CharSequence charSequenceWrapper;
   final int patternLength;
+  final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString, CharSequence 
charSequenceWrapper) {
-this.patternString = patternString;
+this.patternString   = patternString;
 this.charSequenceWrapper = charSequenceWrapper;
-patternLength = patternString.length();
+patternLength= patternString.length();
+
+// The idea is to write loops with simple condition checks to allow 
the Java Hotspot achieve
+// better optimizations (especially vectorization)
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
--- End diff --

I ran two types of tests to evaluate the generic vs custom method:
Test1 - A match does exist
o Custom method is 3x faster because the code will go to the "else" part
o A nested loop is always slower than unrolled code (the loop is also 
correlated with the outer one); please refer to this article 
(https://en.wikipedia.org/wiki/Loop_unrolling) on the benefits of loop unrolling
o Older match function performed in 59sec

Test2- A match doesn't exist
o Custom method and generic one perform in 15sec; this is because both 
perform a comparison and proceed to the next iteration
o Older match function performed in 45sec 


---


[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-25 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r146948741
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -17,37 +17,166 @@
  */
 package org.apache.drill.exec.expr.fn.impl;
 
-public class SqlPatternContainsMatcher implements SqlPatternMatcher {
+public final class SqlPatternContainsMatcher implements SqlPatternMatcher {
   final String patternString;
   CharSequence charSequenceWrapper;
   final int patternLength;
+  final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString, CharSequence 
charSequenceWrapper) {
-this.patternString = patternString;
+this.patternString   = patternString;
 this.charSequenceWrapper = charSequenceWrapper;
-patternLength = patternString.length();
+patternLength= patternString.length();
+
+// The idea is to write loops with simple condition checks to allow 
the Java Hotspot achieve
+// better optimizations (especially vectorization)
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
--- End diff --

how does  matcherN perform compared to matcher1, matcher2, matcher3 for 
pattern lengths 1, 2 and 3 ? If matcherN performs well for patternLengths 1, 2 
and 3, we can just have one matcher instead of multiple for different pattern 
lengths. 


---


[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-24 Thread sachouche
Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r146708658
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -17,37 +17,166 @@
  */
 package org.apache.drill.exec.expr.fn.impl;
 
-public class SqlPatternContainsMatcher implements SqlPatternMatcher {
+public final class SqlPatternContainsMatcher implements SqlPatternMatcher {
   final String patternString;
   CharSequence charSequenceWrapper;
   final int patternLength;
+  final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString, CharSequence 
charSequenceWrapper) {
-this.patternString = patternString;
+this.patternString   = patternString;
 this.charSequenceWrapper = charSequenceWrapper;
-patternLength = patternString.length();
+patternLength= patternString.length();
+
+// The idea is to write loops with simple condition checks to allow 
the Java Hotspot achieve
+// better optimizations (especially vectorization)
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
--- End diff --

Padma, I have two reasons to follow the added complexity
1) The new code is encapsulated within the Contains matching logic; doesn't 
increase code complexity
2) 
o I created a test with the original match logic, pattern and input were 
Strings though passed as CharSequence
o Ran the test with the new and old method (1 billion iterations) on MacOS
o pattern length 
o The old match method performed in 43sec where as the new one performed in 
15sec
o The reason for the speedup is the custom matcher functions have less 
instructions (load and comparison)


---


[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-24 Thread sachouche
Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r146705325
  
--- Diff: 
exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java
 ---
@@ -73,6 +73,9 @@ public void eval() {
 out.start =  in.start;
 if (charCount <= length.value || length.value == 0 ) {
   out.end = in.end;
+  if (charCount == (out.end - out.start)) {
+out.asciiMode = 
org.apache.drill.exec.expr.holders.VarCharHolder.CHAR_MODE_IS_ASCII; // we can 
conclude this string is ASCII
--- End diff --

- As previously stated (when responding to Paul'd comment), the expression 
framework is able to use the same VarCharHolder input variable when it is 
shared amongst multiple expressions
- If the original column was of type var-binary, then the expression 
framework will include a cast to var-char
- The cast logic will also compute the string length
- Using this information to deduce whether the string is pure ASCII or not
- UTF-8 encoding uses 1 byte for ASCII and 2, 3, or 4 for other character 
sets
- If the encoded length and character length are equal, then this means 
this is an ASCII string 


---


[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-23 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r146390381
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/CharSequenceWrapper.java
 ---
@@ -53,13 +55,13 @@
   // The end offset into the drill buffer
   private int end;
   // Indicates that the current byte buffer contains only ascii chars
-  private boolean usAscii;
+  private boolean useAscii;
 
   public CharSequenceWrapper() {
   }
 
   public CharSequenceWrapper(int start, int end, DrillBuf buffer) {
-setBuffer(start, end, buffer);
+setBuffer(start, end, buffer, -1);
--- End diff --

what does -1 mean ? Shouldn't it be one of CHAR_MODE_IS_ASCII, 
CHAR_MODE_UNKNOWN or CHAR_MODE_NOT_ASCII ?



---


[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-23 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r146392275
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -17,37 +17,166 @@
  */
 package org.apache.drill.exec.expr.fn.impl;
 
-public class SqlPatternContainsMatcher implements SqlPatternMatcher {
+public final class SqlPatternContainsMatcher implements SqlPatternMatcher {
   final String patternString;
   CharSequence charSequenceWrapper;
   final int patternLength;
+  final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString, CharSequence 
charSequenceWrapper) {
-this.patternString = patternString;
+this.patternString   = patternString;
 this.charSequenceWrapper = charSequenceWrapper;
-patternLength = patternString.length();
+patternLength= patternString.length();
+
+// The idea is to write loops with simple condition checks to allow 
the Java Hotspot achieve
+// better optimizations (especially vectorization)
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
--- End diff --

I am not sure if it is a good idea to write special case code for each 
pattern length. It is not easy to maintain. Can you please give more details 
how this is improving performance ?  Are we getting better performance for 
patternLength 1 or 2 or 3 or N or all ? If so, how much and why ? 


---


[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-23 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r146388018
  
--- Diff: 
exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java
 ---
@@ -73,6 +73,9 @@ public void eval() {
 out.start =  in.start;
 if (charCount <= length.value || length.value == 0 ) {
   out.end = in.end;
+  if (charCount == (out.end - out.start)) {
+out.asciiMode = 
org.apache.drill.exec.expr.holders.VarCharHolder.CHAR_MODE_IS_ASCII; // we can 
conclude this string is ASCII
--- End diff --

can you please add comments here ? I am not able to understand this change. 


---


[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-19 Thread sachouche
Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r145746250
  
--- Diff: 
exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java
 ---
@@ -73,6 +73,9 @@ public void eval() {
 out.start =  in.start;
 if (charCount <= length.value || length.value == 0 ) {
   out.end = in.end;
+  if (charCount == (out.end-out.start)) {
+out.asciiMode = 1; // we can conclude this string is ASCII
--- End diff --

will do!


---


[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-18 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r145575103
  
--- Diff: 
exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java
 ---
@@ -73,6 +73,9 @@ public void eval() {
 out.start =  in.start;
 if (charCount <= length.value || length.value == 0 ) {
   out.end = in.end;
+  if (charCount == (out.end-out.start)) {
+out.asciiMode = 1; // we can conclude this string is ASCII
--- End diff --

Drill likes spaces around operators: `out.end - out.start`


---


[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-18 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r145577107
  
--- Diff: exec/vector/src/main/codegen/templates/ValueHolders.java ---
@@ -33,34 +33,40 @@
  * This class is generated using freemarker and the ${.template_name} 
template.
  */
 public final class ${className} implements ValueHolder{
-  
+
   public static final MajorType TYPE = 
Types.${mode.name?lower_case}(MinorType.${minor.class?upper_case});
+  public MajorType getType() {return TYPE;}
 
 <#if mode.name == "Repeated">
-
+
 /** The first index (inclusive) into the Vector. **/
 public int start;
-
+
 /** The last index (exclusive) into the Vector. **/
 public int end;
-
+
 /** The Vector holding the actual values. **/
 public ${minor.class}Vector vector;
-
+
 <#else>
 public static final int WIDTH = ${type.width};
-
+
 <#if mode.name == "Optional">public int isSet;
 <#assign fields = minor.fields!type.fields />
 <#list fields as field>
 public ${field.type} ${field.name};
 
-
+
+<#if minor.class == "VarChar">
+// -1: unknown, 0: not ascii, 1: is ascii
+public int asciiMode = -1;
--- End diff --

Drill is complex, nothing is as it seems. Drill has a very elaborate 
mechanism to rewrite byte codes to do scalar replacement. That is, holders are 
added to the Java source during code generation, but then are removed during 
byte code transforms. (Yes, that is silly given that Java 8 does a fine job on 
its own; but it is how Drill works today...)

Will adding this variable change that behavior? Will the scalar replacement 
logic know to not replace this object? If so, will that hurt performance, or 
with the JVM go ahead and figure it out on its own? Is all the code that sets 
the variable inlined via code generation so that scalar replacement is possible?

Yes, indeed, nothing in Drill is as simple as it seems...


---


[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-18 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r145578320
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -17,36 +17,133 @@
  */
 package org.apache.drill.exec.expr.fn.impl;
 
-public class SqlPatternContainsMatcher implements SqlPatternMatcher {
+public final class SqlPatternContainsMatcher implements SqlPatternMatcher {
   final String patternString;
   CharSequence charSequenceWrapper;
   final int patternLength;
 
   public SqlPatternContainsMatcher(String patternString, CharSequence 
charSequenceWrapper) {
-this.patternString = patternString;
+this.patternString   = patternString;
 this.charSequenceWrapper = charSequenceWrapper;
-patternLength = patternString.length();
+patternLength= patternString.length();
   }
 
   @Override
-  public int match() {
-final int txtLength = charSequenceWrapper.length();
-int patternIndex = 0;
-int txtIndex = 0;
+  public final int match() {
+// The idea is to write loops with simple condition checks to allow 
the Java Hotspot vectorize
+// the generate code.
+if (patternLength == 1) {
--- End diff --

Pattern length is fixed per expression; need not be checked per row. See 
more below.


---


[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-18 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r145576718
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -17,36 +17,133 @@
  */
 package org.apache.drill.exec.expr.fn.impl;
 
-public class SqlPatternContainsMatcher implements SqlPatternMatcher {
+public final class SqlPatternContainsMatcher implements SqlPatternMatcher {
   final String patternString;
   CharSequence charSequenceWrapper;
   final int patternLength;
 
   public SqlPatternContainsMatcher(String patternString, CharSequence 
charSequenceWrapper) {
-this.patternString = patternString;
+this.patternString   = patternString;
 this.charSequenceWrapper = charSequenceWrapper;
-patternLength = patternString.length();
+patternLength= patternString.length();
   }
 
   @Override
-  public int match() {
-final int txtLength = charSequenceWrapper.length();
-int patternIndex = 0;
-int txtIndex = 0;
+  public final int match() {
+// The idea is to write loops with simple condition checks to allow 
the Java Hotspot vectorize
+// the generate code.
+if (patternLength == 1) {
+  return match_1();
+} else if (patternLength == 2) {
+  return match_2();
+} else if (patternLength == 3) {
+  return match_3();
+} else {
+  return match_N();
+}
+  }
+
+  private final int match_1() {
--- End diff --

See note about UTF-8. If we don't care about the match position (that is, 
we don't need `strpos()`, and all we care is whether it matches or not, then we 
can do the work on the undecoded UTF-8 bytes, saving a large amount of 
complexity.


---


[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-18 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r145577808
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -17,36 +17,133 @@
  */
 package org.apache.drill.exec.expr.fn.impl;
 
-public class SqlPatternContainsMatcher implements SqlPatternMatcher {
+public final class SqlPatternContainsMatcher implements SqlPatternMatcher {
   final String patternString;
   CharSequence charSequenceWrapper;
   final int patternLength;
 
   public SqlPatternContainsMatcher(String patternString, CharSequence 
charSequenceWrapper) {
-this.patternString = patternString;
+this.patternString   = patternString;
 this.charSequenceWrapper = charSequenceWrapper;
-patternLength = patternString.length();
+patternLength= patternString.length();
   }
 
   @Override
-  public int match() {
-final int txtLength = charSequenceWrapper.length();
-int patternIndex = 0;
-int txtIndex = 0;
+  public final int match() {
+// The idea is to write loops with simple condition checks to allow 
the Java Hotspot vectorize
+// the generate code.
+if (patternLength == 1) {
+  return match_1();
+} else if (patternLength == 2) {
+  return match_2();
+} else if (patternLength == 3) {
+  return match_3();
+} else {
+  return match_N();
+}
+  }
+
+  private final int match_1() {
+final CharSequence sequenceWrapper = charSequenceWrapper;
+final int lengthToProcess  = sequenceWrapper.length();
+final char first_patt_char = patternString.charAt(0);
+
+// simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+// so, we can just directly compare.
+for (int idx = 0; idx < lengthToProcess; idx++) {
+  char input_char = sequenceWrapper.charAt(idx);
+
+  if (first_patt_char != input_char) {
+continue;
+  }
+  return 1;
+}
+return 0;
+  }
+
+  private final int match_2() {
+final CharSequence sequenceWrapper = charSequenceWrapper;
+final int lengthToProcess  = sequenceWrapper.length() - 1;
+final char first_patt_char = patternString.charAt(0);
+
+// simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+// so, we can just directly compare.
+for (int idx = 0; idx < lengthToProcess; idx++) {
+  char input_char = sequenceWrapper.charAt(idx);
+
+  if (first_patt_char != input_char) {
+continue;
+  } else {
+char ch2_1 = sequenceWrapper.charAt(idx+1);
+char ch2_2 = patternString.charAt(1);
--- End diff --

We want speed. Instead of getting the second character multiple times, is 
it better to get it once up front? I suppose that depends on the hit rate. 
Average hit rate may be 2 % (1/ ~64). So if our input is smaller than 64 
characters, we'll have, on average one hit so we pay the second character cost 
once. At 128 or above, we'll pay the cost two or more times. But, maybe the JVM 
can optimize away the second and subsequent accesses?

Actually, let's take a step back. The pattern is fixed. We parsed the 
pattern to decide to use this particular class. Should we instead create a 
1-char, 2-char and n-char matcher class so we get the second character (for the 
2-char case) only once, and we eliminate the extra per-value if-check?


---


[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-18 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r145574785
  
--- Diff: 
exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java
 ---
@@ -73,6 +73,9 @@ public void eval() {
 out.start =  in.start;
 if (charCount <= length.value || length.value == 0 ) {
   out.end = in.end;
+  if (charCount == (out.end-out.start)) {
+out.asciiMode = 1; // we can conclude this string is ASCII
--- End diff --

The values -1, 0, and 1 are magic numbers here. Because we want speed, we 
don't want to use an enum. But, can we define symbols for our values: 
`IS_ASCII`, `NOT_ASCII`, `ASCII_UNKNOWN`? (You decide on the names...)


---


[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-18 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r145577894
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -17,36 +17,133 @@
  */
 package org.apache.drill.exec.expr.fn.impl;
 
-public class SqlPatternContainsMatcher implements SqlPatternMatcher {
+public final class SqlPatternContainsMatcher implements SqlPatternMatcher {
   final String patternString;
   CharSequence charSequenceWrapper;
   final int patternLength;
 
   public SqlPatternContainsMatcher(String patternString, CharSequence 
charSequenceWrapper) {
-this.patternString = patternString;
+this.patternString   = patternString;
 this.charSequenceWrapper = charSequenceWrapper;
-patternLength = patternString.length();
+patternLength= patternString.length();
   }
 
   @Override
-  public int match() {
-final int txtLength = charSequenceWrapper.length();
-int patternIndex = 0;
-int txtIndex = 0;
+  public final int match() {
+// The idea is to write loops with simple condition checks to allow 
the Java Hotspot vectorize
+// the generate code.
+if (patternLength == 1) {
+  return match_1();
+} else if (patternLength == 2) {
+  return match_2();
+} else if (patternLength == 3) {
+  return match_3();
+} else {
+  return match_N();
+}
+  }
+
+  private final int match_1() {
+final CharSequence sequenceWrapper = charSequenceWrapper;
+final int lengthToProcess  = sequenceWrapper.length();
+final char first_patt_char = patternString.charAt(0);
+
+// simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+// so, we can just directly compare.
+for (int idx = 0; idx < lengthToProcess; idx++) {
+  char input_char = sequenceWrapper.charAt(idx);
+
+  if (first_patt_char != input_char) {
+continue;
+  }
+  return 1;
+}
+return 0;
+  }
+
+  private final int match_2() {
+final CharSequence sequenceWrapper = charSequenceWrapper;
+final int lengthToProcess  = sequenceWrapper.length() - 1;
+final char first_patt_char = patternString.charAt(0);
+
+// simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+// so, we can just directly compare.
+for (int idx = 0; idx < lengthToProcess; idx++) {
+  char input_char = sequenceWrapper.charAt(idx);
+
+  if (first_patt_char != input_char) {
+continue;
+  } else {
+char ch2_1 = sequenceWrapper.charAt(idx+1);
+char ch2_2 = patternString.charAt(1);
+
+if (ch2_1 == ch2_2) {
+  return 1;
+}
+  }
+}
+return 0;
+  }
+
+  private final int match_3() {
+final CharSequence sequenceWrapper = charSequenceWrapper;
+final int lengthToProcess  = sequenceWrapper.length() -2;
+final char first_patt_char = patternString.charAt(0);
 
 // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
 // so, we can just directly compare.
-while (patternIndex < patternLength && txtIndex < txtLength) {
-  if (patternString.charAt(patternIndex) != 
charSequenceWrapper.charAt(txtIndex)) {
-// Go back if there is no match
-txtIndex = txtIndex - patternIndex;
-patternIndex = 0;
+for (int idx = 0; idx < lengthToProcess; idx++) {
+  char input_char = sequenceWrapper.charAt(idx);
+
+  if (first_patt_char != input_char) {
+continue;
   } else {
-patternIndex++;
+char ch2_1 = sequenceWrapper.charAt(idx+1);
+char ch2_2 = patternString.charAt(1);
+char ch3_1 = sequenceWrapper.charAt(idx+2);
+char ch3_2 = patternString.charAt(2);
+
+if (ch2_1 == ch2_2 && ch3_1 == ch3_2) {
+  return 1;
+}
   }
-  txtIndex++;
+}
+return 0;
+  }
+
+  private final int match_N() {
+
+if (patternLength == 0) {
--- End diff --

Can't this be optimized away in the pattern parser stage? We should not be 
calling this function if we had a zero-length pattern.


---


[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-18 Thread sachouche
GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1001

JIRA DRILL-5879: Like operator performance improvements

- Recently, custom code has been added to handle common search patterns 
(Like operator)
- Contains, Starts With, and Ends With
- Custom code is way faster than the generic RegEx based implementation
- This pull request is another attempt to improve the Contains pattern 
since it is more CPU intensive

Query: select  from table where colA like '%a%' or 
colA like '%xyz%';
Improvement Opportunities
Avoid isAscii computation (full access of the input string) since we're 
dealing with the same column twice
Optimize the "contains" for-loop
Implementation Details
1)
Added a new integer variable "asciiMode" to the VarCharHolder class
The default value is -1 which indicates this info is not known
Otherwise this value will be set to either 1 or 0 based on the string being 
in ASCII mode or Unicode
The execution plan already shares the same VarCharHolder instance for all 
evaluations of the same column value
The asciiMode will be correctly set during the first LIKE evaluation and 
will be reused across other LIKE evaluations
2)
The "Contains" LIKE operation is quite expensive as the code needs to 
access the input string to perform character based comparisons
Created 4 versions of the same for-loop to a) make the loop simpler to 
optimize (Vectorization) and b) minimize comparisons
Benchmarks
Lineitem table 100GB
Query: select l_returnflag, count from dfs.`` where l_comment not 
like '%a%' or l_comment like '%the%' group by l_returnflag
Before changes: 33sec
After changes : 27sec


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill yodlee-cherry-pick

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1001.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1001


commit c2b05b2e8665daf3f7b43d49c428539b3753595f
Author: Salim Achouche 
Date:   2017-10-18T18:40:18Z

JIRA 5879: Like operator performance improvements




---