[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user sachouche closed the pull request at: https://github.com/apache/drill/pull/1001 ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r146952578 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -17,37 +17,166 @@ */ package org.apache.drill.exec.expr.fn.impl; -public class SqlPatternContainsMatcher implements SqlPatternMatcher { +public final class SqlPatternContainsMatcher implements SqlPatternMatcher { final String patternString; CharSequence charSequenceWrapper; final int patternLength; + final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString, CharSequence charSequenceWrapper) { -this.patternString = patternString; +this.patternString = patternString; this.charSequenceWrapper = charSequenceWrapper; -patternLength = patternString.length(); +patternLength= patternString.length(); + +// The idea is to write loops with simple condition checks to allow the Java Hotspot achieve +// better optimizations (especially vectorization) +if (patternLength == 1) { + matcherFcn = new Matcher1(); --- End diff -- I ran two types of tests to evaluate the generic vs custom method: Test1 - A match does exist o Custom method is 3x faster because the code will go to the "else" part o A nested loop is always slower than unrolled code (the loop is also correlated with the outer one); please refer to this article (https://en.wikipedia.org/wiki/Loop_unrolling) on the benefits of loop unrolling o Older match function performed in 59sec Test2- A match doesn't exist o Custom method and generic one perform in 15sec; this is because both perform a comparison and proceed to the next iteration o Older match function performed in 45sec ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r146948741 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -17,37 +17,166 @@ */ package org.apache.drill.exec.expr.fn.impl; -public class SqlPatternContainsMatcher implements SqlPatternMatcher { +public final class SqlPatternContainsMatcher implements SqlPatternMatcher { final String patternString; CharSequence charSequenceWrapper; final int patternLength; + final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString, CharSequence charSequenceWrapper) { -this.patternString = patternString; +this.patternString = patternString; this.charSequenceWrapper = charSequenceWrapper; -patternLength = patternString.length(); +patternLength= patternString.length(); + +// The idea is to write loops with simple condition checks to allow the Java Hotspot achieve +// better optimizations (especially vectorization) +if (patternLength == 1) { + matcherFcn = new Matcher1(); --- End diff -- how does matcherN perform compared to matcher1, matcher2, matcher3 for pattern lengths 1, 2 and 3 ? If matcherN performs well for patternLengths 1, 2 and 3, we can just have one matcher instead of multiple for different pattern lengths. ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r146708658 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -17,37 +17,166 @@ */ package org.apache.drill.exec.expr.fn.impl; -public class SqlPatternContainsMatcher implements SqlPatternMatcher { +public final class SqlPatternContainsMatcher implements SqlPatternMatcher { final String patternString; CharSequence charSequenceWrapper; final int patternLength; + final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString, CharSequence charSequenceWrapper) { -this.patternString = patternString; +this.patternString = patternString; this.charSequenceWrapper = charSequenceWrapper; -patternLength = patternString.length(); +patternLength= patternString.length(); + +// The idea is to write loops with simple condition checks to allow the Java Hotspot achieve +// better optimizations (especially vectorization) +if (patternLength == 1) { + matcherFcn = new Matcher1(); --- End diff -- Padma, I have two reasons to follow the added complexity 1) The new code is encapsulated within the Contains matching logic; doesn't increase code complexity 2) o I created a test with the original match logic, pattern and input were Strings though passed as CharSequence o Ran the test with the new and old method (1 billion iterations) on MacOS o pattern length o The old match method performed in 43sec where as the new one performed in 15sec o The reason for the speedup is the custom matcher functions have less instructions (load and comparison) ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r146705325 --- Diff: exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java --- @@ -73,6 +73,9 @@ public void eval() { out.start = in.start; if (charCount <= length.value || length.value == 0 ) { out.end = in.end; + if (charCount == (out.end - out.start)) { +out.asciiMode = org.apache.drill.exec.expr.holders.VarCharHolder.CHAR_MODE_IS_ASCII; // we can conclude this string is ASCII --- End diff -- - As previously stated (when responding to Paul'd comment), the expression framework is able to use the same VarCharHolder input variable when it is shared amongst multiple expressions - If the original column was of type var-binary, then the expression framework will include a cast to var-char - The cast logic will also compute the string length - Using this information to deduce whether the string is pure ASCII or not - UTF-8 encoding uses 1 byte for ASCII and 2, 3, or 4 for other character sets - If the encoded length and character length are equal, then this means this is an ASCII string ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r146390381 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/CharSequenceWrapper.java --- @@ -53,13 +55,13 @@ // The end offset into the drill buffer private int end; // Indicates that the current byte buffer contains only ascii chars - private boolean usAscii; + private boolean useAscii; public CharSequenceWrapper() { } public CharSequenceWrapper(int start, int end, DrillBuf buffer) { -setBuffer(start, end, buffer); +setBuffer(start, end, buffer, -1); --- End diff -- what does -1 mean ? Shouldn't it be one of CHAR_MODE_IS_ASCII, CHAR_MODE_UNKNOWN or CHAR_MODE_NOT_ASCII ? ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r146392275 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -17,37 +17,166 @@ */ package org.apache.drill.exec.expr.fn.impl; -public class SqlPatternContainsMatcher implements SqlPatternMatcher { +public final class SqlPatternContainsMatcher implements SqlPatternMatcher { final String patternString; CharSequence charSequenceWrapper; final int patternLength; + final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString, CharSequence charSequenceWrapper) { -this.patternString = patternString; +this.patternString = patternString; this.charSequenceWrapper = charSequenceWrapper; -patternLength = patternString.length(); +patternLength= patternString.length(); + +// The idea is to write loops with simple condition checks to allow the Java Hotspot achieve +// better optimizations (especially vectorization) +if (patternLength == 1) { + matcherFcn = new Matcher1(); --- End diff -- I am not sure if it is a good idea to write special case code for each pattern length. It is not easy to maintain. Can you please give more details how this is improving performance ? Are we getting better performance for patternLength 1 or 2 or 3 or N or all ? If so, how much and why ? ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r146388018 --- Diff: exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java --- @@ -73,6 +73,9 @@ public void eval() { out.start = in.start; if (charCount <= length.value || length.value == 0 ) { out.end = in.end; + if (charCount == (out.end - out.start)) { +out.asciiMode = org.apache.drill.exec.expr.holders.VarCharHolder.CHAR_MODE_IS_ASCII; // we can conclude this string is ASCII --- End diff -- can you please add comments here ? I am not able to understand this change. ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r145746250 --- Diff: exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java --- @@ -73,6 +73,9 @@ public void eval() { out.start = in.start; if (charCount <= length.value || length.value == 0 ) { out.end = in.end; + if (charCount == (out.end-out.start)) { +out.asciiMode = 1; // we can conclude this string is ASCII --- End diff -- will do! ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r145575103 --- Diff: exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java --- @@ -73,6 +73,9 @@ public void eval() { out.start = in.start; if (charCount <= length.value || length.value == 0 ) { out.end = in.end; + if (charCount == (out.end-out.start)) { +out.asciiMode = 1; // we can conclude this string is ASCII --- End diff -- Drill likes spaces around operators: `out.end - out.start` ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r145577107 --- Diff: exec/vector/src/main/codegen/templates/ValueHolders.java --- @@ -33,34 +33,40 @@ * This class is generated using freemarker and the ${.template_name} template. */ public final class ${className} implements ValueHolder{ - + public static final MajorType TYPE = Types.${mode.name?lower_case}(MinorType.${minor.class?upper_case}); + public MajorType getType() {return TYPE;} <#if mode.name == "Repeated"> - + /** The first index (inclusive) into the Vector. **/ public int start; - + /** The last index (exclusive) into the Vector. **/ public int end; - + /** The Vector holding the actual values. **/ public ${minor.class}Vector vector; - + <#else> public static final int WIDTH = ${type.width}; - + <#if mode.name == "Optional">public int isSet; <#assign fields = minor.fields!type.fields /> <#list fields as field> public ${field.type} ${field.name}; - + +<#if minor.class == "VarChar"> +// -1: unknown, 0: not ascii, 1: is ascii +public int asciiMode = -1; --- End diff -- Drill is complex, nothing is as it seems. Drill has a very elaborate mechanism to rewrite byte codes to do scalar replacement. That is, holders are added to the Java source during code generation, but then are removed during byte code transforms. (Yes, that is silly given that Java 8 does a fine job on its own; but it is how Drill works today...) Will adding this variable change that behavior? Will the scalar replacement logic know to not replace this object? If so, will that hurt performance, or with the JVM go ahead and figure it out on its own? Is all the code that sets the variable inlined via code generation so that scalar replacement is possible? Yes, indeed, nothing in Drill is as simple as it seems... ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r145578320 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -17,36 +17,133 @@ */ package org.apache.drill.exec.expr.fn.impl; -public class SqlPatternContainsMatcher implements SqlPatternMatcher { +public final class SqlPatternContainsMatcher implements SqlPatternMatcher { final String patternString; CharSequence charSequenceWrapper; final int patternLength; public SqlPatternContainsMatcher(String patternString, CharSequence charSequenceWrapper) { -this.patternString = patternString; +this.patternString = patternString; this.charSequenceWrapper = charSequenceWrapper; -patternLength = patternString.length(); +patternLength= patternString.length(); } @Override - public int match() { -final int txtLength = charSequenceWrapper.length(); -int patternIndex = 0; -int txtIndex = 0; + public final int match() { +// The idea is to write loops with simple condition checks to allow the Java Hotspot vectorize +// the generate code. +if (patternLength == 1) { --- End diff -- Pattern length is fixed per expression; need not be checked per row. See more below. ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r145576718 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -17,36 +17,133 @@ */ package org.apache.drill.exec.expr.fn.impl; -public class SqlPatternContainsMatcher implements SqlPatternMatcher { +public final class SqlPatternContainsMatcher implements SqlPatternMatcher { final String patternString; CharSequence charSequenceWrapper; final int patternLength; public SqlPatternContainsMatcher(String patternString, CharSequence charSequenceWrapper) { -this.patternString = patternString; +this.patternString = patternString; this.charSequenceWrapper = charSequenceWrapper; -patternLength = patternString.length(); +patternLength= patternString.length(); } @Override - public int match() { -final int txtLength = charSequenceWrapper.length(); -int patternIndex = 0; -int txtIndex = 0; + public final int match() { +// The idea is to write loops with simple condition checks to allow the Java Hotspot vectorize +// the generate code. +if (patternLength == 1) { + return match_1(); +} else if (patternLength == 2) { + return match_2(); +} else if (patternLength == 3) { + return match_3(); +} else { + return match_N(); +} + } + + private final int match_1() { --- End diff -- See note about UTF-8. If we don't care about the match position (that is, we don't need `strpos()`, and all we care is whether it matches or not, then we can do the work on the undecoded UTF-8 bytes, saving a large amount of complexity. ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r145577808 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -17,36 +17,133 @@ */ package org.apache.drill.exec.expr.fn.impl; -public class SqlPatternContainsMatcher implements SqlPatternMatcher { +public final class SqlPatternContainsMatcher implements SqlPatternMatcher { final String patternString; CharSequence charSequenceWrapper; final int patternLength; public SqlPatternContainsMatcher(String patternString, CharSequence charSequenceWrapper) { -this.patternString = patternString; +this.patternString = patternString; this.charSequenceWrapper = charSequenceWrapper; -patternLength = patternString.length(); +patternLength= patternString.length(); } @Override - public int match() { -final int txtLength = charSequenceWrapper.length(); -int patternIndex = 0; -int txtIndex = 0; + public final int match() { +// The idea is to write loops with simple condition checks to allow the Java Hotspot vectorize +// the generate code. +if (patternLength == 1) { + return match_1(); +} else if (patternLength == 2) { + return match_2(); +} else if (patternLength == 3) { + return match_3(); +} else { + return match_N(); +} + } + + private final int match_1() { +final CharSequence sequenceWrapper = charSequenceWrapper; +final int lengthToProcess = sequenceWrapper.length(); +final char first_patt_char = patternString.charAt(0); + +// simplePattern string has meta characters i.e % and _ and escape characters removed. +// so, we can just directly compare. +for (int idx = 0; idx < lengthToProcess; idx++) { + char input_char = sequenceWrapper.charAt(idx); + + if (first_patt_char != input_char) { +continue; + } + return 1; +} +return 0; + } + + private final int match_2() { +final CharSequence sequenceWrapper = charSequenceWrapper; +final int lengthToProcess = sequenceWrapper.length() - 1; +final char first_patt_char = patternString.charAt(0); + +// simplePattern string has meta characters i.e % and _ and escape characters removed. +// so, we can just directly compare. +for (int idx = 0; idx < lengthToProcess; idx++) { + char input_char = sequenceWrapper.charAt(idx); + + if (first_patt_char != input_char) { +continue; + } else { +char ch2_1 = sequenceWrapper.charAt(idx+1); +char ch2_2 = patternString.charAt(1); --- End diff -- We want speed. Instead of getting the second character multiple times, is it better to get it once up front? I suppose that depends on the hit rate. Average hit rate may be 2 % (1/ ~64). So if our input is smaller than 64 characters, we'll have, on average one hit so we pay the second character cost once. At 128 or above, we'll pay the cost two or more times. But, maybe the JVM can optimize away the second and subsequent accesses? Actually, let's take a step back. The pattern is fixed. We parsed the pattern to decide to use this particular class. Should we instead create a 1-char, 2-char and n-char matcher class so we get the second character (for the 2-char case) only once, and we eliminate the extra per-value if-check? ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r145574785 --- Diff: exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java --- @@ -73,6 +73,9 @@ public void eval() { out.start = in.start; if (charCount <= length.value || length.value == 0 ) { out.end = in.end; + if (charCount == (out.end-out.start)) { +out.asciiMode = 1; // we can conclude this string is ASCII --- End diff -- The values -1, 0, and 1 are magic numbers here. Because we want speed, we don't want to use an enum. But, can we define symbols for our values: `IS_ASCII`, `NOT_ASCII`, `ASCII_UNKNOWN`? (You decide on the names...) ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r145577894 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -17,36 +17,133 @@ */ package org.apache.drill.exec.expr.fn.impl; -public class SqlPatternContainsMatcher implements SqlPatternMatcher { +public final class SqlPatternContainsMatcher implements SqlPatternMatcher { final String patternString; CharSequence charSequenceWrapper; final int patternLength; public SqlPatternContainsMatcher(String patternString, CharSequence charSequenceWrapper) { -this.patternString = patternString; +this.patternString = patternString; this.charSequenceWrapper = charSequenceWrapper; -patternLength = patternString.length(); +patternLength= patternString.length(); } @Override - public int match() { -final int txtLength = charSequenceWrapper.length(); -int patternIndex = 0; -int txtIndex = 0; + public final int match() { +// The idea is to write loops with simple condition checks to allow the Java Hotspot vectorize +// the generate code. +if (patternLength == 1) { + return match_1(); +} else if (patternLength == 2) { + return match_2(); +} else if (patternLength == 3) { + return match_3(); +} else { + return match_N(); +} + } + + private final int match_1() { +final CharSequence sequenceWrapper = charSequenceWrapper; +final int lengthToProcess = sequenceWrapper.length(); +final char first_patt_char = patternString.charAt(0); + +// simplePattern string has meta characters i.e % and _ and escape characters removed. +// so, we can just directly compare. +for (int idx = 0; idx < lengthToProcess; idx++) { + char input_char = sequenceWrapper.charAt(idx); + + if (first_patt_char != input_char) { +continue; + } + return 1; +} +return 0; + } + + private final int match_2() { +final CharSequence sequenceWrapper = charSequenceWrapper; +final int lengthToProcess = sequenceWrapper.length() - 1; +final char first_patt_char = patternString.charAt(0); + +// simplePattern string has meta characters i.e % and _ and escape characters removed. +// so, we can just directly compare. +for (int idx = 0; idx < lengthToProcess; idx++) { + char input_char = sequenceWrapper.charAt(idx); + + if (first_patt_char != input_char) { +continue; + } else { +char ch2_1 = sequenceWrapper.charAt(idx+1); +char ch2_2 = patternString.charAt(1); + +if (ch2_1 == ch2_2) { + return 1; +} + } +} +return 0; + } + + private final int match_3() { +final CharSequence sequenceWrapper = charSequenceWrapper; +final int lengthToProcess = sequenceWrapper.length() -2; +final char first_patt_char = patternString.charAt(0); // simplePattern string has meta characters i.e % and _ and escape characters removed. // so, we can just directly compare. -while (patternIndex < patternLength && txtIndex < txtLength) { - if (patternString.charAt(patternIndex) != charSequenceWrapper.charAt(txtIndex)) { -// Go back if there is no match -txtIndex = txtIndex - patternIndex; -patternIndex = 0; +for (int idx = 0; idx < lengthToProcess; idx++) { + char input_char = sequenceWrapper.charAt(idx); + + if (first_patt_char != input_char) { +continue; } else { -patternIndex++; +char ch2_1 = sequenceWrapper.charAt(idx+1); +char ch2_2 = patternString.charAt(1); +char ch3_1 = sequenceWrapper.charAt(idx+2); +char ch3_2 = patternString.charAt(2); + +if (ch2_1 == ch2_2 && ch3_1 == ch3_2) { + return 1; +} } - txtIndex++; +} +return 0; + } + + private final int match_N() { + +if (patternLength == 0) { --- End diff -- Can't this be optimized away in the pattern parser stage? We should not be calling this function if we had a zero-length pattern. ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
GitHub user sachouche opened a pull request: https://github.com/apache/drill/pull/1001 JIRA DRILL-5879: Like operator performance improvements - Recently, custom code has been added to handle common search patterns (Like operator) - Contains, Starts With, and Ends With - Custom code is way faster than the generic RegEx based implementation - This pull request is another attempt to improve the Contains pattern since it is more CPU intensive Query: select from table where colA like '%a%' or colA like '%xyz%'; Improvement Opportunities Avoid isAscii computation (full access of the input string) since we're dealing with the same column twice Optimize the "contains" for-loop Implementation Details 1) Added a new integer variable "asciiMode" to the VarCharHolder class The default value is -1 which indicates this info is not known Otherwise this value will be set to either 1 or 0 based on the string being in ASCII mode or Unicode The execution plan already shares the same VarCharHolder instance for all evaluations of the same column value The asciiMode will be correctly set during the first LIKE evaluation and will be reused across other LIKE evaluations 2) The "Contains" LIKE operation is quite expensive as the code needs to access the input string to perform character based comparisons Created 4 versions of the same for-loop to a) make the loop simpler to optimize (Vectorization) and b) minimize comparisons Benchmarks Lineitem table 100GB Query: select l_returnflag, count from dfs.`` where l_comment not like '%a%' or l_comment like '%the%' group by l_returnflag Before changes: 33sec After changes : 27sec You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachouche/drill yodlee-cherry-pick Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1001.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1001 commit c2b05b2e8665daf3f7b43d49c428539b3753595f Author: Salim AchoucheDate: 2017-10-18T18:40:18Z JIRA 5879: Like operator performance improvements ---