[jira] [Commented] (DRILL-5899) Simple pattern matchers can work with DrillBuf directly
[ https://issues.apache.org/jira/browse/DRILL-5899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249465#comment-16249465 ] ASF GitHub Bot commented on DRILL-5899: --- Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/1015 > Simple pattern matchers can work with DrillBuf directly > --- > > Key: DRILL-5899 > URL: https://issues.apache.org/jira/browse/DRILL-5899 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Flow >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy >Priority: Critical > Labels: ready-to-commit > > For the 4 simple patterns we have i.e. startsWith, endsWith, contains and > constant,, we do not need the overhead of charSequenceWrapper. We can work > with DrillBuf directly. This will save us from doing isAscii check and UTF8 > decoding for each row. > UTF-8 encoding ensures that no UTF-8 character is a prefix of any other valid > character. So, instead of decoding varChar from each row we are processing, > encode the patternString once during setup and do raw byte comparison. > Instead of bounds checking and reading one byte at a time, we get the whole > buffer in one shot and use that for comparison. > This improved overall performance for filter operator by around 20%. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5899) Simple pattern matchers can work with DrillBuf directly
[ https://issues.apache.org/jira/browse/DRILL-5899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243315#comment-16243315 ] ASF GitHub Bot commented on DRILL-5899: --- Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1015 @paul-rogers updated with latest review comments taken care of. Please take a look. > Simple pattern matchers can work with DrillBuf directly > --- > > Key: DRILL-5899 > URL: https://issues.apache.org/jira/browse/DRILL-5899 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Flow >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy >Priority: Critical > > For the 4 simple patterns we have i.e. startsWith, endsWith, contains and > constant,, we do not need the overhead of charSequenceWrapper. We can work > with DrillBuf directly. This will save us from doing isAscii check and UTF8 > decoding for each row. > UTF-8 encoding ensures that no UTF-8 character is a prefix of any other valid > character. So, instead of decoding varChar from each row we are processing, > encode the patternString once during setup and do raw byte comparison. > Instead of bounds checking and reading one byte at a time, we get the whole > buffer in one shot and use that for comparison. > This improved overall performance for filter operator by around 20%. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5899) Simple pattern matchers can work with DrillBuf directly
[ https://issues.apache.org/jira/browse/DRILL-5899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243238#comment-16243238 ] ASF GitHub Bot commented on DRILL-5899: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1015#discussion_r149554195 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -17,37 +17,48 @@ */ package org.apache.drill.exec.expr.fn.impl; -public class SqlPatternContainsMatcher implements SqlPatternMatcher { - final String patternString; - CharSequence charSequenceWrapper; - final int patternLength; - - public SqlPatternContainsMatcher(String patternString, CharSequence charSequenceWrapper) { -this.patternString = patternString; -this.charSequenceWrapper = charSequenceWrapper; -patternLength = patternString.length(); +import io.netty.buffer.DrillBuf; + +public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + + public SqlPatternContainsMatcher(String patternString) { +super(patternString); } @Override - public int match() { -final int txtLength = charSequenceWrapper.length(); -int patternIndex = 0; -int txtIndex = 0; - -// simplePattern string has meta characters i.e % and _ and escape characters removed. -// so, we can just directly compare. -while (patternIndex < patternLength && txtIndex < txtLength) { - if (patternString.charAt(patternIndex) != charSequenceWrapper.charAt(txtIndex)) { -// Go back if there is no match -txtIndex = txtIndex - patternIndex; -patternIndex = 0; - } else { -patternIndex++; + public int match(int start, int end, DrillBuf drillBuf) { + +if (patternLength == 0) { // Everything should match for null pattern string + return 1; +} + +final int txtLength = end - start; + +// no match if input string length is less than pattern length +if (txtLength < patternLength) { + return 0; +} + +outer: +for (int txtIndex = 0; txtIndex < txtLength; txtIndex++) { + + // boundary check + if (txtIndex + patternLength > txtLength) { --- End diff -- Better: ``` int end = txtLength - patternLength; for (int txtIndex = 0; txtIndex < end; txtIndex++) { ``` And omit the boundary check on every iteration. That is, no reason to iterate past the last possible match, then use an if-statement to shorten the loop. Just shorten the loop. > Simple pattern matchers can work with DrillBuf directly > --- > > Key: DRILL-5899 > URL: https://issues.apache.org/jira/browse/DRILL-5899 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Flow >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy >Priority: Critical > > For the 4 simple patterns we have i.e. startsWith, endsWith, contains and > constant,, we do not need the overhead of charSequenceWrapper. We can work > with DrillBuf directly. This will save us from doing isAscii check and UTF8 > decoding for each row. > UTF-8 encoding ensures that no UTF-8 character is a prefix of any other valid > character. So, instead of decoding varChar from each row we are processing, > encode the patternString once during setup and do raw byte comparison. > Instead of bounds checking and reading one byte at a time, we get the whole > buffer in one shot and use that for comparison. > This improved overall performance for filter operator by around 20%. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5899) Simple pattern matchers can work with DrillBuf directly
[ https://issues.apache.org/jira/browse/DRILL-5899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243237#comment-16243237 ] ASF GitHub Bot commented on DRILL-5899: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1015#discussion_r149552453 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/AbstractSqlPatternMatcher.java --- @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.expr.fn.impl; + +import com.google.common.base.Charsets; +import org.apache.drill.common.exceptions.UserException; +import java.nio.ByteBuffer; +import java.nio.CharBuffer; +import java.nio.charset.CharacterCodingException; +import java.nio.charset.CharsetEncoder; +import static org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.logger; + +// To get good performance for most commonly used pattern matches --- End diff -- Javadoc? ``` /** * This is a Javadoc comment and appears in generated documentation. */ // This is a plain comment and does not appear in documentation. ``` > Simple pattern matchers can work with DrillBuf directly > --- > > Key: DRILL-5899 > URL: https://issues.apache.org/jira/browse/DRILL-5899 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Flow >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy >Priority: Critical > > For the 4 simple patterns we have i.e. startsWith, endsWith, contains and > constant,, we do not need the overhead of charSequenceWrapper. We can work > with DrillBuf directly. This will save us from doing isAscii check and UTF8 > decoding for each row. > UTF-8 encoding ensures that no UTF-8 character is a prefix of any other valid > character. So, instead of decoding varChar from each row we are processing, > encode the patternString once during setup and do raw byte comparison. > Instead of bounds checking and reading one byte at a time, we get the whole > buffer in one shot and use that for comparison. > This improved overall performance for filter operator by around 20%. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5899) Simple pattern matchers can work with DrillBuf directly
[ https://issues.apache.org/jira/browse/DRILL-5899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243235#comment-16243235 ] ASF GitHub Bot commented on DRILL-5899: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1015#discussion_r149554356 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternEndsWithMatcher.java --- @@ -17,33 +17,30 @@ */ package org.apache.drill.exec.expr.fn.impl; -public class SqlPatternEndsWithMatcher implements SqlPatternMatcher { - final String patternString; - CharSequence charSequenceWrapper; - final int patternLength; - - public SqlPatternEndsWithMatcher(String patternString, CharSequence charSequenceWrapper) { -this.charSequenceWrapper = charSequenceWrapper; -this.patternString = patternString; -this.patternLength = patternString.length(); +import io.netty.buffer.DrillBuf; + +public class SqlPatternEndsWithMatcher extends AbstractSqlPatternMatcher { + + public SqlPatternEndsWithMatcher(String patternString) { +super(patternString); } @Override - public int match() { -int txtIndex = charSequenceWrapper.length(); -int patternIndex = patternLength; -boolean matchFound = true; // if pattern is empty string, we always match. + public int match(int start, int end, DrillBuf drillBuf) { + +if ( (end - start) < patternLength) { // No match if input string length is less than pattern length. --- End diff -- `( (end - start)` --> `(end - start` > Simple pattern matchers can work with DrillBuf directly > --- > > Key: DRILL-5899 > URL: https://issues.apache.org/jira/browse/DRILL-5899 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Flow >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy >Priority: Critical > > For the 4 simple patterns we have i.e. startsWith, endsWith, contains and > constant,, we do not need the overhead of charSequenceWrapper. We can work > with DrillBuf directly. This will save us from doing isAscii check and UTF8 > decoding for each row. > UTF-8 encoding ensures that no UTF-8 character is a prefix of any other valid > character. So, instead of decoding varChar from each row we are processing, > encode the patternString once during setup and do raw byte comparison. > Instead of bounds checking and reading one byte at a time, we get the whole > buffer in one shot and use that for comparison. > This improved overall performance for filter operator by around 20%. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5899) Simple pattern matchers can work with DrillBuf directly
[ https://issues.apache.org/jira/browse/DRILL-5899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243239#comment-16243239 ] ASF GitHub Bot commented on DRILL-5899: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1015#discussion_r149555002 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternEndsWithMatcher.java --- @@ -17,33 +17,30 @@ */ package org.apache.drill.exec.expr.fn.impl; -public class SqlPatternEndsWithMatcher implements SqlPatternMatcher { - final String patternString; - CharSequence charSequenceWrapper; - final int patternLength; - - public SqlPatternEndsWithMatcher(String patternString, CharSequence charSequenceWrapper) { -this.charSequenceWrapper = charSequenceWrapper; -this.patternString = patternString; -this.patternLength = patternString.length(); +import io.netty.buffer.DrillBuf; + +public class SqlPatternEndsWithMatcher extends AbstractSqlPatternMatcher { + + public SqlPatternEndsWithMatcher(String patternString) { +super(patternString); } @Override - public int match() { -int txtIndex = charSequenceWrapper.length(); -int patternIndex = patternLength; -boolean matchFound = true; // if pattern is empty string, we always match. + public int match(int start, int end, DrillBuf drillBuf) { + +if ( (end - start) < patternLength) { // No match if input string length is less than pattern length. + return 0; +} // simplePattern string has meta characters i.e % and _ and escape characters removed. // so, we can just directly compare. -while (patternIndex > 0 && txtIndex > 0) { - if (charSequenceWrapper.charAt(--txtIndex) != patternString.charAt(--patternIndex)) { -matchFound = false; -break; +for (int index = 1; index <= patternLength; index++) { --- End diff -- ``` int txtStart = end - patternLength; if (txtStart < start) { return 0; } for (int index = 0; index < patternLength; index++) { ... patternByteBuffer.get(index) ... drillBuf.getByte(txtStart + index) ... ``` > Simple pattern matchers can work with DrillBuf directly > --- > > Key: DRILL-5899 > URL: https://issues.apache.org/jira/browse/DRILL-5899 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Flow >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy >Priority: Critical > > For the 4 simple patterns we have i.e. startsWith, endsWith, contains and > constant,, we do not need the overhead of charSequenceWrapper. We can work > with DrillBuf directly. This will save us from doing isAscii check and UTF8 > decoding for each row. > UTF-8 encoding ensures that no UTF-8 character is a prefix of any other valid > character. So, instead of decoding varChar from each row we are processing, > encode the patternString once during setup and do raw byte comparison. > Instead of bounds checking and reading one byte at a time, we get the whole > buffer in one shot and use that for comparison. > This improved overall performance for filter operator by around 20%. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5899) Simple pattern matchers can work with DrillBuf directly
[ https://issues.apache.org/jira/browse/DRILL-5899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243236#comment-16243236 ] ASF GitHub Bot commented on DRILL-5899: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1015#discussion_r149552506 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/AbstractSqlPatternMatcher.java --- @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.expr.fn.impl; + +import com.google.common.base.Charsets; +import org.apache.drill.common.exceptions.UserException; +import java.nio.ByteBuffer; +import java.nio.CharBuffer; +import java.nio.charset.CharacterCodingException; +import java.nio.charset.CharsetEncoder; +import static org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.logger; + +// To get good performance for most commonly used pattern matches +// i.e. CONSTANT('ABC'), STARTSWITH('%ABC'), ENDSWITH('ABC%') and CONTAINS('%ABC%'), +// we have simple pattern matchers. +// Idea is to have our own implementation for simple pattern matchers so we can +// avoid heavy weight regex processing, skip UTF-8 decoding and char conversion. +// Instead, we encode the pattern string and do byte comparison against native memory. +// Overall, this approach +// gives us orders of magnitude performance improvement for simple pattern matches. +// Anything that is not simple is considered +// complex pattern and we use Java regex for complex pattern matches. + +public abstract class AbstractSqlPatternMatcher implements SqlPatternMatcher { + final String patternString; --- End diff -- `protected final` > Simple pattern matchers can work with DrillBuf directly > --- > > Key: DRILL-5899 > URL: https://issues.apache.org/jira/browse/DRILL-5899 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Flow >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy >Priority: Critical > > For the 4 simple patterns we have i.e. startsWith, endsWith, contains and > constant,, we do not need the overhead of charSequenceWrapper. We can work > with DrillBuf directly. This will save us from doing isAscii check and UTF8 > decoding for each row. > UTF-8 encoding ensures that no UTF-8 character is a prefix of any other valid > character. So, instead of decoding varChar from each row we are processing, > encode the patternString once during setup and do raw byte comparison. > Instead of bounds checking and reading one byte at a time, we get the whole > buffer in one shot and use that for comparison. > This improved overall performance for filter operator by around 20%. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5899) Simple pattern matchers can work with DrillBuf directly
[ https://issues.apache.org/jira/browse/DRILL-5899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243050#comment-16243050 ] ASF GitHub Bot commented on DRILL-5899: --- Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1015 @paul-rogers Thanks a lot for the review. Updated the PR with code review comments. Please take a look. Overall, good improvement with this change. Here are the numbers. select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a' 1.4 sec vs 7 sec select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a%' 6.5 sec vs 13.5 sec select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a%' 1.4 sec vs 5.8 sec select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a' 1.1.65 sec vs 5.8 sec I think for "contains", improvement is not as much as others, probably because of nested for loops. @sachouche changes on top of these changes can improve further. > Simple pattern matchers can work with DrillBuf directly > --- > > Key: DRILL-5899 > URL: https://issues.apache.org/jira/browse/DRILL-5899 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Flow >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy >Priority: Critical > > For the 4 simple patterns we have i.e. startsWith, endsWith, contains and > constant,, we do not need the overhead of charSequenceWrapper. We can work > with DrillBuf directly. This will save us from doing isAscii check and UTF8 > decoding for each row. > UTF-8 encoding ensures that no UTF-8 character is a prefix of any other valid > character. So, instead of decoding varChar from each row we are processing, > encode the patternString once during setup and do raw byte comparison. > Instead of bounds checking and reading one byte at a time, we get the whole > buffer in one shot and use that for comparison. > This improved overall performance for filter operator by around 20%. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5899) Simple pattern matchers can work with DrillBuf directly
[ https://issues.apache.org/jira/browse/DRILL-5899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16239919#comment-16239919 ] ASF GitHub Bot commented on DRILL-5899: --- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/1015 Second long list of detailed comments sent privately to keep dev list traffic down. > Simple pattern matchers can work with DrillBuf directly > --- > > Key: DRILL-5899 > URL: https://issues.apache.org/jira/browse/DRILL-5899 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Flow >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy >Priority: Critical > > For the 4 simple patterns we have i.e. startsWith, endsWith, contains and > constant,, we do not need the overhead of charSequenceWrapper. We can work > with DrillBuf directly. This will save us from doing isAscii check and UTF8 > decoding for each row. > UTF-8 encoding ensures that no UTF-8 character is a prefix of any other valid > character. So, instead of decoding varChar from each row we are processing, > encode the patternString once during setup and do raw byte comparison. > Instead of bounds checking and reading one byte at a time, we get the whole > buffer in one shot and use that for comparison. > This improved overall performance for filter operator by around 20%. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5899) Simple pattern matchers can work with DrillBuf directly
[ https://issues.apache.org/jira/browse/DRILL-5899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237125#comment-16237125 ] ASF GitHub Bot commented on DRILL-5899: --- Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1015 @sachouche @paul-rogers Thanks for the review. I updated the PR with review comments. I made one more change. Previously, I was copying native memory buffer into byte array and using it. Instead, if we go to native memory directly, performance is significantly better. In fact, it is 3 times faster :-) Please review updated changes. > Simple pattern matchers can work with DrillBuf directly > --- > > Key: DRILL-5899 > URL: https://issues.apache.org/jira/browse/DRILL-5899 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Flow >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy >Priority: Critical > > For the 4 simple patterns we have i.e. startsWith, endsWith, contains and > constant,, we do not need the overhead of charSequenceWrapper. We can work > with DrillBuf directly. This will save us from doing isAscii check and UTF8 > decoding for each row. > UTF-8 encoding ensures that no UTF-8 character is a prefix of any other valid > character. So, instead of decoding varChar from each row we are processing, > encode the patternString once during setup and do raw byte comparison. > Instead of bounds checking and reading one byte at a time, we get the whole > buffer in one shot and use that for comparison. > This improved overall performance for filter operator by around 20%. -- This message was sent by Atlassian JIRA (v6.4.14#64029)