[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r140055857 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -1054,8 +1057,36 @@ public void setMax(Object max) { return nulls; } -@Override public boolean hasSingleValue() { - return (max != null && min != null && max.equals(min)); +/** + * Checks that the column chunk has a single value. + * Returns {@code true} if {@code min} and {@code max} are the same but not null + * and nulls count is 0 or equal to the rows count. + * + * Returns {@code true} if {@code min} and {@code max} are null and the number of null values + * in the column chunk is equal to the rows count. + * + * Comparison of nulls and rows count is needed for the cases: + * + * column with primitive type has single value and null values + * + * column with primitive type has only null values, min/max couldn't be null, + * but column has single value + * + * + * @param rowCount rows count in column chunk + * @return true if column has single value + */ +@Override +public boolean hasSingleValue(long rowCount) { + if (nulls != null) { +if (min != null) { + // Objects.deepEquals() is used here, since min and max may be byte arrays + return Objects.deepEquals(min, max) && (nulls == 0 || nulls == rowCount); --- End diff -- - if (min != null), then nulls cannot be equal to rowCount - In this case, only nulls == 0 should be checked ---
[GitHub] drill issue #805: Drill-4139: Exception while trying to prune partition. jav...
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/805 +1 ---
[GitHub] drill issue #976: DRILL-5797: Choose parquet reader from read columns
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/976 Sure! Regards, Salim From: dprofeta <notificati...@github.com> Sent: Friday, October 6, 2017 8:52:51 AM To: apache/drill Cc: Salim Achouche; Mention Subject: Re: [apache/drill] DRILL-5797: Choose parquet reader from read columns (#976) @sachouche<https://github.com/sachouche> Can you review it? â You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://github.com/apache/drill/pull/976#issuecomment-334795292>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AckoK82kUi_xZmBVaiKuujb-rxzaL2sIks5spkzTgaJpZM4PwLDB>. ---
[GitHub] drill pull request #976: DRILL-5797: Choose parquet reader from read columns
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/976#discussion_r143837353 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java --- @@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS return new ScanBatch(rowGroupScan, context, oContext, readers, implicitColumns); } - private static boolean isComplex(ParquetMetadata footer) { -MessageType schema = footer.getFileMetaData().getSchema(); + private static boolean isComplex(ParquetMetadata footer, List columns) { +if (Utilities.isStarQuery(columns)) { + MessageType schema = footer.getFileMetaData().getSchema(); -for (Type type : schema.getFields()) { - if (!type.isPrimitive()) { -return true; + for (Type type : schema.getFields()) { +if (!type.isPrimitive()) { + return true; +} } -} -for (ColumnDescriptor col : schema.getColumns()) { - if (col.getMaxRepetitionLevel() > 0) { -return true; + for (ColumnDescriptor col : schema.getColumns()) { +if (col.getMaxRepetitionLevel() > 0) { + return true; +} + } + return false; +} else { + for (SchemaPath column : columns) { +if (isColumnComplex(footer.getFileMetaData().getSchema(), column)) { --- End diff -- Can you please use the already defined schema variable instead of invoking "footer.getFileMetaData().getSchema()" multiple times. ---
[GitHub] drill pull request #970: DRILL-5832: Migrate OperatorFixture to use SystemOp...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/970#discussion_r143017778 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet2/DrillParquetReader.java --- @@ -84,14 +82,7 @@ private RecordReader recordReader; private DrillParquetRecordMaterializer recordMaterializer; private int recordCount; - private List primitiveVectors; private OperatorContext operatorContext; --- End diff -- Can you explain the reason you are removing the existing logic? ---
[GitHub] drill pull request #970: DRILL-5832: Migrate OperatorFixture to use SystemOp...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/970#discussion_r143020963 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java --- @@ -184,25 +183,26 @@ public void testAllScalarTypes() throws Exception { try { // read all of the types with the complex reader - test(String.format("alter session set %s = true", ExecConstants.PARQUET_NEW_RECORD_READER)); + alterSession(ExecConstants.PARQUET_NEW_RECORD_READER, true); --- End diff -- Paul, - I noticed the key PARQUET_NEW_RECORD_READER is erroneous - There are currently two readers o The old one is used when nested data is used as it can handle all parquet use-cases o The new reader only deals with Flat parquet data types - We might want to rename the keys as the new reader cannot always be substituted with the old one ---
[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1001 Paul, - I think you misunderstood the proposal - Let me use an example - select .. c1 like '%pattern1%' OR c1 like '%pattern2'.. - Assume c1 has 3 values [v1, v2, v3] - The generated code will create a new VarcharHolder instance for each value iteration - For v1: VarCharHolder vc1 is created, ascii-mode computed for pattern1, ascii-mode computation reused for pattern2, pattern3, etc since we're evaluating the same value - For v2: VarCharHolder vc1 is created, ascii-mode computed for pattern1, ascii-mode computation reused for pattern2, pattern3, etc since we're evaluating the same value - DITO for v3 Note that the test-suite has similar test cases that you are proposing; they all passed. ---
[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1001 - It was 50% for 1) and 50% for 2) - Notice this breakdown depends on o The number of Contains pattern for the same value (impacts 1)) o The pattern length (impacts both 1) and 2)) ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
GitHub user sachouche opened a pull request: https://github.com/apache/drill/pull/1001 JIRA DRILL-5879: Like operator performance improvements - Recently, custom code has been added to handle common search patterns (Like operator) - Contains, Starts With, and Ends With - Custom code is way faster than the generic RegEx based implementation - This pull request is another attempt to improve the Contains pattern since it is more CPU intensive Query: select from table where colA like '%a%' or colA like '%xyz%'; Improvement Opportunities Avoid isAscii computation (full access of the input string) since we're dealing with the same column twice Optimize the "contains" for-loop Implementation Details 1) Added a new integer variable "asciiMode" to the VarCharHolder class The default value is -1 which indicates this info is not known Otherwise this value will be set to either 1 or 0 based on the string being in ASCII mode or Unicode The execution plan already shares the same VarCharHolder instance for all evaluations of the same column value The asciiMode will be correctly set during the first LIKE evaluation and will be reused across other LIKE evaluations 2) The "Contains" LIKE operation is quite expensive as the code needs to access the input string to perform character based comparisons Created 4 versions of the same for-loop to a) make the loop simpler to optimize (Vectorization) and b) minimize comparisons Benchmarks Lineitem table 100GB Query: select l_returnflag, count from dfs.`` where l_comment not like '%a%' or l_comment like '%the%' group by l_returnflag Before changes: 33sec After changes : 27sec You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachouche/drill yodlee-cherry-pick Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1001.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1001 commit c2b05b2e8665daf3f7b43d49c428539b3753595f Author: Salim Achouche <sachouc...@gmail.com> Date: 2017-10-18T18:40:18Z JIRA 5879: Like operator performance improvements ---
[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1001 Paul, again thanks for the detailed review: - I was able to address most of the feedback except for one - I agree that expressions that can operate directly on the encoded UTF-8 string should ideally perform checks on bytes and not characters - Having said that, such a change is more involved and should be done properly o The SqlPatternContainsMatcher currently gets a CharSequence as input o We should enhance the expression framework so that matchers can a) express their capabilities and b) receive the expected data type (Character or Byte sequences) o Note also there is an impact on the test-suite since StringBuffer are being used to directly test the matcher functionality ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
GitHub user sachouche opened a pull request: https://github.com/apache/drill/pull/1072 DRILL-5879: Improved SQL Pattern Contains Performance **BACKGROUND** - JIRA [DRILL-5879](https://issues.apache.org/jira/browse/DRILL-5879) goal is to improve the "Contains" pattern performance - [DRILL-5899](https://issues.apache.org/jira/browse/DRILL-5899) (sub-task) was created subsequently to avoid the ASCII / Unicode pre-processing overhead. - This pull-request addresses the algorithmic part of this functionality **ANALYSIS** - Contains has O(n*m) complexity - There are two ways to optimize the associated runtime 1) Minimize the number of instructions, pipelining, and CPU stalls 2) Use smarter algorithms (compared to the naive one); for example, the Boyer-Moore algorithm (which is implemented by several popular Open Source tools such as grep) **IMPLEMENTATION** Our approach contains both suggestions (listed in the analysis) - Created five matchers that are chosen based on the pattern length - The first three, are based on pattern 1) and target patterns with length [1..4[ - The forth one, has a similar runtime then the current implementation and targets patterns with length [4..10[ - We use the the Boyer-Moore algorithm for patterns with a length larger than 9 bytes NOTE - the JDK doesn't use this algorithm because of two main reasons - Two extra arrays are necessary with a size relative to the supported character-set and the pattern length (this would be particularly costly for unicode as this would require 64k entries) - Each Contains (or indexOf) invocation would require memory and initialization overhead - Drill doesn't have this issue as a) initialization overhead is amortized since the pattern will be matched against many input values and b) our Contains logic is centered around bytes so the memory overhead is around 256 integers per fragment **PERFORMANCE IMPROVEMENTS** - We observe at least 25% improvement per Contains operation for matchers with pattern length lower than 4; 100% for negative cases (as the code never accesses a secondary for-loop) - It was noticed the Boyer-Moor algorithm performs poorly for small patterns as lookup accesses erase the associated optimizations; this algorithm performs extremely well when the pattern length increases as unlike the naive implementation it is able to use the lookup table to jump beyond the next character (since the matching phase has already gained so insight). - One popular introduction to this algorithm is the following use-case - Input: - Pattern: aax You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachouche/drill DRILL-5879 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1072.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1072 commit 50ac5140420057692cac8a33f181eb16580a28e6 Author: Salim Achouche <sachouc...@gmail.com> Date: 2017-12-13T22:24:45Z DRILL-5879: Improved SQL Pattern Contains Performance ---
[GitHub] drill issue #1015: DRILL-5889: Simple pattern matchers can work with DrillBu...
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1015 Padma, DRILL-5889 is the wrong JIRA (sqlline loses RPC..); I think your JIRA is 5899 Regards, Salim From: Padma Penumarthy <notificati...@github.com> Sent: Wednesday, November 1, 2017 11:15:20 AM To: apache/drill Cc: Salim Achouche; Mention Subject: Re: [apache/drill] DRILL-5889: Simple pattern matchers can work with DrillBuf directly (#1015) @sachouche<https://github.com/sachouche> @paul-rogers<https://github.com/paul-rogers> can you please review ? â You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://github.com/apache/drill/pull/1015#issuecomment-341192974>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AckoK8UN4uDlGPJGg6fpypIbPxKEOqGKks5syLU4gaJpZM4QJzty>. ---
[GitHub] drill issue #976: DRILL-5797: Choose parquet reader from read columns
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/976 Looking at the stack trace: - The code definitely is initializing a column of type REPEATABLE - The Fast Reader didn't expect this scenario so it used a default container (NullableVarBinary) for VL binary DT Why this is happening? - The code in ReadState::buildReader() is processing all selected columns - This information is obtained from the ParquetSchema - Looking at the code, this seems a case-sensitivity issue - The ParquetSchema is case-insensitive whereas the Parquet GroupType is not - Damien added a catch handler (column not found) to handle use-cases where we are projecting non-existing columns - This basically is leading to an unforeseen use-case - Assume column XYZ is complex - User uses an alias (xyz) - The new code will allow this column to pass and treat is as simple - The ParquetSchema is being case insensitive will process this column - and thus the exception in the test suite Suggested Fix - Create a map (key to-lower-case) and register all current row-group columns - Use this map to locate a selected column type ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
GitHub user sachouche opened a pull request: https://github.com/apache/drill/pull/1060 DRILL-5846: Improve parquet performance for Flat Data Types Performance improvements for the Parquet Scanner (Flat Data Types). The are two flags to control this performance enhancement (disabled by default): Option I - Config Name: store.parquet.flat.reader.bulk Config Type : boolean Description : Enables bulk processing to minimize memory checks and improve JVM HotSpot optimizations Option II - Config Name: scan.optimized.implicit.columns Config Type : boolean Description : Memory optimization when storing duplicate value (implicit columns have duplicate values within a batch). Code profiling indicated this step represented one third of Parquet processing (when implicit columns are processed). You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachouche/drill drill-parquet-improv Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1060.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1060 commit 0a3a8b053be85c570ff24237d4737f37668383bd Author: Salim Achouche <sachouc...@gmail.com> Date: 2017-12-04T01:53:08Z DRILL-5846: Improve parquet performance for Flat Data Types ---
[GitHub] drill issue #976: DRILL-5797: Choose parquet reader from read columns
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/976 +1 looks good! ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r146952578 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -17,37 +17,166 @@ */ package org.apache.drill.exec.expr.fn.impl; -public class SqlPatternContainsMatcher implements SqlPatternMatcher { +public final class SqlPatternContainsMatcher implements SqlPatternMatcher { final String patternString; CharSequence charSequenceWrapper; final int patternLength; + final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString, CharSequence charSequenceWrapper) { -this.patternString = patternString; +this.patternString = patternString; this.charSequenceWrapper = charSequenceWrapper; -patternLength = patternString.length(); +patternLength= patternString.length(); + +// The idea is to write loops with simple condition checks to allow the Java Hotspot achieve +// better optimizations (especially vectorization) +if (patternLength == 1) { + matcherFcn = new Matcher1(); --- End diff -- I ran two types of tests to evaluate the generic vs custom method: Test1 - A match does exist o Custom method is 3x faster because the code will go to the "else" part o A nested loop is always slower than unrolled code (the loop is also correlated with the outer one); please refer to this article (https://en.wikipedia.org/wiki/Loop_unrolling) on the benefits of loop unrolling o Older match function performed in 59sec Test2- A match doesn't exist o Custom method and generic one perform in 15sec; this is because both perform a comparison and proceed to the next iteration o Older match function performed in 45sec ---
[GitHub] drill issue #1008: drill-5890: Fixed a file descriptor leak in Drill's test-...
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1008 @parthchandra can you please review this change? thanks! ---
[GitHub] drill pull request #1008: drill-5890: Fixed a file descriptor leak in Drill'...
GitHub user sachouche opened a pull request: https://github.com/apache/drill/pull/1008 drill-5890: Fixed a file descriptor leak in Drill's test-suite Problem Description - The Drill test-suite uses two surefire processes to run tests - This has the advantage of avoiding class reloading if the JVM exited after running a test class - The side effect of this approach, is that resource leaks could be problematic - When running the Drill's test-suite on MacOS (Sierra) my tests failed with a max FD descriptors reached - Had to increase the maximum number of open FDs for the whole machine and per process - The process is described in the following [link](https://superuser.com/questions/302754/increase-the-maximum-number-of-open-file-descriptors-in-snow-leopard) - Two limit files "limit.maxfiles.plist" and "limit.maxproc.plist" have to be created under "/Library/LaunchDaemons" - Originally, I had to set the maximum number of FDs per process to a large value (100,000 and the system to 200,000) for the tests to succeed FD Leak Cause Debugging the Drill test suite, it was noticed - A base class BaseTestQuery has a @BeforeClass and @AfterClass TestNG tags - This means that each Drill test class extending from BaseTestQuery will have a setup method called before any tests are executed and a cleanup method invoked when all the tests are done (or a fatal error in between) - The OpenClient() method was starting a DrillBit and creating a client connection to it - DrillBit's BootStrapContext class was initializing two Netty EventLoopGroup objects which internally opened 20 FDs each - It was noticed that one of them was not getting de-initialized Fix - Added logic within the BootStrapContext object to shutdown the EventLoopGroup objects if they have been already shutdown (and are not in the process of being shutdown) - The fix tries to shut both objects because the container class should ideally manage the lifecycle of its objects; at least, the code should clearly articulate lifecycle management responsibilities to avoid leaks - Used the "shutdownGracefully" method since it was a) already used by our code and b) is advertised to have sensible timeout values - The added shutdown calls are being invoked only when consumer objects have been also shutdown - Running the tests show that the number of FDs per surefire process doesn't extend beyond few hundreds (majority created for JAR files loading) You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachouche/drill drill-5890 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1008.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1008 commit 5f8bad865a78ab265015ca21184d6c59e22d1c95 Author: Salim Achouche <sachouc...@gmail.com> Date: 2017-10-24T17:12:19Z drill-5890: Fixed a file descriptor leak in Drill's test-suite ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r146708658 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -17,37 +17,166 @@ */ package org.apache.drill.exec.expr.fn.impl; -public class SqlPatternContainsMatcher implements SqlPatternMatcher { +public final class SqlPatternContainsMatcher implements SqlPatternMatcher { final String patternString; CharSequence charSequenceWrapper; final int patternLength; + final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString, CharSequence charSequenceWrapper) { -this.patternString = patternString; +this.patternString = patternString; this.charSequenceWrapper = charSequenceWrapper; -patternLength = patternString.length(); +patternLength= patternString.length(); + +// The idea is to write loops with simple condition checks to allow the Java Hotspot achieve +// better optimizations (especially vectorization) +if (patternLength == 1) { + matcherFcn = new Matcher1(); --- End diff -- Padma, I have two reasons to follow the added complexity 1) The new code is encapsulated within the Contains matching logic; doesn't increase code complexity 2) o I created a test with the original match logic, pattern and input were Strings though passed as CharSequence o Ran the test with the new and old method (1 billion iterations) on MacOS o pattern length o The old match method performed in 43sec where as the new one performed in 15sec o The reason for the speedup is the custom matcher functions have less instructions (load and comparison) ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r146705325 --- Diff: exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java --- @@ -73,6 +73,9 @@ public void eval() { out.start = in.start; if (charCount <= length.value || length.value == 0 ) { out.end = in.end; + if (charCount == (out.end - out.start)) { +out.asciiMode = org.apache.drill.exec.expr.holders.VarCharHolder.CHAR_MODE_IS_ASCII; // we can conclude this string is ASCII --- End diff -- - As previously stated (when responding to Paul'd comment), the expression framework is able to use the same VarCharHolder input variable when it is shared amongst multiple expressions - If the original column was of type var-binary, then the expression framework will include a cast to var-char - The cast logic will also compute the string length - Using this information to deduce whether the string is pure ASCII or not - UTF-8 encoding uses 1 byte for ASCII and 2, 3, or 4 for other character sets - If the encoded length and character length are equal, then this means this is an ASCII string ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r158080071 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class Matcher1 extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string - return 1; +private Matcher1() { + super(); } -final int txtLength = end - start; +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } return 0; } + } + /** Handles patterns with length two */ + private final class Matcher2 extends MatcherFcn { -final int outerEnd = txtLength - patternLength; +private Matcher2() { + super(); +} -outer: -for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) { +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; --- End diff -- This is not a problem as the for-loop condition (index = 0; index < lengthToProcess; ..) will prevent any processing to take place unless the correct condition is met. The key point here is that the input length is always provided by the "end - start" formula. ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r158072762 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { --- End diff -- will do. ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r158074965 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class Matcher1 extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string - return 1; +private Matcher1() { + super(); } -final int txtLength = end - start; +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { --- End diff -- These optimizations are useless with modern compilers. This would have been useful if the inputByte value was the same.. then we would save few instructions. ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r158080247 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class Matcher1 extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string - return 1; +private Matcher1() { + super(); } -final int txtLength = end - start; +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } return 0; } + } + /** Handles patterns with length two */ + private final class Matcher2 extends MatcherFcn { -final int outerEnd = txtLength - patternLength; +private Matcher2() { + super(); +} -outer: -for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) { +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; // simplePattern string has meta characters i.e % and _ and escape characters removed. // so, we can just directly compare. - for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) { -if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) { - continue outer; + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); + +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); --- End diff -- same answer. ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r158074289 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); --- End diff -- HeapByteBuffer will always create an internal buffer. Empty string will have a byte array with zero bytes. ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r158085723 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class Matcher1 extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string - return 1; +private Matcher1() { + super(); } -final int txtLength = end - start; +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } return 0; } + } + /** Handles patterns with length two */ + private final class Matcher2 extends MatcherFcn { -final int outerEnd = txtLength - patternLength; +private Matcher2() { + super(); +} -outer: -for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) { +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; // simplePattern string has meta characters i.e % and _ and escape characters removed. // so, we can just directly compare. - for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) { -if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) { - continue outer; + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); + +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); + + if (secondInByte == secondPattByte) { +return 1; + } } } + return 0; +} + } + + /** Handles patterns with length three */ + private final class Matcher3
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r158086368 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class Matcher1 extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string - return 1; +private Matcher1() { + super(); } -final int txtLength = end - start; +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } return 0; } + } + /** Handles patterns with length two */ + private final class Matcher2 extends MatcherFcn { -final int outerEnd = txtLength - patternLength; +private Matcher2() { + super(); +} -outer: -for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) { +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; // simplePattern string has meta characters i.e % and _ and escape characters removed. // so, we can just directly compare. - for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) { -if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) { - continue outer; + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); + +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); + + if (secondInByte == secondPattByte) { +return 1; + } } } + return 0; +} + } + + /** Handles patterns with length three */ + private final class Matcher3
[GitHub] drill issue #1237: DRILL-6348: Fixed code so that Unordered Receiver reports...
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1237 @vrozov, **What are we trying to solve / improve** - Drill is currently not properly reporting memory held in Fragment's receive queues - This makes it hard to analyze OOM conditions This is what I want to see - Every operator reporting on the resources it is currently using (needed) - Fragment held resources (other than the ones already reported by the child operators) - Drilbit level (metadata caches, web-server, ..) - I am ok to incrementally reach this goal **Data Exchange Logistic** - Ideally, the data exchange fabric should be decoupled from the Drill Receive / Send operators - The fabric should be handling all the aspects of pre-fetch / pressuring and so forth - It will tune to the speed of producers / consumers when writing / reading data from it - This infrastructure should have its own resource management and reporting capabilities **Operator based Reporting** - Receive and Send operators shall not worry about batches they didn't consume yet - Doing so is counter productive as the Data Exchange fabric will interpret a "drain" operation as the operator "needing" more data. - For example, the merge-receiver should not be managing the receive queues; it should only advertise the pattern of data consumption and let the exchange fabric figure out the rest. The main difference in the two approaches, is that essentially, you are preaching for Receive and Send operators to control all aspects of communication whereas I am preaching for decoupling such aspects. ---
[GitHub] drill issue #1237: DRILL-6348: Fixed code so that Unordered Receiver reports...
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1237 That was not my intention as my current change aimed at describing the system the way it is. @parthchandra, any feedback? ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184728292 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws IOException { } } + private RawFragmentBatch getNextNotEmptyBatch() throws IOException { +RawFragmentBatch batch; +try { + stats.startWait(); + batch = getNextBatch(); + + // skip over empty batches. we do this since these are basically control messages. + while (batch != null && batch.getHeader().getDef().getRecordCount() == 0 --- End diff -- Ignore this comment as I thought you were releasing the returned batch. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184747702 --- Diff: exec/memory/base/src/main/java/org/apache/drill/exec/memory/AllocationManager.java --- @@ -253,10 +261,12 @@ public boolean transferBalance(final BufferLedger target) { target.historicalLog.recordEvent("incoming(from %s)", owningLedger.allocator.name); } -boolean overlimit = target.allocator.forceAllocate(size); +// Release first to handle the case where the current and target allocators were part of the same +// parent / child tree. allocator.releaseBytes(size); +boolean allocationFit = target.allocator.forceAllocate(size); --- End diff -- - The change of order is an optimization for a parent / child relationship as if we don't release first, then we could unnecessarily go over the memory budget (double counting). - The force-alloc() / free() failures should never happen on normal conditions; when they do, the best thing to do is to exit. I still prefer not to promote the target allocator till it is 100% successful. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184727914 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws IOException { } } + private RawFragmentBatch getNextNotEmptyBatch() throws IOException { +RawFragmentBatch batch; +try { + stats.startWait(); --- End diff -- Ok good point, as I have seen both practices being done within the Drill code. Though, I don't think this is a big deal as I don't see startWait() failing as it merely invokes nano time. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184730050 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RawFragmentBatch.java --- @@ -77,4 +83,46 @@ public long getByteCount() { public boolean isAckSent() { return ackSent.get(); } + + /** + * Transfer ownership of this DrillBuf to the target allocator. This is done for better memory + * accounting (that is, the operator should be charged with the body's Drillbuf memory). + * + * NOTES - + * + * This operation is a NOOP when a) the current allocator (associated with the DrillBuf) is not the + * owning allocator or b) the target allocator is already the owner + * When transfer happens, a new RawFragmentBatch instance is allocated; this is done for proper + * DrillBuf reference count accounting + * The RPC handling code caches a reference to this RawFragmentBatch object instance; release() + * calls should be routed to the previous DrillBuf + * + * + * @param targetAllocator target allocator + * @return a new {@link RawFragmentBatch} object instance on success (where the buffer ownership has + * been switched to the target allocator); otherwise this operation is a NOOP (current instance + * returned) + */ + public RawFragmentBatch transferBodyOwnership(BufferAllocator targetAllocator) { +if (body == null) { + return this; // NOOP +} + +if (!body.getLedger().isOwningLedger() + || body.getLedger().isOwner(targetAllocator)) { + + return this; +} + +int writerIndex = body.writerIndex(); +TransferResult transferResult = body.transferOwnership(targetAllocator); + +// Set the index and increment reference count +transferResult.buffer.writerIndex(writerIndex); + +// Clear the current Drillbuffer since caller will perform release() on the new one +body.release(); + +return new RawFragmentBatch(getHeader(), transferResult.buffer, getSender(), false); --- End diff -- We can take up such an enhancement as as part of another JIRA as any changes within the RPC layer have to be thoroughly tested. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184726839 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -201,6 +208,11 @@ public IterOutcome next() { context.getExecutorState().fail(ex); return IterOutcome.STOP; } finally { + + if (batch != null) { +batch.release(); +batch = null; --- End diff -- The point of this pattern is that if you would like to continue using this object then be prepared to know what can and what cannot be used. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184807153 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws IOException { } } + private RawFragmentBatch getNextNotEmptyBatch() throws IOException { +RawFragmentBatch batch; +try { + stats.startWait(); --- End diff -- I see; I will then fix any such occurrences when opportunity presents itself as I have seen both patterns in the Drill code base. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184572864 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws IOException { } } + private RawFragmentBatch getNextNotEmptyBatch() throws IOException { --- End diff -- why is that? this method is used once and doing so will make us duplicate exception handling code. I would appreciate if you provide reason(s) when you make suggestions to avoid the back and forth. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184574405 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws IOException { } } + private RawFragmentBatch getNextNotEmptyBatch() throws IOException { +RawFragmentBatch batch; +try { + stats.startWait(); + batch = getNextBatch(); + + // skip over empty batches. we do this since these are basically control messages. + while (batch != null && batch.getHeader().getDef().getRecordCount() == 0 --- End diff -- Sure. FYI - We cannot release the batch within this method. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184573646 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws IOException { } } + private RawFragmentBatch getNextNotEmptyBatch() throws IOException { +RawFragmentBatch batch; +try { + stats.startWait(); --- End diff -- Sure, but does it really matter or is it your personal preference? ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184575686 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws IOException { } } + private RawFragmentBatch getNextNotEmptyBatch() throws IOException { +RawFragmentBatch batch; +try { + stats.startWait(); + batch = getNextBatch(); + + // skip over empty batches. we do this since these are basically control messages. + while (batch != null && batch.getHeader().getDef().getRecordCount() == 0 + && (!first || batch.getHeader().getDef().getFieldCount() == 0)) { +batch = getNextBatch(); + } +} finally { + stats.stopWait(); +} +return batch; + } + @Override public IterOutcome next() { batchLoader.resetRecordCount(); stats.startProcessing(); -try{ - RawFragmentBatch batch; - try { -stats.startWait(); -batch = getNextBatch(); -// skip over empty batches. we do this since these are basically control messages. -while (batch != null && batch.getHeader().getDef().getRecordCount() == 0 -&& (!first || batch.getHeader().getDef().getFieldCount() == 0)) { - batch = getNextBatch(); -} - } finally { -stats.stopWait(); - } +RawFragmentBatch batch = null; +try { + batch = getNextNotEmptyBatch(); first = false; if (batch == null) { --- End diff -- Sure. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184574622 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -201,6 +208,11 @@ public IterOutcome next() { context.getExecutorState().fail(ex); return IterOutcome.STOP; } finally { + + if (batch != null) { +batch.release(); +batch = null; --- End diff -- This pattern ensures that after release the batch cannot be accessed anymore (e.g., additions to the current code). ---
[GitHub] drill issue #1237: DRILL-6348: Fixed code so that Unordered Receiver reports...
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1237 @vrozov, your observation is valid, we need more JIRAs to fix the reporting problem **Current Fix** - At this time, the UnorderedReceiver didn't account for any consumed memory - This fix, taxes the operator only when it consumes buffers **Potential Enhancements** Solution I - Create a new fragment child Allocator which will own the received (not yet consumed) batches - Improve the UI to report this allocator size - This solution is simple and complementary to the current work Solution II - Have the receiver operator drain the batch queue - Essentially, the receiver will have a private queue from where to consume batches - I personally don't like this solution as it prevents us from improving the Drill network protocol - Draining the batches for the sake of reporting will make it harder for the network layer to prefetch batches in an optimal manner; the queue size should be an indicator on how many pending batches there are. ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r145746250 --- Diff: exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java --- @@ -73,6 +73,9 @@ public void eval() { out.start = in.start; if (charCount <= length.value || length.value == 0 ) { out.end = in.end; + if (charCount == (out.end-out.start)) { +out.asciiMode = 1; // we can conclude this string is ASCII --- End diff -- will do! ---
[GitHub] drill issue #1060: DRILL-5846: Improve parquet performance for Flat Data Typ...
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1060 Before I reply to the provided comments I want first to thank both Parth and Paul for taking time to review this Pull Request. @parthchandra Regarding the High Level Comments - FS Access Manager The attached document listed several possible improvements in terms of CPU, Partitioning (changes to the query plan), and IO layer. I didn't want to club all the changes in a single pull request; instead I started with the Pull Request with the highest priority (in terms of performance improvement) which is CPU related enhancements. - Drill Memory Package I understand where you're coming from; I have analyzed this task and came up with a solution which I felt was the cleanest. Let me take you through my thinking process and the alternative solutions that I have considered: a) Why did I use this low level APIs? The Bulk Parquet reader uses the Drill Buffer APIs to copy data chunks to the CPU cache; this means that memory sanity checks are always performed. During code profiling I have noticed the Hotspot was using too many instructions (byte based assembly was used); I have researched ways to improve this and found some Hadoop and Compression java code which utilized 64bit access instructions when manipulating byte arrays (e.g., [Snappy](https://github.com/airlift/aircompressor/blob/master/src/main/java/io/airlift/compress/snappy/SnappyRawCompressor.java)). This makes a huge difference b) Safety As indicated previously, native data is always accessed through the Drill Buffer APIs (that is, accessed data is sanitized). During bulk processing, the new utility has additional checks which are triggered when assertions are enabled. This means all test-suite and QA tests (which run with assertions On) will be able to catch any faulty memory access; note also the checks are simpler since they only pertain to byte arrays boundary checks. c) Abstraction I agree that the Drill code base should follow a common code path when accessing native memory (minimize chances of introducing bugs), though I felt the Drill Memory package was the main abstraction since it was centralized and thus easier to test. d) Alternatives I have considered using the Netty io.netty.util.internal.PlatformDependent0 to access the Java intrinsics APIs though this class has a package scope and the wrapper public class io.netty.util.internal.PlatformDependent doesn't expose this low level functionality. Thus the work around would have been to create MemoryUtils class under the io.netty.util.internal package. I didn't think this was a cleaner approach than just using Java's intrinsic APIs directly (note also that Java9 will expose such intrinsics under a public JDK package instead of the sun.misc. Please let me know if you rather prefer the io.netty.util.internal route instead. - Changes to the Vector Code I had few discussions with Paul about the need to reconcile DRILL-5657; the agreement was to proceed with the current implementation and then apply refactoring (future releases) as my changes are backward compatible (new public APIs added). - Arrow Integration I will create a task to analyze the impact of the Arrow Integration on the Parquet Reader. - UserBitShared.proto This class was updated to deal with the implicit columns optimization. I have added two fields to the NullableVarChar class to capture the logical number of rows and the fact this feature was active. This allowed me to use different Accessor and Mutator implementation. You and I had a discussion about this and the agreement was to: a) The new optimization is configurable (current default is off); though we might change the default at some point b) Detect C++ based clients and turn of this optimization (that is, the implicit columns enhancements); I didn't implement this task yet as I forgot the name of the property that you have indicated (this should be quite fast to implement though) - Testing Yes, QA helped me to run the test-suite with the two new options enabled. The new changes didn't modify any of the encoding/compression logic but I'll be happy to add any test(s) you deem necessary.. - Code Style Thank you for the pointers; I'll stick to your feedback in future changes to ease the review process. ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r160514755 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VLAbstractEntryReader.java --- @@ -0,0 +1,215 @@ +/*** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + **/ +package org.apache.drill.exec.store.parquet.columnreaders; + +import java.nio.ByteBuffer; + +import org.apache.drill.exec.store.parquet.columnreaders.VLColumnBulkInput.ColumnPrecisionInfo; +import org.apache.drill.exec.store.parquet.columnreaders.VLColumnBulkInput.PageDataInfo; +import org.apache.drill.exec.util.MemoryUtils; + +/** Abstract class for sub-classes implementing several algorithms for loading a Bulk Entry */ +abstract class VLAbstractEntryReader { + + /** byte buffer used for buffering page data */ + protected final ByteBuffer buffer; + /** Page Data Information */ + protected final PageDataInfo pageInfo; + /** expected precision type: fixed or variable length */ + protected final ColumnPrecisionInfo columnPrecInfo; + /** Bulk entry */ + protected final VLColumnBulkEntry entry; + + /** + * CTOR. + * @param _buffer byte buffer for data buffering (within CPU cache) + * @param _pageInfo page being processed information + * @param _columnPrecInfo column precision information + * @param _entry reusable bulk entry object + */ + VLAbstractEntryReader(ByteBuffer _buffer, +PageDataInfo _pageInfo, +ColumnPrecisionInfo _columnPrecInfo, +VLColumnBulkEntry _entry) { + +this.buffer = _buffer; +this.pageInfo = _pageInfo; +this.columnPrecInfo = _columnPrecInfo; +this.entry = _entry; + } + + /** + * @param valuesToRead maximum values to read within the current page + * @return a bulk entry object + */ + abstract VLColumnBulkEntry getEntry(int valsToReadWithinPage); + + /** + * Indicates whether to use bulk processing + */ + protected final boolean bulkProcess() { +return columnPrecInfo.bulkProcess; + } + + /** + * Loads new data into the buffer if empty or the force flag is set. + * + * @param force flag to force loading new data into the buffer + */ + protected final boolean load(boolean force) { + +if (!force && buffer.hasRemaining()) { + return true; // NOOP +} + +// We're here either because the buffer is empty or we want to force a new load operation. +// In the case of force, there might be unprocessed data (still in the buffer) which is fine +// since the caller updates the page data buffer's offset only for the data it has consumed; this +// means unread data will be loaded again but this time will be positioned in the beginning of the +// buffer. This can happen only for the last entry in the buffer when either of its length or value +// is incomplete. +buffer.clear(); + +int remaining = remainingPageData(); +int toCopy= remaining > buffer.capacity() ? buffer.capacity() : remaining; + +if (toCopy == 0) { + return false; +} + +pageInfo.pageData.getBytes(pageInfo.pageDataOff, buffer.array(), buffer.position(), toCopy); --- End diff -- @parthchandra I know this is counter intuitive but this path makes it faster and I had VTune to prove it; let me explain it: - The logic is to fetch small chunks of data from main memory to the L1 cache (using 4k buffers) for processing - This logic doesn't introduce any extra memory bandwidth penalty since in all cases data has to move from main memory to the CPU and back to main memory (if modified) - Note that the CPU
[GitHub] drill pull request #1087: Attempt to fix memory leak in Parquet
GitHub user sachouche opened a pull request: https://github.com/apache/drill/pull/1087 Attempt to fix memory leak in Parquet ** Problem Description ** This is an extremely rare leak which I was able to emulate by putting a sleep in the AsyncPageReader right after reading the page and before enqueue in the result queue. This is how this issue could manifest itself in real life scenario: - AsyncPageReader reads a page into a buffer but didn't enqueue yet the result (thread got preempted) - Parquet Scan thread blocked waiting on the task (Future object dequeued) - Cancel received and Scan thread interrupted - Future.get() returns (Future object is lost) - Scan thread executes release logic - Scan thread is not able to interrupt the AsyncPageReader thread since the future object is lost - AsyncPageReader thread resumes and enqueues the DrillBuf in the result queue - This results in a leak since this buffer is not properly released ** Fix Description ** - The fix is straightforward as we peek the Future object during the blocking get() method - This way, an exception (such as an interrupt) will leave the Future object in the task queue - The cleanup logic will be able to guarantee the DrillBuf object is either GCed by the AsyncPageReader or ParquetScan thread You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachouche/drill DRILL-6079 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1087.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1087 commit 52030d1d9cc3b8992a10ade8c7126d66e785043a Author: Salim Achouche <sachouche2@...> Date: 2017-12-22T19:50:56Z Attempt to fix memory leak in Parquet ---
[GitHub] drill issue #1087: Attempt to fix memory leak in Parquet
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1087 @parthchandra can you please review this fix? ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r161028252 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java --- @@ -471,8 +519,8 @@ public void close() throws Exception { container.clear(); mutator.clear(); if (currentReader != null) { - currentReader.close(); -} +currentReader.close(); + } --- End diff -- done. ---
[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1001 Created another pull request #1072to merge my changes with the one done with Padma's. ---
[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...
Github user sachouche closed the pull request at: https://github.com/apache/drill/pull/1001 ---
[GitHub] drill issue #1087: Attempt to fix memory leak in Parquet
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1087 Thank you Arina for catching this; I created the commit for QA before vacation so that they could verify the fix. At that time, I didn't have an Apache JIRA. I have now updated the comment to reflect the JIRA id DRILL-6079 which is also the name of this remote branch. ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r160763490 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class Matcher1 extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string - return 1; +private Matcher1() { + super(); } -final int txtLength = end - start; +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } return 0; } + } + /** Handles patterns with length two */ + private final class Matcher2 extends MatcherFcn { -final int outerEnd = txtLength - patternLength; +private Matcher2() { + super(); +} -outer: -for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) { +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; // simplePattern string has meta characters i.e % and _ and escape characters removed. // so, we can just directly compare. - for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) { -if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) { - continue outer; + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); + +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); + + if (secondInByte == secondPattByte) { +return 1; + } } } + return 0; +} + } + + /** Handles patterns with length three */ + private final class Matcher3
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r160753323 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class Matcher1 extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string - return 1; +private Matcher1() { + super(); } -final int txtLength = end - start; +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } return 0; } + } + /** Handles patterns with length two */ + private final class Matcher2 extends MatcherFcn { -final int outerEnd = txtLength - patternLength; +private Matcher2() { + super(); +} -outer: -for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) { +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; // simplePattern string has meta characters i.e % and _ and escape characters removed. // so, we can just directly compare. - for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) { -if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) { - continue outer; + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); + +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); + + if (secondInByte == secondPattByte) { +return 1; + } } } + return 0; +} + } + + /** Handles patterns with length three */ + private final class Matcher3
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r160765882 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/expr/fn/impl/TestSqlPatterns.java --- @@ -446,6 +446,61 @@ public void testSqlPatternComplex() { assertEquals(1, sqlPatternComplex.match(0, byteBuffer.limit(), drillBuf)); // should match } + @Test + public void testSqlPatternContainsMUltipleMatchers() { --- End diff -- Paul, the test-suite already has many other tests for SQL matching. But I agree, they now might hit only few of these matcher. I will add more tests. ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r160755097 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class Matcher1 extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string - return 1; +private Matcher1() { + super(); } -final int txtLength = end - start; +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } return 0; } + } + /** Handles patterns with length two */ + private final class Matcher2 extends MatcherFcn { -final int outerEnd = txtLength - patternLength; +private Matcher2() { + super(); +} -outer: -for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) { +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; // simplePattern string has meta characters i.e % and _ and escape characters removed. // so, we can just directly compare. - for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) { -if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) { - continue outer; + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); + +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); + + if (secondInByte == secondPattByte) { +return 1; + } } } + return 0; +} + } + + /** Handles patterns with length three */ + private final class Matcher3
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r160758370 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class Matcher1 extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string - return 1; +private Matcher1() { + super(); } -final int txtLength = end - start; +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } return 0; } + } + /** Handles patterns with length two */ + private final class Matcher2 extends MatcherFcn { -final int outerEnd = txtLength - patternLength; +private Matcher2() { + super(); +} -outer: -for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) { +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; // simplePattern string has meta characters i.e % and _ and escape characters removed. // so, we can just directly compare. - for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) { -if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) { - continue outer; + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); + +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); + + if (secondInByte == secondPattByte) { +return 1; + } } } + return 0; +} + } + + /** Handles patterns with length three */ + private final class Matcher3
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r160764681 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class Matcher1 extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string - return 1; +private Matcher1() { + super(); } -final int txtLength = end - start; +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } return 0; } + } + /** Handles patterns with length two */ + private final class Matcher2 extends MatcherFcn { -final int outerEnd = txtLength - patternLength; +private Matcher2() { + super(); +} -outer: -for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) { +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; // simplePattern string has meta characters i.e % and _ and escape characters removed. // so, we can just directly compare. - for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) { -if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) { - continue outer; + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); + +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); + + if (secondInByte == secondPattByte) { +return 1; + } } } + return 0; +} + } + + /** Handles patterns with length three */ + private final class Matcher3
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r160759375 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class Matcher1 extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string - return 1; +private Matcher1() { + super(); } -final int txtLength = end - start; +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } return 0; } + } + /** Handles patterns with length two */ + private final class Matcher2 extends MatcherFcn { -final int outerEnd = txtLength - patternLength; +private Matcher2() { + super(); +} -outer: -for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) { +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; // simplePattern string has meta characters i.e % and _ and escape characters removed. // so, we can just directly compare. - for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) { -if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) { - continue outer; + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); + +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); + + if (secondInByte == secondPattByte) { +return 1; + } } } + return 0; +} + } + + /** Handles patterns with length three */ + private final class Matcher3
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r160754694 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { --- End diff -- Good point Paul! ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r162121115 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,286 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 0) { + matcherFcn = new MatcherZero(); +} else if (patternLength == 1) { + matcherFcn = new MatcherOne(); +} else if (patternLength == 2) { + matcherFcn = new MatcherTwo(); +} else if (patternLength == 3) { + matcherFcn = new MatcherThree(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class MatcherZero extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string +private MatcherZero() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { return 1; } + } + + /** Handles patterns with length one */ + private final class MatcherOne extends MatcherFcn { + +private MatcherOne() { + super(); +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; --- End diff -- renamed all such variables to XXXPatternBytes ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r162119757 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,286 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 0) { + matcherFcn = new MatcherZero(); +} else if (patternLength == 1) { + matcherFcn = new MatcherOne(); +} else if (patternLength == 2) { + matcherFcn = new MatcherTwo(); +} else if (patternLength == 3) { + matcherFcn = new MatcherThree(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ --- End diff -- cut-and-paste issue :-) ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r162120115 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,286 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 0) { + matcherFcn = new MatcherZero(); +} else if (patternLength == 1) { + matcherFcn = new MatcherOne(); +} else if (patternLength == 2) { + matcherFcn = new MatcherTwo(); +} else if (patternLength == 3) { + matcherFcn = new MatcherThree(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class MatcherZero extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string +private MatcherZero() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { return 1; } + } + + /** Handles patterns with length one */ + private final class MatcherOne extends MatcherFcn { + +private MatcherOne() { + super(); --- End diff -- removed. ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r162123690 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,286 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 0) { + matcherFcn = new MatcherZero(); +} else if (patternLength == 1) { + matcherFcn = new MatcherOne(); +} else if (patternLength == 2) { + matcherFcn = new MatcherTwo(); +} else if (patternLength == 3) { + matcherFcn = new MatcherThree(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class MatcherZero extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string +private MatcherZero() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { return 1; } + } + + /** Handles patterns with length one */ + private final class MatcherOne extends MatcherFcn { + +private MatcherOne() { + super(); +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; + + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } + return 0; +} + } + + /** Handles patterns with length two */ + private final class MatcherTwo extends MatcherFcn { + +private MatcherTwo() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; --- End diff -- Good idea; done this for all the matchers. ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r162124388 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,286 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 0) { + matcherFcn = new MatcherZero(); +} else if (patternLength == 1) { + matcherFcn = new MatcherOne(); +} else if (patternLength == 2) { + matcherFcn = new MatcherTwo(); +} else if (patternLength == 3) { + matcherFcn = new MatcherThree(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class MatcherZero extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string +private MatcherZero() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { return 1; } + } + + /** Handles patterns with length one */ + private final class MatcherOne extends MatcherFcn { + +private MatcherOne() { + super(); +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; + + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } + return 0; +} + } + + /** Handles patterns with length two */ + private final class MatcherTwo extends MatcherFcn { + +private MatcherTwo() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; + + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); -final int txtLength = end - start; +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + if (secondInByte == secondPattByte) { +return 1; + } +} + } return 0; } + } + /** Handles patterns with length three */ + private final class MatcherT
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r162124359 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,286 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 0) { + matcherFcn = new MatcherZero(); +} else if (patternLength == 1) { + matcherFcn = new MatcherOne(); +} else if (patternLength == 2) { + matcherFcn = new MatcherTwo(); +} else if (patternLength == 3) { + matcherFcn = new MatcherThree(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class MatcherZero extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string +private MatcherZero() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { return 1; } + } + + /** Handles patterns with length one */ + private final class MatcherOne extends MatcherFcn { + +private MatcherOne() { + super(); +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; + + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } + return 0; +} + } + + /** Handles patterns with length two */ + private final class MatcherTwo extends MatcherFcn { + +private MatcherTwo() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; + + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); -final int txtLength = end - start; +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + if (secondInByte == secondPattByte) { +return 1; + } +} + } return 0; } + } + /** Handles patterns with length three */ + private final class MatcherT
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r162828788 --- Diff: exec/vector/src/main/codegen/templates/NullableValueVectors.java --- @@ -68,96 +85,441 @@ private final UInt1Vector bits = new UInt1Vector(bitsField, allocator); private final ${valuesName} values = new ${minor.class}Vector(field, allocator); + private final Mutator mutator = new MutatorImpl(); + private final Accessor accessor= new AccessorImpl(); + + <#if type.major == "VarLen" && minor.class == "VarChar"> + private final Mutator dupMutator = new DupValsOnlyMutator(); + /** Accessor instance for duplicate values vector */ + private final Accessor dupAccessor = new DupValsOnlyAccessor(); + /** Optimization for cases where all values are identical */ + private boolean duplicateValuesOnly; + /** logical number of values */ + private int logicalNumValues; + /** logical value capacity */ + private int logicalValueCapacity; + /** Mutator instance for duplicate values vector */ + + /** true if this vector holds the same value albeit repeated */ + public boolean isDuplicateValsOnly() { --- End diff -- Tried that (that was my first attempt) but the main issue is that the Drill factory creates vectors based on the column metadata alone. This design has the advantage of enabling / rolling-back optimizations transparently from the consumer. I also made sure there is no performance penalty (or at least minimal). ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r162828863 --- Diff: exec/vector/src/main/codegen/templates/NullableValueVectors.java --- @@ -68,96 +85,441 @@ private final UInt1Vector bits = new UInt1Vector(bitsField, allocator); private final ${valuesName} values = new ${minor.class}Vector(field, allocator); + private final Mutator mutator = new MutatorImpl(); + private final Accessor accessor= new AccessorImpl(); + + <#if type.major == "VarLen" && minor.class == "VarChar"> + private final Mutator dupMutator = new DupValsOnlyMutator(); + /** Accessor instance for duplicate values vector */ + private final Accessor dupAccessor = new DupValsOnlyAccessor(); + /** Optimization for cases where all values are identical */ + private boolean duplicateValuesOnly; + /** logical number of values */ + private int logicalNumValues; + /** logical value capacity */ + private int logicalValueCapacity; + /** Mutator instance for duplicate values vector */ + + /** true if this vector holds the same value albeit repeated */ + public boolean isDuplicateValsOnly() { +return duplicateValuesOnly; + } - private final Mutator mutator = new Mutator(); - private final Accessor accessor = new Accessor(); + /** + * Sets this vector duplicate values mode; the {@link #clear()} method wil also be called as a side effect + * of this operation + */ + public void setDuplicateValsOnly(boolean valsOnly) { +clear(); +duplicateValuesOnly = valsOnly; + } - public ${className}(MaterializedField field, BufferAllocator allocator) { -super(field, allocator); + /** {@inheritDoc} */ + @Override + public int getValueCapacity(){ +if (!isDuplicateValsOnly()) { +return Math.min(bits.getValueCapacity(), values.getValueCapacity()); + } +return logicalValueCapacity; } + /** {@inheritDoc} */ @Override - public FieldReader getReader(){ -return reader; + public void close() { +bits.close(); +values.close(); +super.close(); } + /** {@inheritDoc} */ @Override - public int getValueCapacity(){ -return Math.min(bits.getValueCapacity(), values.getValueCapacity()); + public void clear() { +bits.clear(); +values.clear(); +super.clear(); +if (isDuplicateValsOnly()) { + logicalNumValues = 0; + logicalValueCapacity = 0; + duplicateValuesOnly = false; +} } + /** {@inheritDoc} */ @Override - public DrillBuf[] getBuffers(boolean clear) { -final DrillBuf[] buffers = ObjectArrays.concat(bits.getBuffers(false), values.getBuffers(false), DrillBuf.class); -if (clear) { - for (final DrillBuf buffer:buffers) { -buffer.retain(1); + public int getBufferSizeFor(final int valueCount) { +assert valueCount >= 0; + +if (valueCount == 0) { + return 0; +} +if (!isDuplicateValsOnly()) { + return values.getBufferSizeFor(valueCount) + bits.getBufferSizeFor(valueCount); +} +return values.getBufferSizeFor(1) + bits.getBufferSizeFor(1); + } + + /** {@inheritDoc} */ + @Override + public ${valuesName} getValuesVector() { +if (!isDuplicateValsOnly()) { + return values; +} +throw new UnsupportedOperationException(); + } + + /** {@inheritDoc} */ + @Override + public void setInitialCapacity(int numRecords) { +assert numRecords >= 0; +if (!isDuplicateValsOnly()) { + bits.setInitialCapacity(numRecords); + values.setInitialCapacity(numRecords); +} else { + bits.setInitialCapacity(numRecords > 0 ? 1 : 0); + values.setInitialCapacity(numRecords > 0 ? 1 : 0); + logicalValueCapacity = numRecords; +} + } + + /** {@inheritDoc} */ + @Override + public Accessor getAccessor() { +if (!isDuplicateValsOnly()) { + return accessor; +} +return dupAccessor; + } + + /** {@inheritDoc} */ + @Override + public Mutator getMutator() { +if (!isDuplicateValsOnly()) { + return mutator; +} +return dupMutator; + } + + public void copyFrom(int fromIndex, int thisIndex, Nullable${minor.class}Vector from) { +final Accessor fromAccessor = from.getAccessor(); +if (isDuplicateValsOnly()) { + // We currently don't have a use-case where a target dup-vector is provisioned by a non-dup source vector + assert isD
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r162828312 --- Diff: exec/vector/src/main/codegen/templates/NullableValueVectors.java --- @@ -51,6 +57,17 @@ public final class ${className} extends BaseDataValueVector implements <#if type.major == "VarLen">VariableWidth<#else>FixedWidthVector, NullableVector { private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(${className}.class); + /** + * Optimization to set contiguous values nullable state in a bulk manner; cannot define this array + * within the Mutator class as Java doesn't allow static initialization within a non static inner class. + */ + private static final int DEFINED_VALUES_ARRAY_LEN = 1 << 10; + private static final byte[] DEFINED_VALUES_ARRAY = new byte[DEFINED_VALUES_ARRAY_LEN]; --- End diff -- I rather have it within the sub-class which defines the associated functionality (that is nullable DT); the wasted memory is also negligible. ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r162828174 --- Diff: exec/vector/src/main/codegen/templates/NullableValueVectors.java --- @@ -51,6 +57,17 @@ public final class ${className} extends BaseDataValueVector implements <#if type.major == "VarLen">VariableWidth<#else>FixedWidthVector, NullableVector { private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(${className}.class); + /** + * Optimization to set contiguous values nullable state in a bulk manner; cannot define this array + * within the Mutator class as Java doesn't allow static initialization within a non static inner class. + */ + private static final int DEFINED_VALUES_ARRAY_LEN = 1 << 10; --- End diff -- 1 << 10 is 1024; not sure how did you arrive at 10,000,000,000.. ---
[GitHub] drill issue #1060: DRILL-5846: Improve parquet performance for Flat Data Typ...
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1060 @paul-rogers with regard to the design aspects that you brought up: ** Corrections about the proposed Design ** Your analysis somehow assumes the Vector is the one driving the loading operation. This is not the case, a) it is the reader which invokes the bulk save API b) the Vector save method runs a loop requesting the rest of the data. The reader can any time stop the loading phase; currently the reader uses the batch size to make this decision. The plan is to enhance this logic by having a feedback from the Vector save method about memory usage (e.g., save current row set with the condition the memory usage doesn't get X bytes). I believe I have raised this design decision early on and the agreement was to implement the memory batch restrictions in a next Drill release. ** Agile Development ** Through the years I came to appreciate the value of agile development. One needs to prioritize design activities; it is completely valid to start with a design (which is not guaranteed to be perfect) and then refine it overtime as long as the underlying implementation doesn't spill to other modules (encapsulation). This has the merit of showcasing incremental improvements and allowing the developer to get new insight as they have a better understanding. ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r161065237 --- Diff: protocol/src/main/protobuf/UserBitShared.proto --- @@ -148,6 +148,8 @@ message SerializedField { optional int32 value_count = 4; optional int32 var_byte_length = 5; optional int32 buffer_length = 7; + optional bool is_dup = 8; + optional int32 logical_value_count = 9; --- End diff -- Just had an offline conversation with Paul and Kunal. The agreement is that we should introduce a versioning mechanism at the protocol level so that clients could advertise their version identifier; this way the server can use this information to turn on / off features based on the client capabilities. The task is only to create the preliminary versioning infrastructure; more sophistication can be added later on. ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r161045748 --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java --- @@ -386,7 +391,7 @@ public void reAlloc() { throw new OversizedAllocationException("Unable to expand the buffer. Max allowed buffer size is reached."); } -logger.trace("Reallocating VarChar, new size {}", newAllocationSize); +logger.trace("Reallocating VarChar, new size {}",newAllocationSize); --- End diff -- Fixed. ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r161039122 --- Diff: exec/vector/src/main/codegen/templates/FixedValueVectors.java --- @@ -874,6 +880,46 @@ public void setSafe(int index, BigDecimal value) { set(index, value); } +/** + * Copies the bulk input into this value vector and extends its capacity if necessary. + * @param input bulk input + */ +public void setSafe(VLBulkInput input) { + setSafe(input, null); +} + +/** + * Copies the bulk input into this value vector and extends its capacity if necessary. The callback + * mechanism allows decoration as caller is invoked for each bulk entry. + * + * @param input bulk input + * @param callback a bulk input callback object (optional) + */ +public void setSafe(VLBulkInput input, VLBulkInput.BulkInputCallback callback) { --- End diff -- This code is not Parquet specific. Instead, it can be triggered by any Reader which desires to load data in a bulk fashion. Vectors currently expose Mutator APIs for loading single values; I see no good reason which prevent us from passing bulk values instead of a single one at a time which prevent us from code optimization. Look at ByBuffer APIs they allow you to pass single byte values but also byte arrays. ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r161036145 --- Diff: exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java --- @@ -0,0 +1,186 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.util; + +import java.lang.reflect.Field; +import java.nio.ByteOrder; + +import sun.misc.Unsafe; + +/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned platforms */ +@SuppressWarnings("restriction") +public final class MemoryUtils { + + // Ensure this is a little-endian hardware */ + static { +if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) { + throw new IllegalStateException("Drill only runs on LittleEndian systems."); +} + } + + /** Java's unsafe object */ + private static Unsafe UNSAFE; + + static { +try { + Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe"); + theUnsafe.setAccessible(true); + UNSAFE = (Unsafe) theUnsafe.get(null); +} +catch (Exception e) { + throw new RuntimeException(e); +} + } + + /** Byte arrays offset */ + private static final long BYTE_ARRAY_OFFSET = UNSAFE.arrayBaseOffset(byte[].class); + + /** Number of bytes in a long */ + public static final int LONG_NUM_BYTES = 8; + /** Number of bytes in an int */ + public static final int INT_NUM_BYTES = 4; + /** Number of bytes in a short */ + public static final int SHORT_NUM_BYTES = 2; + +// +// APIs +// + + /** + * @param data source byte array + * @param index index within the byte array + * @return short value starting at data+index + */ + public static short getShort(byte[] data, int index) { --- End diff -- Please refer to my reply to Parth. I personally think the memory package should be our main abstraction (of course along with guidelines). Putting all memory access APIs within the DrillBuf class will only clutter that class.. I rather have a centralized memory utility class for advanced processing which do not involve a DrillBuf. ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r161040070 --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java --- @@ -309,7 +314,7 @@ public void setInitialCapacity(final int valueCount) { if (size > MAX_ALLOCATION_SIZE) { throw new OversizedAllocationException("Requested amount of memory is more than max allowed allocation size"); } -allocationSizeInBytes = (int) size; +allocationSizeInBytes = (int)size; --- End diff -- Fixed. ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r161032779 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/NullableColumnReader.java --- @@ -165,17 +181,133 @@ + "Run Length: {} \t Null Run Length: {} \t readCount: {} \t writeCount: {} \t " + "recordsReadInThisIteration: {} \t valuesReadInCurrentPass: {} \t " + "totalValuesRead: {} \t readStartInBytes: {} \t readLength: {} \t pageReader.byteLength: {} \t " - + "definitionLevelsRead: {} \t pageReader.currentPageCount: {}", + + "currPageValuesProcessed: {} \t pageReader.currentPageCount: {}", recordsToReadInThisPass, runLength, nullRunLength, readCount, writeCount, recordsReadInThisIteration, valuesReadInCurrentPass, totalValuesRead, readStartInBytes, readLength, pageReader.byteLength, - definitionLevelsRead, pageReader.currentPageCount); + currPageValuesProcessed, pageReader.currentPageCount); } valueVec.getMutator().setValueCount(valuesReadInCurrentPass); } + private final void processPagesBulk(long recordsToReadInThisPass) throws IOException { +readStartInBytes = 0; +readLength = 0; +readLengthInBits = 0; +recordsReadInThisIteration = 0; +vectorData = castedBaseVector.getBuffer(); + +// values need to be spaced out where nulls appear in the column +// leaving blank space for nulls allows for random access to values +// to optimize copying data out of the buffered disk stream, runs of defined values +// are located and copied together, rather than copying individual values + +int valueCount = 0; +final int maxValuesToProcess = Math.min((int) recordsToReadInThisPass, valueVec.getValueCapacity()); + +// To handle the case where the page has been already loaded +if (pageReader.definitionLevels != null && currPageValuesProcessed == 0) { + definitionLevelWrapper.set(pageReader.definitionLevels, pageReader.currentPageCount); +} + +while (valueCount < maxValuesToProcess) { + + // read a page if needed + if (!pageReader.hasPage() || (currPageValuesProcessed == pageReader.currentPageCount)) { +if (!pageReader.next()) { + break; +} + +//New page. Reset the definition level. +currPageValuesProcessed= 0; +recordsReadInThisIteration = 0; +readStartInBytes = 0; + +// Update the Definition Level reader +definitionLevelWrapper.set(pageReader.definitionLevels, pageReader.currentPageCount); + } + + definitionLevelWrapper.readFirstIntegerIfNeeded(); + + int numNullValues = 0; + int numNonNullValues= 0; + final int remaining = maxValuesToProcess - valueCount; + int currBatchSz = Math.min(remaining, (pageReader.currentPageCount - currPageValuesProcessed)); + assert currBatchSz > 0; + + // Let's skip the next run of nulls if any ... + int idx; + for (idx = 0; idx < currBatchSz; ++idx) { +if (definitionLevelWrapper.readCurrInteger() == 1) { + break; // non-value encountered +} +definitionLevelWrapper.nextIntegerIfNotEOF(); + } + numNullValues += idx; + + // Write the nulls if any --- End diff -- This is the original logic for processing fixed precision columns (except more efficient); the intent is to figure out how many null values there are before processing non-null values. We could have avoided this for-loop if we controlled the Parquet ValuesReader API; I was tempted to do that but decided to prioritize my tasks. The good news is that I recorded all such improvement opportunities and will try to tackle them at some point. ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r161030963 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java --- @@ -449,7 +451,7 @@ public void close() throws Exception { // been created. Gracefully handle that case. if (options != null) { - options.close(); -} +options.close(); } } +} --- End diff -- The issue was with the close() method which was not properly indented. Fixed it. ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r164532290 --- Diff: exec/memory/base/src/main/java/io/netty/buffer/DrillBuf.java --- @@ -703,7 +703,18 @@ protected void _setLong(int index, long value) { @Override public ByteBuf getBytes(int index, ByteBuf dst, int dstIndex, int length) { -udle.getBytes(index + offset, dst, dstIndex, length); +final int BULK_COPY_THR = 1024; --- End diff -- You are right, I'll put more information on this optimization: - During code profiling, I have noticed that getBytes() doesn't perform well when called with small length (lower than 1k) - Its throughput improves as the length increases Analysis : - Java exposes two intrinsics for writing to direct memory: putByte and copyMemory - The JVM is able to inline memory access (no function call) for putByte - copyMemory is a bulk API and this internally invokes libc memcpy (requires function call) - The rational is that we are willing to incur a function call if the associated processing is larger than the overhead; this is almost never the case for small memory accesses. ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r164525752 --- Diff: exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java --- @@ -0,0 +1,186 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.util; + +import java.lang.reflect.Field; +import java.nio.ByteOrder; + +import sun.misc.Unsafe; + +/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned platforms */ +@SuppressWarnings("restriction") +public final class MemoryUtils { + + // Ensure this is a little-endian hardware */ + static { +if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) { + throw new IllegalStateException("Drill only runs on LittleEndian systems."); +} + } + + /** Java's unsafe object */ + private static Unsafe UNSAFE; + + static { +try { + Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe"); + theUnsafe.setAccessible(true); + UNSAFE = (Unsafe) theUnsafe.get(null); +} +catch (Exception e) { + throw new RuntimeException(e); +} + } + + /** Byte arrays offset */ + private static final long BYTE_ARRAY_OFFSET = UNSAFE.arrayBaseOffset(byte[].class); + + /** Number of bytes in a long */ + public static final int LONG_NUM_BYTES = 8; + /** Number of bytes in an int */ + public static final int INT_NUM_BYTES = 4; + /** Number of bytes in a short */ + public static final int SHORT_NUM_BYTES = 2; + +// +// APIs +// + + /** + * @param data source byte array + * @param index index within the byte array + * @return short value starting at data+index + */ + public static short getShort(byte[] data, int index) { --- End diff -- Vlad, Can you please clarify your point? Are you suggesting that every memory access has to be via DrillBuf even if the data is loaded in Java heap and the developer is performing an optimization. If this is your goal, then I would think the danger is that the DrillBuf class will be cluttered and contracts hard to understand: - Notice that DrillBuf performs expensive sanity checks (boundary and reference count checks); this is due to Drill memory optimizations for minimizing the overall memory usage. - When the code is processing already load data, these checks are not applicable anymore as this data is private. - I could have easily moved the new APIs there but I strongly felt this was inappropriate.. ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r164509658 --- Diff: exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java --- @@ -0,0 +1,186 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.util; + +import java.lang.reflect.Field; +import java.nio.ByteOrder; + +import sun.misc.Unsafe; + +/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned platforms */ +@SuppressWarnings("restriction") +public final class MemoryUtils { + + // Ensure this is a little-endian hardware */ + static { +if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) { + throw new IllegalStateException("Drill only runs on LittleEndian systems."); +} + } + + /** Java's unsafe object */ + private static Unsafe UNSAFE; + + static { +try { + Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe"); + theUnsafe.setAccessible(true); + UNSAFE = (Unsafe) theUnsafe.get(null); +} +catch (Exception e) { + throw new RuntimeException(e); +} + } + + /** Byte arrays offset */ + private static final long BYTE_ARRAY_OFFSET = UNSAFE.arrayBaseOffset(byte[].class); + + /** Number of bytes in a long */ + public static final int LONG_NUM_BYTES = 8; + /** Number of bytes in an int */ + public static final int INT_NUM_BYTES = 4; + /** Number of bytes in a short */ + public static final int SHORT_NUM_BYTES = 2; + +// +// APIs +// + + /** + * @param data source byte array + * @param index index within the byte array + * @return short value starting at data+index + */ + public static short getShort(byte[] data, int index) { --- End diff -- Vlad, - The reader has loaded data from direct memory into CPU level-1 cache through DrillBuf - The data is stored (4k at a time) in a Java byte[] - The only reason for using MemoryUtil is to perform 64bit processing (I have previously indicated that many other processing intensive applications such as Snappy and Hadoop Comparison use the same pattern) - I solely added this logic after noticing the Hotspot was generating 8bit instructions Again, the intent is not to access data from DrillBuf (the data has been already loaded into CPU cache); this is mainly an optimization step. ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r165698800 --- Diff: exec/memory/base/src/main/java/io/netty/buffer/DrillBuf.java --- @@ -703,7 +703,18 @@ protected void _setLong(int index, long value) { @Override public ByteBuf getBytes(int index, ByteBuf dst, int dstIndex, int length) { -udle.getBytes(index + offset, dst, dstIndex, length); +final int BULK_COPY_THR = 1024; --- End diff -- Vlad, - I had a chat with @bitblender and he explains that Java was invoking a stub (not a function call) to perform copyMemory; he agreed copyMemory will be slower for small buffers and the task was to determine the cutoff point - My tests (I will send you my test) indicate that a length of 1024bytes is the length were copyMemory starts performing exactly as getByte() NOTE - I am using JRE 1.8; static buffers initialized once; payload 1MB (1048576bytes) and loop-count of 102400; MacOS High Sierra; 1 thread, 4GB MX, MS ---
[GitHub] drill issue #1106: DRILL-6129: Fixed query failure due to nested column data...
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1106 Thanks @paul-rogers for the information; I went through the PR and noticed the following: - BatchSchema invokes MaterializedField.isEquivalent() - With my fix, both methods consider nested columns but they have several differences 1) RecordBatchLoader requires sameness as this information is used to reuse the value vectors; if old and new batch are deemed same, then the value vectors are reloaded using the load(...) API. The metadata better be the same or a runtime exception will occur 2) RecordBatchLoader isSame(...) API compares two different java objects: SerializedField (obtained from protobufs) and already materialized value vectors MaterializedField 3) RecordBatchLoader isSame(...) API tolerates unordered fields (within the same level) but not MaterializedField.isEquivalent() method 4) MaterializedField.isEquivalent() ignores hidden columns such "$bits" and "$offsets" but not RecordBatchLoader isSame(...) I think moving forward, the best way to prevent bugs with regard to schema changes is by maintaining a document that establishes all the rules. This will allow QA to refine their tests and catch current limitations. ---
[GitHub] drill issue #1106: DRILL-6129: Fixed query failure due to nested column data...
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1106 Thank you Aman and Paul for reviewing this JIRA. I have created a new JIRA [DRILL-6131](https://issues.apache.org/jira/browse/DRILL-6131) to track the new SchemaComparator utility for sharing schema comparison logic. ---
[GitHub] drill pull request #1106: DRILL-6129: Fixed query failure due to nested colu...
Github user sachouche closed the pull request at: https://github.com/apache/drill/pull/1106 ---
[GitHub] drill pull request #1106: DRILL-6129: Fixed query failure due to nested colu...
GitHub user sachouche reopened a pull request: https://github.com/apache/drill/pull/1106 DRILL-6129: Fixed query failure due to nested column data type change Problem Description - - The Drillbit was able to successfully send batches containing different metadata (for nested columns) - This was the case when one or multiple scanners were involved - The issue happened within the client where value vectors are cached across batches - The load(...) API is responsible for updating values vectors when a new batch arrives - The RecordBatchLoader class is used to detect schema changes ; if this is the case, then previous value vectors are discarded and new ones created - There is a bug with the current implementation where only first level columns are compared Fix - - The fix is to improve the schema diff logic by including nested columns You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachouche/drill DRILL-6129 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1106.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1106 commit 9ffb41f509cd2531e7f3cdf89a66605ec0fdf7a4 Author: Salim Achouche <sachouche2@...> Date: 2018-02-01T02:59:58Z DRILL-6129: Fixed query failure due to nested column data type change ---
[GitHub] drill pull request #1106: DRILL-6129: Fixed query failure due to nested colu...
GitHub user sachouche opened a pull request: https://github.com/apache/drill/pull/1106 DRILL-6129: Fixed query failure due to nested column data type change Problem Description - - The Drillbit was able to successfully send batches containing different metadata (for nested columns) - This was the case when one or multiple scanners were involved - The issue happened within the client where value vectors are cached across batches - The load(...) API is responsible for updating values vectors when a new batch arrives - The RecordBatchLoader class is used to detect schema changes ; if this is the case, then previous value vectors are discarded and new ones created - There is a bug with the current implementation where only first level columns are compared Fix - - The fix is to improve the schema diff logic by including nested columns You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachouche/drill DRILL-6129 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1106.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1106 commit 9ffb41f509cd2531e7f3cdf89a66605ec0fdf7a4 Author: Salim Achouche <sachouche2@...> Date: 2018-02-01T02:59:58Z DRILL-6129: Fixed query failure due to nested column data type change ---
[GitHub] drill issue #1011: Drill 1170: Drill-on-YARN
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1011 +1 I have reviewed the code and overall looks good. My main feedback is that the current implementation doesn't currently support secure clusters (at least didn't see any logic associated with that). Yarn applications have issues staying up for a long time because of ticket renewal limitations. We might want to create another enhancement JIRA to support such use-cases. ---
[GitHub] drill pull request #1107: DRILL-6123: Limit batch size for Merge Join based ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1107#discussion_r166448594 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java --- @@ -77,7 +77,7 @@ private ExecConstants() { public static final String SPILL_DIRS = "drill.exec.spill.directories"; public static final String OUTPUT_BATCH_SIZE = "drill.exec.memory.operator.output_batch_size"; - public static final LongValidator OUTPUT_BATCH_SIZE_VALIDATOR = new RangeLongValidator(OUTPUT_BATCH_SIZE, 1024, 512 * 1024 * 1024); + public static final LongValidator OUTPUT_BATCH_SIZE_VALIDATOR = new RangeLongValidator(OUTPUT_BATCH_SIZE, 1, 512 * 1024 * 1024); --- End diff -- Theoretically, 2Gb-1 should be the largest value as our value vectors use integers as offset values. ---
[GitHub] drill pull request #1107: DRILL-6123: Limit batch size for Merge Join based ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1107#discussion_r166450263 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/MergeJoinBatch.java --- @@ -102,20 +105,78 @@ private final List comparators; private final JoinRelType joinType; private JoinWorker worker; + private final long outputBatchSize; private static final String LEFT_INPUT = "LEFT INPUT"; private static final String RIGHT_INPUT = "RIGHT INPUT"; + private class MergeJoinMemoryManager extends AbstractRecordBatchMemoryManager { +private int leftRowWidth; +private int rightRowWidth; + +/** + * mergejoin operates on one record at a time from the left and right batches + * using RecordIterator abstraction. We have a callback mechanism to get notified + * when new batch is loaded in record iterator. + * This can get called in the middle of current output batch we are building. + * when this gets called, adjust number of output rows for the current batch and + * update the value to be used for subsequent batches. + */ +@Override +public void update(int inputIndex) { + switch(inputIndex) { +case 0: + final RecordBatchSizer leftSizer = new RecordBatchSizer(left); + leftRowWidth = leftSizer.netRowWidth(); + break; +case 1: + final RecordBatchSizer rightSizer = new RecordBatchSizer(right); + rightRowWidth = rightSizer.netRowWidth(); +default: + break; + } + + final int newOutgoingRowWidth = leftRowWidth + rightRowWidth; + + // If outgoing row width is 0, just return. This is possible for empty batches or + // when first set of batches come with OK_NEW_SCHEMA and no data. + if (newOutgoingRowWidth == 0) { +return; + } + + // update the value to be used for next batch(es) + setOutputRowCount(Math.min(ValueVector.MAX_ROW_COUNT, + Math.max(RecordBatchSizer.safeDivide(outputBatchSize/WORST_CASE_FRAGMENTATION_FACTOR, newOutgoingRowWidth), MIN_NUM_ROWS))); --- End diff -- Our goal is to eventually use Paul's framework for controlling batch sizes. I feel the BatchRecordMemoryManager terminology is a bit ambitious (kind of misleading). ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r158084806 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class Matcher1 extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string - return 1; +private Matcher1() { + super(); } -final int txtLength = end - start; +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } return 0; } + } + /** Handles patterns with length two */ + private final class Matcher2 extends MatcherFcn { -final int outerEnd = txtLength - patternLength; +private Matcher2() { + super(); +} -outer: -for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) { +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; // simplePattern string has meta characters i.e % and _ and escape characters removed. // so, we can just directly compare. - for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) { -if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) { - continue outer; + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); + +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); + + if (secondInByte == secondPattByte) { +return 1; + } } } + return 0; +} + } + + /** Handles patterns with length three */ + private final class Matcher3
[GitHub] drill pull request #1168: DRILL-6246: Reduced the size of the jdbc-all jar f...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1168#discussion_r174870229 --- Diff: exec/jdbc-all/pom.xml --- @@ -473,6 +473,8 @@ org/yaml/** hello/** webapps/** + **/org/apache/calcite/avatica/metrics/** + **/org/apache/calcite/avatica/org/** --- End diff -- The Apache jar size has shrunk by ~1MB from over 34MB to 33393541 bytes; I guess we can decrease the limit from 35MB to 34MB. ---
[GitHub] drill issue #1168: DRILL-6246: Reduced the size of the jdbc-all jar file
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1168 @sohami can you please review this pull request? Thanks! ---
[GitHub] drill pull request #1168: DRILL-6246: Reduced the size of the jdbc-all jar f...
GitHub user sachouche opened a pull request: https://github.com/apache/drill/pull/1168 DRILL-6246: Reduced the size of the jdbc-all jar file - The jdbc-all client jar has been growing in size from version to the next (~20MB in version 1.10, ~27MB in version 1.12, and 34MB in version 1.13 - Note the exact size of the size depends on the maven profile used during compilation (e.g., "mapr" which have more excludes) - Originally, I tried to exclude the following calcite/avatica packages (responsible for the recent size increase): metrics, proto, org/apache/, com/fasterxml, com/google but some tests failed as the calcite code was referencing some of them - In this pull request, I have excluded the following packages: calcite/avatica/org and calcite/avatica/metrics You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachouche/drill DRILL-6246 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1168.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1168 commit 6b6188a6c1677595c479f67fc2e33af091c36818 Author: Salim Achouche <sachouche2@...> Date: 2018-03-14T02:16:06Z DRILL-6246: Reduced the size of the jdbc-all jar file ---
[GitHub] drill pull request #1168: DRILL-6246: Reduced the size of the jdbc-all jar f...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1168#discussion_r174889287 --- Diff: exec/jdbc-all/pom.xml --- @@ -473,6 +473,8 @@ org/yaml/** hello/** webapps/** + **/org/apache/calcite/avatica/metrics/** + **/org/apache/calcite/avatica/org/** --- End diff -- done! ---
[GitHub] drill issue #1170: DRILL-6223: Fixed several Drillbit failures due to schema...
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1170 @amansinha100 can you please review this pull request? Thanks! ---
[GitHub] drill pull request #1170: DRILL-6223: Fixed several Drillbit failures due to...
GitHub user sachouche opened a pull request: https://github.com/apache/drill/pull/1170 DRILL-6223: Fixed several Drillbit failures due to schema changes Fixed several Issues due to Schema changes: 1) Changes in complex data types Drill Query Failing when selecting all columns from a Complex Nested Data File (Parquet) Set). There are differences in Schema among the files: The Parquet files exhibit differences both at the first level and within nested data types A select * will not cause an exception but using a limit clause will Note also this issue seems to happen only when multiple Drillbit minor fragments are involved (concurrency higher than one) 2) Dangling columns (both simple and complex) This situation can be easily reproduced for: - Select STAR queries which involve input data with different schemas - LIMIT or / and PROJECT operators are used - The data will be read from more than one minor fragment - This is because individual readers have logic to handle such use-cases but not downstream operators - So is reader-1 sends one batch with F1, F2, and F3 - The reader-2 sends batch F2, F3 - Then the LIMIT and PROJECT operator will fail to cleanup the dangling column F1 which will cause failures when downstream operators copy logic attempts copy the stale column F1 - This pull request adds logic to detect and eliminate dangling columns You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachouche/drill DRILL-6223 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1170.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1170 commit d986b6c7588c107bb7e49d2fc8eb3f25a60e1214 Author: Salim Achouche <sachouche2@...> Date: 2018-02-21T02:17:14Z DRILL-6223: Fixed several Drillbit failures due to schema changes ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r179296101 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/MemoryUtils.java --- @@ -0,0 +1,199 @@ +/** --- End diff -- Vlad, can you please explain what you meant? are you only referring for specific members or is it a general comment? ---
[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1060#discussion_r179295602 --- Diff: exec/java-exec/pom.xml --- @@ -836,6 +836,14 @@ org.apache.maven.plugins maven-surefire-plugin + +org.apache.maven.plugins +maven-compiler-plugin +2.3.2 + +-XDignore.symbol.file --- End diff -- Will do! ---
[GitHub] drill issue #1168: DRILL-6246: Reduced the size of the jdbc-all jar file
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1168 No I didn't. Regards, Salim Regards, Salim -Original Message- From: Kunal Khatua [notificati...@github.com] Received: Tuesday, 10 Apr 2018, 14:01 To: apache/drill [dr...@noreply.github.com] CC: Salim Achouche [sachou...@mapr.com]; Mention [ment...@noreply.github.com] Subject: Re: [apache/drill] DRILL-6246: Reduced the size of the jdbc-all jar file (#1168) @sachouche<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sachouche=DwMCaQ=cskdkSMqhcnjZxdQVpwTXg=Yz1mbrEFs0HJzxHJZ69JBLCsKtk9LmaS_d_ncP1HZwY=ZuRTiFKVp4XyNeUONWTmOtorxN8OgUnpJeF5Nnbzpi8=ORom6yiAaNx9O8pqbz4PEqbCngKTh3c4MO1H6btm4s4=> have you had a chance to test the generated JDBC driver with Spotfire/SQuirreL ? â You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_drill_pull_1168-23issuecomment-2D380245152=DwMCaQ=cskdkSMqhcnjZxdQVpwTXg=Yz1mbrEFs0HJzxHJZ69JBLCsKtk9LmaS_d_ncP1HZwY=ZuRTiFKVp4XyNeUONWTmOtorxN8OgUnpJeF5Nnbzpi8=Q3TGflgHce0_-5S2QAAdrJp8SQnAobk1W0ZFJwD50-Q=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AckoK6qKn0CR0-2DszS5136weV3xs2MjZYks5tnR2UgaJpZM4Ssiu1=DwMCaQ=cskdkSMqhcnjZxdQVpwTXg=Yz1mbrEFs0HJzxHJZ69JBLCsKtk9LmaS_d_ncP1HZwY=ZuRTiFKVp4XyNeUONWTmOtorxN8OgUnpJeF5Nnbzpi8=xgHCT0e9lj785056Kkt2KgkhNR3pvO4wPNI_B3J5S54=>. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r183898352 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -182,13 +184,18 @@ public IterOutcome next() { return IterOutcome.OUT_OF_MEMORY; } + // Transfer the ownership of this raw-batch to this operator for proper memory statistics reporting + batch = batch.transferBodyOwnership(oContext.getAllocator()); --- End diff -- @HanumathRao, - This will not affect the outcome since the unordered-receiver is a child of the context allocator (the minor fragment) - Doing the check earlier seemed cleaner as there is no point of changing ownership when there is already an out-of-memory condition - I think the code was written this way since implicitly, the network handlers are receiving the batch and then performing a change of ownership (from RPC to context allocator); this step could lead to an out-of-memory condition at the fragment level - If your point is about reporting, that is attributing the OOM condition to the child operator, then I guess I don't have a problem with that. @parthchandra what do you think? ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
GitHub user sachouche opened a pull request: https://github.com/apache/drill/pull/1237 DRILL-6348: Fixed code so that Unordered Receiver reports its memory ⦠Problem Description - - The Unordered Receiver doesn't report any memory usage because the RPC infrastructure associated the allocated memory ownership to the minor fragment container Proposed Fix - - Modified the code to change the received DrillBuf memory ownership from the parent fragment to the operator (as discussed with @parthchandra) - Made sure that the buffer accounting logic is preserved @parthchandra, can you please review this pull request? Thanks! You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachouche/drill DRILL-6348 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1237.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1237 commit 9ab501a3fc3ebbda44c1478ad9db6711192b8ca1 Author: Salim Achouche <sachouche2@...> Date: 2018-04-23T01:02:35Z DRILL-6348: Fixed code so that Unordered Receiver reports its memory usage ---