from:"sachouche"

[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-09-20 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r140055857
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -1054,8 +1057,36 @@ public void setMax(Object max) {
   return nulls;
 }
 
-@Override public boolean hasSingleValue() {
-  return (max != null && min != null && max.equals(min));
+/**
+ * Checks that the column chunk has a single value.
+ * Returns {@code true} if {@code min} and {@code max} are the same 
but not null
+ * and nulls count is 0 or equal to the rows count.
+ * 
+ * Returns {@code true} if {@code min} and {@code max} are null and 
the number of null values
+ * in the column chunk is equal to the rows count.
+ * 
+ * Comparison of nulls and rows count is needed for the cases:
+ * 
+ * column with primitive type has single value and null values
+ *
+ * column with primitive type has only null values, min/max 
couldn't be null,
+ * but column has single value
+ * 
+ *
+ * @param rowCount rows count in column chunk
+ * @return true if column has single value
+ */
+@Override
+public boolean hasSingleValue(long rowCount) {
+  if (nulls != null) {
+if (min != null) {
+  // Objects.deepEquals() is used here, since min and max may be 
byte arrays
+  return Objects.deepEquals(min, max) && (nulls == 0 || nulls == 
rowCount);
--- End diff --

- if (min != null), then nulls cannot be equal to rowCount
- In this case, only nulls == 0 should be checked 


---

[GitHub] drill issue #805: Drill-4139: Exception while trying to prune partition. jav...

2017-09-20 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/805
  
+1


---

[GitHub] drill issue #976: DRILL-5797: Choose parquet reader from read columns

2017-10-06 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/976
  
Sure!


Regards,

Salim


From: dprofeta <notificati...@github.com>
Sent: Friday, October 6, 2017 8:52:51 AM
To: apache/drill
Cc: Salim Achouche; Mention
Subject: Re: [apache/drill] DRILL-5797: Choose parquet reader from read 
columns (#976)


@sachouche<https://github.com/sachouche> Can you review it?

â
You are receiving this because you were mentioned.
Reply to this email directly, view it on 
GitHub<https://github.com/apache/drill/pull/976#issuecomment-334795292>, or 
mute the 
thread<https://github.com/notifications/unsubscribe-auth/AckoK82kUi_xZmBVaiKuujb-rxzaL2sIks5spkzTgaJpZM4PwLDB>.



---

[GitHub] drill pull request #976: DRILL-5797: Choose parquet reader from read columns

2017-10-10 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/976#discussion_r143837353
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 return new ScanBatch(rowGroupScan, context, oContext, readers, 
implicitColumns);
   }
 
-  private static boolean isComplex(ParquetMetadata footer) {
-MessageType schema = footer.getFileMetaData().getSchema();
+  private static boolean isComplex(ParquetMetadata footer, 
List columns) {
+if (Utilities.isStarQuery(columns)) {
+  MessageType schema = footer.getFileMetaData().getSchema();
 
-for (Type type : schema.getFields()) {
-  if (!type.isPrimitive()) {
-return true;
+  for (Type type : schema.getFields()) {
+if (!type.isPrimitive()) {
+  return true;
+}
   }
-}
-for (ColumnDescriptor col : schema.getColumns()) {
-  if (col.getMaxRepetitionLevel() > 0) {
-return true;
+  for (ColumnDescriptor col : schema.getColumns()) {
+if (col.getMaxRepetitionLevel() > 0) {
+  return true;
+}
+  }
+  return false;
+} else {
+  for (SchemaPath column : columns) {
+if (isColumnComplex(footer.getFileMetaData().getSchema(), column)) 
{
--- End diff --

Can you please use the already defined schema variable instead of invoking 
"footer.getFileMetaData().getSchema()" multiple times.


---

[GitHub] drill pull request #970: DRILL-5832: Migrate OperatorFixture to use SystemOp...

2017-10-05 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/970#discussion_r143017778
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet2/DrillParquetReader.java
 ---
@@ -84,14 +82,7 @@
   private RecordReader recordReader;
   private DrillParquetRecordMaterializer recordMaterializer;
   private int recordCount;
-  private List primitiveVectors;
   private OperatorContext operatorContext;
--- End diff --

Can you explain the reason you are removing the existing logic?


---

[GitHub] drill pull request #970: DRILL-5832: Migrate OperatorFixture to use SystemOp...

2017-10-05 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/970#discussion_r143020963
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java
 ---
@@ -184,25 +183,26 @@ public void testAllScalarTypes() throws Exception {
 
 try {
   // read all of the types with the complex reader
-  test(String.format("alter session set %s = true", 
ExecConstants.PARQUET_NEW_RECORD_READER));
+  alterSession(ExecConstants.PARQUET_NEW_RECORD_READER, true);
--- End diff --

Paul,

- I noticed the key PARQUET_NEW_RECORD_READER is erroneous
- There are currently two readers
   o The old one is used when nested data is used as it can handle all 
parquet use-cases
   o The new reader only deals with Flat parquet data types
- We might want to rename the keys as the new reader cannot always be 
substituted with the old one 



---

[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2017-10-18 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1001
  
Paul,

- I think you misunderstood the proposal
- Let me use an example
- select .. c1 like '%pattern1%' OR c1 like '%pattern2'..
- Assume c1 has 3 values [v1, v2, v3]
- The generated code will create a new VarcharHolder instance for each 
value iteration
- For v1: VarCharHolder vc1 is created, ascii-mode computed for pattern1, 
ascii-mode computation reused for pattern2, pattern3, etc since we're 
evaluating the same value
- For v2: VarCharHolder vc1 is created, ascii-mode computed for pattern1, 
ascii-mode computation reused for pattern2, pattern3, etc since we're 
evaluating the same value
- DITO for v3

Note that the test-suite has similar test cases that you are proposing; 
they all passed.  


---

[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2017-10-18 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1001
  
- It was 50% for 1) and 50% for 2)
- Notice this breakdown depends on 
   o The number of Contains pattern for the same value (impacts 1))
   o The pattern length (impacts both 1) and 2))  


---

[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-18 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1001

JIRA DRILL-5879: Like operator performance improvements

- Recently, custom code has been added to handle common search patterns 
(Like operator)
- Contains, Starts With, and Ends With
- Custom code is way faster than the generic RegEx based implementation
- This pull request is another attempt to improve the Contains pattern 
since it is more CPU intensive

Query: select  from table where colA like '%a%' or 
colA like '%xyz%';
Improvement Opportunities
Avoid isAscii computation (full access of the input string) since we're 
dealing with the same column twice
Optimize the "contains" for-loop
Implementation Details
1)
Added a new integer variable "asciiMode" to the VarCharHolder class
The default value is -1 which indicates this info is not known
Otherwise this value will be set to either 1 or 0 based on the string being 
in ASCII mode or Unicode
The execution plan already shares the same VarCharHolder instance for all 
evaluations of the same column value
The asciiMode will be correctly set during the first LIKE evaluation and 
will be reused across other LIKE evaluations
2)
The "Contains" LIKE operation is quite expensive as the code needs to 
access the input string to perform character based comparisons
Created 4 versions of the same for-loop to a) make the loop simpler to 
optimize (Vectorization) and b) minimize comparisons
Benchmarks
Lineitem table 100GB
Query: select l_returnflag, count from dfs.`` where l_comment not 
like '%a%' or l_comment like '%the%' group by l_returnflag
Before changes: 33sec
After changes : 27sec


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill yodlee-cherry-pick

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1001.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1001


commit c2b05b2e8665daf3f7b43d49c428539b3753595f
Author: Salim Achouche <sachouc...@gmail.com>
Date:   2017-10-18T18:40:18Z

JIRA 5879: Like operator performance improvements




---

[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2017-10-23 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1001
  
Paul, again thanks for the detailed review:

- I was able to address most of the feedback except for one
- I agree that expressions that can operate directly on the encoded UTF-8 
string should ideally perform  checks on bytes and not characters
- Having said that, such a change is more involved and should be done 
properly
   o The SqlPatternContainsMatcher currently gets a CharSequence as input
   o We should enhance the expression framework so that matchers can a) 
express their capabilities and b) receive the expected data type (Character or 
Byte sequences)
  o Note also there is an impact on the test-suite since StringBuffer are 
being used to directly test the matcher functionality 


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-13 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1072

DRILL-5879: Improved SQL Pattern Contains Performance

**BACKGROUND**
- JIRA [DRILL-5879](https://issues.apache.org/jira/browse/DRILL-5879) goal 
is to improve the "Contains" pattern performance
- [DRILL-5899](https://issues.apache.org/jira/browse/DRILL-5899) (sub-task) 
was created subsequently to avoid the ASCII / Unicode pre-processing overhead.
- This pull-request addresses the algorithmic part of this functionality

**ANALYSIS**
- Contains has O(n*m) complexity
- There are two ways to optimize the associated runtime
1) Minimize the number of instructions, pipelining, and CPU stalls
2) Use smarter algorithms (compared to the naive one); for example, the 
Boyer-Moore algorithm (which is implemented by several popular Open Source 
tools such as grep)

**IMPLEMENTATION**
Our approach contains both suggestions (listed in the analysis)
- Created five matchers that are chosen based on the pattern length
- The first three, are based on pattern 1) and target patterns with length 
[1..4[ 
- The forth one, has a similar runtime then the current implementation and 
targets patterns with length [4..10[
- We use the the Boyer-Moore algorithm for patterns with a length larger 
than 9 bytes
NOTE - the JDK doesn't use this algorithm because of two main  reasons
- Two extra arrays are necessary with a size relative to the supported 
character-set and the pattern length (this would be particularly costly for 
unicode as this would require 64k entries)
- Each Contains (or indexOf) invocation would require memory and 
initialization overhead
- Drill doesn't have this issue as a) initialization overhead is amortized 
since the pattern will be matched against many input values and b) our Contains 
logic is centered around bytes so the memory overhead is around 256 integers 
per fragment

**PERFORMANCE IMPROVEMENTS**
- We observe at least 25% improvement per Contains operation for matchers 
with pattern length lower than 4; 100% for negative cases (as the code never 
accesses a secondary for-loop)
- It was noticed the Boyer-Moor algorithm performs poorly for small 
patterns as lookup accesses erase the associated optimizations; this algorithm 
performs extremely well when the pattern length increases as unlike the naive 
implementation it is able to use the lookup table to jump beyond the next 
character (since the matching phase has already gained so insight).
- One popular introduction to this algorithm is the following use-case
- Input: 
- Pattern: aax



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill DRILL-5879

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1072.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1072


commit 50ac5140420057692cac8a33f181eb16580a28e6
Author: Salim Achouche <sachouc...@gmail.com>
Date:   2017-12-13T22:24:45Z

DRILL-5879: Improved SQL Pattern Contains Performance




---

[GitHub] drill issue #1015: DRILL-5889: Simple pattern matchers can work with DrillBu...

2017-11-01 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1015
  
Padma, DRILL-5889 is the wrong JIRA (sqlline loses RPC..); I think your 
JIRA is 5899


Regards,

Salim


From: Padma Penumarthy <notificati...@github.com>
Sent: Wednesday, November 1, 2017 11:15:20 AM
To: apache/drill
Cc: Salim Achouche; Mention
Subject: Re: [apache/drill] DRILL-5889: Simple pattern matchers can work 
with DrillBuf directly (#1015)


@sachouche<https://github.com/sachouche> 
@paul-rogers<https://github.com/paul-rogers> can you please review ?

â
You are receiving this because you were mentioned.
Reply to this email directly, view it on 
GitHub<https://github.com/apache/drill/pull/1015#issuecomment-341192974>, or 
mute the 
thread<https://github.com/notifications/unsubscribe-auth/AckoK8UN4uDlGPJGg6fpypIbPxKEOqGKks5syLU4gaJpZM4QJzty>.



---

[GitHub] drill issue #976: DRILL-5797: Choose parquet reader from read columns

2017-11-01 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/976
  
Looking at the stack trace:
- The code definitely is initializing a column of type REPEATABLE
- The Fast Reader didn't expect this scenario so it used a default 
container (NullableVarBinary) for VL binary DT

Why this is happening?
- The code in ReadState::buildReader() is processing all selected columns
- This information is obtained from the ParquetSchema
- Looking at the code, this seems a case-sensitivity issue
- The ParquetSchema is case-insensitive whereas the Parquet GroupType is not
- Damien added a catch handler (column not found) to handle use-cases where 
we are projecting non-existing columns
- This basically is leading to an unforeseen use-case
- Assume column XYZ is complex
- User uses an alias (xyz)
- The new code will allow this column to pass and treat is as simple
- The ParquetSchema is being case insensitive will process this column
- and thus the exception in the test suite

Suggested Fix
- Create a map (key to-lower-case) and register all current row-group 
columns
- Use this map to locate a selected column type



---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2017-12-03 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1060

DRILL-5846: Improve parquet performance for Flat Data Types

Performance improvements for the Parquet Scanner (Flat Data Types). The are 
two flags to control this performance enhancement (disabled by default):
Option I -
Config Name: store.parquet.flat.reader.bulk
Config Type  : boolean
Description   : Enables bulk processing to minimize memory checks and 
improve JVM HotSpot optimizations

Option II -
Config Name: scan.optimized.implicit.columns 
Config Type  : boolean
Description   : Memory optimization when storing duplicate value (implicit 
columns have duplicate values within a batch). Code profiling indicated this 
step represented one third of Parquet processing (when implicit columns are 
processed).  


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill drill-parquet-improv

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1060.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1060


commit 0a3a8b053be85c570ff24237d4737f37668383bd
Author: Salim Achouche <sachouc...@gmail.com>
Date:   2017-12-04T01:53:08Z

DRILL-5846: Improve parquet performance for Flat Data Types




---

[GitHub] drill issue #976: DRILL-5797: Choose parquet reader from read columns

2017-10-29 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/976
  
+1
looks good!


---

[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-25 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r146952578
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -17,37 +17,166 @@
  */
 package org.apache.drill.exec.expr.fn.impl;
 
-public class SqlPatternContainsMatcher implements SqlPatternMatcher {
+public final class SqlPatternContainsMatcher implements SqlPatternMatcher {
   final String patternString;
   CharSequence charSequenceWrapper;
   final int patternLength;
+  final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString, CharSequence 
charSequenceWrapper) {
-this.patternString = patternString;
+this.patternString   = patternString;
 this.charSequenceWrapper = charSequenceWrapper;
-patternLength = patternString.length();
+patternLength= patternString.length();
+
+// The idea is to write loops with simple condition checks to allow 
the Java Hotspot achieve
+// better optimizations (especially vectorization)
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
--- End diff --

I ran two types of tests to evaluate the generic vs custom method:
Test1 - A match does exist
o Custom method is 3x faster because the code will go to the "else" part
o A nested loop is always slower than unrolled code (the loop is also 
correlated with the outer one); please refer to this article 
(https://en.wikipedia.org/wiki/Loop_unrolling) on the benefits of loop unrolling
o Older match function performed in 59sec

Test2- A match doesn't exist
o Custom method and generic one perform in 15sec; this is because both 
perform a comparison and proceed to the next iteration
o Older match function performed in 45sec 


---

[GitHub] drill issue #1008: drill-5890: Fixed a file descriptor leak in Drill's test-...

2017-10-24 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1008
  
@parthchandra can you please review this change? thanks!


---

[GitHub] drill pull request #1008: drill-5890: Fixed a file descriptor leak in Drill'...

2017-10-24 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1008

drill-5890: Fixed a file descriptor leak in Drill's test-suite

Problem Description
- The Drill test-suite uses two surefire processes to run tests
- This has the advantage of avoiding class reloading if the JVM exited 
after running a test class
- The side effect of this approach, is that resource leaks could be 
problematic
- When running the Drill's test-suite on MacOS (Sierra) my tests failed 
with a max FD descriptors reached
- Had to increase the maximum number of open FDs for the whole machine and 
per process
- The process is described in the following 
[link](https://superuser.com/questions/302754/increase-the-maximum-number-of-open-file-descriptors-in-snow-leopard)
- Two limit files "limit.maxfiles.plist" and "limit.maxproc.plist" have to 
be created under "/Library/LaunchDaemons"
- Originally, I had to set the maximum number of FDs per process to a large 
value (100,000 and the system to 200,000) for the tests to succeed

FD Leak Cause
Debugging the Drill test suite, it was noticed 
- A base class BaseTestQuery has a @BeforeClass and @AfterClass TestNG tags 
- This means that each Drill test class extending from BaseTestQuery will 
have a setup method called before any tests are executed and a cleanup method 
invoked when all the tests are done (or a fatal error in between)
- The OpenClient() method was starting a DrillBit and creating a client 
connection to it
- DrillBit's BootStrapContext class was initializing two Netty 
EventLoopGroup objects which internally opened 20 FDs each
- It was noticed that one of them was not getting de-initialized

Fix
- Added logic within the BootStrapContext object to shutdown the 
EventLoopGroup objects if they have been already shutdown (and are not in the 
process of being shutdown)
- The fix tries to shut both objects because the container class should 
ideally manage the lifecycle of its objects; at least, the code should clearly 
articulate lifecycle management responsibilities to avoid leaks
- Used the "shutdownGracefully" method since it was a) already used by our 
code and b) is advertised to have sensible timeout values
- The added shutdown calls are being invoked only when consumer objects 
have been also shutdown
- Running the tests show that the number of FDs per surefire process 
doesn't extend beyond few hundreds (majority created for JAR files loading)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sachouche/drill drill-5890

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1008.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1008


commit 5f8bad865a78ab265015ca21184d6c59e22d1c95
Author: Salim Achouche <sachouc...@gmail.com>
Date:   2017-10-24T17:12:19Z

drill-5890: Fixed a file descriptor leak in Drill's test-suite




---

[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-24 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r146708658
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -17,37 +17,166 @@
  */
 package org.apache.drill.exec.expr.fn.impl;
 
-public class SqlPatternContainsMatcher implements SqlPatternMatcher {
+public final class SqlPatternContainsMatcher implements SqlPatternMatcher {
   final String patternString;
   CharSequence charSequenceWrapper;
   final int patternLength;
+  final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString, CharSequence 
charSequenceWrapper) {
-this.patternString = patternString;
+this.patternString   = patternString;
 this.charSequenceWrapper = charSequenceWrapper;
-patternLength = patternString.length();
+patternLength= patternString.length();
+
+// The idea is to write loops with simple condition checks to allow 
the Java Hotspot achieve
+// better optimizations (especially vectorization)
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
--- End diff --

Padma, I have two reasons to follow the added complexity
1) The new code is encapsulated within the Contains matching logic; doesn't 
increase code complexity
2) 
o I created a test with the original match logic, pattern and input were 
Strings though passed as CharSequence
o Ran the test with the new and old method (1 billion iterations) on MacOS
o pattern length 
o The old match method performed in 43sec where as the new one performed in 
15sec
o The reason for the speedup is the custom matcher functions have less 
instructions (load and comparison)


---

[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-24 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r146705325
  
--- Diff: 
exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java
 ---
@@ -73,6 +73,9 @@ public void eval() {
 out.start =  in.start;
 if (charCount <= length.value || length.value == 0 ) {
   out.end = in.end;
+  if (charCount == (out.end - out.start)) {
+out.asciiMode = 
org.apache.drill.exec.expr.holders.VarCharHolder.CHAR_MODE_IS_ASCII; // we can 
conclude this string is ASCII
--- End diff --

- As previously stated (when responding to Paul'd comment), the expression 
framework is able to use the same VarCharHolder input variable when it is 
shared amongst multiple expressions
- If the original column was of type var-binary, then the expression 
framework will include a cast to var-char
- The cast logic will also compute the string length
- Using this information to deduce whether the string is pure ASCII or not
- UTF-8 encoding uses 1 byte for ASCII and 2, 3, or 4 for other character 
sets
- If the encoded length and character length are equal, then this means 
this is an ASCII string 


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158080071
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
--- End diff --

This is not a problem as the for-loop condition (index = 0; index < 
lengthToProcess; ..) will prevent any processing to take place unless the 
correct condition is met. The key point here is that the input length is always 
provided by the "end - start" formula. 


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158072762
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
--- End diff --

will do.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158074965
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
--- End diff --

These optimizations are useless with modern compilers. This would have been 
useful if the inputByte value was the same.. then we would save few 
instructions.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158080247
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
--- End diff --

same answer.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158074289
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
--- End diff --

HeapByteBuffer will always create an internal buffer. Empty string will 
have a byte array with zero bytes.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158085723
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158086368
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill issue #1237: DRILL-6348: Fixed code so that Unordered Receiver reports...

2018-04-27 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1237
  
@vrozov,

**What are we trying to solve / improve**
- Drill is currently not properly reporting memory held in Fragment's 
receive queues
- This makes it hard to analyze OOM conditions
This is what I want to see
- Every operator reporting on the resources it is currently using (needed)
- Fragment held resources (other than the ones already reported by the 
child operators)
- Drilbit level (metadata caches, web-server, ..)
- I am ok to incrementally reach this goal

**Data Exchange Logistic**
- Ideally, the data exchange fabric should be decoupled from the Drill 
Receive / Send operators
- The fabric should be handling all the aspects of pre-fetch / pressuring 
and so forth
- It will tune to the speed of producers / consumers when writing / reading 
data from it
- This infrastructure should have its own resource management and reporting 
capabilities

**Operator based Reporting**
- Receive and Send operators shall not worry about batches they didn't 
consume yet
- Doing so is counter productive as the Data Exchange fabric will interpret 
a "drain" operation as the operator "needing" more data. 
- For example, the merge-receiver should not be managing the receive 
queues; it should only advertise the pattern of data consumption and let the 
exchange fabric figure out the rest. 

The main difference in the two approaches, is that essentially, you are 
preaching for Receive and Send operators to control all aspects of 
communication whereas I am preaching for decoupling such aspects.


---

[GitHub] drill issue #1237: DRILL-6348: Fixed code so that Unordered Receiver reports...

2018-04-27 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1237
  
That was not my intention as my current change aimed at describing the 
system the way it is. 

@parthchandra, any feedback?


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-27 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184728292
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
+  batch = getNextBatch();
+
+  // skip over empty batches. we do this since these are basically 
control messages.
+  while (batch != null && batch.getHeader().getDef().getRecordCount() 
== 0
--- End diff --

Ignore this comment as I thought you were releasing the returned batch.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-27 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184747702
  
--- Diff: 
exec/memory/base/src/main/java/org/apache/drill/exec/memory/AllocationManager.java
 ---
@@ -253,10 +261,12 @@ public boolean transferBalance(final BufferLedger 
target) {
   target.historicalLog.recordEvent("incoming(from %s)", 
owningLedger.allocator.name);
 }
 
-boolean overlimit = target.allocator.forceAllocate(size);
+// Release first to handle the case where the current and target 
allocators were part of the same
+// parent / child tree.
 allocator.releaseBytes(size);
+boolean allocationFit = target.allocator.forceAllocate(size);
--- End diff --

- The change of order is an optimization for a parent / child relationship 
as if we don't release first, then we could unnecessarily go over the memory 
budget (double counting).
- The force-alloc() / free() failures should never happen on normal 
conditions; when they do, the best thing to do is to exit. I still prefer not 
to promote the target allocator till it is 100% successful.




---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-27 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184727914
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
--- End diff --

Ok good point, as I have seen both practices being done within the Drill 
code. Though, I don't think this is a big deal as I don't see startWait() 
failing as it merely invokes nano time.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-27 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184730050
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RawFragmentBatch.java 
---
@@ -77,4 +83,46 @@ public long getByteCount() {
   public boolean isAckSent() {
 return ackSent.get();
   }
+
+  /**
+   * Transfer ownership of this DrillBuf to the target allocator. This is 
done for better memory
+   * accounting (that is, the operator should be charged with the body's 
Drillbuf memory).
+   *
+   * NOTES -
+   * 
+   * This operation is a NOOP when a) the current allocator 
(associated with the DrillBuf) is not the
+   * owning allocator or b) the target allocator is already the owner
+   * When transfer happens, a new RawFragmentBatch instance is 
allocated; this is done for proper
+   * DrillBuf reference count accounting
+   * The RPC handling code caches a reference to this RawFragmentBatch 
object instance; release()
+   * calls should be routed to the previous DrillBuf
+   * 
+   *
+   * @param targetAllocator target allocator
+   * @return a new {@link RawFragmentBatch} object instance on success 
(where the buffer ownership has
+   * been switched to the target allocator); otherwise this 
operation is a NOOP (current instance
+   * returned)
+   */
+  public RawFragmentBatch transferBodyOwnership(BufferAllocator 
targetAllocator) {
+if (body == null) {
+  return this; // NOOP
+}
+
+if (!body.getLedger().isOwningLedger()
+ || body.getLedger().isOwner(targetAllocator)) {
+
+  return this;
+}
+
+int writerIndex   = body.writerIndex();
+TransferResult transferResult = 
body.transferOwnership(targetAllocator);
+
+// Set the index and increment reference count
+transferResult.buffer.writerIndex(writerIndex);
+
+// Clear the current Drillbuffer since caller will perform release() 
on the new one
+body.release();
+
+return new RawFragmentBatch(getHeader(), transferResult.buffer, 
getSender(), false);
--- End diff --

We can take up such an enhancement as as part of another JIRA as any 
changes within the RPC layer have to be thoroughly tested.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-27 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184726839
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -201,6 +208,11 @@ public IterOutcome next() {
   context.getExecutorState().fail(ex);
   return IterOutcome.STOP;
 } finally {
+
+  if (batch != null) {
+batch.release();
+batch = null;
--- End diff --

The point of this pattern is that if you would like to continue using this 
object then be prepared to know what can and what cannot be used.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-27 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184807153
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
--- End diff --

I see; I will then fix any such occurrences when opportunity presents 
itself as I have seen both patterns in the Drill code base.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184572864
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
--- End diff --

why is that? this method is used once and doing so will make us duplicate 
exception handling code. I would appreciate if you provide reason(s) when you 
make suggestions to avoid the back and forth.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184574405
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
+  batch = getNextBatch();
+
+  // skip over empty batches. we do this since these are basically 
control messages.
+  while (batch != null && batch.getHeader().getDef().getRecordCount() 
== 0
--- End diff --

Sure. 

FYI - We cannot release the batch within this method.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184573646
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
--- End diff --

Sure, but does it really matter or is it your personal preference?


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184575686
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
+  batch = getNextBatch();
+
+  // skip over empty batches. we do this since these are basically 
control messages.
+  while (batch != null && batch.getHeader().getDef().getRecordCount() 
== 0
+  && (!first || batch.getHeader().getDef().getFieldCount() == 0)) {
+batch = getNextBatch();
+  }
+} finally {
+  stats.stopWait();
+}
+return batch;
+  }
+
   @Override
   public IterOutcome next() {
 batchLoader.resetRecordCount();
 stats.startProcessing();
-try{
-  RawFragmentBatch batch;
-  try {
-stats.startWait();
-batch = getNextBatch();
 
-// skip over empty batches. we do this since these are basically 
control messages.
-while (batch != null && 
batch.getHeader().getDef().getRecordCount() == 0
-&& (!first || batch.getHeader().getDef().getFieldCount() == 
0)) {
-  batch = getNextBatch();
-}
-  } finally {
-stats.stopWait();
-  }
+RawFragmentBatch batch = null;
+try {
 
+  batch = getNextNotEmptyBatch();
   first = false;
 
   if (batch == null) {
--- End diff --

Sure.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184574622
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -201,6 +208,11 @@ public IterOutcome next() {
   context.getExecutorState().fail(ex);
   return IterOutcome.STOP;
 } finally {
+
+  if (batch != null) {
+batch.release();
+batch = null;
--- End diff --

This pattern ensures that after release the batch cannot be accessed 
anymore (e.g., additions to the current code).


---

[GitHub] drill issue #1237: DRILL-6348: Fixed code so that Unordered Receiver reports...

2018-04-26 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1237
  
@vrozov, your observation is valid, we need more JIRAs to fix the reporting 
problem
**Current Fix**
- At this time, the UnorderedReceiver didn't account for any consumed memory
- This fix, taxes the operator only when it consumes buffers

**Potential Enhancements**
Solution I
- Create a new fragment child Allocator which will own the received (not 
yet consumed) batches
- Improve the UI to report this allocator size
- This solution is simple and complementary to the current work

Solution II
- Have the receiver operator drain the batch queue
- Essentially, the receiver will have a private queue from where to consume 
batches
- I personally don't like this solution as it prevents us from improving 
the Drill network protocol
- Draining the batches for the sake of reporting will make it harder for 
the network layer to prefetch batches in an optimal manner; the queue size 
should be an indicator on how many pending batches there are. 


---

[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-19 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r145746250
  
--- Diff: 
exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java
 ---
@@ -73,6 +73,9 @@ public void eval() {
 out.start =  in.start;
 if (charCount <= length.value || length.value == 0 ) {
   out.end = in.end;
+  if (charCount == (out.end-out.start)) {
+out.asciiMode = 1; // we can conclude this string is ASCII
--- End diff --

will do!


---

[GitHub] drill issue #1060: DRILL-5846: Improve parquet performance for Flat Data Typ...

2018-01-09 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1060
  
Before I reply to the provided comments I want first to thank both Parth 
and Paul for taking time to review this Pull Request.
@parthchandra Regarding the High Level Comments

- FS Access Manager
The attached document listed several possible improvements in terms of CPU, 
Partitioning (changes to the query plan), and IO layer. I didn't want to club 
all the changes in a single pull request; instead I started with the Pull 
Request with the highest priority (in terms of performance improvement) which 
is CPU related enhancements.

- Drill Memory Package
I understand where you're coming from; I have analyzed this task and came 
up with a solution which I felt was the cleanest. Let me take you through my 
thinking process and the alternative solutions that I have considered:
a) Why did I use this low level APIs?
The Bulk Parquet reader uses the Drill Buffer APIs to copy data chunks to 
the CPU cache; this means that memory sanity checks are always performed. 
During code profiling I have noticed the Hotspot was using too many 
instructions (byte based assembly was used); I have researched ways to improve 
this and found some Hadoop and Compression java code which utilized 64bit 
access instructions when manipulating byte arrays (e.g., 
[Snappy](https://github.com/airlift/aircompressor/blob/master/src/main/java/io/airlift/compress/snappy/SnappyRawCompressor.java)).
  This makes a huge difference

b) Safety
As indicated previously, native data is always accessed through the Drill 
Buffer APIs (that is, accessed data is sanitized). During bulk processing, the 
new utility has additional checks which are triggered when assertions are 
enabled. This means all test-suite and QA tests (which run with assertions On) 
will be able to catch any faulty memory access; note also the checks are 
simpler since they only pertain to byte arrays boundary checks.

c) Abstraction
I agree that the Drill code base should follow a common code path when 
accessing native memory (minimize chances of introducing bugs), though I felt 
the Drill Memory package was the main abstraction since it was centralized and 
thus easier to test.

d) Alternatives
I have considered using the Netty io.netty.util.internal.PlatformDependent0 
to access the Java intrinsics APIs though this class has a package scope and 
the wrapper public class io.netty.util.internal.PlatformDependent doesn't 
expose this low level functionality. Thus the work around would have been to 
create MemoryUtils class under the io.netty.util.internal package. I didn't 
think this was a cleaner approach than just using Java's intrinsic APIs 
directly (note also that Java9 will expose such intrinsics under a public JDK 
package instead of the sun.misc. Please let me know if you rather prefer the 
io.netty.util.internal route instead.

- Changes to the Vector Code
I had few discussions with Paul about the need to reconcile DRILL-5657; the 
agreement was to proceed with the current implementation and then apply 
refactoring (future releases) as my changes are backward compatible (new public 
APIs added). 

- Arrow Integration
I will create a task to analyze the impact of the Arrow Integration on the 
Parquet Reader.

- UserBitShared.proto
This class was updated to deal with the implicit columns optimization. I 
have added two fields to the NullableVarChar class to capture the logical 
number of rows and the fact this feature was active. This allowed me to use 
different Accessor and Mutator implementation. You and I had a discussion about 
this and the agreement was to:
a) The new optimization is configurable (current default is off); though we 
might change the default at some point
b) Detect C++ based clients and turn of this optimization (that is, the 
implicit columns enhancements); I didn't implement this task yet as I forgot 
the name of the property that you have indicated (this should be quite fast to 
implement though)

- Testing
Yes, QA helped me to run the test-suite with the two new options enabled. 
The new changes didn't modify any of the encoding/compression logic but I'll be 
happy to add any test(s) you deem necessary.. 

- Code Style
Thank you for the pointers; I'll stick to your feedback in future changes 
to ease the review process.





---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-09 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r160514755
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VLAbstractEntryReader.java
 ---
@@ -0,0 +1,215 @@

+/***
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ 
**/
+package org.apache.drill.exec.store.parquet.columnreaders;
+
+import java.nio.ByteBuffer;
+
+import 
org.apache.drill.exec.store.parquet.columnreaders.VLColumnBulkInput.ColumnPrecisionInfo;
+import 
org.apache.drill.exec.store.parquet.columnreaders.VLColumnBulkInput.PageDataInfo;
+import org.apache.drill.exec.util.MemoryUtils;
+
+/** Abstract class for sub-classes implementing several algorithms for 
loading a Bulk Entry */
+abstract class VLAbstractEntryReader {
+
+  /** byte buffer used for buffering page data */
+  protected final ByteBuffer buffer;
+  /** Page Data Information */
+  protected final PageDataInfo pageInfo;
+  /** expected precision type: fixed or variable length */
+  protected final ColumnPrecisionInfo columnPrecInfo;
+  /** Bulk entry */
+  protected final VLColumnBulkEntry entry;
+
+  /**
+   * CTOR.
+   * @param _buffer byte buffer for data buffering (within CPU cache)
+   * @param _pageInfo page being processed information
+   * @param _columnPrecInfo column precision information
+   * @param _entry reusable bulk entry object
+   */
+  VLAbstractEntryReader(ByteBuffer _buffer,
+PageDataInfo _pageInfo,
+ColumnPrecisionInfo _columnPrecInfo,
+VLColumnBulkEntry _entry) {
+
+this.buffer = _buffer;
+this.pageInfo   = _pageInfo;
+this.columnPrecInfo = _columnPrecInfo;
+this.entry  = _entry;
+  }
+
+  /**
+   * @param valuesToRead maximum values to read within the current page
+   * @return a bulk entry object
+   */
+  abstract VLColumnBulkEntry getEntry(int valsToReadWithinPage);
+
+  /**
+   * Indicates whether to use bulk processing
+   */
+  protected final boolean bulkProcess() {
+return columnPrecInfo.bulkProcess;
+  }
+
+  /**
+   * Loads new data into the buffer if empty or the force flag is set.
+   *
+   * @param force flag to force loading new data into the buffer
+   */
+  protected final boolean load(boolean force) {
+
+if (!force && buffer.hasRemaining()) {
+  return true; // NOOP
+}
+
+// We're here either because the buffer is empty or we want to force a 
new load operation.
+// In the case of force, there might be unprocessed data (still in the 
buffer) which is fine
+// since the caller updates the page data buffer's offset only for the 
data it has consumed; this
+// means unread data will be loaded again but this time will be 
positioned in the beginning of the
+// buffer. This can happen only for the last entry in the buffer when 
either of its length or value
+// is incomplete.
+buffer.clear();
+
+int remaining = remainingPageData();
+int toCopy= remaining > buffer.capacity() ? buffer.capacity() : 
remaining;
+
+if (toCopy == 0) {
+  return false;
+}
+
+pageInfo.pageData.getBytes(pageInfo.pageDataOff, buffer.array(), 
buffer.position(), toCopy);
--- End diff --

@parthchandra I know this is counter intuitive but this path makes it 
faster and I had VTune to prove it; let me explain it:
- The logic is to fetch small chunks of data from main memory to the L1 
cache (using 4k buffers) for processing
- This logic doesn't introduce any extra memory bandwidth penalty since in 
all cases data has to move from main memory to the CPU and back to main memory 
(if modified)
- Note that the CPU

[GitHub] drill pull request #1087: Attempt to fix memory leak in Parquet

2018-01-09 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1087

Attempt to fix memory leak in Parquet

** Problem Description **
This is an extremely rare leak which I was able to emulate by putting a 
sleep in the AsyncPageReader right after reading the page and before enqueue in 
the result queue. This is how this issue could manifest itself in real life 
scenario:
- AsyncPageReader reads a page into a buffer but didn't enqueue yet the 
result (thread got preempted)
- Parquet Scan thread blocked waiting on the task (Future object dequeued)
- Cancel received and Scan thread interrupted 
- Future.get() returns (Future object is lost)
- Scan thread executes release logic
- Scan thread is not able to interrupt the AsyncPageReader thread since the 
future object is lost
-  AsyncPageReader thread resumes and enqueues the DrillBuf in the result 
queue
- This results in a leak since this buffer is not properly released

** Fix Description **
- The fix is straightforward as we peek the Future object during the 
blocking get() method
- This way, an exception (such as an interrupt) will leave the Future 
object in the task queue
- The cleanup logic will be able to guarantee the DrillBuf object is either 
GCed by the AsyncPageReader or ParquetScan thread

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill DRILL-6079

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1087.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1087


commit 52030d1d9cc3b8992a10ade8c7126d66e785043a
Author: Salim Achouche <sachouche2@...>
Date:   2017-12-22T19:50:56Z

Attempt to fix memory leak in Parquet




---

[GitHub] drill issue #1087: Attempt to fix memory leak in Parquet

2018-01-09 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1087
  
@parthchandra can you please review this fix?


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-11 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r161028252
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java 
---
@@ -471,8 +519,8 @@ public void close() throws Exception {
 container.clear();
 mutator.clear();
 if (currentReader != null) {
-  currentReader.close();
-}
+currentReader.close();
+  }
--- End diff --

done.


---

[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2018-01-10 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1001
  
Created another pull request #1072to merge my changes with the one done 
with Padma's.


---

[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2018-01-10 Thread sachouche

Github user sachouche closed the pull request at:

https://github.com/apache/drill/pull/1001


---

[GitHub] drill issue #1087: Attempt to fix memory leak in Parquet

2018-01-10 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1087
  
Thank you Arina for catching this; I created the commit for QA before 
vacation so that they could verify the fix. At that time, I didn't have an 
Apache JIRA. I have now updated the comment to reflect the JIRA id DRILL-6079 
which is also the name of this remote branch.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-10 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r160763490
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-10 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r160753323
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-10 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r160765882
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/expr/fn/impl/TestSqlPatterns.java
 ---
@@ -446,6 +446,61 @@ public void testSqlPatternComplex() {
 assertEquals(1, sqlPatternComplex.match(0, byteBuffer.limit(), 
drillBuf)); // should match
   }
 
+  @Test
+  public void testSqlPatternContainsMUltipleMatchers() {
--- End diff --

Paul, the test-suite already has many other tests for SQL matching. But I 
agree, they now might hit only few of these matcher. I will add more tests.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-10 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r160755097
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-10 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r160758370
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-10 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r160764681
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-10 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r160759375
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-10 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r160754694
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
--- End diff --

Good point Paul!


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-17 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r162121115
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherZero extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
+private MatcherZero() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
   return 1;
 }
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherOne extends MatcherFcn {
+
+private MatcherOne() {
+  super();
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
--- End diff --

renamed all such variables to XXXPatternBytes


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-17 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r162119757
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
--- End diff --

cut-and-paste issue :-)


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-17 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r162120115
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherZero extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
+private MatcherZero() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
   return 1;
 }
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherOne extends MatcherFcn {
+
+private MatcherOne() {
+  super();
--- End diff --

removed.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-17 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r162123690
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherZero extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
+private MatcherZero() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
   return 1;
 }
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherOne extends MatcherFcn {
+
+private MatcherOne() {
+  super();
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
+
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length two */
+  private final class MatcherTwo extends MatcherFcn {
+
+private MatcherTwo() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
--- End diff --

Good idea; done this for all the matchers.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-17 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r162124388
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherZero extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
+private MatcherZero() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
   return 1;
 }
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherOne extends MatcherFcn {
+
+private MatcherOne() {
+  super();
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
+
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length two */
+  private final class MatcherTwo extends MatcherFcn {
+
+private MatcherTwo() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
+
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
 
-final int txtLength = end - start;
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
+}
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length three */
+  private final class MatcherT

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-17 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r162124359
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherZero extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
+private MatcherZero() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
   return 1;
 }
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherOne extends MatcherFcn {
+
+private MatcherOne() {
+  super();
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
+
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length two */
+  private final class MatcherTwo extends MatcherFcn {
+
+private MatcherTwo() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
+
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
 
-final int txtLength = end - start;
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
+}
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length three */
+  private final class MatcherT

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-21 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r162828788
  
--- Diff: exec/vector/src/main/codegen/templates/NullableValueVectors.java 
---
@@ -68,96 +85,441 @@
 
   private final UInt1Vector bits = new UInt1Vector(bitsField, allocator);
   private final ${valuesName} values = new ${minor.class}Vector(field, 
allocator);
+  private final Mutator mutator  = new MutatorImpl();
+  private final Accessor accessor= new AccessorImpl();
+
+  <#if type.major == "VarLen" && minor.class == "VarChar">
+  private final Mutator dupMutator   = new DupValsOnlyMutator();
+  /** Accessor instance for duplicate values vector */
+  private final Accessor dupAccessor = new DupValsOnlyAccessor();
+  /** Optimization for cases where all values are identical */
+  private boolean duplicateValuesOnly;
+  /** logical number of values */
+  private int logicalNumValues;
+  /** logical value capacity */
+  private int logicalValueCapacity;
+  /** Mutator instance for duplicate values vector */
+
+  /** true if this vector holds the same value albeit repeated */
+  public boolean isDuplicateValsOnly() {
--- End diff --

Tried that (that was my first attempt) but the main issue is that the Drill 
factory creates vectors based on the column metadata alone. This design has the 
advantage of enabling / rolling-back optimizations transparently from the 
consumer. I also made sure there is no performance penalty (or at least 
minimal).


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-21 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r162828863
  
--- Diff: exec/vector/src/main/codegen/templates/NullableValueVectors.java 
---
@@ -68,96 +85,441 @@
 
   private final UInt1Vector bits = new UInt1Vector(bitsField, allocator);
   private final ${valuesName} values = new ${minor.class}Vector(field, 
allocator);
+  private final Mutator mutator  = new MutatorImpl();
+  private final Accessor accessor= new AccessorImpl();
+
+  <#if type.major == "VarLen" && minor.class == "VarChar">
+  private final Mutator dupMutator   = new DupValsOnlyMutator();
+  /** Accessor instance for duplicate values vector */
+  private final Accessor dupAccessor = new DupValsOnlyAccessor();
+  /** Optimization for cases where all values are identical */
+  private boolean duplicateValuesOnly;
+  /** logical number of values */
+  private int logicalNumValues;
+  /** logical value capacity */
+  private int logicalValueCapacity;
+  /** Mutator instance for duplicate values vector */
+
+  /** true if this vector holds the same value albeit repeated */
+  public boolean isDuplicateValsOnly() {
+return duplicateValuesOnly;
+  }
 
-  private final Mutator mutator = new Mutator();
-  private final Accessor accessor = new Accessor();
+  /**
+   * Sets this vector duplicate values mode; the {@link #clear()} method 
wil also be called as a side effect
+   *  of this operation
+   */
+  public void setDuplicateValsOnly(boolean valsOnly) {
+clear();
+duplicateValuesOnly = valsOnly;
+  }
 
-  public ${className}(MaterializedField field, BufferAllocator allocator) {
-super(field, allocator);
+  /** {@inheritDoc} */
+  @Override
+  public int getValueCapacity(){
+if (!isDuplicateValsOnly()) {
+return Math.min(bits.getValueCapacity(), values.getValueCapacity());
+  }
+return logicalValueCapacity;
   }
 
+  /** {@inheritDoc} */
   @Override
-  public FieldReader getReader(){
-return reader;
+  public void close() {
+bits.close();
+values.close();
+super.close();
   }
 
+  /** {@inheritDoc} */
   @Override
-  public int getValueCapacity(){
-return Math.min(bits.getValueCapacity(), values.getValueCapacity());
+  public void clear() {
+bits.clear();
+values.clear();
+super.clear();
+if (isDuplicateValsOnly()) {
+  logicalNumValues = 0;
+  logicalValueCapacity = 0;
+  duplicateValuesOnly  = false;
+}
   }
 
+  /** {@inheritDoc} */
   @Override
-  public DrillBuf[] getBuffers(boolean clear) {
-final DrillBuf[] buffers = ObjectArrays.concat(bits.getBuffers(false), 
values.getBuffers(false), DrillBuf.class);
-if (clear) {
-  for (final DrillBuf buffer:buffers) {
-buffer.retain(1);
+  public int getBufferSizeFor(final int valueCount) {
+assert valueCount >= 0;
+
+if (valueCount == 0) {
+  return 0;
+}
+if (!isDuplicateValsOnly()) {
+  return values.getBufferSizeFor(valueCount) + 
bits.getBufferSizeFor(valueCount);
+}
+return values.getBufferSizeFor(1) + bits.getBufferSizeFor(1);
+  }
+
+  /** {@inheritDoc} */
+  @Override
+  public ${valuesName} getValuesVector() {
+if (!isDuplicateValsOnly()) {
+  return values;
+}
+throw new UnsupportedOperationException();
+  }
+
+  /** {@inheritDoc} */
+  @Override
+  public void setInitialCapacity(int numRecords) {
+assert numRecords >= 0;
+if (!isDuplicateValsOnly()) {
+  bits.setInitialCapacity(numRecords);
+  values.setInitialCapacity(numRecords);
+} else {
+  bits.setInitialCapacity(numRecords > 0 ? 1 : 0);
+  values.setInitialCapacity(numRecords > 0 ? 1 : 0);
+  logicalValueCapacity = numRecords;
+}
+  }
+
+  /** {@inheritDoc} */
+  @Override
+  public Accessor getAccessor() {
+if (!isDuplicateValsOnly()) {
+  return accessor;
+}
+return dupAccessor;
+  }
+
+  /** {@inheritDoc} */
+  @Override
+  public Mutator getMutator() {
+if (!isDuplicateValsOnly()) {
+  return mutator;
+}
+return dupMutator;
+  }
+
+  public void copyFrom(int fromIndex, int thisIndex, 
Nullable${minor.class}Vector from) {
+final Accessor fromAccessor = from.getAccessor();
+if (isDuplicateValsOnly()) {
+  // We currently don't have a use-case where a target dup-vector is 
provisioned by a non-dup source vector
+  assert isD

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-21 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r162828312
  
--- Diff: exec/vector/src/main/codegen/templates/NullableValueVectors.java 
---
@@ -51,6 +57,17 @@
 public final class ${className} extends BaseDataValueVector implements 
<#if type.major == "VarLen">VariableWidth<#else>FixedWidthVector, 
NullableVector {
   private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(${className}.class);
 
+  /**
+   * Optimization to set contiguous values nullable state in a bulk 
manner; cannot define this array
+   * within the Mutator class as Java doesn't allow static initialization 
within a non static inner class.
+   */
+  private static final int DEFINED_VALUES_ARRAY_LEN = 1 << 10;
+  private static final byte[] DEFINED_VALUES_ARRAY  = new 
byte[DEFINED_VALUES_ARRAY_LEN];
--- End diff --

I rather have it within the sub-class which defines the associated 
functionality (that is nullable DT); the wasted memory is also negligible.


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-21 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r162828174
  
--- Diff: exec/vector/src/main/codegen/templates/NullableValueVectors.java 
---
@@ -51,6 +57,17 @@
 public final class ${className} extends BaseDataValueVector implements 
<#if type.major == "VarLen">VariableWidth<#else>FixedWidthVector, 
NullableVector {
   private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(${className}.class);
 
+  /**
+   * Optimization to set contiguous values nullable state in a bulk 
manner; cannot define this array
+   * within the Mutator class as Java doesn't allow static initialization 
within a non static inner class.
+   */
+  private static final int DEFINED_VALUES_ARRAY_LEN = 1 << 10;
--- End diff --

1 << 10 is 1024; not sure how did you arrive at 10,000,000,000..


---

[GitHub] drill issue #1060: DRILL-5846: Improve parquet performance for Flat Data Typ...

2018-01-12 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1060
  
@paul-rogers with regard to the design aspects that you brought up:
** Corrections about the proposed Design **
Your analysis somehow assumes the Vector is the one driving the loading 
operation. This is not the case, a) it is the reader which invokes the bulk 
save API b) the Vector save method runs a loop requesting the rest of the data. 
The reader can any time stop the loading phase; currently the reader uses the 
batch size to make this decision. The plan is to enhance this logic by having a 
feedback from the Vector save method about memory usage (e.g., save current row 
set with the condition the memory usage doesn't get X bytes). I believe I have 
raised this design decision early on and the agreement was to implement the 
memory batch restrictions in a next Drill release.

** Agile Development **
Through the years I came to appreciate the value of agile development. One 
needs to prioritize design activities; it is completely valid to start with a 
design (which is not guaranteed to be perfect) and then refine it overtime as 
long as the underlying implementation doesn't spill to other modules 
(encapsulation). This has the merit of showcasing incremental improvements and 
allowing the developer to get new insight as they have a better understanding.
   


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-11 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r161065237
  
--- Diff: protocol/src/main/protobuf/UserBitShared.proto ---
@@ -148,6 +148,8 @@ message SerializedField {
   optional int32 value_count = 4;
   optional int32 var_byte_length = 5;
   optional int32 buffer_length = 7;
+  optional bool is_dup = 8;
+  optional int32 logical_value_count = 9;
--- End diff --

Just had an offline conversation with Paul and Kunal. The agreement is that 
we should introduce a versioning mechanism at the protocol level so that 
clients could advertise their version identifier; this way the server can use 
this information to turn on / off features based on the client capabilities. 
The task is only to create the preliminary versioning infrastructure; more 
sophistication can be added later on. 


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-11 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r161045748
  
--- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java 
---
@@ -386,7 +391,7 @@ public void reAlloc() {
   throw new OversizedAllocationException("Unable to expand the buffer. 
Max allowed buffer size is reached.");
 }
 
-logger.trace("Reallocating VarChar, new size {}", newAllocationSize);
+logger.trace("Reallocating VarChar, new size {}",newAllocationSize);
--- End diff --

Fixed.


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-11 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r161039122
  
--- Diff: exec/vector/src/main/codegen/templates/FixedValueVectors.java ---
@@ -874,6 +880,46 @@ public void setSafe(int index, BigDecimal value) {
   set(index, value);
 }
 
+/**
+ * Copies the bulk input into this value vector and extends its 
capacity if necessary.
+ * @param input bulk input
+ */
+public  void setSafe(VLBulkInput input) {
+  setSafe(input, null);
+}
+
+/**
+ * Copies the bulk input into this value vector and extends its 
capacity if necessary. The callback
+ * mechanism allows decoration as caller is invoked for each bulk 
entry.
+ *
+ * @param input bulk input
+ * @param callback a bulk input callback object (optional)
+ */
+public  void setSafe(VLBulkInput input, 
VLBulkInput.BulkInputCallback callback) {
--- End diff --

This code is not Parquet specific. Instead, it can be triggered by any 
Reader which desires to load data in a bulk fashion. Vectors currently expose 
Mutator APIs for loading single values; I see no good reason which prevent us 
from passing bulk values instead of a single one at a time which prevent us 
from code optimization. Look at ByBuffer APIs they allow you to pass single 
byte values but also byte arrays.


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-11 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r161036145
  
--- Diff: 
exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java ---
@@ -0,0 +1,186 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util;
+
+import java.lang.reflect.Field;
+import java.nio.ByteOrder;
+
+import sun.misc.Unsafe;
+
+/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned 
platforms */
+@SuppressWarnings("restriction")
+public final class MemoryUtils {
+
+  // Ensure this is a little-endian hardware */
+  static {
+if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) {
+  throw new IllegalStateException("Drill only runs on LittleEndian 
systems.");
+}
+  }
+
+  /** Java's unsafe object */
+  private static Unsafe UNSAFE;
+
+  static {
+try {
+  Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
+  theUnsafe.setAccessible(true);
+  UNSAFE = (Unsafe) theUnsafe.get(null);
+}
+catch (Exception e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  /** Byte arrays offset */
+  private static final long BYTE_ARRAY_OFFSET = 
UNSAFE.arrayBaseOffset(byte[].class);
+
+  /** Number of bytes in a long */
+  public static final int LONG_NUM_BYTES  = 8;
+  /** Number of bytes in an int */
+  public static final int INT_NUM_BYTES   = 4;
+  /** Number of bytes in a short */
+  public static final int SHORT_NUM_BYTES = 2;
+

+//
+// APIs

+//
+
+  /**
+   * @param data source byte array
+   * @param index index within the byte array
+   * @return short value starting at data+index
+   */
+  public static short getShort(byte[] data, int index) {
--- End diff --

Please refer to my reply to Parth. I personally think the memory package 
should be our main abstraction (of course along with guidelines). Putting all 
memory access APIs within the DrillBuf class will only clutter that class.. I 
rather have a centralized memory utility class for advanced processing which do 
not involve a DrillBuf.


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-11 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r161040070
  
--- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java 
---
@@ -309,7 +314,7 @@ public void setInitialCapacity(final int valueCount) {
 if (size > MAX_ALLOCATION_SIZE) {
   throw new OversizedAllocationException("Requested amount of memory 
is more than max allowed allocation size");
 }
-allocationSizeInBytes = (int) size;
+allocationSizeInBytes = (int)size;
--- End diff --

Fixed.


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-11 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r161032779
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/NullableColumnReader.java
 ---
@@ -165,17 +181,133 @@
   + "Run Length: {} \t Null Run Length: {} \t readCount: {} \t 
writeCount: {} \t "
   + "recordsReadInThisIteration: {} \t 
valuesReadInCurrentPass: {} \t "
   + "totalValuesRead: {} \t readStartInBytes: {} \t 
readLength: {} \t pageReader.byteLength: {} \t "
-  + "definitionLevelsRead: {} \t pageReader.currentPageCount: 
{}",
+  + "currPageValuesProcessed: {} \t 
pageReader.currentPageCount: {}",
   recordsToReadInThisPass, runLength, nullRunLength, readCount,
   writeCount, recordsReadInThisIteration, valuesReadInCurrentPass,
   totalValuesRead, readStartInBytes, readLength, 
pageReader.byteLength,
-  definitionLevelsRead, pageReader.currentPageCount);
+  currPageValuesProcessed, pageReader.currentPageCount);
 
 }
 
 valueVec.getMutator().setValueCount(valuesReadInCurrentPass);
   }
 
+  private final void processPagesBulk(long recordsToReadInThisPass) throws 
IOException {
+readStartInBytes   = 0;
+readLength = 0;
+readLengthInBits   = 0;
+recordsReadInThisIteration = 0;
+vectorData = castedBaseVector.getBuffer();
+
+// values need to be spaced out where nulls appear in the column
+// leaving blank space for nulls allows for random access to values
+// to optimize copying data out of the buffered disk stream, runs of 
defined values
+// are located and copied together, rather than copying individual 
values
+
+int valueCount   = 0;
+final int maxValuesToProcess = Math.min((int) recordsToReadInThisPass, 
valueVec.getValueCapacity());
+
+// To handle the case where the page has been already loaded
+if (pageReader.definitionLevels != null && currPageValuesProcessed == 
0) {
+  definitionLevelWrapper.set(pageReader.definitionLevels, 
pageReader.currentPageCount);
+}
+
+while (valueCount < maxValuesToProcess) {
+
+  // read a page if needed
+  if (!pageReader.hasPage() || (currPageValuesProcessed == 
pageReader.currentPageCount)) {
+if (!pageReader.next()) {
+  break;
+}
+
+//New page. Reset the definition level.
+currPageValuesProcessed= 0;
+recordsReadInThisIteration = 0;
+readStartInBytes   = 0;
+
+// Update the Definition Level reader
+definitionLevelWrapper.set(pageReader.definitionLevels, 
pageReader.currentPageCount);
+  }
+
+  definitionLevelWrapper.readFirstIntegerIfNeeded();
+
+  int numNullValues   = 0;
+  int numNonNullValues= 0;
+  final int remaining = maxValuesToProcess - valueCount;
+  int currBatchSz = Math.min(remaining, 
(pageReader.currentPageCount - currPageValuesProcessed));
+  assert currBatchSz > 0;
+
+  // Let's skip the next run of nulls if any ...
+  int idx;
+  for (idx = 0; idx < currBatchSz; ++idx) {
+if (definitionLevelWrapper.readCurrInteger() == 1) {
+  break; // non-value encountered
+}
+definitionLevelWrapper.nextIntegerIfNotEOF();
+  }
+  numNullValues += idx;
+
+  // Write the nulls if any
--- End diff --

This is the original logic for processing fixed precision columns (except 
more efficient); the intent is to figure out how many null values there are 
before processing non-null values. We could have avoided this for-loop if we 
controlled the Parquet ValuesReader API; I was tempted to do that but decided 
to prioritize my tasks. The good news is that I recorded all such improvement 
opportunities and will try to tackle them at some point. 


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-11 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r161030963
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java
 ---
@@ -449,7 +451,7 @@ public void close() throws Exception {
 // been created. Gracefully handle that case.
 
 if (options != null) {
-  options.close();
-}
+options.close();
   }
 }
+}
--- End diff --

The issue was with the close() method which was not properly indented. 
Fixed it.


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-29 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r164532290
  
--- Diff: exec/memory/base/src/main/java/io/netty/buffer/DrillBuf.java ---
@@ -703,7 +703,18 @@ protected void _setLong(int index, long value) {
 
   @Override
   public ByteBuf getBytes(int index, ByteBuf dst, int dstIndex, int 
length) {
-udle.getBytes(index + offset, dst, dstIndex, length);
+final int BULK_COPY_THR = 1024;
--- End diff --

You are right, I'll put more information on this optimization:
- During code profiling, I have noticed that getBytes() doesn't perform 
well when called with small length (lower than 1k)
- Its throughput improves as the length increases

Analysis :
- Java exposes two intrinsics for writing to direct memory: putByte and 
copyMemory
- The JVM is able to inline memory access (no function call) for putByte 
- copyMemory is a bulk API and this internally invokes libc memcpy 
(requires function call)
- The rational is that we are willing to incur a function call if the 
associated processing is larger than the overhead; this is almost never the 
case for small memory accesses. 


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-29 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r164525752
  
--- Diff: 
exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java ---
@@ -0,0 +1,186 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util;
+
+import java.lang.reflect.Field;
+import java.nio.ByteOrder;
+
+import sun.misc.Unsafe;
+
+/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned 
platforms */
+@SuppressWarnings("restriction")
+public final class MemoryUtils {
+
+  // Ensure this is a little-endian hardware */
+  static {
+if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) {
+  throw new IllegalStateException("Drill only runs on LittleEndian 
systems.");
+}
+  }
+
+  /** Java's unsafe object */
+  private static Unsafe UNSAFE;
+
+  static {
+try {
+  Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
+  theUnsafe.setAccessible(true);
+  UNSAFE = (Unsafe) theUnsafe.get(null);
+}
+catch (Exception e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  /** Byte arrays offset */
+  private static final long BYTE_ARRAY_OFFSET = 
UNSAFE.arrayBaseOffset(byte[].class);
+
+  /** Number of bytes in a long */
+  public static final int LONG_NUM_BYTES  = 8;
+  /** Number of bytes in an int */
+  public static final int INT_NUM_BYTES   = 4;
+  /** Number of bytes in a short */
+  public static final int SHORT_NUM_BYTES = 2;
+

+//
+// APIs

+//
+
+  /**
+   * @param data source byte array
+   * @param index index within the byte array
+   * @return short value starting at data+index
+   */
+  public static short getShort(byte[] data, int index) {
--- End diff --

Vlad,

Can you please clarify your point? Are you suggesting that every memory 
access has to be via DrillBuf even if the data is loaded in Java heap and the 
developer is performing an optimization. If this is your goal, then I would 
think the danger is that the DrillBuf class will be cluttered and contracts 
hard to understand:
- Notice that DrillBuf performs expensive sanity checks (boundary and 
reference count checks); this is due to Drill memory optimizations for 
minimizing the overall memory usage.
- When the code is processing already load data, these checks are not 
applicable anymore as this data is private.
- I could have easily moved the new APIs there but I strongly felt this was 
inappropriate.. 


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-29 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r164509658
  
--- Diff: 
exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java ---
@@ -0,0 +1,186 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util;
+
+import java.lang.reflect.Field;
+import java.nio.ByteOrder;
+
+import sun.misc.Unsafe;
+
+/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned 
platforms */
+@SuppressWarnings("restriction")
+public final class MemoryUtils {
+
+  // Ensure this is a little-endian hardware */
+  static {
+if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) {
+  throw new IllegalStateException("Drill only runs on LittleEndian 
systems.");
+}
+  }
+
+  /** Java's unsafe object */
+  private static Unsafe UNSAFE;
+
+  static {
+try {
+  Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
+  theUnsafe.setAccessible(true);
+  UNSAFE = (Unsafe) theUnsafe.get(null);
+}
+catch (Exception e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  /** Byte arrays offset */
+  private static final long BYTE_ARRAY_OFFSET = 
UNSAFE.arrayBaseOffset(byte[].class);
+
+  /** Number of bytes in a long */
+  public static final int LONG_NUM_BYTES  = 8;
+  /** Number of bytes in an int */
+  public static final int INT_NUM_BYTES   = 4;
+  /** Number of bytes in a short */
+  public static final int SHORT_NUM_BYTES = 2;
+

+//
+// APIs

+//
+
+  /**
+   * @param data source byte array
+   * @param index index within the byte array
+   * @return short value starting at data+index
+   */
+  public static short getShort(byte[] data, int index) {
--- End diff --

Vlad,

- The reader has loaded data from direct memory into CPU level-1 cache 
through DrillBuf
- The data is stored (4k at a time) in a Java byte[]
- The only reason for using MemoryUtil is to perform 64bit processing (I 
have previously indicated that many other processing intensive applications 
such as Snappy and Hadoop Comparison use the same pattern)
- I solely added this logic after noticing the Hotspot was generating 8bit 
instructions 

Again, the intent is not to access data from DrillBuf (the data has been 
already loaded into CPU cache); this is mainly an optimization step.  


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-02-02 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r165698800
  
--- Diff: exec/memory/base/src/main/java/io/netty/buffer/DrillBuf.java ---
@@ -703,7 +703,18 @@ protected void _setLong(int index, long value) {
 
   @Override
   public ByteBuf getBytes(int index, ByteBuf dst, int dstIndex, int 
length) {
-udle.getBytes(index + offset, dst, dstIndex, length);
+final int BULK_COPY_THR = 1024;
--- End diff --

Vlad,

- I had a chat with @bitblender and he explains that Java was invoking a 
stub (not a function call) to perform copyMemory; he agreed copyMemory will be 
slower for small buffers and the task was to determine the cutoff point
- My tests (I will send you my test) indicate that a length of 1024bytes is 
the length were copyMemory starts performing exactly as getByte()

NOTE - I am using JRE 1.8; static buffers initialized once; payload 1MB 
(1048576bytes) and loop-count of 102400; MacOS High Sierra; 1 thread, 4GB MX, MS


---

[GitHub] drill issue #1106: DRILL-6129: Fixed query failure due to nested column data...

2018-02-01 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1106
  
Thanks @paul-rogers for the information; I went through the PR and noticed 
the following:

- BatchSchema invokes MaterializedField.isEquivalent()
- With my fix, both methods consider nested columns but they have several 
differences

1) RecordBatchLoader requires sameness as this information is used to reuse 
the value vectors; if old and new batch are deemed same, then the value vectors 
are reloaded using the load(...) API. The metadata better be the same or a 
runtime exception will occur

2) RecordBatchLoader isSame(...) API compares two different java objects: 
SerializedField (obtained from protobufs) and already materialized value 
vectors MaterializedField

3) RecordBatchLoader isSame(...) API tolerates unordered fields (within the 
same level) but not MaterializedField.isEquivalent() method

4) MaterializedField.isEquivalent() ignores hidden columns such "$bits" and 
"$offsets" but not RecordBatchLoader isSame(...)

I think moving forward, the best way to prevent bugs with regard to schema 
changes is by maintaining a document that establishes all the rules. This will 
allow QA to refine their tests and catch current limitations.  



---

[GitHub] drill issue #1106: DRILL-6129: Fixed query failure due to nested column data...

2018-02-01 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1106
  
Thank you Aman and Paul for reviewing this JIRA. I have created a new JIRA 
[DRILL-6131](https://issues.apache.org/jira/browse/DRILL-6131) to track the new 
SchemaComparator utility for sharing schema comparison logic.


---

[GitHub] drill pull request #1106: DRILL-6129: Fixed query failure due to nested colu...

2018-02-01 Thread sachouche

Github user sachouche closed the pull request at:

https://github.com/apache/drill/pull/1106


---

[GitHub] drill pull request #1106: DRILL-6129: Fixed query failure due to nested colu...

2018-02-01 Thread sachouche

GitHub user sachouche reopened a pull request:

https://github.com/apache/drill/pull/1106

DRILL-6129: Fixed query failure due to nested column data type change

Problem Description -
- The Drillbit was able to successfully send batches containing different 
metadata (for nested columns)
- This was the case when one or multiple scanners were involved
- The issue happened within the client where value vectors are cached 
across batches
- The load(...) API is responsible for updating values vectors when a new 
batch arrives
- The RecordBatchLoader class is used to detect schema changes ; if this is 
the case, then previous value vectors are discarded and new ones created
- There is a bug with the current implementation where only first level 
columns are compared

Fix -
- The fix is to improve the schema diff logic by including nested columns

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill DRILL-6129

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1106.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1106


commit 9ffb41f509cd2531e7f3cdf89a66605ec0fdf7a4
Author: Salim Achouche <sachouche2@...>
Date:   2018-02-01T02:59:58Z

DRILL-6129: Fixed query failure due to nested column data type change




---

[GitHub] drill pull request #1106: DRILL-6129: Fixed query failure due to nested colu...

2018-01-31 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1106

DRILL-6129: Fixed query failure due to nested column data type change

Problem Description -
- The Drillbit was able to successfully send batches containing different 
metadata (for nested columns)
- This was the case when one or multiple scanners were involved
- The issue happened within the client where value vectors are cached 
across batches
- The load(...) API is responsible for updating values vectors when a new 
batch arrives
- The RecordBatchLoader class is used to detect schema changes ; if this is 
the case, then previous value vectors are discarded and new ones created
- There is a bug with the current implementation where only first level 
columns are compared

Fix -
- The fix is to improve the schema diff logic by including nested columns

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill DRILL-6129

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1106.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1106


commit 9ffb41f509cd2531e7f3cdf89a66605ec0fdf7a4
Author: Salim Achouche <sachouche2@...>
Date:   2018-02-01T02:59:58Z

DRILL-6129: Fixed query failure due to nested column data type change




---

[GitHub] drill issue #1011: Drill 1170: Drill-on-YARN

2018-02-05 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1011
  
+1
I have reviewed the code and overall looks good. My main feedback is that 
the current implementation doesn't currently support secure clusters (at least 
didn't see any logic associated with that). Yarn applications have issues 
staying up for a long time because of ticket renewal limitations. We might want 
to create another enhancement JIRA to support such use-cases.


---

[GitHub] drill pull request #1107: DRILL-6123: Limit batch size for Merge Join based ...

2018-02-06 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1107#discussion_r166448594
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ---
@@ -77,7 +77,7 @@ private ExecConstants() {
   public static final String SPILL_DIRS = "drill.exec.spill.directories";
 
   public static final String OUTPUT_BATCH_SIZE = 
"drill.exec.memory.operator.output_batch_size";
-  public static final LongValidator OUTPUT_BATCH_SIZE_VALIDATOR = new 
RangeLongValidator(OUTPUT_BATCH_SIZE, 1024, 512 * 1024 * 1024);
+  public static final LongValidator OUTPUT_BATCH_SIZE_VALIDATOR = new 
RangeLongValidator(OUTPUT_BATCH_SIZE, 1, 512 * 1024 * 1024);
--- End diff --

Theoretically, 2Gb-1 should be the largest value as our value vectors use 
integers as offset values.


---

[GitHub] drill pull request #1107: DRILL-6123: Limit batch size for Merge Join based ...

2018-02-06 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1107#discussion_r166450263
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/MergeJoinBatch.java
 ---
@@ -102,20 +105,78 @@
   private final List comparators;
   private final JoinRelType joinType;
   private JoinWorker worker;
+  private final long outputBatchSize;
 
   private static final String LEFT_INPUT = "LEFT INPUT";
   private static final String RIGHT_INPUT = "RIGHT INPUT";
 
+  private class MergeJoinMemoryManager extends 
AbstractRecordBatchMemoryManager {
+private int leftRowWidth;
+private int rightRowWidth;
+
+/**
+ * mergejoin operates on one record at a time from the left and right 
batches
+ * using RecordIterator abstraction. We have a callback mechanism to 
get notified
+ * when new batch is loaded in record iterator.
+ * This can get called in the middle of current output batch we are 
building.
+ * when this gets called, adjust number of output rows for the current 
batch and
+ * update the value to be used for subsequent batches.
+ */
+@Override
+public void update(int inputIndex) {
+  switch(inputIndex) {
+case 0:
+  final RecordBatchSizer leftSizer = new RecordBatchSizer(left);
+  leftRowWidth = leftSizer.netRowWidth();
+  break;
+case 1:
+  final RecordBatchSizer rightSizer = new RecordBatchSizer(right);
+  rightRowWidth = rightSizer.netRowWidth();
+default:
+  break;
+  }
+
+  final int newOutgoingRowWidth = leftRowWidth + rightRowWidth;
+
+  // If outgoing row width is 0, just return. This is possible for 
empty batches or
+  // when first set of batches come with OK_NEW_SCHEMA and no data.
+  if (newOutgoingRowWidth == 0) {
+return;
+  }
+
+  // update the value to be used for next batch(es)
+  setOutputRowCount(Math.min(ValueVector.MAX_ROW_COUNT,
+
Math.max(RecordBatchSizer.safeDivide(outputBatchSize/WORST_CASE_FRAGMENTATION_FACTOR,
 newOutgoingRowWidth), MIN_NUM_ROWS)));
--- End diff --

Our goal is to eventually use Paul's framework for controlling batch sizes. 
I feel the BatchRecordMemoryManager terminology is a bit ambitious (kind of 
misleading).


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158084806
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill pull request #1168: DRILL-6246: Reduced the size of the jdbc-all jar f...

2018-03-15 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1168#discussion_r174870229
  
--- Diff: exec/jdbc-all/pom.xml ---
@@ -473,6 +473,8 @@
org/yaml/**
hello/**
webapps/**
+   **/org/apache/calcite/avatica/metrics/**
+   **/org/apache/calcite/avatica/org/**
--- End diff --

The Apache jar size has shrunk by ~1MB from over 34MB to 33393541 bytes; I 
guess we can decrease the limit from 35MB to 34MB.


---

[GitHub] drill issue #1168: DRILL-6246: Reduced the size of the jdbc-all jar file

2018-03-15 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1168
  
@sohami can you please review this pull request?

Thanks!


---

[GitHub] drill pull request #1168: DRILL-6246: Reduced the size of the jdbc-all jar f...

2018-03-15 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1168

DRILL-6246: Reduced the size of the jdbc-all jar file

- The jdbc-all client jar has been growing in size from version to the next 
(~20MB in version 1.10, ~27MB in version 1.12, and 34MB in version 1.13
- Note the exact size of the size depends on the maven profile used during 
compilation (e.g., "mapr" which have more excludes)
- Originally, I tried to exclude the following calcite/avatica packages 
(responsible for the recent size increase): metrics, proto, org/apache/, 
com/fasterxml, com/google but some tests failed as the calcite code was 
referencing some of them
- In this pull request, I have excluded the following packages: 
calcite/avatica/org and calcite/avatica/metrics

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill DRILL-6246

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1168.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1168


commit 6b6188a6c1677595c479f67fc2e33af091c36818
Author: Salim Achouche <sachouche2@...>
Date:   2018-03-14T02:16:06Z

DRILL-6246: Reduced the size of the jdbc-all jar file




---

[GitHub] drill pull request #1168: DRILL-6246: Reduced the size of the jdbc-all jar f...

2018-03-15 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1168#discussion_r174889287
  
--- Diff: exec/jdbc-all/pom.xml ---
@@ -473,6 +473,8 @@
org/yaml/**
hello/**
webapps/**
+   **/org/apache/calcite/avatica/metrics/**
+   **/org/apache/calcite/avatica/org/**
--- End diff --

done!


---

[GitHub] drill issue #1170: DRILL-6223: Fixed several Drillbit failures due to schema...

2018-03-15 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1170
  
@amansinha100 can you please review this pull request?

Thanks!


---

[GitHub] drill pull request #1170: DRILL-6223: Fixed several Drillbit failures due to...

2018-03-15 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1170

DRILL-6223: Fixed several Drillbit failures due to schema changes

Fixed several Issues due to Schema changes:
1) Changes in complex data types
Drill Query Failing when selecting all columns from a Complex Nested Data 
File (Parquet) Set). There are differences in Schema among the files:

The Parquet files exhibit differences both at the first level and within 
nested data types
A select * will not cause an exception but using a limit clause will
Note also this issue seems to happen only when multiple Drillbit minor 
fragments are involved (concurrency higher than one)

2) Dangling columns (both simple and complex)
This situation can be easily reproduced for:
- Select STAR queries which involve input data with different schemas
- LIMIT or / and PROJECT operators are used
- The data will be read from more than one minor fragment
- This is because individual readers have logic to handle such use-cases 
but not downstream operators
- So is reader-1 sends one batch with F1, F2, and F3
- The reader-2 sends batch F2, F3
- Then the LIMIT and PROJECT operator will fail to cleanup the dangling 
column F1 which will cause failures when downstream operators copy logic 
attempts copy the stale column F1
- This pull request adds logic to detect and eliminate dangling columns   

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill DRILL-6223

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1170.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1170


commit d986b6c7588c107bb7e49d2fc8eb3f25a60e1214
Author: Salim Achouche <sachouche2@...>
Date:   2018-02-21T02:17:14Z

DRILL-6223: Fixed several Drillbit failures due to schema changes




---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-04-04 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r179296101
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/MemoryUtils.java
 ---
@@ -0,0 +1,199 @@
+/**
--- End diff --

Vlad, can you please explain what you meant? are you only referring for 
specific members or is it a general comment?



---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-04-04 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r179295602
  
--- Diff: exec/java-exec/pom.xml ---
@@ -836,6 +836,14 @@
 org.apache.maven.plugins
 maven-surefire-plugin
   
+  
+org.apache.maven.plugins
+maven-compiler-plugin
+2.3.2
+
+-XDignore.symbol.file
--- End diff --

Will do!


---

[GitHub] drill issue #1168: DRILL-6246: Reduced the size of the jdbc-all jar file

2018-04-10 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1168
  
No I didn't.

Regards,
Salim

Regards,
Salim

-Original Message-
From: Kunal Khatua [notificati...@github.com]
Received: Tuesday, 10 Apr 2018, 14:01
To: apache/drill [dr...@noreply.github.com]
CC: Salim Achouche [sachou...@mapr.com]; Mention 
[ment...@noreply.github.com]
Subject: Re: [apache/drill] DRILL-6246: Reduced the size of the jdbc-all 
jar file (#1168)



@sachouche<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sachouche=DwMCaQ=cskdkSMqhcnjZxdQVpwTXg=Yz1mbrEFs0HJzxHJZ69JBLCsKtk9LmaS_d_ncP1HZwY=ZuRTiFKVp4XyNeUONWTmOtorxN8OgUnpJeF5Nnbzpi8=ORom6yiAaNx9O8pqbz4PEqbCngKTh3c4MO1H6btm4s4=>
 have you had a chance to test the generated JDBC driver with Spotfire/SQuirreL 
?

â
You are receiving this because you were mentioned.
Reply to this email directly, view it on 
GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_drill_pull_1168-23issuecomment-2D380245152=DwMCaQ=cskdkSMqhcnjZxdQVpwTXg=Yz1mbrEFs0HJzxHJZ69JBLCsKtk9LmaS_d_ncP1HZwY=ZuRTiFKVp4XyNeUONWTmOtorxN8OgUnpJeF5Nnbzpi8=Q3TGflgHce0_-5S2QAAdrJp8SQnAobk1W0ZFJwD50-Q=>,
 or mute the 
thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AckoK6qKn0CR0-2DszS5136weV3xs2MjZYks5tnR2UgaJpZM4Ssiu1=DwMCaQ=cskdkSMqhcnjZxdQVpwTXg=Yz1mbrEFs0HJzxHJZ69JBLCsKtk9LmaS_d_ncP1HZwY=ZuRTiFKVp4XyNeUONWTmOtorxN8OgUnpJeF5Nnbzpi8=xgHCT0e9lj785056Kkt2KgkhNR3pvO4wPNI_B3J5S54=>.



---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-24 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r183898352
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -182,13 +184,18 @@ public IterOutcome next() {
 return IterOutcome.OUT_OF_MEMORY;
   }
 
+  // Transfer the ownership of this raw-batch to this operator for 
proper memory statistics reporting
+  batch = batch.transferBodyOwnership(oContext.getAllocator());
--- End diff --

@HanumathRao,

- This will not affect the outcome since the unordered-receiver is a child 
of the context allocator (the minor fragment)
- Doing the check earlier seemed cleaner as there is no point of changing 
ownership when there is already an out-of-memory condition
- I think the code was written this way since implicitly, the network 
handlers are receiving the batch and then performing a change of ownership 
(from RPC to context allocator); this step could lead to an out-of-memory 
condition at the fragment level
- If your point is about reporting, that is attributing the OOM condition 
to the child operator, then I guess I don't have a problem with that.
@parthchandra what do you think?
 


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-22 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1237

DRILL-6348: Fixed code so that Unordered Receiver reports its memory â¦

Problem Description -
- The Unordered Receiver doesn't report any memory usage because the RPC 
infrastructure associated the allocated memory ownership to the minor fragment 
container

Proposed Fix -
- Modified the code to change the received DrillBuf memory ownership from 
the parent fragment to the operator (as discussed with @parthchandra)
- Made sure that the buffer accounting logic is preserved

@parthchandra, can you please review this pull request?

Thanks!

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill DRILL-6348

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1237


commit 9ab501a3fc3ebbda44c1478ad9db6711192b8ca1
Author: Salim Achouche <sachouche2@...>
Date:   2018-04-23T01:02:35Z

DRILL-6348: Fixed code so that Unordered Receiver reports its memory usage




---

1 2 >

1 - 100 of 119 matches

Mail list logo