[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348039#comment-16348039 ] ASF GitHub Bot commented on DRILL-6032: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r165263759 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -215,6 +206,7 @@ public BatchHolder() { MaterializedField outputField = materializedValueFields[i]; // Create a type-specific ValueVector for this value vector = TypeHelper.getNewVector(outputField, allocator); + int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize; --- End diff -- I can think of three reasons to use the sizer: * Type logic is complex: we have multiple sets of rules depending on the data type. Best to encapsulate the logic in a single place. So, either 1) use the "sizer", or 2) move the logic from the "sizer" to a common utility. * Column size is tricky as it depends on `DataMode`. The size or a `Required INT` is 4. The (total memory) size of an `Optional INT` is 5. For a `Repeated INT`? You need to know the average array cardinality, which the "sizer" provides (by analyzing an input batch.) * As discussed, variable-width columns (`VARCHAR`, `VARBINARY` for HBase) have no known size. We really have to completely forget about that awful "50" estimate. We can only estimate size from input, which is, again, what the "sizer" does. Of course, all the above only works I you actually sample the input. A current limitation (and good enhancement) is that the Sizer is aware of just one batch. The sort (the first user of the "sizer") needed only aggregate row size, so it just kept track of the widest row ever seen. If you need detailed column information, you may want another layer: one that aggregates information across batches. (For arrays and variable-width columns, you can take the weighted average or the maximum depending on your needs.) Remember, if the purpose of this number is to estimate memory use, then you have to add a 33% (average) allowance for internal fragmentation. (Each vector is, on average, 75% full.) > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6129) Query fails on nested data type schema change
[ https://issues.apache.org/jira/browse/DRILL-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348024#comment-16348024 ] ASF GitHub Bot commented on DRILL-6129: --- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/1106 Note that a similar bug was recently fixed in (as I recall) the Merge Receiver. As part of this fix, would be good to either: 1. Determine if we have more copies of this logic besides the Merge Receiver (previously fixed) and the client code (fixed here.) 2. Refactor the code so that all use cases use a common set of code for this task. In any event, would be good to compare this code with that done in the Merge Receiver to ensure that we are using a common approach. See `exec/java-exec/src/main/java/org/apache/drill/exec/record/BatchSchema.java` in PR #968. The two sets of code appear similar, depending on what `isSameSchema()` does with a list of `MaterializedField`s. But, please take a look. > Query fails on nested data type schema change > - > > Key: DRILL-6129 > URL: https://issues.apache.org/jira/browse/DRILL-6129 > Project: Apache Drill > Issue Type: Bug > Components: Client - CLI >Affects Versions: 1.10.0 >Reporter: salim achouche >Assignee: salim achouche >Priority: Minor > Fix For: 1.13.0 > > > Use-Case - > * Assume two parquet files with similar schemas except for a nested column > * Schema file1 > ** int64 field1 > ** optional group field2 > *** optional group field2.1 (LIST) > repeated group list > * optional group element > ** optional int64 child_field > * Schema file2 > ** int64 field1 > ** optional group field2 > *** optional group field2.1 (LIST) > repeated group list > * optional group element > ** optional group child_field > *** optional int64 child_field_f1 > *** optional int64 child_field_f1 > * Essentially child_field changed from an int64 to a group of fields > > Observed Query Failure > select * from ; > Error: Unexpected RuntimeException: java.lang.IllegalArgumentException: The > field $bits$(UINT1:REQUIRED) doesn't match the provided metadata major_type { > minor_type: MAP > mode: REQUIRED > Note that selecting one file at a time succeeds which seems to indicate the > issue has to do with the schema change logic. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6124) testCountDownLatch can be null in PartitionerDecorator depending on user's injection controls config
[ https://issues.apache.org/jira/browse/DRILL-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347965#comment-16347965 ] ASF GitHub Bot commented on DRILL-6124: --- Github user ilooner commented on the issue: https://github.com/apache/drill/pull/1103 You are right @arina-ielchiieva . Thanks for catching this, I will close the PR and mark the jira as invalid. > testCountDownLatch can be null in PartitionerDecorator depending on user's > injection controls config > > > Key: DRILL-6124 > URL: https://issues.apache.org/jira/browse/DRILL-6124 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Minor > Fix For: 1.13.0 > > > In PartitionerDecorator we get a latch from the injector with the following > code. > testCountDownLatch = injector.getLatch(context.getExecutionControls(), > "partitioner-sender-latch"); > However, if there is no injection site defined in the user's drill > configuration then testCountDownLatch will be null. So we have to check if it > is null in order to avoid NPE's -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6124) testCountDownLatch can be null in PartitionerDecorator depending on user's injection controls config
[ https://issues.apache.org/jira/browse/DRILL-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347966#comment-16347966 ] ASF GitHub Bot commented on DRILL-6124: --- Github user ilooner closed the pull request at: https://github.com/apache/drill/pull/1103 > testCountDownLatch can be null in PartitionerDecorator depending on user's > injection controls config > > > Key: DRILL-6124 > URL: https://issues.apache.org/jira/browse/DRILL-6124 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Minor > Fix For: 1.13.0 > > > In PartitionerDecorator we get a latch from the injector with the following > code. > testCountDownLatch = injector.getLatch(context.getExecutionControls(), > "partitioner-sender-latch"); > However, if there is no injection site defined in the user's drill > configuration then testCountDownLatch will be null. So we have to check if it > is null in order to avoid NPE's -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (DRILL-6128) Wrong Result with Nested Loop Join
[ https://issues.apache.org/jira/browse/DRILL-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347960#comment-16347960 ] Sorabh Hamirwasia edited comment on DRILL-6128 at 2/1/18 3:36 AM: -- I did some more investigation on this and looks like we recently added generated code for *doEval* method of NestedLoopJoin as part of DRILL-5375. The right sift happens inside *doEval* method, reason being it identifies the right side container as HyperContainer and does right shift on the index to get the batchIndex. This will only work if it's insured by the creator of Expandable HyperContainer to fully pack the value vectors inside it and then just use So far other operators using the ExpandableHyperContainers are HashJoin, MergingRecordBatch and Sort/TopN (using PriorityQueue). >From discussion with [~amansinha100] it looks like while building HashTable we >use HyperContainers of BatchHolders, but we make sure that each BatchHolder is >fully filled before adding another one in the container. Hence it is working >fine with respect to generated code accessing records from it. It would be >good to make sure PriorityQueue is also doing something like this. *Current Nested Loop Behavior:* NestedLoop Join adds the right side input batches inside HyperContainer (rightContainer) without ensuring it's fully packed. It also maintains a list of record counts in each batch in rightCounts. Later these are passed to generated code using BatchReference. During EvaluationVisitor, when it sees the rightContainer as HyperContainer it does right shift on passed batchIndex. *Currently to fix this issue we have following ways:* 1) Make sure all the operators using Hyper container always fully pack it (which looks to be the case today). In which case Nested Loop Join has to do something similar like while adding batches inside rightContainer make sure the batch is fully packed. 2) Operators which are not fully packing the hyper container should use BatchReference for generated code and should also keep track of list of record counts in each batch along with hyper container. Whenever only index of the record is passed to the generated code it should generate index like: rightIndex = batchIndex << 16 + recordWithinBatchIndex. When both batch index and record index is passed separately then it should generate batchIndex = batchIndex << 16 and pass recordWithintBatchIndex separately. Later is the case for NestedLoopJoin current Implementation. was (Author: shamirwasia): I did some more investigation on this and looks like we recently added generated code for *doEval* method of NestedLoopJoin. The right sift happens inside *doEval* method, reason being it identifies the right side container as HyperContainer and does right shift on the index to get the batchIndex. This will only work if it's insured by the creator of Expandable HyperContainer to fully pack the value vectors inside it and then just use So far other operators using the ExpandableHyperContainers are HashJoin, MergingRecordBatch and Sort/TopN (using PriorityQueue). >From discussion with [~amansinha100] it looks like while building HashTable we >use HyperContainers of BatchHolders, but we make sure that each BatchHolder is >fully filled before adding another one in the container. Hence it is working >fine with respect to generated code accessing records from it. It would be >good to make sure PriorityQueue is also doing something like this. *Current Nested Loop Behavior:* NestedLoop Join adds the right side input batches inside HyperContainer (rightContainer) without ensuring it's fully packed. It also maintains a list of record counts in each batch in rightCounts. Later these are passed to generated code using BatchReference. During EvaluationVisitor, when it sees the rightContainer as HyperContainer it does right shift on passed batchIndex. *Currently to fix this issue we have following ways:* 1) Make sure all the operators using Hyper container always fully pack it (which looks to be the case today). In which case Nested Loop Join has to do something similar like while adding batches inside rightContainer make sure the batch is fully packed. 2) Operators which are not fully packing the hyper container should use BatchReference for generated code and should also keep track of list of record counts in each batch along with hyper container. Whenever only index of the record is passed to the generated code it should generate index like: rightIndex = batchIndex << 16 + recordWithinBatchIndex. When both batch index and record index is passed separately then it should generate batchIndex = batchIndex << 16 and pass recordWithintBatchIndex separately. Later is the case for NestedLoopJoin current Implementation. > Wrong Result with Nested Loop Join > -- > > Key: DRILL-6128
[jira] [Commented] (DRILL-6128) Wrong Result with Nested Loop Join
[ https://issues.apache.org/jira/browse/DRILL-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347960#comment-16347960 ] Sorabh Hamirwasia commented on DRILL-6128: -- I did some more investigation on this and looks like we recently added generated code for *doEval* method of NestedLoopJoin. The right sift happens inside *doEval* method, reason being it identifies the right side container as HyperContainer and does right shift on the index to get the batchIndex. This will only work if it's insured by the creator of Expandable HyperContainer to fully pack the value vectors inside it and then just use So far other operators using the ExpandableHyperContainers are HashJoin, MergingRecordBatch and Sort/TopN (using PriorityQueue). >From discussion with [~amansinha100] it looks like while building HashTable we >use HyperContainers of BatchHolders, but we make sure that each BatchHolder is >fully filled before adding another one in the container. Hence it is working >fine with respect to generated code accessing records from it. It would be >good to make sure PriorityQueue is also doing something like this. *Current Nested Loop Behavior:* NestedLoop Join adds the right side input batches inside HyperContainer (rightContainer) without ensuring it's fully packed. It also maintains a list of record counts in each batch in rightCounts. Later these are passed to generated code using BatchReference. During EvaluationVisitor, when it sees the rightContainer as HyperContainer it does right shift on passed batchIndex. *Currently to fix this issue we have following ways:* 1) Make sure all the operators using Hyper container always fully pack it (which looks to be the case today). In which case Nested Loop Join has to do something similar like while adding batches inside rightContainer make sure the batch is fully packed. 2) Operators which are not fully packing the hyper container should use BatchReference for generated code and should also keep track of list of record counts in each batch along with hyper container. Whenever only index of the record is passed to the generated code it should generate index like: rightIndex = batchIndex << 16 + recordWithinBatchIndex. When both batch index and record index is passed separately then it should generate batchIndex = batchIndex << 16 and pass recordWithintBatchIndex separately. Later is the case for NestedLoopJoin current Implementation. > Wrong Result with Nested Loop Join > -- > > Key: DRILL-6128 > URL: https://issues.apache.org/jira/browse/DRILL-6128 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators >Reporter: Sorabh Hamirwasia >Assignee: Sorabh Hamirwasia >Priority: Major > > Nested Loop Join produces wrong result's if there are multiple batches on the > right side. It builds an ExapandableHyperContainer to hold all the right side > of batches. Then for each record on left side input evaluates the condition > with all records on right side and emit the output if condition is satisfied. > The main loop inside > [populateOutgoingBatch|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinTemplate.java#L106] > call's *doEval* with correct indexes to evaluate records on both the sides. > In generated code of *doEval* for some reason there is a right shift of 16 > done on the rightBatchIndex (sample shared below). > {code:java} > public boolean doEval(int leftIndex, int rightBatchIndex, int > rightRecordIndexWithinBatch) > throws SchemaChangeException > { > { >IntHolder out3 = new IntHolder(); >{ > out3 .value = vv0 .getAccessor().get((leftIndex)); >} >IntHolder out7 = new IntHolder(); >{ > out7 .value = > > vv4[((rightBatchIndex)>>>16)].getAccessor().get(((rightRecordIndexWithinBatch)& > 65535)); >} > .. > .. > }{code} > > When the actual loop is processing second batch, inside eval method the index > with right shift becomes 0 and it ends up evaluating condition w.r.t first > right batch again. So if there is more than one batch (upto 65535) on right > side doEval will always consider first batch for condition evaluation. But > the output data will be based on correct batch so there will be issues like > OutOfBound and WrongData. Cases can be: > Let's say: *rightBatchIndex*: index of right batch to consider, > *rightRecordIndexWithinBatch*: index of record in right batch at > rightBatchIndex > 1) First right batch comes with zero data and with OK_NEW_SCHEMA (let's say > because of filter in the operator tree). Next Right batch has > 0 data. So > when we call doEval for second batch(*rightBatchIndex = 1*) and first record > in it (i.e.
[jira] [Updated] (DRILL-6129) Query fails on nested data type schema change
[ https://issues.apache.org/jira/browse/DRILL-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] salim achouche updated DRILL-6129: -- Reviewer: Aman Sinha > Query fails on nested data type schema change > - > > Key: DRILL-6129 > URL: https://issues.apache.org/jira/browse/DRILL-6129 > Project: Apache Drill > Issue Type: Bug > Components: Client - CLI >Affects Versions: 1.10.0 >Reporter: salim achouche >Assignee: salim achouche >Priority: Minor > Fix For: 1.13.0 > > > Use-Case - > * Assume two parquet files with similar schemas except for a nested column > * Schema file1 > ** int64 field1 > ** optional group field2 > *** optional group field2.1 (LIST) > repeated group list > * optional group element > ** optional int64 child_field > * Schema file2 > ** int64 field1 > ** optional group field2 > *** optional group field2.1 (LIST) > repeated group list > * optional group element > ** optional group child_field > *** optional int64 child_field_f1 > *** optional int64 child_field_f1 > * Essentially child_field changed from an int64 to a group of fields > > Observed Query Failure > select * from ; > Error: Unexpected RuntimeException: java.lang.IllegalArgumentException: The > field $bits$(UINT1:REQUIRED) doesn't match the provided metadata major_type { > minor_type: MAP > mode: REQUIRED > Note that selecting one file at a time succeeds which seems to indicate the > issue has to do with the schema change logic. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6129) Query fails on nested data type schema change
[ https://issues.apache.org/jira/browse/DRILL-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347950#comment-16347950 ] ASF GitHub Bot commented on DRILL-6129: --- Github user priteshm commented on the issue: https://github.com/apache/drill/pull/1106 @amansinha100 can you please review it? > Query fails on nested data type schema change > - > > Key: DRILL-6129 > URL: https://issues.apache.org/jira/browse/DRILL-6129 > Project: Apache Drill > Issue Type: Bug > Components: Client - CLI >Affects Versions: 1.10.0 >Reporter: salim achouche >Assignee: salim achouche >Priority: Minor > Fix For: 1.13.0 > > > Use-Case - > * Assume two parquet files with similar schemas except for a nested column > * Schema file1 > ** int64 field1 > ** optional group field2 > *** optional group field2.1 (LIST) > repeated group list > * optional group element > ** optional int64 child_field > * Schema file2 > ** int64 field1 > ** optional group field2 > *** optional group field2.1 (LIST) > repeated group list > * optional group element > ** optional group child_field > *** optional int64 child_field_f1 > *** optional int64 child_field_f1 > * Essentially child_field changed from an int64 to a group of fields > > Observed Query Failure > select * from ; > Error: Unexpected RuntimeException: java.lang.IllegalArgumentException: The > field $bits$(UINT1:REQUIRED) doesn't match the provided metadata major_type { > minor_type: MAP > mode: REQUIRED > Note that selecting one file at a time succeeds which seems to indicate the > issue has to do with the schema change logic. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6129) Query fails on nested data type schema change
[ https://issues.apache.org/jira/browse/DRILL-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347948#comment-16347948 ] ASF GitHub Bot commented on DRILL-6129: --- GitHub user sachouche opened a pull request: https://github.com/apache/drill/pull/1106 DRILL-6129: Fixed query failure due to nested column data type change Problem Description - - The Drillbit was able to successfully send batches containing different metadata (for nested columns) - This was the case when one or multiple scanners were involved - The issue happened within the client where value vectors are cached across batches - The load(...) API is responsible for updating values vectors when a new batch arrives - The RecordBatchLoader class is used to detect schema changes ; if this is the case, then previous value vectors are discarded and new ones created - There is a bug with the current implementation where only first level columns are compared Fix - - The fix is to improve the schema diff logic by including nested columns You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachouche/drill DRILL-6129 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1106.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1106 commit 9ffb41f509cd2531e7f3cdf89a66605ec0fdf7a4 Author: Salim AchoucheDate: 2018-02-01T02:59:58Z DRILL-6129: Fixed query failure due to nested column data type change > Query fails on nested data type schema change > - > > Key: DRILL-6129 > URL: https://issues.apache.org/jira/browse/DRILL-6129 > Project: Apache Drill > Issue Type: Bug > Components: Client - CLI >Affects Versions: 1.10.0 >Reporter: salim achouche >Assignee: salim achouche >Priority: Minor > Fix For: 1.13.0 > > > Use-Case - > * Assume two parquet files with similar schemas except for a nested column > * Schema file1 > ** int64 field1 > ** optional group field2 > *** optional group field2.1 (LIST) > repeated group list > * optional group element > ** optional int64 child_field > * Schema file2 > ** int64 field1 > ** optional group field2 > *** optional group field2.1 (LIST) > repeated group list > * optional group element > ** optional group child_field > *** optional int64 child_field_f1 > *** optional int64 child_field_f1 > * Essentially child_field changed from an int64 to a group of fields > > Observed Query Failure > select * from ; > Error: Unexpected RuntimeException: java.lang.IllegalArgumentException: The > field $bits$(UINT1:REQUIRED) doesn't match the provided metadata major_type { > minor_type: MAP > mode: REQUIRED > Note that selecting one file at a time succeeds which seems to indicate the > issue has to do with the schema change logic. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6129) Query fails on nested data type schema change
salim achouche created DRILL-6129: - Summary: Query fails on nested data type schema change Key: DRILL-6129 URL: https://issues.apache.org/jira/browse/DRILL-6129 Project: Apache Drill Issue Type: Bug Components: Client - CLI Affects Versions: 1.10.0 Reporter: salim achouche Assignee: salim achouche Fix For: 1.13.0 Use-Case - * Assume two parquet files with similar schemas except for a nested column * Schema file1 ** int64 field1 ** optional group field2 *** optional group field2.1 (LIST) repeated group list * optional group element ** optional int64 child_field * Schema file2 ** int64 field1 ** optional group field2 *** optional group field2.1 (LIST) repeated group list * optional group element ** optional group child_field *** optional int64 child_field_f1 *** optional int64 child_field_f1 * Essentially child_field changed from an int64 to a group of fields Observed Query Failure select * from ; Error: Unexpected RuntimeException: java.lang.IllegalArgumentException: The field $bits$(UINT1:REQUIRED) doesn't match the provided metadata major_type { minor_type: MAP mode: REQUIRED Note that selecting one file at a time succeeds which seems to indicate the issue has to do with the schema change logic. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.
[ https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347848#comment-16347848 ] ASF GitHub Bot commented on DRILL-6106: --- Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/1099 > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. > > > Key: DRILL-6106 > URL: https://issues.apache.org/jira/browse/DRILL-6106 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Reudismam Rolim de Sousa >Assignee: Reudismam Rolim de Sousa >Priority: Minor > Labels: ready-to-commit > Fix For: 1.13.0 > > > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6128) Wrong Result with Nested Loop Join
Sorabh Hamirwasia created DRILL-6128: Summary: Wrong Result with Nested Loop Join Key: DRILL-6128 URL: https://issues.apache.org/jira/browse/DRILL-6128 Project: Apache Drill Issue Type: Bug Components: Execution - Relational Operators Reporter: Sorabh Hamirwasia Assignee: Sorabh Hamirwasia Nested Loop Join produces wrong result's if there are multiple batches on the right side. It builds an ExapandableHyperContainer to hold all the right side of batches. Then for each record on left side input evaluates the condition with all records on right side and emit the output if condition is satisfied. The main loop inside [populateOutgoingBatch|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinTemplate.java#L106] call's *doEval* with correct indexes to evaluate records on both the sides. In generated code of *doEval* for some reason there is a right shift of 16 done on the rightBatchIndex (sample shared below). {code:java} public boolean doEval(int leftIndex, int rightBatchIndex, int rightRecordIndexWithinBatch) throws SchemaChangeException { { IntHolder out3 = new IntHolder(); { out3 .value = vv0 .getAccessor().get((leftIndex)); } IntHolder out7 = new IntHolder(); { out7 .value = vv4[((rightBatchIndex)>>>16)].getAccessor().get(((rightRecordIndexWithinBatch)& 65535)); } .. .. }{code} When the actual loop is processing second batch, inside eval method the index with right shift becomes 0 and it ends up evaluating condition w.r.t first right batch again. So if there is more than one batch (upto 65535) on right side doEval will always consider first batch for condition evaluation. But the output data will be based on correct batch so there will be issues like OutOfBound and WrongData. Cases can be: Let's say: *rightBatchIndex*: index of right batch to consider, *rightRecordIndexWithinBatch*: index of record in right batch at rightBatchIndex 1) First right batch comes with zero data and with OK_NEW_SCHEMA (let's say because of filter in the operator tree). Next Right batch has > 0 data. So when we call doEval for second batch(*rightBatchIndex = 1*) and first record in it (i.e. *rightRecordIndexWithinBatch = 0*), actual evaluation will happen using first batch (since *rightBatchIndex >>> 16 = 0*). On accessing record at *rightRecordIndexWithinBatch* in first batch it will throw *IndexOutofBoundException* since the first batch has no records. 2) Let's say there are 2 batches on right side. Also let's say first batch contains 3 records (with id_right=1/2/3) and 2nd batch also contain 3 records (with id_right=10/20/30). Also let's say there is 1 batch on left side with 3 records (with id_left=1/2/3). Then in this case the NestedLoopJoin (with equality condition) will end up producing 6 records instead of 3. It produces first 3 records based on match between left records and match in first right batch records. But while 2nd right batch it will evaluate id_left=id_right based on first batch instead and will again find matches and will produce another 3 records. *Example:* *Left Batch Data:* {code:java} Batch1: { "id_left": 1, "cost_left": 11, "name_left": "item11" } { "id_left": 2, "cost_left": 21, "name_left": "item21" } { "id_left": 3, "cost_left": 31, "name_left": "item31" }{code} *Right Batch Data:* {code:java} Batch 1: { "id_right": 1, "cost_right": 10, "name_right": "item1" } { "id_right": 2, "cost_right": 20, "name_right": "item2" } { "id_right": 3, "cost_right": 30, "name_right": "item3" } {code} {code:java} Batch 2: { "id_right": 4, "cost_right": 40, "name_right": "item4" } { "id_right": 4, "cost_right": 40, "name_right": "item4" } { "id_right": 4, "cost_right": 40, "name_right": "item4" }{code} *Produced output:* {code:java} { "id_left": 1, "cost_left": 11, "name_left": "item11", "id_right": 1, "cost_right": 10, "name_right": "item1" } { "id_left": 1, "cost_left": 11, "name_left": "item11", "id_right": 4, "cost_right": 40, "name_right": "item4" } { "id_left": 2, "cost_left": 21, "name_left": "item21" "id_right": 2, "cost_right": 20, "name_right": "item2" } { "id_left": 2, "cost_left": 21, "name_left": "item21" "id_right": 4, "cost_right": 40, "name_right": "item4" } { "id_left": 3, "cost_left": 31, "name_left": "item31" "id_right": 3, "cost_right": 30, "name_right": "item3" } { "id_left": 3, "cost_left": 31, "name_left": "item31" "id_right": 4, "cost_right": 40, "name_right": "item4" }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (DRILL-6111) NullPointerException with Kafka Storage Plugin
[ https://issues.apache.org/jira/browse/DRILL-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khatua reassigned DRILL-6111: --- Assignee: Bhallamudi Venkata Siva Kamesh > NullPointerException with Kafka Storage Plugin > -- > > Key: DRILL-6111 > URL: https://issues.apache.org/jira/browse/DRILL-6111 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Other >Affects Versions: 1.12.0 >Reporter: Jared Stehler >Assignee: Bhallamudi Venkata Siva Kamesh >Priority: Major > > I'm unable to query using the kafka storage plugin; queries are failing with > a NPE which *seems* like a json typo: > {code:java} > org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: > NullPointerException > Fragment 1:2 > [Error Id: 49d5f72f-0187-480b-8b29-6eeeb5adc88f on 10.80.53.16:31820] > at > org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:586) > ~[drill-common-1.12.0.jar:1.12.0] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:298) > [drill-java-exec-1.12.0.jar:1.12.0] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160) > [drill-java-exec-1.12.0.jar:1.12.0] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:267) > [drill-java-exec-1.12.0.jar:1.12.0] > at > org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) > [drill-common-1.12.0.jar:1.12.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [na:1.8.0_131] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_131] > at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131] > Caused by: com.fasterxml.jackson.databind.JsonMappingException: Instantiation > of [simple type, class org.apache.drill.exec.store.kafka.KafkaSubScan] value > failed (java.lang.NullPointerException): null > at [Source: { > "pop" : "single-sender", > "@id" : 0, > "receiver-major-fragment" : 0, > "receiver-minor-fragment" : 0, > "child" : { > "pop" : "selection-vector-remover", > "@id" : 1, > "child" : { > "pop" : "limit", > "@id" : 2, > "child" : { > "pop" : "kafka-partition-scan", > "@id" : 3, > "userName" : "", > "columns" : [ "`*`" ], > "partitionSubScanSpecList" : [ { > "topicName" : "ingest-prime", > "partitionId" : 5, > "startOffset" : 8824294, > "endOffset" : 8874172 > }, { > "topicName" : "ingest-prime", > "partitionId" : 1, > "startOffset" : 8826346, > "endOffset" : 8874623 > }, { > "topicName" : "ingest-prime", > "partitionId" : 6, > "startOffset" : 8824744, > "endOffset" : 8874617 > } ], > "initialAllocation" : 100, > "maxAllocation" : 100, > "KafkaStoragePluginConfig" : { > "type" : "kafka", > "kafkaConsumerProps" : { > "key.deserializer" : > "org.apache.kafka.common.serialization.ByteArrayDeserializer", > "auto.offset.reset" : "earliest", > "bootstrap.servers" : > "kafkas.dev3.master.us-west-2.prod.aws.intellify.io:9092", > "enable.auto.commit" : "true", > "group.id" : "drill-query-consumer-1", > "value.deserializer" : > "org.apache.kafka.common.serialization.ByteArrayDeserializer", > "session.timeout.ms" : "3" > }, > "enabled" : true > }, > "cost" : 0.0 > }, > "first" : 0, > "last" : 2, > "initialAllocation" : 100, > "maxAllocation" : 100, > "cost" : 2.0 > }, > "initialAllocation" : 100, > "maxAllocation" : 100, > "cost" : 2.0 > }, > "destination" : "CgsxMC44MC41My4xNhDM+AEYzfgBIM74ATIGMS4xMi4wOAA=", > "initialAllocation" : 100, > "maxAllocation" : 100, > "cost" : 2.0 > }; line: 49, column: 7] (through reference chain: > org.apache.drill.exec.physical.config.SingleSender["child"]->org.apache.drill.exec.physical.config.SelectionVectorRemover["child"]->org.apache.drill.exec.physical.config.Limit["child"]) > at > com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:263) > ~[jackson-databind-2.7.9.1.jar:2.7.9.1] > at > com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.wrapAsJsonMappingException(StdValueInstantiator.java:453) > ~[jackson-databind-2.7.9.1.jar:2.7.9.1] > at > com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.rewrapCtorProblem(StdValueInstantiator.java:472) > ~[jackson-databind-2.7.9.1.jar:2.7.9.1] > at > com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.createFromObjectWith(StdValueInstantiator.java:258) > ~[jackson-databind-2.7.9.1.jar:2.7.9.1] > at > com.fasterxml.jackson.databind.deser.impl.PropertyBasedCreator.build(PropertyBasedCreator.java:135) > ~[jackson-databind-2.7.9.1.jar:2.7.9.1] > at > com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:444) >
[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.
[ https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347506#comment-16347506 ] ASF GitHub Bot commented on DRILL-6106: --- Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1099 @reudismam Travis fails in other PRs as well. See #1105. > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. > > > Key: DRILL-6106 > URL: https://issues.apache.org/jira/browse/DRILL-6106 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Reudismam Rolim de Sousa >Assignee: Reudismam Rolim de Sousa >Priority: Minor > Labels: ready-to-commit > Fix For: 1.13.0 > > > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347504#comment-16347504 ] ASF GitHub Bot commented on DRILL-6032: --- Github user ilooner commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r165168294 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -84,13 +85,6 @@ public abstract class HashAggTemplate implements HashAggregator { protected static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(HashAggregator.class); - private static final int VARIABLE_MAX_WIDTH_VALUE_SIZE = 50; - private static final int VARIABLE_MIN_WIDTH_VALUE_SIZE = 8; - - private static final boolean EXTRA_DEBUG_1 = false; --- End diff -- Oh but there is! slf4j and logback have a feature called markers, which allows you to associate a tag with a statement. When you print logs you can specify to filter by level and by marker. There is a working example here https://examples.javacodegeeks.com/enterprise-java/slf4j/slf4j-markers-example/ . I will update the log statements to use markers in this PR. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347492#comment-16347492 ] ASF GitHub Bot commented on DRILL-6032: --- Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164617150 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -84,13 +85,6 @@ public abstract class HashAggTemplate implements HashAggregator { protected static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(HashAggregator.class); - private static final int VARIABLE_MAX_WIDTH_VALUE_SIZE = 50; - private static final int VARIABLE_MIN_WIDTH_VALUE_SIZE = 8; - - private static final boolean EXTRA_DEBUG_1 = false; --- End diff -- The logging framework only gives error/warning/debug/trace ... there is no option for a user configurable level > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347493#comment-16347493 ] ASF GitHub Bot commented on DRILL-6032: --- Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r165166234 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java --- @@ -255,7 +254,6 @@ private HashAggregator createAggregatorInternal() throws SchemaChangeException, groupByOutFieldIds[i] = container.add(vv); } -int extraNonNullColumns = 0; // each of SUM, MAX and MIN gets an extra bigint column --- End diff -- Maybe do this work as a separate PR (for DRILL-5728) ? Else it would delay this PR, and overload it ... > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347447#comment-16347447 ] ASF GitHub Bot commented on DRILL-6032: --- Github user ilooner commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r165161146 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java --- @@ -255,7 +254,6 @@ private HashAggregator createAggregatorInternal() throws SchemaChangeException, groupByOutFieldIds[i] = container.add(vv); } -int extraNonNullColumns = 0; // each of SUM, MAX and MIN gets an extra bigint column --- End diff -- Thanks for catching this. Then we should fix the underlying problem instead of passing around additional parameters to work around the issue. I will work on fixing the codegen for the BatchHolder as part of this PR. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347420#comment-16347420 ] ASF GitHub Bot commented on DRILL-6032: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r165156589 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -215,6 +206,7 @@ public BatchHolder() { MaterializedField outputField = materializedValueFields[i]; // Create a type-specific ValueVector for this value vector = TypeHelper.getNewVector(outputField, allocator); + int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize; --- End diff -- @ilooner That is the point. If we know the exact value, why do we need RecordBatchSizer ? we should use RecordBatchSizer when we need to get sizing information for a batch (in most cases, incoming batch). In this case, you are allocating memory for value vectors for the batch you are building. For fixed width columns, you can get the column width size for each type you are allocating memory for using TypeHelper.getSize. For variable width columns, TypeHelper.getSize assumes it is 50 bytes. If you want to adjust memory you are allocating for variable width columns for outgoing batch based on incoming batch, that's when you use RecordBatchSizer on actual incoming batch to figure out the average size of that column. You can also use RecordBatchSizer on incoming batch if you want to figure out how many values you want to allocate memory for in the outgoing batch. Note that, with your change, for just created value vectors, variable width columns will return estSize of 1, which is not what you want. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6125) PartitionSenderRootExec can leak memory because close method is not synchronized
[ https://issues.apache.org/jira/browse/DRILL-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Farkas updated DRILL-6125: -- Fix Version/s: 1.13.0 > PartitionSenderRootExec can leak memory because close method is not > synchronized > > > Key: DRILL-6125 > URL: https://issues.apache.org/jira/browse/DRILL-6125 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Minor > Fix For: 1.13.0 > > > PartitionSenderRootExec creates a PartitionerDecorator and saves it in the > *partitioner* field. The creation of the partitioner happens in the > createPartitioner method. This method get's called by the main fragment > thread. The partitioner field is accessed by the fragment thread during > normal execution but it can also be accessed by the receivingFragmentFinished > method which is a callback executed by the event processor thread. Because > multiple threads can access the partitioner field synchronization is done on > creation and on when receivingFragmentFinished. However, the close method can > also be called by the event processor thread, and the close method does not > synchronize before accessing the partitioner field. Since synchronization is > not done the event processor thread may have an old reference to the > partitioner when a query cancellation is done. Since it has an old reference > the current partitioner can may not be cleared and a memory leak may occur. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6124) testCountDownLatch can be null in PartitionerDecorator depending on user's injection controls config
[ https://issues.apache.org/jira/browse/DRILL-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Farkas updated DRILL-6124: -- Affects Version/s: (was: 1.12.0) 1.13.0 > testCountDownLatch can be null in PartitionerDecorator depending on user's > injection controls config > > > Key: DRILL-6124 > URL: https://issues.apache.org/jira/browse/DRILL-6124 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Minor > Fix For: 1.13.0 > > > In PartitionerDecorator we get a latch from the injector with the following > code. > testCountDownLatch = injector.getLatch(context.getExecutionControls(), > "partitioner-sender-latch"); > However, if there is no injection site defined in the user's drill > configuration then testCountDownLatch will be null. So we have to check if it > is null in order to avoid NPE's -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6125) PartitionSenderRootExec can leak memory because close method is not synchronized
[ https://issues.apache.org/jira/browse/DRILL-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Farkas updated DRILL-6125: -- Affects Version/s: 1.13.0 > PartitionSenderRootExec can leak memory because close method is not > synchronized > > > Key: DRILL-6125 > URL: https://issues.apache.org/jira/browse/DRILL-6125 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Minor > > PartitionSenderRootExec creates a PartitionerDecorator and saves it in the > *partitioner* field. The creation of the partitioner happens in the > createPartitioner method. This method get's called by the main fragment > thread. The partitioner field is accessed by the fragment thread during > normal execution but it can also be accessed by the receivingFragmentFinished > method which is a callback executed by the event processor thread. Because > multiple threads can access the partitioner field synchronization is done on > creation and on when receivingFragmentFinished. However, the close method can > also be called by the event processor thread, and the close method does not > synchronize before accessing the partitioner field. Since synchronization is > not done the event processor thread may have an old reference to the > partitioner when a query cancellation is done. Since it has an old reference > the current partitioner can may not be cleared and a memory leak may occur. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6125) PartitionSenderRootExec can leak memory because close method is not synchronized
[ https://issues.apache.org/jira/browse/DRILL-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Farkas updated DRILL-6125: -- Reviewer: Arina Ielchiieva > PartitionSenderRootExec can leak memory because close method is not > synchronized > > > Key: DRILL-6125 > URL: https://issues.apache.org/jira/browse/DRILL-6125 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > > PartitionSenderRootExec creates a PartitionerDecorator and saves it in the > *partitioner* field. The creation of the partitioner happens in the > createPartitioner method. This method get's called by the main fragment > thread. The partitioner field is accessed by the fragment thread during > normal execution but it can also be accessed by the receivingFragmentFinished > method which is a callback executed by the event processor thread. Because > multiple threads can access the partitioner field synchronization is done on > creation and on when receivingFragmentFinished. However, the close method can > also be called by the event processor thread, and the close method does not > synchronize before accessing the partitioner field. Since synchronization is > not done the event processor thread may have an old reference to the > partitioner when a query cancellation is done. Since it has an old reference > the current partitioner can may not be cleared and a memory leak may occur. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6125) PartitionSenderRootExec can leak memory because close method is not synchronized
[ https://issues.apache.org/jira/browse/DRILL-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Farkas updated DRILL-6125: -- Priority: Minor (was: Major) > PartitionSenderRootExec can leak memory because close method is not > synchronized > > > Key: DRILL-6125 > URL: https://issues.apache.org/jira/browse/DRILL-6125 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Minor > > PartitionSenderRootExec creates a PartitionerDecorator and saves it in the > *partitioner* field. The creation of the partitioner happens in the > createPartitioner method. This method get's called by the main fragment > thread. The partitioner field is accessed by the fragment thread during > normal execution but it can also be accessed by the receivingFragmentFinished > method which is a callback executed by the event processor thread. Because > multiple threads can access the partitioner field synchronization is done on > creation and on when receivingFragmentFinished. However, the close method can > also be called by the event processor thread, and the close method does not > synchronize before accessing the partitioner field. Since synchronization is > not done the event processor thread may have an old reference to the > partitioner when a query cancellation is done. Since it has an old reference > the current partitioner can may not be cleared and a memory leak may occur. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6125) PartitionSenderRootExec can leak memory because close method is not synchronized
[ https://issues.apache.org/jira/browse/DRILL-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347395#comment-16347395 ] ASF GitHub Bot commented on DRILL-6125: --- Github user ilooner commented on the issue: https://github.com/apache/drill/pull/1105 @sachouche @arina-ielchiieva > PartitionSenderRootExec can leak memory because close method is not > synchronized > > > Key: DRILL-6125 > URL: https://issues.apache.org/jira/browse/DRILL-6125 > Project: Apache Drill > Issue Type: Bug >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > > PartitionSenderRootExec creates a PartitionerDecorator and saves it in the > *partitioner* field. The creation of the partitioner happens in the > createPartitioner method. This method get's called by the main fragment > thread. The partitioner field is accessed by the fragment thread during > normal execution but it can also be accessed by the receivingFragmentFinished > method which is a callback executed by the event processor thread. Because > multiple threads can access the partitioner field synchronization is done on > creation and on when receivingFragmentFinished. However, the close method can > also be called by the event processor thread, and the close method does not > synchronize before accessing the partitioner field. Since synchronization is > not done the event processor thread may have an old reference to the > partitioner when a query cancellation is done. Since it has an old reference > the current partitioner can may not be cleared and a memory leak may occur. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6125) PartitionSenderRootExec can leak memory because close method is not synchronized
[ https://issues.apache.org/jira/browse/DRILL-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347393#comment-16347393 ] ASF GitHub Bot commented on DRILL-6125: --- GitHub user ilooner opened a pull request: https://github.com/apache/drill/pull/1105 DRILL-6125: Fix possible memory leak when query is cancelled. A detailed description of the problem and solution can be found here: https://issues.apache.org/jira/browse/DRILL-6125 You can merge this pull request into a Git repository by running: $ git pull https://github.com/ilooner/drill DRILL-6125 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1105.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1105 commit 1d1725a276c058e8c09e456963bac928d1f062ed Author: Timothy FarkasDate: 2018-01-30T23:55:41Z DRILL-6125: Fix possible memory leak when query is cancelled. > PartitionSenderRootExec can leak memory because close method is not > synchronized > > > Key: DRILL-6125 > URL: https://issues.apache.org/jira/browse/DRILL-6125 > Project: Apache Drill > Issue Type: Bug >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > > PartitionSenderRootExec creates a PartitionerDecorator and saves it in the > *partitioner* field. The creation of the partitioner happens in the > createPartitioner method. This method get's called by the main fragment > thread. The partitioner field is accessed by the fragment thread during > normal execution but it can also be accessed by the receivingFragmentFinished > method which is a callback executed by the event processor thread. Because > multiple threads can access the partitioner field synchronization is done on > creation and on when receivingFragmentFinished. However, the close method can > also be called by the event processor thread, and the close method does not > synchronize before accessing the partitioner field. Since synchronization is > not done the event processor thread may have an old reference to the > partitioner when a query cancellation is done. Since it has an old reference > the current partitioner can may not be cleared and a memory leak may occur. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.
[ https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347366#comment-16347366 ] ASF GitHub Bot commented on DRILL-6106: --- Github user reudismam commented on the issue: https://github.com/apache/drill/pull/1099 Only pass Travis CI by removing the edits to SSLConfigClient.java > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. > > > Key: DRILL-6106 > URL: https://issues.apache.org/jira/browse/DRILL-6106 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Reudismam Rolim de Sousa >Assignee: Reudismam Rolim de Sousa >Priority: Minor > Labels: ready-to-commit > Fix For: 1.13.0 > > > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347292#comment-16347292 ] ASF GitHub Bot commented on DRILL-6032: --- Github user ilooner commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r165137630 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java --- @@ -232,9 +251,8 @@ else if (width > 0) { } } - public static final int MAX_VECTOR_SIZE = ValueVector.MAX_BUFFER_SIZE; // 16 MiB - private List columnSizes = new ArrayList<>(); + private MapcolumnSizeMap = CaseInsensitiveMap.newHashMap(); --- End diff -- Thanks for the explanation here and on the dev list @paul-rogers. > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347288#comment-16347288 ] ASF GitHub Bot commented on DRILL-6032: --- Github user ilooner commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r165136635 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -397,11 +384,9 @@ private void delayedSetup() { } numPartitions = BaseAllocator.nextPowerOfTwo(numPartitions); // in case not a power of 2 -if ( schema == null ) { estValuesBatchSize = estOutgoingAllocSize = estMaxBatchSize = 0; } // incoming was an empty batch --- End diff -- All the unit and functional tests passed without an NPE. The null check was redundant because the code in **doWork** that calls **delayedSetup** sets the schema if it is null. ``` // This would be called only once - first time actual data arrives on incoming if ( schema == null && incoming.getRecordCount() > 0 ) { this.schema = incoming.getSchema(); currentBatchRecordCount = incoming.getRecordCount(); // initialize for first non empty batch // Calculate the number of partitions based on actual incoming data delayedSetup(); } ``` So schema will never be null when delayed setup is called > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6127) NullPointerException happens when submitting physical plan to the Hive storage plugin
Anton Gozhiy created DRILL-6127: --- Summary: NullPointerException happens when submitting physical plan to the Hive storage plugin Key: DRILL-6127 URL: https://issues.apache.org/jira/browse/DRILL-6127 Project: Apache Drill Issue Type: Bug Affects Versions: 1.13.0 Reporter: Anton Gozhiy *Prerequisites:* *1.* Create some test table in Hive: {code:sql} create external table if not exists hive_storage.test (key string, value string) stored as parquet location '/hive_storage/test'; insert into table test values ("key", "value"); {code} *2.* Hive plugin config: {code:json} { "type": "hive", "enabled": true, "configProps": { "hive.metastore.uris": "thrift://localhost:9083", "fs.default.name": "maprfs:///", "hive.metastore.sasl.enabled": "false" } } {code} *Steps:* *1.* From the Drill web UI, run the following query: {code:sql} explain plan for select * from hive.hive_storage.`test` {code} *2.* Copy the json part of the plan *3.* On the Query page set checkbox to the PHYSICAL *4.* Submit the copied plan *Expected result:* Drill should return normal result: "key", "value" *Actual result:* NPE happens: {noformat} [Error Id: 8b45c27e-bddd-4552-b7ea-e5af6f40866a on node1:31010] org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: NullPointerException [Error Id: 8b45c27e-bddd-4552-b7ea-e5af6f40866a on node1:31010] at org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:633) ~[drill-common-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT] at org.apache.drill.exec.work.foreman.Foreman$ForemanResult.close(Foreman.java:761) [drill-java-exec-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT] at org.apache.drill.exec.work.foreman.QueryStateProcessor.checkCommonStates(QueryStateProcessor.java:327) [drill-java-exec-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT] at org.apache.drill.exec.work.foreman.QueryStateProcessor.planning(QueryStateProcessor.java:223) [drill-java-exec-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT] at org.apache.drill.exec.work.foreman.QueryStateProcessor.moveToState(QueryStateProcessor.java:83) [drill-java-exec-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT] at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:279) [drill-java-exec-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_161] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161] Caused by: org.apache.drill.exec.work.foreman.ForemanSetupException: Failure while parsing physical plan. at org.apache.drill.exec.work.foreman.Foreman.parseAndRunPhysicalPlan(Foreman.java:393) [drill-java-exec-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT] at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:257) [drill-java-exec-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT] ... 3 common frames omitted Caused by: com.fasterxml.jackson.databind.JsonMappingException: Instantiation of [simple type, class org.apache.drill.exec.store.hive.HiveScan] value failed (java.lang.NullPointerException): null at [Source: { "head" : { "version" : 1, "generator" : { "type" : "ExplainHandler", "info" : "" }, "type" : "APACHE_DRILL_PHYSICAL", "options" : [ ], "queue" : 0, "hasResourcePlan" : false, "resultMode" : "EXEC" }, "graph" : [ { "pop" : "hive-scan", "@id" : 2, "userName" : "mapr", "hive-table" : { "table" : { "tableName" : "test", "dbName" : "hive_storage", "owner" : "mapr", "createTime" : 1517417959, "lastAccessTime" : 0, "retention" : 0, "sd" : { "location" : "maprfs:/hive_storage/test", "inputFormat" : "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat", "outputFormat" : "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat", "compressed" : false, "numBuckets" : -1, "serDeInfo" : { "name" : null, "serializationLib" : "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe", "parameters" : { "serialization.format" : "1" } }, "sortCols" : [ ], "parameters" : { } }, "partitionKeys" : [ ], "parameters" : { "totalSize" : "0", "EXTERNAL" : "TRUE", "numRows" : "1", "rawDataSize" : "2", "COLUMN_STATS_ACCURATE" : "true", "numFiles" : "0", "transient_lastDdlTime" : "1517418363" }, "viewOriginalText" : null, "viewExpandedText" : null, "tableType" : "EXTERNAL_TABLE", "columnsCache" : { "keys" : [ [ { "name" : "key", "type" : "string", "comment" : null }, { "name" : "value", "type" : "string", "comment" : null } ] ] } }, "partitions" : null }, "columns" : [ "`key`", "`value`" ], "cost" : 0.0 }, { "pop" : "project", "@id" : 1, "exprs" : [ { "ref" : "`key`", "expr" : "`key`" }, { "ref" : "`value`", "expr" : "`value`" } ], "child" : 2, "outputProj" : true, "initialAllocation" : 100, "maxAllocation" : 100,
[jira] [Commented] (DRILL-6032) Use RecordBatchSizer to estimate size of columns in HashAgg
[ https://issues.apache.org/jira/browse/DRILL-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347279#comment-16347279 ] ASF GitHub Bot commented on DRILL-6032: --- Github user ilooner commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r165135291 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -215,6 +206,7 @@ public BatchHolder() { MaterializedField outputField = materializedValueFields[i]; // Create a type-specific ValueVector for this value vector = TypeHelper.getNewVector(outputField, allocator); + int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize; --- End diff -- @ppadma I thought estSize represented the estimated column width. For FixedWidth vectors we know the exact column width, so why can't we use the exact value? Also why are there two different things for measuring column sizes, when do you use RecordBatchSizer and when do you use TypeHelper? > Use RecordBatchSizer to estimate size of columns in HashAgg > --- > > Key: DRILL-6032 > URL: https://issues.apache.org/jira/browse/DRILL-6032 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > We need to use the RecordBatchSize to estimate the size of columns in the > Partition batches created by HashAgg. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.
[ https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347178#comment-16347178 ] ASF GitHub Bot commented on DRILL-6106: --- Github user reudismam commented on the issue: https://github.com/apache/drill/pull/1099 I have squashed the commits, but I’m getting an error in Travis CI similar to the previous one when I reverted some changes. Column a-offsets of type UInt4Vector: Offset (0) must be 0 but was 1 > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. > > > Key: DRILL-6106 > URL: https://issues.apache.org/jira/browse/DRILL-6106 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Reudismam Rolim de Sousa >Assignee: Reudismam Rolim de Sousa >Priority: Minor > Labels: ready-to-commit > Fix For: 1.13.0 > > > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-5377) Five-digit year dates are displayed incorrectly via jdbc
[ https://issues.apache.org/jira/browse/DRILL-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347010#comment-16347010 ] ASF GitHub Bot commented on DRILL-5377: --- Github user vdiravka commented on the issue: https://github.com/apache/drill/pull/916 @arina-ielchiieva You are right. According to SQL spec after resolving [CALCITE-2055](https://issues.apache.org/jira/browse/CALCITE-2055) and Drill-Calcite upgrade Drill and Calcite don't support five digit years. Please find more details in jira description. > Five-digit year dates are displayed incorrectly via jdbc > > > Key: DRILL-5377 > URL: https://issues.apache.org/jira/browse/DRILL-5377 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.10.0 >Reporter: Rahul Challapalli >Assignee: Vitalii Diravka >Priority: Minor > Fix For: 1.13.0 > > > git.commit.id.abbrev=38ef562 > The issue is connected to displaying five-digit year dates via jdbc > Below is the output, I get from test framework when I disable auto correction > for date fields > {code} > select l_shipdate from table(cp.`tpch/lineitem.parquet` (type => 'parquet', > autoCorrectCorruptDates => false)) order by l_shipdate limit 10; > ^@356-03-19 > ^@356-03-21 > ^@356-03-21 > ^@356-03-23 > ^@356-03-24 > ^@356-03-24 > ^@356-03-26 > ^@356-03-26 > ^@356-03-26 > ^@356-03-26 > {code} > Or a simpler case: > {code} > 0: jdbc:drill:> select cast('11356-02-16' as date) as FUTURE_DATE from > (VALUES(1)); > +--+ > | FUTURE_DATE | > +--+ > | 356-02-16 | > +--+ > 1 row selected (0.293 seconds) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-5377) Five-digit year dates are displayed incorrectly via jdbc
[ https://issues.apache.org/jira/browse/DRILL-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347011#comment-16347011 ] ASF GitHub Bot commented on DRILL-5377: --- Github user vdiravka closed the pull request at: https://github.com/apache/drill/pull/916 > Five-digit year dates are displayed incorrectly via jdbc > > > Key: DRILL-5377 > URL: https://issues.apache.org/jira/browse/DRILL-5377 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.10.0 >Reporter: Rahul Challapalli >Assignee: Vitalii Diravka >Priority: Minor > Fix For: 1.13.0 > > > git.commit.id.abbrev=38ef562 > The issue is connected to displaying five-digit year dates via jdbc > Below is the output, I get from test framework when I disable auto correction > for date fields > {code} > select l_shipdate from table(cp.`tpch/lineitem.parquet` (type => 'parquet', > autoCorrectCorruptDates => false)) order by l_shipdate limit 10; > ^@356-03-19 > ^@356-03-21 > ^@356-03-21 > ^@356-03-23 > ^@356-03-24 > ^@356-03-24 > ^@356-03-26 > ^@356-03-26 > ^@356-03-26 > ^@356-03-26 > {code} > Or a simpler case: > {code} > 0: jdbc:drill:> select cast('11356-02-16' as date) as FUTURE_DATE from > (VALUES(1)); > +--+ > | FUTURE_DATE | > +--+ > | 356-02-16 | > +--+ > 1 row selected (0.293 seconds) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (DRILL-5377) Five-digit year dates are displayed incorrectly via jdbc
[ https://issues.apache.org/jira/browse/DRILL-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalii Diravka resolved DRILL-5377. Resolution: Not A Problem [~vvysotskyi] Thank you. So for now test cases from jira description will fail with: {code} java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Year out of range: [11356] {code} This is an expected exception. Nothing should be fixed. > Five-digit year dates are displayed incorrectly via jdbc > > > Key: DRILL-5377 > URL: https://issues.apache.org/jira/browse/DRILL-5377 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.10.0 >Reporter: Rahul Challapalli >Assignee: Vitalii Diravka >Priority: Minor > Fix For: 1.13.0 > > > git.commit.id.abbrev=38ef562 > The issue is connected to displaying five-digit year dates via jdbc > Below is the output, I get from test framework when I disable auto correction > for date fields > {code} > select l_shipdate from table(cp.`tpch/lineitem.parquet` (type => 'parquet', > autoCorrectCorruptDates => false)) order by l_shipdate limit 10; > ^@356-03-19 > ^@356-03-21 > ^@356-03-21 > ^@356-03-23 > ^@356-03-24 > ^@356-03-24 > ^@356-03-26 > ^@356-03-26 > ^@356-03-26 > ^@356-03-26 > {code} > Or a simpler case: > {code} > 0: jdbc:drill:> select cast('11356-02-16' as date) as FUTURE_DATE from > (VALUES(1)); > +--+ > | FUTURE_DATE | > +--+ > | 356-02-16 | > +--+ > 1 row selected (0.293 seconds) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6111) NullPointerException with Kafka Storage Plugin
[ https://issues.apache.org/jira/browse/DRILL-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346983#comment-16346983 ] Arina Ielchiieva commented on DRILL-6111: - [~akumarb2010] & [~kam_iitkgp] could you please take a look? > NullPointerException with Kafka Storage Plugin > -- > > Key: DRILL-6111 > URL: https://issues.apache.org/jira/browse/DRILL-6111 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Other >Affects Versions: 1.12.0 >Reporter: Jared Stehler >Priority: Major > > I'm unable to query using the kafka storage plugin; queries are failing with > a NPE which *seems* like a json typo: > {code:java} > org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: > NullPointerException > Fragment 1:2 > [Error Id: 49d5f72f-0187-480b-8b29-6eeeb5adc88f on 10.80.53.16:31820] > at > org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:586) > ~[drill-common-1.12.0.jar:1.12.0] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:298) > [drill-java-exec-1.12.0.jar:1.12.0] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160) > [drill-java-exec-1.12.0.jar:1.12.0] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:267) > [drill-java-exec-1.12.0.jar:1.12.0] > at > org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) > [drill-common-1.12.0.jar:1.12.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [na:1.8.0_131] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_131] > at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131] > Caused by: com.fasterxml.jackson.databind.JsonMappingException: Instantiation > of [simple type, class org.apache.drill.exec.store.kafka.KafkaSubScan] value > failed (java.lang.NullPointerException): null > at [Source: { > "pop" : "single-sender", > "@id" : 0, > "receiver-major-fragment" : 0, > "receiver-minor-fragment" : 0, > "child" : { > "pop" : "selection-vector-remover", > "@id" : 1, > "child" : { > "pop" : "limit", > "@id" : 2, > "child" : { > "pop" : "kafka-partition-scan", > "@id" : 3, > "userName" : "", > "columns" : [ "`*`" ], > "partitionSubScanSpecList" : [ { > "topicName" : "ingest-prime", > "partitionId" : 5, > "startOffset" : 8824294, > "endOffset" : 8874172 > }, { > "topicName" : "ingest-prime", > "partitionId" : 1, > "startOffset" : 8826346, > "endOffset" : 8874623 > }, { > "topicName" : "ingest-prime", > "partitionId" : 6, > "startOffset" : 8824744, > "endOffset" : 8874617 > } ], > "initialAllocation" : 100, > "maxAllocation" : 100, > "KafkaStoragePluginConfig" : { > "type" : "kafka", > "kafkaConsumerProps" : { > "key.deserializer" : > "org.apache.kafka.common.serialization.ByteArrayDeserializer", > "auto.offset.reset" : "earliest", > "bootstrap.servers" : > "kafkas.dev3.master.us-west-2.prod.aws.intellify.io:9092", > "enable.auto.commit" : "true", > "group.id" : "drill-query-consumer-1", > "value.deserializer" : > "org.apache.kafka.common.serialization.ByteArrayDeserializer", > "session.timeout.ms" : "3" > }, > "enabled" : true > }, > "cost" : 0.0 > }, > "first" : 0, > "last" : 2, > "initialAllocation" : 100, > "maxAllocation" : 100, > "cost" : 2.0 > }, > "initialAllocation" : 100, > "maxAllocation" : 100, > "cost" : 2.0 > }, > "destination" : "CgsxMC44MC41My4xNhDM+AEYzfgBIM74ATIGMS4xMi4wOAA=", > "initialAllocation" : 100, > "maxAllocation" : 100, > "cost" : 2.0 > }; line: 49, column: 7] (through reference chain: > org.apache.drill.exec.physical.config.SingleSender["child"]->org.apache.drill.exec.physical.config.SelectionVectorRemover["child"]->org.apache.drill.exec.physical.config.Limit["child"]) > at > com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:263) > ~[jackson-databind-2.7.9.1.jar:2.7.9.1] > at > com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.wrapAsJsonMappingException(StdValueInstantiator.java:453) > ~[jackson-databind-2.7.9.1.jar:2.7.9.1] > at > com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.rewrapCtorProblem(StdValueInstantiator.java:472) > ~[jackson-databind-2.7.9.1.jar:2.7.9.1] > at > com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.createFromObjectWith(StdValueInstantiator.java:258) > ~[jackson-databind-2.7.9.1.jar:2.7.9.1] > at > com.fasterxml.jackson.databind.deser.impl.PropertyBasedCreator.build(PropertyBasedCreator.java:135) > ~[jackson-databind-2.7.9.1.jar:2.7.9.1] > at > com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:444) >
[jira] [Commented] (DRILL-5377) Five-digit year dates are displayed incorrectly via jdbc
[ https://issues.apache.org/jira/browse/DRILL-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346944#comment-16346944 ] Volodymyr Vysotskyi commented on DRILL-5377: After the changes made in CALCITE-1690, date string should strictly match pattern {noformat} [0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9] {noformat} In CALCITE-2055 was added a check for ranges of date elements. More details connected with SQL spec. may be found in {{6.1 }} > Five-digit year dates are displayed incorrectly via jdbc > > > Key: DRILL-5377 > URL: https://issues.apache.org/jira/browse/DRILL-5377 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.10.0 >Reporter: Rahul Challapalli >Assignee: Vitalii Diravka >Priority: Minor > Fix For: 1.13.0 > > > git.commit.id.abbrev=38ef562 > The issue is connected to displaying five-digit year dates via jdbc > Below is the output, I get from test framework when I disable auto correction > for date fields > {code} > select l_shipdate from table(cp.`tpch/lineitem.parquet` (type => 'parquet', > autoCorrectCorruptDates => false)) order by l_shipdate limit 10; > ^@356-03-19 > ^@356-03-21 > ^@356-03-21 > ^@356-03-23 > ^@356-03-24 > ^@356-03-24 > ^@356-03-26 > ^@356-03-26 > ^@356-03-26 > ^@356-03-26 > {code} > Or a simpler case: > {code} > 0: jdbc:drill:> select cast('11356-02-16' as date) as FUTURE_DATE from > (VALUES(1)); > +--+ > | FUTURE_DATE | > +--+ > | 356-02-16 | > +--+ > 1 row selected (0.293 seconds) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-5377) Five-digit year dates are displayed incorrectly via jdbc
[ https://issues.apache.org/jira/browse/DRILL-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346911#comment-16346911 ] Arina Ielchiieva commented on DRILL-5377: - [~vitalii] after upgrade to Calcite 1.15 year with more then 4 digits is disallowed according to Sql standard. [~vvysotskyi] please confirm. > Five-digit year dates are displayed incorrectly via jdbc > > > Key: DRILL-5377 > URL: https://issues.apache.org/jira/browse/DRILL-5377 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.10.0 >Reporter: Rahul Challapalli >Assignee: Vitalii Diravka >Priority: Minor > Fix For: 1.13.0 > > > git.commit.id.abbrev=38ef562 > The issue is connected to displaying five-digit year dates via jdbc > Below is the output, I get from test framework when I disable auto correction > for date fields > {code} > select l_shipdate from table(cp.`tpch/lineitem.parquet` (type => 'parquet', > autoCorrectCorruptDates => false)) order by l_shipdate limit 10; > ^@356-03-19 > ^@356-03-21 > ^@356-03-21 > ^@356-03-23 > ^@356-03-24 > ^@356-03-24 > ^@356-03-26 > ^@356-03-26 > ^@356-03-26 > ^@356-03-26 > {code} > Or a simpler case: > {code} > 0: jdbc:drill:> select cast('11356-02-16' as date) as FUTURE_DATE from > (VALUES(1)); > +--+ > | FUTURE_DATE | > +--+ > | 356-02-16 | > +--+ > 1 row selected (0.293 seconds) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-5377) Five-digit year dates are displayed incorrectly via jdbc
[ https://issues.apache.org/jira/browse/DRILL-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346914#comment-16346914 ] ASF GitHub Bot commented on DRILL-5377: --- Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/916 It seems that this PR is not relevant after Calcite upgrade. @vdiravka please confirm and close PR. > Five-digit year dates are displayed incorrectly via jdbc > > > Key: DRILL-5377 > URL: https://issues.apache.org/jira/browse/DRILL-5377 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.10.0 >Reporter: Rahul Challapalli >Assignee: Vitalii Diravka >Priority: Minor > Fix For: 1.13.0 > > > git.commit.id.abbrev=38ef562 > The issue is connected to displaying five-digit year dates via jdbc > Below is the output, I get from test framework when I disable auto correction > for date fields > {code} > select l_shipdate from table(cp.`tpch/lineitem.parquet` (type => 'parquet', > autoCorrectCorruptDates => false)) order by l_shipdate limit 10; > ^@356-03-19 > ^@356-03-21 > ^@356-03-21 > ^@356-03-23 > ^@356-03-24 > ^@356-03-24 > ^@356-03-26 > ^@356-03-26 > ^@356-03-26 > ^@356-03-26 > {code} > Or a simpler case: > {code} > 0: jdbc:drill:> select cast('11356-02-16' as date) as FUTURE_DATE from > (VALUES(1)); > +--+ > | FUTURE_DATE | > +--+ > | 356-02-16 | > +--+ > 1 row selected (0.293 seconds) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346884#comment-16346884 ] ASF GitHub Bot commented on DRILL-6118: --- Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/1104 @chunhui-shi please review. > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.13.0 > > > Project push down, filter push down and partition pruning does not work with > dynamically expanded column with is represented as star in ITEM operator: > _ITEM($0, 'column_name')_ where $0 is a star. > This often occurs when view, sub-select or cte with star is issued. > To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which > will rewrite such ITEM operator before filter push down and directory > pruning. For project into scan push down logic will be handled separately in > already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can > consider the following queries the same: > {{select col1 from t}} > {{select col1 from (select * from t)}} > *Use cases* > Since item star columns where not considered during project / filter push > down and directory pruning, push down and pruning did not happen. This was > causing Drill to read all columns from file (when only several are needed) or > ready all files instead. Views with star query is the most common example. > Such behavior significantly degrades performance for item star queries > comparing to queries without item star. > *EXAMPLES* > *Data set* > will create table with three files each in dedicated sub-folder: > {noformat} > use dfs.tmp; > create table `order_ctas/t1` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-01' and date '1992-01-03'; > create table `order_ctas/t2` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-04' and date '1992-01-06'; > create table `order_ctas/t3` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-07' and date '1992-01-09'; > {noformat} > *Filter push down* > {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read > only one file > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(T1¦¦**=[$0]) > 00-03 SelectionVectorRemover > 00-04Filter(condition=[=($1, 1992-01-01)]) > 00-05 Project(T1¦¦**=[$0], o_orderdate=[$1]) > 00-06Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], > selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, > usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where o_orderdate = date > '1992-01-01'}} will ready all three files > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, > numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > *Directory pruning* > {{select * from order_ctas where dir0 = 't1'}} will read data only from one > folder > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(**=[$0]) > 00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, > numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]]) > {noformat} > {{select * from (select * from order_ctas) where dir0 = 't1'}} will read > content of all three folders > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02SelectionVectorRemover > 00-03 Filter(condition=[=(ITEM($0, 'dir0'), 't1')]) > 00-04Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath > [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath >
[jira] [Commented] (DRILL-6118) Handle item star columns during project / filter push down and directory pruning
[ https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346883#comment-16346883 ] ASF GitHub Bot commented on DRILL-6118: --- GitHub user arina-ielchiieva opened a pull request: https://github.com/apache/drill/pull/1104 DRILL-6118: Handle item star columns during project / filter push dow… …n and directory pruning 1. Added DrillFilterItemStarReWriterRule to re-write item star fields to regular field references. 2. Refactored DrillPushProjectIntoScanRule to handle item star fields, factored out helper classes and methods from PreUitl.class. 3. Fixed issue with dynamic star usage (after Calcite upgrade old usage of star was still present, replaced WILDCARD -> DYNAMIC_STAR for clarity). 4. Added unit tests to check project / filter push down and directory pruning with item star. Details in [DRILL-6118](https://issues.apache.org/jira/browse/DRILL-6118). You can merge this pull request into a Git repository by running: $ git pull https://github.com/arina-ielchiieva/drill DRILL-6118 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1104.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1104 commit 4673bfb593ca6422d58fa9e0e6eb281a69f1ed69 Author: Arina IelchiievaDate: 2017-12-21T17:31:00Z DRILL-6118: Handle item star columns during project / filter push down and directory pruning 1. Added DrillFilterItemStarReWriterRule to re-write item star fields to regular field references. 2. Refactored DrillPushProjectIntoScanRule to handle item star fields, factored out helper classes and methods from PreUitl.class. 3. Fixed issue with dynamic star usage (after Calcite upgrade old usage of star was still present, replaced WILDCARD -> DYNAMIC_STAR for clarity). 4. Added unit tests to check project / filter push down and directory pruning with item star. > Handle item star columns during project / filter push down and directory > pruning > -- > > Key: DRILL-6118 > URL: https://issues.apache.org/jira/browse/DRILL-6118 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.13.0 > > > Project push down, filter push down and partition pruning does not work with > dynamically expanded column with is represented as star in ITEM operator: > _ITEM($0, 'column_name')_ where $0 is a star. > This often occurs when view, sub-select or cte with star is issued. > To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which > will rewrite such ITEM operator before filter push down and directory > pruning. For project into scan push down logic will be handled separately in > already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can > consider the following queries the same: > {{select col1 from t}} > {{select col1 from (select * from t)}} > *Use cases* > Since item star columns where not considered during project / filter push > down and directory pruning, push down and pruning did not happen. This was > causing Drill to read all columns from file (when only several are needed) or > ready all files instead. Views with star query is the most common example. > Such behavior significantly degrades performance for item star queries > comparing to queries without item star. > *EXAMPLES* > *Data set* > will create table with three files each in dedicated sub-folder: > {noformat} > use dfs.tmp; > create table `order_ctas/t1` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-01' and date '1992-01-03'; > create table `order_ctas/t2` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-04' and date '1992-01-06'; > create table `order_ctas/t3` as select cast(o_orderdate as date) as > o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date > '1992-01-07' and date '1992-01-09'; > {noformat} > *Filter push down* > {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read > only one file > {noformat} > 00-00Screen > 00-01 Project(**=[$0]) > 00-02Project(T1¦¦**=[$0]) > 00-03 SelectionVectorRemover > 00-04Filter(condition=[=($1, 1992-01-01)]) > 00-05 Project(T1¦¦**=[$0], o_orderdate=[$1]) > 00-06Scan(groupscan=[ParquetGroupScan >
[jira] [Updated] (DRILL-5978) Upgrade drill-hive library version to 2.1 or newer.
[ https://issues.apache.org/jira/browse/DRILL-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-5978: Labels: doc-impacting (was: ) > Upgrade drill-hive library version to 2.1 or newer. > --- > > Key: DRILL-5978 > URL: https://issues.apache.org/jira/browse/DRILL-5978 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Hive >Affects Versions: 1.11.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka >Priority: Major > Labels: doc-impacting > Fix For: 1.13.0 > > > Currently Drill uses [Hive version 1.2.1 > libraries|https://github.com/apache/drill/blob/master/pom.xml#L53] to perform > queries on Hive. This version of library can be used for Hive1.x versions and > Hive2.x versions too, but some features of Hive2.x are broken (for example > using of ORC transactional tables). To fix that it will be good to update > drill-hive library version to 2.1 or newer. > Tasks which should be done: > - resolving dependency conflicts; > - investigating backward compatibility of newer drill-hive library with older > Hive versions (1.x); > - updating drill-hive version for > [MapR|https://github.com/apache/drill/blob/master/pom.xml#L1777] profile too. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-4185) UNION ALL involving empty directory on any side of union all results in Failed query
[ https://issues.apache.org/jira/browse/DRILL-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-4185: Issue Type: Improvement (was: Bug) > UNION ALL involving empty directory on any side of union all results in > Failed query > > > Key: DRILL-4185 > URL: https://issues.apache.org/jira/browse/DRILL-4185 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Relational Operators >Affects Versions: 1.4.0 >Reporter: Khurram Faraaz >Assignee: Vitalii Diravka >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.13.0 > > > UNION ALL query that involves an empty directory on either side of UNION ALL > operator results in FAILED query. We should return the results for the > non-empty side (input) of UNION ALL. > Note that empty_DIR is an empty directory, the directory exists, but it has > no files in it. > Drill 1.4 git.commit.id=b9068117 > 4 node cluster on CentOS > {code} > 0: jdbc:drill:schema=dfs.tmp> select columns[0] from empty_DIR UNION ALL > select cast(columns[0] as int) c1 from `testWindow.csv`; > Error: VALIDATION ERROR: From line 1, column 24 to line 1, column 32: Table > 'empty_DIR' not found > [Error Id: 5c024786-6703-4107-8a4a-16c96097be08 on centos-01.qa.lab:31010] > (state=,code=0) > 0: jdbc:drill:schema=dfs.tmp> select cast(columns[0] as int) c1 from > `testWindow.csv` UNION ALL select columns[0] from empty_DIR; > Error: VALIDATION ERROR: From line 1, column 90 to line 1, column 98: Table > 'empty_DIR' not found > [Error Id: 58c98bc4-99df-425c-aa07-c8c5faec4748 on centos-01.qa.lab:31010] > (state=,code=0) > {code} > *Fix overview:* > After resolving the current issue Drill can query an empty directory. It is a > schemaless Drill table for now. > User can query empty directory and use it for queries with any JOIN and UNION > (UNION ALL) operators. > Empty directory with parquet metadata cache files is schemaless Drill table > as well. > It works similar to empty files: > - The query with star will return empty result. > - If some fields are indicated in select statement, that fields will be > returned as INT-OPTIONAL types. > - The empty directory in the query with UNION operator will not change the > result as if the statement with UNION is absent in the query. > - The query with joins will return an empty result except the cases of using > outer join clauses, when the outer table for "right join" or derived table > for "left join" has a data. In that case the data from a non-empty table is > returned. > - The empty directory table can be used in complex queries. > *Code changes:* > Internally empty directory interprets as DynamicDrillTable with null > selection. SchemalessScan, SchemalessBatchCreator and SchemalessBatch are > introduced and used on execution state for interactions with other operators > and batches. > If empty directory contain parquet metadata cache files, the ParquetGroupScan > for such table is not valid and SchemalessScan is used instead of that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-4185) UNION ALL involving empty directory on any side of union all results in Failed query
[ https://issues.apache.org/jira/browse/DRILL-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-4185: Labels: doc-impacting ready-to-commit (was: doc-impacting) > UNION ALL involving empty directory on any side of union all results in > Failed query > > > Key: DRILL-4185 > URL: https://issues.apache.org/jira/browse/DRILL-4185 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators >Affects Versions: 1.4.0 >Reporter: Khurram Faraaz >Assignee: Vitalii Diravka >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.13.0 > > > UNION ALL query that involves an empty directory on either side of UNION ALL > operator results in FAILED query. We should return the results for the > non-empty side (input) of UNION ALL. > Note that empty_DIR is an empty directory, the directory exists, but it has > no files in it. > Drill 1.4 git.commit.id=b9068117 > 4 node cluster on CentOS > {code} > 0: jdbc:drill:schema=dfs.tmp> select columns[0] from empty_DIR UNION ALL > select cast(columns[0] as int) c1 from `testWindow.csv`; > Error: VALIDATION ERROR: From line 1, column 24 to line 1, column 32: Table > 'empty_DIR' not found > [Error Id: 5c024786-6703-4107-8a4a-16c96097be08 on centos-01.qa.lab:31010] > (state=,code=0) > 0: jdbc:drill:schema=dfs.tmp> select cast(columns[0] as int) c1 from > `testWindow.csv` UNION ALL select columns[0] from empty_DIR; > Error: VALIDATION ERROR: From line 1, column 90 to line 1, column 98: Table > 'empty_DIR' not found > [Error Id: 58c98bc4-99df-425c-aa07-c8c5faec4748 on centos-01.qa.lab:31010] > (state=,code=0) > {code} > *Fix overview:* > After resolving the current issue Drill can query an empty directory. It is a > schemaless Drill table for now. > User can query empty directory and use it for queries with any JOIN and UNION > (UNION ALL) operators. > Empty directory with parquet metadata cache files is schemaless Drill table > as well. > It works similar to empty files: > - The query with star will return empty result. > - If some fields are indicated in select statement, that fields will be > returned as INT-OPTIONAL types. > - The empty directory in the query with UNION operator will not change the > result as if the statement with UNION is absent in the query. > - The query with joins will return an empty result except the cases of using > outer join clauses, when the outer table for "right join" or derived table > for "left join" has a data. In that case the data from a non-empty table is > returned. > - The empty directory table can be used in complex queries. > *Code changes:* > Internally empty directory interprets as DynamicDrillTable with null > selection. SchemalessScan, SchemalessBatchCreator and SchemalessBatch are > introduced and used on execution state for interactions with other operators > and batches. > If empty directory contain parquet metadata cache files, the ParquetGroupScan > for such table is not valid and SchemalessScan is used instead of that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-4185) UNION ALL involving empty directory on any side of union all results in Failed query
[ https://issues.apache.org/jira/browse/DRILL-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-4185: Fix Version/s: 1.13.0 > UNION ALL involving empty directory on any side of union all results in > Failed query > > > Key: DRILL-4185 > URL: https://issues.apache.org/jira/browse/DRILL-4185 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators >Affects Versions: 1.4.0 >Reporter: Khurram Faraaz >Assignee: Vitalii Diravka >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.13.0 > > > UNION ALL query that involves an empty directory on either side of UNION ALL > operator results in FAILED query. We should return the results for the > non-empty side (input) of UNION ALL. > Note that empty_DIR is an empty directory, the directory exists, but it has > no files in it. > Drill 1.4 git.commit.id=b9068117 > 4 node cluster on CentOS > {code} > 0: jdbc:drill:schema=dfs.tmp> select columns[0] from empty_DIR UNION ALL > select cast(columns[0] as int) c1 from `testWindow.csv`; > Error: VALIDATION ERROR: From line 1, column 24 to line 1, column 32: Table > 'empty_DIR' not found > [Error Id: 5c024786-6703-4107-8a4a-16c96097be08 on centos-01.qa.lab:31010] > (state=,code=0) > 0: jdbc:drill:schema=dfs.tmp> select cast(columns[0] as int) c1 from > `testWindow.csv` UNION ALL select columns[0] from empty_DIR; > Error: VALIDATION ERROR: From line 1, column 90 to line 1, column 98: Table > 'empty_DIR' not found > [Error Id: 58c98bc4-99df-425c-aa07-c8c5faec4748 on centos-01.qa.lab:31010] > (state=,code=0) > {code} > *Fix overview:* > After resolving the current issue Drill can query an empty directory. It is a > schemaless Drill table for now. > User can query empty directory and use it for queries with any JOIN and UNION > (UNION ALL) operators. > Empty directory with parquet metadata cache files is schemaless Drill table > as well. > It works similar to empty files: > - The query with star will return empty result. > - If some fields are indicated in select statement, that fields will be > returned as INT-OPTIONAL types. > - The empty directory in the query with UNION operator will not change the > result as if the statement with UNION is absent in the query. > - The query with joins will return an empty result except the cases of using > outer join clauses, when the outer table for "right join" or derived table > for "left join" has a data. In that case the data from a non-empty table is > returned. > - The empty directory table can be used in complex queries. > *Code changes:* > Internally empty directory interprets as DynamicDrillTable with null > selection. SchemalessScan, SchemalessBatchCreator and SchemalessBatch are > introduced and used on execution state for interactions with other operators > and batches. > If empty directory contain parquet metadata cache files, the ParquetGroupScan > for such table is not valid and SchemalessScan is used instead of that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-4185) UNION ALL involving empty directory on any side of union all results in Failed query
[ https://issues.apache.org/jira/browse/DRILL-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346662#comment-16346662 ] ASF GitHub Bot commented on DRILL-4185: --- Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/1083 +1, LGTM. Thanks for making the changes. > UNION ALL involving empty directory on any side of union all results in > Failed query > > > Key: DRILL-4185 > URL: https://issues.apache.org/jira/browse/DRILL-4185 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators >Affects Versions: 1.4.0 >Reporter: Khurram Faraaz >Assignee: Vitalii Diravka >Priority: Major > Labels: doc-impacting > > UNION ALL query that involves an empty directory on either side of UNION ALL > operator results in FAILED query. We should return the results for the > non-empty side (input) of UNION ALL. > Note that empty_DIR is an empty directory, the directory exists, but it has > no files in it. > Drill 1.4 git.commit.id=b9068117 > 4 node cluster on CentOS > {code} > 0: jdbc:drill:schema=dfs.tmp> select columns[0] from empty_DIR UNION ALL > select cast(columns[0] as int) c1 from `testWindow.csv`; > Error: VALIDATION ERROR: From line 1, column 24 to line 1, column 32: Table > 'empty_DIR' not found > [Error Id: 5c024786-6703-4107-8a4a-16c96097be08 on centos-01.qa.lab:31010] > (state=,code=0) > 0: jdbc:drill:schema=dfs.tmp> select cast(columns[0] as int) c1 from > `testWindow.csv` UNION ALL select columns[0] from empty_DIR; > Error: VALIDATION ERROR: From line 1, column 90 to line 1, column 98: Table > 'empty_DIR' not found > [Error Id: 58c98bc4-99df-425c-aa07-c8c5faec4748 on centos-01.qa.lab:31010] > (state=,code=0) > {code} > *Fix overview:* > After resolving the current issue Drill can query an empty directory. It is a > schemaless Drill table for now. > User can query empty directory and use it for queries with any JOIN and UNION > (UNION ALL) operators. > Empty directory with parquet metadata cache files is schemaless Drill table > as well. > It works similar to empty files: > - The query with star will return empty result. > - If some fields are indicated in select statement, that fields will be > returned as INT-OPTIONAL types. > - The empty directory in the query with UNION operator will not change the > result as if the statement with UNION is absent in the query. > - The query with joins will return an empty result except the cases of using > outer join clauses, when the outer table for "right join" or derived table > for "left join" has a data. In that case the data from a non-empty table is > returned. > - The empty directory table can be used in complex queries. > *Code changes:* > Internally empty directory interprets as DynamicDrillTable with null > selection. SchemalessScan, SchemalessBatchCreator and SchemalessBatch are > introduced and used on execution state for interactions with other operators > and batches. > If empty directory contain parquet metadata cache files, the ParquetGroupScan > for such table is not valid and SchemalessScan is used instead of that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-4185) UNION ALL involving empty directory on any side of union all results in Failed query
[ https://issues.apache.org/jira/browse/DRILL-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346641#comment-16346641 ] ASF GitHub Bot commented on DRILL-4185: --- Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/1083#discussion_r165023581 --- Diff: exec/java-exec/src/test/java/org/apache/drill/TestJoinNullable.java --- @@ -568,6 +570,22 @@ public void nullMixedComparatorEqualJoinHelper(final String query) throws Except .go(); } + /** InnerJoin with empty dir table on nullable cols, MergeJoin */ + // TODO: the same tests should be added for HashJoin operator, DRILL-6070 + @Test --- End diff -- The bug was founded for NLJ and empty tables. I have resolved that issue. The separate test class is added for empty dir tables and different join operators. Also I have made refactoring for the TestHashJoinAdvanced, TestMergeJoinAdvanced, TestNestedLoopJoin classes. > UNION ALL involving empty directory on any side of union all results in > Failed query > > > Key: DRILL-4185 > URL: https://issues.apache.org/jira/browse/DRILL-4185 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators >Affects Versions: 1.4.0 >Reporter: Khurram Faraaz >Assignee: Vitalii Diravka >Priority: Major > Labels: doc-impacting > > UNION ALL query that involves an empty directory on either side of UNION ALL > operator results in FAILED query. We should return the results for the > non-empty side (input) of UNION ALL. > Note that empty_DIR is an empty directory, the directory exists, but it has > no files in it. > Drill 1.4 git.commit.id=b9068117 > 4 node cluster on CentOS > {code} > 0: jdbc:drill:schema=dfs.tmp> select columns[0] from empty_DIR UNION ALL > select cast(columns[0] as int) c1 from `testWindow.csv`; > Error: VALIDATION ERROR: From line 1, column 24 to line 1, column 32: Table > 'empty_DIR' not found > [Error Id: 5c024786-6703-4107-8a4a-16c96097be08 on centos-01.qa.lab:31010] > (state=,code=0) > 0: jdbc:drill:schema=dfs.tmp> select cast(columns[0] as int) c1 from > `testWindow.csv` UNION ALL select columns[0] from empty_DIR; > Error: VALIDATION ERROR: From line 1, column 90 to line 1, column 98: Table > 'empty_DIR' not found > [Error Id: 58c98bc4-99df-425c-aa07-c8c5faec4748 on centos-01.qa.lab:31010] > (state=,code=0) > {code} > *Fix overview:* > After resolving the current issue Drill can query an empty directory. It is a > schemaless Drill table for now. > User can query empty directory and use it for queries with any JOIN and UNION > (UNION ALL) operators. > Empty directory with parquet metadata cache files is schemaless Drill table > as well. > It works similar to empty files: > - The query with star will return empty result. > - If some fields are indicated in select statement, that fields will be > returned as INT-OPTIONAL types. > - The empty directory in the query with UNION operator will not change the > result as if the statement with UNION is absent in the query. > - The query with joins will return an empty result except the cases of using > outer join clauses, when the outer table for "right join" or derived table > for "left join" has a data. In that case the data from a non-empty table is > returned. > - The empty directory table can be used in complex queries. > *Code changes:* > Internally empty directory interprets as DynamicDrillTable with null > selection. SchemalessScan, SchemalessBatchCreator and SchemalessBatch are > introduced and used on execution state for interactions with other operators > and batches. > If empty directory contain parquet metadata cache files, the ParquetGroupScan > for such table is not valid and SchemalessScan is used instead of that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6124) testCountDownLatch can be null in PartitionerDecorator depending on user's injection controls config
[ https://issues.apache.org/jira/browse/DRILL-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346585#comment-16346585 ] ASF GitHub Bot commented on DRILL-6124: --- Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/1103 @ilooner it looks like if latch is not found, execution control will return dummy latch [1]? If I am missing something, please explain. [1] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/testing/ExecutionControls.java#L206 > testCountDownLatch can be null in PartitionerDecorator depending on user's > injection controls config > > > Key: DRILL-6124 > URL: https://issues.apache.org/jira/browse/DRILL-6124 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Minor > Fix For: 1.13.0 > > > In PartitionerDecorator we get a latch from the injector with the following > code. > testCountDownLatch = injector.getLatch(context.getExecutionControls(), > "partitioner-sender-latch"); > However, if there is no injection site defined in the user's drill > configuration then testCountDownLatch will be null. So we have to check if it > is null in order to avoid NPE's -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.
[ https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346581#comment-16346581 ] ASF GitHub Bot commented on DRILL-6106: --- Github user reudismam commented on the issue: https://github.com/apache/drill/pull/1099 Maybe it has not worked as expected. It squashed the commit (first commit), but as the commit mix commits from other persons, they come together. Maybe it will be the case of creating a patch file for the desired commit and apply this patch to a new pull request. > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. > > > Key: DRILL-6106 > URL: https://issues.apache.org/jira/browse/DRILL-6106 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Reudismam Rolim de Sousa >Assignee: Reudismam Rolim de Sousa >Priority: Minor > Labels: ready-to-commit > Fix For: 1.13.0 > > > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6106) Use valueOf method instead of constructor since valueOf has a higher performance by caching frequently requested values.
[ https://issues.apache.org/jira/browse/DRILL-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346587#comment-16346587 ] ASF GitHub Bot commented on DRILL-6106: --- Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/1099 Well, you can always use force push to override your previous changes or even replace your remote branch with new local. > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. > > > Key: DRILL-6106 > URL: https://issues.apache.org/jira/browse/DRILL-6106 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Reudismam Rolim de Sousa >Assignee: Reudismam Rolim de Sousa >Priority: Minor > Labels: ready-to-commit > Fix For: 1.13.0 > > > Use valueOf method instead of constructor since valueOf has a higher > performance by caching frequently requested values. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6099) Drill does not push limit past project (flatten) if it cannot be pushed into scan
[ https://issues.apache.org/jira/browse/DRILL-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-6099: Reviewer: Chunhui Shi > Drill does not push limit past project (flatten) if it cannot be pushed into > scan > - > > Key: DRILL-6099 > URL: https://issues.apache.org/jira/browse/DRILL-6099 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Gautam Kumar Parai >Assignee: Gautam Kumar Parai >Priority: Major > Labels: ready-to-commit > Fix For: 1.13.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > It would be useful to have pushdown occur past flatten(project). Here is an > example to illustrate the issue: > {{explain plan without implementation for }}{{select name, > flatten(categories) as category from dfs.`/tmp/t_json_20` LIMIT 1;}} > {{DrillScreenRel}}{{ }} > {{ DrillLimitRel(fetch=[1])}}{{ }} > {{ DrillProjectRel(name=[$0], category=[FLATTEN($1)])}} > {{ DrillScanRel(table=[[dfs, /tmp/t_json_20]], groupscan=[EasyGroupScan > [selectionRoot=maprfs:/tmp/t_json_20, numFiles=1, columns=[`name`, > `categories`], files=[maprfs:///tmp/t_json_20/0_0_0.json]]])}} > = > Content of 0_0_0.json > = > { > "name" : "Eric Goldberg, MD", > "categories" : [ "Doctors", "Health & Medical" ] > } { > "name" : "Pine Cone Restaurant", > "categories" : [ "Restaurants" ] > } { > "name" : "Deforest Family Restaurant", > "categories" : [ "American (Traditional)", "Restaurants" ] > } { > "name" : "Culver's", > "categories" : [ "Food", "Ice Cream & Frozen Yogurt", "Fast Food", > "Restaurants" ] > } { > "name" : "Chang Jiang Chinese Kitchen", > "categories" : [ "Chinese", "Restaurants" ] > } > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6124) testCountDownLatch can be null in PartitionerDecorator depending on user's injection controls config
[ https://issues.apache.org/jira/browse/DRILL-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-6124: Affects Version/s: 1.12.0 > testCountDownLatch can be null in PartitionerDecorator depending on user's > injection controls config > > > Key: DRILL-6124 > URL: https://issues.apache.org/jira/browse/DRILL-6124 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Minor > Fix For: 1.13.0 > > > In PartitionerDecorator we get a latch from the injector with the following > code. > testCountDownLatch = injector.getLatch(context.getExecutionControls(), > "partitioner-sender-latch"); > However, if there is no injection site defined in the user's drill > configuration then testCountDownLatch will be null. So we have to check if it > is null in order to avoid NPE's -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6124) testCountDownLatch can be null in PartitionerDecorator depending on user's injection controls config
[ https://issues.apache.org/jira/browse/DRILL-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-6124: Fix Version/s: 1.13.0 > testCountDownLatch can be null in PartitionerDecorator depending on user's > injection controls config > > > Key: DRILL-6124 > URL: https://issues.apache.org/jira/browse/DRILL-6124 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Minor > Fix For: 1.13.0 > > > In PartitionerDecorator we get a latch from the injector with the following > code. > testCountDownLatch = injector.getLatch(context.getExecutionControls(), > "partitioner-sender-latch"); > However, if there is no injection site defined in the user's drill > configuration then testCountDownLatch will be null. So we have to check if it > is null in order to avoid NPE's -- This message was sent by Atlassian JIRA (v7.6.3#76005)