[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668561#comment-16668561
 ] 

ASF GitHub Bot commented on DRILL-6223:
---

vdiravka commented on issue #1170: DRILL-6223: Fixed several Drillbit failures 
due to schema changes
URL: https://github.com/apache/drill/pull/1170#issuecomment-434259969
 
 
   @sachouche Could you please respond to review comments? Should the changes 
be revised? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-04-02 Thread Parth Chandra (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422069#comment-16422069
 ] 

Parth Chandra commented on DRILL-6223:
--

{quote}Sorry to say, I still disagree with this statement: "This pull request 
adds logic to detect and eliminate dangling columns".
{quote}
Adding some more background -

[~paul-rogers] has this exactly right. At the high level, dropping any columns 
for star queries is not the solution as a star queries explicitly asks for all 
columns.

More importantly, as Paul pointed out, the solution (in some cases) is to 
provide a schema.

For Parquet (or any other source that has schema information), we might solve 
the problem by creating a union of the columns (and also take care of type 
changes by promoting columns to a common type). Parquet currently does not do 
that. Worse, it overrides the schema seen so far with the new schema when a 
change is encountered (I would suggest, once again, a perusal of the Parquet 
metadata cache code). This means in the presence of schema change with Parquet 
files, even the planning gets off on the wrong foot.

The only solution, again, as Paul mentioned, is to provide a composite schema 
by inferring it or asking the user to provide one. The latter is hard because 
sometimes the user does not have the schema or because it is mutating all the 
time (it happens when people move fast and break things). This is exactly when 
they want to use Drill, and, in fact, this was a primary use case for the early 
design of Drill. As a design constraint, we had to assume that we could not 
know the schema until runtime (we may have taken it too far :( ).

We could try to infer the schema by doing a complete scan but that has the same 
issues as asking the user to specify as schema and, in addition, can take 
really really long.

 

 

 

 

> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-04-02 Thread Parth Chandra (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422042#comment-16422042
 ] 

Parth Chandra commented on DRILL-6223:
--

{quote}To your point about compensation logic in the context of Schema Changes
{quote} * 
{quote}Why do you think it is ok to dynamically include new columns?{quote}
 * 
{quote}Yet it is not ok to exclude them?{quote}

Usually, in real world data with dynamically changing schema's, new columns are 
added and not removed. 
 * 
{quote}Consider a batch of 32k rows{quote}
 * 
{quote}A VV with null integer values will require 32kb (bits) + 32kb * 4 = 
160kb{quote}
 * 
{quote}Each missing column will require that much memory per 
mini-fragment{quote}

One of the guarantees provided by value vectors is that elements can be 
accessed by index in constant time (or, in the case of nested elements in O(m) 
where m is the level of nesting) . The representation is based on providing 
this guarantee. It comes at the cost of additional memory usage, which is a 
deliberate tradeoff.
{quote}This is unless (similarly to the implicit columns) we optimize the VV 
storage representation or / and push the column preservation to higher layers 
such as the client or foreman
{quote}
It would be wonderful to improve vectors to use much less memory while 
providing the same guarantees. A proposal would be welcome. 

 

> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-30 Thread salim achouche (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421106#comment-16421106
 ] 

salim achouche commented on DRILL-6223:
---

Thanks Paul for your feedback!

I wanted to bring to your attention another implementation detail that we also 
need to pay a closer look at; the current implementation of a column with all 
nulls is not cheap:
 * Consider a batch of 32k rows
 * A VV with null integer values will require 32kb (bits) + 32kb * 4 = 160kb
 * Each missing column will require that much memory per mini-fragment
 * This is unless (similarly to the implicit columns) we optimize the VV 
storage representation or / and push the column preservation to higher layers 
such as the client or foreman
 * I understand that handling few missing columns this way is fine but this 
will not be the case if we are talking about dozens of such columns especially 
that we are now ramping up our support to operator based batch sizing  

> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-30 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421091#comment-16421091
 ] 

Paul Rogers commented on DRILL-6223:


[~sachouche], thanks for the explanation, very helpful. The tests will help 
clarify the original problem and the fix.

Looking at the code, it does appear we try to prune unused columns (there are 
references to used columns; which I naively assumed meant we are separating the 
used from unused, perhaps I'm wrong.)

If we cannot correctly handle a schema change (according to whatever semantics 
we decide we want), then we need to kill the query rather than produce invalid 
results.

On the dynamically adding columns: a careful reading will show that the 
suggestion is to *preserve* columns, not create them. The discussion was around 
when we can preserve columns (columns appeared in first batch, then 
disappeared) and when we can't (columns appear in second or later batch.)

This PR will be solid if we do three things:

* Avoid memory corruption (the primary goal here, and a good one)
* Add unit tests that verify the fix
* Avoid introducing new semantics (dropping columns) as that just digs us 
deeper into the schema-free mess. Instead, fail the query if we are given 
schemas we can't reconcile.


> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-30 Thread salim achouche (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420926#comment-16420926
 ] 

salim achouche commented on DRILL-6223:
---

* This PR is only to {color:#FF}avoid corruption{color} now that memory 
checks are disabled
 * As I said, we all agree that Drill enabled an ill-defined functionality
 ** We have the opportunity to discuss, clarify, formalize it it in a dedicated 
JIRA
 * Meanwhile, what to do with the current bugs?
 ** Let's use the following example (which has nothing to do with Schema 
Changes but instead is a byproduct of this functionality); assume the following 
FS structure
 *** ROOT/my_data/T1/\{column-c1}, \{column-c2}, ..
 *** ROOT/my_data/T2/\{column-c1}, \{column-c2}, ..
 ** Assume you issue the following query:  
 *** SELECT * from dfs.`ROOT/my_data/*`;
 ** The current code will blindly attempt to read the files thinking they are 
originating from the same schema
 *** The chance of dangling columns is extremely high
 *** What do we do?
  We can either pretend this is a is schema change issue and try to address 
it by inserting compensation logic
  Avoid corruption by either failing the query or removing dangling columns
  I chose the latter solution because I don't have clarity on the Schema 
Changes functionality
  It occurred to me that we can also disable Schema Change logic for 
SELECT_STAR queries
 ** To your point about compensation logic in the context of Schema Changes
 *** Why do you think it is ok to dynamically include new columns?
 *** Yet it is not ok to exclude them?
 *** Why did we include a mechanism to report schema changes to the JDBC 
client? maybe we thought the consumer app is in a better position to handle 
such events; in which case, any compensation logic is unnecessary (?)
 * With regards to tests
 ** I will add a test that triggers this condition
 ** The test will be deemed successful if there are no runtime failures
 ** Whether we should add missing columns or not is still being debated and 
outside the scope of this JIRA

> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420714#comment-16420714
 ] 

ASF GitHub Bot commented on DRILL-6223:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1170
  
Over the last year, we've tended to favor including unit tests with each 
PR. There don't seem to be any with this one, yet we are proposing to make a 
fairly complex change. Perhaps tests can be added.

Further, by having good tests, we don't have to debate how Drill will 
handle the scenarios discussed in an earlier comment: we just code 'em up and 
try 'em out, letting Drill speak for itself. We can then decide whether or not 
we like the results, rather than discussing hypotheticals.


> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420711#comment-16420711
 ] 

ASF GitHub Bot commented on DRILL-6223:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1170
  
BTW: thanks for tackling such a difficult, core issue in Drill. Drill 
claims to be a) schema free and b) SQL compliant. SQL is based on operations 
over relations with a fixed number of columns of fixed types. Reconciling these 
two ideas is very difficult. Even the original Drill developers, who built a 
huge amount of code very quickly, and who had intimate knowledge of the Drill 
internals, even they did not find a good solution which is why the problem is 
still open.

There are two obvious approaches: 1) redefine SQL to operate over lists of 
maps (with arbitrary name/value pairs that differ across rows), or 2) define 
translation rules from schema-free input into the schema-full relations that 
SQL requires.

This PR attempts to go down the first route: redefine SQL. To be 
successful, we'd want to rely on research papers, if any, that show how to 
reformulate relational theory on top of lists of maps rather than on relations 
and domains.

The other approach is to define conversion rules: something much more on 
the order of a straight-forward implementation project. Can the user provide 
conversion rules (in the form of a schema) when the conversion is ambiguous? 
Would users rather encounter schema change exceptions or provide the conversion 
rules? These are interesting open questions.


> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420108#comment-16420108
 ] 

ASF GitHub Bot commented on DRILL-6223:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1170#discussion_r178222960
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/limit/LimitRecordBatch.java
 ---
@@ -60,13 +60,7 @@ public LimitRecordBatch(Limit popConfig, FragmentContext 
context, RecordBatch in
   protected boolean setupNewSchema() throws SchemaChangeException {
 container.zeroVectors();
 transfers.clear();
-
-
-for(final VectorWrapper v : incoming) {
-  final TransferPair pair = v.getValueVector().makeTransferPair(
-  container.addOrGet(v.getField(), callBack));
-  transfers.add(pair);
-}
+container.onSchemaChange(incoming, callBack, transfers);
--- End diff --

`onSchemaChange()` may perhaps be the wrong name. It is why this 
functionality is called in this case. But, the actual functionality is closer 
to `setupTransfers()` (assuming the removed code was simply moved into the 
container class...)


> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420106#comment-16420106
 ] 

ASF GitHub Bot commented on DRILL-6223:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1170#discussion_r178225725
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java 
---
@@ -136,14 +138,28 @@ public void transferOut(VectorContainer containerOut) 
{
   public  T addOrGet(final MaterializedField field, 
final SchemaChangeCallBack callBack) {
 final TypedFieldId id = 
getValueVectorId(SchemaPath.getSimplePath(field.getName()));
 final ValueVector vector;
-final Class clazz = 
TypeHelper.getValueVectorClass(field.getType().getMinorType(), 
field.getType().getMode());
+
 if (id != null) {
-  vector = getValueAccessorById(id.getFieldIds()).getValueVector();
+  vector= 
getValueAccessorById(id.getFieldIds()).getValueVector();
+  final Class clazz  = 
TypeHelper.getValueVectorClass(field.getType().getMinorType(), 
field.getType().getMode());
+
+  // Check whether incoming field and the current one are compatible; 
if not then replace previous one with the new one
   if (id.getFieldIds().length == 1 && clazz != null && 
!clazz.isAssignableFrom(vector.getClass())) {
 final ValueVector newVector = TypeHelper.getNewVector(field, 
this.getAllocator(), callBack);
 replace(vector, newVector);
 return (T) newVector;
   }
+
+  // At this point, we know incoming and current fields are 
compatible. Maps can have children,
+  // we need to ensure they have the same structure.
+  if (MinorType.MAP.equals(field.getType().getMinorType())
--- End diff --

This is tricky and probably broken elsewhere. If the incoming type is a 
union or a list, then it can contain a nested map.


> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420107#comment-16420107
 ] 

ASF GitHub Bot commented on DRILL-6223:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1170#discussion_r178225930
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java 
---
@@ -136,14 +138,28 @@ public void transferOut(VectorContainer containerOut) 
{
   public  T addOrGet(final MaterializedField field, 
final SchemaChangeCallBack callBack) {
 final TypedFieldId id = 
getValueVectorId(SchemaPath.getSimplePath(field.getName()));
 final ValueVector vector;
-final Class clazz = 
TypeHelper.getValueVectorClass(field.getType().getMinorType(), 
field.getType().getMode());
+
 if (id != null) {
-  vector = getValueAccessorById(id.getFieldIds()).getValueVector();
+  vector= 
getValueAccessorById(id.getFieldIds()).getValueVector();
+  final Class clazz  = 
TypeHelper.getValueVectorClass(field.getType().getMinorType(), 
field.getType().getMode());
+
+  // Check whether incoming field and the current one are compatible; 
if not then replace previous one with the new one
   if (id.getFieldIds().length == 1 && clazz != null && 
!clazz.isAssignableFrom(vector.getClass())) {
 final ValueVector newVector = TypeHelper.getNewVector(field, 
this.getAllocator(), callBack);
 replace(vector, newVector);
 return (T) newVector;
   }
+
+  // At this point, we know incoming and current fields are 
compatible. Maps can have children,
+  // we need to ensure they have the same structure.
+  if (MinorType.MAP.equals(field.getType().getMinorType())
+   && vector != null
+   && 
!SchemaUtil.isSameSchemaIncludingOrder(vector.getField().getChildren(), 
field.getChildren())) {
+
+final ValueVector newVector = TypeHelper.getNewVector(field, 
this.getAllocator(), callBack);
+replace(vector, newVector);
+return (T) newVector;
--- End diff --

We have two vectors, both maps. We found out their schemas differ. We are 
throwing  away the old map, replacing it with a new one with no members. Is 
that what we want to do? Or, do we want to recursively merge the maps? Is this 
done elsewhere? If so, how do we remember to do that merge?


> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420103#comment-16420103
 ] 

ASF GitHub Bot commented on DRILL-6223:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1170
  
Sorry to say, I still disagree with this statement: "This pull request adds 
logic to detect and eliminate dangling columns".

There was a prior discussion that `SELECT *` means "return all columns", 
not "return only those columns which happen to be in common." We discussed how 
removing columns midway through a DAG can produce inconsistent results.

But, let's take this particular case: the Project operator.

What should happen (to be consistent with other parts of Drill), is that 
the operator correctly fills in values for the "dangling" F3 so the the output 
is (F1, F2, F3).

Note that this becomes VERY ambiguous. Suppose Projection sees the 
following from Files A(F1, F2, F3) and B(F1, F2)

* Batch A.1 (with F1, F2, F3)
* Batch B.1 (with F1, F2)

Clearly, the project can remember that F3 was previously seen and fill in 
the missing column. (This is exactly what the new projection logic in the new 
scan framework does, by the way.) This works, however, only if F3 is nullable. 
If not... what (non-null) value can we fill in for F3?

Had we know that F3 would turn up dangling, we could have converted F3 in 
the first batch to become nullable, but Drill can't predict the future.

Let's consider the proposal: we drop dangling columns. But, since the 
dangling column (F3) appeared in the first batch, we didn't know it is 
dangling. Only when we see the second batch (B.1) do we realize that F3 was 
dangling and we should have removed it. Again, this algorithm requires 
effective time travel.

Now, suppose that the order is reversed:

* Batch B.1 (with F1, F2)
* Batch A.1 (with F1, F2, F3)

Here, we can identify F3 as dangling and could remove it, so the proposal 
is sound.

On the other hand, the "fill in F3" trick does not work here because 
Project sends B.1 downstream. Later, it notices that A.1 adds a column. Project 
can't somehow retroactively add the missing column; all it can do is trigger a 
schema change downstream. Again, Drill can't predict the future to know that it 
has to fill in F3 in the first B.1 batch.

We've not yet discussed the case in which F2, which exists in both files, 
has distinct types (INT in A, say and VARCHAR in B). The dangling column trick 
won't work. The same logic as above applies to the type mismatch.

Perhaps we use either the "remove" or "fill in" depending on whether the 
column appears in the first batch. So, for the following:

* Batch A.1 (with F1, F2, F3)
* Batch B.1 (with F1, F2)

The result would be (F1, F2, F3)

But if the input was:

* Batch B.1 (with F1, F2)
* Batch A.1 (with F1, F2, F3)

The result wold be (F1, F2)

Since the user has no control over the order that files are read, the 
result would be random: half the time the user gets one schema, the other half 
the other. It is unlikely that the user will perceive that as a feature.

The general conclusion is that there is no way that Project can "smooth" 
the schema in the general case: it would have to predict the future to do so.

Now, let's think about other operators, Sort, say. Suppose we do `SELECT * 
FROM foo ORDER BY x`. In this case, there is no project. The Sort operator will 
see batches with differing schemas, but must sort/merge them together. The 
schemas must match. The Sort tries to do this by using the union type (actually 
works, there is a unit test for it somewhere) if columns have conflicting 
types. I suspect it does not work if a column is missing from some batches. 
(Would need to test.)

And, of course, the situation is worse if the dangling column is the sort 
key!

Overall, while it is very appealing to think that dangling columns are a 
"bug" for which there is a fix, the reality is not so simple. This is an 
inherent ambiguity in the Drill model for which there is no fix that both works 
and is consistent with SQL semantics.

What would work? A schema! Suppose we are told that the schema is (F1: INT, 
F2: VARCHAR, F3: nullable DOUBLE). Now, when Project (or even the scanner) 
notices that F3 is missing, it knows to add in the required column of the 
correct type and, voila! no schema change.

Suppose that B defines F2 as INT. We know we want a VARCHAR, and so can do 
an implicit conversion. Again, voila! no schema change.

In summary, I completely agree that the scenario described is a problem. 
But, I don't believe that removing columns is the fix; instead the only valid 
fix is to allow the user to provide a 

[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-29 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420082#comment-16420082
 ] 

Paul Rogers commented on DRILL-6223:


[~sachouche], agree completely that we should fix bugs that can be fixed. The 
rub is that most of the schema change "bugs" are actually inherent ambiguities 
in the input data. For those that are not ambiguous, then there is a single 
answer, and so it is a bug that can be fixed. But, if file A has an INT and 
file B has a VARCHAR, there is no a priori way to know which is "correct" (or 
if, say, both should really be converted to DECIMAL.)

This is a challenging assignment because of these ambiguities. I wrestled with 
the issues with JSON and in the scan operators; there is a long list of cases 
for which there is no good answer without a schema.

On the other hand, if the user presents a collection of files, all with the 
same schema, then Drill should certainly work as there is no ambiguity.

Looking forward to reviewing the code.

> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419711#comment-16419711
 ] 

ASF GitHub Bot commented on DRILL-6223:
---

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1170
  
@parthchandra and @paul-rogers, I have added a comment within the Jira 
[DRILL-6223](https://issues.apache.org/jira/browse/DRILL-6223); please let me 
know what you think.

Thanks!


> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-29 Thread salim achouche (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419709#comment-16419709
 ] 

salim achouche commented on DRILL-6223:
---

* I concur that we need to a) ultimately formalize the schema change behavior 
and b) clarify how it can be useful to the Drill users
 ** I have created a new Jira for this effort DRILL-6297 
 ** This hopefully will allow us to address current bugs and even schedule 
future enhancements
 * Back to this Jira
 ** There are currently Drill users hitting schema changes issues
 ** Our JDBC client has logic to detect that a schema change occurred and 
consequently notify the client of this event and reload the metadata
 ** The question now is on whether we should wait till DRILL-6297 is resolved
 ** Or try to address some of these bugs albeit temporarily till the associated 
functionality is refined
 ** If you feel that we should wait, then I suggest that we disable this 
functionality so that it doesn't destabilize Drill especially that the runtime 
checks are disabled by default
 *** The bugs that I have addressed would have corrupted the Drillbit if it 
weren't for the runtime checks

> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-18 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404137#comment-16404137
 ] 

Paul Rogers commented on DRILL-6223:


Been reflecting on [~parthc]'s comment. The challenge here is that, as we like 
to joke, "if something seems simple in Drill, it is only because we are missing 
something." More seriously, Drill combines relational theory, schema-less 
files, and a distributed-in-memory system in ways that leads to high complexity 
and, at times, outright contradictions. Nowhere is that more clear than in 
schema handling.

To Parth's point, consider our example of two files in a directory {{d = [p(a, 
b), q(a)]}}. Suppose I do the following:

{noformat}
SELECT a, b GROUP BY b FROM d
{noformat}

Today, one of two things will happen. If {{b}} happens to be a {{nullable 
INT}}, then the result will come back with all nulls (including all rows from 
{{q}}) in one group, along with other groups for the {{b}} values from {{p}}. 
This is what we expect.

Now, suppose, today, we do:

{noformat}
SELECT * GROUP BY b FROM d
{noformat}

Today, we get the same result as the first case. (Or, I hope we do; I have not 
actually tried this...) The reason is that the wildcard should expand to {{a, 
b}}, resulting in the same query.

Of course, if {{b}} turns out to be a {{VARCHAR}}, then everything fails 
(schema exception), because Drill can't group by variable types. (This is not a 
Drill restriction; relational theory is based on fixed domains for each column; 
no one has defined rules for what to do if a single column comes from multiple 
domains.)

Technically, the current behavior for the wildcard is to produce he *union* of 
columns from input files, causing the result we want above. The change seems to 
propose to change the behavior to the *intersection* of columns. As [~parthc] 
notes, this will break queries.

Run the explicit select query above with the proposed change. The result will 
be the same as today. Run the wildcard query. Now, the behavior is 
unpredictable.

If we do a two-phase grouping, we will group the rows from {{p}}, grouping them 
by {{b}}. Then we group the rows from {{q}}, will notice that {{b}} does not 
exist, and either fail the query or generate the {{nullable INT}} column and 
group by a set of null values. Once the results are combined, {{b}} exists in 
both input paths and so threre is no dangling column. We get the same results 
as today, which is good.

But, suppose we do the grouping *after* merging results. In this case, the 
dangling column rule kicks in, we remove column {{b}}, but then we recreate it 
in the group operator, resulting in all data from column {{b}} displaying as 
nulls.

Overall, I think this change would end up causing a never-ending set of bug 
requests as columns sometimes disappear and sometimes they don't.

Further, it is not clear if the fix even makes a schema change go away. Suppose 
we use the wildcard query and do partial grouping in the fragment with the 
scan, as explained above. Since the grouping must create a missing {{b}}, (and 
the distributed groups can't communicate to negotiatie which columns to drop), 
we still get a schema change error if {{p}} has column {{b}} as {{VARCHAR}}, 
but the grouping operator for {{q}} introduces a {{nullable INT}}.

So, overall, the goal of eliminating schema change is a good one; no user wants 
their query to fail. We've spent years trying to work out ad-hoc solutions that 
can be applied to each operator. But, local fixes can never solve a global 
problem (all scanners must agree on column number and type.) Hence, I've slowly 
come to realize that having a global schema is the only general solution.

There is another possible solution that I've been meaning to try. Modify the 
query to include a filter that creates a new column {{b1}} defined as something 
like "if the type of {{b}} is {{nullable VARCHAR}} then use the value of {{b}}, 
else use a {{NULL}} of type {{VARCHAR}}." There are some test queries of this 
type for the {{UNION}} type. The unfortunate bit is that this code must be 
inserted into every query. Or, the user must define a view. The extra 
calculations slow the query. Still, it may be the best solution we have at 
present.

Still, your PR mentions complex columns (I suppose maps and map arrays). As 
mentioned earlier, Drill cannot do the above calculations for nested columns, 
so the only solution would be, as part of the view, to project all nested 
columns up to the top level, which completely destroys the ability to do 
lateral joins or flattens.

This area does, indeed, need some serious design thought.

> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution 

[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-18 Thread Parth Chandra (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403951#comment-16403951
 ] 

Parth Chandra commented on DRILL-6223:
--

Once again, without having looked at the actual PR: 

If you have schema changes occurring in the source files and the Parquet 
metadata overwrites the types for a column whose type has mutated, then 
everything else downstream is irrelevant. So please take a look at it.

On a more fundamental note, +1 to [~paul-rogers]'s comment on formalizing the 
schema change specification before fixing issues related to schema change.

> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403809#comment-16403809
 ] 

ASF GitHub Bot commented on DRILL-6223:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1170
  
The waters here run deep. Please see a detailed comment in 
[DRILL-6223](https://issues.apache.org/jira/browse/DRILL-6223).


> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-17 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403808#comment-16403808
 ] 

Paul Rogers commented on DRILL-6223:


[~sachouche], schema changes in Drill are complex and we've all had the fun of 
thinking we understand them, only to find out that there are implicit rules in 
Drill that complicate things.

There are two cases to consider here.

First, a {{SELECT a, b, c}} (explicit) projection list. In this case, the 
*reader* is supposed to project only the requested columns (and fill in nulls 
for missing columns.) This is pretty simple for top-level column, but gets very 
tricky for nested columns.

It is important to be aware (for any future readers) that Drill does allow us 
to specify nested columns: {{SELECT m.a, m.b}}. But this does not do what you 
think, it does not return a map {{m}} with only columns {{a}} and {{b}}. 
Instead, it (possibly helpfully) projects the columns to the top level as 
{{EXPR$1}} and {{EXPR$2}}. There is no effective way to project a map as a map, 
and decide which columns. It is all or nothing.

In the explicit case, there is nothing for the Project operator to do, the 
reader will have done all the needed work to produce a schema that matches the 
project list.

Here, we must be aware of the fatal flaw in the "schema-free" idea as Drill 
implements it. We mentioned missing columns. Drill readers assign the type of 
{{Nullable INT}} to those. Suppose the query is {{SELECT a, b FROM `someday`}} 
in which file "p.parquet" has both columns (as {{VARCHAR}}) but file 
{{q.parquet}} has only {{a}}. Then, the reader for {{q}} will add a column 
{{b}} of type {{Nullable INT}}. This will cause an schema change exception 
downstream.

This is a known, fundamental problem. Would be great if {{q}} could say, "Hey, 
I don't have column {{b}}. So, I'll give you a vector of unspecified type, you 
supply the type later." But, today, Drill does not work that way.

OK, that's the first case. Now onto the second: wildcard: {{SELECT *}}. In this 
case, the reader must return all columns from each file. Back to our two files: 
{{p}} returns {{(a, b)}}, {{q}} returns just {{(a)}}. What does the user expect?

Your change says the user expects only {{(a))}} (the change identifies {{b}} as 
dangling.) But, is that correct?

Historically Drill says that we expect schema evolution. In that environment, 
we want both columns, but we want the "dangling" columns to be filled with 
nulls where they are missing. That is, the proper result is {{(a, b)}}, with 
the {{b}} columns filled with nulls for records that came from {{q}} (the file 
without {{b}}).

The trick, of course, is that {{a}} may have a non-nullable type. (This is the 
issue that [~vitalii] was working on with DRILL-5970 a while back.)

Overall, I don't think we want to remove columns that are "dangling"; the user 
could never query files that have evolved if we do. Instead, we need to find a 
way to include the "dangling" columns, filling in missing values.

Now, why can I write so much about this? Much of the reader-side logic for this 
is implemented in the "batch size" set of changes that is slowly merging into 
master. The question here is touched up on in the 
[write-up|https://github.com/paul-rogers/drill/wiki/BH-Projection-Framework]. 
(This work did not attempt to address downstream behavior, that is an open 
issue. But, it does try to do "schema smoothing" so that, if we read file {{p}} 
and {{q}} in the same reader, the reader will fill in the missing column {{b}} 
for file {[q}} using the type from file {{p}}. Obviously, this trick only works 
if the files are read in the same reader, in the order {{p}}, {{q}}.)

Rather than do this as a quiet bug fix, I suggest we write up a design for how 
we actually want Drill to behave. The only real way to get rational behavior is 
to define a schema up front so that all readers can coordinate, or the 
downstream operators know that they need to fill in columns and how to do so. 
That is a big task; one that has been put off for years, but the underlying 
conflict and ambiguities have not gone away.

For Parquet (only), we have a possible solution. The planner checks all the 
files (I believe, I think that is why planning can be slow. Let the planner 
figure out the union of all the queries, using information from, say, file 
{{p}} in our example to know what column to fill in for file {{q}}. Pass this 
information into the physical plan. Let the readers use it to "do the right 
thing." The work mentioned earlier anticipates this solution: it provides a 
schema hint mechanism to fill in missing columns. Right now, it only looks at 
the very faint hints obtained from the project list, but is intended to use 
full schema info.


> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: 

[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403753#comment-16403753
 ] 

ASF GitHub Bot commented on DRILL-6223:
---

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1170
  
@parthchandra can you also please review this PR? 
Thanks!



> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-16 Thread salim achouche (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402255#comment-16402255
 ] 

salim achouche commented on DRILL-6223:
---

Parth, this PR is not Parquet specific as it deals with downstream operators 
having issues handling schema changes. Most of the time, the end result would 
be downstream operators trying to access stale data.

 

> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402228#comment-16402228
 ] 

ASF GitHub Bot commented on DRILL-6223:
---

Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/1170
  
I added a comment in the JIRA - 
[DRILL-6223](https://issues.apache.org/jira/browse/DRILL-6223?focusedCommentId=16402223=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16402223)



> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-16 Thread Parth Chandra (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402223#comment-16402223
 ] 

Parth Chandra commented on DRILL-6223:
--

Schema change for Parquet files is not supported by the Parquet metadata cache. 
The Parquet metadata cache overwrites the schema if it changes (does not merge) 
and so the last one encountered is the schema selected. New columns added are 
OK, I think, but type changes are not.

See [1].

I haven't looked at the PR, but you might want to test this out with the 
metadata cache enabled.

[1] 
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L420

> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401197#comment-16401197
 ] 

ASF GitHub Bot commented on DRILL-6223:
---

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1170
  
@amansinha100 can you please review this pull request?

Thanks!


> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401193#comment-16401193
 ] 

ASF GitHub Bot commented on DRILL-6223:
---

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1170

DRILL-6223: Fixed several Drillbit failures due to schema changes

Fixed several Issues due to Schema changes:
1) Changes in complex data types
Drill Query Failing when selecting all columns from a Complex Nested Data 
File (Parquet) Set). There are differences in Schema among the files:

The Parquet files exhibit differences both at the first level and within 
nested data types
A select * will not cause an exception but using a limit clause will
Note also this issue seems to happen only when multiple Drillbit minor 
fragments are involved (concurrency higher than one)

2) Dangling columns (both simple and complex)
This situation can be easily reproduced for:
- Select STAR queries which involve input data with different schemas
- LIMIT or / and PROJECT operators are used
- The data will be read from more than one minor fragment
- This is because individual readers have logic to handle such use-cases 
but not downstream operators
- So is reader-1 sends one batch with F1, F2, and F3
- The reader-2 sends batch F2, F3
- Then the LIMIT and PROJECT operator will fail to cleanup the dangling 
column F1 which will cause failures when downstream operators copy logic 
attempts copy the stale column F1
- This pull request adds logic to detect and eliminate dangling columns   

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill DRILL-6223

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1170.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1170


commit d986b6c7588c107bb7e49d2fc8eb3f25a60e1214
Author: Salim Achouche 
Date:   2018-02-21T02:17:14Z

DRILL-6223: Fixed several Drillbit failures due to schema changes




> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6223) Drill fails on Schema changes

2018-03-13 Thread Pritesh Maker (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398087#comment-16398087
 ] 

Pritesh Maker commented on DRILL-6223:
--

[~sachouche] can you attach the PR as well? 

> Drill fails on Schema changes 
> --
>
> Key: DRILL-6223
> URL: https://issues.apache.org/jira/browse/DRILL-6223
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.10.0, 1.12.0
>Reporter: salim achouche
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)