[jira] [Commented] (DRILL-5822) The query with "SELECT *" with "ORDER BY" clause and `planner.slice_target`=1 doesn't preserve column order

2018-01-12 Thread Alex (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324095#comment-16324095
 ] 

Alex commented on DRILL-5822:
-

The fix was re-tested in v.1.13.0-SNAPSHOT/1.12.0, status - PASSED:
- Setup planner.slice_target=1
- Ran SELECT queries (with order by/order by desc clauses) from different data 
sources (from parquet/json/csv files, database table) 

> The query with "SELECT *" with "ORDER BY" clause and `planner.slice_target`=1 
> doesn't preserve column order
> ---
>
> Key: DRILL-5822
> URL: https://issues.apache.org/jira/browse/DRILL-5822
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Prasad Nagaraj Subramanya
>Assignee: Vitalii Diravka
>  Labels: ready-to-commit
> Fix For: 1.12.0
>
>
> Columns ordering doesn't preserve for the star query with sorting when this 
> is planned into multiple fragments.
> Repro steps:
> 1) {code}alter session set `planner.slice_target`=1;{code}
> 2) ORDER BY clause in the query.
> Scenarios:
> {code}
> 0: jdbc:drill:zk=local> alter session reset `planner.slice_target`;
> +---++
> |  ok   |summary |
> +---++
> | true  | planner.slice_target updated.  |
> +---++
> 1 row selected (0.082 seconds)
> 0: jdbc:drill:zk=local> select * from cp.`tpch/nation.parquet` order by 
> n_name limit 1;
> +--+--+--+--+
> | n_nationkey  |  n_name  | n_regionkey  |  n_comment 
>   |
> +--+--+--+--+
> | 0| ALGERIA  | 0|  haggle. carefully final deposits 
> detect slyly agai  |
> +--+--+--+--+
> 1 row selected (0.141 seconds)
> 0: jdbc:drill:zk=local> alter session set `planner.slice_target`=1;
> +---++
> |  ok   |summary |
> +---++
> | true  | planner.slice_target updated.  |
> +---++
> 1 row selected (0.091 seconds)
> 0: jdbc:drill:zk=local> select * from cp.`tpch/nation.parquet` order by 
> n_name limit 1;
> +--+--+--+--+
> |  n_comment   |  n_name  | 
> n_nationkey  | n_regionkey  |
> +--+--+--+--+
> |  haggle. carefully final deposits detect slyly agai  | ALGERIA  | 0 
>| 0|
> +--+--+--+--+
> 1 row selected (0.201 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5822) The query with "SELECT *" with "ORDER BY" clause and `planner.slice_target`=1 doesn't preserve column order

2017-11-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249468#comment-16249468
 ] 

ASF GitHub Bot commented on DRILL-5822:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1017


> The query with "SELECT *" with "ORDER BY" clause and `planner.slice_target`=1 
> doesn't preserve column order
> ---
>
> Key: DRILL-5822
> URL: https://issues.apache.org/jira/browse/DRILL-5822
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Prasad Nagaraj Subramanya
>Assignee: Vitalii Diravka
>  Labels: ready-to-commit
> Fix For: 1.12.0
>
>
> Columns ordering doesn't preserve for the star query with sorting when this 
> is planned into multiple fragments.
> Repro steps:
> 1) {code}alter session set `planner.slice_target`=1;{code}
> 2) ORDER BY clause in the query.
> Scenarios:
> {code}
> 0: jdbc:drill:zk=local> alter session reset `planner.slice_target`;
> +---++
> |  ok   |summary |
> +---++
> | true  | planner.slice_target updated.  |
> +---++
> 1 row selected (0.082 seconds)
> 0: jdbc:drill:zk=local> select * from cp.`tpch/nation.parquet` order by 
> n_name limit 1;
> +--+--+--+--+
> | n_nationkey  |  n_name  | n_regionkey  |  n_comment 
>   |
> +--+--+--+--+
> | 0| ALGERIA  | 0|  haggle. carefully final deposits 
> detect slyly agai  |
> +--+--+--+--+
> 1 row selected (0.141 seconds)
> 0: jdbc:drill:zk=local> alter session set `planner.slice_target`=1;
> +---++
> |  ok   |summary |
> +---++
> | true  | planner.slice_target updated.  |
> +---++
> 1 row selected (0.091 seconds)
> 0: jdbc:drill:zk=local> select * from cp.`tpch/nation.parquet` order by 
> n_name limit 1;
> +--+--+--+--+
> |  n_comment   |  n_name  | 
> n_nationkey  | n_regionkey  |
> +--+--+--+--+
> |  haggle. carefully final deposits detect slyly agai  | ALGERIA  | 0 
>| 0|
> +--+--+--+--+
> 1 row selected (0.201 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5822) The query with "SELECT *" with "ORDER BY" clause and `planner.slice_target`=1 doesn't preserve column order

2017-11-01 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235100#comment-16235100
 ] 

Paul Rogers commented on DRILL-5822:


[~vitalii], thanks for the explanation. I was hoping for a bit more of a 
conceptual overview: what is our overall approach to handling column order? The 
answer referred to specific implementations. Without a design, we just 
ping-pong back and forth among various code solutions.

In Drill, column order does not matter. At least, that is what I've been told 
by the veterans. That is, (a, b) and (b, a) are the same schema. Code 
generation uses names to find columns, not column indexes as in most DB systems.

Given this, we need a policy. One policy would be that we preserve project list 
ordering created by the planner. That works, except in the case of {{SELECT 
*}}. The standard for SQL is that a {{SELECT *}} query preserves the column 
order in the table. Fair enough.

But, in a distributed system, each table may have a different column order; 
especially in files such as JSON that use key/value pairs. So, there is no 
"right" order. Then what do we do?

We can make up an order (sort columns alphabetically as the prior code did.) 
But, this will be surprising to users if their CSV file, say, has ID, Address, 
Block and we produce an output of Address, Block, ID.

Another rule might be to preserve column order where possible, but when a 
conflict occurs, choose a "first" batch and coerce others to match that. If the 
merging receiver (which sees n batches in no real order) gets batch 3 first, 
then 3 becomes the template and other batches are coerced to match.

If the Sort, say, gets batches sequentially, then the first one is the template 
and others are coerced to match.

Makes sense? Good. Now, what about RecordBatchLoader? By itself, it can't do 
the job. It needs help.

On the first batch, it can save an ordering. On the second batch, it can:

* Pick out columns that match its existing schema.
* If prior columns do not exist, it can fill in nulls (as long as the prior 
column was nullable or an array.)
* If the prior column was required, or a new column appears, a hard schema 
change must occur.

The result is that the batch loader absorbs trivial schema changes. I call this 
"schema smoothing." But, it alerts the surrounding operator to larger issues.

Now, what about the merging receiver? The algorithm might be this:

* Start with the first batch. This "primes" the batch loader.
* Visit the second batch. The batch loader "smooths" the schema as described 
above.
* Continue with the third, and so on.
* If, for any batch, a hard schema change occurs, we fail the query.

(Not that we could actually handle the schema change as described below; but 
we'd end up with a very large number of very small batches and a very large 
number of schema changes. Until Drill has a design for schema change, there is 
really no point in adding this complexity.)

If we do the above, then we don't need the check in the new code that compare 
all batches up front. Instead, we do the comparison one by one as we convert 
them from wire to in-memory format using the record batch loader.

OK, that's the merging receiver. What about the union receiver? It can just 
send a schema change event each time the schema changes. This could be noisy; 
we might get a schema change on every batch. If we have three senders, X, Y and 
Z, and each has a different schema, then we would get a stream something like 
X1, Y1, Z1, X2, Y2, Z2 and we'd have a schema change with each one. Oh well. 
But, if the batches simply differ in column order, the schema "smoothing" 
described above will kick in and we get no schema changes.

This same logic can be applied to each operator where we might have a problem.

Note that the schema smoothing algorithm described above is not just a theory. 
It is actually implemented, tested, and working in the "batch size control" 
project. Here we are just reimplementing it in multiple places due to the 
extraordinary delays in getting large code changes approved in Drill.

> The query with "SELECT *" with "ORDER BY" clause and `planner.slice_target`=1 
> doesn't preserve column order
> ---
>
> Key: DRILL-5822
> URL: https://issues.apache.org/jira/browse/DRILL-5822
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Prasad Nagaraj Subramanya
>Assignee: Vitalii Diravka
>Priority: Major
> Fix For: 1.12.0
>
>
> Columns ordering doesn't preserve for the star query with sorting when this 
> is planned into multiple fragments.
> Repro steps:
> 1) {code}alter session set `planner.slice_target`=1;{code}
> 2) ORDER BY clause in the query.
> Scenarios:
> {code}
> 

[jira] [Commented] (DRILL-5822) The query with "SELECT *" with "ORDER BY" clause and `planner.slice_target`=1 doesn't preserve column order

2017-10-31 Thread Vitalii Diravka (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227349#comment-16227349
 ] 

Vitalii Diravka commented on DRILL-5822:


[~Paul.Rogers] 
Here is the issue of unnecessary sorting of columns for query with the 
following conditions: using wildcard in the query and ORDER BY clause, and when 
this is planned into multiple fragments ("alter session set 
`planner.slice_target`=1;").

The issue is connected to adding canonicalizing the schemas of input batches 
for Merging Receiver in DRILL-847. But this approach is outdated since for now 
in the process of loading batches in the RecordBatchLoader the new batch with 
same columns (SchemaPaths) but other ordering of them is perceived with same 
schema as the previous batch has: 
[All fields from the last batch is a hashMap 
structure|https://github.com/apache/drill/blob/fe79a633a3da8b4f6db50454fde64c30c73233bb/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchLoader.java#L90]
 and [when new batch appears the columns are just removed from the old one by 
the 
key|https://github.com/apache/drill/blob/fe79a633a3da8b4f6db50454fde64c30c73233bb/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchLoader.java#L102].
So the schemaChange flag still equals to false. And then [the schema will 
built|https://github.com/apache/drill/blob/fe79a633a3da8b4f6db50454fde64c30c73233bb/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchLoader.java#L138].
 

Here is only the issue that RecordBatchLoader permutes column order for the 
above case. And it was described in the jira ticket created by you DRILL-5828 
and can be fixed there.
So my changes fix the current issue but not fully cover the requirements from 
your comment. Will It be reasonably if that changes will be done in context of 
DRILL-5828?

> The query with "SELECT *" with "ORDER BY" clause and `planner.slice_target`=1 
> doesn't preserve column order
> ---
>
> Key: DRILL-5822
> URL: https://issues.apache.org/jira/browse/DRILL-5822
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Prasad Nagaraj Subramanya
>Assignee: Vitalii Diravka
> Fix For: 1.12.0
>
>
> Columns ordering doesn't preserve for the star query with sorting when this 
> is planned into multiple fragments.
> Repro steps:
> 1) {code}alter session set `planner.slice_target`=1;{code}
> 2) ORDER BY clause in the query.
> Scenarios:
> {code}
> 0: jdbc:drill:zk=local> alter session reset `planner.slice_target`;
> +---++
> |  ok   |summary |
> +---++
> | true  | planner.slice_target updated.  |
> +---++
> 1 row selected (0.082 seconds)
> 0: jdbc:drill:zk=local> select * from cp.`tpch/nation.parquet` order by 
> n_name limit 1;
> +--+--+--+--+
> | n_nationkey  |  n_name  | n_regionkey  |  n_comment 
>   |
> +--+--+--+--+
> | 0| ALGERIA  | 0|  haggle. carefully final deposits 
> detect slyly agai  |
> +--+--+--+--+
> 1 row selected (0.141 seconds)
> 0: jdbc:drill:zk=local> alter session set `planner.slice_target`=1;
> +---++
> |  ok   |summary |
> +---++
> | true  | planner.slice_target updated.  |
> +---++
> 1 row selected (0.091 seconds)
> 0: jdbc:drill:zk=local> select * from cp.`tpch/nation.parquet` order by 
> n_name limit 1;
> +--+--+--+--+
> |  n_comment   |  n_name  | 
> n_nationkey  | n_regionkey  |
> +--+--+--+--+
> |  haggle. carefully final deposits detect slyly agai  | ALGERIA  | 0 
>| 0|
> +--+--+--+--+
> 1 row selected (0.201 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5822) The query with "SELECT *" with "ORDER BY" clause and `planner.slice_target`=1 doesn't preserve column order

2017-10-31 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227074#comment-16227074
 ] 

Paul Rogers commented on DRILL-5822:


The general rule for the SQL project clause is the following:

* If the list is explicit, `SELECT b, c, a` then columns are returned in that 
order, even if the table defines them in the order (a, b, c).
* If the lis is implicit using a wildcard, `SELECT *`, then the column order is 
that defined by the table. In our example above, the order would be `a, b, c`.

Since Drill is distributed and schema-on-read, we run into the issue that two 
tables might have the same columns, but defined in different orders. For 
example, `{"a": 10, "b": 20, "c": 30}` and `{"c": 40, "b": 50, "c": 60}`. In 
this case, there is no "correct" order. Instead, Drill must:

1. Recognize that the above scenario can occur.
2. Define each merging operator to follow some reconciliation rule.

Here a "merging" operator is anything that can see batches from two distinct 
scans. That is, almost all operators, but at least the receivers.

A good reconciliation rule is that the first schema wins, and all other batches 
are projected into that first schema. In our example, `a, b, c` and `c, b, a` 
are both projected into `a, b, c`.

The PMC has asked that we not discuss design issues in PR reviews. So, can you 
perhaps please explain here the approach that this PR takes to solve the 
problem? Do we agree on the description above? Or, did this PR take a different 
approach?

> The query with "SELECT *" with "ORDER BY" clause and `planner.slice_target`=1 
> doesn't preserve column order
> ---
>
> Key: DRILL-5822
> URL: https://issues.apache.org/jira/browse/DRILL-5822
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Prasad Nagaraj Subramanya
>Assignee: Vitalii Diravka
> Fix For: 1.12.0
>
>
> Columns ordering doesn't preserve for the star query with sorting when this 
> is planned into multiple fragments.
> Repro steps:
> 1) {code}alter session set `planner.slice_target`=1;{code}
> 2) ORDER BY clause in the query.
> Scenarios:
> {code}
> 0: jdbc:drill:zk=local> alter session reset `planner.slice_target`;
> +---++
> |  ok   |summary |
> +---++
> | true  | planner.slice_target updated.  |
> +---++
> 1 row selected (0.082 seconds)
> 0: jdbc:drill:zk=local> select * from cp.`tpch/nation.parquet` order by 
> n_name limit 1;
> +--+--+--+--+
> | n_nationkey  |  n_name  | n_regionkey  |  n_comment 
>   |
> +--+--+--+--+
> | 0| ALGERIA  | 0|  haggle. carefully final deposits 
> detect slyly agai  |
> +--+--+--+--+
> 1 row selected (0.141 seconds)
> 0: jdbc:drill:zk=local> alter session set `planner.slice_target`=1;
> +---++
> |  ok   |summary |
> +---++
> | true  | planner.slice_target updated.  |
> +---++
> 1 row selected (0.091 seconds)
> 0: jdbc:drill:zk=local> select * from cp.`tpch/nation.parquet` order by 
> n_name limit 1;
> +--+--+--+--+
> |  n_comment   |  n_name  | 
> n_nationkey  | n_regionkey  |
> +--+--+--+--+
> |  haggle. carefully final deposits detect slyly agai  | ALGERIA  | 0 
>| 0|
> +--+--+--+--+
> 1 row selected (0.201 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5822) The query with "SELECT *" with "ORDER BY" clause and `planner.slice_target`=1 doesn't preserve column order

2017-10-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226962#comment-16226962
 ] 

ASF GitHub Bot commented on DRILL-5822:
---

Github user vdiravka commented on the issue:

https://github.com/apache/drill/pull/1017
  
@paul-rogers Could you please review this PR? You can find a short 
description here or more detailed - in the jira ticket 
[DRILL-5822](https://issues.apache.org/jira/browse/DRILL-5822).


> The query with "SELECT *" with "ORDER BY" clause and `planner.slice_target`=1 
> doesn't preserve column order
> ---
>
> Key: DRILL-5822
> URL: https://issues.apache.org/jira/browse/DRILL-5822
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Prasad Nagaraj Subramanya
>Assignee: Vitalii Diravka
> Fix For: 1.12.0
>
>
> Columns ordering doesn't preserve for the star query with sorting when this 
> is planned into multiple fragments.
> Repro steps:
> 1) {code}alter session set `planner.slice_target`=1;{code}
> 2) ORDER BY clause in the query.
> Scenarios:
> {code}
> 0: jdbc:drill:zk=local> alter session reset `planner.slice_target`;
> +---++
> |  ok   |summary |
> +---++
> | true  | planner.slice_target updated.  |
> +---++
> 1 row selected (0.082 seconds)
> 0: jdbc:drill:zk=local> select * from cp.`tpch/nation.parquet` order by 
> n_name limit 1;
> +--+--+--+--+
> | n_nationkey  |  n_name  | n_regionkey  |  n_comment 
>   |
> +--+--+--+--+
> | 0| ALGERIA  | 0|  haggle. carefully final deposits 
> detect slyly agai  |
> +--+--+--+--+
> 1 row selected (0.141 seconds)
> 0: jdbc:drill:zk=local> alter session set `planner.slice_target`=1;
> +---++
> |  ok   |summary |
> +---++
> | true  | planner.slice_target updated.  |
> +---++
> 1 row selected (0.091 seconds)
> 0: jdbc:drill:zk=local> select * from cp.`tpch/nation.parquet` order by 
> n_name limit 1;
> +--+--+--+--+
> |  n_comment   |  n_name  | 
> n_nationkey  | n_regionkey  |
> +--+--+--+--+
> |  haggle. carefully final deposits detect slyly agai  | ALGERIA  | 0 
>| 0|
> +--+--+--+--+
> 1 row selected (0.201 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5822) The query with "SELECT *" with "ORDER BY" clause and `planner.slice_target`=1 doesn't preserve column order

2017-10-31 Thread Vitalii Diravka (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226679#comment-16226679
 ] 

Vitalii Diravka commented on DRILL-5822:


This is an old topic which was discussed in DRILL-1499 and DRILL-3101. 
For now there is no need to canonicalize the batch or container since 
RecordBatchLoader swallows the "schema change" if two batches have different 
column ordering. That's why DRILL-847 is oudated.

> The query with "SELECT *" with "ORDER BY" clause and `planner.slice_target`=1 
> doesn't preserve column order
> ---
>
> Key: DRILL-5822
> URL: https://issues.apache.org/jira/browse/DRILL-5822
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Prasad Nagaraj Subramanya
>Assignee: Vitalii Diravka
> Fix For: 1.12.0
>
>
> Columns ordering doesn't preserve for the star query with sorting when this 
> is planned into multiple fragments.
> Repro steps:
> 1) {code}alter session set `planner.slice_target`=1;{code}
> 2) ORDER BY clause in the query.
> Scenarios:
> {code}
> 0: jdbc:drill:zk=local> alter session reset `planner.slice_target`;
> +---++
> |  ok   |summary |
> +---++
> | true  | planner.slice_target updated.  |
> +---++
> 1 row selected (0.082 seconds)
> 0: jdbc:drill:zk=local> select * from cp.`tpch/nation.parquet` order by 
> n_name limit 1;
> +--+--+--+--+
> | n_nationkey  |  n_name  | n_regionkey  |  n_comment 
>   |
> +--+--+--+--+
> | 0| ALGERIA  | 0|  haggle. carefully final deposits 
> detect slyly agai  |
> +--+--+--+--+
> 1 row selected (0.141 seconds)
> 0: jdbc:drill:zk=local> alter session set `planner.slice_target`=1;
> +---++
> |  ok   |summary |
> +---++
> | true  | planner.slice_target updated.  |
> +---++
> 1 row selected (0.091 seconds)
> 0: jdbc:drill:zk=local> select * from cp.`tpch/nation.parquet` order by 
> n_name limit 1;
> +--+--+--+--+
> |  n_comment   |  n_name  | 
> n_nationkey  | n_regionkey  |
> +--+--+--+--+
> |  haggle. carefully final deposits detect slyly agai  | ALGERIA  | 0 
>| 0|
> +--+--+--+--+
> 1 row selected (0.201 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)