subject:"\[jira\] \[Updated\] \(SPARK\-17891\) SQL\-based three column join loses first column"

[jira] [Updated] (SPARK-17891) SQL-based three column join loses first column

2016-10-18 Thread Eli Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Miller updated SPARK-17891:
---
Issue Type: Bug  (was: Question)

> SQL-based three column join loses first column
> --
>
> Key: SPARK-17891
> URL: https://issues.apache.org/jira/browse/SPARK-17891
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Eli Miller
> Attachments: test.tgz
>
>
> Hi all,
> I hope that this is not a known issue (I haven't had any luck finding 
> anything similar in Jira or the mailing lists but I could be searching with 
> the wrong terms). I just started to experiment with Spark SQL and am seeing 
> what appears to be a bug. When using Spark SQL to join two tables with a 
> three column inner join, the first column join is ignored. The example code 
> that I have starts with two tables *T1*:
> {noformat}
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  1|  2|  3|  4|
> +---+---+---+---+
> {noformat}
> and *T2*:
> {noformat}
> +---+---+---+---+
> |  b|  c|  d|  e|
> +---+---+---+---+
> |  2|  3|  4|  5|
> | -2|  3|  4|  6|
> |  2| -3|  4|  7|
> +---+---+---+---+
> {noformat}
> Joining *T1* to *T2* on *b*, *c* and *d* (in that order):
> {code:sql}
> SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e
> FROM t1, t2
> WHERE t1.b = t2.b AND t1.c = t2.c AND t1.d = t2.d
> {code}
> results in the following (note that *T1.b* != *T2.b* in the first row):
> {noformat}
> +---+---+---+---+---+---+---+---+
> |  a|  b|  b|  c|  c|  d|  d|  e|
> +---+---+---+---+---+---+---+---+
> |  1|  2| -2|  3|  3|  4|  4|  6|
> |  1|  2|  2|  3|  3|  4|  4|  5|
> +---+---+---+---+---+---+---+---+
> {noformat}
> Switching the predicate order to *c*, *b* and *d*:
> {code:sql}
> SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e
> FROM t1, t2
> WHERE t1.c = t2.c AND t1.b = t2.b AND t1.d = t2.d
> {code}
> yields different results (now *T1.c* != *T2.c* in the first row):
> {noformat}
> +---+---+---+---+---+---+---+---+
> |  a|  b|  b|  c|  c|  d|  d|  e|
> +---+---+---+---+---+---+---+---+
> |  1|  2|  2|  3| -3|  4|  4|  7|
> |  1|  2|  2|  3|  3|  4|  4|  5|
> +---+---+---+---+---+---+---+---+
> {noformat}
> Is this expected?
> I started to research this a bit and one thing that jumped out at me was the 
> ordering of the HashedRelationBroadcastMode concatenation in the plan (this 
> is from the *b*, *c*, *d* ordering):
> {noformat}
> ...
> *Project [a#0, b#1, b#9, c#2, c#10, d#3, d#11, e#12]
> +- *BroadcastHashJoin [b#1, c#2, d#3], [b#9, c#10, d#11], Inner, BuildRight
>:- *Project [a#0, b#1, c#2, d#3]
>:  +- *Filter ((isnotnull(b#1) && isnotnull(c#2)) && isnotnull(d#3))
>: +- *Scan csv [a#0,b#1,c#2,d#3] Format: CSV, InputPaths: 
> file:/home/eli/git/IENG/what/target/classes/t1.csv, PartitionFilters: [], 
> PushedFilters: [IsNotNull(b), IsNotNull(c), IsNotNull(d)], ReadSchema: 
> struct
>+- BroadcastExchange 
> HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, 
> true] as bigint), 32) | (cast(input[1, int, true] as bigint) & 4294967295)), 
> 32) | (cast(input[2, int, true] as bigint) & 4294967295
>   +- *Project [b#9, c#10, d#11, e#12]
>  +- *Filter ((isnotnull(c#10) && isnotnull(b#9)) && isnotnull(d#11))
> +- *Scan csv [b#9,c#10,d#11,e#12] Format: CSV, InputPaths: 
> file:/home/eli/git/IENG/what/target/classes/t2.csv, PartitionFilters: [], 
> PushedFilters: [IsNotNull(c), IsNotNull(b), IsNotNull(d)], ReadSchema: 
> struct]
> {noformat}
> If this concatenated byte array is ever truncated to 64 bits in a comparison, 
> the leading column will be lost and could result in this behavior.
> I will attach my example code and data. Please let me know if I can provide 
> any other details.
> Best regards,
> Eli



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17891) SQL-based three column join loses first column

2016-10-12 Thread Eli Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Miller updated SPARK-17891:
---
Attachment: test.tgz

A simple maven project that contains TripleJoin.java and sample data files 
which illustrate the problem. Requires Java 1.8.

> SQL-based three column join loses first column
> --
>
> Key: SPARK-17891
> URL: https://issues.apache.org/jira/browse/SPARK-17891
> Project: Spark
>  Issue Type: Question
>Affects Versions: 2.0.1
>Reporter: Eli Miller
> Attachments: test.tgz
>
>
> Hi all,
> I hope that this is not a known issue (I haven't had any luck finding 
> anything similar in Jira or the mailing lists but I could be searching with 
> the wrong terms). I just started to experiment with Spark SQL and am seeing 
> what appears to be a bug. When using Spark SQL to join two tables with a 
> three column inner join, the first column join is ignored. The example code 
> that I have starts with two tables *T1*:
> {noformat}
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  1|  2|  3|  4|
> +---+---+---+---+
> {noformat}
> and *T2*:
> {noformat}
> +---+---+---+---+
> |  b|  c|  d|  e|
> +---+---+---+---+
> |  2|  3|  4|  5|
> | -2|  3|  4|  6|
> |  2| -3|  4|  7|
> +---+---+---+---+
> {noformat}
> Joining *T1* to *T2* on *b*, *c* and *d* (in that order):
> {code:sql}
> SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e
> FROM t1, t2
> WHERE t1.b = t2.b AND t1.c = t2.c AND t1.d = t2.d
> {code}
> results in the following (note that *T1.b* != *T2.b* in the first row):
> {noformat}
> +---+---+---+---+---+---+---+---+
> |  a|  b|  b|  c|  c|  d|  d|  e|
> +---+---+---+---+---+---+---+---+
> |  1|  2| -2|  3|  3|  4|  4|  6|
> |  1|  2|  2|  3|  3|  4|  4|  5|
> +---+---+---+---+---+---+---+---+
> {noformat}
> Switching the predicate order to *c*, *b* and *d*:
> {code:sql}
> SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e
> FROM t1, t2
> WHERE t1.c = t2.c AND t1.b = t2.b AND t1.d = t2.d
> {code}
> yields different results (now *T1.c* != *T2.c* in the first row):
> {noformat}
> +---+---+---+---+---+---+---+---+
> |  a|  b|  b|  c|  c|  d|  d|  e|
> +---+---+---+---+---+---+---+---+
> |  1|  2|  2|  3| -3|  4|  4|  7|
> |  1|  2|  2|  3|  3|  4|  4|  5|
> +---+---+---+---+---+---+---+---+
> {noformat}
> Is this expected?
> I started to research this a bit and one thing that jumped out at me was the 
> ordering of the HashedRelationBroadcastMode concatenation in the plan (this 
> is from the *b*, *c*, *d* ordering):
> {noformat}
> ...
> *Project [a#0, b#1, b#9, c#2, c#10, d#3, d#11, e#12]
> +- *BroadcastHashJoin [b#1, c#2, d#3], [b#9, c#10, d#11], Inner, BuildRight
>:- *Project [a#0, b#1, c#2, d#3]
>:  +- *Filter ((isnotnull(b#1) && isnotnull(c#2)) && isnotnull(d#3))
>: +- *Scan csv [a#0,b#1,c#2,d#3] Format: CSV, InputPaths: 
> file:/home/eli/git/IENG/what/target/classes/t1.csv, PartitionFilters: [], 
> PushedFilters: [IsNotNull(b), IsNotNull(c), IsNotNull(d)], ReadSchema: 
> struct
>+- BroadcastExchange 
> HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, 
> true] as bigint), 32) | (cast(input[1, int, true] as bigint) & 4294967295)), 
> 32) | (cast(input[2, int, true] as bigint) & 4294967295
>   +- *Project [b#9, c#10, d#11, e#12]
>  +- *Filter ((isnotnull(c#10) && isnotnull(b#9)) && isnotnull(d#11))
> +- *Scan csv [b#9,c#10,d#11,e#12] Format: CSV, InputPaths: 
> file:/home/eli/git/IENG/what/target/classes/t2.csv, PartitionFilters: [], 
> PushedFilters: [IsNotNull(c), IsNotNull(b), IsNotNull(d)], ReadSchema: 
> struct]
> {noformat}
> If this concatenated byte array is ever truncated to 64 bits in a comparison, 
> the leading column will be lost and could result in this behavior.
> I will attach my example code and data. Please let me know if I can provide 
> any other details.
> Best regards,
> Eli



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17891) SQL-based three column join loses first column

[jira] [Updated] (SPARK-17891) SQL-based three column join loses first column

2 matches

Site Navigation

Mail list logo

Footer information