[jira] [Updated] (SPARK-24669) Managed table was not cleared of path after drop database cascade
[ https://issues.apache.org/jira/browse/SPARK-24669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dong Jiang updated SPARK-24669: --- Description: I can do the following in sequence # Create a managed table using path options # Drop the table via dropping the parent database cascade # Re-create the database and table with a different path # The new table shows data from the old path, not the new path {code} echo "first" > /tmp/first.csv echo "second" > /tmp/second.csv spark-shell spark.version res0: String = 2.3.0 spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/first.csv')") spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ spark.sql("drop database foo cascade") spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/second.csv')") "note, the path is different now, pointing to second.csv, but still showing data from first file" spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ "now, if I drop the table explicitly, instead of via dropping database cascade, then it will be the correct result" spark.sql("drop table foo.first") spark.sql("create table foo.first (id string) using csv options (path='/tmp/second.csv')") spark.table("foo.first").show() +--+ |id| +--+ |second| +--+ {code} Same sequence failed in 2.3.1 as well. was: I can do the following in sequence # Create a managed table using path options # Drop the table via dropping the parent database cascade # Re-create the database and table with a different path # The new table shows data from the old path, not the new path {code} echo "first" > /tmp/first.csv echo "second" > /tmp/second.csv spark-shell spark.version res0: String = 2.3.0 spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/first.csv')") spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ spark.sql("drop database foo cascade") spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/second.csv')") "note, the path is different now, pointing to second.csv, but still showing data from first file" spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ "now, if I drop the table explicitly, instead of via dropping database cascade, then it will be the correct result" spark.sql("drop table foo.first") spark.sql("create table foo.first (id string) using csv options (path='/tmp/second.csv')") spark.table("foo.first").show() +--+ |id| +--+ |second| +--+ {code} > Managed table was not cleared of path after drop database cascade > - > > Key: SPARK-24669 > URL: https://issues.apache.org/jira/browse/SPARK-24669 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Dong Jiang >Priority: Major > > I can do the following in sequence > # Create a managed table using path options > # Drop the table via dropping the parent database cascade > # Re-create the database and table with a different path > # The new table shows data from the old path, not the new path > {code} > echo "first" > /tmp/first.csv > echo "second" > /tmp/second.csv > spark-shell > spark.version > res0: String = 2.3.0 > spark.sql("create database foo") > spark.sql("create table foo.first (id string) using csv options > (path='/tmp/first.csv')") > spark.table("foo.first").show() > +-+ > | id| > +-+ > |first| > +-+ > spark.sql("drop database foo cascade") > spark.sql("create database foo") > spark.sql("create table foo.first (id string) using csv options > (path='/tmp/second.csv')") > "note, the path is different now, pointing to second.csv, but still showing > data from first file" > spark.table("foo.first").show() > +-+ > | id| > +-+ > |first| > +-+ > "now, if I drop the table explicitly, instead of via dropping database > cascade, then it will be the correct result" > spark.sql("drop table foo.first") > spark.sql("create table foo.first (id string) using csv options > (path='/tmp/second.csv')") > spark.table("foo.first").show() > +--+ > |id| > +--+ > |second| > +--+ > {code} > Same sequence failed in 2.3.1 as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24669) Managed table was not cleared of path after drop database cascade
[ https://issues.apache.org/jira/browse/SPARK-24669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dong Jiang updated SPARK-24669: --- Affects Version/s: 2.3.1 > Managed table was not cleared of path after drop database cascade > - > > Key: SPARK-24669 > URL: https://issues.apache.org/jira/browse/SPARK-24669 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Dong Jiang >Priority: Major > > I can do the following in sequence > # Create a managed table using path options > # Drop the table via dropping the parent database cascade > # Re-create the database and table with a different path > # The new table shows data from the old path, not the new path > {code} > echo "first" > /tmp/first.csv > echo "second" > /tmp/second.csv > spark-shell > spark.version > res0: String = 2.3.0 > spark.sql("create database foo") > spark.sql("create table foo.first (id string) using csv options > (path='/tmp/first.csv')") > spark.table("foo.first").show() > +-+ > | id| > +-+ > |first| > +-+ > spark.sql("drop database foo cascade") > spark.sql("create database foo") > spark.sql("create table foo.first (id string) using csv options > (path='/tmp/second.csv')") > "note, the path is different now, pointing to second.csv, but still showing > data from first file" > spark.table("foo.first").show() > +-+ > | id| > +-+ > |first| > +-+ > "now, if I drop the table explicitly, instead of via dropping database > cascade, then it will be the correct result" > spark.sql("drop table foo.first") > spark.sql("create table foo.first (id string) using csv options > (path='/tmp/second.csv')") > spark.table("foo.first").show() > +--+ > |id| > +--+ > |second| > +--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24669) Managed table was not cleared of path after drop database cascade
[ https://issues.apache.org/jira/browse/SPARK-24669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dong Jiang updated SPARK-24669: --- Description: I can do the following in sequence # Create a managed table using path options # Drop the table via dropping the parent database cascade # Re-create the database and table with a different path # The new table shows data from the old path, not the new path {code} echo "first" > /tmp/first.csv echo "second" > /tmp/second.csv spark-shell spark.version res0: String = 2.3.0 spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/first.csv')") spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ spark.sql("drop database foo cascade") spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/second.csv')") "note, the path is different now, pointing to second.csv, but still showing data from first file" spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ "now, if I drop the table explicitly, instead of via dropping database cascade, then it will be the correct result" spark.sql("drop table foo.first") spark.sql("create table foo.first (id string) using csv options (path='/tmp/second.csv')") spark.table("foo.first").show() +--+ |id| +--+ |second| +--+ {code} was: I can do the following in sequence # Create a managed table using path options # Drop the table via dropping the parent database cascade # Re-create the database and table with a different path # The new table shows data from the old path, not the new path {code} echo "first" > /tmp/first.csv echo "second" > /tmp/second.csv spark-shell spark.version res0: String = 2.3.0 spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/first.csv')") spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ spark.sql("drop database foo cascade") spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/second.csv')") "note, the path is different now, pointing to second.csv, but still showing data from first file" spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ "now, if I drop the table explicitly, then it will be correct" spark.sql("drop table foo.first") spark.sql("create table foo.first (id string) using csv options (path='/tmp/second.csv')") spark.table("foo.first").show() +--+ |id| +--+ |second| +--+ {code} > Managed table was not cleared of path after drop database cascade > - > > Key: SPARK-24669 > URL: https://issues.apache.org/jira/browse/SPARK-24669 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dong Jiang >Priority: Major > > I can do the following in sequence > # Create a managed table using path options > # Drop the table via dropping the parent database cascade > # Re-create the database and table with a different path > # The new table shows data from the old path, not the new path > {code} > echo "first" > /tmp/first.csv > echo "second" > /tmp/second.csv > spark-shell > spark.version > res0: String = 2.3.0 > spark.sql("create database foo") > spark.sql("create table foo.first (id string) using csv options > (path='/tmp/first.csv')") > spark.table("foo.first").show() > +-+ > | id| > +-+ > |first| > +-+ > spark.sql("drop database foo cascade") > spark.sql("create database foo") > spark.sql("create table foo.first (id string) using csv options > (path='/tmp/second.csv')") > "note, the path is different now, pointing to second.csv, but still showing > data from first file" > spark.table("foo.first").show() > +-+ > | id| > +-+ > |first| > +-+ > "now, if I drop the table explicitly, instead of via dropping database > cascade, then it will be the correct result" > spark.sql("drop table foo.first") > spark.sql("create table foo.first (id string) using csv options > (path='/tmp/second.csv')") > spark.table("foo.first").show() > +--+ > |id| > +--+ > |second| > +--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24669) Managed table was not cleared of path after drop database cascade
[ https://issues.apache.org/jira/browse/SPARK-24669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dong Jiang updated SPARK-24669: --- Description: I can do the following in sequence # Create a managed table using path options # Drop the table via dropping the parent database cascade # Re-create the database and table with a different path # The new table shows data from the old path, not the new path {code} echo "first" > /tmp/first.csv echo "second" > /tmp/second.csv spark-shell spark.version res0: String = 2.3.0 spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/first.csv')") spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ spark.sql("drop database foo cascade") spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/second.csv')") "note, the path is different now, pointing to second.csv, but still showing data from first file" spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ "now, if I drop the table explicitly, then it will be correct" spark.sql("drop table foo.first") spark.sql("create table foo.first (id string) using csv options (path='/tmp/second.csv')") spark.table("foo.first").show() +--+ |id| +--+ |second| +--+ {code} was: I can do the following in sequence # Create a managed table using path options # Drop the table via dropping the parent database cascade # Re-create the database and table with a different path # The new table shows data from the old path, not the new path {code} echo "first" > /tmp/first.csv echo "second" > /tmp/second.csv spark-shell spark.version res0: String = 2.3.0 spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/first.csv')") spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ spark.sql("drop database foo cascade") spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/second.csv')") "note, the path is different now, pointing to second.csv, but still showing data from first file" spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ {code} > Managed table was not cleared of path after drop database cascade > - > > Key: SPARK-24669 > URL: https://issues.apache.org/jira/browse/SPARK-24669 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dong Jiang >Priority: Major > > I can do the following in sequence > # Create a managed table using path options > # Drop the table via dropping the parent database cascade > # Re-create the database and table with a different path > # The new table shows data from the old path, not the new path > {code} > echo "first" > /tmp/first.csv > echo "second" > /tmp/second.csv > spark-shell > spark.version > res0: String = 2.3.0 > spark.sql("create database foo") > spark.sql("create table foo.first (id string) using csv options > (path='/tmp/first.csv')") > spark.table("foo.first").show() > +-+ > | id| > +-+ > |first| > +-+ > spark.sql("drop database foo cascade") > spark.sql("create database foo") > spark.sql("create table foo.first (id string) using csv options > (path='/tmp/second.csv')") > "note, the path is different now, pointing to second.csv, but still showing > data from first file" > spark.table("foo.first").show() > +-+ > | id| > +-+ > |first| > +-+ > "now, if I drop the table explicitly, then it will be correct" > spark.sql("drop table foo.first") > spark.sql("create table foo.first (id string) using csv options > (path='/tmp/second.csv')") > spark.table("foo.first").show() > +--+ > |id| > +--+ > |second| > +--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24669) Managed table was not cleared of path after drop database cascade
[ https://issues.apache.org/jira/browse/SPARK-24669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dong Jiang updated SPARK-24669: --- Description: I can do the following in sequence # Create a managed table using path options # Drop the table via dropping the parent database cascade # Re-create the database and table with a different path # The new table shows data from the old path, not the new path {code} echo "first" > /tmp/first.csv echo "second" > /tmp/second.csv spark-shell spark.version res0: String = 2.3.0 spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/first.csv')") spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ spark.sql("drop database foo cascade") spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/second.csv')") "note, the path is different now, pointing to second.csv, but still showing data from first file" spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ {code} was: I can do the following in sequence # Create a managed table using path options # Drop the table via dropping the parent database cascade # Re-create the database and table with a different path # The new table shows data from the old path, not the new path {code} echo "first" > /tmp/first.csv echo "second" > /tmp/second.csv spark-shell spark.version res0: String = 2.3.0 spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/first.csv')") spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ spark.sql("drop database foo cascade") spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/second.csv')") "note, the path is different now, pointing to second.csv, but still showing data from first file" spark.table("foo.second").show() spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ {code} > Managed table was not cleared of path after drop database cascade > - > > Key: SPARK-24669 > URL: https://issues.apache.org/jira/browse/SPARK-24669 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dong Jiang >Priority: Major > > I can do the following in sequence > # Create a managed table using path options > # Drop the table via dropping the parent database cascade > # Re-create the database and table with a different path > # The new table shows data from the old path, not the new path > {code} > echo "first" > /tmp/first.csv > echo "second" > /tmp/second.csv > spark-shell > spark.version > res0: String = 2.3.0 > spark.sql("create database foo") > spark.sql("create table foo.first (id string) using csv options > (path='/tmp/first.csv')") > spark.table("foo.first").show() > +-+ > | id| > +-+ > |first| > +-+ > spark.sql("drop database foo cascade") > spark.sql("create database foo") > spark.sql("create table foo.first (id string) using csv options > (path='/tmp/second.csv')") > "note, the path is different now, pointing to second.csv, but still showing > data from first file" > spark.table("foo.first").show() > +-+ > | id| > +-+ > |first| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24669) Managed table was not cleared of path after drop database cascade
Dong Jiang created SPARK-24669: -- Summary: Managed table was not cleared of path after drop database cascade Key: SPARK-24669 URL: https://issues.apache.org/jira/browse/SPARK-24669 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Dong Jiang I can do the following in sequence # Create a managed table using path options # Drop the table via dropping the parent database cascade # Re-create the database and table with a different path # The new table shows data from the old path, not the new path {code} echo "first" > /tmp/first.csv echo "second" > /tmp/second.csv spark-shell spark.version res0: String = 2.3.0 spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/first.csv')") spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ spark.sql("drop database foo cascade") spark.sql("create database foo") spark.sql("create table foo.first (id string) using csv options (path='/tmp/second.csv')") "note, the path is different now, pointing to second.csv, but still showing data from first file" spark.table("foo.second").show() spark.table("foo.first").show() +-+ | id| +-+ |first| +-+ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23866) Extend ALTER TABLE DROP PARTITION syntax to use all comparators
Dong Jiang created SPARK-23866: -- Summary: Extend ALTER TABLE DROP PARTITION syntax to use all comparators Key: SPARK-23866 URL: https://issues.apache.org/jira/browse/SPARK-23866 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Dong Jiang Please add SQL support equivalent to drop multiple partitions by operators other than =, basically equivalent of https://issues.apache.org/jira/browse/HIVE-2908 "To drop a partition from a Hive table, this works: ALTER TABLE foo DROP PARTITION(ds = 'date') ...but it should also work to drop all partitions prior to date. ALTER TABLE foo DROP PARTITION(ds < 'date') This task is to implement ALTER TABLE DROP PARTITION for all of the comparators, < > <= >= <> = != instead of just for =." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23549) Spark SQL unexpected behavior when comparing timestamp to date
[ https://issues.apache.org/jira/browse/SPARK-23549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391351#comment-16391351 ] Dong Jiang commented on SPARK-23549: [~kiszk], I expect your query to return false, as presto/Athena does. A date in SQL is typically thought of equivalent to timestamp at 00:00:00 > Spark SQL unexpected behavior when comparing timestamp to date > -- > > Key: SPARK-23549 > URL: https://issues.apache.org/jira/browse/SPARK-23549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Dong Jiang >Priority: Major > > {code:java} > scala> spark.version > res1: String = 2.2.1 > scala> spark.sql("select cast('2017-03-01 00:00:00' as timestamp) between > cast('2017-02-28' as date) and cast('2017-03-01' as date)").show > +---+ > |((CAST(CAST(2017-03-01 00:00:00 AS TIMESTAMP) AS STRING) >= > CAST(CAST(2017-02-28 AS DATE) AS STRING)) AND (CAST(CAST(2017-03-01 00:00:00 > AS TIMESTAMP) AS STRING) <= CAST(CAST(2017-03-01 AS DATE) AS STRING)))| > +---+ > | > > false| > +---+{code} > As shown above, when a timestamp is compared to date in SparkSQL, both > timestamp and date are downcast to string, and leading to unexpected result. > If run the same SQL in presto/Athena, I got the expected result > {code:java} > select cast('2017-03-01 00:00:00' as timestamp) between cast('2017-02-28' as > date) and cast('2017-03-01' as date) > _col0 > 1 true > {code} > Is this a bug for Spark or a feature? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23549) Spark SQL unexpected behavior when comparing timestamp to date
[ https://issues.apache.org/jira/browse/SPARK-23549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384226#comment-16384226 ] Dong Jiang commented on SPARK-23549: Tested in spark 2.3.0, same thing > Spark SQL unexpected behavior when comparing timestamp to date > -- > > Key: SPARK-23549 > URL: https://issues.apache.org/jira/browse/SPARK-23549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Dong Jiang >Priority: Major > > {code:java} > scala> spark.version > res1: String = 2.2.1 > scala> spark.sql("select cast('2017-03-01 00:00:00' as timestamp) between > cast('2017-02-28' as date) and cast('2017-03-01' as date)").show > +---+ > |((CAST(CAST(2017-03-01 00:00:00 AS TIMESTAMP) AS STRING) >= > CAST(CAST(2017-02-28 AS DATE) AS STRING)) AND (CAST(CAST(2017-03-01 00:00:00 > AS TIMESTAMP) AS STRING) <= CAST(CAST(2017-03-01 AS DATE) AS STRING)))| > +---+ > | > > false| > +---+{code} > As shown above, when a timestamp is compared to date in SparkSQL, both > timestamp and date are downcast to string, and leading to unexpected result. > If run the same SQL in presto/Athena, I got the expected result > {code:java} > select cast('2017-03-01 00:00:00' as timestamp) between cast('2017-02-28' as > date) and cast('2017-03-01' as date) > _col0 > 1 true > {code} > Is this a bug for Spark or a feature? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23549) Spark SQL unexpected behavior when comparing timestamp to date
Dong Jiang created SPARK-23549: -- Summary: Spark SQL unexpected behavior when comparing timestamp to date Key: SPARK-23549 URL: https://issues.apache.org/jira/browse/SPARK-23549 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1 Reporter: Dong Jiang {code:java} scala> spark.version res1: String = 2.2.1 scala> spark.sql("select cast('2017-03-01 00:00:00' as timestamp) between cast('2017-02-28' as date) and cast('2017-03-01' as date)").show +---+ |((CAST(CAST(2017-03-01 00:00:00 AS TIMESTAMP) AS STRING) >= CAST(CAST(2017-02-28 AS DATE) AS STRING)) AND (CAST(CAST(2017-03-01 00:00:00 AS TIMESTAMP) AS STRING) <= CAST(CAST(2017-03-01 AS DATE) AS STRING)))| +---+ | false| +---+{code} As shown above, when a timestamp is compared to date in SparkSQL, both timestamp and date are downcast to string, and leading to unexpected result. If run the same SQL in presto/Athena, I got the expected result {code:java} select cast('2017-03-01 00:00:00' as timestamp) between cast('2017-02-28' as date) and cast('2017-03-01' as date) _col0 1 true{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23549) Spark SQL unexpected behavior when comparing timestamp to date
[ https://issues.apache.org/jira/browse/SPARK-23549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dong Jiang updated SPARK-23549: --- Description: {code:java} scala> spark.version res1: String = 2.2.1 scala> spark.sql("select cast('2017-03-01 00:00:00' as timestamp) between cast('2017-02-28' as date) and cast('2017-03-01' as date)").show +---+ |((CAST(CAST(2017-03-01 00:00:00 AS TIMESTAMP) AS STRING) >= CAST(CAST(2017-02-28 AS DATE) AS STRING)) AND (CAST(CAST(2017-03-01 00:00:00 AS TIMESTAMP) AS STRING) <= CAST(CAST(2017-03-01 AS DATE) AS STRING)))| +---+ | false| +---+{code} As shown above, when a timestamp is compared to date in SparkSQL, both timestamp and date are downcast to string, and leading to unexpected result. If run the same SQL in presto/Athena, I got the expected result {code:java} select cast('2017-03-01 00:00:00' as timestamp) between cast('2017-02-28' as date) and cast('2017-03-01' as date) _col0 1 true {code} Is this a bug for Spark or a feature? was: {code:java} scala> spark.version res1: String = 2.2.1 scala> spark.sql("select cast('2017-03-01 00:00:00' as timestamp) between cast('2017-02-28' as date) and cast('2017-03-01' as date)").show +---+ |((CAST(CAST(2017-03-01 00:00:00 AS TIMESTAMP) AS STRING) >= CAST(CAST(2017-02-28 AS DATE) AS STRING)) AND (CAST(CAST(2017-03-01 00:00:00 AS TIMESTAMP) AS STRING) <= CAST(CAST(2017-03-01 AS DATE) AS STRING)))| +---+ | false| +---+{code} As shown above, when a timestamp is compared to date in SparkSQL, both timestamp and date are downcast to string, and leading to unexpected result. If run the same SQL in presto/Athena, I got the expected result {code:java} select cast('2017-03-01 00:00:00' as timestamp) between cast('2017-02-28' as date) and cast('2017-03-01' as date) _col0 1 true{code} > Spark SQL unexpected behavior when comparing timestamp to date > -- > > Key: SPARK-23549 > URL: https://issues.apache.org/jira/browse/SPARK-23549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Dong Jiang >Priority: Major > > {code:java} > scala> spark.version > res1: String = 2.2.1 > scala> spark.sql("select cast('2017-03-01 00:00:00' as timestamp) between > cast('2017-02-28' as date) and cast('2017-03-01' as date)").show > +---+ > |((CAST(CAST(2017-03-01 00:00:00 AS TIMESTAMP) AS STRING) >= > CAST(CAST(2017-02-28 AS DATE) AS STRING)) AND (CAST(CAST(2017-03-01 00:00:00 > AS TIMESTAMP) AS STRING) <= CAST(CAST(2017-03-01 AS DATE) AS STRING)))| > +---+ > | > >
[jira] [Commented] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)
[ https://issues.apache.org/jira/browse/SPARK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305691#comment-16305691 ] Dong Jiang commented on SPARK-13127: [~gaurav24], looks like you are like me, waiting for this ticket to be worked on. If you would like, help to comment on this thread in developer list to advocate to have this issue resolved in Spark 2.3 release http://apache-spark-developers-list.1001551.n3.nabble.com/Timeline-for-Spark-2-3-td22793.html > Upgrade Parquet to 1.9 (Fixes parquet sorting) > -- > > Key: SPARK-13127 > URL: https://issues.apache.org/jira/browse/SPARK-13127 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Justin Pihony > > Currently, when you write a sorted DataFrame to Parquet, then reading the > data back out is not sorted by default. [This is due to a bug in > Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in > 1.9. > There is a workaround to read the file back in using a file glob (filepath/*). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17647) SQL LIKE does not handle backslashes correctly
[ https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295261#comment-16295261 ] Dong Jiang commented on SPARK-17647: Are we sure this issue is resolved, I tested the following on spark-shell 2.2.0 {code} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_25) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("select '' like '%\\%'").show +--+ |\ LIKE %\%| +--+ | false| +--+ {code} same in spark-sql {code} spark-sql> select '' like '%\\%'; false Time taken: 2.296 seconds, Fetched 1 row(s) {code} > SQL LIKE does not handle backslashes correctly > -- > > Key: SPARK-17647 > URL: https://issues.apache.org/jira/browse/SPARK-17647 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Labels: correctness > Fix For: 2.1.1, 2.2.0 > > > Try the following in SQL shell: > {code} > select '' like '%\\%'; > {code} > It returned false, which is wrong. > cc: [~yhuai] [~joshrosen] > A false-negative considered previously: > {code} > select '' rlike '.*.*'; > {code} > It returned true, which is correct if we assume that the pattern is treated > as a Java string but not raw string. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)
[ https://issues.apache.org/jira/browse/SPARK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250512#comment-16250512 ] Dong Jiang commented on SPARK-13127: [~igozali], I think you are referring to this parquet ticket: https://issues.apache.org/jira/browse/PARQUET-686 The parquet ticket indicated the fix is in 1.9.0, so we still need Spark to upgrade parquet to 1.9.0 I have examined the parquet file generated by Spark 2.2, the string column doesn't have the min/max generated in the footer. I believe it is disabled. > Upgrade Parquet to 1.9 (Fixes parquet sorting) > -- > > Key: SPARK-13127 > URL: https://issues.apache.org/jira/browse/SPARK-13127 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Justin Pihony > > Currently, when you write a sorted DataFrame to Parquet, then reading the > data back out is not sorted by default. [This is due to a bug in > Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in > 1.9. > There is a workaround to read the file back in using a file glob (filepath/*). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)
[ https://issues.apache.org/jira/browse/SPARK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250512#comment-16250512 ] Dong Jiang edited comment on SPARK-13127 at 11/13/17 11:56 PM: --- [~igozali], I think you are referring to this parquet ticket: https://issues.apache.org/jira/browse/PARQUET-686 The parquet ticket indicated the fix is in 1.9.0, so we still need Spark to upgrade parquet to 1.9.0 I have examined the parquet file generated by Spark 2.2, the string column doesn't have the min/max generated in the footer. I believe it is disabled. Do we have any progress on this issue? Will it be included in Spark 2.3? was (Author: djiangxu): [~igozali], I think you are referring to this parquet ticket: https://issues.apache.org/jira/browse/PARQUET-686 The parquet ticket indicated the fix is in 1.9.0, so we still need Spark to upgrade parquet to 1.9.0 I have examined the parquet file generated by Spark 2.2, the string column doesn't have the min/max generated in the footer. I believe it is disabled. > Upgrade Parquet to 1.9 (Fixes parquet sorting) > -- > > Key: SPARK-13127 > URL: https://issues.apache.org/jira/browse/SPARK-13127 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Justin Pihony > > Currently, when you write a sorted DataFrame to Parquet, then reading the > data back out is not sorted by default. [This is due to a bug in > Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in > 1.9. > There is a workaround to read the file back in using a file glob (filepath/*). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16806) from_unixtime function gives wrong answer
[ https://issues.apache.org/jira/browse/SPARK-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dong Jiang updated SPARK-16806: --- Description: The following is from 2.0, for the same epoch, the function with format argument generates a different result for the year. spark-sql> select from_unixtime(100), from_unixtime(100, '-MM-dd HH:mm:ss'); 1969-12-31 19:01:40 1970-12-31 19:01:40 was: The following is from 2.0, for the same epoch, the function with format argument generates a different result. spark-sql> select from_unixtime(100), from_unixtime(100, '-MM-dd HH:mm:ss'); 1969-12-31 19:01:40 1970-12-31 19:01:40 > from_unixtime function gives wrong answer > - > > Key: SPARK-16806 > URL: https://issues.apache.org/jira/browse/SPARK-16806 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Dong Jiang > > The following is from 2.0, for the same epoch, the function with format > argument generates a different result for the year. > spark-sql> select from_unixtime(100), from_unixtime(100, '-MM-dd > HH:mm:ss'); > 1969-12-31 19:01:40 1970-12-31 19:01:40 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16806) from_unixtime function gives wrong answer
Dong Jiang created SPARK-16806: -- Summary: from_unixtime function gives wrong answer Key: SPARK-16806 URL: https://issues.apache.org/jira/browse/SPARK-16806 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Dong Jiang The following is from 2.0, for the same epoch, the function with format argument generates a different result. spark-sql> select from_unixtime(100), from_unixtime(100, '-MM-dd HH:mm:ss'); 1969-12-31 19:01:40 1970-12-31 19:01:40 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org