[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2018-10-28 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1775#comment-1775
 ] 

Wenchen Fan commented on SPARK-14220:
-

You will find it when Spark 2.4.0 is released. If you are eager to try it out, 
please use the staging repository, whose URL is posted in the RC voting emails.

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Sean Owen
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24064) [Spark SQL] Create table using csv does not support binary column Type

2018-10-28 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1765#comment-1765
 ] 

Hyukjin Kwon commented on SPARK-24064:
--

That's trivial. It explicitly throws an exception for now. Please file another 
JIRA that targets to list up supported types. I am resolving this.

> [Spark SQL] Create table  using csv does not support binary column Type
> ---
>
> Key: SPARK-24064
> URL: https://issues.apache.org/jira/browse/SPARK-24064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: OS Type: Suse 11
> Spark Version: 2.3.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>  Labels: test
>
> #  Launch spark-sql --master yarn                                         
>  # create table csvTable (time timestamp, name string, isright boolean, 
> datetoday date, num binary, height double, score float, decimaler 
> decimal(10,0), id tinyint, age int, license bigint, length smallint) using 
> CSV options (path "/user/datatmo/customer1.csv");
> Result: Table creation is successful
>     3. Select * from csvTable;
> Throws below Exception
> ERROR SparkSQLDriver:91 - Failed in [select * from csvtable]
> java.lang.UnsupportedOperationException: *CSV data source does not support 
> binary data type*.
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.org$apache$spark$sql$execution$datasources$csv$CSVUtils$$verifyType$1(CSVUtils.scala:127)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$$anonfun$verifySchema$1.apply(CSVUtils.scala:131)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$$anonfun$verifySchema$1.apply(CSVUtils.scala:131)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
>  
> But Normal table supports binary Data Type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24064) [Spark SQL] Create table using csv does not support binary column Type

2018-10-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24064.
--
Resolution: Won't Fix

> [Spark SQL] Create table  using csv does not support binary column Type
> ---
>
> Key: SPARK-24064
> URL: https://issues.apache.org/jira/browse/SPARK-24064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: OS Type: Suse 11
> Spark Version: 2.3.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>  Labels: test
>
> #  Launch spark-sql --master yarn                                         
>  # create table csvTable (time timestamp, name string, isright boolean, 
> datetoday date, num binary, height double, score float, decimaler 
> decimal(10,0), id tinyint, age int, license bigint, length smallint) using 
> CSV options (path "/user/datatmo/customer1.csv");
> Result: Table creation is successful
>     3. Select * from csvTable;
> Throws below Exception
> ERROR SparkSQLDriver:91 - Failed in [select * from csvtable]
> java.lang.UnsupportedOperationException: *CSV data source does not support 
> binary data type*.
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.org$apache$spark$sql$execution$datasources$csv$CSVUtils$$verifyType$1(CSVUtils.scala:127)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$$anonfun$verifySchema$1.apply(CSVUtils.scala:131)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$$anonfun$verifySchema$1.apply(CSVUtils.scala:131)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
>  
> But Normal table supports binary Data Type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25545) CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not confirm to non-nullable schema fields

2018-10-28 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1758#comment-1758
 ] 

Hyukjin Kwon commented on SPARK-25545:
--

See the discussion at https://github.com/apache/spark/pull/17293 Eventually 
they shouldn't be implicitly converted, or at the very least it should be fixed 
with a coherent reason.

> CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not 
> confirm to non-nullable schema fields
> -
>
> Key: SPARK-25545
> URL: https://issues.apache.org/jira/browse/SPARK-25545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Steven Bakhtiari
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> I'm loading a CSV file into a dataframe using Spark. I have defined a Schema 
> and specified one of the fields as non-nullable.
> When setting the mode to {{DROPMALFORMED}}, I expect any rows in the CSV with 
> missing (null) values for those columns to result in the whole row being 
> dropped. At the moment, the CSV loader correctly drops rows that do not 
> conform to the field type, but the nullable property is seemingly ignored.
> Example CSV input:
> {code:java}
> 1,2,3
> 1,,3
> ,2,3
> 1,2,abc
> {code}
> Example Spark job:
> {code:java}
> val spark = SparkSession
>   .builder()
>   .appName("csv-test")
>   .master("local")
>   .getOrCreate()
> spark.read
>   .format("csv")
>   .schema(StructType(
> StructField("col1", IntegerType, nullable = false) ::
>   StructField("col2", IntegerType, nullable = false) ::
>   StructField("col3", IntegerType, nullable = false) :: Nil))
>   .option("header", false)
>   .option("mode", "DROPMALFORMED")
>   .load("path/to/file.csv")
>   .coalesce(1)
>   .write
>   .format("csv")
>   .option("header", false)
>   .save("path/to/output")
> {code}
> The actual output will be:
> {code:java}
> 1,2,3
> 1,,3
> ,2,3{code}
> Note that the row containing non-integer values has been dropped, as 
> expected, but rows containing null values persist, despite the nullable 
> property being set to false in the schema definition.
> My expected output is:
> {code:java}
> 1,2,3{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25864) Make main args set correctly in BenchmarkBase

2018-10-28 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25864:
--
Description: 
Set main args correctly in BenchmarkBase, to make it accessible for its 
subclass.

It will benefit:
 * BuiltInDataSourceWriteBenchmark
 * AvroWriteBenchmark 

  was:
Set main args correctly in BenchmarkBase, to make it accessible for its 
subclass.

It will benefit:

- BuiltInDataSourceWriteBenchmark

- AvroWriteBenchmark 


> Make main args set correctly in BenchmarkBase
> -
>
> Key: SPARK-25864
> URL: https://issues.apache.org/jira/browse/SPARK-25864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yucai
>Priority: Major
>
> Set main args correctly in BenchmarkBase, to make it accessible for its 
> subclass.
> It will benefit:
>  * BuiltInDataSourceWriteBenchmark
>  * AvroWriteBenchmark 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25864) Make main args set correctly in BenchmarkBase

2018-10-28 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25864:
--
Description: 
Set main args correctly in BenchmarkBase, to make it accessible for its 
subclass.

It will benefit:

- BuiltInDataSourceWriteBenchmark

- AvroWriteBenchmark 

  was:
Save main args correctly in BenchmarkBase, to make it accessible for its 
subclass.

It will benefit:

- BuiltInDataSourceWriteBenchmark

- AvroWriteBenchmark 


> Make main args set correctly in BenchmarkBase
> -
>
> Key: SPARK-25864
> URL: https://issues.apache.org/jira/browse/SPARK-25864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yucai
>Priority: Major
>
> Set main args correctly in BenchmarkBase, to make it accessible for its 
> subclass.
> It will benefit:
> - BuiltInDataSourceWriteBenchmark
> - AvroWriteBenchmark 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25864) Make main args set correctly in BenchmarkBase

2018-10-28 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25864:
--
Summary: Make main args set correctly in BenchmarkBase  (was: Make mainArgs 
correctly set in BenchmarkBase)

> Make main args set correctly in BenchmarkBase
> -
>
> Key: SPARK-25864
> URL: https://issues.apache.org/jira/browse/SPARK-25864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yucai
>Priority: Major
>
> Save main args correctly in BenchmarkBase, to make it accessible for its 
> subclass.
> It will benefit:
> - BuiltInDataSourceWriteBenchmark
> - AvroWriteBenchmark 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25864) Make mainArgs correctly set in BenchmarkBase

2018-10-28 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25864:
--
Description: 
Save main args correctly in BenchmarkBase, to make it accessible for its 
subclass.

It will benefit:

- BuiltInDataSourceWriteBenchmark

- AvroWriteBenchmark 

  was:
Make mainArgs correctly set in BenchmarkBase, it will benefit:

* BuiltInDataSourceWriteBenchmark

* AvroWriteBenchmark

* Any other case that needs to access main args after inheriting from 
BenchmarkBase class

 


> Make mainArgs correctly set in BenchmarkBase
> 
>
> Key: SPARK-25864
> URL: https://issues.apache.org/jira/browse/SPARK-25864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yucai
>Priority: Major
>
> Save main args correctly in BenchmarkBase, to make it accessible for its 
> subclass.
> It will benefit:
> - BuiltInDataSourceWriteBenchmark
> - AvroWriteBenchmark 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25864) Make mainArgs correctly set in BenchmarkBase

2018-10-28 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25864:
--
Description: 
Make mainArgs correctly set in BenchmarkBase, it will benefit:

* BuiltInDataSourceWriteBenchmark

* AvroWriteBenchmark

* Any other case that needs to access main args after inheriting from 
BenchmarkBase class

 

  was:
Make mainArgs correctly set in BenchmarkBase, it will benefit:

- BuiltInDataSourceWriteBenchmark

- AvroWriteBenchmark

- Any other case that needs to access main args after inheriting from 
BenchmarkBase class.

 


> Make mainArgs correctly set in BenchmarkBase
> 
>
> Key: SPARK-25864
> URL: https://issues.apache.org/jira/browse/SPARK-25864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yucai
>Priority: Major
>
> Make mainArgs correctly set in BenchmarkBase, it will benefit:
> * BuiltInDataSourceWriteBenchmark
> * AvroWriteBenchmark
> * Any other case that needs to access main args after inheriting from 
> BenchmarkBase class
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25864) Make mainArgs correctly set in BenchmarkBase

2018-10-28 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25864:
--
Description: 
Make mainArgs correctly set in BenchmarkBase, it will benefit:

- BuiltInDataSourceWriteBenchmark

- AvroWriteBenchmark

- Any other case that needs to access main args after inheriting from 
BenchmarkBase class.

 

  was:
Make mainArgs correctly set in BenchmarkBase, it will benefit:

- BuiltInDataSourceWriteBenchmark

- AvroWriteBenchmark


> Make mainArgs correctly set in BenchmarkBase
> 
>
> Key: SPARK-25864
> URL: https://issues.apache.org/jira/browse/SPARK-25864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yucai
>Priority: Major
>
> Make mainArgs correctly set in BenchmarkBase, it will benefit:
> - BuiltInDataSourceWriteBenchmark
> - AvroWriteBenchmark
> - Any other case that needs to access main args after inheriting from 
> BenchmarkBase class.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25864) Make mainArgs correctly set in BenchmarkBase

2018-10-28 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25864:
--
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-25475

> Make mainArgs correctly set in BenchmarkBase
> 
>
> Key: SPARK-25864
> URL: https://issues.apache.org/jira/browse/SPARK-25864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yucai
>Priority: Major
>
> Make mainArgs correctly set in BenchmarkBase, it will benefit:
> - BuiltInDataSourceWriteBenchmark
> - AvroWriteBenchmark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25864) Make mainArgs correctly set in BenchmarkBase

2018-10-28 Thread yucai (JIRA)
yucai created SPARK-25864:
-

 Summary: Make mainArgs correctly set in BenchmarkBase
 Key: SPARK-25864
 URL: https://issues.apache.org/jira/browse/SPARK-25864
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: yucai


Make mainArgs correctly set in BenchmarkBase, it will benefit:

- BuiltInDataSourceWriteBenchmark

- AvroWriteBenchmark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25797) Views created via 2.1 cannot be read via 2.2+

2018-10-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25797:
-

Assignee: Chenxiao Mao

> Views created via 2.1 cannot be read via 2.2+
> -
>
> Key: SPARK-25797
> URL: https://issues.apache.org/jira/browse/SPARK-25797
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
> Fix For: 2.2.3, 2.3.3, 2.4.0
>
>
> We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a 
> simple example to reproduce the issue.
> Create views via Spark 2.1
> {code:sql}
> create view v1 as
> select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
> {code}
> Query views via Spark 2.3
> {code:sql}
> select * from v1;
> Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
> decimal(19,0) as it may truncate
> {code}
> After investigation, we found that this is because when a view is created via 
> Spark 2.1, the expanded text is saved instead of the original text. 
> Unfortunately, the blow expanded text is buggy.
> {code:sql}
> spark-sql> desc extended v1;
> c1 decimal(19,0) NULL
> Detailed Table Information
> Database default
> Table v1
> Type VIEW
> View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS 
> DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS 
> DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0
> {code}
> We can see that c1 is decimal(19,0), however in the expanded text there is 
> decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 
> 2.2, decimal(20,0) in query is not allowed to cast to view definition column 
> decimal(19,0). ([https://github.com/apache/spark/pull/16561])
> I further tested other decimal calculations. Only add/subtract has this issue.
> Create views via 2.1:
> {code:sql}
> create view v1 as
> select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
> create view v2 as
> select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1;
> create view v3 as
> select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1;
> create view v4 as
> select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1;
> create view v5 as
> select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1;
> create view v6 as
> select cast(1 as decimal(18,0)) c1
> union
> select cast(1 as decimal(19,0)) c1;
> {code}
> Query views via Spark 2.3
> {code:sql}
> select * from v1;
> Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
> decimal(19,0) as it may truncate
> select * from v2;
> Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: 
> decimal(19,0) as it may truncate
> select * from v3;
> 1
> select * from v4;
> 1
> select * from v5;
> 0
> select * from v6;
> 1
> {code}
> Views created via Spark 2.2+ don't have this issue because Spark 2.2+ does 
> not generate expanded text for view 
> (https://issues.apache.org/jira/browse/SPARK-18209).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25797) Views created via 2.1 cannot be read via 2.2+

2018-10-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25797:
--
Fix Version/s: 2.4.0

> Views created via 2.1 cannot be read via 2.2+
> -
>
> Key: SPARK-25797
> URL: https://issues.apache.org/jira/browse/SPARK-25797
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
> Fix For: 2.2.3, 2.3.3, 2.4.0
>
>
> We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a 
> simple example to reproduce the issue.
> Create views via Spark 2.1
> {code:sql}
> create view v1 as
> select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
> {code}
> Query views via Spark 2.3
> {code:sql}
> select * from v1;
> Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
> decimal(19,0) as it may truncate
> {code}
> After investigation, we found that this is because when a view is created via 
> Spark 2.1, the expanded text is saved instead of the original text. 
> Unfortunately, the blow expanded text is buggy.
> {code:sql}
> spark-sql> desc extended v1;
> c1 decimal(19,0) NULL
> Detailed Table Information
> Database default
> Table v1
> Type VIEW
> View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS 
> DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS 
> DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0
> {code}
> We can see that c1 is decimal(19,0), however in the expanded text there is 
> decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 
> 2.2, decimal(20,0) in query is not allowed to cast to view definition column 
> decimal(19,0). ([https://github.com/apache/spark/pull/16561])
> I further tested other decimal calculations. Only add/subtract has this issue.
> Create views via 2.1:
> {code:sql}
> create view v1 as
> select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
> create view v2 as
> select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1;
> create view v3 as
> select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1;
> create view v4 as
> select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1;
> create view v5 as
> select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1;
> create view v6 as
> select cast(1 as decimal(18,0)) c1
> union
> select cast(1 as decimal(19,0)) c1;
> {code}
> Query views via Spark 2.3
> {code:sql}
> select * from v1;
> Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
> decimal(19,0) as it may truncate
> select * from v2;
> Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: 
> decimal(19,0) as it may truncate
> select * from v3;
> 1
> select * from v4;
> 1
> select * from v5;
> 0
> select * from v6;
> 1
> {code}
> Views created via Spark 2.2+ don't have this issue because Spark 2.2+ does 
> not generate expanded text for view 
> (https://issues.apache.org/jira/browse/SPARK-18209).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25797) Views created via 2.1 cannot be read via 2.2+

2018-10-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25797.
---
   Resolution: Fixed
Fix Version/s: 2.3.3
   2.2.3

Issue resolved by pull request 22851
[https://github.com/apache/spark/pull/22851]

> Views created via 2.1 cannot be read via 2.2+
> -
>
> Key: SPARK-25797
> URL: https://issues.apache.org/jira/browse/SPARK-25797
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
> Fix For: 2.2.3, 2.3.3
>
>
> We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a 
> simple example to reproduce the issue.
> Create views via Spark 2.1
> {code:sql}
> create view v1 as
> select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
> {code}
> Query views via Spark 2.3
> {code:sql}
> select * from v1;
> Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
> decimal(19,0) as it may truncate
> {code}
> After investigation, we found that this is because when a view is created via 
> Spark 2.1, the expanded text is saved instead of the original text. 
> Unfortunately, the blow expanded text is buggy.
> {code:sql}
> spark-sql> desc extended v1;
> c1 decimal(19,0) NULL
> Detailed Table Information
> Database default
> Table v1
> Type VIEW
> View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS 
> DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS 
> DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0
> {code}
> We can see that c1 is decimal(19,0), however in the expanded text there is 
> decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 
> 2.2, decimal(20,0) in query is not allowed to cast to view definition column 
> decimal(19,0). ([https://github.com/apache/spark/pull/16561])
> I further tested other decimal calculations. Only add/subtract has this issue.
> Create views via 2.1:
> {code:sql}
> create view v1 as
> select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
> create view v2 as
> select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1;
> create view v3 as
> select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1;
> create view v4 as
> select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1;
> create view v5 as
> select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1;
> create view v6 as
> select cast(1 as decimal(18,0)) c1
> union
> select cast(1 as decimal(19,0)) c1;
> {code}
> Query views via Spark 2.3
> {code:sql}
> select * from v1;
> Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
> decimal(19,0) as it may truncate
> select * from v2;
> Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: 
> decimal(19,0) as it may truncate
> select * from v3;
> 1
> select * from v4;
> 1
> select * from v5;
> 0
> select * from v6;
> 1
> {code}
> Views created via Spark 2.2+ don't have this issue because Spark 2.2+ does 
> not generate expanded text for view 
> (https://issues.apache.org/jira/browse/SPARK-18209).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25863) java.lang.UnsupportedOperationException: empty.max at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala

2018-10-28 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1736#comment-1736
 ] 

Ruslan Dautkhanov commented on SPARK-25863:
---

It seems error happens here

[https://github.com/apache/spark/blob/branch-2.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L1475]

but this is as far as I can go... any ideas why this happens? thanks!

 

> java.lang.UnsupportedOperationException: empty.max at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> -
>
> Key: SPARK-25863
> URL: https://issues.apache.org/jira/browse/SPARK-25863
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: cache, catalyst, code-generation
>
> Failing task : 
> {noformat}
> An error occurred while calling o2875.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
> in stage 21413.0 failed 4 times, most recent failure: Lost task 58.3 in stage 
> 21413.0 (TID 4057314, pc1udatahad117, executor 431): 
> java.lang.UnsupportedOperationException: empty.max
> at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
> at scala.collection.AbstractTraversable.max(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1418)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
> at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
> at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:81)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:40)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
> at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:263)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:262)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at 

[jira] [Updated] (SPARK-25863) java.lang.UnsupportedOperationException: empty.max at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1

2018-10-28 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-25863:
--
Affects Version/s: 2.3.1

> java.lang.UnsupportedOperationException: empty.max at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> -
>
> Key: SPARK-25863
> URL: https://issues.apache.org/jira/browse/SPARK-25863
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: cache, catalyst, code-generation
>
> Failing task : 
> {noformat}
> An error occurred while calling o2875.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
> in stage 21413.0 failed 4 times, most recent failure: Lost task 58.3 in stage 
> 21413.0 (TID 4057314, pc1udatahad117, executor 431): 
> java.lang.UnsupportedOperationException: empty.max
> at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
> at scala.collection.AbstractTraversable.max(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1418)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
> at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
> at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:81)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:40)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
> at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:263)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:262)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> 

[jira] [Commented] (SPARK-25863) java.lang.UnsupportedOperationException: empty.max at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala

2018-10-28 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1726#comment-1726
 ] 

Ruslan Dautkhanov commented on SPARK-25863:
---

This happens only on one of our heaviest Spark jobs.. 

> java.lang.UnsupportedOperationException: empty.max at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> -
>
> Key: SPARK-25863
> URL: https://issues.apache.org/jira/browse/SPARK-25863
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: cache, catalyst, code-generation
>
> Failing task : 
> {noformat}
> An error occurred while calling o2875.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
> in stage 21413.0 failed 4 times, most recent failure: Lost task 58.3 in stage 
> 21413.0 (TID 4057314, pc1udatahad117, executor 431): 
> java.lang.UnsupportedOperationException: empty.max
> at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
> at scala.collection.AbstractTraversable.max(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1418)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
> at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
> at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:81)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:40)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
> at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:263)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:262)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> 

[jira] [Created] (SPARK-25863) java.lang.UnsupportedOperationException: empty.max at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1

2018-10-28 Thread Ruslan Dautkhanov (JIRA)
Ruslan Dautkhanov created SPARK-25863:
-

 Summary: java.lang.UnsupportedOperationException: empty.max at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
 Key: SPARK-25863
 URL: https://issues.apache.org/jira/browse/SPARK-25863
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, Spark Core
Affects Versions: 2.3.2
Reporter: Ruslan Dautkhanov


Failing task : 
{noformat}
An error occurred while calling o2875.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 in 
stage 21413.0 failed 4 times, most recent failure: Lost task 58.3 in stage 
21413.0 (TID 4057314, pc1udatahad117, executor 431): 
java.lang.UnsupportedOperationException: empty.max
at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
at scala.collection.AbstractTraversable.max(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1418)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
at 
org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at 
org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at 
org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:81)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:40)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
at 
org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:263)
at 
org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:262)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

{noformat}
 

Driver stack trace:
{noformat}
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1609)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1597)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1596)
at 

[jira] [Resolved] (SPARK-25293) Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir

2018-10-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25293.
--
Resolution: Invalid

Yea, the answer looks correct.

> Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx 
> instead of directly saving in outputDir
> --
>
> Key: SPARK-25293
> URL: https://issues.apache.org/jira/browse/SPARK-25293
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Java API, Spark Shell, Spark Submit
>Affects Versions: 2.0.2, 2.1.3
>Reporter: omkar puttagunta
>Priority: Major
>
> [https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating]
> {quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master 
> node on AWS EC2
> {quote}
> Simple Test; reading pipe delimited file and writing data to csv. Commands 
> below are executed in spark-shell with master-url set
> {{val df = 
> spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/")
>  val emailDf=df.filter("_c3='EML'") 
> emailDf.repartition(100).write.csv("/opt/outputFile/")}}
> After executing the cmds above in spark-shell with master url set.
> {quote}In {{worker1}} -> Each part file is created 
> in\{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}}
>  In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated 
> directly under outputDirectory specified during write.
> {quote}
> *Same thing happens with coalesce(100) or without specifying 
> repartition/coalesce!!! Tried with Java also!*
> *_Quesiton_*
> 1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have 
> {{part-}} files just like in {{worker2}}? why {{_temporary}} directory is 
> created and {{part-xxx-xx}} files reside in the \{{task-xxx}}directories?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-10-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25179:


Assignee: Hyukjin Kwon  (was: Bryan Cutler)

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Assignee: Hyukjin Kwon
>Priority: Major
>
> binary type support requires pyarrow 0.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1683#comment-1683
 ] 

Apache Spark commented on SPARK-25179:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/22871

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> binary type support requires pyarrow 0.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25179:


Assignee: Apache Spark  (was: Bryan Cutler)

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> binary type support requires pyarrow 0.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1682#comment-1682
 ] 

Apache Spark commented on SPARK-25179:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/22871

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> binary type support requires pyarrow 0.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25179:


Assignee: Bryan Cutler  (was: Apache Spark)

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> binary type support requires pyarrow 0.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-10-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25179:
-
Issue Type: Sub-task  (was: Task)
Parent: SPARK-21187

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> binary type support requires pyarrow 0.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25787) [K8S] Spark can't use data locality information

2018-10-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25787.
--
Resolution: Cannot Reproduce

I can't reproduce this given the information here. I am leaving this resolved 
for now.

> [K8S] Spark can't use data locality information
> ---
>
> Key: SPARK-25787
> URL: https://issues.apache.org/jira/browse/SPARK-25787
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Maciej Bryński
>Priority: Major
>
> I started experimenting with Spark based on this presentation:
> https://www.slideshare.net/databricks/hdfs-on-kuberneteslessons-learned-with-kimoon-kim
> I'm using excelent https://github.com/apache-spark-on-k8s/kubernetes-HDFS
> charts to deploy HDFS.
> Unfortunately reading from HDFS gives ANY locality for every task.
> Is data locality working on Kubernetes cluster ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25674) If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated

2018-10-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25674:

Fix Version/s: (was: 2.4.1)
   (was: 3.0.0)
   2.4.0

> If the records are incremented by more than 1 at a time,the number of bytes 
> might rarely ever get updated
> -
>
> Key: SPARK-25674
> URL: https://issues.apache.org/jira/browse/SPARK-25674
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.3.3, 2.4.0
>
>
> If the records are incremented by more than 1 at a time,the number of bytes 
> might rarely ever get updated in `FileScanRDD.scala`,because it might skip 
> over the count that is an exact multiple of 
> UPDATE_INPUT_METRICS_INTERVAL_RECORDS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25636) spark-submit swallows the failure reason when there is an error connecting to master

2018-10-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25636:

Fix Version/s: (was: 2.4.1)
   (was: 3.0.0)
   2.4.0

> spark-submit swallows the failure reason when there is an error connecting to 
> master
> 
>
> Key: SPARK-25636
> URL: https://issues.apache.org/jira/browse/SPARK-25636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Devaraj K
>Assignee: Devaraj K
>Priority: Minor
> Fix For: 2.4.0
>
>
> {code:xml}
> [apache-spark]$ ./bin/spark-submit --verbose --master spark://
> 
> Error: Exception thrown in awaitResult:
> Run with --help for usage help or --verbose for debug output
> {code}
> When the spark submit cannot connect to master, there is no error shown. I 
> think it should display the cause for the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25677) Configuring zstd compression in JDBC throwing IllegalArgumentException Exception

2018-10-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25677:

Fix Version/s: (was: 2.4.1)
   (was: 3.0.0)
   2.4.0

> Configuring zstd compression in JDBC throwing IllegalArgumentException 
> Exception
> 
>
> Key: SPARK-25677
> URL: https://issues.apache.org/jira/browse/SPARK-25677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Shivu Sondur
>Priority: Major
> Fix For: 2.4.0
>
>
> To check the Event Log compression size with different compression technique 
> mentioned in Spark Doc
> Set below parameter in spark-default.conf of JDBC and JobHistory
>  1. spark.eventLog.compress=true
>  2. Enable spark.io.compression.codec = 
> org.apache.spark.io.ZstdCompressionCodec
>  3. Restart the JDBC and Job History Services
>  4. Check the JDBC and Job History Logs
>  Exception throws
>  ava.lang.IllegalArgumentException: No short name for codec 
> org.apache.spark.io.ZstdCompressionCodec.
>  at 
> org.apache.spark.io.CompressionCodec$$anonfun$getShortName$2.apply(CompressionCodec.scala:94)
>  at 
> org.apache.spark.io.CompressionCodec$$anonfun$getShortName$2.apply(CompressionCodec.scala:94)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.io.CompressionCodec$.getShortName(CompressionCodec.scala:94)
>  at org.apache.spark.SparkContext$$anonfun$9.apply(SparkContext.scala:414)
>  at org.apache.spark.SparkContext$$anonfun$9.apply(SparkContext.scala:414)
>  at scala.Option.map(Option.scala:146)
>  at org.apache.spark.SparkContext.(SparkContext.scala:414)
>  at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2507)
>  at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:939)
>  at



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25639) Add documentation on foreachBatch, and multiple watermark policy

2018-10-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25639:

Fix Version/s: (was: 2.4.1)
   2.4.0

> Add documentation on foreachBatch, and multiple watermark policy
> 
>
> Key: SPARK-25639
> URL: https://issues.apache.org/jira/browse/SPARK-25639
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Things to add
> - Python foreach
> - Scala, Java and Python foreachBatch
> - Multiple watermark policy
> - The semantics of what changes are allowed to the streaming between restarts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

2018-10-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24787:

Fix Version/s: (was: 2.4.1)
   (was: 3.0.0)
   2.4.0

> Events being dropped at an alarming rate due to hsync being slow for 
> eventLogging
> -
>
> Key: SPARK-24787
> URL: https://issues.apache.org/jira/browse/SPARK-24787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Sanket Reddy
>Assignee: Devaraj K
>Priority: Minor
> Fix For: 2.4.0
>
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the 
> inprogress files allowing history server being responsive.
> Although we have a production job that has 6 tasks per stage and due to 
> hsync being slow it starts dropping events and the history server has wrong 
> stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it 
> configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25805) Flaky test: DataFrameSuite.SPARK-25159 unittest failure

2018-10-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25805:

Fix Version/s: (was: 2.4.1)
   (was: 3.0.0)
   2.4.0

> Flaky test: DataFrameSuite.SPARK-25159 unittest failure
> ---
>
> Key: SPARK-25805
> URL: https://issues.apache.org/jira/browse/SPARK-25805
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.4.0
>
>
> I've seen this test fail on internal builds:
> {noformat}
> Error Message0 did not equal 1Stacktrace  
> org.scalatest.exceptions.TestFailedException: 0 did not equal 1
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99$$anonfun$apply$mcV$sp$219.apply(DataFrameSuite.scala:2552)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99$$anonfun$apply$mcV$sp$219.apply(DataFrameSuite.scala:2534)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempPath(SQLTestUtils.scala:179)
>   at 
> org.apache.spark.sql.DataFrameSuite.withTempPath(DataFrameSuite.scala:46)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply$mcV$sp(DataFrameSuite.scala:2534)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply(DataFrameSuite.scala:2534)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply(DataFrameSuite.scala:2534)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at 
> org.apache.spark.sql.DataFrameSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(DataFrameSuite.scala:46)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:221)
>   at org.apache.spark.sql.DataFrameSuite.runTest(DataFrameSuite.scala:46)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255)
>   at 
> 

[jira] [Resolved] (SPARK-25816) Functions does not resolve Columns correctly

2018-10-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25816.
-
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.3

> Functions does not resolve Columns correctly
> 
>
> Key: SPARK-25816
> URL: https://issues.apache.org/jira/browse/SPARK-25816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Brian Zhang
>Assignee: Peter Toth
>Priority: Critical
> Fix For: 2.3.3, 2.4.0
>
> Attachments: final_allDatatypes_Spark.avro, source.snappy.parquet
>
>
> When there is a duplicate column name in the current Dataframe and orginal 
> Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does 
> not resolve the column correctly when using it in the expression, hence 
> causing casting issue. The same code is working in Spark 2.2.1
> Please see below code to reproduce the issue
> import org.apache.spark._
> import org.apache.spark.rdd._
> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.sql._
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.catalyst.expressions._
> import org.apache.spark.sql.Column
> val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet")
> val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*)
> val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2"))
> val v5_2 = $"2"
> v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0)
> //v00's 3rdcolumn is binary and 16th is map
> Error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to 
> data type mismatch: argument 1 requires map type, however, '`2`' is of binary 
> type.;
>  
>  'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < 
> {color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- 
> Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- 
> Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS 
> 2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS 
> 5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS 
> 8#1567, c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, 
> c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS 
> 13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- 
> Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542]
>  parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25795) Fix CSV SparkR SQL Example

2018-10-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25795:

Fix Version/s: (was: 2.4.1)
   (was: 3.0.0)
   2.4.0

> Fix CSV SparkR SQL Example
> --
>
> Key: SPARK-25795
> URL: https://issues.apache.org/jira/browse/SPARK-25795
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, R
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.3.3, 2.4.0
>
> Attachments: 
> 0001-SPARK-25795-R-EXAMPLE-Fix-CSV-SparkR-SQL-Example.patch
>
>
> This issue aims to fix the following SparkR example in Spark 2.3.0 ~ 2.4.0.
> {code}
> > df <- read.df("examples/src/main/resources/people.csv", "csv")
> > namesAndAges <- select(df, "name", "age")
> ...
> Caused by: org.apache.spark.sql.AnalysisException: cannot resolve '`name`' 
> given input columns: [_c0];;
> 'Project ['name, 'age]
> +- AnalysisBarrier
>   +- Relation[_c0#97] csv
> {code}
>  
> - 
> https://github.com/apache/spark/blob/master/examples/src/main/r/RSparkSQLExample.R
> - 
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/_site/sql-programming-guide.html#manually-specifying-options
> - 
> http://spark.apache.org/docs/2.3.2/sql-programming-guide.html#manually-specifying-options
> - 
> http://spark.apache.org/docs/2.3.1/sql-programming-guide.html#manually-specifying-options
> - 
> http://spark.apache.org/docs/2.3.0/sql-programming-guide.html#manually-specifying-options



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25803) The -n option to docker-image-tool.sh causes other options to be ignored

2018-10-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25803:

Fix Version/s: (was: 2.4.1)
   (was: 3.0.0)
   2.4.0

> The -n option to docker-image-tool.sh causes other options to be ignored
> 
>
> Key: SPARK-25803
> URL: https://issues.apache.org/jira/browse/SPARK-25803
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
> Environment: * OS X 10.14
>  * iTerm2
>  * bash3
>  * Docker 2.0.0.0-beta1-mac75 (27117)
> (NB: I don't believe the above has a bearing; I imagine this issue is present 
> also on linux and can confirm if needed.)
>Reporter: Steve Larkin
>Assignee: Steve Larkin
>Priority: Minor
> Fix For: 2.4.0
>
>
> To reproduce:-
> 1. Build spark
>  $ ./build/mvn -Pkubernetes -DskipTests clean package
> 2. Create a Dockerfile (a simple one, just for demonstration)
>  $ cat > hello-world.dockerfile <  > FROM hello-world
>  > EOF
> 3. Build container images with our Dockerfile
>  $ ./bin/docker-image-tool.sh -R hello-world.dockerfile -r docker.io/myrepo 
> -t myversion build
> The result is that the -R option is honoured and the hello-world image is 
> built for spark-r, as expected.
> 4. Build container images with our Dockerfile and the -n option
>  $ ./bin/docker-image-tool.sh -n -R hello-world.dockerfile -r 
> docker.io/myrepo -t myversion build
> The result is that the -R option is ignored and the default container image 
> for R is built
> docker-image-tool.sh, uses 
> [getopts|http://pubs.opengroup.org/onlinepubs/9699919799/utilities/getopts.html]
>  in which a colon, ':', signifies that an option takes an argument.  Since -n 
> does not take an argument it should not have a colon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25697) When zstd compression enabled in progress application is throwing Error in UI

2018-10-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25697:

Fix Version/s: (was: 2.4.1)
   2.4.0

> When zstd compression enabled in progress application is throwing Error in UI
> -
>
> Key: SPARK-25697
> URL: https://issues.apache.org/jira/browse/SPARK-25697
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: shahid
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: Screenshot from 2018-10-10 12-45-20.png
>
>
> # In spark-default.conf of Job History enable below parameter
> spark.eventLog.compress=true
> spark.io.compression.codec = org.apache.spark.io.ZStdCompressionCodec
>  #  Restart Job History Services
>  # Submit beeline jobs
>  # Open Yarn Resource Page
>  # Check for the running application in Yarn Resource Page it will list the 
> application.
>  # Open Job History Page 
>  # Go and click Incomplete Application Link and click on the application
> *Actual Result:*
> UI display "*Read error or truncated source*" Error
> *Expected Result:*
> Job History should list the Jobs of the application on clicking the 
> application ID.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25816) Functions does not resolve Columns correctly

2018-10-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25816:
---

Assignee: Peter Toth

> Functions does not resolve Columns correctly
> 
>
> Key: SPARK-25816
> URL: https://issues.apache.org/jira/browse/SPARK-25816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Brian Zhang
>Assignee: Peter Toth
>Priority: Critical
> Attachments: final_allDatatypes_Spark.avro, source.snappy.parquet
>
>
> When there is a duplicate column name in the current Dataframe and orginal 
> Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does 
> not resolve the column correctly when using it in the expression, hence 
> causing casting issue. The same code is working in Spark 2.2.1
> Please see below code to reproduce the issue
> import org.apache.spark._
> import org.apache.spark.rdd._
> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.sql._
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.catalyst.expressions._
> import org.apache.spark.sql.Column
> val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet")
> val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*)
> val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2"))
> val v5_2 = $"2"
> v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0)
> //v00's 3rdcolumn is binary and 16th is map
> Error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to 
> data type mismatch: argument 1 requires map type, however, '`2`' is of binary 
> type.;
>  
>  'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < 
> {color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- 
> Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- 
> Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS 
> 2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS 
> 5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS 
> 8#1567, c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, 
> c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS 
> 13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- 
> Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542]
>  parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25271) Creating parquet table with all the column null throws exception

2018-10-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25271:
--
Priority: Critical  (was: Major)

> Creating parquet table with all the column null throws exception
> 
>
> Key: SPARK-25271
> URL: https://issues.apache.org/jira/browse/SPARK-25271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Shivu Sondur
>Priority: Critical
> Attachments: image-2018-09-07-09-12-34-944.png, 
> image-2018-09-07-09-29-33-370.png, image-2018-09-07-09-29-52-899.png, 
> image-2018-09-07-09-32-43-892.png, image-2018-09-07-09-33-03-095.png
>
>
> {code:java}
>  1)cat /data/parquet.dat
> 1$abc2$pqr:3$xyz
> null{code}
>  
> {code:java}
> 2)spark.sql("create table vp_reader_temp (projects map) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' 
> MAP KEYS TERMINATED BY '$'")
> {code}
> {code:java}
> 3)spark.sql("
> LOAD DATA LOCAL INPATH '/data/parquet.dat' INTO TABLE vp_reader_temp")
> {code}
> {code:java}
> 4)spark.sql("create table vp_reader STORED AS PARQUET as select * from 
> vp_reader_temp")
> {code}
> *Result :* Throwing exception (Working fine with spark 2.2.1)
> {code:java}
> java.lang.RuntimeException: Parquet record is malformed: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:112)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:125)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:406)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:283)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:281)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1438)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:286)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:211)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:210)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.parquet.io.ParquetEncodingException: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:320)
>   at 
> org.apache.parquet.io.RecordConsumerLoggingWrapper.endField(RecordConsumerLoggingWrapper.java:165)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60)
>   ... 21 more
> {code}



--
This message was sent by Atlassian 

[jira] [Updated] (SPARK-23907) Support regr_* functions

2018-10-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23907:
--
Target Version/s:   (was: 2.4.0)

> Support regr_* functions
> 
>
> Key: SPARK-23907
> URL: https://issues.apache.org/jira/browse/SPARK-23907
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-15978
> {noformat}
> Support the standard regr_* functions, regr_slope, regr_intercept, regr_r2, 
> regr_sxx, regr_syy, regr_sxy, regr_avgx, regr_avgy, regr_count. SQL reference 
> section 10.9
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23907) Support regr_* functions

2018-10-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-23907.
---
Resolution: Won't Do

> Support regr_* functions
> 
>
> Key: SPARK-23907
> URL: https://issues.apache.org/jira/browse/SPARK-23907
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-15978
> {noformat}
> Support the standard regr_* functions, regr_slope, regr_intercept, regr_r2, 
> regr_sxx, regr_syy, regr_sxy, regr_avgx, regr_avgy, regr_count. SQL reference 
> section 10.9
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23907) Support regr_* functions

2018-10-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-23907:
---

> Support regr_* functions
> 
>
> Key: SPARK-23907
> URL: https://issues.apache.org/jira/browse/SPARK-23907
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-15978
> {noformat}
> Support the standard regr_* functions, regr_slope, regr_intercept, regr_r2, 
> regr_sxx, regr_syy, regr_sxy, regr_avgx, regr_avgy, regr_count. SQL reference 
> section 10.9
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23907) Support regr_* functions

2018-10-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23907:
--
Fix Version/s: (was: 2.4.0)

> Support regr_* functions
> 
>
> Key: SPARK-23907
> URL: https://issues.apache.org/jira/browse/SPARK-23907
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-15978
> {noformat}
> Support the standard regr_* functions, regr_slope, regr_intercept, regr_r2, 
> regr_sxx, regr_syy, regr_sxy, regr_avgx, regr_avgy, regr_count. SQL reference 
> section 10.9
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat and OrcFileFormat

2018-10-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25806.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22802
[https://github.com/apache/spark/pull/22802]

>  The instanceof FileSplit is redundant for ParquetFileFormat and OrcFileFormat
> --
>
> Key: SPARK-25806
> URL: https://issues.apache.org/jira/browse/SPARK-25806
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Trivial
> Fix For: 3.0.0
>
>
> The instance of FileSplit is redundant  {color:#33}in the 
> {color}{color:#f79232}ParquetFileFormat{color} and {color:#f79232} 
> {{hive\orc\OrcFileFormat}}{color} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat and OrcFileFormat

2018-10-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25806:
-

Assignee: liuxian

>  The instanceof FileSplit is redundant for ParquetFileFormat and OrcFileFormat
> --
>
> Key: SPARK-25806
> URL: https://issues.apache.org/jira/browse/SPARK-25806
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Trivial
> Fix For: 3.0.0
>
>
> The instance of FileSplit is redundant  {color:#33}in the 
> {color}{color:#f79232}ParquetFileFormat{color} and {color:#f79232} 
> {{hive\orc\OrcFileFormat}}{color} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24437) Memory leak in UnsafeHashedRelation

2018-10-28 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1572#comment-1572
 ] 

t oo commented on SPARK-24437:
--

can this be merged?

> Memory leak in UnsafeHashedRelation
> ---
>
> Key: SPARK-24437
> URL: https://issues.apache.org/jira/browse/SPARK-24437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: gagan taneja
>Priority: Critical
> Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png, Screen Shot 
> 2018-05-30 at 2.07.22 PM.png
>
>
> There seems to memory leak with 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation
> We have a long running instance of STS.
> With each query execution requiring Broadcast Join, UnsafeHashedRelation is 
> getting added for cleanup in ContextCleaner. This reference of 
> UnsafeHashedRelation is being held at some other Collection and not becoming 
> eligible for GC and because of this ContextCleaner is not able to clean it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22575) Making Spark Thrift Server clean up its cache

2018-10-28 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1573#comment-1573
 ] 

t oo commented on SPARK-22575:
--

can this be merged?

> Making Spark Thrift Server clean up its cache
> -
>
> Key: SPARK-22575
> URL: https://issues.apache.org/jira/browse/SPARK-22575
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, SQL
>Affects Versions: 2.2.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>  Labels: cache, dataproc, thrift, yarn
>
> Currently, Spark Thrift Server accumulates data in its appcache, even for old 
> queries. This fills up the disk (using over 100GB per worker node) within 
> days, and the only way to clear it is to restart the Thrift Server 
> application. Even deleting the files directly isn't a solution, as Spark then 
> complains about FileNotFound.
> I asked about this on [Stack 
> Overflow|https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache]
>  a few weeks ago, but it does not seem to be currently doable by 
> configuration.
> Am I missing some configuration option, or some other factor here?
> Otherwise, can anyone point me to the code that handles this, so maybe I can 
> try my hand at a fix?
> Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25862) Remove rangeBetween APIs introduced in SPARK-21608

2018-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1568#comment-1568
 ] 

Apache Spark commented on SPARK-25862:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22870

> Remove rangeBetween APIs introduced in SPARK-21608
> --
>
> Key: SPARK-25862
> URL: https://issues.apache.org/jira/browse/SPARK-25862
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> As a follow up to https://issues.apache.org/jira/browse/SPARK-25842, removing 
> the API so we can introduce a new one.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21608) Window rangeBetween() API should allow literal boundary

2018-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1567#comment-1567
 ] 

Apache Spark commented on SPARK-21608:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22870

> Window rangeBetween() API should allow literal boundary
> ---
>
> Key: SPARK-21608
> URL: https://issues.apache.org/jira/browse/SPARK-21608
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 2.3.0
>
>
> Window rangeBetween() API should allow literal boundary, that means, the 
> window range frame can calculate frame of double/date/timestamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25862) Remove rangeBetween APIs introduced in SPARK-21608

2018-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25862:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove rangeBetween APIs introduced in SPARK-21608
> --
>
> Key: SPARK-25862
> URL: https://issues.apache.org/jira/browse/SPARK-25862
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Major
>
> As a follow up to https://issues.apache.org/jira/browse/SPARK-25842, removing 
> the API so we can introduce a new one.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25862) Remove rangeBetween APIs introduced in SPARK-21608

2018-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1566#comment-1566
 ] 

Apache Spark commented on SPARK-25862:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22870

> Remove rangeBetween APIs introduced in SPARK-21608
> --
>
> Key: SPARK-25862
> URL: https://issues.apache.org/jira/browse/SPARK-25862
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> As a follow up to https://issues.apache.org/jira/browse/SPARK-25842, removing 
> the API so we can introduce a new one.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25862) Remove rangeBetween APIs introduced in SPARK-21608

2018-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25862:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove rangeBetween APIs introduced in SPARK-21608
> --
>
> Key: SPARK-25862
> URL: https://issues.apache.org/jira/browse/SPARK-25862
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> As a follow up to https://issues.apache.org/jira/browse/SPARK-25842, removing 
> the API so we can introduce a new one.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25841) Redesign window function rangeBetween API

2018-10-28 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1559#comment-1559
 ] 

Reynold Xin commented on SPARK-25841:
-

I posted api proposal sketches in 
https://issues.apache.org/jira/browse/SPARK-25843

> Redesign window function rangeBetween API
> -
>
> Key: SPARK-25841
> URL: https://issues.apache.org/jira/browse/SPARK-25841
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> As I was reviewing the Spark API changes for 2.4, I found that through 
> organic, ad-hoc evolution the current API for window functions in Scala is 
> pretty bad.
>   
>  To illustrate the problem, we have two rangeBetween functions in Window 
> class:
>   
> {code:java}
> class Window {
>  def unboundedPreceding: Long
>  ...
>  def rangeBetween(start: Long, end: Long): WindowSpec
>  def rangeBetween(start: Column, end: Column): WindowSpec
> }{code}
>  
>  The Column version of rangeBetween was added in Spark 2.3 because the 
> previous version (Long) could only support integral values and not time 
> intervals. Now in order to support specifying unboundedPreceding in the 
> rangeBetween(Column, Column) API, we added an unboundedPreceding that returns 
> a Column in functions.scala.
>   
>  There are a few issues I have with the API:
>   
>  1. To the end user, this can be just super confusing. Why are there two 
> unboundedPreceding functions, in different classes, that are named the same 
> but return different types?
>   
>  2. Using Column as the parameter signature implies this can be an actual 
> Column, but in practice rangeBetween can only accept literal values.
>   
>  3. We added the new APIs to support intervals, but they don't actually work, 
> because in the implementation we try to validate the start is less than the 
> end, but calendar interval types are not comparable, and as a result we throw 
> a type mismatch exception at runtime: scala.MatchError: CalendarIntervalType 
> (of class org.apache.spark.sql.types.CalendarIntervalType$)
>   
>  4. In order to make interval work, users need to create an interval using 
> CalendarInterval, which is an internal class that has no documentation and no 
> stable API.
>   
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25843) Redesign rangeBetween API

2018-10-28 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-25843:

Description: 
See parent ticket for more information. Two proposals with sketches:

 
Proposal 1. create a version of rangeBetween that accepts Strings, i.e. 
rangeBetween(String, String). This is obviously very flexible, but less type 
safe.
 
Proposal 2. creates a new type called WindowFrameBoundary:
 
 
{code:java}
trait WindowFrameBoundary
 
object WindowFrameBoundary {
  def unboundedPreceding: WindowFrameBoundary
  def unboundedFollowing: WindowFrameBoundary
  def currentRow: WindowFrameBoundary
  def at(value: Long)
  def interval(interval: String)
}{code}
 
And create a new rangeBetween that accepts WindowFrameBoundary's, i.e.
 
 
{code:java}
def rangeBetween(start: WindowFrameBoundary, end: WindowFrameBoundary)  {code}
 
This is also very flexible and type safe at the same time.
 
 
Note the two are not mutually exclusive, and we can also deprecate the existing 
confusing APIs.
 
 

  was:
See parent ticket for more information. I have a rough design that I will post 
later.

 


> Redesign rangeBetween API
> -
>
> Key: SPARK-25843
> URL: https://issues.apache.org/jira/browse/SPARK-25843
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> See parent ticket for more information. Two proposals with sketches:
>  
> Proposal 1. create a version of rangeBetween that accepts Strings, i.e. 
> rangeBetween(String, String). This is obviously very flexible, but less type 
> safe.
>  
> Proposal 2. creates a new type called WindowFrameBoundary:
>  
>  
> {code:java}
> trait WindowFrameBoundary
>  
> object WindowFrameBoundary {
>   def unboundedPreceding: WindowFrameBoundary
>   def unboundedFollowing: WindowFrameBoundary
>   def currentRow: WindowFrameBoundary
>   def at(value: Long)
>   def interval(interval: String)
> }{code}
>  
> And create a new rangeBetween that accepts WindowFrameBoundary's, i.e.
>  
>  
> {code:java}
> def rangeBetween(start: WindowFrameBoundary, end: WindowFrameBoundary)  {code}
>  
> This is also very flexible and type safe at the same time.
>  
>  
> Note the two are not mutually exclusive, and we can also deprecate the 
> existing confusing APIs.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25862) Remove rangeBetween APIs introduced in SPARK-21608

2018-10-28 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-25862:
---

 Summary: Remove rangeBetween APIs introduced in SPARK-21608
 Key: SPARK-25862
 URL: https://issues.apache.org/jira/browse/SPARK-25862
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Reynold Xin
Assignee: Reynold Xin


As a follow up to https://issues.apache.org/jira/browse/SPARK-25842, removing 
the API so we can introduce a new one.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25816) Functions does not resolve Columns correctly

2018-10-28 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1523#comment-1523
 ] 

Dongjoon Hyun commented on SPARK-25816:
---

Thank you, [~bzhang] and [~petertoth]. I also confirmed that this is a bug in 
2.3.x and 2.4.0 RC4 (and master). Thanks to [~petertoth], it looks like we can 
have the fix this for 2.3.3 and 2.4.0 RC5.

cc [~cloud_fan]

> Functions does not resolve Columns correctly
> 
>
> Key: SPARK-25816
> URL: https://issues.apache.org/jira/browse/SPARK-25816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Brian Zhang
>Priority: Critical
> Attachments: final_allDatatypes_Spark.avro, source.snappy.parquet
>
>
> When there is a duplicate column name in the current Dataframe and orginal 
> Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does 
> not resolve the column correctly when using it in the expression, hence 
> causing casting issue. The same code is working in Spark 2.2.1
> Please see below code to reproduce the issue
> import org.apache.spark._
> import org.apache.spark.rdd._
> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.sql._
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.catalyst.expressions._
> import org.apache.spark.sql.Column
> val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet")
> val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*)
> val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2"))
> val v5_2 = $"2"
> v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0)
> //v00's 3rdcolumn is binary and 16th is map
> Error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to 
> data type mismatch: argument 1 requires map type, however, '`2`' is of binary 
> type.;
>  
>  'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < 
> {color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- 
> Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- 
> Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS 
> 2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS 
> 5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS 
> 8#1567, c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, 
> c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS 
> 13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- 
> Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542]
>  parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25816) Functions does not resolve Columns correctly

2018-10-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25816:
--
Affects Version/s: 2.4.0

> Functions does not resolve Columns correctly
> 
>
> Key: SPARK-25816
> URL: https://issues.apache.org/jira/browse/SPARK-25816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Brian Zhang
>Priority: Critical
> Attachments: final_allDatatypes_Spark.avro, source.snappy.parquet
>
>
> When there is a duplicate column name in the current Dataframe and orginal 
> Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does 
> not resolve the column correctly when using it in the expression, hence 
> causing casting issue. The same code is working in Spark 2.2.1
> Please see below code to reproduce the issue
> import org.apache.spark._
> import org.apache.spark.rdd._
> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.sql._
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.catalyst.expressions._
> import org.apache.spark.sql.Column
> val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet")
> val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*)
> val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2"))
> val v5_2 = $"2"
> v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0)
> //v00's 3rdcolumn is binary and 16th is map
> Error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to 
> data type mismatch: argument 1 requires map type, however, '`2`' is of binary 
> type.;
>  
>  'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < 
> {color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- 
> Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- 
> Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS 
> 2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS 
> 5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS 
> 8#1567, c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, 
> c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS 
> 13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- 
> Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542]
>  parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25816) Functions does not resolve Columns correctly

2018-10-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25816:
--
Affects Version/s: 2.3.2

> Functions does not resolve Columns correctly
> 
>
> Key: SPARK-25816
> URL: https://issues.apache.org/jira/browse/SPARK-25816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Brian Zhang
>Priority: Critical
> Attachments: final_allDatatypes_Spark.avro, source.snappy.parquet
>
>
> When there is a duplicate column name in the current Dataframe and orginal 
> Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does 
> not resolve the column correctly when using it in the expression, hence 
> causing casting issue. The same code is working in Spark 2.2.1
> Please see below code to reproduce the issue
> import org.apache.spark._
> import org.apache.spark.rdd._
> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.sql._
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.catalyst.expressions._
> import org.apache.spark.sql.Column
> val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet")
> val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*)
> val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2"))
> val v5_2 = $"2"
> v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0)
> //v00's 3rdcolumn is binary and 16th is map
> Error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to 
> data type mismatch: argument 1 requires map type, however, '`2`' is of binary 
> type.;
>  
>  'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < 
> {color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- 
> Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- 
> Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS 
> 2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS 
> 5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS 
> 8#1567, c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, 
> c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS 
> 13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- 
> Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542]
>  parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25758) Deprecate BisectingKMeans compute cost

2018-10-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25758:
--
Labels: 3.0.0  (was: )

> Deprecate BisectingKMeans compute cost
> --
>
> Key: SPARK-25758
> URL: https://issues.apache.org/jira/browse/SPARK-25758
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
>  Labels: 3.0.0
>
> In SPARK-23451 the method {{computeCost}} from KMeans was deprecated, as we 
> have now a better way to evaluate a clustering algorithm (the 
> {{ClusteringEvaluator}}). Moreover, in the deprecation, the method was 
> targeted for removal in 3.0.
> I think we should deprecate the computeCost method on BisectingKMeans  for 
> the same reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25845) Fix MatchError for calendar interval type in rangeBetween

2018-10-28 Thread Xingbo Jiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang resolved SPARK-25845.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Resolved by https://github.com/apache/spark/pull/22853

> Fix MatchError for calendar interval type in rangeBetween
> -
>
> Key: SPARK-25845
> URL: https://issues.apache.org/jira/browse/SPARK-25845
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 3.0.0
>
>
> WindowSpecDefinition checks start < less, but CalendarIntervalType is not 
> comparable, so it would throw the following exception at runtime:
>  
>  
> {noformat}
>  scala.MatchError: CalendarIntervalType (of class 
> org.apache.spark.sql.types.CalendarIntervalType$)  at 
> org.apache.spark.sql.catalyst.util.TypeUtils$.getInterpretedOrdering(TypeUtils.scala:58)
>  at 
> org.apache.spark.sql.catalyst.expressions.BinaryComparison.ordering$lzycompute(predicates.scala:592)
>  at 
> org.apache.spark.sql.catalyst.expressions.BinaryComparison.ordering(predicates.scala:592)
>  at 
> org.apache.spark.sql.catalyst.expressions.GreaterThan.nullSafeEval(predicates.scala:797)
>  at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:496)
>  at 
> org.apache.spark.sql.catalyst.expressions.SpecifiedWindowFrame.isGreaterThan(windowExpressions.scala:245)
>  at 
> org.apache.spark.sql.catalyst.expressions.SpecifiedWindowFrame.checkInputDataTypes(windowExpressions.scala:216)
>  at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:171)
>  at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:171)
>  at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:183)
>  at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:183)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.forall(IndexedSeqOptimized.scala:43)
>  at scala.collection.mutable.ArrayBuffer.forall(ArrayBuffer.scala:48) at 
> org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:183)
>  at 
> org.apache.spark.sql.catalyst.expressions.WindowSpecDefinition.resolved$lzycompute(windowExpressions.scala:48)
>  at 
> org.apache.spark.sql.catalyst.expressions.WindowSpecDefinition.resolved(windowExpressions.scala:48)
>  at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:183)
>  at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:183)
>  at 
> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:83) 
>{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25758) Deprecate BisectingKMeans compute cost

2018-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1343#comment-1343
 ] 

Apache Spark commented on SPARK-25758:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22869

> Deprecate BisectingKMeans compute cost
> --
>
> Key: SPARK-25758
> URL: https://issues.apache.org/jira/browse/SPARK-25758
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
>
> In SPARK-23451 the method {{computeCost}} from KMeans was deprecated, as we 
> have now a better way to evaluate a clustering algorithm (the 
> {{ClusteringEvaluator}}). Moreover, in the deprecation, the method was 
> targeted for removal in 3.0.
> I think we should deprecate the computeCost method on BisectingKMeans  for 
> the same reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25758) Deprecate BisectingKMeans compute cost

2018-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25758:


Assignee: Marco Gaido  (was: Apache Spark)

> Deprecate BisectingKMeans compute cost
> --
>
> Key: SPARK-25758
> URL: https://issues.apache.org/jira/browse/SPARK-25758
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
>
> In SPARK-23451 the method {{computeCost}} from KMeans was deprecated, as we 
> have now a better way to evaluate a clustering algorithm (the 
> {{ClusteringEvaluator}}). Moreover, in the deprecation, the method was 
> targeted for removal in 3.0.
> I think we should deprecate the computeCost method on BisectingKMeans  for 
> the same reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25758) Deprecate BisectingKMeans compute cost

2018-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25758:


Assignee: Apache Spark  (was: Marco Gaido)

> Deprecate BisectingKMeans compute cost
> --
>
> Key: SPARK-25758
> URL: https://issues.apache.org/jira/browse/SPARK-25758
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Minor
>
> In SPARK-23451 the method {{computeCost}} from KMeans was deprecated, as we 
> have now a better way to evaluate a clustering algorithm (the 
> {{ClusteringEvaluator}}). Moreover, in the deprecation, the method was 
> targeted for removal in 3.0.
> I think we should deprecate the computeCost method on BisectingKMeans  for 
> the same reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat and OrcFileFormat

2018-10-28 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25806:

Summary:  The instanceof FileSplit is redundant for ParquetFileFormat and 
OrcFileFormat  (was:  The instanceof FileSplit is redundant for 
ParquetFileFormat)

>  The instanceof FileSplit is redundant for ParquetFileFormat and OrcFileFormat
> --
>
> Key: SPARK-25806
> URL: https://issues.apache.org/jira/browse/SPARK-25806
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Trivial
>
> The instance of FileSplit is redundant  {color:#33}in the 
> {color}{color:#f79232}ParquetFileFormat{color} and {color:#f79232} 
> {{hive\orc\OrcFileFormat}}{color} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat

2018-10-28 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25806:

Description: The instance of FileSplit is redundant  {color:#33}in the 
{color}{color:#f79232}ParquetFileFormat{color} and {color:#f79232} 
{{hive\orc\OrcFileFormat}}{color} class.  (was: The instance of FileSplit is 
redundant for {color:#ffc66d}buildReaderWithPartitionValues {color}

{color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class.)

>  The instanceof FileSplit is redundant for ParquetFileFormat
> 
>
> Key: SPARK-25806
> URL: https://issues.apache.org/jira/browse/SPARK-25806
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Trivial
>
> The instance of FileSplit is redundant  {color:#33}in the 
> {color}{color:#f79232}ParquetFileFormat{color} and {color:#f79232} 
> {{hive\orc\OrcFileFormat}}{color} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25833) Views without column names created by Hive are not readable by Spark

2018-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25833:


Assignee: Apache Spark

> Views without column names created by Hive are not readable by Spark
> 
>
> Key: SPARK-25833
> URL: https://issues.apache.org/jira/browse/SPARK-25833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Chenxiao Mao
>Assignee: Apache Spark
>Priority: Major
>
> A simple example to reproduce this issue.
>  create a view via Hive CLI:
> {code:sql}
> hive> CREATE VIEW v1 AS SELECT * FROM (SELECT 1) t1
> {code}
> query that view via Spark
> {code:sql}
> spark-sql> select * from v1;
> Error in query: cannot resolve '`t1._c0`' given input columns: [1]; line 1 
> pos 7;
> 'Project [*]
> +- 'SubqueryAlias v1, `default`.`v1`
>+- 'Project ['t1._c0]
>   +- SubqueryAlias t1
>  +- Project [1 AS 1#41]
> +- OneRowRelation$
> {code}
> Check the view definition:
> {code:sql}
> hive> desc extended v1;
> OK
> _c0   int
> ...
> viewOriginalText:SELECT * FROM (SELECT 1) t1, 
> viewExpandedText:SELECT `t1`.`_c0` FROM (SELECT 1) `t1`
> ...
> {code}
> _c0 in above view definition is automatically generated by Hive, which is not 
> recognizable by Spark.
>  see [Hive 
> LanguageManual|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30746446=true#LanguageManualDDL-CreateView]
>  for more details:
> {quote}If no column names are supplied, the names of the view's columns will 
> be derived automatically from the defining SELECT expression. (If the SELECT 
> contains unaliased scalar expressions such as x+y, the resulting view column 
> names will be generated in the form _C0, _C1, etc.)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25833) Views without column names created by Hive are not readable by Spark

2018-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25833:


Assignee: (was: Apache Spark)

> Views without column names created by Hive are not readable by Spark
> 
>
> Key: SPARK-25833
> URL: https://issues.apache.org/jira/browse/SPARK-25833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Chenxiao Mao
>Priority: Major
>
> A simple example to reproduce this issue.
>  create a view via Hive CLI:
> {code:sql}
> hive> CREATE VIEW v1 AS SELECT * FROM (SELECT 1) t1
> {code}
> query that view via Spark
> {code:sql}
> spark-sql> select * from v1;
> Error in query: cannot resolve '`t1._c0`' given input columns: [1]; line 1 
> pos 7;
> 'Project [*]
> +- 'SubqueryAlias v1, `default`.`v1`
>+- 'Project ['t1._c0]
>   +- SubqueryAlias t1
>  +- Project [1 AS 1#41]
> +- OneRowRelation$
> {code}
> Check the view definition:
> {code:sql}
> hive> desc extended v1;
> OK
> _c0   int
> ...
> viewOriginalText:SELECT * FROM (SELECT 1) t1, 
> viewExpandedText:SELECT `t1`.`_c0` FROM (SELECT 1) `t1`
> ...
> {code}
> _c0 in above view definition is automatically generated by Hive, which is not 
> recognizable by Spark.
>  see [Hive 
> LanguageManual|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30746446=true#LanguageManualDDL-CreateView]
>  for more details:
> {quote}If no column names are supplied, the names of the view's columns will 
> be derived automatically from the defining SELECT expression. (If the SELECT 
> contains unaliased scalar expressions such as x+y, the resulting view column 
> names will be generated in the form _C0, _C1, etc.)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25833) Views without column names created by Hive are not readable by Spark

2018-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1302#comment-1302
 ] 

Apache Spark commented on SPARK-25833:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22868

> Views without column names created by Hive are not readable by Spark
> 
>
> Key: SPARK-25833
> URL: https://issues.apache.org/jira/browse/SPARK-25833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Chenxiao Mao
>Priority: Major
>
> A simple example to reproduce this issue.
>  create a view via Hive CLI:
> {code:sql}
> hive> CREATE VIEW v1 AS SELECT * FROM (SELECT 1) t1
> {code}
> query that view via Spark
> {code:sql}
> spark-sql> select * from v1;
> Error in query: cannot resolve '`t1._c0`' given input columns: [1]; line 1 
> pos 7;
> 'Project [*]
> +- 'SubqueryAlias v1, `default`.`v1`
>+- 'Project ['t1._c0]
>   +- SubqueryAlias t1
>  +- Project [1 AS 1#41]
> +- OneRowRelation$
> {code}
> Check the view definition:
> {code:sql}
> hive> desc extended v1;
> OK
> _c0   int
> ...
> viewOriginalText:SELECT * FROM (SELECT 1) t1, 
> viewExpandedText:SELECT `t1`.`_c0` FROM (SELECT 1) `t1`
> ...
> {code}
> _c0 in above view definition is automatically generated by Hive, which is not 
> recognizable by Spark.
>  see [Hive 
> LanguageManual|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30746446=true#LanguageManualDDL-CreateView]
>  for more details:
> {quote}If no column names are supplied, the names of the view's columns will 
> be derived automatically from the defining SELECT expression. (If the SELECT 
> contains unaliased scalar expressions such as x+y, the resulting view column 
> names will be generated in the form _C0, _C1, etc.)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25833) Views without column names created by Hive are not readable by Spark

2018-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1303#comment-1303
 ] 

Apache Spark commented on SPARK-25833:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22868

> Views without column names created by Hive are not readable by Spark
> 
>
> Key: SPARK-25833
> URL: https://issues.apache.org/jira/browse/SPARK-25833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Chenxiao Mao
>Priority: Major
>
> A simple example to reproduce this issue.
>  create a view via Hive CLI:
> {code:sql}
> hive> CREATE VIEW v1 AS SELECT * FROM (SELECT 1) t1
> {code}
> query that view via Spark
> {code:sql}
> spark-sql> select * from v1;
> Error in query: cannot resolve '`t1._c0`' given input columns: [1]; line 1 
> pos 7;
> 'Project [*]
> +- 'SubqueryAlias v1, `default`.`v1`
>+- 'Project ['t1._c0]
>   +- SubqueryAlias t1
>  +- Project [1 AS 1#41]
> +- OneRowRelation$
> {code}
> Check the view definition:
> {code:sql}
> hive> desc extended v1;
> OK
> _c0   int
> ...
> viewOriginalText:SELECT * FROM (SELECT 1) t1, 
> viewExpandedText:SELECT `t1`.`_c0` FROM (SELECT 1) `t1`
> ...
> {code}
> _c0 in above view definition is automatically generated by Hive, which is not 
> recognizable by Spark.
>  see [Hive 
> LanguageManual|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30746446=true#LanguageManualDDL-CreateView]
>  for more details:
> {quote}If no column names are supplied, the names of the view's columns will 
> be derived automatically from the defining SELECT expression. (If the SELECT 
> contains unaliased scalar expressions such as x+y, the resulting view column 
> names will be generated in the form _C0, _C1, etc.)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org