[jira] [Created] (SPARK-30723) Executing the example on https://spark.apache.org/docs/latest/running-on-yarn.html fails

2020-02-03 Thread Reinhard Eilmsteiner (Jira)
Reinhard Eilmsteiner created SPARK-30723:


 Summary: Executing the example on 
https://spark.apache.org/docs/latest/running-on-yarn.html fails
 Key: SPARK-30723
 URL: https://issues.apache.org/jira/browse/SPARK-30723
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.4.4
 Environment: jdk1.8.0_241

hadoop-3.1.0

spark-2.4.4-bin-without-hadoop
Reporter: Reinhard Eilmsteiner


running

 {{./bin/spark-submit --class org.apache.spark.examples.SparkPi
--master yarn
--deploy-mode cluster
--driver-memory 4g
--executor-memory 2g
--executor-cores 1
--queue thequeue
examples/jars/spark-examples*.jar
10}}

results in

Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
 at java.lang.Class.getDeclaredMethods0(Native Method)
 at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
 at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
 at java.lang.Class.getMethod0(Class.java:3018)
 at java.lang.Class.getMethod(Class.java:1784)
 at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
 at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
 at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
 ... 7 more{{}}

Just install slf4j or is there a way around?{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-03 Thread Guram Savinov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029619#comment-17029619
 ] 

Guram Savinov commented on SPARK-30701:
---

Ok, let's go to Hadoop project: 
https://issues.apache.org/jira/browse/HADOOP-16837

> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Winutils 2.7.1: 
> [https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]
> Oracle JavaSE 8
> SparkSQL 2.4.4 / Hadoop 2.6.5
> Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive
> Set: winutils chmod -R 777 \Users\OSUser\hadoop\tmp\hive
>Reporter: Guram Savinov
>Priority: Major
>  Labels: WIndows, hive, unit-test
> Attachments: HadoopGroupTest.java
>
>
> Running SparkSQL local embedded unit tests on Win10, using winutils.
> Got warnings about 'hadoop chgrp'.
> See environment info.
> {code:bash}
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}
> Related info on SO: 
> https://stackoverflow.com/questions/48605907/error-in-pyspark-when-insert-data-in-hive
> Seems like the problem is here: 
> hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FsShellPermissions.java:210



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30706) TimeZone in writing pure date type in CSV output

2020-02-03 Thread Waldemar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029604#comment-17029604
 ] 

Waldemar commented on SPARK-30706:
--

Yes please. I have attached zip with these csv spark files and json copy of my 
notebook.

To see the problem "on the West" please run the paragraphs by turns.

> TimeZone in writing pure date type in CSV output
> 
>
> Key: SPARK-30706
> URL: https://issues.apache.org/jira/browse/SPARK-30706
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.4.3
>Reporter: Waldemar
>Priority: Minor
> Attachments: DateZoneBug.zip
>
>
> If I read string date from CSV file, then cast to date type and write into 
> CSV file again on the west of Greenwich to csv file again it writes date of 
> one day ago. This way making this operation in loop we can get unwillingly 
> past date.
> If the spark-shell work on the east of Greenwich all is OK.
> When writing to parquet also is OK.
> Example of code:
> {code:java}
> //
> val test_5_load = "hdfs://192.168.44.161:8020/db/wbiernacki/test_5_load.csv"
> val test_5_save = "hdfs://192.168.44.161:8020/db/wbiernacki/test_5_save.csv"
> val test_5 = spark.read.format("csv")
>   .option("header","true")
>   .load( test_5_load )
>   .withColumn("begin",to_date(col("begin" ),"-MM-dd"))
>   .withColumn("end" ,to_date(col("end" ),"-MM-dd"))
> test_5.show()
> test_5
>   .write.mode("overwrite")
>   .format("csv")
>   .option("header","true")
>   .save( test_5_save )
> {code}
>  Please perform this few times.. The test_5_load.csv file looks like:
> {code:java}
> // 
> ++--+--++
> | patient| begin|   end| new|
> ++--+--++
> |waldemar|2015-09-22|2015-09-23|old1|
> ++--+--++{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30706) TimeZone in writing pure date type in CSV output

2020-02-03 Thread Waldemar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Waldemar updated SPARK-30706:
-
Attachment: DateZoneBug.zip

> TimeZone in writing pure date type in CSV output
> 
>
> Key: SPARK-30706
> URL: https://issues.apache.org/jira/browse/SPARK-30706
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.4.3
>Reporter: Waldemar
>Priority: Minor
> Attachments: DateZoneBug.zip
>
>
> If I read string date from CSV file, then cast to date type and write into 
> CSV file again on the west of Greenwich to csv file again it writes date of 
> one day ago. This way making this operation in loop we can get unwillingly 
> past date.
> If the spark-shell work on the east of Greenwich all is OK.
> When writing to parquet also is OK.
> Example of code:
> {code:java}
> //
> val test_5_load = "hdfs://192.168.44.161:8020/db/wbiernacki/test_5_load.csv"
> val test_5_save = "hdfs://192.168.44.161:8020/db/wbiernacki/test_5_save.csv"
> val test_5 = spark.read.format("csv")
>   .option("header","true")
>   .load( test_5_load )
>   .withColumn("begin",to_date(col("begin" ),"-MM-dd"))
>   .withColumn("end" ,to_date(col("end" ),"-MM-dd"))
> test_5.show()
> test_5
>   .write.mode("overwrite")
>   .format("csv")
>   .option("header","true")
>   .save( test_5_save )
> {code}
>  Please perform this few times.. The test_5_load.csv file looks like:
> {code:java}
> // 
> ++--+--++
> | patient| begin|   end| new|
> ++--+--++
> |waldemar|2015-09-22|2015-09-23|old1|
> ++--+--++{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30722) Document type hints in pandas UDF

2020-02-03 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-30722:


 Summary: Document type hints in pandas UDF
 Key: SPARK-30722
 URL: https://issues.apache.org/jira/browse/SPARK-30722
 Project: Spark
  Issue Type: Documentation
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


We should document the new type hints for pandas UDF introduced at SPARK-28264.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30717) AQE subquery map should cache `SubqueryExec` instead of `ExecSubqueryExpression`

2020-02-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30717.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27446
[https://github.com/apache/spark/pull/27446]

> AQE subquery map should cache `SubqueryExec` instead of 
> `ExecSubqueryExpression`
> 
>
> Key: SPARK-30717
> URL: https://issues.apache.org/jira/browse/SPARK-30717
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30717) AQE subquery map should cache `SubqueryExec` instead of `ExecSubqueryExpression`

2020-02-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30717:
---

Assignee: Wei Xue

> AQE subquery map should cache `SubqueryExec` instead of 
> `ExecSubqueryExpression`
> 
>
> Key: SPARK-30717
> URL: https://issues.apache.org/jira/browse/SPARK-30717
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF

2020-02-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30688:
-
Affects Version/s: (was: 3.0.0)

> Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
> --
>
> Key: SPARK-30688
> URL: https://issues.apache.org/jira/browse/SPARK-30688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Rajkumar Singh
>Priority: Major
>
>  
> {code:java}
> scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
> +-+
> |unix_timestamp(20201, ww)|
> +-+
> |                         null|
> +-+
>  
> scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
> -+
> |unix_timestamp(20202, ww)|
> +-+
> |                   1578182400|
> +-+
>  
> {code}
>  
>  
> This seems to happen for leap year only, I dig deeper into it and it seems 
> that  Spark is using the java.text.SimpleDateFormat and try to parse the 
> expression here
> [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
> {code:java}
> formatter.parse(
>  t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
>  but fail and SimpleDateFormat unable to parse the date throw Unparseable 
> Exception but Spark handle it silently and returns NULL.
>  
> *Spark-3.0:* I did some tests where spark no longer using the legacy 
> java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
> expect a valid date with valid format
>  org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF

2020-02-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029576#comment-17029576
 ] 

Hyukjin Kwon edited comment on SPARK-30688 at 2/4/20 4:45 AM:
--

[~rakson] can you clarify if this issue is fixed in the master or not?

{quote}
Spark-3.0: I did some tests where spark no longer using the legacy 
java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
expect a valid date with valid format
{quote}



was (Author: hyukjin.kwon):
Ah, okay. I misread this:
{quote}
Spark-3.0: I did some tests where spark no longer using the legacy 
java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
expect a valid date with valid format
{quote}

> Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
> --
>
> Key: SPARK-30688
> URL: https://issues.apache.org/jira/browse/SPARK-30688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 3.0.0
>Reporter: Rajkumar Singh
>Priority: Major
>
>  
> {code:java}
> scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
> +-+
> |unix_timestamp(20201, ww)|
> +-+
> |                         null|
> +-+
>  
> scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
> -+
> |unix_timestamp(20202, ww)|
> +-+
> |                   1578182400|
> +-+
>  
> {code}
>  
>  
> This seems to happen for leap year only, I dig deeper into it and it seems 
> that  Spark is using the java.text.SimpleDateFormat and try to parse the 
> expression here
> [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
> {code:java}
> formatter.parse(
>  t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
>  but fail and SimpleDateFormat unable to parse the date throw Unparseable 
> Exception but Spark handle it silently and returns NULL.
>  
> *Spark-3.0:* I did some tests where spark no longer using the legacy 
> java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
> expect a valid date with valid format
>  org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF

2020-02-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30688:
-
Affects Version/s: 3.0.0

> Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
> --
>
> Key: SPARK-30688
> URL: https://issues.apache.org/jira/browse/SPARK-30688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 3.0.0
>Reporter: Rajkumar Singh
>Priority: Major
>
>  
> {code:java}
> scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
> +-+
> |unix_timestamp(20201, ww)|
> +-+
> |                         null|
> +-+
>  
> scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
> -+
> |unix_timestamp(20202, ww)|
> +-+
> |                   1578182400|
> +-+
>  
> {code}
>  
>  
> This seems to happen for leap year only, I dig deeper into it and it seems 
> that  Spark is using the java.text.SimpleDateFormat and try to parse the 
> expression here
> [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
> {code:java}
> formatter.parse(
>  t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
>  but fail and SimpleDateFormat unable to parse the date throw Unparseable 
> Exception but Spark handle it silently and returns NULL.
>  
> *Spark-3.0:* I did some tests where spark no longer using the legacy 
> java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
> expect a valid date with valid format
>  org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF

2020-02-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-30688:
--

> Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
> --
>
> Key: SPARK-30688
> URL: https://issues.apache.org/jira/browse/SPARK-30688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Rajkumar Singh
>Priority: Major
>
>  
> {code:java}
> scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
> +-+
> |unix_timestamp(20201, ww)|
> +-+
> |                         null|
> +-+
>  
> scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
> -+
> |unix_timestamp(20202, ww)|
> +-+
> |                   1578182400|
> +-+
>  
> {code}
>  
>  
> This seems to happen for leap year only, I dig deeper into it and it seems 
> that  Spark is using the java.text.SimpleDateFormat and try to parse the 
> expression here
> [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
> {code:java}
> formatter.parse(
>  t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
>  but fail and SimpleDateFormat unable to parse the date throw Unparseable 
> Exception but Spark handle it silently and returns NULL.
>  
> *Spark-3.0:* I did some tests where spark no longer using the legacy 
> java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
> expect a valid date with valid format
>  org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF

2020-02-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029576#comment-17029576
 ] 

Hyukjin Kwon commented on SPARK-30688:
--

Ah, okay. I misread this:
{quote}
Spark-3.0: I did some tests where spark no longer using the legacy 
java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
expect a valid date with valid format
{quote}

> Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
> --
>
> Key: SPARK-30688
> URL: https://issues.apache.org/jira/browse/SPARK-30688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Rajkumar Singh
>Priority: Major
>
>  
> {code:java}
> scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
> +-+
> |unix_timestamp(20201, ww)|
> +-+
> |                         null|
> +-+
>  
> scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
> -+
> |unix_timestamp(20202, ww)|
> +-+
> |                   1578182400|
> +-+
>  
> {code}
>  
>  
> This seems to happen for leap year only, I dig deeper into it and it seems 
> that  Spark is using the java.text.SimpleDateFormat and try to parse the 
> expression here
> [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
> {code:java}
> formatter.parse(
>  t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
>  but fail and SimpleDateFormat unable to parse the date throw Unparseable 
> Exception but Spark handle it silently and returns NULL.
>  
> *Spark-3.0:* I did some tests where spark no longer using the legacy 
> java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
> expect a valid date with valid format
>  org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30677) Spark Streaming Job stuck when Kinesis Shard is increased when the job is running

2020-02-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029574#comment-17029574
 ] 

Hyukjin Kwon commented on SPARK-30677:
--

Are you able to produce reproducible steps and codes? Seems difficult to 
reproduce.

> Spark Streaming Job stuck when Kinesis Shard is increased when the job is 
> running
> -
>
> Key: SPARK-30677
> URL: https://issues.apache.org/jira/browse/SPARK-30677
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Mullaivendhan Ariaputhri
>Priority: Major
> Attachments: Cluster-Config-P1.JPG, Cluster-Config-P2.JPG, 
> Instance-Config-P1.JPG, Instance-Config-P2.JPG
>
>
> Spark job stopped processing when the number of shards is increased when the 
> job is already running.
> We have observed the below exceptions.
>  
> 2020-01-27 06:42:29 WARN FileBasedWriteAheadLog_ReceivedBlockTracker:66 - 
> Failed to write to write ahead log
>  2020-01-27 06:42:29 WARN FileBasedWriteAheadLog_ReceivedBlockTracker:66 - 
> Failed to write to write ahead log
>  2020-01-27 06:42:29 ERROR FileBasedWriteAheadLog_ReceivedBlockTracker:70 - 
> Failed to write to write ahead log after 3 failures
>  2020-01-27 06:42:29 WARN BatchedWriteAheadLog:87 - BatchedWriteAheadLog 
> Writer failed to write ArrayBuffer(Record(java.nio.HeapByteBuffer[pos=0 
> lim=1845 cap=1845],1580107349095,Future()))
>  java.io.IOException: Not supported
>  at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.append(S3NativeFileSystem.java:588)
>  at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181)
>  at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295)
>  at 
> org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35)
>  at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32)
>  at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32)
>  at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35)
>  at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229)
>  at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94)
>  at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50)
>  at 
> org.apache.spark.streaming.util.BatchedWriteAheadLog.org$apache$spark$streaming$util$BatchedWriteAheadLog$$flushRecords(BatchedWriteAheadLog.scala:175)
>  at 
> org.apache.spark.streaming.util.BatchedWriteAheadLog$$anon$1.run(BatchedWriteAheadLog.scala:142)
>  at java.lang.Thread.run(Thread.java:748)
>  2020-01-27 06:42:29 WARN ReceivedBlockTracker:87 - Exception thrown while 
> writing record: 
> BlockAdditionEvent(ReceivedBlockInfo(0,Some(36),Some(SequenceNumberRanges(SequenceNumberRange(XXX,shardId-0006,49603657998853972269624727295162770770442241924489281634,49603657998853972269624727295206292099948368574778703970,36))),WriteAheadLogBasedStoreResult(input-0-1580106915391,Some(36),FileBasedWriteAheadLogSegment(s3://XXX/spark/checkpoint/XX/XXX/receivedData/0/log-1580107349000-1580107409000,0,31769
>  to the WriteAheadLog.
>  org.apache.spark.SparkException: Exception thrown in awaitResult: 
>  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
>  at 
> org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84)
>  at 
> org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:242)
>  at 
> org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:89)
>  at 
> org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:347)
>  at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:522)
>  at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
>  at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:520)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
>  Caused by: java.io.IOException: Not supported
>  at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.append(S3NativeFileSystem.java:588)
>  at 

[jira] [Commented] (SPARK-30675) Spark Streaming Job stopped reading events from Queue upon Deregister Exception

2020-02-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029575#comment-17029575
 ] 

Hyukjin Kwon commented on SPARK-30675:
--

Are you able to provide minimised reproducer? Seems impossible to reproduce.

> Spark Streaming Job stopped reading events from Queue upon Deregister 
> Exception
> ---
>
> Key: SPARK-30675
> URL: https://issues.apache.org/jira/browse/SPARK-30675
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, DStreams
>Affects Versions: 2.4.3
>Reporter: Mullaivendhan Ariaputhri
>Priority: Major
> Attachments: Cluster-Config-P1.JPG, Cluster-Config-P2.JPG, 
> Instance-Config-P1.JPG, Instance-Config-P2.JPG
>
>
>  
> *+Stream+*
> We have observed discrepancy in  kinesis stream, whereas stream has 
> continuous incoming records but GetRecords.Records is not available.
>  
> Upon analysis, we have understood that there were no GetRecords calls made by 
> Spark Job during the time due to which the GetRecords count is not available, 
> hence there should not be any issues with streams as the messages were being 
> received.
> *+Spark/EMR+*
> From the driver logs, it has been found that the driver de-registered the 
> receiver for the stream
> +*_Driver Logs_*+
> 2020-01-03 11:11:40 ERROR ReceiverTracker:70 - *{color:#de350b}Deregistered 
> receiver for stream 0: Error while storing block into Spark - 
> java.util.concurrent.TimeoutException: Futures timed out after [30 
> seconds]{color}*
>     at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>     at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>     at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
>     at 
> org.apache.spark.streaming.receiver.{color:#de350b}*WriteAheadLogBasedBlockHandler.storeBlock*{color}(ReceivedBlockHandler.scala:210)
>     at 
> org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)
>     at 
> org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)
>     at 
> org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)
>     at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:293)
>     at 
> org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:344)
>     at 
> org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)
>     at 
> org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)
>     at 
> org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)
>     ...
> *Till this point, there is no receiver being started/registered. From the 
> executor logs (below), it has been observed that one of the executors was 
> running on the container.*
>  
> +*_Executor Logs_*+
> 2020-01-03 11:11:30 INFO  BlockManager:54 - Removing RDD 2851002
> 2020-01-03 11:11:31 INFO  ReceiverSupervisorImpl:54 - 
> {color:#de350b}*S**topping receiver with message: Error while storing block 
> into Spark: java.util.concurrent.TimeoutException: Futures timed out after 
> [30 seconds]*{color}
> 2020-01-03 11:11:31 INFO  Worker:593 - Worker shutdown requested.
> 2020-01-03 11:11:31 INFO  LeaseCoordinator:298 - Worker 
> ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
>  has successfully stopped lease-tracking threads
> 2020-01-03 11:11:31 INFO  KinesisRecordProcessor:54 - Shutdown:  Shutting 
> down workerId 
> ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
>  with reason ZOMBIE
> 2020-01-03 11:11:32 INFO  MemoryStore:54 - Block input-0-1575374565339 stored 
> as bytes in memory (estimated size /7.3 KB, free 3.4 GB)
> 2020-01-03 11:11:33 INFO  Worker:634 - All record processors have been shut 
> down successfully.
>  
> *After this point, the Kinesis KCL worker seemed to be terminated which was 
> reading the Queue, due to which we could see the gap in the GetRecords.*  
>  
> +*Mitigation*+
> Increased the timeout
>  * 'spark.streaming.receiver.blockStoreTimeout’ to 59 seconds (from default - 
> 30 seconds) 
>  * 'spark.streaming.driver.writeAheadLog.batchingTimeout’ to 30 seconds (from 
> default - 5seconds)
>  
> Note : 
>  1. Writeahead logs and Checkpoint is being maitained in AWS S3 bucket
> 2. Spark submit Configuration as below:
> spark-submit --deploy-mode cluster --executor-memory 4608M --driver-memory 
> 4608M 
>  --conf 

[jira] [Commented] (SPARK-30687) When reading from a file with pre-defined schema and encountering a single value that is not the same type as that of its column , Spark nullifies the entire row

2020-02-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029572#comment-17029572
 ] 

Hyukjin Kwon commented on SPARK-30687:
--

[~bnguye1010], Spark 2.3.x is EOL. Can you test and see if the issue exists in 
2.4.x?

> When reading from a file with pre-defined schema and encountering a single 
> value that is not the same type as that of its column , Spark nullifies the 
> entire row
> -
>
> Key: SPARK-30687
> URL: https://issues.apache.org/jira/browse/SPARK-30687
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bao Nguyen
>Priority: Major
>
> When reading from a file with pre-defined schema and encountering a single 
> value that is not the same type as that of its column , Spark nullifies the 
> entire row instead of setting the value at that cell to be null.
>  
> {code:java}
> case class TestModel(
>   num: Double, test: String, mac: String, value: Double
> )
> val schema = 
> ScalaReflection.schemaFor[TestModel].dataType.asInstanceOf[StructType]
> //here's the content of the file test.data
> //1~test~mac1~2
> //1.0~testdatarow2~mac2~non-numeric
> //2~test1~mac1~3
> val ds = spark
>   .read
>   .schema(schema)
>   .option("delimiter", "~")
>   .csv("/test-data/test.data")
> ds.show();
> //the content of data frame. second row is all null. 
> //  ++-++-+
> //  | num| test| mac|value|
> //  ++-++-+
> //  | 1.0| test|mac1|  2.0|
> //  |null| null|null| null|
> //  | 2.0|test1|mac1|  3.0|
> //  ++-++-+
> //should be
> // ++--++-+ 
> // | num| test | mac|value| 
> // ++--++-+ 
> // | 1.0| test |mac1| 2.0 | 
> // |1.0 |testdatarow2  |mac2| null| 
> // | 2.0|test1 |mac1| 3.0 | 
> // ++--++-+{code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF

2020-02-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30688.
--
Resolution: Incomplete

> Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
> --
>
> Key: SPARK-30688
> URL: https://issues.apache.org/jira/browse/SPARK-30688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Rajkumar Singh
>Priority: Major
>
>  
> {code:java}
> scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
> +-+
> |unix_timestamp(20201, ww)|
> +-+
> |                         null|
> +-+
>  
> scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
> -+
> |unix_timestamp(20202, ww)|
> +-+
> |                   1578182400|
> +-+
>  
> {code}
>  
>  
> This seems to happen for leap year only, I dig deeper into it and it seems 
> that  Spark is using the java.text.SimpleDateFormat and try to parse the 
> expression here
> [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
> {code:java}
> formatter.parse(
>  t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
>  but fail and SimpleDateFormat unable to parse the date throw Unparseable 
> Exception but Spark handle it silently and returns NULL.
>  
> *Spark-3.0:* I did some tests where spark no longer using the legacy 
> java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
> expect a valid date with valid format
>  org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF

2020-02-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029570#comment-17029570
 ] 

Hyukjin Kwon commented on SPARK-30688:
--

So switching to new java.time APIs fixed this. I guess it's SPARK-26651 but 
this cannot be backported.
Also, 2.3.x is EOL so no backports or releases will be done in 2.3.x line.

> Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
> --
>
> Key: SPARK-30688
> URL: https://issues.apache.org/jira/browse/SPARK-30688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Rajkumar Singh
>Priority: Major
>
>  
> {code:java}
> scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
> +-+
> |unix_timestamp(20201, ww)|
> +-+
> |                         null|
> +-+
>  
> scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
> -+
> |unix_timestamp(20202, ww)|
> +-+
> |                   1578182400|
> +-+
>  
> {code}
>  
>  
> This seems to happen for leap year only, I dig deeper into it and it seems 
> that  Spark is using the java.text.SimpleDateFormat and try to parse the 
> expression here
> [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
> {code:java}
> formatter.parse(
>  t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
>  but fail and SimpleDateFormat unable to parse the date throw Unparseable 
> Exception but Spark handle it silently and returns NULL.
>  
> *Spark-3.0:* I did some tests where spark no longer using the legacy 
> java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
> expect a valid date with valid format
>  org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30701.
--
Resolution: Not A Problem

I am resolving this as it's a Hadoop side problem.

> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Winutils 2.7.1: 
> [https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]
> Oracle JavaSE 8
> SparkSQL 2.4.4 / Hadoop 2.6.5
> Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive
> Set: winutils chmod -R 777 \Users\OSUser\hadoop\tmp\hive
>Reporter: Guram Savinov
>Priority: Major
>  Labels: WIndows, hive, unit-test
> Attachments: HadoopGroupTest.java
>
>
> Running SparkSQL local embedded unit tests on Win10, using winutils.
> Got warnings about 'hadoop chgrp'.
> See environment info.
> {code:bash}
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}
> Related info on SO: 
> https://stackoverflow.com/questions/48605907/error-in-pyspark-when-insert-data-in-hive
> Seems like the problem is here: 
> hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FsShellPermissions.java:210



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30701:


Assignee: (was: Felix Cheung)

> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Winutils 2.7.1: 
> [https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]
> Oracle JavaSE 8
> SparkSQL 2.4.4 / Hadoop 2.6.5
> Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive
> Set: winutils chmod -R 777 \Users\OSUser\hadoop\tmp\hive
>Reporter: Guram Savinov
>Priority: Major
>  Labels: WIndows, hive, unit-test
> Attachments: HadoopGroupTest.java
>
>
> Running SparkSQL local embedded unit tests on Win10, using winutils.
> Got warnings about 'hadoop chgrp'.
> See environment info.
> {code:bash}
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}
> Related info on SO: 
> https://stackoverflow.com/questions/48605907/error-in-pyspark-when-insert-data-in-hive
> Seems like the problem is here: 
> hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FsShellPermissions.java:210



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30706) TimeZone in writing pure date type in CSV output

2020-02-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029566#comment-17029566
 ] 

Hyukjin Kwon commented on SPARK-30706:
--

Can you show your csv files?

> TimeZone in writing pure date type in CSV output
> 
>
> Key: SPARK-30706
> URL: https://issues.apache.org/jira/browse/SPARK-30706
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.4.3
>Reporter: Waldemar
>Priority: Minor
>
> If I read string date from CSV file, then cast to date type and write into 
> CSV file again on the west of Greenwich to csv file again it writes date of 
> one day ago. This way making this operation in loop we can get unwillingly 
> past date.
> If the spark-shell work on the east of Greenwich all is OK.
> When writing to parquet also is OK.
> Example of code:
> {code:java}
> //
> val test_5_load = "hdfs://192.168.44.161:8020/db/wbiernacki/test_5_load.csv"
> val test_5_save = "hdfs://192.168.44.161:8020/db/wbiernacki/test_5_save.csv"
> val test_5 = spark.read.format("csv")
>   .option("header","true")
>   .load( test_5_load )
>   .withColumn("begin",to_date(col("begin" ),"-MM-dd"))
>   .withColumn("end" ,to_date(col("end" ),"-MM-dd"))
> test_5.show()
> test_5
>   .write.mode("overwrite")
>   .format("csv")
>   .option("header","true")
>   .save( test_5_save )
> {code}
>  Please perform this few times.. The test_5_load.csv file looks like:
> {code:java}
> // 
> ++--+--++
> | patient| begin|   end| new|
> ++--+--++
> |waldemar|2015-09-22|2015-09-23|old1|
> ++--+--++{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30709) Spark 2.3 to Spark 2.4 Upgrade. Problems reading HIVE partitioned tables.

2020-02-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30709.
--
Resolution: Invalid

> Spark 2.3 to Spark 2.4 Upgrade. Problems reading HIVE partitioned tables.
> -
>
> Key: SPARK-30709
> URL: https://issues.apache.org/jira/browse/SPARK-30709
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: PRE- Production
>Reporter: Carlos Mario
>Priority: Major
>  Labels: SQL, Spark
>
> Hello
> We recently updated our preproduction environment from Spark 2.3 to Spark 
> 2.4.0
> Along time we have created a big amount of tables in Hive Metastore, 
> partitioned by 2 fields one of them String and the other one BigInt.
> We were reading this tables with Spark 2.3 with no problem, but after 
> upgrading to Spark 2.4 we get the following log every time we run our SW:
> 
> log_filterBIGINT.out:
>  Caused by: MetaException(message:Filtering is supported only on partition 
> keys of type string) Caused by: MetaException(message:Filtering is supported 
> only on partition keys of type string) Caused by: 
> MetaException(message:Filtering is supported only on partition keys of type 
> string)
>  
> hadoop-cmf-hive-HIVEMETASTORE-isblcsmsttc0001.scisb.isban.corp.log.out.1:
>  
> 2020-01-10 09:36:05,781 ERROR 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler: [pool-5-thread-138]: 
> MetaException(message:Filtering is supported only on partition keys of type 
> string)
> 2020-01-10 11:19:19,208 ERROR 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler: [pool-5-thread-187]: 
> MetaException(message:Filtering is supported only on partition keys of type 
> string)
> 2020-01-10 11:19:54,780 ERROR 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler: [pool-5-thread-167]: 
> MetaException(message:Filtering is supported only on partition keys of type 
> string)
>  
>  
> We know the best practice from Spark point of view is to use 'STRING' type 
> for partition columns, but we need to explore a solution we'll be able to 
> deploy with ease, due to the big amount of tables created with a bigiint type 
> column partition.
>  
> As a first solution we tried to set the  
> spark.sql.hive.manageFilesourcePartitions parameter to false in the Spark 
> Submmit, but after reruning the SW the error stood still.
>  
> Is there anyone in the community who experienced the same problem? What was 
> the solution for it? 
>  
> Kind Regards and thanks in advance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30709) Spark 2.3 to Spark 2.4 Upgrade. Problems reading HIVE partitioned tables.

2020-02-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029562#comment-17029562
 ] 

Hyukjin Kwon commented on SPARK-30709:
--

Please ask questions into mailing list (https://spark.apache.org/community.html)

> Spark 2.3 to Spark 2.4 Upgrade. Problems reading HIVE partitioned tables.
> -
>
> Key: SPARK-30709
> URL: https://issues.apache.org/jira/browse/SPARK-30709
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: PRE- Production
>Reporter: Carlos Mario
>Priority: Major
>  Labels: SQL, Spark
>
> Hello
> We recently updated our preproduction environment from Spark 2.3 to Spark 
> 2.4.0
> Along time we have created a big amount of tables in Hive Metastore, 
> partitioned by 2 fields one of them String and the other one BigInt.
> We were reading this tables with Spark 2.3 with no problem, but after 
> upgrading to Spark 2.4 we get the following log every time we run our SW:
> 
> log_filterBIGINT.out:
>  Caused by: MetaException(message:Filtering is supported only on partition 
> keys of type string) Caused by: MetaException(message:Filtering is supported 
> only on partition keys of type string) Caused by: 
> MetaException(message:Filtering is supported only on partition keys of type 
> string)
>  
> hadoop-cmf-hive-HIVEMETASTORE-isblcsmsttc0001.scisb.isban.corp.log.out.1:
>  
> 2020-01-10 09:36:05,781 ERROR 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler: [pool-5-thread-138]: 
> MetaException(message:Filtering is supported only on partition keys of type 
> string)
> 2020-01-10 11:19:19,208 ERROR 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler: [pool-5-thread-187]: 
> MetaException(message:Filtering is supported only on partition keys of type 
> string)
> 2020-01-10 11:19:54,780 ERROR 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler: [pool-5-thread-167]: 
> MetaException(message:Filtering is supported only on partition keys of type 
> string)
>  
>  
> We know the best practice from Spark point of view is to use 'STRING' type 
> for partition columns, but we need to explore a solution we'll be able to 
> deploy with ease, due to the big amount of tables created with a bigiint type 
> column partition.
>  
> As a first solution we tried to set the  
> spark.sql.hive.manageFilesourcePartitions parameter to false in the Spark 
> Submmit, but after reruning the SW the error stood still.
>  
> Is there anyone in the community who experienced the same problem? What was 
> the solution for it? 
>  
> Kind Regards and thanks in advance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30710) SPARK 2.4.4 - DROP TABLE and drop HDFS

2020-02-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30710:
-
Fix Version/s: (was: 2.4.4)

> SPARK 2.4.4 - DROP TABLE and drop HDFS
> --
>
> Key: SPARK-30710
> URL: https://issues.apache.org/jira/browse/SPARK-30710
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.4.4
>Reporter: Nguyen Nhanduc
>Priority: Major
>  Labels: drop, hive, spark, table
>
> Hi all,
> I need to DROP a hive table and clear data in HDFS. But I can't convert it 
> from External to Managed (I have to create external table for my business).
> On Spark 2.0.2, I can use script ALTER TABLE  SET TBLPROPERTIES 
> ('EXTERNAL' = 'FALSE'); then I execute script DROP TABLE ; metadata 
> and data was deleted together.
> But on spark 2.4.4, it's not work. I need a solution for that (drop a 
> external table and drop data in hdfs or convert a external table to managed 
> table and drop it).
> Many thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30710) SPARK 2.4.4 - DROP TABLE and drop HDFS

2020-02-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30710.
--
Resolution: Invalid

> SPARK 2.4.4 - DROP TABLE and drop HDFS
> --
>
> Key: SPARK-30710
> URL: https://issues.apache.org/jira/browse/SPARK-30710
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.4.4
>Reporter: Nguyen Nhanduc
>Priority: Major
>  Labels: drop, hive, spark, table
>
> Hi all,
> I need to DROP a hive table and clear data in HDFS. But I can't convert it 
> from External to Managed (I have to create external table for my business).
> On Spark 2.0.2, I can use script ALTER TABLE  SET TBLPROPERTIES 
> ('EXTERNAL' = 'FALSE'); then I execute script DROP TABLE ; metadata 
> and data was deleted together.
> But on spark 2.4.4, it's not work. I need a solution for that (drop a 
> external table and drop data in hdfs or convert a external table to managed 
> table and drop it).
> Many thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30710) SPARK 2.4.4 - DROP TABLE and drop HDFS

2020-02-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029561#comment-17029561
 ] 

Hyukjin Kwon commented on SPARK-30710:
--

Please ask questions into mailing list 
(https://spark.apache.org/community.html).

> SPARK 2.4.4 - DROP TABLE and drop HDFS
> --
>
> Key: SPARK-30710
> URL: https://issues.apache.org/jira/browse/SPARK-30710
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.4.4
>Reporter: Nguyen Nhanduc
>Priority: Major
>  Labels: drop, hive, spark, table
> Fix For: 2.4.4
>
>
> Hi all,
> I need to DROP a hive table and clear data in HDFS. But I can't convert it 
> from External to Managed (I have to create external table for my business).
> On Spark 2.0.2, I can use script ALTER TABLE  SET TBLPROPERTIES 
> ('EXTERNAL' = 'FALSE'); then I execute script DROP TABLE ; metadata 
> and data was deleted together.
> But on spark 2.4.4, it's not work. I need a solution for that (drop a 
> external table and drop data in hdfs or convert a external table to managed 
> table and drop it).
> Many thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException

2020-02-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029560#comment-17029560
 ] 

Hyukjin Kwon commented on SPARK-30711:
--

It seems passing fine in the master. [~schreiber] can you test it out against 
Spark preview 3.0? 

> 64KB JVM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> 

[jira] [Updated] (SPARK-30710) SPARK 2.4.4 - DROP TABLE and drop HDFS

2020-02-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30710:
-
Target Version/s:   (was: 2.4.4)

> SPARK 2.4.4 - DROP TABLE and drop HDFS
> --
>
> Key: SPARK-30710
> URL: https://issues.apache.org/jira/browse/SPARK-30710
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.4.4
>Reporter: Nguyen Nhanduc
>Priority: Major
>  Labels: drop, hive, spark, table
> Fix For: 2.4.4
>
>
> Hi all,
> I need to DROP a hive table and clear data in HDFS. But I can't convert it 
> from External to Managed (I have to create external table for my business).
> On Spark 2.0.2, I can use script ALTER TABLE  SET TBLPROPERTIES 
> ('EXTERNAL' = 'FALSE'); then I execute script DROP TABLE ; metadata 
> and data was deleted together.
> But on spark 2.4.4, it's not work. I need a solution for that (drop a 
> external table and drop data in hdfs or convert a external table to managed 
> table and drop it).
> Many thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files

2020-02-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029558#comment-17029558
 ] 

Hyukjin Kwon commented on SPARK-30712:
--

Do you have some works already done and/or performance numbers to show?

> Estimate sizeInBytes from file metadata for parquet files
> -
>
> Key: SPARK-30712
> URL: https://issues.apache.org/jira/browse/SPARK-30712
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, Spark will use a compressionFactor when calculating `sizeInBytes` 
> for `HadoopFsRelation`, but this is not accurate and it's hard to choose the 
> best `compressionFactor`. Sometimes, this can causing OOMs due to improper 
> BroadcastHashJoin.
> So I propose to use the rowCount in the BlockMetadata to estimate the size in 
> memory, which can be more accurate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files

2020-02-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029557#comment-17029557
 ] 

Hyukjin Kwon commented on SPARK-30712:
--

To do that, it should actually reads the file, and potentially every file. I 
think it adds some more overhead. I am not yet so positive about this.
This JIRA relates to SPARK-24914

> Estimate sizeInBytes from file metadata for parquet files
> -
>
> Key: SPARK-30712
> URL: https://issues.apache.org/jira/browse/SPARK-30712
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, Spark will use a compressionFactor when calculating `sizeInBytes` 
> for `HadoopFsRelation`, but this is not accurate and it's hard to choose the 
> best `compressionFactor`. Sometimes, this can causing OOMs due to improper 
> BroadcastHashJoin.
> So I propose to use the rowCount in the BlockMetadata to estimate the size in 
> memory, which can be more accurate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30714) DSV2: Vectorized datasource does not have handling for ProlepticCalendar

2020-02-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029555#comment-17029555
 ] 

Hyukjin Kwon commented on SPARK-30714:
--

Spark 2.3.x is EOL so there wouldn't be no more backports and releases in 2.3.x 
line.
https://issues.apache.org/jira/browse/SPARK-26651 is mostly fixed in Spark 3, 
and it's a pretty breaking change. So it wasn't backported to 2.4 either.


> DSV2: Vectorized datasource does not have handling for ProlepticCalendar
> 
>
> Key: SPARK-30714
> URL: https://issues.apache.org/jira/browse/SPARK-30714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Shubham Chaurasia
>Priority: Major
>
> Consider the following scenarios - 
> 1)
> {code:scala}
> scala> spark.read.format("com.shubham.MyDataSource").option("ts_millis", 
> "1580736255261").load.show(false)
> MyDataSourceReader.createDataReaderFactories: ts_millis:  1580736255261
> Using LocalDateTime: 2020-02-03T13:24:15
> +---+
> |my_ts  |
> +---+
> |2020-02-03 13:24:15.261|
> +---+
> {code}
> In above output, we can see that both the timestamps (the one logged and the 
> one in dataframe) are equal(upto seconds)
> 2)
> {code:scala}
> scala> spark.read.format("com.shubham.MyDataSource").option("ts_millis", 
> "-6213559680").load.show(false)
> MyDataSourceReader.createDataReaderFactories: ts_millis:  -6213559680
> Using LocalDateTime: 0001-01-01T00:00
> +---+
> |my_ts  |
> +---+
> |0001-01-03 00:00:00|
> +---+
> {code}
> Here in 2), we can see that timestamp coming from DataReader is not converted 
> properly according to proleptic calendar and hence the two timestamps are 
> different. 
> Code to Repro
> DataSourceReader 
> {code:java}
> public class MyDataSourceReader implements DataSourceReader, 
> SupportsScanColumnarBatch {
>   private StructType schema;
>   private long timestampToProduce = -6213559680L;
>   public MyDataSourceReader(Map options) {
> initOptions(options);
>   }
>   private void initOptions(Map options) {
> String ts = options.get("ts_millis");
> if (ts != null) {
>   timestampToProduce = Long.parseLong(ts);
> }
>   }
>   @Override public StructType readSchema() {
> StructField[] fields = new StructField[1];
> fields[0] = new StructField("my_ts", DataTypes.TimestampType, true, 
> Metadata.empty());
> schema = new StructType(fields);
> return schema;
>   }
>   @Override public List> 
> createBatchDataReaderFactories() {
> System.out.println("MyDataSourceReader.createDataReaderFactories: 
> ts_millis:  " + timestampToProduce);
> System.out.println("Using LocalDateTime: " +  
> LocalDateTime.ofEpochSecond(timestampToProduce/1000, 0, ZoneOffset.UTC));
> List> dataReaderFactories = new 
> ArrayList<>();
> dataReaderFactories.add(new MyVectorizedTSProducerFactory(schema, 
> timestampToProduce));
> return dataReaderFactories;
>   }
> {code}
> DataReaderFactory & DataReader
> {code:java}
> public class MyVectorizedTSProducerFactory implements 
> DataReaderFactory {
>   private final StructType schema;
>   private long timestampValueToProduce;
>   public MyVectorizedTSProducerFactory(StructType schema, long 
> timestampValueToProduce) {
> this.schema = schema;
> this.timestampValueToProduce = timestampValueToProduce;
>   }
>   @Override public DataReader createDataReader() {
> return new MyVectorizedProducerReader(schema, timestampValueToProduce);
>   }
>   public static class MyVectorizedProducerReader implements 
> DataReader {
> private final StructType schema;
> private long timestampValueToProduce;
> private ColumnarBatch columnarBatch;
> // return just one batch for now
> private boolean batchRemaining = true;
> public MyVectorizedProducerReader(StructType schema, long 
> timestampValueToProduce) {
>   this.schema = schema;
>   this.timestampValueToProduce = timestampValueToProduce;
> }
> @Override public boolean next() {
>   return batchRemaining;
> }
> @Override public ColumnarBatch get() {
>   batchRemaining = false;
>   OnHeapColumnVector[] onHeapColumnVectors = 
> OnHeapColumnVector.allocateColumns(1, schema);
>   for (OnHeapColumnVector vector : onHeapColumnVectors) {
> // convert millis to micros
> vector.putLong(0, timestampValueToProduce * 1000);
>   }
>   columnarBatch = new ColumnarBatch(onHeapColumnVectors);
>   columnarBatch.setNumRows(1);
>   return columnarBatch;
> }
> @Override public void close() {
>   if (columnarBatch != null) {
> columnarBatch.close();
>   }
> }

[jira] [Resolved] (SPARK-30718) Exclude jdk.tools dependency from hadoop-yarn-api

2020-02-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30718.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27445
[https://github.com/apache/spark/pull/27445]

> Exclude jdk.tools dependency from hadoop-yarn-api
> -
>
> Key: SPARK-30718
> URL: https://issues.apache.org/jira/browse/SPARK-30718
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue aims to remove `jdk.tools:jdk.tools` dependency from 
> hadoop-yarn-api.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30718) Exclude jdk.tools dependency from hadoop-yarn-api

2020-02-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30718:
-

Assignee: Dongjoon Hyun

> Exclude jdk.tools dependency from hadoop-yarn-api
> -
>
> Key: SPARK-30718
> URL: https://issues.apache.org/jira/browse/SPARK-30718
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This issue aims to remove `jdk.tools:jdk.tools` dependency from 
> hadoop-yarn-api.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30721) Turning off WSCG did not take effect in AQE query planning

2020-02-03 Thread Wei Xue (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029527#comment-17029527
 ] 

Wei Xue commented on SPARK-30721:
-

cc [~cloud_fan], [~Jk_Self]

> Turning off WSCG did not take effect in AQE query planning
> --
>
> Key: SPARK-30721
> URL: https://issues.apache.org/jira/browse/SPARK-30721
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Priority: Major
>
> This is a follow up for 
> [https://github.com/apache/spark/pull/26813#discussion_r373044512].
> We need to fix test DataFrameAggregateSuite with AQE on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30721) Turning off WSCG did not take effect in AQE query planning

2020-02-03 Thread Wei Xue (Jira)
Wei Xue created SPARK-30721:
---

 Summary: Turning off WSCG did not take effect in AQE query planning
 Key: SPARK-30721
 URL: https://issues.apache.org/jira/browse/SPARK-30721
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wei Xue


This is a follow up for 
[https://github.com/apache/spark/pull/26813#discussion_r373044512].

We need to fix test DataFrameAggregateSuite with AQE on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30719) AQE should not issue a "not supported" warning for queries being by-passed

2020-02-03 Thread Wei Xue (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029525#comment-17029525
 ] 

Wei Xue commented on SPARK-30719:
-

cc [~cloud_fan], [~Jk_Self]

> AQE should not issue a "not supported" warning for queries being by-passed
> --
>
> Key: SPARK-30719
> URL: https://issues.apache.org/jira/browse/SPARK-30719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Priority: Minor
>
> This is a follow up for [https://github.com/apache/spark/pull/26813].
> AQE bypasses queries that don't have exchanges or subqueries. This is not a 
> limitation and it is different from queries that are not supported in AQE. 
> Issuing a warning in this case can be confusing and annoying.
> It would also be good to add an internal conf for this bypassing behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30720) Spark framework hangs and becomes inactive on Mesos UI if executor can not connect to shuffle external service.

2020-02-03 Thread Andrei Stankevich (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Stankevich updated SPARK-30720:
--
Description: 
We are using spark 2.4.3 with mesos and with external shuffle service. External 
shuffle service is launched using systemd by command
{code:java}
 /bin/bash -ce "exec /*/spark/bin/spark-class 
org.apache.spark.deploy.mesos.MesosExternalShuffleService"
{code}
Sometimes spark executor has connection timeout when it tries to connect to 
external shuffle service. When it happens spark executor throws an exception 
{noformat}
ERROR BlockManager: Failed to connect to external shuffle server, will retry 4 
more times after waiting 5 seconds...{noformat}
 If connection timeout happens 4 more times spark executor throws an error
{noformat}
ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to 
create executor due to Unable to register with external shuffle server due to : 
Failed to connect to our-host.com/10.103..:7337
{noformat}
After this error Spark application just hangs. On Mesos UI it goes to inactive 
frameworks and on Spark Driver UI I can see few failed tasks and looks like it 
does nothing.

 

External Shuffle service throws an exception 
{code:java}
ERROR TransportRequestHandler: Error sending result 
RpcResponse{requestId=4941243310586976766, 
body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=13 cap=13]}} to 
/10.103.*.*:49482; closing connection{code}
 

Full spark executor log is 

 

ERROR BlockManager: Failed to connect to external shuffle server, will retry 1 
more times after waiting 5 seconds...
 java.io.IOException: Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
 at 
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:201)
 at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142)
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:295)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at org.apache.spark.executor.Executor.(Executor.scala:118)
 at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
 at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
 at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
 at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
 Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
Connection timed out: our-host.com/10.103.*.*:7337
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
 at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 ... 1 more
 Caused by: java.net.ConnectException: Connection timed out
 ... 11 more
 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to 
create executor due to Unable to register with external shuffle server due to : 
Failed to connect to our-host.com/10.103.**.**:7337
 org.apache.spark.SparkException: Unable to register with external shuffle 
server due to : Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:304)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at org.apache.spark.executor.Executor.(Executor.scala:118)
 at 

[jira] [Updated] (SPARK-30720) Spark framework hangs and becomes inactive on Mesos UI if executor can not connect to shuffle external service.

2020-02-03 Thread Andrei Stankevich (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Stankevich updated SPARK-30720:
--
Description: 
We are using spark 2.4.3 with mesos and with external shuffle service. External 
shuffle service is launched using systemd by command
{code:java}
 /bin/bash -ce "exec /*/spark/bin/spark-class 
org.apache.spark.deploy.mesos.MesosExternalShuffleService"
{code}
Sometimes spark executor has connection timeout when it tries to connect to 
external shuffle service. When it happens spark executor throws an exception 
{noformat}
ERROR BlockManager: Failed to connect to external shuffle server, will retry 4 
more times after waiting 5 seconds...{noformat}
 If connection timeout happens 4 more times spark executor throws an error
{noformat}
ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to 
create executor due to Unable to register with external shuffle server due to : 
Failed to connect to our-host.com/10.103..:7337
{noformat}
After this error Spark application just hangs. On Mesos UI it goes to inactive 
frameworks and on Spark Driver UI I can see few failed tasks and looks like it 
does nothing.

 

External Shuffle service throws an exception 
{code:java}
ERROR TransportRequestHandler: Error sending result 
RpcResponse{requestId=4941243310586976766, 
body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=13 cap=13]}} to 
/10.103.*.*:49482; closing connection{code}
 

Full spark executor log is 

 

20/01/31 16:27:09 ERROR BlockManager: Failed to connect to external shuffle 
server, will retry 1 more times after waiting 5 seconds...
 java.io.IOException: Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
 at 
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:201)
 at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142)
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:295)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at org.apache.spark.executor.Executor.(Executor.scala:118)
 at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
 at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
 at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
 at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
 Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
Connection timed out: our-host.com/10.103.*.*:7337
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
 at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 ... 1 more
 Caused by: java.net.ConnectException: Connection timed out
 ... 11 more
 20/01/31 16:29:25 ERROR CoarseGrainedExecutorBackend: Executor self-exiting 
due to : Unable to create executor due to Unable to register with external 
shuffle server due to : Failed to connect to our-host.com/10.103.**.**:7337
 org.apache.spark.SparkException: Unable to register with external shuffle 
server due to : Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:304)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at 

[jira] [Updated] (SPARK-30720) Spark framework hangs and becomes inactive on Mesos UI if executor can not connect to shuffle external service.

2020-02-03 Thread Andrei Stankevich (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Stankevich updated SPARK-30720:
--
Description: 
We are using spark 2.4.3 with mesos and with external shuffle service. External 
shuffle service is launched using systemd by command
{code:java}
 /bin/bash -ce "exec /*/spark/bin/spark-class 
org.apache.spark.deploy.mesos.MesosExternalShuffleService"
{code}
Sometimes spark executor has connection timeout when it tries to connect to 
external shuffle service. When it happens spark executor throws an exception 
{noformat}
ERROR BlockManager: Failed to connect to external shuffle server, will retry 4 
more times after waiting 5 seconds...{noformat}
 If connection timeout happens 4 more times spark executor throws an error
{noformat}
ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to 
create executor due to Unable to register with external shuffle server due to : 
Failed to connect to our-host.com/10.103..:7337
{noformat}
After this error Spark application just hangs. On Mesos UI it goes to inactive 
frameworks and on Spark Driver UI I can see few failed tasks and looks like it 
does nothing.

 

External Shuffle service throws an exception 
{code:java}
ERROR TransportRequestHandler: Error sending result 
RpcResponse{requestId=4941243310586976766, 
body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=13 cap=13]}} to 
/10.103.*.*:49482; closing connection{code}
 

Full spark executor log is 

 

20/01/31 16:27:09 ERROR BlockManager: Failed to connect to external shuffle 
server, will retry 1 more times after waiting 5 seconds...
 java.io.IOException: Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
 at 
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:201)
 at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142)
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:295)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at org.apache.spark.executor.Executor.(Executor.scala:118)
 at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
 at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
 at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
 at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
 Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
Connection timed out: our-host.com/10.103.*.*:7337
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
 at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 ... 1 more
 Caused by: java.net.ConnectException: Connection timed out
 ... 11 more
 20/01/31 16:29:25 ERROR CoarseGrainedExecutorBackend: Executor self-exiting 
due to : Unable to create executor due to Unable to register with external 
shuffle server due to : Failed to connect to our-host.com/10.103.**.**:7337
 org.apache.spark.SparkException: Unable to register with external shuffle 
server due to : Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:304)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at 

[jira] [Updated] (SPARK-30720) Spark framework hangs and becomes inactive on Mesos UI if executor can not connect to shuffle external service.

2020-02-03 Thread Andrei Stankevich (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Stankevich updated SPARK-30720:
--
Description: 
We are using spark 2.4.3 with mesos and with external shuffle service. External 
shuffle service is launched using systemd by command
{code:java}
 /bin/bash -ce "exec /*/spark/bin/spark-class 
org.apache.spark.deploy.mesos.MesosExternalShuffleService"
{code}
Sometimes spark executor has connection timeout when it tries to connect to 
external shuffle service. When it happens spark executor throws an exception 
{noformat}
ERROR BlockManager: Failed to connect to external shuffle server, will retry 4 
more times after waiting 5 seconds...{noformat}
 If connection timeout happens 4 more times spark executor throws an error
{noformat}
ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to 
create executor due to Unable to register with external shuffle server due to : 
Failed to connect to our-host.com/10.103..:7337
{noformat}
After this error Spark application just hangs. On Mesos UI it goes to inactive 
frameworks and on Spark Driver UI I can see few failed tasks and looks like it 
does nothing.

 

External Shuffle service throws an exception 
{code:java}
ERROR TransportRequestHandler: Error sending result 
RpcResponse{requestId=4941243310586976766, 
body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=13 cap=13]}} to 
/10.103.*.*:49482; closing connection{code}
 

Full spark executor log is 

 

20/01/31 16:27:09 ERROR BlockManager: Failed to connect to external shuffle 
server, will retry 1 more times after waiting 5 seconds...
 java.io.IOException: Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
 at 
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:201)
 at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142)
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:295)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at org.apache.spark.executor.Executor.(Executor.scala:118)
 at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
 at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
 at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
 at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
 Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
Connection timed out: our-host.com/10.103.*.*:7337
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
 at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 ... 1 more
 Caused by: java.net.ConnectException: Connection timed out
 ... 11 more
 20/01/31 16:29:25 ERROR CoarseGrainedExecutorBackend: Executor self-exiting 
due to : Unable to create executor due to Unable to register with external 
shuffle server due to : Failed to connect to our-host.com/10.103.*.*:7337
 org.apache.spark.SparkException: Unable to register with external shuffle 
server due to : Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:304)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at 

[jira] [Updated] (SPARK-30720) Spark framework hangs and becomes inactive on Mesos UI if executor can not connect to shuffle external service.

2020-02-03 Thread Andrei Stankevich (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Stankevich updated SPARK-30720:
--
Description: 
We are using spark 2.4.3 with mesos and with external shuffle service. External 
shuffle service is launched using systemd by command
{code:java}
 /bin/bash -ce "exec /*/spark/bin/spark-class 
org.apache.spark.deploy.mesos.MesosExternalShuffleService"
{code}
Sometimes spark executor has connection timeout when it tries to connect to 
external shuffle service. When it happens spark executor throws an exception 
{noformat}
ERROR BlockManager: Failed to connect to external shuffle server, will retry 4 
more times after waiting 5 seconds...{noformat}
 If connection timeout happens 4 more times spark executor throws an error
{noformat}
ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to 
create executor due to Unable to register with external shuffle server due to : 
Failed to connect to our-host.com/10.103..:7337
{noformat}
After this error Spark application just hangs. On Mesos UI it goes to inactive 
frameworks and on Spark Driver UI I can see few failed tasks and looks like it 
does nothing.

 

Full spark executor log is 

 

20/01/31 16:27:09 ERROR BlockManager: Failed to connect to external shuffle 
server, will retry 1 more times after waiting 5 seconds...
java.io.IOException: Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
 at 
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:201)
 at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142)
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:295)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at org.apache.spark.executor.Executor.(Executor.scala:118)
 at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
 at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
 at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
 at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
Connection timed out: our-host.com/10.103.*.*:7337
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
 at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 ... 1 more
Caused by: java.net.ConnectException: Connection timed out
 ... 11 more
20/01/31 16:29:25 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due 
to : Unable to create executor due to Unable to register with external shuffle 
server due to : Failed to connect to our-host.com/10.103.*.*:7337
org.apache.spark.SparkException: Unable to register with external shuffle 
server due to : Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:304)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at org.apache.spark.executor.Executor.(Executor.scala:118)
 at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
 at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
 at 

[jira] [Updated] (SPARK-30720) Spark framework hangs and becomes inactive on Mesos UI if executor can not connect to shuffle external service.

2020-02-03 Thread Andrei Stankevich (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Stankevich updated SPARK-30720:
--
Description: 
We are using spark 2.4.3 with mesos and with external shuffle service. External 
shuffle service is launched using systemd by command
{code:java}
 /bin/bash -ce "exec /*/spark/bin/spark-class 
org.apache.spark.deploy.mesos.MesosExternalShuffleService"
{code}
Sometimes spark executor has connection timeout when it tries to connect to 
external shuffle service. When it happens spark executor throws an exception 
{noformat}
ERROR BlockManager: Failed to connect to external shuffle server, will retry 4 
more times after waiting 5 seconds...{noformat}
 If connection timeout happens 4 more times spark executor throws an error
{noformat}
ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to 
create executor due to Unable to register with external shuffle server due to : 
Failed to connect to our-host.com/10.103..:7337
{noformat}
After this error Spark application just hangs. On Mesos UI it goes to inactive 
frameworks and on Spark Driver UI I can see few failed tasks and looks like it 
does nothing.

 

Full spark executor log is 
{noformat}
20/01/31 16:27:09 ERROR BlockManager: Failed to connect to external shuffle 
server, will retry 1 more times after waiting 5 seconds... java.io.IOException: 
Failed to connect to our-host.com/10.103..:7337 at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
 at 
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:201)
 at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142)
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:295)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265) at 
org.apache.spark.executor.Executor.(Executor.scala:118) at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
 at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117) at 
org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at 
org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102) at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) Caused by: 
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed 
out: our-host.com/10.103..:7337 at 
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
 at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633) 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at 
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 ... 1 more Caused by: java.net.ConnectException: Connection timed out ... 11 
more 20/01/31 16:29:25 ERROR CoarseGrainedExecutorBackend: Executor 
self-exiting due to : Unable to create executor due to Unable to register with 
external shuffle server due to : Failed to connect to 
our-host.com/10.103..:7337 org.apache.spark.SparkException: Unable to register 
with external shuffle server due to : Failed to connect to 
our-host.com/10.103..:7337 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:304)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265) at 
org.apache.spark.executor.Executor.(Executor.scala:118) at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
 at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117) at 
org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) 

[jira] [Updated] (SPARK-30720) Spark framework hangs and becomes inactive on Mesos UI if executor can not connect to shuffle external service.

2020-02-03 Thread Andrei Stankevich (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Stankevich updated SPARK-30720:
--
Description: 
We are using spark 2.4.3 with mesos and with external shuffle service. External 
shuffle service is launched using systemd by command
{code:java}
 /bin/bash -ce "exec /*/spark/bin/spark-class 
org.apache.spark.deploy.mesos.MesosExternalShuffleService"
{code}
Sometimes spark executor has connection timeout when it tries to connect to 
external shuffle service. When it happens spark executor throws an exception 

 
{noformat}
ERROR BlockManager: Failed to connect to external shuffle server, will retry 4 
more times after waiting 5 seconds...{noformat}
 

If connection timeout happens 4 more times spark executor throws an error

`ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to 
create executor due to Unable to register with external shuffle server due to : 
Failed to connect to our-host.com/10.103.*.*:7337`

 

After this error Spark application just hangs. On Mesos UI it goes to inactive 
frameworks and on Spark Driver UI I can see few failed tasks and looks like it 
does nothing.

 

Full spark executor log is 

```

20/01/31 16:27:09 ERROR BlockManager: Failed to connect to external shuffle 
server, will retry 1 more times after waiting 5 seconds...
 java.io.IOException: Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
 at 
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:201)
 at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142)
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:295)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at org.apache.spark.executor.Executor.(Executor.scala:118)
 at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
 at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
 at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
 at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
 Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
Connection timed out: our-host.com/10.103.*.*:7337
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
 at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 ... 1 more
 Caused by: java.net.ConnectException: Connection timed out
 ... 11 more
 20/01/31 16:29:25 ERROR CoarseGrainedExecutorBackend: Executor self-exiting 
due to : Unable to create executor due to Unable to register with external 
shuffle server due to : Failed to connect to our-host.com/10.103.*.*:7337
 org.apache.spark.SparkException: Unable to register with external shuffle 
server due to : Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:304)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at org.apache.spark.executor.Executor.(Executor.scala:118)
 at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
 at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
 at 

[jira] [Updated] (SPARK-30720) Spark framework hangs and becomes inactive on Mesos UI if executor can not connect to shuffle external service.

2020-02-03 Thread Andrei Stankevich (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Stankevich updated SPARK-30720:
--
Description: 
We are using spark 2.4.3 with mesos and with external shuffle service. External 
shuffle service is launched using systemd by command
{code:java}
exec /*/spark/bin/spark-class 
org.apache.spark.deploy.mesos.MesosExternalShuffleService
{code}
Sometimes spark executor has connection timeout when it tries to connect to 
external shuffle service. When it happens spark executor throws an exception 

`ERROR BlockManager: Failed to connect to external shuffle server, will retry 4 
more times after waiting 5 seconds...`

If connection timeout happens 4 more times spark executor throws an error

`ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to 
create executor due to Unable to register with external shuffle server due to : 
Failed to connect to our-host.com/10.103.*.*:7337`

 

After this error Spark application just hangs. On Mesos UI it goes to inactive 
frameworks and on Spark Driver UI I can see few failed tasks and looks like it 
does nothing.

 

Full spark executor log is 

```

20/01/31 16:27:09 ERROR BlockManager: Failed to connect to external shuffle 
server, will retry 1 more times after waiting 5 seconds...
 java.io.IOException: Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
 at 
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:201)
 at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142)
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:295)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at org.apache.spark.executor.Executor.(Executor.scala:118)
 at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
 at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
 at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
 at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
 Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
Connection timed out: our-host.com/10.103.*.*:7337
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
 at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 ... 1 more
 Caused by: java.net.ConnectException: Connection timed out
 ... 11 more
 20/01/31 16:29:25 ERROR CoarseGrainedExecutorBackend: Executor self-exiting 
due to : Unable to create executor due to Unable to register with external 
shuffle server due to : Failed to connect to our-host.com/10.103.*.*:7337
 org.apache.spark.SparkException: Unable to register with external shuffle 
server due to : Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:304)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at org.apache.spark.executor.Executor.(Executor.scala:118)
 at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
 at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
 at 

[jira] [Created] (SPARK-30720) Spark framework hangs and becomes inactive on Mesos UI if executor can not connect to shuffle external service.

2020-02-03 Thread Andrei Stankevich (Jira)
Andrei Stankevich created SPARK-30720:
-

 Summary: Spark framework hangs and becomes inactive on Mesos UI if 
executor can not connect to shuffle external service.
 Key: SPARK-30720
 URL: https://issues.apache.org/jira/browse/SPARK-30720
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Andrei Stankevich


We are using spark 2.4.3 with mesos and with external shuffle service. External 
shuffle service is launched using systemd by command `

exec /*/spark/bin/spark-class 
org.apache.spark.deploy.mesos.MesosExternalShuffleService

` 

Sometimes spark executor has connection timeout when it tries to connect to 
external shuffle service. When it happens spark executor throws an exception 

`ERROR BlockManager: Failed to connect to external shuffle server, will retry 4 
more times after waiting 5 seconds...`

If connection timeout happens 4 more times spark executor throws an error

`ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to 
create executor due to Unable to register with external shuffle server due to : 
Failed to connect to our-host.com/10.103.*.*:7337`

 

After this error Spark application just hangs. On Mesos UI it goes to inactive 
frameworks and on Spark Driver UI I can see few failed tasks and looks like it 
does nothing.

 

Full spark executor log is 

```

20/01/31 16:27:09 ERROR BlockManager: Failed to connect to external shuffle 
server, will retry 1 more times after waiting 5 seconds...
java.io.IOException: Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
 at 
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:201)
 at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142)
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:295)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at org.apache.spark.executor.Executor.(Executor.scala:118)
 at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
 at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
 at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
 at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
Connection timed out: our-host.com/10.103.*.*:7337
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
 at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 ... 1 more
Caused by: java.net.ConnectException: Connection timed out
 ... 11 more
20/01/31 16:29:25 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due 
to : Unable to create executor due to Unable to register with external shuffle 
server due to : Failed to connect to our-host.com/10.103.*.*:7337
org.apache.spark.SparkException: Unable to register with external shuffle 
server due to : Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:304)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at org.apache.spark.executor.Executor.(Executor.scala:118)
 at 

[jira] [Updated] (SPARK-30720) Spark framework hangs and becomes inactive on Mesos UI if executor can not connect to shuffle external service.

2020-02-03 Thread Andrei Stankevich (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Stankevich updated SPARK-30720:
--
Description: 
We are using spark 2.4.3 with mesos and with external shuffle service. External 
shuffle service is launched using systemd by command `exec 
/*/spark/bin/spark-class 
org.apache.spark.deploy.mesos.MesosExternalShuffleService` 

Sometimes spark executor has connection timeout when it tries to connect to 
external shuffle service. When it happens spark executor throws an exception 

`ERROR BlockManager: Failed to connect to external shuffle server, will retry 4 
more times after waiting 5 seconds...`

If connection timeout happens 4 more times spark executor throws an error

`ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to 
create executor due to Unable to register with external shuffle server due to : 
Failed to connect to our-host.com/10.103.*.*:7337`

 

After this error Spark application just hangs. On Mesos UI it goes to inactive 
frameworks and on Spark Driver UI I can see few failed tasks and looks like it 
does nothing.

 

Full spark executor log is 

```

20/01/31 16:27:09 ERROR BlockManager: Failed to connect to external shuffle 
server, will retry 1 more times after waiting 5 seconds...
 java.io.IOException: Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
 at 
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:201)
 at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142)
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:295)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at org.apache.spark.executor.Executor.(Executor.scala:118)
 at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
 at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
 at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
 at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
 Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
Connection timed out: our-host.com/10.103.*.*:7337
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
 at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 ... 1 more
 Caused by: java.net.ConnectException: Connection timed out
 ... 11 more
 20/01/31 16:29:25 ERROR CoarseGrainedExecutorBackend: Executor self-exiting 
due to : Unable to create executor due to Unable to register with external 
shuffle server due to : Failed to connect to our-host.com/10.103.*.*:7337
 org.apache.spark.SparkException: Unable to register with external shuffle 
server due to : Failed to connect to our-host.com/10.103.*.*:7337
 at 
org.apache.spark.storage.BlockManager.$anonfun$registerWithExternalShuffleServer$3(BlockManager.scala:304)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
 at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
 at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
 at org.apache.spark.executor.Executor.(Executor.scala:118)
 at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
 at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
 at 

[jira] [Created] (SPARK-30719) AQE should not issue a "not supported" warning for queries being by-passed

2020-02-03 Thread Wei Xue (Jira)
Wei Xue created SPARK-30719:
---

 Summary: AQE should not issue a "not supported" warning for 
queries being by-passed
 Key: SPARK-30719
 URL: https://issues.apache.org/jira/browse/SPARK-30719
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wei Xue


This is a follow up for [https://github.com/apache/spark/pull/26813].

AQE bypasses queries that don't have exchanges or subqueries. This is not a 
limitation and it is different from queries that are not supported in AQE. 
Issuing a warning in this case can be confusing and annoying.

It would also be good to add an internal conf for this bypassing behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30718) Exclude jdk.tools dependency from hadoop-yarn-api

2020-02-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30718:
--
Target Version/s: 3.0.0

> Exclude jdk.tools dependency from hadoop-yarn-api
> -
>
> Key: SPARK-30718
> URL: https://issues.apache.org/jira/browse/SPARK-30718
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to remove `jdk.tools:jdk.tools` dependency from 
> hadoop-yarn-api.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30718) Exclude jdk.tools dependency from hadoop-yarn-api

2020-02-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30718:
--
Parent: SPARK-29194
Issue Type: Sub-task  (was: Bug)

> Exclude jdk.tools dependency from hadoop-yarn-api
> -
>
> Key: SPARK-30718
> URL: https://issues.apache.org/jira/browse/SPARK-30718
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to remove `jdk.tools:jdk.tools` dependency from 
> hadoop-yarn-api.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30718) Exclude jdk.tools dependency from hadoop-yarn-api

2020-02-03 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-30718:
-

 Summary: Exclude jdk.tools dependency from hadoop-yarn-api
 Key: SPARK-30718
 URL: https://issues.apache.org/jira/browse/SPARK-30718
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


This issue aims to remove `jdk.tools:jdk.tools` dependency from hadoop-yarn-api.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30717) AQE subquery map should cache `SubqueryExec` instead of `ExecSubqueryExpression`

2020-02-03 Thread Wei Xue (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Xue updated SPARK-30717:

Summary: AQE subquery map should cache `SubqueryExec` instead of 
`ExecSubqueryExpression`  (was: AQE subquery map should cache 
`BaseSubqueryExec` instead of `ExecSubqueryExpression`)

> AQE subquery map should cache `SubqueryExec` instead of 
> `ExecSubqueryExpression`
> 
>
> Key: SPARK-30717
> URL: https://issues.apache.org/jira/browse/SPARK-30717
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30717) AQE subquery map should cache `BaseSubqueryExec` instead of `ExecSubqueryExpression`

2020-02-03 Thread Wei Xue (Jira)
Wei Xue created SPARK-30717:
---

 Summary: AQE subquery map should cache `BaseSubqueryExec` instead 
of `ExecSubqueryExpression`
 Key: SPARK-30717
 URL: https://issues.apache.org/jira/browse/SPARK-30717
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wei Xue






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException

2020-02-03 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029139#comment-17029139
 ] 

Dongjoon Hyun commented on SPARK-30711:
---

Thank you for reporting, [~schreiber]. Could you check the behavior on the old 
versions (e.g. 2.4.3 or 2.3.4), too?

> 64KB JVM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> 

[jira] [Updated] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException

2020-02-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30711:
--
Summary: 64KB JVM bytecode limit - janino.InternalCompilerException  (was: 
64KB JBM bytecode limit - janino.InternalCompilerException)

> 64KB JVM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> 

[jira] [Commented] (SPARK-30715) Upgrade fabric8 to 4.7.1 to support K8s 1.17

2020-02-03 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029136#comment-17029136
 ] 

Dongjoon Hyun commented on SPARK-30715:
---

For now, I converted this to `Improvement` JIRA to support `K8s 1.17` according 
to the release note.
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.7.0

> Upgrade fabric8 to 4.7.1 to support K8s 1.17
> 
>
> Key: SPARK-30715
> URL: https://issues.apache.org/jira/browse/SPARK-30715
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Onur Satici
>Priority: Major
>
> Fabric8 kubernetes-client introduced a regression in building Quantity values 
> in 4.7.0.
> More info: [https://github.com/fabric8io/kubernetes-client/issues/1953]
> As part of this upgrade, creation of quantity objects should be changed in 
> order to keep correctly parsing quantities with units.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30715) Upgrade fabric8 to 4.7.1 to support K8s 1.17

2020-02-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30715:
--
Summary: Upgrade fabric8 to 4.7.1 to support K8s 1.17  (was: Upgrade 
fabric8 to 4.7.1)

> Upgrade fabric8 to 4.7.1 to support K8s 1.17
> 
>
> Key: SPARK-30715
> URL: https://issues.apache.org/jira/browse/SPARK-30715
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Onur Satici
>Priority: Major
>
> Fabric8 kubernetes-client introduced a regression in building Quantity values 
> in 4.7.0.
> More info: [https://github.com/fabric8io/kubernetes-client/issues/1953]
> As part of this upgrade, creation of quantity objects should be changed in 
> order to keep correctly parsing quantities with units.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30715) Upgrade fabric8 to 4.7.1

2020-02-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30715:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Upgrade fabric8 to 4.7.1
> 
>
> Key: SPARK-30715
> URL: https://issues.apache.org/jira/browse/SPARK-30715
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Onur Satici
>Priority: Major
>
> Fabric8 kubernetes-client introduced a regression in building Quantity values 
> in 4.7.0.
> More info: [https://github.com/fabric8io/kubernetes-client/issues/1953]
> As part of this upgrade, creation of quantity objects should be changed in 
> order to keep correctly parsing quantities with units.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30715) Upgrade fabric8 to 4.7.1

2020-02-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30715:
--
Issue Type: Improvement  (was: Dependency upgrade)

> Upgrade fabric8 to 4.7.1
> 
>
> Key: SPARK-30715
> URL: https://issues.apache.org/jira/browse/SPARK-30715
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Onur Satici
>Priority: Major
>
> Fabric8 kubernetes-client introduced a regression in building Quantity values 
> in 4.7.0.
> More info: [https://github.com/fabric8io/kubernetes-client/issues/1953]
> As part of this upgrade, creation of quantity objects should be changed in 
> order to keep correctly parsing quantities with units.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30715) Upgrade fabric8 to 4.7.1

2020-02-03 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029132#comment-17029132
 ] 

Dongjoon Hyun commented on SPARK-30715:
---

Hi, [~onursatici]. We still use `4.6.4`. So, the bug of 4.7.0 cannot be a bug 
of Apache Spark.

> Upgrade fabric8 to 4.7.1
> 
>
> Key: SPARK-30715
> URL: https://issues.apache.org/jira/browse/SPARK-30715
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Onur Satici
>Priority: Major
>
> Fabric8 kubernetes-client introduced a regression in building Quantity values 
> in 4.7.0.
> More info: [https://github.com/fabric8io/kubernetes-client/issues/1953]
> As part of this upgrade, creation of quantity objects should be changed in 
> order to keep correctly parsing quantities with units.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30525) HiveTableScanExec do not need to prune partitions again after pushing down to hive metastore

2020-02-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30525:

Fix Version/s: (was: 3.0.0)
   3.1.0

> HiveTableScanExec do not need to prune partitions again after pushing down to 
> hive metastore
> 
>
> Key: SPARK-30525
> URL: https://issues.apache.org/jira/browse/SPARK-30525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Hu Fuwang
>Assignee: Hu Fuwang
>Priority: Major
> Fix For: 3.1.0
>
>
> In HiveTableScanExec, it will push down to hive metastore for partition 
> pruning if _spark.sql.hive.metastorePartitionPruning_ is true, and then it 
> will prune the returned partitions again using partition filters, because 
> some predicates, eg. "b like 'xyz'", are not supported in hive metastore. But 
> now this problem is already fixed in 
> HiveExternalCatalog.listPartitionsByFilter, the 
> HiveExternalCatalog.listPartitionsByFilter can return exactly what we want 
> now. So it is not necessary any more to double prune in HiveTableScanExec.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30525) HiveTableScanExec do not need to prune partitions again after pushing down to hive metastore

2020-02-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30525.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27232
[https://github.com/apache/spark/pull/27232]

> HiveTableScanExec do not need to prune partitions again after pushing down to 
> hive metastore
> 
>
> Key: SPARK-30525
> URL: https://issues.apache.org/jira/browse/SPARK-30525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Hu Fuwang
>Assignee: Hu Fuwang
>Priority: Major
> Fix For: 3.0.0
>
>
> In HiveTableScanExec, it will push down to hive metastore for partition 
> pruning if _spark.sql.hive.metastorePartitionPruning_ is true, and then it 
> will prune the returned partitions again using partition filters, because 
> some predicates, eg. "b like 'xyz'", are not supported in hive metastore. But 
> now this problem is already fixed in 
> HiveExternalCatalog.listPartitionsByFilter, the 
> HiveExternalCatalog.listPartitionsByFilter can return exactly what we want 
> now. So it is not necessary any more to double prune in HiveTableScanExec.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30525) HiveTableScanExec do not need to prune partitions again after pushing down to hive metastore

2020-02-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30525:
---

Assignee: Hu Fuwang

> HiveTableScanExec do not need to prune partitions again after pushing down to 
> hive metastore
> 
>
> Key: SPARK-30525
> URL: https://issues.apache.org/jira/browse/SPARK-30525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Hu Fuwang
>Assignee: Hu Fuwang
>Priority: Major
>
> In HiveTableScanExec, it will push down to hive metastore for partition 
> pruning if _spark.sql.hive.metastorePartitionPruning_ is true, and then it 
> will prune the returned partitions again using partition filters, because 
> some predicates, eg. "b like 'xyz'", are not supported in hive metastore. But 
> now this problem is already fixed in 
> HiveExternalCatalog.listPartitionsByFilter, the 
> HiveExternalCatalog.listPartitionsByFilter can return exactly what we want 
> now. So it is not necessary any more to double prune in HiveTableScanExec.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30716) Change `SkewedPartitionReaderExec` into `UnaryExecNode` and replace the direct link with dependent stages with a reuse link

2020-02-03 Thread Wei Xue (Jira)
Wei Xue created SPARK-30716:
---

 Summary: Change `SkewedPartitionReaderExec` into `UnaryExecNode` 
and replace the direct link with dependent stages with a reuse link
 Key: SPARK-30716
 URL: https://issues.apache.org/jira/browse/SPARK-30716
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wei Xue


Otherwise it breaks two assumptions in AQE:
 # The only aqe leaf nodes are `AdaptiveSparkPlanExec`, `QueryStageExec`.
 # The plan is strictly a tree.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30614) The native ALTER COLUMN syntax should change one thing at a time

2020-02-03 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029068#comment-17029068
 ] 

Wenchen Fan commented on SPARK-30614:
-

Yea we can deduce the column type so nothing will be broken.

> The native ALTER COLUMN syntax should change one thing at a time
> 
>
> Key: SPARK-30614
> URL: https://issues.apache.org/jira/browse/SPARK-30614
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Our native ALTER COLUMN syntax is newly added in 3.0 and almost follows the 
> SQL standard.
> {code}
> ALTER TABLE table=multipartIdentifier
>   (ALTER | CHANGE) COLUMN? column=multipartIdentifier
>   (TYPE dataType)?
>   (COMMENT comment=STRING)?
>   colPosition?   
> {code}
> The SQL standard (section 11.12) only allows changing one property at a time. 
> This is also true on other recent SQL systems like 
> snowflake(https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html)
>  and 
> redshift(https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html)
> The snowflake has an extension that it allows changing multiple columns at a 
> time, like ALTER COLUMN c1 TYPE int, c2 TYPE int. If we want to extend the 
> SQL standard, I think this syntax is better. 
> For now, let's be conservative and only allow changing one property at a time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30715) Upgrade fabric8 to 4.7.1

2020-02-03 Thread Onur Satici (Jira)
Onur Satici created SPARK-30715:
---

 Summary: Upgrade fabric8 to 4.7.1
 Key: SPARK-30715
 URL: https://issues.apache.org/jira/browse/SPARK-30715
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Onur Satici


Fabric8 kubernetes-client introduced a regression in building Quantity values 
in 4.7.0.

More info: [https://github.com/fabric8io/kubernetes-client/issues/1953]

As part of this upgrade, creation of quantity objects should be changed in 
order to keep correctly parsing quantities with units.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30714) DSV2: Vectorized datasource does not have handling for ProlepticCalendar

2020-02-03 Thread Shubham Chaurasia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028975#comment-17028975
 ] 

Shubham Chaurasia commented on SPARK-30714:
---

Oh looks like https://issues.apache.org/jira/browse/SPARK-26651 handles this, 
will that be applicable for spark 2.3.x versions as well ?

> DSV2: Vectorized datasource does not have handling for ProlepticCalendar
> 
>
> Key: SPARK-30714
> URL: https://issues.apache.org/jira/browse/SPARK-30714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Shubham Chaurasia
>Priority: Major
>
> Consider the following scenarios - 
> 1)
> {code:scala}
> scala> spark.read.format("com.shubham.MyDataSource").option("ts_millis", 
> "1580736255261").load.show(false)
> MyDataSourceReader.createDataReaderFactories: ts_millis:  1580736255261
> Using LocalDateTime: 2020-02-03T13:24:15
> +---+
> |my_ts  |
> +---+
> |2020-02-03 13:24:15.261|
> +---+
> {code}
> In above output, we can see that both the timestamps (the one logged and the 
> one in dataframe) are equal(upto seconds)
> 2)
> {code:scala}
> scala> spark.read.format("com.shubham.MyDataSource").option("ts_millis", 
> "-6213559680").load.show(false)
> MyDataSourceReader.createDataReaderFactories: ts_millis:  -6213559680
> Using LocalDateTime: 0001-01-01T00:00
> +---+
> |my_ts  |
> +---+
> |0001-01-03 00:00:00|
> +---+
> {code}
> Here in 2), we can see that timestamp coming from DataReader is not converted 
> properly according to proleptic calendar and hence the two timestamps are 
> different. 
> Code to Repro
> DataSourceReader 
> {code:java}
> public class MyDataSourceReader implements DataSourceReader, 
> SupportsScanColumnarBatch {
>   private StructType schema;
>   private long timestampToProduce = -6213559680L;
>   public MyDataSourceReader(Map options) {
> initOptions(options);
>   }
>   private void initOptions(Map options) {
> String ts = options.get("ts_millis");
> if (ts != null) {
>   timestampToProduce = Long.parseLong(ts);
> }
>   }
>   @Override public StructType readSchema() {
> StructField[] fields = new StructField[1];
> fields[0] = new StructField("my_ts", DataTypes.TimestampType, true, 
> Metadata.empty());
> schema = new StructType(fields);
> return schema;
>   }
>   @Override public List> 
> createBatchDataReaderFactories() {
> System.out.println("MyDataSourceReader.createDataReaderFactories: 
> ts_millis:  " + timestampToProduce);
> System.out.println("Using LocalDateTime: " +  
> LocalDateTime.ofEpochSecond(timestampToProduce/1000, 0, ZoneOffset.UTC));
> List> dataReaderFactories = new 
> ArrayList<>();
> dataReaderFactories.add(new MyVectorizedTSProducerFactory(schema, 
> timestampToProduce));
> return dataReaderFactories;
>   }
> {code}
> DataReaderFactory & DataReader
> {code:java}
> public class MyVectorizedTSProducerFactory implements 
> DataReaderFactory {
>   private final StructType schema;
>   private long timestampValueToProduce;
>   public MyVectorizedTSProducerFactory(StructType schema, long 
> timestampValueToProduce) {
> this.schema = schema;
> this.timestampValueToProduce = timestampValueToProduce;
>   }
>   @Override public DataReader createDataReader() {
> return new MyVectorizedProducerReader(schema, timestampValueToProduce);
>   }
>   public static class MyVectorizedProducerReader implements 
> DataReader {
> private final StructType schema;
> private long timestampValueToProduce;
> private ColumnarBatch columnarBatch;
> // return just one batch for now
> private boolean batchRemaining = true;
> public MyVectorizedProducerReader(StructType schema, long 
> timestampValueToProduce) {
>   this.schema = schema;
>   this.timestampValueToProduce = timestampValueToProduce;
> }
> @Override public boolean next() {
>   return batchRemaining;
> }
> @Override public ColumnarBatch get() {
>   batchRemaining = false;
>   OnHeapColumnVector[] onHeapColumnVectors = 
> OnHeapColumnVector.allocateColumns(1, schema);
>   for (OnHeapColumnVector vector : onHeapColumnVectors) {
> // convert millis to micros
> vector.putLong(0, timestampValueToProduce * 1000);
>   }
>   columnarBatch = new ColumnarBatch(onHeapColumnVectors);
>   columnarBatch.setNumRows(1);
>   return columnarBatch;
> }
> @Override public void close() {
>   if (columnarBatch != null) {
> columnarBatch.close();
>   }
> }
>   }
> }
> {code}
> Any workarounds/solutions for this? 



--
This message was sent by 

[jira] [Updated] (SPARK-30714) DSV2: Vectorized datasource does not have handling for ProlepticCalendar

2020-02-03 Thread Shubham Chaurasia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated SPARK-30714:
--
Description: 
Consider the following scenarios - 
1)
{code:scala}
scala> spark.read.format("com.shubham.MyDataSource").option("ts_millis", 
"1580736255261").load.show(false)
MyDataSourceReader.createDataReaderFactories: ts_millis:  1580736255261
Using LocalDateTime: 2020-02-03T13:24:15
+---+
|my_ts  |
+---+
|2020-02-03 13:24:15.261|
+---+
{code}

In above output, we can see that both the timestamps (the one logged and the 
one in dataframe) are equal(upto seconds)

2)
{code:scala}
scala> spark.read.format("com.shubham.MyDataSource").option("ts_millis", 
"-6213559680").load.show(false)
MyDataSourceReader.createDataReaderFactories: ts_millis:  -6213559680
Using LocalDateTime: 0001-01-01T00:00
+---+
|my_ts  |
+---+
|0001-01-03 00:00:00|
+---+
{code}

Here in 2), we can see that timestamp coming from DataReader is not converted 
properly according to proleptic calendar and hence the two timestamps are 
different. 

Code to Repro
DataSourceReader 
{code:java}
public class MyDataSourceReader implements DataSourceReader, 
SupportsScanColumnarBatch {

  private StructType schema;

  private long timestampToProduce = -6213559680L;

  public MyDataSourceReader(Map options) {
initOptions(options);
  }

  private void initOptions(Map options) {
String ts = options.get("ts_millis");
if (ts != null) {
  timestampToProduce = Long.parseLong(ts);
}
  }

  @Override public StructType readSchema() {
StructField[] fields = new StructField[1];
fields[0] = new StructField("my_ts", DataTypes.TimestampType, true, 
Metadata.empty());
schema = new StructType(fields);
return schema;
  }

  @Override public List> 
createBatchDataReaderFactories() {
System.out.println("MyDataSourceReader.createDataReaderFactories: 
ts_millis:  " + timestampToProduce);
System.out.println("Using LocalDateTime: " +  
LocalDateTime.ofEpochSecond(timestampToProduce/1000, 0, ZoneOffset.UTC));
List> dataReaderFactories = new 
ArrayList<>();
dataReaderFactories.add(new MyVectorizedTSProducerFactory(schema, 
timestampToProduce));
return dataReaderFactories;
  }
{code}

DataReaderFactory & DataReader
{code:java}
public class MyVectorizedTSProducerFactory implements 
DataReaderFactory {

  private final StructType schema;
  private long timestampValueToProduce;

  public MyVectorizedTSProducerFactory(StructType schema, long 
timestampValueToProduce) {
this.schema = schema;
this.timestampValueToProduce = timestampValueToProduce;
  }

  @Override public DataReader createDataReader() {
return new MyVectorizedProducerReader(schema, timestampValueToProduce);
  }

  public static class MyVectorizedProducerReader implements 
DataReader {

private final StructType schema;
private long timestampValueToProduce;

private ColumnarBatch columnarBatch;

// return just one batch for now
private boolean batchRemaining = true;

public MyVectorizedProducerReader(StructType schema, long 
timestampValueToProduce) {
  this.schema = schema;
  this.timestampValueToProduce = timestampValueToProduce;
}

@Override public boolean next() {
  return batchRemaining;
}

@Override public ColumnarBatch get() {
  batchRemaining = false;
  OnHeapColumnVector[] onHeapColumnVectors = 
OnHeapColumnVector.allocateColumns(1, schema);
  for (OnHeapColumnVector vector : onHeapColumnVectors) {
// convert millis to micros
vector.putLong(0, timestampValueToProduce * 1000);
  }

  columnarBatch = new ColumnarBatch(onHeapColumnVectors);
  columnarBatch.setNumRows(1);
  return columnarBatch;
}

@Override public void close() {
  if (columnarBatch != null) {
columnarBatch.close();
  }
}
  }
}
{code}

Any workarounds/solutions for this? 



  was:
Consider the following scenarios - 
1)
{code:scala}
scala> spark.read.format("com.shubham.MyDataSource").option("ts_millis", 
"1580736255261").load.show(false)
MyDataSourceReader.createDataReaderFactories: ts_millis:  1580736255261
Using LocalDateTime: 2020-02-03T13:24:15
+---+
|my_ts  |
+---+
|2020-02-03 13:24:15.261|
+---+
{code}

In above output, we can see that both the timestamps (the one logged and the 
one in dataframe).

2)
{code:scala}
scala> spark.read.format("com.shubham.MyDataSource").option("ts_millis", 
"-6213559680").load.show(false)
MyDataSourceReader.createDataReaderFactories: ts_millis:  -6213559680
Using LocalDateTime: 0001-01-01T00:00
+---+
|my_ts  |
+---+
|0001-01-03 00:00:00|

[jira] [Created] (SPARK-30714) DSV2: Vectorized datasource does not have handling for ProlepticCalendar

2020-02-03 Thread Shubham Chaurasia (Jira)
Shubham Chaurasia created SPARK-30714:
-

 Summary: DSV2: Vectorized datasource does not have handling for 
ProlepticCalendar
 Key: SPARK-30714
 URL: https://issues.apache.org/jira/browse/SPARK-30714
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2
Reporter: Shubham Chaurasia


Consider the following scenarios - 
1)
{code:scala}
scala> spark.read.format("com.shubham.MyDataSource").option("ts_millis", 
"1580736255261").load.show(false)
MyDataSourceReader.createDataReaderFactories: ts_millis:  1580736255261
Using LocalDateTime: 2020-02-03T13:24:15
+---+
|my_ts  |
+---+
|2020-02-03 13:24:15.261|
+---+
{code}

In above output, we can see that both the timestamps (the one logged and the 
one in dataframe).

2)
{code:scala}
scala> spark.read.format("com.shubham.MyDataSource").option("ts_millis", 
"-6213559680").load.show(false)
MyDataSourceReader.createDataReaderFactories: ts_millis:  -6213559680
Using LocalDateTime: 0001-01-01T00:00
+---+
|my_ts  |
+---+
|0001-01-03 00:00:00|
+---+
{code}

Here in 2), we can see that timestamp coming from DataReader is not converted 
properly according to proleptic calendar and hence the two timestamps are 
different. 

Code to Repro
DataSourceReader 
{code:java}
public class MyDataSourceReader implements DataSourceReader, 
SupportsScanColumnarBatch {

  private StructType schema;

  private long timestampToProduce = -6213559680L;

  public MyDataSourceReader(Map options) {
initOptions(options);
  }

  private void initOptions(Map options) {
String ts = options.get("ts_millis");
if (ts != null) {
  timestampToProduce = Long.parseLong(ts);
}
  }

  @Override public StructType readSchema() {
StructField[] fields = new StructField[1];
fields[0] = new StructField("my_ts", DataTypes.TimestampType, true, 
Metadata.empty());
schema = new StructType(fields);
return schema;
  }

  @Override public List> 
createBatchDataReaderFactories() {
System.out.println("MyDataSourceReader.createDataReaderFactories: 
ts_millis:  " + timestampToProduce);
System.out.println("Using LocalDateTime: " +  
LocalDateTime.ofEpochSecond(timestampToProduce/1000, 0, ZoneOffset.UTC));
List> dataReaderFactories = new 
ArrayList<>();
dataReaderFactories.add(new MyVectorizedTSProducerFactory(schema, 
timestampToProduce));
return dataReaderFactories;
  }
{code}

DataReaderFactory & DataReader
{code:java}
public class MyVectorizedTSProducerFactory implements 
DataReaderFactory {

  private final StructType schema;
  private long timestampValueToProduce;

  public MyVectorizedTSProducerFactory(StructType schema, long 
timestampValueToProduce) {
this.schema = schema;
this.timestampValueToProduce = timestampValueToProduce;
  }

  @Override public DataReader createDataReader() {
return new MyVectorizedProducerReader(schema, timestampValueToProduce);
  }

  public static class MyVectorizedProducerReader implements 
DataReader {

private final StructType schema;
private long timestampValueToProduce;

private ColumnarBatch columnarBatch;

// return just one batch for now
private boolean batchRemaining = true;

public MyVectorizedProducerReader(StructType schema, long 
timestampValueToProduce) {
  this.schema = schema;
  this.timestampValueToProduce = timestampValueToProduce;
}

@Override public boolean next() {
  return batchRemaining;
}

@Override public ColumnarBatch get() {
  batchRemaining = false;
  OnHeapColumnVector[] onHeapColumnVectors = 
OnHeapColumnVector.allocateColumns(1, schema);
  for (OnHeapColumnVector vector : onHeapColumnVectors) {
// convert millis to micros
vector.putLong(0, timestampValueToProduce * 1000);
  }

  columnarBatch = new ColumnarBatch(onHeapColumnVectors);
  columnarBatch.setNumRows(1);
  return columnarBatch;
}

@Override public void close() {
  if (columnarBatch != null) {
columnarBatch.close();
  }
}
  }
}
{code}

Any workarounds/solutions for this? 





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30713) Respect mapOutputSize in memory in adaptive execution

2020-02-03 Thread liupengcheng (Jira)
liupengcheng created SPARK-30713:


 Summary: Respect mapOutputSize in memory in adaptive execution
 Key: SPARK-30713
 URL: https://issues.apache.org/jira/browse/SPARK-30713
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: liupengcheng


Currently, Spark adaptive execution use the MapOutputStatistics information to 
adjust the plan dynamically, but this MapOutputSize does not respect the 
compression factor. So there are cases that the original SparkPlan is 
`SortMergeJoin`, but the Plan after adaptive adjustment was changed to 
`BroadcastHashJoin`, but this `BroadcastHashJoin` might causing OOMs due to 
inaccurate estimation.

 

Also, if the shuffle implementation is local shuffle(intel Spark-Adaptive 
execution impl), then in some cases, it will cause `Too large Frame` exception.

 

So I propose to respect the compression factor in adaptive execution, or use 
`dataSize` metrics in `ShuffleExchangeExec` in adaptive execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30711) 64KB JBM bytecode limit - janino.InternalCompilerException

2020-02-03 Thread Frederik Schreiber (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederik Schreiber updated SPARK-30711:
---
Environment: 
Windows 10

Spark 2.4.4

scalaVersion 2.11.12

JVM Oracle 1.8.0_221-b11

  was:
Windows 10

Spark 2.4.4

scalaVersion 2.11.12


> 64KB JBM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> 

[jira] [Created] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files

2020-02-03 Thread liupengcheng (Jira)
liupengcheng created SPARK-30712:


 Summary: Estimate sizeInBytes from file metadata for parquet files
 Key: SPARK-30712
 URL: https://issues.apache.org/jira/browse/SPARK-30712
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: liupengcheng


Currently, Spark will use a compressionFactor when calculating `sizeInBytes` 
for `HadoopFsRelation`, but this is not accurate and it's hard to choose the 
best `compressionFactor`. Sometimes, this can causing OOMs due to improper 
BroadcastHashJoin.

So I propose to use the rowCount in the BlockMetadata to estimate the size in 
memory, which can be more accurate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit

2020-02-03 Thread Frederik Schreiber (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028902#comment-17028902
 ] 

Frederik Schreiber commented on SPARK-22510:


extract my code to an example and paste it into: 
https://issues.apache.org/jira/browse/SPARK-30711

> Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit 
> 
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: bulk-closed, releasenotes
>
> Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant 
> pool entry limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit

2020-02-03 Thread Frederik Schreiber (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028902#comment-17028902
 ] 

Frederik Schreiber edited comment on SPARK-22510 at 2/3/20 12:35 PM:
-

extract my code to an example and paste it into: SPARK-30711


was (Author: schreiber):
extract my code to an example and paste it into: 
https://issues.apache.org/jira/browse/SPARK-30711

> Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit 
> 
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: bulk-closed, releasenotes
>
> Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant 
> pool entry limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30711) 64KB JBM bytecode limit - janino.InternalCompilerException

2020-02-03 Thread Frederik Schreiber (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederik Schreiber updated SPARK-30711:
---
Description: 
Exception
{code:java}
ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code 
of method "processNext()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
 grows beyond 64 KBERROR CodeGenerator: failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code 
of method "processNext()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
 grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
"GeneratedClass": Code of method "processNext()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
 grows beyond 64 KB at 
org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
 at 
org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
 at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) at 
org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
 at 
org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
 at 
org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) 
at 
org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
 at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) 
at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
 at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) at 
org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
org.apache.spark.sql.Dataset.collect(Dataset.scala:2783) at 
de.sparkbug.janino.SparkJaninoBug$$anonfun$1.apply(SparkJaninoBug.scala:105) at 
de.sparkbug.janino.SparkJaninoBug$$anonfun$1.apply(SparkJaninoBug.scala:12) at 
org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at 
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at 
org.scalatest.Transformer.apply(Transformer.scala:22) at 
org.scalatest.Transformer.apply(Transformer.scala:20) at 
org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) at 
org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196) at 
org.scalatest.FunSuite.withFixture(FunSuite.scala:1560) at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) at 

[jira] [Created] (SPARK-30711) 64KB JBM bytecode limit - janino.InternalCompilerException

2020-02-03 Thread Frederik Schreiber (Jira)
Frederik Schreiber created SPARK-30711:
--

 Summary: 64KB JBM bytecode limit - janino.InternalCompilerException
 Key: SPARK-30711
 URL: https://issues.apache.org/jira/browse/SPARK-30711
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
 Environment: Windows 10

Spark 2.4.4

scalaVersion 2.11.12
Reporter: Frederik Schreiber


{code:java}
ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code 
of method "processNext()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
 grows beyond 64 KBERROR CodeGenerator: failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code 
of method "processNext()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
 grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
"GeneratedClass": Code of method "processNext()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
 grows beyond 64 KB at 
org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
 at 
org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
 at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) at 
org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
 at 
org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
 at 
org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) 
at 
org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
 at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) 
at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
 at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) at 
org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
org.apache.spark.sql.Dataset.collect(Dataset.scala:2783) at 
de.sparkbug.janino.SparkJaninoBug$$anonfun$1.apply(SparkJaninoBug.scala:105) at 
de.sparkbug.janino.SparkJaninoBug$$anonfun$1.apply(SparkJaninoBug.scala:12) at 
org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at 
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at 
org.scalatest.Transformer.apply(Transformer.scala:22) at 
org.scalatest.Transformer.apply(Transformer.scala:20) at 
org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) at 

[jira] [Created] (SPARK-30710) SPARK 2.4.4 - DROP TABLE and drop HDFS

2020-02-03 Thread Nguyen Nhanduc (Jira)
Nguyen Nhanduc created SPARK-30710:
--

 Summary: SPARK 2.4.4 - DROP TABLE and drop HDFS
 Key: SPARK-30710
 URL: https://issues.apache.org/jira/browse/SPARK-30710
 Project: Spark
  Issue Type: Question
  Components: Deploy
Affects Versions: 2.4.4
Reporter: Nguyen Nhanduc
 Fix For: 2.4.4


Hi all,

I need to DROP a hive table and clear data in HDFS. But I can't convert it from 
External to Managed (I have to create external table for my business).

On Spark 2.0.2, I can use script ALTER TABLE  SET TBLPROPERTIES 
('EXTERNAL' = 'FALSE'); then I execute script DROP TABLE ; metadata 
and data was deleted together.

But on spark 2.4.4, it's not work. I need a solution for that (drop a external 
table and drop data in hdfs or convert a external table to managed table and 
drop it).

Many thanks.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27990) Provide a way to recursively load data from datasource

2020-02-03 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028856#comment-17028856
 ] 

Jorge Machado commented on SPARK-27990:
---

[~nchammas]: Just pass this like: 
{code:java}
//.option("pathGlobFilter", 
".*\\.png|.*\\.jpg|.*\\.jpeg|.*\\.PNG|.*\\.JPG|.*\\.JPEG")
//.option("recursiveFileLookup", "true")
{code}

> Provide a way to recursively load data from datasource
> --
>
> Key: SPARK-27990
> URL: https://issues.apache.org/jira/browse/SPARK-27990
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 2.4.3
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Provide a way to recursively load data from datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27990) Provide a way to recursively load data from datasource

2020-02-03 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028854#comment-17028854
 ] 

Jorge Machado commented on SPARK-27990:
---

Can we backport this to 2.4.4 ?

> Provide a way to recursively load data from datasource
> --
>
> Key: SPARK-27990
> URL: https://issues.apache.org/jira/browse/SPARK-27990
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 2.4.3
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Provide a way to recursively load data from datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28413) sizeInByte is Not updated for parquet datasource on Next Insert.

2020-02-03 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-28413.
-
Resolution: Fixed

The issue fixed by 
https://github.com/apache/spark/commit/17881a467a1ac4224a50247458107f8b141850d2:
{noformat}

scala> spark.sql("create table tab2(id bigint) using parquet")
res4: org.apache.spark.sql.DataFrame = []

scala> spark.sql("explain cost select * from tab2").show(false)
+---+
|plan   




|
+---+
|== Optimized Logical Plan ==
Relation[id#37L] parquet, Statistics(sizeInBytes=0.0 B)

== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet default.tab2[id#37L] Batched: true, DataFilters: [], 
Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-27176/spark-3.0.0-SNAPSHOT-bin-2.7.4/spark-ware...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct

|
+---+


scala> spark.sql("insert into tab2 select 1")
res6: org.apache.spark.sql.DataFrame = []

scala>  spark.sql("explain cost select * from tab2").show(false)
+-+
|plan   




  |
+-+
|== Optimized Logical Plan ==
Relation[id#51L] parquet, Statistics(sizeInBytes=457.0 B)

== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet default.tab2[id#51L] Batched: true, DataFilters: [], 
Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-27176/spark-3.0.0-SNAPSHOT-bin-2.7.4/spark-ware...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct

|
+-+


{noformat}

> sizeInByte is Not updated for parquet datasource on Next Insert.
> 
>
> Key: SPARK-28413
> URL: https://issues.apache.org/jira/browse/SPARK-28413
> Project: Spark
>  Issue Type: Bug

[jira] [Created] (SPARK-30709) Spark 2.3 to Spark 2.4 Upgrade. Problems reading HIVE partitioned tables.

2020-02-03 Thread Carlos Mario (Jira)
Carlos Mario created SPARK-30709:


 Summary: Spark 2.3 to Spark 2.4 Upgrade. Problems reading HIVE 
partitioned tables.
 Key: SPARK-30709
 URL: https://issues.apache.org/jira/browse/SPARK-30709
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 2.4.0
 Environment: PRE- Production
Reporter: Carlos Mario


Hello

We recently updated our preproduction environment from Spark 2.3 to Spark 2.4.0

Along time we have created a big amount of tables in Hive Metastore, 
partitioned by 2 fields one of them String and the other one BigInt.

We were reading this tables with Spark 2.3 with no problem, but after upgrading 
to Spark 2.4 we get the following log every time we run our SW:



log_filterBIGINT.out:

 Caused by: MetaException(message:Filtering is supported only on partition keys 
of type string) Caused by: MetaException(message:Filtering is supported only on 
partition keys of type string) Caused by: MetaException(message:Filtering is 
supported only on partition keys of type string)

 

hadoop-cmf-hive-HIVEMETASTORE-isblcsmsttc0001.scisb.isban.corp.log.out.1:

 

2020-01-10 09:36:05,781 ERROR 
org.apache.hadoop.hive.metastore.RetryingHMSHandler: [pool-5-thread-138]: 
MetaException(message:Filtering is supported only on partition keys of type 
string)

2020-01-10 11:19:19,208 ERROR 
org.apache.hadoop.hive.metastore.RetryingHMSHandler: [pool-5-thread-187]: 
MetaException(message:Filtering is supported only on partition keys of type 
string)

2020-01-10 11:19:54,780 ERROR 
org.apache.hadoop.hive.metastore.RetryingHMSHandler: [pool-5-thread-167]: 
MetaException(message:Filtering is supported only on partition keys of type 
string)

 

 

We know the best practice from Spark point of view is to use 'STRING' type for 
partition columns, but we need to explore a solution we'll be able to deploy 
with ease, due to the big amount of tables created with a bigiint type column 
partition.

 

As a first solution we tried to set the  
spark.sql.hive.manageFilesourcePartitions parameter to false in the Spark 
Submmit, but after reruning the SW the error stood still.

 

Is there anyone in the community who experienced the same problem? What was the 
solution for it? 

 

Kind Regards and thanks in advance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29245) CCE during creating HiveMetaStoreClient

2020-02-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028804#comment-17028804
 ] 

Hyukjin Kwon commented on SPARK-29245:
--

I sent an email to Hive dev for Hive 2.3.7 release 
([https://www.mail-archive.com/dev@hive.apache.org/msg138085.html])

> CCE during creating HiveMetaStoreClient 
> 
>
> Key: SPARK-29245
> URL: https://issues.apache.org/jira/browse/SPARK-29245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> From `master` branch build, when I try to connect to an external HMS, I hit 
> the following.
> {code}
> 19/09/25 10:58:46 ERROR hive.log: Got exception: java.lang.ClassCastException 
> class [Ljava.lang.Object; cannot be cast to class [Ljava.net.URI; 
> ([Ljava.lang.Object; and [Ljava.net.URI; are in module java.base of loader 
> 'bootstrap')
> java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to 
> class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module 
> java.base of loader 'bootstrap')
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:200)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
> {code}
> With HIVE-21508, I can get the following.
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.4)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("show databases").show
> ++
> |databaseName|
> ++
> |  .  |
> ...
> {code}
> With 2.3.7-SNAPSHOT, the following basic tests are tested.
> - SHOW DATABASES / TABLES
> - DESC DATABASE / TABLE
> - CREATE / DROP / USE DATABASE
> - CREATE / DROP / INSERT / LOAD / SELECT TABLE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit

2020-02-03 Thread Frederik Schreiber (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028791#comment-17028791
 ] 

Frederik Schreiber commented on SPARK-22510:


thank you for answer i try to extract a minimal example and open a new ticket.

> Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit 
> 
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: bulk-closed, releasenotes
>
> Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant 
> pool entry limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit

2020-02-03 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028784#comment-17028784
 ] 

Kazuaki Ishizaki commented on SPARK-22510:
--

[~schreiber] Thank you for reporting the problem. Could you please share a 
program that causes this problem? We want to know which operators cause this 
issue at first.

> Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit 
> 
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: bulk-closed, releasenotes
>
> Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant 
> pool entry limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2020-02-03 Thread Frederik Schreiber (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028783#comment-17028783
 ] 

Frederik Schreiber commented on SPARK-25094:


Should this issue linked to SPARK-22510?

> proccesNext() failed to compile size is over 64kb
> -
>
> Key: SPARK-25094
> URL: https://issues.apache.org/jira/browse/SPARK-25094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: generated_code.txt
>
>
> I have this tree:
> 2018-08-12T07:14:31,289 WARN  [] 
> org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen 
> disabled for plan (id=1):
>  *(1) Project [, ... 10 more fields]
> +- *(1) Filter NOT exposure_calc_method#10141 IN 
> (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)
>+- InMemoryTableScan [, ... 11 more fields], [NOT 
> exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)]
>  +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, 
> deserialized, 1 replicas)
>+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner
>   :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0
>   :  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
>   : +- *(1) Project [, ... 6 more fields]
>   :+- *(1) Filter (isnotnull(v#49) && 
> isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) 
> && (v#49 = DATA_REG)) && isnotnull(unique_id#39))
>   :   +- InMemoryTableScan [, ... 6 more fields], [, 
> ... 6 more fields]
>   : +- InMemoryRelation [, ... 6 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   :   +- *(1) FileScan csv [,... 6 more 
> fields] , ... 6 more fields
>   +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0
>  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
> +- *(3) Project [, ... 74 more fields]
>+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 
> <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54))
>   +- InMemoryTableScan [, ... 74 more fields], [, 
> ... 4 more fields]
> +- InMemoryRelation [, ... 74 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   +- *(1) FileScan csv [,... 74 more 
> fields] , ... 6 more fields
> Compiling "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> and the generated code failed to compile.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit

2020-02-03 Thread Frederik Schreiber (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028763#comment-17028763
 ] 

Frederik Schreiber edited comment on SPARK-22510 at 2/3/20 8:29 AM:


Hi [~smilegator], [~kiszk]

we are using Spark 2.4.0 and currently having trouble with 64KB Exception. Our 
Dataframe has about 42 Columns, so we are wondering because all of these bugs 
are closed. Are there still known bugs which leads to that exception? Can this 
exception appear on complex queries/dataframes by design?

spark.sql.codegen.maxFields is set to 100

 

Are there some suggestions to avoid that error?

 

We although tried with Spark version 2.4.4 with same results.


was (Author: schreiber):
Hi [~smilegator], [~kiszk]

we are using Spark 2.4.0 and currently having trouble with 64KB Exception. Our 
Dataframe has about 42 Columns, so we are wondering because all of these bugs 
are closed. Are there still known bugs which leads to that exception? Can this 
exception appear on complex queries/dataframes by design?

spark.sql.codegen.maxFields is set to 100

 

Are there same suggestion to avoid that error?

 

We although tried with Spark version 2.4.4 with same results.

> Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit 
> 
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: bulk-closed, releasenotes
>
> Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant 
> pool entry limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit

2020-02-03 Thread Frederik Schreiber (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028763#comment-17028763
 ] 

Frederik Schreiber commented on SPARK-22510:


Hi [~smilegator], [~kiszk]

we are using Spark 2.4.0 and currently having trouble with 64KB Exception. Our 
Dataframe has about 42 Columns, so we are wondering because all of these bugs 
are closed. Are there still known bugs which leads to that exception? Can this 
exception appear on complex queries/dataframes by design?

spark.sql.codegen.maxFields is set to 100

 

Are there same suggestion to avoid that error?

 

We although tried with Spark version 2.4.4 with same results.

> Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit 
> 
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: bulk-closed, releasenotes
>
> Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant 
> pool entry limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30708) first_value/last_value window function throws ParseException

2020-02-03 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028752#comment-17028752
 ] 

jiaan.geng commented on SPARK-30708:


I'm working!

> first_value/last_value window function throws ParseException
> 
>
> Key: SPARK-30708
> URL: https://issues.apache.org/jira/browse/SPARK-30708
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> first_value/last_value throws ParseException
>  
> {code:java}
> SELECT first_value(unique1) over w,
> last_value(unique1) over w, unique1, four
> FROM tenk1 WHERE unique1 < 10
> +WINDOW w AS (order by four range between current row and unbounded following)
>  
> org.apache.spark.sql.catalyst.parser.ParseException
>  
> no viable alternative at input 'first_value'(line 1, pos 7)
>  
> == SQL ==
> SELECT first_value(unique1) over w,
> ---^^^
> last_value(unique1) over w, unique1, four
> FROM tenk1 WHERE unique1 < 10
> WINDOW w AS (order by four range between current row and unbounded following)
> {code}
>  
> Maybe we need fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30707) Lead/Lag window function throws AnalysisException without ORDER BY clause

2020-02-03 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30707:
---
Description: 
 Lead/Lag window function throws AnalysisException without ORDER BY clause:
{code:java}
SELECT lead(ten, four + 1) OVER (PARTITION BY four), ten, four
FROM (SELECT * FROM tenk1 WHERE unique2 < 10 ORDER BY four, ten)s
org.apache.spark.sql.AnalysisException
Window function lead(ten#x, (four#x + 1), null) requires window to be ordered, 
please add ORDER BY clause. For example SELECT lead(ten#x, (four#x + 1), 
null)(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) 
from table;
{code}
 

Maybe we need fix this issue.

  was:
 
{code:java}
Lead/Lag window function throws AnalysisException without ORDER BY clause:
SELECT lead(ten, four + 1) OVER (PARTITION BY four), ten, four
FROM (SELECT * FROM tenk1 WHERE unique2 < 10 ORDER BY four, ten)s
org.apache.spark.sql.AnalysisException
Window function lead(ten#x, (four#x + 1), null) requires window to be ordered, 
please add ORDER BY clause. For example SELECT lead(ten#x, (four#x + 1), 
null)(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) 
from table;
{code}
 

Maybe we need fix this issue.


> Lead/Lag window function throws AnalysisException without ORDER BY clause
> -
>
> Key: SPARK-30707
> URL: https://issues.apache.org/jira/browse/SPARK-30707
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
>  Lead/Lag window function throws AnalysisException without ORDER BY clause:
> {code:java}
> SELECT lead(ten, four + 1) OVER (PARTITION BY four), ten, four
> FROM (SELECT * FROM tenk1 WHERE unique2 < 10 ORDER BY four, ten)s
> org.apache.spark.sql.AnalysisException
> Window function lead(ten#x, (four#x + 1), null) requires window to be 
> ordered, please add ORDER BY clause. For example SELECT lead(ten#x, (four#x + 
> 1), null)(value_expr) OVER (PARTITION BY window_partition ORDER BY 
> window_ordering) from table;
> {code}
>  
> Maybe we need fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30708) first_value/last_value window function throws ParseException

2020-02-03 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30708:
---
Summary: first_value/last_value window function throws ParseException  
(was: first_value/last_value throws ParseException)

> first_value/last_value window function throws ParseException
> 
>
> Key: SPARK-30708
> URL: https://issues.apache.org/jira/browse/SPARK-30708
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> first_value/last_value throws ParseException
>  
> {code:java}
> SELECT first_value(unique1) over w,
> last_value(unique1) over w, unique1, four
> FROM tenk1 WHERE unique1 < 10
> +WINDOW w AS (order by four range between current row and unbounded following)
>  
> org.apache.spark.sql.catalyst.parser.ParseException
>  
> no viable alternative at input 'first_value'(line 1, pos 7)
>  
> == SQL ==
> SELECT first_value(unique1) over w,
> ---^^^
> last_value(unique1) over w, unique1, four
> FROM tenk1 WHERE unique1 < 10
> WINDOW w AS (order by four range between current row and unbounded following)
> {code}
>  
> Maybe we need fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30708) first_value/last_value throws ParseException

2020-02-03 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-30708:
--

 Summary: first_value/last_value throws ParseException
 Key: SPARK-30708
 URL: https://issues.apache.org/jira/browse/SPARK-30708
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jiaan.geng


first_value/last_value throws ParseException

 
{code:java}
SELECT first_value(unique1) over w,
last_value(unique1) over w, unique1, four
FROM tenk1 WHERE unique1 < 10
+WINDOW w AS (order by four range between current row and unbounded following)
 
org.apache.spark.sql.catalyst.parser.ParseException
 
no viable alternative at input 'first_value'(line 1, pos 7)
 
== SQL ==
SELECT first_value(unique1) over w,
---^^^
last_value(unique1) over w, unique1, four
FROM tenk1 WHERE unique1 < 10
WINDOW w AS (order by four range between current row and unbounded following)
{code}
 

Maybe we need fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org