[jira] [Created] (SPARK-25960) Support subpath mounting with Kubernetes

2018-11-06 Thread Timothy Chen (JIRA)
Timothy Chen created SPARK-25960:


 Summary: Support subpath mounting with Kubernetes
 Key: SPARK-25960
 URL: https://issues.apache.org/jira/browse/SPARK-25960
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 2.5.0
Reporter: Timothy Chen


Currently we support mounting volumes into executor and driver, but there is no 
option to provide a subpath to be mounted from the volume. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18466) Missing withFilter method causes errors when using for comprehensions in Scala 2.12

2018-11-06 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18466:
--
Affects Version/s: (was: 2.3.0)
   (was: 2.0.2)
   (was: 2.0.1)
   (was: 1.6.3)
   (was: 1.6.2)
   (was: 1.6.1)
   (was: 1.5.2)
   (was: 1.5.1)
   (was: 1.6.0)
   (was: 1.4.1)
   (was: 1.5.0)
   (was: 2.0.0)
   (was: 1.4.0)

> Missing withFilter method causes errors when using for comprehensions in 
> Scala 2.12
> ---
>
> Key: SPARK-18466
> URL: https://issues.apache.org/jira/browse/SPARK-18466
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Richard W. Eggert II
>Priority: Minor
>  Labels: easyfix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The fact that the RDD class has a {{filter}} method but not a {{withFilter}} 
> method results in compiler warnings when using RDDs in {{for}} 
> comprehensions. As of Scala 2.12, falling back to use of {{filter}} is no 
> longer supported, so {{for}} comprehensions that use filters will no longer 
> compile. Semantically, the only difference between {{withFilter}} and 
> {{filter}} is that {{withFilter}} is lazy, and since RDDs are lazy by nature, 
> one can simply be aliased to the other.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18466) Missing withFilter method causes errors when using for comprehensions

2018-11-06 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18466:
--
Summary: Missing withFilter method causes errors when using for 
comprehensions  (was: Missing withFilter method causes warnings when using for 
comprehensions)

> Missing withFilter method causes errors when using for comprehensions
> -
>
> Key: SPARK-18466
> URL: https://issues.apache.org/jira/browse/SPARK-18466
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 
> 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.3.0, 2.4.0
>Reporter: Richard W. Eggert II
>Priority: Minor
>  Labels: easyfix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The fact that the RDD class has a {{filter}} method but not a {{withFilter}} 
> method results in compiler warnings when using RDDs in {{for}} 
> comprehensions. As of Scala 2.12, falling back to use of {{filter}} is no 
> longer supported, so {{for}} comprehensions that use filters will no longer 
> compile. Semantically, the only difference between {{withFilter}} and 
> {{filter}} is that {{withFilter}} is lazy, and since RDDs are lazy by nature, 
> one can simply be aliased to the other.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18466) Missing withFilter method causes errors when using for comprehensions in Scala 2.12

2018-11-06 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18466:
--
Summary: Missing withFilter method causes errors when using for 
comprehensions in Scala 2.12  (was: Missing withFilter method causes errors 
when using for comprehensions)

> Missing withFilter method causes errors when using for comprehensions in 
> Scala 2.12
> ---
>
> Key: SPARK-18466
> URL: https://issues.apache.org/jira/browse/SPARK-18466
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 
> 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.3.0, 2.4.0
>Reporter: Richard W. Eggert II
>Priority: Minor
>  Labels: easyfix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The fact that the RDD class has a {{filter}} method but not a {{withFilter}} 
> method results in compiler warnings when using RDDs in {{for}} 
> comprehensions. As of Scala 2.12, falling back to use of {{filter}} is no 
> longer supported, so {{for}} comprehensions that use filters will no longer 
> compile. Semantically, the only difference between {{withFilter}} and 
> {{filter}} is that {{withFilter}} is lazy, and since RDDs are lazy by nature, 
> one can simply be aliased to the other.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25959) Difference in featureImportances results on computed vs saved models

2018-11-06 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677752#comment-16677752
 ] 

shahid commented on SPARK-25959:


Thanks. I will analyze the issue. 

> Difference in featureImportances results on computed vs saved models
> 
>
> Key: SPARK-25959
> URL: https://issues.apache.org/jira/browse/SPARK-25959
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Priority: Major
>
> I tried to implement GBT and found that the feature Importance computed while 
> the model was fit is different when the same model was saved into a storage 
> and loaded back. 
>  
> I also found that once the persistent model is loaded and saved back again 
> and loaded, the feature importance remains the same. 
>  
> Not sure if its bug while storing and reading the model first time or am 
> missing some parameter that need to be set before saving the model (thus 
> model is picking some defaults - causing feature importance to change)
>  
> *Below is the test code:*
> val testDF = Seq(
> (1, 3, 2, 1, 1),
> (3, 2, 1, 2, 0),
> (2, 2, 1, 1, 0),
> (3, 4, 2, 2, 0),
> (2, 2, 1, 3, 1)
> ).toDF("a", "b", "c", "d", "e")
> val featureColumns = testDF.columns.filter(_ != "e")
> // Assemble the features into a vector
> val assembler = new 
> VectorAssembler().setInputCols(featureColumns).setOutputCol("features")
> // Transform the data to get the feature data set
> val featureDF = assembler.transform(testDF)
> // Train a GBT model.
> val gbt = new GBTClassifier()
> .setLabelCol("e")
> .setFeaturesCol("features")
> .setMaxDepth(2)
> .setMaxBins(5)
> .setMaxIter(10)
> .setSeed(10)
> .fit(featureDF)
> gbt.transform(featureDF).show(false)
> // Write out the model
> featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* Prints
> (d,0.5931875075767403)
> (a,0.3747184548362353)
> (b,0.03209403758702444)
> (c,0.0)
> */
> gbt.write.overwrite().save("file:///tmp/test123")
> println("Reading model again")
> val gbtload = GBTClassificationModel.load("file:///tmp/test123")
> featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /*
> Prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */
> gbtload.write.overwrite().save("file:///tmp/test123_rewrite")
> val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite")
> featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18466) Missing withFilter method causes warnings when using for comprehensions

2018-11-06 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-18466.
-

> Missing withFilter method causes warnings when using for comprehensions
> ---
>
> Key: SPARK-18466
> URL: https://issues.apache.org/jira/browse/SPARK-18466
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 
> 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.3.0, 2.4.0
>Reporter: Richard W. Eggert II
>Priority: Minor
>  Labels: easyfix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The fact that the RDD class has a {{filter}} method but not a {{withFilter}} 
> method results in compiler warnings when using RDDs in {{for}} 
> comprehensions. As of Scala 2.12, falling back to use of {{filter}} is no 
> longer supported, so {{for}} comprehensions that use filters will no longer 
> compile. Semantically, the only difference between {{withFilter}} and 
> {{filter}} is that {{withFilter}} is lazy, and since RDDs are lazy by nature, 
> one can simply be aliased to the other.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18466) Missing withFilter method causes warnings when using for comprehensions

2018-11-06 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-18466.
---
Resolution: Won't Fix

Thank you for reporting and contribution, [~reggert1980]. We know that this has 
been a long pending PR.

Although this issue causes a regression with Scala 2.12 at Spark 2.4, Apache 
Spark community will not add this feature due to the risk. Please see the 
discussion on the PR.

{code}
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context available as 'sc' (master = local[*], app id = 
local-1541571276105).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
  /_/

Using Scala version 2.12.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.

scala> (for (n <- sc.parallelize(Seq(1,2,3)) if n > 2) yield n).toDebugString
:25: error: value withFilter is not a member of 
org.apache.spark.rdd.RDD[Int]
   (for (n <- sc.parallelize(Seq(1,2,3)) if n > 2) yield n).toDebugString
{code}

> Missing withFilter method causes warnings when using for comprehensions
> ---
>
> Key: SPARK-18466
> URL: https://issues.apache.org/jira/browse/SPARK-18466
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 
> 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.3.0, 2.4.0
>Reporter: Richard W. Eggert II
>Priority: Minor
>  Labels: easyfix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The fact that the RDD class has a {{filter}} method but not a {{withFilter}} 
> method results in compiler warnings when using RDDs in {{for}} 
> comprehensions. As of Scala 2.12, falling back to use of {{filter}} is no 
> longer supported, so {{for}} comprehensions that use filters will no longer 
> compile. Semantically, the only difference between {{withFilter}} and 
> {{filter}} is that {{withFilter}} is lazy, and since RDDs are lazy by nature, 
> one can simply be aliased to the other.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18466) Missing withFilter method causes warnings when using for comprehensions

2018-11-06 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18466:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-24417

> Missing withFilter method causes warnings when using for comprehensions
> ---
>
> Key: SPARK-18466
> URL: https://issues.apache.org/jira/browse/SPARK-18466
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 
> 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.3.0, 2.4.0
>Reporter: Richard W. Eggert II
>Priority: Minor
>  Labels: easyfix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The fact that the RDD class has a {{filter}} method but not a {{withFilter}} 
> method results in compiler warnings when using RDDs in {{for}} 
> comprehensions. As of Scala 2.12, falling back to use of {{filter}} is no 
> longer supported, so {{for}} comprehensions that use filters will no longer 
> compile. Semantically, the only difference between {{withFilter}} and 
> {{filter}} is that {{withFilter}} is lazy, and since RDDs are lazy by nature, 
> one can simply be aliased to the other.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18466) Missing withFilter method causes warnings when using for comprehensions

2018-11-06 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18466:
--
Affects Version/s: 2.4.0
   2.3.0

> Missing withFilter method causes warnings when using for comprehensions
> ---
>
> Key: SPARK-18466
> URL: https://issues.apache.org/jira/browse/SPARK-18466
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 
> 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.3.0, 2.4.0
>Reporter: Richard W. Eggert II
>Priority: Minor
>  Labels: easyfix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The fact that the RDD class has a {{filter}} method but not a {{withFilter}} 
> method results in compiler warnings when using RDDs in {{for}} 
> comprehensions. As of Scala 2.12, falling back to use of {{filter}} is no 
> longer supported, so {{for}} comprehensions that use filters will no longer 
> compile. Semantically, the only difference between {{withFilter}} and 
> {{filter}} is that {{withFilter}} is lazy, and since RDDs are lazy by nature, 
> one can simply be aliased to the other.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25098) ‘Cast’ will return NULL when input string starts/ends with special character(s)

2018-11-06 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25098.
---
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/22943

> ‘Cast’ will return NULL when input string starts/ends with special 
> character(s)
> ---
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character, but hive doesn't.
> For examle, we get hour from a string ends with a blank :
> hive:
> ```
> hive> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> OK 
> 2018-08-13
> hive> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank
> OK
> 17
> ```
> spark-sql:
> ```
> spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> NULL
> spark-sql> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank
> NULL
> ```
> All of the following UDFs will be affected:
> ```
> year
> month
> day
> hour
> minute
> second
> date_add
> date_sub
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25098) Trim the string when cast stringToTimestamp and stringToDate

2018-11-06 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25098:
--
Summary: Trim the string when cast stringToTimestamp and stringToDate  
(was: ‘Cast’ will return NULL when input string starts/ends with special 
character(s))

> Trim the string when cast stringToTimestamp and stringToDate
> 
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character, but hive doesn't.
> For examle, we get hour from a string ends with a blank :
> hive:
> ```
> hive> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> OK 
> 2018-08-13
> hive> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank
> OK
> 17
> ```
> spark-sql:
> ```
> spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> NULL
> spark-sql> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank
> NULL
> ```
> All of the following UDFs will be affected:
> ```
> year
> month
> day
> hour
> minute
> second
> date_add
> date_sub
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25950) from_csv should respect to spark.sql.columnNameOfCorruptRecord

2018-11-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25950:


Assignee: Maxim Gekk

> from_csv should respect to spark.sql.columnNameOfCorruptRecord
> --
>
> Key: SPARK-25950
> URL: https://issues.apache.org/jira/browse/SPARK-25950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> The from_csv() functions should respect to SQL config 
> *spark.sql.columnNameOfCorruptRecord* as from_json() does. Currently it takes 
> into account CSV option *columnNameOfCorruptRecord* only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25950) from_csv should respect to spark.sql.columnNameOfCorruptRecord

2018-11-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25950.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22956
[https://github.com/apache/spark/pull/22956]

> from_csv should respect to spark.sql.columnNameOfCorruptRecord
> --
>
> Key: SPARK-25950
> URL: https://issues.apache.org/jira/browse/SPARK-25950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> The from_csv() functions should respect to SQL config 
> *spark.sql.columnNameOfCorruptRecord* as from_json() does. Currently it takes 
> into account CSV option *columnNameOfCorruptRecord* only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25959) Difference in featureImportances results on computed vs saved models

2018-11-06 Thread Suraj Nayak (JIRA)
Suraj Nayak created SPARK-25959:
---

 Summary: Difference in featureImportances results on computed vs 
saved models
 Key: SPARK-25959
 URL: https://issues.apache.org/jira/browse/SPARK-25959
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 2.2.0
Reporter: Suraj Nayak


I tried to implement GBT and found that the feature Importance computed while 
the model was fit is different when the same model was saved into a storage and 
loaded back. 

 

I also found that once the persistent model is loaded and saved back again and 
loaded, the feature importance remains the same. 

 

Not sure if its bug while storing and reading the model first time or am 
missing some parameter that need to be set before saving the model (thus model 
is picking some defaults - causing feature importance to change)

 

*Below is the test code:*

val testDF = Seq(
(1, 3, 2, 1, 1),
(3, 2, 1, 2, 0),
(2, 2, 1, 1, 0),
(3, 4, 2, 2, 0),
(2, 2, 1, 3, 1)
).toDF("a", "b", "c", "d", "e")


val featureColumns = testDF.columns.filter(_ != "e")
// Assemble the features into a vector
val assembler = new 
VectorAssembler().setInputCols(featureColumns).setOutputCol("features")
// Transform the data to get the feature data set
val featureDF = assembler.transform(testDF)

// Train a GBT model.
val gbt = new GBTClassifier()
.setLabelCol("e")
.setFeaturesCol("features")
.setMaxDepth(2)
.setMaxBins(5)
.setMaxIter(10)
.setSeed(10)
.fit(featureDF)

gbt.transform(featureDF).show(false)

// Write out the model

featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
/* Prints

(d,0.5931875075767403)
(a,0.3747184548362353)
(b,0.03209403758702444)
(c,0.0)

*/
gbt.write.overwrite().save("file:///tmp/test123")

println("Reading model again")
val gbtload = GBTClassificationModel.load("file:///tmp/test123")

featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)

/*

Prints

(d,0.6455841215290767)
(a,0.3316126797964181)
(b,0.022803198674505094)
(c,0.0)

*/


gbtload.write.overwrite().save("file:///tmp/test123_rewrite")

val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite")

featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)

/* prints
(d,0.6455841215290767)
(a,0.3316126797964181)
(b,0.022803198674505094)
(c,0.0)

*/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25947) Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns

2018-11-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25947:


Assignee: Apache Spark

> Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns
> -
>
> Key: SPARK-25947
> URL: https://issues.apache.org/jira/browse/SPARK-25947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Shuheng Dai
>Assignee: Apache Spark
>Priority: Major
>
> When sorting rows, ShuffleExchangeExec uses the entire row instead of just 
> the columns references in SortOrder to create the RangePartitioner. This 
> causes the RangePartitioner to sample entire rows to create rangeBounds and 
> can cause OOM issues on the driver when rows contain large fields.
> Create a projection and only use columns involved in the SortOrder for the 
> RangePartitioner



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25947) Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns

2018-11-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677513#comment-16677513
 ] 

Apache Spark commented on SPARK-25947:
--

User 'mu5358271' has created a pull request for this issue:
https://github.com/apache/spark/pull/22961

> Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns
> -
>
> Key: SPARK-25947
> URL: https://issues.apache.org/jira/browse/SPARK-25947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Shuheng Dai
>Priority: Major
>
> When sorting rows, ShuffleExchangeExec uses the entire row instead of just 
> the columns references in SortOrder to create the RangePartitioner. This 
> causes the RangePartitioner to sample entire rows to create rangeBounds and 
> can cause OOM issues on the driver when rows contain large fields.
> Create a projection and only use columns involved in the SortOrder for the 
> RangePartitioner



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25947) Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns

2018-11-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677514#comment-16677514
 ] 

Apache Spark commented on SPARK-25947:
--

User 'mu5358271' has created a pull request for this issue:
https://github.com/apache/spark/pull/22961

> Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns
> -
>
> Key: SPARK-25947
> URL: https://issues.apache.org/jira/browse/SPARK-25947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Shuheng Dai
>Priority: Major
>
> When sorting rows, ShuffleExchangeExec uses the entire row instead of just 
> the columns references in SortOrder to create the RangePartitioner. This 
> causes the RangePartitioner to sample entire rows to create rangeBounds and 
> can cause OOM issues on the driver when rows contain large fields.
> Create a projection and only use columns involved in the SortOrder for the 
> RangePartitioner



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25947) Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns

2018-11-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25947:


Assignee: (was: Apache Spark)

> Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns
> -
>
> Key: SPARK-25947
> URL: https://issues.apache.org/jira/browse/SPARK-25947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Shuheng Dai
>Priority: Major
>
> When sorting rows, ShuffleExchangeExec uses the entire row instead of just 
> the columns references in SortOrder to create the RangePartitioner. This 
> causes the RangePartitioner to sample entire rows to create rangeBounds and 
> can cause OOM issues on the driver when rows contain large fields.
> Create a projection and only use columns involved in the SortOrder for the 
> RangePartitioner



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17967) Support for list or other types as an option for datasources

2018-11-06 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677486#comment-16677486
 ] 

Hyukjin Kwon commented on SPARK-17967:
--

It's a rough idea but I was also thinking allowing binary and sending some CSV 
setting object directly ({{CsvWriterSettings}}) from Scala and Java. Current 
Univocity parser allows too many options and it's kind of troublesome to judge 
which one should be added or not (https://github.com/apache/spark/pull/22590).

> Support for list or other types as an option for datasources
> 
>
> Key: SPARK-17967
> URL: https://issues.apache.org/jira/browse/SPARK-17967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This was discussed in SPARK-17878
> For other datasources, it seems okay with string/long/boolean/double value as 
> an option but it seems it is not enough for the datasource such as CSV. As it 
> is an interface for other external datasources, I guess it'd affect several 
> ones out there.
> I took a look a first but it seems it'd be difficult to support this (need to 
> change a lot).
> One suggestion is support this as a JSON array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-06 Thread Ruslan Dautkhanov (JIRA)
Ruslan Dautkhanov created SPARK-25958:
-

 Summary: error: [Errno 97] Address family not supported by 
protocol in dataframe.take()
 Key: SPARK-25958
 URL: https://issues.apache.org/jira/browse/SPARK-25958
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core
Affects Versions: 2.3.2, 2.3.1
Reporter: Ruslan Dautkhanov


Following error happens on a heavy Spark job after 4 hours of runtime.. 

{code:python}
2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: [Errno 
97] Address family not supported by protocol
Traceback (most recent call last):
  File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
item.create_persistent_data()
  File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
line 53, in create_persistent_data
single_obj.create_persistent_data()
  File "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", 
line 21, in create_persistent_data
main_df = self.generate_dataframe_main()
  File "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", 
line 98, in generate_dataframe_main
raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
  File 
"/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
 line 16, in get_raw_data_with_metadata_and_aggregation
main_df = 
self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
  File 
"/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
 line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
return_df = self.get_dataframe_from_binary_value_iteration(input_df)
  File 
"/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
 line 136, in get_dataframe_from_binary_value_iteration
combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
binary_value=count)
  File 
"/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
 line 154, in get_dataframe_from_binary_value
if len(results_of_filter_df.take(1)) == 0:
  File 
"/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", line 
504, in take
return self.limit(num).collect()
  File 
"/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", line 
467, in collect
return list(_load_from_socket(sock_info, 
BatchedSerializer(PickleSerializer(
  File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
148, in _load_from_socket
sock = socket.socket(af, socktype, proto)
  File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
__init__
_sock = _realsocket(family, type, proto)
error: [Errno 97] Address family not supported by protocol
{code}

Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148: 

{code:python}

def _load_from_socket(sock_info, serializer):
port, auth_secret = sock_info
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
sock = socket.socket(af, socktype, proto)
try:
sock.settimeout(15)
sock.connect(sa)
except socket.error:
sock.close()
sock = None
continue
break
if not sock:
raise Exception("could not open socket")
# The RDD materialization time is unpredicable, if we set a timeout for 
socket reading
# operation, it will very possibly fail. See SPARK-18281.
sock.settimeout(None)

sockfile = sock.makefile("rwb", 65536)
do_server_auth(sockfile, auth_secret)

# The socket will be automatically closed when garbage-collected.
return serializer.load_stream(sockfile)
{code}

the culprint is in the line 

{code:python}
socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
{code}

so the error "error: [Errno 97] *Address family* not supported by protocol" 

seems to be caused by socket.AF_UNSPEC third option to the socket.getaddrinfo() 
call.

I tried to call similar socket.getaddrinfo call locally outside of PySpark and 
it worked fine.

RHEL 7.5.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17967) Support for list or other types as an option for datasources

2018-11-06 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677478#comment-16677478
 ] 

Hyukjin Kwon commented on SPARK-17967:
--

For CSV itself, yea, there are workaround and I agree - for CSV, it should be 
just a good to do. However, other cases like, for instance, specifying binary 
format (https://github.com/apache/spark/pull/21192) ideally needs this. it is 
also needed to specify multiple delimiters or dates format (there are already 
some JIRAs open).

> Support for list or other types as an option for datasources
> 
>
> Key: SPARK-17967
> URL: https://issues.apache.org/jira/browse/SPARK-17967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This was discussed in SPARK-17878
> For other datasources, it seems okay with string/long/boolean/double value as 
> an option but it seems it is not enough for the datasource such as CSV. As it 
> is an interface for other external datasources, I guess it'd affect several 
> ones out there.
> I took a look a first but it seems it'd be difficult to support this (need to 
> change a lot).
> One suggestion is support this as a JSON array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17166) CTAS lost table properties after conversion to data source tables.

2018-11-06 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-17166.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.4.0

It is decided to support Spark-related Parquet/ORC properties and implemented 
in Spark 2.4.0.
The other properties which are unknown to Spark will be ignored intentionally.

> CTAS lost table properties after conversion to data source tables.
> --
>
> Key: SPARK-17166
> URL: https://issues.apache.org/jira/browse/SPARK-17166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> CTAS lost table properties after conversion to data source tables. For 
> example, 
> {noformat}
> CREATE TABLE t TBLPROPERTIES('prop1' = 'c', 'prop2' = 'd') AS SELECT 1 as a, 
> 1 as b
> {noformat}
> The output of `DESC FORMATTED t` does not have the related properties. 
> {noformat}
> |Table Parameters:   |
>   |   |
> |  rawDataSize   |-1  
>   |   |
> |  numFiles  |1   
>   |   |
> |  transient_lastDdlTime |1471670983  
>   |   |
> |  totalSize |496 
>   |   |
> |  spark.sql.sources.provider|parquet 
>   |   |
> |  EXTERNAL  |FALSE   
>   |   |
> |  COLUMN_STATS_ACCURATE |false   
>   |   |
> |  numRows   |-1  
>   |   |
> ||
>   |   |
> |# Storage Information   |
>   |   |
> |SerDe Library:  
> |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
>  |   |
> |InputFormat:
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
>  |   |
> |OutputFormat:   
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
>  |   |
> |Compressed: |No  
>   |   |
> |Storage Desc Parameters:|
>   |   |
> |  serialization.format  |1   
>   |   |
> |  path  
> |file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-f3aa2927-6464-4a35-a715-1300dde6c614/t|
>|
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8.2

2018-11-06 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24771:
--
Summary: Upgrade AVRO version from 1.7.7 to 1.8.2  (was: Upgrade AVRO 
version from 1.7.7 to 1.8)

> Upgrade AVRO version from 1.7.7 to 1.8.2
> 
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24422) Add JDK11 in our Jenkins' build servers

2018-11-06 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-24422:

Summary: Add JDK11 in our Jenkins' build servers  (was: Add JDK9+ in our 
Jenkins' build servers)

> Add JDK11 in our Jenkins' build servers
> ---
>
> Key: SPARK-24422
> URL: https://issues.apache.org/jira/browse/SPARK-24422
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support

2018-11-06 Thread Nagaram Prasad Addepally (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nagaram Prasad Addepally updated SPARK-25957:
-
Description: 
[docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]
 script by default tries to build spark-r image. We may not always build spark 
distribution with R support. It would be good to skip building and publishing 
spark-r images if R support is not available in the spark distribution.  (was: 
[docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]
 script by default tries to build spark-r image by default. We may not always 
build spark distribution with R support. It would be good to skip building and 
publishing spark-r images if R support is not available in the spark 
distribution.)

> Skip building spark-r docker image if spark distribution does not have R 
> support
> 
>
> Key: SPARK-25957
> URL: https://issues.apache.org/jira/browse/SPARK-25957
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Nagaram Prasad Addepally
>Priority: Major
>
> [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]
>  script by default tries to build spark-r image. We may not always build 
> spark distribution with R support. It would be good to skip building and 
> publishing spark-r images if R support is not available in the spark 
> distribution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support

2018-11-06 Thread Nagaram Prasad Addepally (JIRA)
Nagaram Prasad Addepally created SPARK-25957:


 Summary: Skip building spark-r docker image if spark distribution 
does not have R support
 Key: SPARK-25957
 URL: https://issues.apache.org/jira/browse/SPARK-25957
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Nagaram Prasad Addepally


[docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]
 script by default tries to build spark-r image by default. We may not always 
build spark distribution with R support. It would be good to skip building and 
publishing spark-r images if R support is not available in the spark 
distribution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25676) Refactor BenchmarkWideTable to use main method

2018-11-06 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25676.
---
   Resolution: Fixed
 Assignee: yucai
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/22823 .

> Refactor BenchmarkWideTable to use main method
> --
>
> Key: SPARK-25676
> URL: https://issues.apache.org/jira/browse/SPARK-25676
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: yucai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17967) Support for list or other types as an option for datasources

2018-11-06 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677418#comment-16677418
 ] 

Reynold Xin commented on SPARK-17967:
-

BTW how important is this? Seems like for CSV people can just replace the null 
values with null themselves using the programmatic API.

 

> Support for list or other types as an option for datasources
> 
>
> Key: SPARK-17967
> URL: https://issues.apache.org/jira/browse/SPARK-17967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This was discussed in SPARK-17878
> For other datasources, it seems okay with string/long/boolean/double value as 
> an option but it seems it is not enough for the datasource such as CSV. As it 
> is an interface for other external datasources, I guess it'd affect several 
> ones out there.
> I took a look a first but it seems it'd be difficult to support this (need to 
> change a lot).
> One suggestion is support this as a JSON array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25956) Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-06 Thread DB Tsai (JIRA)
DB Tsai created SPARK-25956:
---

 Summary: Make Scala 2.12 as default Scala version in Spark 3.0
 Key: SPARK-25956
 URL: https://issues.apache.org/jira/browse/SPARK-25956
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 2.4.0
Reporter: DB Tsai


Scala 2.11 will unlikely support Java 11 
https://github.com/scala/scala-dev/issues/559#issuecomment-436160166; hence, we 
will make Scala 2.12 as default Scala version in Spark 3.0 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24421) sun.misc.Unsafe in JDK9+

2018-11-06 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677280#comment-16677280
 ] 

Sean Owen commented on SPARK-24421:
---

BTW thanks for the tip [~akorzhuev] ; I checked and unfortunately 
DirectByteBuffer in Java 9 holds an instance of jdk.internal.ref.Cleaner, not 
java.lang.ref.Cleaner. The latter isn't available in Java 8 anyway. So we'd 
have to hack this with reflection regardless.

> sun.misc.Unsafe in JDK9+
> 
>
> Key: SPARK-24421
> URL: https://issues.apache.org/jira/browse/SPARK-24421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> Many internal APIs such as unsafe are encapsulated in JDK9+, see 
> http://openjdk.java.net/jeps/260 for detail.
> To use Unsafe, we need to add *jdk.unsupported* to our code’s module 
> declaration:
> {code:java}
> module java9unsafe {
> requires jdk.unsupported;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25922) [K8] Spark Driver/Executor "spark-app-selector" label mismatch

2018-11-06 Thread Yinan Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677276#comment-16677276
 ] 

Yinan Li commented on SPARK-25922:
--

The application ID used to set the {{spark-app-selector}} label for the driver 
pod is from this line 
[https://github.com/apache/spark/blob/3404a73f4cf7be37e574026d08ad5cf82cfac871/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L217.]
 The application ID used to set the {{spark-app-selector}} label for the 
executor pod is from this line 
[https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L87|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L87,].
 Agreed that it's problematic that two different labels are used.

> [K8] Spark Driver/Executor "spark-app-selector" label mismatch
> --
>
> Key: SPARK-25922
> URL: https://issues.apache.org/jira/browse/SPARK-25922
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 RC4
>Reporter: Anmol Khurana
>Priority: Major
>
> Hi,
> I have been testing Spark 2.4.0 RC4 on Kubernetes  to run Python Spark 
> Applications and running into an issue where the AppId label on the driver 
> and executors mis-match. I am using the 
> [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator] to run these 
> applications. 
> I see a spark.app.id of the form spark-* as  "spark-app-selector" label on 
> the driver as well as in the K8 config-map which gets created for the driver 
> via spark-submit . My guess is this is coming from 
> [https://github.com/apache/spark/blob/f6cc354d83c2c9a757f9b507aadd4dbdc5825cca/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L211]
>  
> But when the driver actually comes up and brings up executors etc. , I see 
> that the "spark-app-selector" label on the executors as well as the 
> spark.app.Id config within the user-code on the driver is something of the 
> form spark-application-* ( probably from 
> [https://github.com/apache/spark/blob/b19a28dea098c7d6188f8540429c50f42952d678/core/src/main/scala/org/apache/spark/SparkContext.scala#L511]
>  & 
> [https://github.com/apache/spark/blob/bfb74394a5513134ea1da9fcf4a1783b77dd64e4/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala#L26|https://github.com/apache/spark/blob/bfb74394a5513134ea1da9fcf4a1783b77dd64e4/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala#L26)]
>  )
> We were consuming this "spark-app-selector" label on the Driver Pod to get 
> the App Id and use it to look-up the app in SparkHistory server (among other 
> use-cases). but due to this mis-match, this logic no longer works. This was 
> working fine in Spark 2.2 fork for Kubernetes which i was using earlier. Is 
> this expected behavior and if yes, what's the correct way to fetch the 
> applicationId from outside the application ?  
> Let me know if I can provide any more details or if I am doing something 
> wrong. Here is an example run with different *spark-app-selector* label on 
> the driver/executor : 
>  
> {code:java}
> Name: pyfiles-driver
> Namespace: default
> Priority: 0
> PriorityClassName: 
> Start Time: Thu, 01 Nov 2018 18:19:46 -0700
> Labels: spark-app-selector=spark-b78bb10feebf4e2d98c11d7b6320e18f
>  spark-role=driver
>  sparkoperator.k8s.io/app-name=pyfiles
>  sparkoperator.k8s.io/launched-by-spark-operator=true
>  version=2.4.0
> Status: Running
> Name: pyfiles-1541121585642-exec-1
> Namespace: default
> Priority: 0
> PriorityClassName: 
> Start Time: Thu, 01 Nov 2018 18:24:02 -0700
> Labels: spark-app-selector=spark-application-1541121829445
>  spark-exec-id=1
>  spark-role=executor
>  sparkoperator.k8s.io/app-name=pyfiles
>  sparkoperator.k8s.io/launched-by-spark-operator=true
>  version=2.4.0
> Status: Pending
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25955) Porting JSON test for CSV functions

2018-11-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25955:


Assignee: (was: Apache Spark)

> Porting JSON test for CSV functions
> ---
>
> Key: SPARK-25955
> URL: https://issues.apache.org/jira/browse/SPARK-25955
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> JsonFunctionsSuite contains test that are applicable and useful for CSV 
> functions - from_csv, to_csv and schema_of_csv:
> * uses DDL strings for defining a schema - java
> * roundtrip to_csv -> from_csv
> * roundtrip from_csv -> to_csv
> * infers schemas of a CSV string and pass to to from_csv
> * Support to_csv in SQL
> * Support from_csv in SQL



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25955) Porting JSON test for CSV functions

2018-11-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677272#comment-16677272
 ] 

Apache Spark commented on SPARK-25955:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22960

> Porting JSON test for CSV functions
> ---
>
> Key: SPARK-25955
> URL: https://issues.apache.org/jira/browse/SPARK-25955
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> JsonFunctionsSuite contains test that are applicable and useful for CSV 
> functions - from_csv, to_csv and schema_of_csv:
> * uses DDL strings for defining a schema - java
> * roundtrip to_csv -> from_csv
> * roundtrip from_csv -> to_csv
> * infers schemas of a CSV string and pass to to from_csv
> * Support to_csv in SQL
> * Support from_csv in SQL



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25955) Porting JSON test for CSV functions

2018-11-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25955:


Assignee: Apache Spark

> Porting JSON test for CSV functions
> ---
>
> Key: SPARK-25955
> URL: https://issues.apache.org/jira/browse/SPARK-25955
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> JsonFunctionsSuite contains test that are applicable and useful for CSV 
> functions - from_csv, to_csv and schema_of_csv:
> * uses DDL strings for defining a schema - java
> * roundtrip to_csv -> from_csv
> * roundtrip from_csv -> to_csv
> * infers schemas of a CSV string and pass to to from_csv
> * Support to_csv in SQL
> * Support from_csv in SQL



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25955) Porting JSON test for CSV functions

2018-11-06 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25955:
--

 Summary: Porting JSON test for CSV functions
 Key: SPARK-25955
 URL: https://issues.apache.org/jira/browse/SPARK-25955
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


JsonFunctionsSuite contains test that are applicable and useful for CSV 
functions - from_csv, to_csv and schema_of_csv:
* uses DDL strings for defining a schema - java
* roundtrip to_csv -> from_csv
* roundtrip from_csv -> to_csv
* infers schemas of a CSV string and pass to to from_csv
* Support to_csv in SQL
* Support from_csv in SQL



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25222) Spark on Kubernetes Pod Watcher dumps raw container status

2018-11-06 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-25222:
---
Fix Version/s: 3.0.0

> Spark on Kubernetes Pod Watcher dumps raw container status
> --
>
> Key: SPARK-25222
> URL: https://issues.apache.org/jira/browse/SPARK-25222
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Priority: Minor
> Fix For: 3.0.0
>
>
> Spark on Kubernetes provides logging of the pod/container status as a monitor 
> of the job progress.  However the logger just dumps the raw container status 
> object leading to fairly unreadable output like so:
> {noformat}
> 18/08/24 09:03:27 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-groupby-1535101393784-driver
>namespace: default
>labels: spark-app-selector -> spark-47f7248122b9444b8d5fd3701028a1e8, 
> spark-role -> driver
>pod uid: 88de6467-a77c-11e8-b9da-a4bf0128b75b
>creation time: 2018-08-24T09:03:14Z
>service account name: spark
>volumes: spark-local-dir-1, spark-conf-volume, spark-token-kjxkv
>node name: tab-cmp4
>start time: 2018-08-24T09:03:14Z
>container images: rvesse/spark:latest
>phase: Running
>status: 
> [ContainerStatus(containerID=docker://23ae58571f59505e837dca40455d0347fb90e9b88e2a2b145a38e2919fceb447,
>  image=rvesse/spark:latest, 
> imageID=docker-pullable://rvesse/spark@sha256:92abf0b718743d0f5a26068fc94ec42233db0493c55a8570dc8c851c62a4bc0a,
>  lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=true, 
> restartCount=0, 
> state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2018-08-24T09:03:26Z,
>  additionalProperties={}), additionalProperties={}), terminated=null, 
> waiting=null, additionalProperties={}), additionalProperties={})]
> {noformat}
> The {{LoggingPodStatusWatcher}} actually already includes code to nicely 
> format this information but only invokes it at the end of the job:
> {noformat}
> 18/08/24 09:04:07 INFO LoggingPodStatusWatcherImpl: Container final statuses:
>  Container name: spark-kubernetes-driver
>  Container image: rvesse/spark:latest
>  Container state: Terminated
>  Exit code: 0
> {noformat}
> It would be nice if we continually used the nice formatting throughout the 
> logging.
> We already have patched this on our internal fork and will upstream a fix 
> shortly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25222) Spark on Kubernetes Pod Watcher dumps raw container status

2018-11-06 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25222:
--

Assignee: Rob Vesse

> Spark on Kubernetes Pod Watcher dumps raw container status
> --
>
> Key: SPARK-25222
> URL: https://issues.apache.org/jira/browse/SPARK-25222
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Assignee: Rob Vesse
>Priority: Minor
> Fix For: 3.0.0
>
>
> Spark on Kubernetes provides logging of the pod/container status as a monitor 
> of the job progress.  However the logger just dumps the raw container status 
> object leading to fairly unreadable output like so:
> {noformat}
> 18/08/24 09:03:27 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-groupby-1535101393784-driver
>namespace: default
>labels: spark-app-selector -> spark-47f7248122b9444b8d5fd3701028a1e8, 
> spark-role -> driver
>pod uid: 88de6467-a77c-11e8-b9da-a4bf0128b75b
>creation time: 2018-08-24T09:03:14Z
>service account name: spark
>volumes: spark-local-dir-1, spark-conf-volume, spark-token-kjxkv
>node name: tab-cmp4
>start time: 2018-08-24T09:03:14Z
>container images: rvesse/spark:latest
>phase: Running
>status: 
> [ContainerStatus(containerID=docker://23ae58571f59505e837dca40455d0347fb90e9b88e2a2b145a38e2919fceb447,
>  image=rvesse/spark:latest, 
> imageID=docker-pullable://rvesse/spark@sha256:92abf0b718743d0f5a26068fc94ec42233db0493c55a8570dc8c851c62a4bc0a,
>  lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=true, 
> restartCount=0, 
> state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2018-08-24T09:03:26Z,
>  additionalProperties={}), additionalProperties={}), terminated=null, 
> waiting=null, additionalProperties={}), additionalProperties={})]
> {noformat}
> The {{LoggingPodStatusWatcher}} actually already includes code to nicely 
> format this information but only invokes it at the end of the job:
> {noformat}
> 18/08/24 09:04:07 INFO LoggingPodStatusWatcherImpl: Container final statuses:
>  Container name: spark-kubernetes-driver
>  Container image: rvesse/spark:latest
>  Container state: Terminated
>  Exit code: 0
> {noformat}
> It would be nice if we continually used the nice formatting throughout the 
> logging.
> We already have patched this on our internal fork and will upstream a fix 
> shortly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25871) Streaming WAL should not use hdfs erasure coding, regardless of FS defaults

2018-11-06 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25871.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22882
[https://github.com/apache/spark/pull/22882]

> Streaming WAL should not use hdfs erasure coding, regardless of FS defaults
> ---
>
> Key: SPARK-25871
> URL: https://issues.apache.org/jira/browse/SPARK-25871
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 3.0.0
>
>
> The {{FileBasedWriteAheadLogWriter}} expects the output stream for the WAL to 
> support {{hflush()}}, but hdfs erasure coded files do not support that.
> https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html#Limitations
> otherwise you get exceptions like:
> {noformat}
> 17/10/17 17:31:34 ERROR executor.Executor: Exception in task 0.2 in stage 6.0 
> (TID 85)
> org.apache.spark.SparkException: Could not read data from write ahead log 
> record 
> FileBasedWriteAheadLogSegment(hdfs://quasar-yxckyb-1.vpc.cloudera.com:8020/tmp/__spark__a10be3a3-85ec-4d4f-8782-a4760df4cc4c/88657/checkpoints/receivedData/0/log-1508286672978-1508286732978,1321921,189000)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.EOFException: Cannot seek after EOF
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.seek(DFSStripedInputStream.java:331)
>   at 
> org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
>   at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLogRandomReader.read(FileBasedWriteAheadLogRandomReader.scala:37)
>   at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLog.read(FileBasedWriteAheadLog.scala:120)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:142)
>   ... 18 more
> {noformat}
> HDFS allows you to force a file to be replicated, regardless of the FS 
> defaults -- we should do that for the WAL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25871) Streaming WAL should not use hdfs erasure coding, regardless of FS defaults

2018-11-06 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25871:
--

Assignee: Imran Rashid

> Streaming WAL should not use hdfs erasure coding, regardless of FS defaults
> ---
>
> Key: SPARK-25871
> URL: https://issues.apache.org/jira/browse/SPARK-25871
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 3.0.0
>
>
> The {{FileBasedWriteAheadLogWriter}} expects the output stream for the WAL to 
> support {{hflush()}}, but hdfs erasure coded files do not support that.
> https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html#Limitations
> otherwise you get exceptions like:
> {noformat}
> 17/10/17 17:31:34 ERROR executor.Executor: Exception in task 0.2 in stage 6.0 
> (TID 85)
> org.apache.spark.SparkException: Could not read data from write ahead log 
> record 
> FileBasedWriteAheadLogSegment(hdfs://quasar-yxckyb-1.vpc.cloudera.com:8020/tmp/__spark__a10be3a3-85ec-4d4f-8782-a4760df4cc4c/88657/checkpoints/receivedData/0/log-1508286672978-1508286732978,1321921,189000)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.EOFException: Cannot seek after EOF
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.seek(DFSStripedInputStream.java:331)
>   at 
> org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
>   at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLogRandomReader.read(FileBasedWriteAheadLogRandomReader.scala:37)
>   at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLog.read(FileBasedWriteAheadLog.scala:120)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:142)
>   ... 18 more
> {noformat}
> HDFS allows you to force a file to be replicated, regardless of the FS 
> defaults -- we should do that for the WAL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25954) Upgrade to Kafka 2.1.0

2018-11-06 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677163#comment-16677163
 ] 

Dongjoon Hyun commented on SPARK-25954:
---

Thanks. Of course, it's for Spark 3.0 next year.

> Upgrade to Kafka 2.1.0
> --
>
> Key: SPARK-25954
> URL: https://issues.apache.org/jira/browse/SPARK-25954
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Kafka 2.1.0 RC0 is started. Since this includes official KAFKA-7264 JDK 11 
> support, we had better use that.
> - 
> https://lists.apache.org/thread.html/8288f0afdfed4d329f1a8338320b6e24e7684a0593b4bbd6f1b79101@%3Cdev.kafka.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25954) Upgrade to Kafka 2.1.0

2018-11-06 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677152#comment-16677152
 ] 

Ted Yu commented on SPARK-25954:


Looking at Kafka thread, message from Satish indicated there may be another RC 
coming.

> Upgrade to Kafka 2.1.0
> --
>
> Key: SPARK-25954
> URL: https://issues.apache.org/jira/browse/SPARK-25954
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Kafka 2.1.0 RC0 is started. Since this includes official KAFKA-7264 JDK 11 
> support, we had better use that.
> - 
> https://lists.apache.org/thread.html/8288f0afdfed4d329f1a8338320b6e24e7684a0593b4bbd6f1b79101@%3Cdev.kafka.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25876) Simplify configuration types in k8s backend

2018-11-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677138#comment-16677138
 ] 

Apache Spark commented on SPARK-25876:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22959

> Simplify configuration types in k8s backend
> ---
>
> Key: SPARK-25876
> URL: https://issues.apache.org/jira/browse/SPARK-25876
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This is a child of SPARK-25874 to deal with the current issues with the 
> different configuration objects used in the k8s backend. Please refer to the 
> parent for further discussion of what this means.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25876) Simplify configuration types in k8s backend

2018-11-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25876:


Assignee: (was: Apache Spark)

> Simplify configuration types in k8s backend
> ---
>
> Key: SPARK-25876
> URL: https://issues.apache.org/jira/browse/SPARK-25876
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This is a child of SPARK-25874 to deal with the current issues with the 
> different configuration objects used in the k8s backend. Please refer to the 
> parent for further discussion of what this means.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25876) Simplify configuration types in k8s backend

2018-11-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25876:


Assignee: Apache Spark

> Simplify configuration types in k8s backend
> ---
>
> Key: SPARK-25876
> URL: https://issues.apache.org/jira/browse/SPARK-25876
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Major
>
> This is a child of SPARK-25874 to deal with the current issues with the 
> different configuration objects used in the k8s backend. Please refer to the 
> parent for further discussion of what this means.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25876) Simplify configuration types in k8s backend

2018-11-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677135#comment-16677135
 ] 

Apache Spark commented on SPARK-25876:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22959

> Simplify configuration types in k8s backend
> ---
>
> Key: SPARK-25876
> URL: https://issues.apache.org/jira/browse/SPARK-25876
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This is a child of SPARK-25874 to deal with the current issues with the 
> different configuration objects used in the k8s backend. Please refer to the 
> parent for further discussion of what this means.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25954) Upgrade to Kafka 2.1.0

2018-11-06 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677107#comment-16677107
 ] 

Dongjoon Hyun commented on SPARK-25954:
---

Ping, [~te...@apache.org].
Are you interested in this since you did SPARK-18057?

> Upgrade to Kafka 2.1.0
> --
>
> Key: SPARK-25954
> URL: https://issues.apache.org/jira/browse/SPARK-25954
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Kafka 2.1.0 RC0 is started. Since this includes official KAFKA-7264 JDK 11 
> support, we had better use that.
> - 
> https://lists.apache.org/thread.html/8288f0afdfed4d329f1a8338320b6e24e7684a0593b4bbd6f1b79101@%3Cdev.kafka.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25954) Upgrade to Kafka 2.1.0

2018-11-06 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-25954:
-

 Summary: Upgrade to Kafka 2.1.0
 Key: SPARK-25954
 URL: https://issues.apache.org/jira/browse/SPARK-25954
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


Kafka 2.1.0 RC0 is started. Since this includes official KAFKA-7264 JDK 11 
support, we had better use that.
- 
https://lists.apache.org/thread.html/8288f0afdfed4d329f1a8338320b6e24e7684a0593b4bbd6f1b79101@%3Cdev.kafka.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25953) install jdk11 on jenkins workers

2018-11-06 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-25953:

Description: once we pin down exact what we want installed on the jenkins 
workers, i will add it to our ansible and deploy.

> install jdk11 on jenkins workers
> 
>
> Key: SPARK-25953
> URL: https://issues.apache.org/jira/browse/SPARK-25953
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Critical
>
> once we pin down exact what we want installed on the jenkins workers, i will 
> add it to our ansible and deploy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25952) from_json returns wrong result if corrupt record column is in the middle of schema

2018-11-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677073#comment-16677073
 ] 

Apache Spark commented on SPARK-25952:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22958

> from_json returns wrong result if corrupt record column is in the middle of 
> schema
> --
>
> Key: SPARK-25952
> URL: https://issues.apache.org/jira/browse/SPARK-25952
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> If an user specifies a corrupt record column via 
> spark.sql.columnNameOfCorruptRecord or JSON options 
> columnNameOfCorruptRecord, schema with the column is propagated to Jackson 
> parser. This breaks an assumption inside of FailureSafeParser that a row 
> returned from Jackson Parser contains only actual data. As a consequence of 
> that FailureSafeParser writes a bad record in wrong position.
> For example:
> {code:scala}
> val schema = new StructType()
>   .add("a", IntegerType)
>   .add("_unparsed", StringType)
>   .add("b", IntegerType)
> val badRec = """{"a" 1, "b": 11}"""
> val df = Seq(badRec, """{"a": 2, "b": 12}""").toDS()
> {code}
> the collect() action below
> {code:scala}
> df.select(from_json($"value", schema, Map("columnNameOfCorruptRecord" -> 
> "_unparsed"))).collect()
> {code}
> loses 12:
> {code}
> Array(Row(Row(null, "{"a" 1, "b": 11}", null)), Row(Row(2, null, null)))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25952) from_json returns wrong result if corrupt record column is in the middle of schema

2018-11-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25952:


Assignee: Apache Spark

> from_json returns wrong result if corrupt record column is in the middle of 
> schema
> --
>
> Key: SPARK-25952
> URL: https://issues.apache.org/jira/browse/SPARK-25952
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> If an user specifies a corrupt record column via 
> spark.sql.columnNameOfCorruptRecord or JSON options 
> columnNameOfCorruptRecord, schema with the column is propagated to Jackson 
> parser. This breaks an assumption inside of FailureSafeParser that a row 
> returned from Jackson Parser contains only actual data. As a consequence of 
> that FailureSafeParser writes a bad record in wrong position.
> For example:
> {code:scala}
> val schema = new StructType()
>   .add("a", IntegerType)
>   .add("_unparsed", StringType)
>   .add("b", IntegerType)
> val badRec = """{"a" 1, "b": 11}"""
> val df = Seq(badRec, """{"a": 2, "b": 12}""").toDS()
> {code}
> the collect() action below
> {code:scala}
> df.select(from_json($"value", schema, Map("columnNameOfCorruptRecord" -> 
> "_unparsed"))).collect()
> {code}
> loses 12:
> {code}
> Array(Row(Row(null, "{"a" 1, "b": 11}", null)), Row(Row(2, null, null)))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25952) from_json returns wrong result if corrupt record column is in the middle of schema

2018-11-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25952:


Assignee: (was: Apache Spark)

> from_json returns wrong result if corrupt record column is in the middle of 
> schema
> --
>
> Key: SPARK-25952
> URL: https://issues.apache.org/jira/browse/SPARK-25952
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> If an user specifies a corrupt record column via 
> spark.sql.columnNameOfCorruptRecord or JSON options 
> columnNameOfCorruptRecord, schema with the column is propagated to Jackson 
> parser. This breaks an assumption inside of FailureSafeParser that a row 
> returned from Jackson Parser contains only actual data. As a consequence of 
> that FailureSafeParser writes a bad record in wrong position.
> For example:
> {code:scala}
> val schema = new StructType()
>   .add("a", IntegerType)
>   .add("_unparsed", StringType)
>   .add("b", IntegerType)
> val badRec = """{"a" 1, "b": 11}"""
> val df = Seq(badRec, """{"a": 2, "b": 12}""").toDS()
> {code}
> the collect() action below
> {code:scala}
> df.select(from_json($"value", schema, Map("columnNameOfCorruptRecord" -> 
> "_unparsed"))).collect()
> {code}
> loses 12:
> {code}
> Array(Row(Row(null, "{"a" 1, "b": 11}", null)), Row(Row(2, null, null)))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25953) install jdk11 on jenkins workers

2018-11-06 Thread shane knapp (JIRA)
shane knapp created SPARK-25953:
---

 Summary: install jdk11 on jenkins workers
 Key: SPARK-25953
 URL: https://issues.apache.org/jira/browse/SPARK-25953
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.0.0
Reporter: shane knapp
Assignee: shane knapp






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25952) from_json returns wrong result if corrupt record column is in the middle of schema

2018-11-06 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25952:
--

 Summary: from_json returns wrong result if corrupt record column 
is in the middle of schema
 Key: SPARK-25952
 URL: https://issues.apache.org/jira/browse/SPARK-25952
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


If an user specifies a corrupt record column via 
spark.sql.columnNameOfCorruptRecord or JSON options columnNameOfCorruptRecord, 
schema with the column is propagated to Jackson parser. This breaks an 
assumption inside of FailureSafeParser that a row returned from Jackson Parser 
contains only actual data. As a consequence of that FailureSafeParser writes a 
bad record in wrong position.

For example:
{code:scala}
val schema = new StructType()
  .add("a", IntegerType)
  .add("_unparsed", StringType)
  .add("b", IntegerType)
val badRec = """{"a" 1, "b": 11}"""
val df = Seq(badRec, """{"a": 2, "b": 12}""").toDS()
{code}
the collect() action below
{code:scala}
df.select(from_json($"value", schema, Map("columnNameOfCorruptRecord" -> 
"_unparsed"))).collect()
{code}
loses 12:
{code}
Array(Row(Row(null, "{"a" 1, "b": 11}", null)), Row(Row(2, null, null)))
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24417) Build and Run Spark on JDK11

2018-11-06 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677040#comment-16677040
 ] 

Dongjoon Hyun commented on SPARK-24417:
---

+1 for moving to Scala 2.12 as default Scala version.

> Build and Run Spark on JDK11
> 
>
> Key: SPARK-24417
> URL: https://issues.apache.org/jira/browse/SPARK-24417
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> This is an umbrella JIRA for Apache Spark to support JDK11
> As JDK8 is reaching EOL, and JDK9 and 10 are already end of life, per 
> community discussion, we will skip JDK9 and 10 to support JDK 11 directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25951) Redundant shuffle if column is renamed

2018-11-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25951:


Assignee: (was: Apache Spark)

> Redundant shuffle if column is renamed
> --
>
> Key: SPARK-25951
> URL: https://issues.apache.org/jira/browse/SPARK-25951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Minor
>
> we've noticed that sometimes a column rename causes extra shuffle:
> {code:java}
> val N = 1 << 12
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> val t1 = spark.range(N).selectExpr("floor(id/4) as key1")
> val t2 = spark.range(N).selectExpr("floor(id/4) as key2")
> import org.apache.spark.sql.functions._
> t1.groupBy("key1").agg(count(lit("1")).as("cnt1"))
> .join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2",
>  "key3"),
> col("key1")===col("key3"))
> .explain()
> {code}
> results in:
> {noformat}
> == Physical Plan ==
> *(6) SortMergeJoin [key1#6L], [key3#22L], Inner
> :- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0
> :  +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], 
> output=[key1#6L, cnt1#14L])
> : +- Exchange hashpartitioning(key1#6L, 2)
> :+- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], 
> output=[key1#6L, count#39L])
> :   +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L]
> :  +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0)))
> : +- *(1) Range (0, 4096, step=1, splits=1)
> +- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(key3#22L, 2)
>   +- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], 
> output=[key3#22L, cnt2#19L])
>  +- Exchange hashpartitioning(key2#10L, 2)
> +- *(3) HashAggregate(keys=[key2#10L], 
> functions=[partial_count(1)], output=[key2#10L, count#41L])
>+- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS 
> key2#10L]
>   +- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / 
> 4.0)))
>  +- *(3) Range (0, 4096, step=1, splits=1)
> {noformat}
> I was able to track it down to this code in class HashPartitioning:
> {code:java}
> case h: HashClusteredDistribution =>
>   expressions.length == h.expressions.length && 
> expressions.zip(h.expressions).forall {
>   case (l, r) => l.semanticEquals(r)
>  }
> {code}
> the semanticEquals returns false as it compares key2 and key3 eventhough key3 
> is just a rename of key2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25951) Redundant shuffle if column is renamed

2018-11-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25951:


Assignee: Apache Spark

> Redundant shuffle if column is renamed
> --
>
> Key: SPARK-25951
> URL: https://issues.apache.org/jira/browse/SPARK-25951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Assignee: Apache Spark
>Priority: Minor
>
> we've noticed that sometimes a column rename causes extra shuffle:
> {code:java}
> val N = 1 << 12
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> val t1 = spark.range(N).selectExpr("floor(id/4) as key1")
> val t2 = spark.range(N).selectExpr("floor(id/4) as key2")
> import org.apache.spark.sql.functions._
> t1.groupBy("key1").agg(count(lit("1")).as("cnt1"))
> .join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2",
>  "key3"),
> col("key1")===col("key3"))
> .explain()
> {code}
> results in:
> {noformat}
> == Physical Plan ==
> *(6) SortMergeJoin [key1#6L], [key3#22L], Inner
> :- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0
> :  +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], 
> output=[key1#6L, cnt1#14L])
> : +- Exchange hashpartitioning(key1#6L, 2)
> :+- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], 
> output=[key1#6L, count#39L])
> :   +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L]
> :  +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0)))
> : +- *(1) Range (0, 4096, step=1, splits=1)
> +- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(key3#22L, 2)
>   +- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], 
> output=[key3#22L, cnt2#19L])
>  +- Exchange hashpartitioning(key2#10L, 2)
> +- *(3) HashAggregate(keys=[key2#10L], 
> functions=[partial_count(1)], output=[key2#10L, count#41L])
>+- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS 
> key2#10L]
>   +- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / 
> 4.0)))
>  +- *(3) Range (0, 4096, step=1, splits=1)
> {noformat}
> I was able to track it down to this code in class HashPartitioning:
> {code:java}
> case h: HashClusteredDistribution =>
>   expressions.length == h.expressions.length && 
> expressions.zip(h.expressions).forall {
>   case (l, r) => l.semanticEquals(r)
>  }
> {code}
> the semanticEquals returns false as it compares key2 and key3 eventhough key3 
> is just a rename of key2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25951) Redundant shuffle if column is renamed

2018-11-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676941#comment-16676941
 ] 

Apache Spark commented on SPARK-25951:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22957

> Redundant shuffle if column is renamed
> --
>
> Key: SPARK-25951
> URL: https://issues.apache.org/jira/browse/SPARK-25951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Minor
>
> we've noticed that sometimes a column rename causes extra shuffle:
> {code:java}
> val N = 1 << 12
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> val t1 = spark.range(N).selectExpr("floor(id/4) as key1")
> val t2 = spark.range(N).selectExpr("floor(id/4) as key2")
> import org.apache.spark.sql.functions._
> t1.groupBy("key1").agg(count(lit("1")).as("cnt1"))
> .join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2",
>  "key3"),
> col("key1")===col("key3"))
> .explain()
> {code}
> results in:
> {noformat}
> == Physical Plan ==
> *(6) SortMergeJoin [key1#6L], [key3#22L], Inner
> :- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0
> :  +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], 
> output=[key1#6L, cnt1#14L])
> : +- Exchange hashpartitioning(key1#6L, 2)
> :+- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], 
> output=[key1#6L, count#39L])
> :   +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L]
> :  +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0)))
> : +- *(1) Range (0, 4096, step=1, splits=1)
> +- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(key3#22L, 2)
>   +- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], 
> output=[key3#22L, cnt2#19L])
>  +- Exchange hashpartitioning(key2#10L, 2)
> +- *(3) HashAggregate(keys=[key2#10L], 
> functions=[partial_count(1)], output=[key2#10L, count#41L])
>+- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS 
> key2#10L]
>   +- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / 
> 4.0)))
>  +- *(3) Range (0, 4096, step=1, splits=1)
> {noformat}
> I was able to track it down to this code in class HashPartitioning:
> {code:java}
> case h: HashClusteredDistribution =>
>   expressions.length == h.expressions.length && 
> expressions.zip(h.expressions).forall {
>   case (l, r) => l.semanticEquals(r)
>  }
> {code}
> the semanticEquals returns false as it compares key2 and key3 eventhough key3 
> is just a rename of key2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25951) Redundant shuffle if column is renamed

2018-11-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676942#comment-16676942
 ] 

Apache Spark commented on SPARK-25951:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22957

> Redundant shuffle if column is renamed
> --
>
> Key: SPARK-25951
> URL: https://issues.apache.org/jira/browse/SPARK-25951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Minor
>
> we've noticed that sometimes a column rename causes extra shuffle:
> {code:java}
> val N = 1 << 12
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> val t1 = spark.range(N).selectExpr("floor(id/4) as key1")
> val t2 = spark.range(N).selectExpr("floor(id/4) as key2")
> import org.apache.spark.sql.functions._
> t1.groupBy("key1").agg(count(lit("1")).as("cnt1"))
> .join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2",
>  "key3"),
> col("key1")===col("key3"))
> .explain()
> {code}
> results in:
> {noformat}
> == Physical Plan ==
> *(6) SortMergeJoin [key1#6L], [key3#22L], Inner
> :- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0
> :  +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], 
> output=[key1#6L, cnt1#14L])
> : +- Exchange hashpartitioning(key1#6L, 2)
> :+- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], 
> output=[key1#6L, count#39L])
> :   +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L]
> :  +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0)))
> : +- *(1) Range (0, 4096, step=1, splits=1)
> +- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(key3#22L, 2)
>   +- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], 
> output=[key3#22L, cnt2#19L])
>  +- Exchange hashpartitioning(key2#10L, 2)
> +- *(3) HashAggregate(keys=[key2#10L], 
> functions=[partial_count(1)], output=[key2#10L, count#41L])
>+- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS 
> key2#10L]
>   +- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / 
> 4.0)))
>  +- *(3) Range (0, 4096, step=1, splits=1)
> {noformat}
> I was able to track it down to this code in class HashPartitioning:
> {code:java}
> case h: HashClusteredDistribution =>
>   expressions.length == h.expressions.length && 
> expressions.zip(h.expressions).forall {
>   case (l, r) => l.semanticEquals(r)
>  }
> {code}
> the semanticEquals returns false as it compares key2 and key3 eventhough key3 
> is just a rename of key2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25951) Redundant shuffle if column is renamed

2018-11-06 Thread Ohad Raviv (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ohad Raviv updated SPARK-25951:
---
Description: 
we've noticed that sometimes a column rename causes extra shuffle:
{code:java}
val N = 1 << 12

spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")

val t1 = spark.range(N).selectExpr("floor(id/4) as key1")
val t2 = spark.range(N).selectExpr("floor(id/4) as key2")

import org.apache.spark.sql.functions._
t1.groupBy("key1").agg(count(lit("1")).as("cnt1"))
.join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2",
 "key3"),
col("key1")===col("key3"))
.explain()
{code}
results in:
{noformat}
== Physical Plan ==
*(6) SortMergeJoin [key1#6L], [key3#22L], Inner
:- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0
:  +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], output=[key1#6L, 
cnt1#14L])
: +- Exchange hashpartitioning(key1#6L, 2)
:+- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], 
output=[key1#6L, count#39L])
:   +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L]
:  +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0)))
: +- *(1) Range (0, 4096, step=1, splits=1)
+- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(key3#22L, 2)
  +- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], 
output=[key3#22L, cnt2#19L])
 +- Exchange hashpartitioning(key2#10L, 2)
+- *(3) HashAggregate(keys=[key2#10L], 
functions=[partial_count(1)], output=[key2#10L, count#41L])
   +- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS 
key2#10L]
  +- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / 4.0)))
 +- *(3) Range (0, 4096, step=1, splits=1)
{noformat}
I was able to track it down to this code in class HashPartitioning:
{code:java}
case h: HashClusteredDistribution =>
  expressions.length == h.expressions.length && 
expressions.zip(h.expressions).forall {
  case (l, r) => l.semanticEquals(r)
 }
{code}
the semanticEquals returns false as it compares key2 and key3 eventhough key3 
is just a rename of key2

  was:
we've noticed that sometimes a column rename causes extra shuffle:
{code}
val N = 1 << 12

spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")

val t1 = spark.range(N).selectExpr("floor(id/4) as key1")
val t2 = spark.range(N).selectExpr("floor(id/4) as key2")

t1.groupBy("key1").agg(count(lit("1")).as("cnt1"))
.join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2",
 "key3"),
col("key1")===col("key3"))
.explain()
{code}

results in:

{noformat}
== Physical Plan ==
*(6) SortMergeJoin [key1#6L], [key3#22L], Inner
:- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0
:  +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], output=[key1#6L, 
cnt1#14L])
: +- Exchange hashpartitioning(key1#6L, 2)
:+- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], 
output=[key1#6L, count#39L])
:   +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L]
:  +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0)))
: +- *(1) Range (0, 4096, step=1, splits=1)
+- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(key3#22L, 2)
  +- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], 
output=[key3#22L, cnt2#19L])
 +- Exchange hashpartitioning(key2#10L, 2)
+- *(3) HashAggregate(keys=[key2#10L], 
functions=[partial_count(1)], output=[key2#10L, count#41L])
   +- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS 
key2#10L]
  +- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / 4.0)))
 +- *(3) Range (0, 4096, step=1, splits=1)
{noformat}
I was able to track it down to this code in class HashPartitioning:
{code}
case h: HashClusteredDistribution =>
  expressions.length == h.expressions.length && 
expressions.zip(h.expressions).forall {
  case (l, r) => l.semanticEquals(r)
 }
{code}
the semanticEquals returns false as it compares key2 and key3 eventhough key3 
is just a rename of key2


> Redundant shuffle if column is renamed
> --
>
> Key: SPARK-25951
> URL: https://issues.apache.org/jira/browse/SPARK-25951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Minor
>
> we've noticed that sometimes a column rename causes extra shuffle:
> {code:java}
> val N = 1 << 12
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> val t1 = spark.range(N).selectExpr("floor(id/4) as key1")
> val t2 = spark.range(N).selectExpr("floor(id/4) as key2")
> import org.apache.spark.sql.functions._

[jira] [Resolved] (SPARK-25866) Update KMeans formatVersion

2018-11-06 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25866.
-
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1

Issue resolved by pull request 22873
[https://github.com/apache/spark/pull/22873]

> Update KMeans formatVersion
> ---
>
> Key: SPARK-25866
> URL: https://issues.apache.org/jira/browse/SPARK-25866
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.1, 3.0.0
>
>
> KMeans's {{formatVersion}} has not been updated to 2.0, when the 
> distanceMeasure parameter has been added. Despite this causes no issue, as 
> {{formatVersion}} is not used anywhere, the information returned is wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25866) Update KMeans formatVersion

2018-11-06 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25866:
---

Assignee: Marco Gaido

> Update KMeans formatVersion
> ---
>
> Key: SPARK-25866
> URL: https://issues.apache.org/jira/browse/SPARK-25866
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.1, 3.0.0
>
>
> KMeans's {{formatVersion}} has not been updated to 2.0, when the 
> distanceMeasure parameter has been added. Despite this causes no issue, as 
> {{formatVersion}} is not used anywhere, the information returned is wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25951) Redundant shuffle if column is renamed

2018-11-06 Thread Ohad Raviv (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ohad Raviv updated SPARK-25951:
---
Description: 
we've noticed that sometimes a column rename causes extra shuffle:
{code}
val N = 1 << 12

spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")

val t1 = spark.range(N).selectExpr("floor(id/4) as key1")
val t2 = spark.range(N).selectExpr("floor(id/4) as key2")

t1.groupBy("key1").agg(count(lit("1")).as("cnt1"))
.join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2",
 "key3"),
col("key1")===col("key3"))
.explain()
{code}

results in:

{noformat}
== Physical Plan ==
*(6) SortMergeJoin [key1#6L], [key3#22L], Inner
:- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0
:  +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], output=[key1#6L, 
cnt1#14L])
: +- Exchange hashpartitioning(key1#6L, 2)
:+- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], 
output=[key1#6L, count#39L])
:   +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L]
:  +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0)))
: +- *(1) Range (0, 4096, step=1, splits=1)
+- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(key3#22L, 2)
  +- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], 
output=[key3#22L, cnt2#19L])
 +- Exchange hashpartitioning(key2#10L, 2)
+- *(3) HashAggregate(keys=[key2#10L], 
functions=[partial_count(1)], output=[key2#10L, count#41L])
   +- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS 
key2#10L]
  +- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / 4.0)))
 +- *(3) Range (0, 4096, step=1, splits=1)
{noformat}
I was able to track it down to this code in class HashPartitioning:
{code}
case h: HashClusteredDistribution =>
  expressions.length == h.expressions.length && 
expressions.zip(h.expressions).forall {
  case (l, r) => l.semanticEquals(r)
 }
{code}
the semanticEquals returns false as it compares key2 and key3 eventhough key3 
is just a rename of key2

  was:
we've noticed that sometimes a column rename causes extra shuffle:
{code}
val N = 1 << 12

spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")

val t1 = spark.range(N).selectExpr("floor(id/4) as key1")
val t2 = spark.range(N).selectExpr("floor(id/4) as key2")

t1.groupBy("key1").agg(count(lit("1")).as("cnt1"))
.join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2",
 "key3"),
col("key1")===col("key3"))
.explain(true)
{code}

results in:

{code}
== Physical Plan ==
*(6) SortMergeJoin [key1#6L], [key3#22L], Inner
:- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0
: +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], output=[key1#6L, 
cnt1#14L])
: +- Exchange hashpartitioning(key1#6L, 2)
: +- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], 
output=[key1#6L, count#39L])
: +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L]
: +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0)))
: +- *(1) Range (0, 4096, step=1, splits=1)
+- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(key3#22L, 2)
+- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], output=[key3#22L, 
cnt2#19L])
+- Exchange hashpartitioning(key2#10L, 2)
+- *(3) HashAggregate(keys=[key2#10L], functions=[partial_count(1)], 
output=[key2#10L, count#41L])
+- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS key2#10L]
+- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / 4.0)))
+- *(3) Range (0, 4096, step=1, splits=1)
{code}
I was able to track it down to this code in class HashPartitioning:
{code}
case h: HashClusteredDistribution =>
  expressions.length == h.expressions.length && 
expressions.zip(h.expressions).forall {
  case (l, r) => l.semanticEquals(r)
 }
{code}
the semanticEquals returns false as it compares key2 and key3 eventhough key3 
is just a rename of key2


> Redundant shuffle if column is renamed
> --
>
> Key: SPARK-25951
> URL: https://issues.apache.org/jira/browse/SPARK-25951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Minor
>
> we've noticed that sometimes a column rename causes extra shuffle:
> {code}
> val N = 1 << 12
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> val t1 = spark.range(N).selectExpr("floor(id/4) as key1")
> val t2 = spark.range(N).selectExpr("floor(id/4) as key2")
> t1.groupBy("key1").agg(count(lit("1")).as("cnt1"))
> .join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2",
>  "key3"),
> col("key1")===col("key3"))
> .explain()
> {code}
> results in:
> {noformat}
> == 

[jira] [Created] (SPARK-25951) Redundant shuffle if column is renamed

2018-11-06 Thread Ohad Raviv (JIRA)
Ohad Raviv created SPARK-25951:
--

 Summary: Redundant shuffle if column is renamed
 Key: SPARK-25951
 URL: https://issues.apache.org/jira/browse/SPARK-25951
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Ohad Raviv


we've noticed that sometimes a column rename causes extra shuffle:
{code}
val N = 1 << 12

spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")

val t1 = spark.range(N).selectExpr("floor(id/4) as key1")
val t2 = spark.range(N).selectExpr("floor(id/4) as key2")

t1.groupBy("key1").agg(count(lit("1")).as("cnt1"))
.join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2",
 "key3"),
col("key1")===col("key3"))
.explain(true)
{code}

results in:

{code}
== Physical Plan ==
*(6) SortMergeJoin [key1#6L], [key3#22L], Inner
:- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0
: +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], output=[key1#6L, 
cnt1#14L])
: +- Exchange hashpartitioning(key1#6L, 2)
: +- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], 
output=[key1#6L, count#39L])
: +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L]
: +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0)))
: +- *(1) Range (0, 4096, step=1, splits=1)
+- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(key3#22L, 2)
+- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], output=[key3#22L, 
cnt2#19L])
+- Exchange hashpartitioning(key2#10L, 2)
+- *(3) HashAggregate(keys=[key2#10L], functions=[partial_count(1)], 
output=[key2#10L, count#41L])
+- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS key2#10L]
+- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / 4.0)))
+- *(3) Range (0, 4096, step=1, splits=1)
{code}
I was able to track it down to this code in class HashPartitioning:
{code}
case h: HashClusteredDistribution =>
  expressions.length == h.expressions.length && 
expressions.zip(h.expressions).forall {
  case (l, r) => l.semanticEquals(r)
 }
{code}
the semanticEquals returns false as it compares key2 and key3 eventhough key3 
is just a rename of key2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled

2018-11-06 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-22148.
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1

> TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current 
> executors are blacklisted but dynamic allocation is enabled
> -
>
> Key: SPARK-22148
> URL: https://issues.apache.org/jira/browse/SPARK-22148
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Juan Rodríguez Hortalá
>Assignee: Dhruve Ashar
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
> Attachments: SPARK-22148_WIP.diff
>
>
> Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and 
> the whole Spark job with `task X (partition Y) cannot run anywhere due to 
> node and executor blacklist. Blacklisting behavior can be configured via 
> spark.blacklist.*.` when all the available executors are blacklisted for a 
> pending Task or TaskSet. This makes sense for static allocation, where the 
> set of executors is fixed for the duration of the application, but this might 
> lead to unnecessary job failures when dynamic allocation is enabled. For 
> example, in a Spark application with a single job at a time, when a node 
> fails at the end of a stage attempt, all other executors will complete their 
> tasks, but the tasks running in the executors of the failing node will be 
> pending. Spark will keep waiting for those tasks for 2 minutes by default 
> (spark.network.timeout) until the heartbeat timeout is triggered, and then it 
> will blacklist those executors for that stage. At that point in time, other 
> executors would had been released after being idle for 1 minute by default 
> (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't 
> started yet and so there are no more tasks available (assuming the default of 
> spark.speculation = false). So Spark will fail because the only executors 
> available are blacklisted for that stage. 
> An alternative is requesting more executors to the cluster manager in this 
> situation. This could be retried a configurable number of times after a 
> configurable wait time between request attempts, so if the cluster manager 
> fails to provide a suitable executor then the job is aborted like in the 
> previous case. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled

2018-11-06 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-22148:
-

Assignee: Dhruve Ashar

> TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current 
> executors are blacklisted but dynamic allocation is enabled
> -
>
> Key: SPARK-22148
> URL: https://issues.apache.org/jira/browse/SPARK-22148
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Juan Rodríguez Hortalá
>Assignee: Dhruve Ashar
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
> Attachments: SPARK-22148_WIP.diff
>
>
> Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and 
> the whole Spark job with `task X (partition Y) cannot run anywhere due to 
> node and executor blacklist. Blacklisting behavior can be configured via 
> spark.blacklist.*.` when all the available executors are blacklisted for a 
> pending Task or TaskSet. This makes sense for static allocation, where the 
> set of executors is fixed for the duration of the application, but this might 
> lead to unnecessary job failures when dynamic allocation is enabled. For 
> example, in a Spark application with a single job at a time, when a node 
> fails at the end of a stage attempt, all other executors will complete their 
> tasks, but the tasks running in the executors of the failing node will be 
> pending. Spark will keep waiting for those tasks for 2 minutes by default 
> (spark.network.timeout) until the heartbeat timeout is triggered, and then it 
> will blacklist those executors for that stage. At that point in time, other 
> executors would had been released after being idle for 1 minute by default 
> (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't 
> started yet and so there are no more tasks available (assuming the default of 
> spark.speculation = false). So Spark will fail because the only executors 
> available are blacklisted for that stage. 
> An alternative is requesting more executors to the cluster manager in this 
> situation. This could be retried a configurable number of times after a 
> configurable wait time between request attempts, so if the cluster manager 
> fails to provide a suitable executor then the job is aborted like in the 
> previous case. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator

2018-11-06 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-15815.
---
Resolution: Duplicate

> Hang while enable blacklistExecutor and DynamicExecutorAllocator 
> -
>
> Key: SPARK-15815
> URL: https://issues.apache.org/jira/browse/SPARK-15815
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.1
>Reporter: SuYan
>Priority: Minor
>
> Enable BlacklistExecutor with some time large than 120s and enabled 
> DynamicAllocate with minExecutors = 0
> 1. Assume there only left 1 task running in Executor A, and other Executor 
> are all timeout.  
> 2. the task failed, so task will not scheduled in current Executor A due to 
> enable blacklistTime.
> 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 
> executors, due to we already have executor A, so the oldTargetNumExecutor  == 
> targetNumExecutor = 1, so will never add more Executors...even if Executor A 
> was timeout.  it became endless request delta=0 executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25950) from_csv should respect to spark.sql.columnNameOfCorruptRecord

2018-11-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25950:


Assignee: (was: Apache Spark)

> from_csv should respect to spark.sql.columnNameOfCorruptRecord
> --
>
> Key: SPARK-25950
> URL: https://issues.apache.org/jira/browse/SPARK-25950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The from_csv() functions should respect to SQL config 
> *spark.sql.columnNameOfCorruptRecord* as from_json() does. Currently it takes 
> into account CSV option *columnNameOfCorruptRecord* only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25950) from_csv should respect to spark.sql.columnNameOfCorruptRecord

2018-11-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676751#comment-16676751
 ] 

Apache Spark commented on SPARK-25950:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22956

> from_csv should respect to spark.sql.columnNameOfCorruptRecord
> --
>
> Key: SPARK-25950
> URL: https://issues.apache.org/jira/browse/SPARK-25950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The from_csv() functions should respect to SQL config 
> *spark.sql.columnNameOfCorruptRecord* as from_json() does. Currently it takes 
> into account CSV option *columnNameOfCorruptRecord* only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25950) from_csv should respect to spark.sql.columnNameOfCorruptRecord

2018-11-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676752#comment-16676752
 ] 

Apache Spark commented on SPARK-25950:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22956

> from_csv should respect to spark.sql.columnNameOfCorruptRecord
> --
>
> Key: SPARK-25950
> URL: https://issues.apache.org/jira/browse/SPARK-25950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The from_csv() functions should respect to SQL config 
> *spark.sql.columnNameOfCorruptRecord* as from_json() does. Currently it takes 
> into account CSV option *columnNameOfCorruptRecord* only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25950) from_csv should respect to spark.sql.columnNameOfCorruptRecord

2018-11-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25950:


Assignee: Apache Spark

> from_csv should respect to spark.sql.columnNameOfCorruptRecord
> --
>
> Key: SPARK-25950
> URL: https://issues.apache.org/jira/browse/SPARK-25950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The from_csv() functions should respect to SQL config 
> *spark.sql.columnNameOfCorruptRecord* as from_json() does. Currently it takes 
> into account CSV option *columnNameOfCorruptRecord* only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25950) from_csv should respect to spark.sql.columnNameOfCorruptRecord

2018-11-06 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25950:
--

 Summary: from_csv should respect to 
spark.sql.columnNameOfCorruptRecord
 Key: SPARK-25950
 URL: https://issues.apache.org/jira/browse/SPARK-25950
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


The from_csv() functions should respect to SQL config 
*spark.sql.columnNameOfCorruptRecord* as from_json() does. Currently it takes 
into account CSV option *columnNameOfCorruptRecord* only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25949) Add test for optimizer rule PullOutPythonUDFInJoinCondition

2018-11-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25949:


Assignee: Apache Spark

> Add test for optimizer rule PullOutPythonUDFInJoinCondition
> ---
>
> Key: SPARK-25949
> URL: https://issues.apache.org/jira/browse/SPARK-25949
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Assignee: Apache Spark
>Priority: Major
>
> As comment in [PullOutPythonUDFInJoinCondition introduced 
> pr|https://github.com/apache/spark/pull/22326#issuecomment-424923967], we 
> test the new added optimizer rule by end-to-end test in python side, need to 
> add suites under org.apache.spark.sql.catalyst.optimizer like other optimizer 
> rules.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25949) Add test for optimizer rule PullOutPythonUDFInJoinCondition

2018-11-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676703#comment-16676703
 ] 

Apache Spark commented on SPARK-25949:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/22955

> Add test for optimizer rule PullOutPythonUDFInJoinCondition
> ---
>
> Key: SPARK-25949
> URL: https://issues.apache.org/jira/browse/SPARK-25949
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Priority: Major
>
> As comment in [PullOutPythonUDFInJoinCondition introduced 
> pr|https://github.com/apache/spark/pull/22326#issuecomment-424923967], we 
> test the new added optimizer rule by end-to-end test in python side, need to 
> add suites under org.apache.spark.sql.catalyst.optimizer like other optimizer 
> rules.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25949) Add test for optimizer rule PullOutPythonUDFInJoinCondition

2018-11-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25949:


Assignee: (was: Apache Spark)

> Add test for optimizer rule PullOutPythonUDFInJoinCondition
> ---
>
> Key: SPARK-25949
> URL: https://issues.apache.org/jira/browse/SPARK-25949
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Priority: Major
>
> As comment in [PullOutPythonUDFInJoinCondition introduced 
> pr|https://github.com/apache/spark/pull/22326#issuecomment-424923967], we 
> test the new added optimizer rule by end-to-end test in python side, need to 
> add suites under org.apache.spark.sql.catalyst.optimizer like other optimizer 
> rules.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25949) Add test for optimizer rule PullOutPythonUDFInJoinCondition

2018-11-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676702#comment-16676702
 ] 

Apache Spark commented on SPARK-25949:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/22955

> Add test for optimizer rule PullOutPythonUDFInJoinCondition
> ---
>
> Key: SPARK-25949
> URL: https://issues.apache.org/jira/browse/SPARK-25949
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Priority: Major
>
> As comment in [PullOutPythonUDFInJoinCondition introduced 
> pr|https://github.com/apache/spark/pull/22326#issuecomment-424923967], we 
> test the new added optimizer rule by end-to-end test in python side, need to 
> add suites under org.apache.spark.sql.catalyst.optimizer like other optimizer 
> rules.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25949) Add test for optimizer rule PullOutPythonUDFInJoinCondition

2018-11-06 Thread Yuanjian Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-25949:

Summary: Add test for optimizer rule PullOutPythonUDFInJoinCondition  (was: 
Add test for optimize rule PullOutPythonUDFInJoinCondition)

> Add test for optimizer rule PullOutPythonUDFInJoinCondition
> ---
>
> Key: SPARK-25949
> URL: https://issues.apache.org/jira/browse/SPARK-25949
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Priority: Major
>
> As comment in [PullOutPythonUDFInJoinCondition introduced 
> pr|https://github.com/apache/spark/pull/22326#issuecomment-424923967], we 
> test the new added optimizer rule by end-to-end test in python side, need to 
> add suites under org.apache.spark.sql.catalyst.optimizer like other optimizer 
> rules.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25949) Add test for optimize rule PullOutPythonUDFInJoinCondition

2018-11-06 Thread Yuanjian Li (JIRA)
Yuanjian Li created SPARK-25949:
---

 Summary: Add test for optimize rule PullOutPythonUDFInJoinCondition
 Key: SPARK-25949
 URL: https://issues.apache.org/jira/browse/SPARK-25949
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.0
Reporter: Yuanjian Li


As comment in [PullOutPythonUDFInJoinCondition introduced 
pr|https://github.com/apache/spark/pull/22326#issuecomment-424923967], we test 
the new added optimizer rule by end-to-end test in python side, need to add 
suites under org.apache.spark.sql.catalyst.optimizer like other optimizer rules.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20389) Upgrade kryo to fix NegativeArraySizeException

2018-11-06 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-20389.
-
Resolution: Duplicate

Fixed by https://github.com/apache/spark/pull/22179

> Upgrade kryo to fix NegativeArraySizeException
> --
>
> Key: SPARK-20389
> URL: https://issues.apache.org/jira/browse/SPARK-20389
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.1.0, 2.2.1
> Environment: Linux, Centos7, jdk8
>Reporter: Georg Heiler
>Priority: Major
>
> I am experiencing an issue with Kryo when writing parquet files. Similar to 
> https://github.com/broadinstitute/gatk/issues/1524 a 
> NegativeArraySizeException occurs. Apparently this is fixed in a current Kryo 
> version. Spark is still using the very old 3.3 Kryo. 
> Can you please upgrade to a fixed Kryo version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25937) Support user-defined schema in Kafka Source & Sink

2018-11-06 Thread Jackey Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676374#comment-16676374
 ] 

Jackey Lee commented on SPARK-25937:


Not exactly the same thing.
 The current implementation is to receive the format and schema, and then parse 
or merge user data. The following is a simple example.
{code}
val format = "json"
val schema: StructType = new StructType().add ("name", StringType).add ("age", 
IntegerType)
df.withFormat(format, schema).parseValue(df.value).select("name", "age")
{code}
For other unsupported formats, like "protobuf", "avro", user could redefine 
parseValue/combineValue and use the same code to parse data.

> Support user-defined schema in Kafka Source & Sink
> --
>
> Key: SPARK-25937
> URL: https://issues.apache.org/jira/browse/SPARK-25937
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jackey Lee
>Priority: Major
>
>     Kafka Source & Sink is widely used in Spark and has the highest frequency 
> in streaming production environment. But at present, both Kafka Source and 
> Link use the fixed schema, which force user to do data conversion when 
> reading and writing Kafka. So why not we use fileformat to do this just like 
> hive?
>     Flink has implemented Kafka's Json/Csv/Avro extended Source & Sink, we 
> can also support it in Spark.
> *Main Goals:*
> 1. Provide a Source and Sink that support user defined Schema. Users can read 
> and write Kafka directly in the program without additional data conversion.
> 2. Provides read-write mechanism based on FileFormat. User's data conversion 
> is similar to FileFormat's read and write process, we can provide a mechanism 
> similar to FileFormat, which provide common read-write format conversion. It 
> also allow users to customize format conversion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org