[jira] [Commented] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-09 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575778#comment-16575778
 ] 

Saisai Shao commented on SPARK-25084:
-

I see. Unfortunately I've cut the RC4, if it worth to include in 2.3.2, I will 
cut a new RC.

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
> ss_net_profit) limit 1000;{code}
> Wrong Codegen:
> {code:java}
> /* 146 */ private int computeHashForStruct_0(InternalRow 
> mutableStateArray[0], int value1) {
> /* 147 */
> /* 148 */
> /* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
> /* 150 */
> /* 151 */ final int element = mutableStateArray[0].getInt(0);
> /* 152 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1);
> /* 153 */
> /* 154 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-09 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575777#comment-16575777
 ] 

Yuming Wang commented on SPARK-25084:
-

It's a regression.

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
> ss_net_profit) limit 1000;{code}
> Wrong Codegen:
> {code:java}
> /* 146 */ private int computeHashForStruct_0(InternalRow 
> mutableStateArray[0], int value1) {
> /* 147 */
> /* 148 */
> /* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
> /* 150 */
> /* 151 */ final int element = mutableStateArray[0].getInt(0);
> /* 152 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1);
> /* 153 */
> /* 154 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-09 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575774#comment-16575774
 ] 

Saisai Shao commented on SPARK-25084:
-

Is this a regression or just a bug existed in old version?

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
> ss_net_profit) limit 1000;{code}
> Wrong Codegen:
> {code:java}
> /* 146 */ private int computeHashForStruct_0(InternalRow 
> mutableStateArray[0], int value1) {
> /* 147 */
> /* 148 */
> /* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
> /* 150 */
> /* 151 */ final int element = mutableStateArray[0].getInt(0);
> /* 152 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1);
> /* 153 */
> /* 154 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25067) Active tasks exceed total cores of an executor in WebUI

2018-08-09 Thread StanZhai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-25067:
-
Attachment: (was: 1533128203469_2.png)

> Active tasks exceed total cores of an executor in WebUI
> ---
>
> Key: SPARK-25067
> URL: https://issues.apache.org/jira/browse/SPARK-25067
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.2, 2.3.0, 2.3.1
>Reporter: StanZhai
>Priority: Major
> Attachments: WechatIMG1.jpeg
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-09 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575772#comment-16575772
 ] 

Saisai Shao commented on SPARK-25084:
-

I'm already preparing new RC4. If this is not a severe issue, I would not block 
the RC4 release.

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
> ss_net_profit) limit 1000;{code}
> Wrong Codegen:
> {code:java}
> /* 146 */ private int computeHashForStruct_0(InternalRow 
> mutableStateArray[0], int value1) {
> /* 147 */
> /* 148 */
> /* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
> /* 150 */
> /* 151 */ final int element = mutableStateArray[0].getInt(0);
> /* 152 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1);
> /* 153 */
> /* 154 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-09 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575771#comment-16575771
 ] 

Yuming Wang commented on SPARK-25084:
-

[~smilegator], [~jerryshao] I think It should be target 2.3.2.

 

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
> ss_net_profit) limit 1000;{code}
> Wrong Codegen:
> {code:java}
> /* 146 */ private int computeHashForStruct_0(InternalRow 
> mutableStateArray[0], int value1) {
> /* 147 */
> /* 148 */
> /* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
> /* 150 */
> /* 151 */ final int element = mutableStateArray[0].getInt(0);
> /* 152 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1);
> /* 153 */
> /* 154 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25084:


Assignee: Apache Spark

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Assignee: Apache Spark
>Priority: Major
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
> ss_net_profit) limit 1000;{code}
> Wrong Codegen:
> {code:java}
> /* 146 */ private int computeHashForStruct_0(InternalRow 
> mutableStateArray[0], int value1) {
> /* 147 */
> /* 148 */
> /* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
> /* 150 */
> /* 151 */ final int element = mutableStateArray[0].getInt(0);
> /* 152 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1);
> /* 153 */
> /* 154 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575767#comment-16575767
 ] 

Apache Spark commented on SPARK-25084:
--

User 'yucai' has created a pull request for this issue:
https://github.com/apache/spark/pull/22066

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
> ss_net_profit) limit 1000;{code}
> Wrong Codegen:
> {code:java}
> /* 146 */ private int computeHashForStruct_0(InternalRow 
> mutableStateArray[0], int value1) {
> /* 147 */
> /* 148 */
> /* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
> /* 150 */
> /* 151 */ final int element = mutableStateArray[0].getInt(0);
> /* 152 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1);
> /* 153 */
> /* 154 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25084:


Assignee: (was: Apache Spark)

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
> ss_net_profit) limit 1000;{code}
> Wrong Codegen:
> {code:java}
> /* 146 */ private int computeHashForStruct_0(InternalRow 
> mutableStateArray[0], int value1) {
> /* 147 */
> /* 148 */
> /* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
> /* 150 */
> /* 151 */ final int element = mutableStateArray[0].getInt(0);
> /* 152 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1);
> /* 153 */
> /* 154 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-09 Thread yucai (JIRA)
yucai created SPARK-25084:
-

 Summary: "distribute by" on multiple columns may lead to codegen 
issue
 Key: SPARK-25084
 URL: https://issues.apache.org/jira/browse/SPARK-25084
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: yucai


Test Query:
{code:java}
select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
ss_net_profit) limit 1000;{code}
Wrong Codegen:
{code:java}
/* 146 */ private int computeHashForStruct_0(InternalRow mutableStateArray[0], 
int value1) {
/* 147 */
/* 148 */
/* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
/* 150 */
/* 151 */ final int element = mutableStateArray[0].getInt(0);
/* 152 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, 
value1);
/* 153 */
/* 154 */ }{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23992) ShuffleDependency does not need to be deserialized every time

2018-08-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575747#comment-16575747
 ] 

Apache Spark commented on SPARK-23992:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/22065

> ShuffleDependency does not need to be deserialized every time
> -
>
> Key: SPARK-23992
> URL: https://issues.apache.org/jira/browse/SPARK-23992
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Minor
>
> In the same stage, 'ShuffleDependency' is not necessary to be deserialized 
> each time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-25052) Is there any possibility that spark structured streaming generate duplicates in the output?

2018-08-09 Thread bharath kumar avusherla (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bharath kumar avusherla closed SPARK-25052.
---

> Is there any possibility that spark structured streaming generate duplicates 
> in the output?
> ---
>
> Key: SPARK-25052
> URL: https://issues.apache.org/jira/browse/SPARK-25052
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: bharath kumar avusherla
>Priority: Minor
>
> We recently observed that the spark structured streaming generated duplicates 
> in the output when reading from Kafka topic and storing the output to the S3 
> (and checkpointing in S3).  We ran into this issue twice. This is not 
> reproducible. Is there anyone has ever faced this kind of issue before? Is 
> this because of S3 eventual consistency?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23243) Shuffle+Repartition on an RDD could lead to incorrect answers

2018-08-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23243:

Labels: correctness  (was: )

> Shuffle+Repartition on an RDD could lead to incorrect answers
> -
>
> Key: SPARK-23243
> URL: https://issues.apache.org/jira/browse/SPARK-23243
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
>
> The RDD repartition also uses the round-robin way to distribute data, this 
> can also cause incorrect answers on RDD workload the similar way as in 
> https://issues.apache.org/jira/browse/SPARK-23207
> The approach that fixes DataFrame.repartition() doesn't apply on the RDD 
> repartition issue, as discussed in 
> https://github.com/apache/spark/pull/20393#issuecomment-360912451
> We track for alternative solutions for this issue in this task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25044) Address translation of LMF closure primitive args to Object in Scala 2.12

2018-08-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575672#comment-16575672
 ] 

Apache Spark commented on SPARK-25044:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22063

> Address translation of LMF closure primitive args to Object in Scala 2.12
> -
>
> Key: SPARK-25044
> URL: https://issues.apache.org/jira/browse/SPARK-25044
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> A few SQL-related tests fail in Scala 2.12, such as UDFSuite's "SPARK-24891 
> Fix HandleNullInputsForUDF rule":
> {code:java}
> - SPARK-24891 Fix HandleNullInputsForUDF rule *** FAILED ***
> Results do not match for query:
> ...
> == Results ==
> == Results ==
> !== Correct Answer - 3 == == Spark Answer - 3 ==
> !struct<> struct
> ![0,10,null] [0,10,0]
> ![1,12,null] [1,12,1]
> ![2,14,null] [2,14,2] (QueryTest.scala:163){code}
> You can kind of get what's going on reading the test:
> {code:java}
> test("SPARK-24891 Fix HandleNullInputsForUDF rule") {
> // assume(!ClosureCleanerSuite2.supportsLMFs)
> // This test won't test what it intends to in 2.12, as lambda metafactory 
> closures
> // have arg types that are not primitive, but Object
> val udf1 = udf({(x: Int, y: Int) => x + y})
> val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
> .withColumn("c", udf1($"a", lit(null)))
> val plan = spark.sessionState.executePlan(df.logicalPlan).analyzed
> comparePlans(df.logicalPlan, plan)
> checkAnswer(
> df,
> Seq(
> Row(0, 10, null),
> Row(1, 12, null),
> Row(2, 14, null)))
> }{code}
>  
> It seems that the closure that is fed in as a UDF changes behavior, in a way 
> that primitive-type arguments are handled differently. For example an Int 
> argument, when fed 'null', acts like 0.
> I'm sure it's a difference in the LMF closure and how its types are 
> understood, but not exactly sure of the cause yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25044) Address translation of LMF closure primitive args to Object in Scala 2.12

2018-08-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25044:


Assignee: (was: Apache Spark)

> Address translation of LMF closure primitive args to Object in Scala 2.12
> -
>
> Key: SPARK-25044
> URL: https://issues.apache.org/jira/browse/SPARK-25044
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> A few SQL-related tests fail in Scala 2.12, such as UDFSuite's "SPARK-24891 
> Fix HandleNullInputsForUDF rule":
> {code:java}
> - SPARK-24891 Fix HandleNullInputsForUDF rule *** FAILED ***
> Results do not match for query:
> ...
> == Results ==
> == Results ==
> !== Correct Answer - 3 == == Spark Answer - 3 ==
> !struct<> struct
> ![0,10,null] [0,10,0]
> ![1,12,null] [1,12,1]
> ![2,14,null] [2,14,2] (QueryTest.scala:163){code}
> You can kind of get what's going on reading the test:
> {code:java}
> test("SPARK-24891 Fix HandleNullInputsForUDF rule") {
> // assume(!ClosureCleanerSuite2.supportsLMFs)
> // This test won't test what it intends to in 2.12, as lambda metafactory 
> closures
> // have arg types that are not primitive, but Object
> val udf1 = udf({(x: Int, y: Int) => x + y})
> val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
> .withColumn("c", udf1($"a", lit(null)))
> val plan = spark.sessionState.executePlan(df.logicalPlan).analyzed
> comparePlans(df.logicalPlan, plan)
> checkAnswer(
> df,
> Seq(
> Row(0, 10, null),
> Row(1, 12, null),
> Row(2, 14, null)))
> }{code}
>  
> It seems that the closure that is fed in as a UDF changes behavior, in a way 
> that primitive-type arguments are handled differently. For example an Int 
> argument, when fed 'null', acts like 0.
> I'm sure it's a difference in the LMF closure and how its types are 
> understood, but not exactly sure of the cause yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25044) Address translation of LMF closure primitive args to Object in Scala 2.12

2018-08-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25044:


Assignee: Apache Spark

> Address translation of LMF closure primitive args to Object in Scala 2.12
> -
>
> Key: SPARK-25044
> URL: https://issues.apache.org/jira/browse/SPARK-25044
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Major
>
> A few SQL-related tests fail in Scala 2.12, such as UDFSuite's "SPARK-24891 
> Fix HandleNullInputsForUDF rule":
> {code:java}
> - SPARK-24891 Fix HandleNullInputsForUDF rule *** FAILED ***
> Results do not match for query:
> ...
> == Results ==
> == Results ==
> !== Correct Answer - 3 == == Spark Answer - 3 ==
> !struct<> struct
> ![0,10,null] [0,10,0]
> ![1,12,null] [1,12,1]
> ![2,14,null] [2,14,2] (QueryTest.scala:163){code}
> You can kind of get what's going on reading the test:
> {code:java}
> test("SPARK-24891 Fix HandleNullInputsForUDF rule") {
> // assume(!ClosureCleanerSuite2.supportsLMFs)
> // This test won't test what it intends to in 2.12, as lambda metafactory 
> closures
> // have arg types that are not primitive, but Object
> val udf1 = udf({(x: Int, y: Int) => x + y})
> val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
> .withColumn("c", udf1($"a", lit(null)))
> val plan = spark.sessionState.executePlan(df.logicalPlan).analyzed
> comparePlans(df.logicalPlan, plan)
> checkAnswer(
> df,
> Seq(
> Row(0, 10, null),
> Row(1, 12, null),
> Row(2, 14, null)))
> }{code}
>  
> It seems that the closure that is fed in as a UDF changes behavior, in a way 
> that primitive-type arguments are handled differently. For example an Int 
> argument, when fed 'null', acts like 0.
> I'm sure it's a difference in the LMF closure and how its types are 
> understood, but not exactly sure of the cause yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25083) remove the type erasure hack in data source scan

2018-08-09 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25083:
---

 Summary: remove the type erasure hack in data source scan
 Key: SPARK-25083
 URL: https://issues.apache.org/jira/browse/SPARK-25083
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan


It's hacky to pretend a `RDD[ColumnarBatch]` to be a `RDD[InternalRow]`. We 
should make the type explicit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25076) SQLConf should not be retrieved from a stopped SparkSession

2018-08-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25076:

Fix Version/s: 2.3.2

> SQLConf should not be retrieved from a stopped SparkSession
> ---
>
> Key: SPARK-25076
> URL: https://issues.apache.org/jira/browse/SPARK-25076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24502) flaky test: UnsafeRowSerializerSuite

2018-08-09 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575642#comment-16575642
 ] 

Wenchen Fan commented on SPARK-24502:
-

This is enough. The places you need to pay attention to is where you call 
`SparkSession.getActiveSession`. Make sure you can handle the case that the 
session is stopped.

> flaky test: UnsafeRowSerializerSuite
> 
>
> Key: SPARK-24502
> URL: https://issues.apache.org/jira/browse/SPARK-24502
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: flaky-test
> Fix For: 2.3.2, 2.4.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4193/testReport/org.apache.spark.sql.execution/UnsafeRowSerializerSuite/toUnsafeRow___test_helper_method/
> {code}
> sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is 
> stopped.
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
>   at 
> org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
>   at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
>   at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126)
>   at 
> org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24886) Increase Jenkins build time

2018-08-09 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24886.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21845
[https://github.com/apache/spark/pull/21845]

> Increase Jenkins build time
> ---
>
> Key: SPARK-24886
> URL: https://issues.apache.org/jira/browse/SPARK-24886
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, looks we hit the time limit time to time. Looks better increasing 
> the time a bit.
> For instance, please see https://github.com/apache/spark/pull/21822



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180

2018-08-09 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575629#comment-16575629
 ] 

Steve Loughran commented on SPARK-22236:


I wouldn't recommend changing multiline=true by default precisely because the 
change is so traumatic: you'd never be able to partition a CSV file across >1 
worker

> CSV I/O: does not respect RFC 4180
> --
>
> Key: SPARK-22236
> URL: https://issues.apache.org/jira/browse/SPARK-22236
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Ondrej Kokes
>Priority: Minor
>
> When reading or writing CSV files with Spark, double quotes are escaped with 
> a backslash by default. However, the appropriate behaviour as set out by RFC 
> 4180 (and adhered to by many software packages) is to escape using a second 
> double quote.
> This piece of Python code demonstrates the issue
> {code}
> import csv
> with open('testfile.csv', 'w') as f:
> cw = csv.writer(f)
> cw.writerow(['a 2.5" drive', 'another column'])
> cw.writerow(['a "quoted" string', '"quoted"'])
> cw.writerow([1,2])
> with open('testfile.csv') as f:
> print(f.read())
> # "a 2.5"" drive",another column
> # "a ""quoted"" string","""quoted"""
> # 1,2
> spark.read.csv('testfile.csv').collect()
> # [Row(_c0='"a 2.5"" drive"', _c1='another column'),
> #  Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'),
> #  Row(_c0='1', _c1='2')]
> # explicitly stating the escape character fixed the issue
> spark.read.option('escape', '"').csv('testfile.csv').collect()
> # [Row(_c0='a 2.5" drive', _c1='another column'),
> #  Row(_c0='a "quoted" string', _c1='"quoted"'),
> #  Row(_c0='1', _c1='2')]
> {code}
> The same applies to writes, where reading the file written by Spark may 
> result in garbage.
> {code}
> df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file 
> correctly
> df.write.format("csv").save('testout.csv')
> with open('testout.csv/part-csv') as f:
> cr = csv.reader(f)
> print(next(cr))
> print(next(cr))
> # ['a 2.5\\ drive"', 'another column']
> # ['a \\quoted\\" string"', '\\quoted\\""']
> {code}
> The culprit is in 
> [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91],
>  where the default escape character is overridden.
> While it's possible to work with CSV files in a "compatible" manner, it would 
> be useful if Spark had sensible defaults that conform to the above-mentioned 
> RFC (as well as W3C recommendations). I realise this would be a breaking 
> change and thus if accepted, it would probably need to result in a warning 
> first, before moving to a new default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25082) Documentation for Spark Function expm1 is incomplete

2018-08-09 Thread Alexander Belov (JIRA)
Alexander Belov created SPARK-25082:
---

 Summary: Documentation for Spark Function expm1 is incomplete
 Key: SPARK-25082
 URL: https://issues.apache.org/jira/browse/SPARK-25082
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Affects Versions: 2.3.1, 2.0.0
Reporter: Alexander Belov


The documentation for the function expm1 that takes in a string

public static 
[Column|https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/sql/Column.html]
 expm1(String columnName) 
([https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/sql/functions.html#expm1-java.lang.String-)]

States that it "Computes the exponential of the given column." without 
mentioning that it first subtracts 1 from the value.

 

The documentation for the function expm1 that takes in a column has the correct 
documentation:

https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/sql/functions.html#expm1-org.apache.spark.sql.Column-

 

"Computes the exponential of the given value minus one."

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25081) Nested spill in ShuffleExternalSorter may access a released memory page

2018-08-09 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25081:
-
Labels: correctness  (was: )

> Nested spill in ShuffleExternalSorter may access a released memory page 
> 
>
> Key: SPARK-25081
> URL: https://issues.apache.org/jira/browse/SPARK-25081
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: correctness
>
> This issue is pretty similar to SPARK-21907. 
> "allocateArray" in 
> [ShuffleInMemorySorter.reset|https://github.com/apache/spark/blob/9b8521e53e56a53b44c02366a99f8a8ee1307bbf/core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java#L99]
>  may trigger a spill and cause ShuffleInMemorySorter access the released 
> `array`. Another task may get the same memory page from the pool. This will 
> cause two tasks access the same memory page. When a task reads memory written 
> by another task, many types of failures may happen. Here are some examples I  
> have seen:
> - JVM crash. (This is easy to reproduce in a unit test as we fill newly 
> allocated and deallocated memory with 0xa5 and 0x5a bytes which usually 
> points to an invalid memory address)
> - java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> - java.lang.NullPointerException at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:384)
> - java.lang.UnsupportedOperationException: Cannot grow BufferHolder by size 
> -536870912 because the size after growing exceeds size limitation 2147483632



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25081) Nested spill in ShuffleExternalSorter may access a released memory page

2018-08-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25081:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Nested spill in ShuffleExternalSorter may access a released memory page 
> 
>
> Key: SPARK-25081
> URL: https://issues.apache.org/jira/browse/SPARK-25081
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> This issue is pretty similar to SPARK-21907. 
> "allocateArray" in 
> [ShuffleInMemorySorter.reset|https://github.com/apache/spark/blob/9b8521e53e56a53b44c02366a99f8a8ee1307bbf/core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java#L99]
>  may trigger a spill and cause ShuffleInMemorySorter access the released 
> `array`. Another task may get the same memory page from the pool. This will 
> cause two tasks access the same memory page. When a task reads memory written 
> by another task, many types of failures may happen. Here are some examples I  
> have seen:
> - JVM crash. (This is easy to reproduce in a unit test as we fill newly 
> allocated and deallocated memory with 0xa5 and 0x5a bytes which usually 
> points to an invalid memory address)
> - java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> - java.lang.NullPointerException at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:384)
> - java.lang.UnsupportedOperationException: Cannot grow BufferHolder by size 
> -536870912 because the size after growing exceeds size limitation 2147483632



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25081) Nested spill in ShuffleExternalSorter may access a released memory page

2018-08-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25081:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Nested spill in ShuffleExternalSorter may access a released memory page 
> 
>
> Key: SPARK-25081
> URL: https://issues.apache.org/jira/browse/SPARK-25081
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> This issue is pretty similar to SPARK-21907. 
> "allocateArray" in 
> [ShuffleInMemorySorter.reset|https://github.com/apache/spark/blob/9b8521e53e56a53b44c02366a99f8a8ee1307bbf/core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java#L99]
>  may trigger a spill and cause ShuffleInMemorySorter access the released 
> `array`. Another task may get the same memory page from the pool. This will 
> cause two tasks access the same memory page. When a task reads memory written 
> by another task, many types of failures may happen. Here are some examples I  
> have seen:
> - JVM crash. (This is easy to reproduce in a unit test as we fill newly 
> allocated and deallocated memory with 0xa5 and 0x5a bytes which usually 
> points to an invalid memory address)
> - java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> - java.lang.NullPointerException at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:384)
> - java.lang.UnsupportedOperationException: Cannot grow BufferHolder by size 
> -536870912 because the size after growing exceeds size limitation 2147483632



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25081) Nested spill in ShuffleExternalSorter may access a released memory page

2018-08-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575598#comment-16575598
 ] 

Apache Spark commented on SPARK-25081:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/22062

> Nested spill in ShuffleExternalSorter may access a released memory page 
> 
>
> Key: SPARK-25081
> URL: https://issues.apache.org/jira/browse/SPARK-25081
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> This issue is pretty similar to SPARK-21907. 
> "allocateArray" in 
> [ShuffleInMemorySorter.reset|https://github.com/apache/spark/blob/9b8521e53e56a53b44c02366a99f8a8ee1307bbf/core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java#L99]
>  may trigger a spill and cause ShuffleInMemorySorter access the released 
> `array`. Another task may get the same memory page from the pool. This will 
> cause two tasks access the same memory page. When a task reads memory written 
> by another task, many types of failures may happen. Here are some examples I  
> have seen:
> - JVM crash. (This is easy to reproduce in a unit test as we fill newly 
> allocated and deallocated memory with 0xa5 and 0x5a bytes which usually 
> points to an invalid memory address)
> - java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> - java.lang.NullPointerException at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:384)
> - java.lang.UnsupportedOperationException: Cannot grow BufferHolder by size 
> -536870912 because the size after growing exceeds size limitation 2147483632



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25081) Nested spill in ShuffleExternalSorter may access a released memory page

2018-08-09 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25081:
-
Description: 
This issue is pretty similar to SPARK-21907. 
"allocateArray" in 
[ShuffleInMemorySorter.reset|https://github.com/apache/spark/blob/9b8521e53e56a53b44c02366a99f8a8ee1307bbf/core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java#L99]
 may trigger a spill and cause ShuffleInMemorySorter access the released 
`array`. Another task may get the same memory page from the pool. This will 
cause two tasks access the same memory page. When a task reads memory written 
by another task, many types of failures may happen. Here are some examples I  
have seen:

- JVM crash. (This is easy to reproduce in a unit test as we fill newly 
allocated and deallocated memory with 0xa5 and 0x5a bytes which usually points 
to an invalid memory address)
- java.lang.IllegalArgumentException: Comparison method violates its general 
contract!
- java.lang.NullPointerException at 
org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:384)
- java.lang.UnsupportedOperationException: Cannot grow BufferHolder by size 
-536870912 because the size after growing exceeds size limitation 2147483632

  was:
This issue is pretty similar to SPARK-21907. 
"allocateArray" in 
[ShuffleInMemorySorter.reset|https://github.com/apache/spark/blob/9b8521e53e56a53b44c02366a99f8a8ee1307bbf/core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java#L99]
 may trigger a spill and cause ShuffleInMemorySorter access the released 
`array`.



> Nested spill in ShuffleExternalSorter may access a released memory page 
> 
>
> Key: SPARK-25081
> URL: https://issues.apache.org/jira/browse/SPARK-25081
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> This issue is pretty similar to SPARK-21907. 
> "allocateArray" in 
> [ShuffleInMemorySorter.reset|https://github.com/apache/spark/blob/9b8521e53e56a53b44c02366a99f8a8ee1307bbf/core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java#L99]
>  may trigger a spill and cause ShuffleInMemorySorter access the released 
> `array`. Another task may get the same memory page from the pool. This will 
> cause two tasks access the same memory page. When a task reads memory written 
> by another task, many types of failures may happen. Here are some examples I  
> have seen:
> - JVM crash. (This is easy to reproduce in a unit test as we fill newly 
> allocated and deallocated memory with 0xa5 and 0x5a bytes which usually 
> points to an invalid memory address)
> - java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> - java.lang.NullPointerException at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:384)
> - java.lang.UnsupportedOperationException: Cannot grow BufferHolder by size 
> -536870912 because the size after growing exceeds size limitation 2147483632



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25081) Nested spill in ShuffleExternalSorter may access a released memory page

2018-08-09 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25081:
-
Description: 
This issue is pretty similar to SPARK-21907. 
"allocateArray" in 
[ShuffleInMemorySorter.reset|https://github.com/apache/spark/blob/9b8521e53e56a53b44c02366a99f8a8ee1307bbf/core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java#L99]
 may trigger a spill and cause 


> Nested spill in ShuffleExternalSorter may access a released memory page 
> 
>
> Key: SPARK-25081
> URL: https://issues.apache.org/jira/browse/SPARK-25081
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> This issue is pretty similar to SPARK-21907. 
> "allocateArray" in 
> [ShuffleInMemorySorter.reset|https://github.com/apache/spark/blob/9b8521e53e56a53b44c02366a99f8a8ee1307bbf/core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java#L99]
>  may trigger a spill and cause 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25081) Nested spill in ShuffleExternalSorter may access a released memory page

2018-08-09 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25081:
-
Description: 
This issue is pretty similar to SPARK-21907. 
"allocateArray" in 
[ShuffleInMemorySorter.reset|https://github.com/apache/spark/blob/9b8521e53e56a53b44c02366a99f8a8ee1307bbf/core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java#L99]
 may trigger a spill and cause ShuffleInMemorySorter access the released 
`array`.


  was:
This issue is pretty similar to SPARK-21907. 
"allocateArray" in 
[ShuffleInMemorySorter.reset|https://github.com/apache/spark/blob/9b8521e53e56a53b44c02366a99f8a8ee1307bbf/core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java#L99]
 may trigger a spill and cause 



> Nested spill in ShuffleExternalSorter may access a released memory page 
> 
>
> Key: SPARK-25081
> URL: https://issues.apache.org/jira/browse/SPARK-25081
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> This issue is pretty similar to SPARK-21907. 
> "allocateArray" in 
> [ShuffleInMemorySorter.reset|https://github.com/apache/spark/blob/9b8521e53e56a53b44c02366a99f8a8ee1307bbf/core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java#L99]
>  may trigger a spill and cause ShuffleInMemorySorter access the released 
> `array`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25081) Nested spill in ShuffleExternalSorter may access a released memory page

2018-08-09 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-25081:


 Summary: Nested spill in ShuffleExternalSorter may access a 
released memory page 
 Key: SPARK-25081
 URL: https://issues.apache.org/jira/browse/SPARK-25081
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.5

2018-08-09 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575554#comment-16575554
 ] 

shane knapp commented on SPARK-25079:
-

i've confirmed, for the master branch at least, that a symlink from 3.4 to 3.5 
works.

next up:  spot-checking older branches.  which should be...  interesting.

> [PYTHON] upgrade python 3.4 -> 3.5
> --
>
> Key: SPARK-25079
> URL: https://issues.apache.org/jira/browse/SPARK-25079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> for the impending arrow upgrade 
> (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 
> 3.4 -> 3.5.
> i have been testing this here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]
> my methodology:
> 1) upgrade python + arrow to 3.5 and 0.10.0
> 2) run python tests
> 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
> upgrade centos workers to python3.5
> 4) simultaneously do the following: 
>   - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that 
> points to python3.5 (this is currently being tested here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]
>   - push a change to python/run-tests.py replacing 3.4 with 3.5
> 5) once the python3.5 change to run-tests.py is merged, we will need to 
> back-port this to all existing branches
> 6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-08-09 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575521#comment-16575521
 ] 

shane knapp commented on SPARK-24950:
-

do we care about 2.0 and 1.6?

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: shane knapp
>Assignee: Chris Martin
>Priority: Major
> Fix For: 2.1.4, 2.2.3, 2.3.2, 2.4.0
>
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-08-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24950.
---
   Resolution: Fixed
Fix Version/s: 2.3.2
   2.2.3
   2.1.4

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: shane knapp
>Assignee: Chris Martin
>Priority: Major
> Fix For: 2.1.4, 2.2.3, 2.3.2, 2.4.0
>
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-08-09 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575504#comment-16575504
 ] 

shane knapp commented on SPARK-24950:
-

booyah!  i love watching the build queue pile up.  :)

thanks [~srowen]!

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: shane knapp
>Assignee: Chris Martin
>Priority: Major
> Fix For: 2.4.0
>
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-08-09 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575493#comment-16575493
 ] 

shane knapp commented on SPARK-24950:
-

word.  let's leave this open until for a bit longer as i continue to test.

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: shane knapp
>Assignee: Chris Martin
>Priority: Major
> Fix For: 2.4.0
>
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-08-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575488#comment-16575488
 ] 

Sean Owen commented on SPARK-24950:
---

Roger that, will back-port back to 2.1 as best I can.

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: shane knapp
>Assignee: Chris Martin
>Priority: Major
> Fix For: 2.4.0
>
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-08-09 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575479#comment-16575479
 ] 

shane knapp commented on SPARK-24950:
-

[~srowen] [~d80tb7] i'm thinking that we actually need to backport this change 
to previous branches.

:(

[https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/spark-branch-2.1-test-maven-hadoop-2.7-ubuntu-testing/2/]
{noformat}
- daysToMillis and millisToDays *** FAILED ***
  9131 did not equal 9130 Round trip of 9130 did not work in tz 
sun.util.calendar.ZoneInfo[id="Pacific/Enderbury",offset=4680,dstSavings=0,useDaylight=false,transitions=5,lastRule=null]
 (DateTimeUtilsSuite.scala:554){noformat}

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: shane knapp
>Assignee: Chris Martin
>Priority: Major
> Fix For: 2.4.0
>
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-08-09 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-24950:

Affects Version/s: 2.1.3

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: shane knapp
>Assignee: Chris Martin
>Priority: Major
> Fix For: 2.4.0
>
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-08-09 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp reopened SPARK-24950:
-

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: shane knapp
>Assignee: Chris Martin
>Priority: Major
> Fix For: 2.4.0
>
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25068) High-order function: exists(array, function) → boolean

2018-08-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25068.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> High-order function: exists(array, function) → boolean
> -
>
> Key: SPARK-25068
> URL: https://issues.apache.org/jira/browse/SPARK-25068
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.0
>
>
> Tests if arrays have those elements for which function returns true.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25076) SQLConf should not be retrieved from a stopped SparkSession

2018-08-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25076.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> SQLConf should not be retrieved from a stopped SparkSession
> ---
>
> Key: SPARK-25076
> URL: https://issues.apache.org/jira/browse/SPARK-25076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0

2018-08-09 Thread Steve Bairos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575373#comment-16575373
 ] 

Steve Bairos commented on SPARK-18057:
--

Hey, long shot but any chance this change could get back ported to branch-2.3? 
My company is currently on the 2.3 branch and we're dying to get off of 
kafka-client 0.10 because there are a few issues with 0.10 and TLS.

> Update structured streaming kafka from 0.10.0.1 to 2.0.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Assignee: Ted Yu
>Priority: Major
> Fix For: 2.4.0
>
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25077) Delete unused variable in WindowExec

2018-08-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25077.
-
   Resolution: Fixed
 Assignee: Li Yuanjian
Fix Version/s: 2.4.0

> Delete unused variable in WindowExec
> 
>
> Key: SPARK-25077
> URL: https://issues.apache.org/jira/browse/SPARK-25077
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Li Yuanjian
>Assignee: Li Yuanjian
>Priority: Trivial
> Fix For: 2.4.0
>
>
> Delete the unused variable `inputFields` in WindowExec, avoid making others 
> confused while reading the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23298) distinct.count on Dataset/DataFrame yields non-deterministic results

2018-08-09 Thread Mateusz Jukiewicz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mateusz Jukiewicz updated SPARK-23298:
--
Labels: Correctness CorrectnessBug correctness  (was: CorrectnessBug 
correctness)

> distinct.count on Dataset/DataFrame yields non-deterministic results
> 
>
> Key: SPARK-23298
> URL: https://issues.apache.org/jira/browse/SPARK-23298
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL, YARN
>Affects Versions: 2.1.0, 2.2.0
> Environment: Spark 2.2.0 or 2.1.0
> Java 1.8.0_144
> Yarn version:
> {code:java}
> Hadoop 2.6.0-cdh5.12.1
> Subversion http://github.com/cloudera/hadoop -r 
> 520d8b072e666e9f21d645ca6a5219fc37535a52
> Compiled by jenkins on 2017-08-24T16:43Z
> Compiled with protoc 2.5.0
> From source with checksum de51bf9693ab9426379a1cd28142cea0
> This command was run using 
> /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.12.1.jar{code}
>  
>  
>Reporter: Mateusz Jukiewicz
>Priority: Major
>  Labels: Correctness, CorrectnessBug, correctness
>
> This is what happens:
> {code:java}
> /* Exemplary spark-shell starting command 
> /opt/spark/bin/spark-shell \
> --num-executors 269 \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.kryoserializer.buffer.max=512m 
> */
> val dataset = spark.read.textFile("/text_dataset.out")
> dataset.distinct.count
> // res0: Long = 24025868
> dataset.distinct.count
> // res1: Long = 24014227{code}
> The _text_dataset.out_ file is a dataset with one string per line. The string 
> has alphanumeric characters as well as colons and spaces. The line length 
> does not exceed 1200. I don't think that's important though, as the issue 
> appeared on various other datasets, I just tried to narrow it down to the 
> simplest possible case.
> The observations regarding the issue are as follows:
>  * I managed to reproduce it on both spark 2.2 and spark 2.1.
>  * The issue occurs in YARN cluster mode (I haven't tested YARN client mode).
>  * The issue is not reproducible on a single machine (e.g. laptop) in spark 
> local mode.
>  * It seems that once the correct count is computed, it is not possible to 
> reproduce the issue in the same spark session. In other words, I was able to 
> get 2-3 incorrect distinct.count results consecutively, but once it got 
> right, it always returned the correct value. I had to re-run spark-shell to 
> observe the problem again.
>  * The issue appears on both Dataset and DataFrame (i.e. using read.text or 
> read.textFile).
>  * The issue is not reproducible on RDD (i.e. dataset.rdd.distinct.count).
>  * Not a single container has failed in those multiple invalid executions.
>  * YARN doesn't show any warnings or errors in those invalid executions.
>  * The execution plan determined for both valid and invalid executions was 
> always the same (it's shown in the _SQL_ tab of the UI).
>  * The number returned in the invalid executions was always greater than the 
> correct number (24 014 227).
>  * This occurs even though the input is already completely deduplicated (i.e. 
> _distinct.count_ shouldn't change anything).
>  * The input isn't replicated (i.e. there's only one copy of each file block 
> on the HDFS).
>  * The problem is probably not related to reading from HDFS. Spark was always 
> able to correctly read all input records (which was shown in the UI), and 
> that number got malformed after the exchange phase:
>  ** correct execution:
>  Input Size / Records: 3.9 GB / 24014227 _(first stage)_
>  Shuffle Write: 3.3 GB / 24014227 _(first stage)_
>  Shuffle Read: 3.3 GB / 24014227 _(second stage)_
>  ** incorrect execution:
>  Input Size / Records: 3.9 GB / 24014227 _(first stage)_
>  Shuffle Write: 3.3 GB / 24014227 _(first stage)_
>  Shuffle Read: 3.3 GB / 24020150 _(second stage)_
>  * The problem might be related with the internal way of Encoders hashing. 
> The reason might be:
>  ** in a simple `distinct.count` invocation, there are in total three 
> hash-related stages (called `HashAggregate`),
>  ** excerpt from scaladoc for `distinct` method says:
> {code:java}
>* @note Equality checking is performed directly on the encoded 
> representation of the data
>* and thus is not affected by a custom `equals` function defined on 
> `T`.{code}
>  * One of my suspicions was the number of partitions we're using (2154). This 
> is greater than 2000, which means that a different data structure (i.e. 
> _HighlyCompressedMapStatus_instead of _CompressedMapStatus_) will be used for 
> book-keeping during the shuffle. Unfortunately after decreasing the number 
> below this threshold the problem still occurs.
>  * It's easier to reproduce the issue with a large number of partitions.
>  * One of my another suspicions was 

[jira] [Updated] (SPARK-23298) distinct.count on Dataset/DataFrame yields non-deterministic results

2018-08-09 Thread Mateusz Jukiewicz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mateusz Jukiewicz updated SPARK-23298:
--
Labels: CorrectnessBug correctness  (was: )

> distinct.count on Dataset/DataFrame yields non-deterministic results
> 
>
> Key: SPARK-23298
> URL: https://issues.apache.org/jira/browse/SPARK-23298
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL, YARN
>Affects Versions: 2.1.0, 2.2.0
> Environment: Spark 2.2.0 or 2.1.0
> Java 1.8.0_144
> Yarn version:
> {code:java}
> Hadoop 2.6.0-cdh5.12.1
> Subversion http://github.com/cloudera/hadoop -r 
> 520d8b072e666e9f21d645ca6a5219fc37535a52
> Compiled by jenkins on 2017-08-24T16:43Z
> Compiled with protoc 2.5.0
> From source with checksum de51bf9693ab9426379a1cd28142cea0
> This command was run using 
> /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.12.1.jar{code}
>  
>  
>Reporter: Mateusz Jukiewicz
>Priority: Major
>  Labels: CorrectnessBug, correctness
>
> This is what happens:
> {code:java}
> /* Exemplary spark-shell starting command 
> /opt/spark/bin/spark-shell \
> --num-executors 269 \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.kryoserializer.buffer.max=512m 
> */
> val dataset = spark.read.textFile("/text_dataset.out")
> dataset.distinct.count
> // res0: Long = 24025868
> dataset.distinct.count
> // res1: Long = 24014227{code}
> The _text_dataset.out_ file is a dataset with one string per line. The string 
> has alphanumeric characters as well as colons and spaces. The line length 
> does not exceed 1200. I don't think that's important though, as the issue 
> appeared on various other datasets, I just tried to narrow it down to the 
> simplest possible case.
> The observations regarding the issue are as follows:
>  * I managed to reproduce it on both spark 2.2 and spark 2.1.
>  * The issue occurs in YARN cluster mode (I haven't tested YARN client mode).
>  * The issue is not reproducible on a single machine (e.g. laptop) in spark 
> local mode.
>  * It seems that once the correct count is computed, it is not possible to 
> reproduce the issue in the same spark session. In other words, I was able to 
> get 2-3 incorrect distinct.count results consecutively, but once it got 
> right, it always returned the correct value. I had to re-run spark-shell to 
> observe the problem again.
>  * The issue appears on both Dataset and DataFrame (i.e. using read.text or 
> read.textFile).
>  * The issue is not reproducible on RDD (i.e. dataset.rdd.distinct.count).
>  * Not a single container has failed in those multiple invalid executions.
>  * YARN doesn't show any warnings or errors in those invalid executions.
>  * The execution plan determined for both valid and invalid executions was 
> always the same (it's shown in the _SQL_ tab of the UI).
>  * The number returned in the invalid executions was always greater than the 
> correct number (24 014 227).
>  * This occurs even though the input is already completely deduplicated (i.e. 
> _distinct.count_ shouldn't change anything).
>  * The input isn't replicated (i.e. there's only one copy of each file block 
> on the HDFS).
>  * The problem is probably not related to reading from HDFS. Spark was always 
> able to correctly read all input records (which was shown in the UI), and 
> that number got malformed after the exchange phase:
>  ** correct execution:
>  Input Size / Records: 3.9 GB / 24014227 _(first stage)_
>  Shuffle Write: 3.3 GB / 24014227 _(first stage)_
>  Shuffle Read: 3.3 GB / 24014227 _(second stage)_
>  ** incorrect execution:
>  Input Size / Records: 3.9 GB / 24014227 _(first stage)_
>  Shuffle Write: 3.3 GB / 24014227 _(first stage)_
>  Shuffle Read: 3.3 GB / 24020150 _(second stage)_
>  * The problem might be related with the internal way of Encoders hashing. 
> The reason might be:
>  ** in a simple `distinct.count` invocation, there are in total three 
> hash-related stages (called `HashAggregate`),
>  ** excerpt from scaladoc for `distinct` method says:
> {code:java}
>* @note Equality checking is performed directly on the encoded 
> representation of the data
>* and thus is not affected by a custom `equals` function defined on 
> `T`.{code}
>  * One of my suspicions was the number of partitions we're using (2154). This 
> is greater than 2000, which means that a different data structure (i.e. 
> _HighlyCompressedMapStatus_instead of _CompressedMapStatus_) will be used for 
> book-keeping during the shuffle. Unfortunately after decreasing the number 
> below this threshold the problem still occurs.
>  * It's easier to reproduce the issue with a large number of partitions.
>  * One of my another suspicions was that it's somehow related to the number 
> of blocks 

[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180

2018-08-09 Thread Joe Pallas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575305#comment-16575305
 ] 

Joe Pallas commented on SPARK-22236:


If this can't change before 3.0, how about a note in the documentation 
explaining that compatibility with RFC4180 requires setting {{escape}} to {{"}} 
and {{multiLine}} to {{true}}?

> CSV I/O: does not respect RFC 4180
> --
>
> Key: SPARK-22236
> URL: https://issues.apache.org/jira/browse/SPARK-22236
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Ondrej Kokes
>Priority: Minor
>
> When reading or writing CSV files with Spark, double quotes are escaped with 
> a backslash by default. However, the appropriate behaviour as set out by RFC 
> 4180 (and adhered to by many software packages) is to escape using a second 
> double quote.
> This piece of Python code demonstrates the issue
> {code}
> import csv
> with open('testfile.csv', 'w') as f:
> cw = csv.writer(f)
> cw.writerow(['a 2.5" drive', 'another column'])
> cw.writerow(['a "quoted" string', '"quoted"'])
> cw.writerow([1,2])
> with open('testfile.csv') as f:
> print(f.read())
> # "a 2.5"" drive",another column
> # "a ""quoted"" string","""quoted"""
> # 1,2
> spark.read.csv('testfile.csv').collect()
> # [Row(_c0='"a 2.5"" drive"', _c1='another column'),
> #  Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'),
> #  Row(_c0='1', _c1='2')]
> # explicitly stating the escape character fixed the issue
> spark.read.option('escape', '"').csv('testfile.csv').collect()
> # [Row(_c0='a 2.5" drive', _c1='another column'),
> #  Row(_c0='a "quoted" string', _c1='"quoted"'),
> #  Row(_c0='1', _c1='2')]
> {code}
> The same applies to writes, where reading the file written by Spark may 
> result in garbage.
> {code}
> df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file 
> correctly
> df.write.format("csv").save('testout.csv')
> with open('testout.csv/part-csv') as f:
> cr = csv.reader(f)
> print(next(cr))
> print(next(cr))
> # ['a 2.5\\ drive"', 'another column']
> # ['a \\quoted\\" string"', '\\quoted\\""']
> {code}
> The culprit is in 
> [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91],
>  where the default escape character is overridden.
> While it's possible to work with CSV files in a "compatible" manner, it would 
> be useful if Spark had sensible defaults that conform to the above-mentioned 
> RFC (as well as W3C recommendations). I realise this would be a breaking 
> change and thus if accepted, it would probably need to result in a warning 
> first, before moving to a new default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25059) Exception while executing an action on DataFrame that read Json

2018-08-09 Thread Kunal Goswami (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575288#comment-16575288
 ] 

Kunal Goswami commented on SPARK-25059:
---

Thank you so much for the prompt response, let me try using spark 2.3 then. 

> Exception while executing an action on DataFrame that read Json
> ---
>
> Key: SPARK-25059
> URL: https://issues.apache.org/jira/browse/SPARK-25059
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.0
> Environment: AWS EMR 5.8.0 
> Spark 2.2.0 
>  
>Reporter: Kunal Goswami
>Priority: Major
>  Labels: Spark-SQL
>
> When I try to read ~9600 Json files using
> {noformat}
> val test = spark.read.option("header", true).option("inferSchema", 
> true).json(paths: _*) {noformat}
>  
> Any action on the above created data frame results in: 
> {noformat}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply2_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class "org.apache.spark.sql.catalyst.expressions.Generat[73/1850]
> pecificUnsafeProjection" grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:949)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:839)
>   at org.codehaus.janino.UnitCompiler.writeOpcode(UnitCompiler.java:11081)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4546)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1436)
>   at org.codehaus.janino.UnitCompiler.access$1600(UnitCompiler.java:206)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1376)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$Block.accept(Java.java:2471)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2220)
>   at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1378)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$IfStatement.accept(Java.java:2621)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1436)
>   at org.codehaus.janino.UnitCompiler.access$1600(UnitCompiler.java:206)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1376)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$Block.accept(Java.java:2471)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2220)
>   at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1378)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1370)
>   at 

[jira] [Created] (SPARK-25080) NPE in HiveShim$.toCatalystDecimal(HiveShim.scala:110)

2018-08-09 Thread Andrew K Long (JIRA)
Andrew K Long created SPARK-25080:
-

 Summary: NPE in HiveShim$.toCatalystDecimal(HiveShim.scala:110)
 Key: SPARK-25080
 URL: https://issues.apache.org/jira/browse/SPARK-25080
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.3.1
 Environment: AWS EMR
Reporter: Andrew K Long


NPE while reading hive table.

 

```
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 1190 in stage 392.0 failed 4 times, most recent failure: Lost task 1190.3 
in stage 392.0 (TID 122055, ip-172-31-32-196.ec2.internal, executor 487): 
java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:110)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:414)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:413)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:433)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:217)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$2.apply(ShuffleExchangeExec.scala:294)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$2.apply(ShuffleExchangeExec.scala:265)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
 
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1753)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1741)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1740)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1740)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:871)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:871)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:871)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1974)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1923)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1912)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:682)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
... 67 more
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:110)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:414)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:413)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:433)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 

[jira] [Commented] (SPARK-25024) Update mesos documentation to be clear about security supported

2018-08-09 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575267#comment-16575267
 ] 

Thomas Graves commented on SPARK-25024:
---

ok, I'm not familiar with mesos hardly at all so I apologize if some of these 
seem intuitive, from reading the mesos docs its not clear to me on a few 
points.  Note I'm going to go through the yarn docs and try to clarify very 
similar things there.

I see some updates have been made on master vs I was originally looking at the 
2.3.1 docs 
(https://github.com/apache/spark/blob/master/docs/running-on-mesos.md)
 * for cluster mode does MesosClusterDispatcher support authentication and can 
zookeeper be secured?
 * Does it support accessing secure HDFS?  Does it require keytabs be shipped?
 * does  Mesos Shuffle Service support authentication?  I assume so since I 
would expect it to use spark RPC, so assume spark confs when you start it need 
spark.authenticate=true and specify a secret?  So its not really multi-tenant, 
but perhaps mesos handles the multi-tenancy as does each user start their own 
shuffle service?
 * spark.mesos.principal and spark.mesos.secret, assume mesos handles 
multi-tenancy based on registry?
 * for the spark.mesos.driver.secret* configs, I assume it would vary by setup 
if these are actually secure.  For instance if I specify an env variable or 
config can other users see it.  Also does that secret need to match shuffle 
service, might depend on question above if only one per cluster or setup per 
user.  Maybe to many variations to talk about?
 *

> Update mesos documentation to be clear about security supported
> ---
>
> Key: SPARK-25024
> URL: https://issues.apache.org/jira/browse/SPARK-25024
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.2
>Reporter: Thomas Graves
>Priority: Major
>
> I was reading through our mesos deployment docs and security docs and its not 
> clear at all what type of security and how to set it up for mesos.  I think 
> we should clarify this and have something about exactly what is supported and 
> what is not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25044) Address translation of LMF closure primitive args to Object in Scala 2.12

2018-08-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575247#comment-16575247
 ] 

Sean Owen commented on SPARK-25044:
---

Next thought: use ScalaUDF's inputTypes field to determine which args are 
primitive. However, I find this is only set when performing type coercion, and 
can't be relied on, it seems. We could change the whole code base to always set 
this, but I wonder if we can force user code that might reference ScalaUDF to 
do so. Hm.

> Address translation of LMF closure primitive args to Object in Scala 2.12
> -
>
> Key: SPARK-25044
> URL: https://issues.apache.org/jira/browse/SPARK-25044
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> A few SQL-related tests fail in Scala 2.12, such as UDFSuite's "SPARK-24891 
> Fix HandleNullInputsForUDF rule":
> {code:java}
> - SPARK-24891 Fix HandleNullInputsForUDF rule *** FAILED ***
> Results do not match for query:
> ...
> == Results ==
> == Results ==
> !== Correct Answer - 3 == == Spark Answer - 3 ==
> !struct<> struct
> ![0,10,null] [0,10,0]
> ![1,12,null] [1,12,1]
> ![2,14,null] [2,14,2] (QueryTest.scala:163){code}
> You can kind of get what's going on reading the test:
> {code:java}
> test("SPARK-24891 Fix HandleNullInputsForUDF rule") {
> // assume(!ClosureCleanerSuite2.supportsLMFs)
> // This test won't test what it intends to in 2.12, as lambda metafactory 
> closures
> // have arg types that are not primitive, but Object
> val udf1 = udf({(x: Int, y: Int) => x + y})
> val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
> .withColumn("c", udf1($"a", lit(null)))
> val plan = spark.sessionState.executePlan(df.logicalPlan).analyzed
> comparePlans(df.logicalPlan, plan)
> checkAnswer(
> df,
> Seq(
> Row(0, 10, null),
> Row(1, 12, null),
> Row(2, 14, null)))
> }{code}
>  
> It seems that the closure that is fed in as a UDF changes behavior, in a way 
> that primitive-type arguments are handled differently. For example an Int 
> argument, when fed 'null', acts like 0.
> I'm sure it's a difference in the LMF closure and how its types are 
> understood, but not exactly sure of the cause yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.5

2018-08-09 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575197#comment-16575197
 ] 

shane knapp commented on SPARK-25079:
-

SO.  MANY.  MOVING.  PARTS.

> [PYTHON] upgrade python 3.4 -> 3.5
> --
>
> Key: SPARK-25079
> URL: https://issues.apache.org/jira/browse/SPARK-25079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> for the impending arrow upgrade 
> (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 
> 3.4 -> 3.5.
> i have been testing this here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]
> my methodology:
> 1) upgrade python + arrow to 3.5 and 0.10.0
> 2) run python tests
> 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
> upgrade centos workers to python3.5
> 4) simultaneously do the following: 
>   - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that 
> points to python3.5 (this is currently being tested here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]
>   - push a change to python/run-tests.py replacing 3.4 with 3.5
> 5) once the python3.5 change to run-tests.py is merged, we will need to 
> back-port this to all existing branches
> 6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.5

2018-08-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25079:


Assignee: Apache Spark  (was: shane knapp)

> [PYTHON] upgrade python 3.4 -> 3.5
> --
>
> Key: SPARK-25079
> URL: https://issues.apache.org/jira/browse/SPARK-25079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: Apache Spark
>Priority: Major
>
> for the impending arrow upgrade 
> (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 
> 3.4 -> 3.5.
> i have been testing this here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]
> my methodology:
> 1) upgrade python + arrow to 3.5 and 0.10.0
> 2) run python tests
> 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
> upgrade centos workers to python3.5
> 4) simultaneously do the following: 
>   - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that 
> points to python3.5 (this is currently being tested here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]
>   - push a change to python/run-tests.py replacing 3.4 with 3.5
> 5) once the python3.5 change to run-tests.py is merged, we will need to 
> back-port this to all existing branches
> 6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.5

2018-08-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575186#comment-16575186
 ] 

Apache Spark commented on SPARK-25079:
--

User 'shaneknapp' has created a pull request for this issue:
https://github.com/apache/spark/pull/22061

> [PYTHON] upgrade python 3.4 -> 3.5
> --
>
> Key: SPARK-25079
> URL: https://issues.apache.org/jira/browse/SPARK-25079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> for the impending arrow upgrade 
> (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 
> 3.4 -> 3.5.
> i have been testing this here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]
> my methodology:
> 1) upgrade python + arrow to 3.5 and 0.10.0
> 2) run python tests
> 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
> upgrade centos workers to python3.5
> 4) simultaneously do the following: 
>   - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that 
> points to python3.5 (this is currently being tested here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]
>   - push a change to python/run-tests.py replacing 3.4 with 3.5
> 5) once the python3.5 change to run-tests.py is merged, we will need to 
> back-port this to all existing branches
> 6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.5

2018-08-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25079:


Assignee: shane knapp  (was: Apache Spark)

> [PYTHON] upgrade python 3.4 -> 3.5
> --
>
> Key: SPARK-25079
> URL: https://issues.apache.org/jira/browse/SPARK-25079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> for the impending arrow upgrade 
> (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 
> 3.4 -> 3.5.
> i have been testing this here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]
> my methodology:
> 1) upgrade python + arrow to 3.5 and 0.10.0
> 2) run python tests
> 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
> upgrade centos workers to python3.5
> 4) simultaneously do the following: 
>   - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that 
> points to python3.5 (this is currently being tested here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]
>   - push a change to python/run-tests.py replacing 3.4 with 3.5
> 5) once the python3.5 change to run-tests.py is merged, we will need to 
> back-port this to all existing branches
> 6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-09 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575170#comment-16575170
 ] 

shane knapp commented on SPARK-23874:
-

this issue depends on this:  https://issues.apache.org/jira/browse/SPARK-25079

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.5

2018-08-09 Thread shane knapp (JIRA)
shane knapp created SPARK-25079:
---

 Summary: [PYTHON] upgrade python 3.4 -> 3.5
 Key: SPARK-25079
 URL: https://issues.apache.org/jira/browse/SPARK-25079
 Project: Spark
  Issue Type: Improvement
  Components: Build, PySpark
Affects Versions: 2.3.1
Reporter: shane knapp
Assignee: shane knapp


for the impending arrow upgrade 
(https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 3.4 
-> 3.5.

i have been testing this here:  
[https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]

my methodology:

1) upgrade python + arrow to 3.5 and 0.10.0

2) run python tests

3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
upgrade centos workers to python3.5

4) simultaneously do the following: 

  - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that points 
to python3.5 (this is currently being tested here:  
[https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]

  - push a change to python/run-tests.py replacing 3.4 with 3.5

5) once the python3.5 change to run-tests.py is merged, we will need to 
back-port this to all existing branches

6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25036:


Assignee: Apache Spark  (was: Kazuaki Ishizaki)

> Scala 2.12 issues: Compilation error with sbt
> -
>
> Key: SPARK-25036
> URL: https://issues.apache.org/jira/browse/SPARK-25036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>
> When compiling with sbt, the following errors occur:
> There are -two- three types:
> 1. {{ExprValue.isNull}} is compared with unexpected type.
> 2. {{match may not be exhaustive}} is detected at {{match}}
> 3. discarding unmoored doc comment
> The first one is more serious since it may also generate incorrect code in 
> Spark 2.3.
> {code:java}
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
> Boolean = (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn] (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
> ArrayData()), (_, _)
> [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: NewFunctionSpec(_, None, 
> Some(_)), NewFunctionSpec(_, Some(_), None)
> [error] [warn] newFunction match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely always compare unequal
> [error] [warn] if (eval.isNull != "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]  if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn] if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
>  match may not be exhaustive.
> [error] It would fail on the following input: Schema((x: 
> org.apache.spark.sql.types.DataType forSome x not in 
> org.apache.spark.sql.types.StructType), _)
> [error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
> [error] [warn] 
> {code}
> {code:java}
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:410:
>  discarding unmoored doc comment
> [error] [warn] /**
> [error] [warn] 
> [error] [warn] 
> 

[jira] [Assigned] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25036:


Assignee: Kazuaki Ishizaki  (was: Apache Spark)

> Scala 2.12 issues: Compilation error with sbt
> -
>
> Key: SPARK-25036
> URL: https://issues.apache.org/jira/browse/SPARK-25036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> When compiling with sbt, the following errors occur:
> There are -two- three types:
> 1. {{ExprValue.isNull}} is compared with unexpected type.
> 2. {{match may not be exhaustive}} is detected at {{match}}
> 3. discarding unmoored doc comment
> The first one is more serious since it may also generate incorrect code in 
> Spark 2.3.
> {code:java}
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
> Boolean = (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn] (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
> ArrayData()), (_, _)
> [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: NewFunctionSpec(_, None, 
> Some(_)), NewFunctionSpec(_, Some(_), None)
> [error] [warn] newFunction match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely always compare unequal
> [error] [warn] if (eval.isNull != "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]  if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn] if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
>  match may not be exhaustive.
> [error] It would fail on the following input: Schema((x: 
> org.apache.spark.sql.types.DataType forSome x not in 
> org.apache.spark.sql.types.StructType), _)
> [error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
> [error] [warn] 
> {code}
> {code:java}
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:410:
>  discarding unmoored doc comment
> [error] [warn] /**
> [error] [warn] 
> [error] [warn] 
> 

[jira] [Commented] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575158#comment-16575158
 ] 

Apache Spark commented on SPARK-25036:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22059

> Scala 2.12 issues: Compilation error with sbt
> -
>
> Key: SPARK-25036
> URL: https://issues.apache.org/jira/browse/SPARK-25036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> When compiling with sbt, the following errors occur:
> There are -two- three types:
> 1. {{ExprValue.isNull}} is compared with unexpected type.
> 2. {{match may not be exhaustive}} is detected at {{match}}
> 3. discarding unmoored doc comment
> The first one is more serious since it may also generate incorrect code in 
> Spark 2.3.
> {code:java}
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
> Boolean = (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn] (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
> ArrayData()), (_, _)
> [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: NewFunctionSpec(_, None, 
> Some(_)), NewFunctionSpec(_, Some(_), None)
> [error] [warn] newFunction match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely always compare unequal
> [error] [warn] if (eval.isNull != "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]  if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn] if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
>  match may not be exhaustive.
> [error] It would fail on the following input: Schema((x: 
> org.apache.spark.sql.types.DataType forSome x not in 
> org.apache.spark.sql.types.StructType), _)
> [error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
> [error] [warn] 
> {code}
> {code:java}
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:410:
>  discarding unmoored doc comment
> [error] [warn] /**
> [error] [warn] 
> [error] [warn] 
> 

[jira] [Updated] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-09 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25036:
-
Description: 
When compiling with sbt, the following errors occur:

There are -two- three types:
1. {{ExprValue.isNull}} is compared with unexpected type.
2. {{match may not be exhaustive}} is detected at {{match}}
3. discarding unmoored doc comment

The first one is more serious since it may also generate incorrect code in 
Spark 2.3.
{code:java}
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
 match may not be exhaustive.
[error] It would fail on the following inputs: (NumericValueInterval(_, _), _), 
(_, NumericValueInterval(_, _)), (_, _)
[error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
Boolean = (r1, r2) match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
 match may not be exhaustive.
[error] It would fail on the following inputs: (NumericValueInterval(_, _), _), 
(_, NumericValueInterval(_, _)), (_, _)
[error] [warn] (r1, r2) match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
 match may not be exhaustive.
[error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
ArrayData()), (_, _)
[error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
 match may not be exhaustive.
[error] It would fail on the following inputs: NewFunctionSpec(_, None, 
Some(_)), NewFunctionSpec(_, Some(_), None)
[error] [warn] newFunction match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely always compare unequal
[error] [warn] if (eval.isNull != "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn]  if (eval.isNull == "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn] if (eval.isNull == "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
 match may not be exhaustive.
[error] It would fail on the following input: Schema((x: 
org.apache.spark.sql.types.DataType forSome x not in 
org.apache.spark.sql.types.StructType), _)
[error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
[error] [warn] 
{code}


{code:java}
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:410:
 discarding unmoored doc comment
[error] [warn] /**
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:441:
 discarding unmoored doc comment
[error] [warn] /**
[error] [warn] 
...
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:440:
 discarding unmoored doc comment
[error] [warn] /**
[error] [warn] 
{code}

  was:
When compiling with sbt, the following errors occur:

There are two types:
1. {{ExprValue.isNull}} is compared with unexpected type.
1. {{match may not be exhaustive}} is detected at {{match}}

The first one is more serious since it may also generate incorrect code in 
Spark 2.3.

{code}
[error] [warn] 

[jira] [Comment Edited] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-09 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575137#comment-16575137
 ] 

Kazuaki Ishizaki edited comment on SPARK-25036 at 8/9/18 5:05 PM:
--

Another type of compilation error is found. Added the log to the description


was (Author: kiszk):
Another type of compilation error is found

> Scala 2.12 issues: Compilation error with sbt
> -
>
> Key: SPARK-25036
> URL: https://issues.apache.org/jira/browse/SPARK-25036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> When compiling with sbt, the following errors occur:
> There are two types:
> 1. {{ExprValue.isNull}} is compared with unexpected type.
> 1. {{match may not be exhaustive}} is detected at {{match}}
> The first one is more serious since it may also generate incorrect code in 
> Spark 2.3.
> {code}
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
> Boolean = (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn] (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
> ArrayData()), (_, _)
> [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: NewFunctionSpec(_, None, 
> Some(_)), NewFunctionSpec(_, Some(_), None)
> [error] [warn] newFunction match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely always compare unequal
> [error] [warn] if (eval.isNull != "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]  if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn] if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
>  match may not be exhaustive.
> [error] It would fail on the following input: Schema((x: 
> org.apache.spark.sql.types.DataType forSome x not in 
> org.apache.spark.sql.types.StructType), _)
> [error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
> [error] [warn] 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Reopened] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-09 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki reopened SPARK-25036:
--

Another type of compilation error is found

> Scala 2.12 issues: Compilation error with sbt
> -
>
> Key: SPARK-25036
> URL: https://issues.apache.org/jira/browse/SPARK-25036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> When compiling with sbt, the following errors occur:
> There are two types:
> 1. {{ExprValue.isNull}} is compared with unexpected type.
> 1. {{match may not be exhaustive}} is detected at {{match}}
> The first one is more serious since it may also generate incorrect code in 
> Spark 2.3.
> {code}
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
> Boolean = (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn] (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
> ArrayData()), (_, _)
> [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: NewFunctionSpec(_, None, 
> Some(_)), NewFunctionSpec(_, Some(_), None)
> [error] [warn] newFunction match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely always compare unequal
> [error] [warn] if (eval.isNull != "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]  if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn] if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
>  match may not be exhaustive.
> [error] It would fail on the following input: Schema((x: 
> org.apache.spark.sql.types.DataType forSome x not in 
> org.apache.spark.sql.types.StructType), _)
> [error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
> [error] [warn] 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25024) Update mesos documentation to be clear about security supported

2018-08-09 Thread Arthur Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575134#comment-16575134
 ] 

Arthur Rand commented on SPARK-25024:
-

Just to chime in here. Unless a lot has changed, all of the Spark security 
features when running on Mesos are available in "vanilla Mesos", as long as you 
have the required plug-ins. The problem is that the users' suite of security 
plug-ins is impossible to predict so the spark docs only tell you how to 
_configure Spark_. Some of the questions you bring up [~tgraves], depend on the 
specific setup, for example auth when submitting jobs. However, I think it's 
safe to say that if you have a _secure Mesos_ cluster (meaning you have some 
form of plug-ins) then it'll work with Spark. 

> Update mesos documentation to be clear about security supported
> ---
>
> Key: SPARK-25024
> URL: https://issues.apache.org/jira/browse/SPARK-25024
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.2
>Reporter: Thomas Graves
>Priority: Major
>
> I was reading through our mesos deployment docs and security docs and its not 
> clear at all what type of security and how to set it up for mesos.  I think 
> we should clarify this and have something about exactly what is supported and 
> what is not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25059) Exception while executing an action on DataFrame that read Json

2018-08-09 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575129#comment-16575129
 ] 

Kazuaki Ishizaki commented on SPARK-25059:
--

Thank you for reporting the issue. Could you please try this using Spark 2.3?
This is because the community extensively investigated and fixed these issues 
in Spark 2.3

> Exception while executing an action on DataFrame that read Json
> ---
>
> Key: SPARK-25059
> URL: https://issues.apache.org/jira/browse/SPARK-25059
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.0
> Environment: AWS EMR 5.8.0 
> Spark 2.2.0 
>  
>Reporter: Kunal Goswami
>Priority: Major
>  Labels: Spark-SQL
>
> When I try to read ~9600 Json files using
> {noformat}
> val test = spark.read.option("header", true).option("inferSchema", 
> true).json(paths: _*) {noformat}
>  
> Any action on the above created data frame results in: 
> {noformat}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply2_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class "org.apache.spark.sql.catalyst.expressions.Generat[73/1850]
> pecificUnsafeProjection" grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:949)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:839)
>   at org.codehaus.janino.UnitCompiler.writeOpcode(UnitCompiler.java:11081)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4546)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1436)
>   at org.codehaus.janino.UnitCompiler.access$1600(UnitCompiler.java:206)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1376)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$Block.accept(Java.java:2471)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2220)
>   at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1378)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$IfStatement.accept(Java.java:2621)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1436)
>   at org.codehaus.janino.UnitCompiler.access$1600(UnitCompiler.java:206)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1376)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$Block.accept(Java.java:2471)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2220)
>   at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1378)
>   at 
> 

[jira] [Commented] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575116#comment-16575116
 ] 

Apache Spark commented on SPARK-25036:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22058

> Scala 2.12 issues: Compilation error with sbt
> -
>
> Key: SPARK-25036
> URL: https://issues.apache.org/jira/browse/SPARK-25036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> When compiling with sbt, the following errors occur:
> There are two types:
> 1. {{ExprValue.isNull}} is compared with unexpected type.
> 1. {{match may not be exhaustive}} is detected at {{match}}
> The first one is more serious since it may also generate incorrect code in 
> Spark 2.3.
> {code}
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
> Boolean = (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn] (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
> ArrayData()), (_, _)
> [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: NewFunctionSpec(_, None, 
> Some(_)), NewFunctionSpec(_, Some(_), None)
> [error] [warn] newFunction match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely always compare unequal
> [error] [warn] if (eval.isNull != "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]  if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn] if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
>  match may not be exhaustive.
> [error] It would fail on the following input: Schema((x: 
> org.apache.spark.sql.types.DataType forSome x not in 
> org.apache.spark.sql.types.StructType), _)
> [error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
> [error] [warn] 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers

2018-08-09 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575097#comment-16575097
 ] 

Imran Rashid commented on SPARK-23207:
--

yeah I agree with Tom, silent data loss is a major bug.  I don't actually think 
the chance to hit this is so small.

> Shuffle+Repartition on an DataFrame could lead to incorrect answers
> ---
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.0
>
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 100:
> {code}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22634) Update Bouncy castle dependency

2018-08-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575066#comment-16575066
 ] 

Sean Owen commented on SPARK-22634:
---

That text is part of Netty's NOTICE file, which must be reproduced, but, it 
isn't actually pulled in by Netty according to mvn. It says only jets3t uses 
it, and you say jets3t isn't used directly here. I'd say this is resolved if 
SPARK-23654 is resolved then. 

There's another reason to remove this if it's not necessary. Being crypto 
software, I think we need to update an ECCN for Spark if it's distributed. I'm 
investigating that separately. But all the better to remove it if not needed.

> Update Bouncy castle dependency
> ---
>
> Key: SPARK-22634
> URL: https://issues.apache.org/jira/browse/SPARK-22634
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Lior Regev
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.3.0
>
>
> Spark's usage of jets3t library as well as Spark's own Flume and Kafka 
> streaming uses bouncy castle version 1.51
> This is an outdated version as the latest one is 1.58
> This, in turn renders packages such as 
> [spark-hadoopcryptoledger-ds|https://github.com/ZuInnoTe/spark-hadoopcryptoledger-ds]
>  unusable since these require 1.58 and spark's distributions come along with 
> 1.51
> My own attempt was to run on EMR, and since I automatically get all of 
> spark's dependecies (bouncy castle 1.51 being one of them) into the 
> classpath, using the library to parse blockchain data failed due to missing 
> functionality.
> I have also opened an 
> [issue|https://bitbucket.org/jmurty/jets3t/issues/242/bouncycastle-dependency]
>  with jets3t to update their dependecy as well, but along with that Spark 
> would have to update it's own or at least be packaged with a newer version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25078) Standalone does not work with spark.authenticate.secret and deploy-mode=cluster

2018-08-09 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-25078:
-
Summary: Standalone does not work with spark.authenticate.secret and 
deploy-mode=cluster  (was: Standalone cluster mode does not work with 
spark.authenticate.secret)

> Standalone does not work with spark.authenticate.secret and 
> deploy-mode=cluster
> ---
>
> Key: SPARK-25078
> URL: https://issues.apache.org/jira/browse/SPARK-25078
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> When running a spark standalone cluster with spark.authenticate.secret setup, 
> you cannot submit a program in cluster mode, even with the right secret.  The 
> driver fails with:
> {noformat}
> 18/08/09 08:17:21 INFO SecurityManager: SecurityManager: authentication 
> enabled; ui acls disabled; users  with view permissions: Set(systest); groups 
> with view permissions: Set(); users  with modify permissions: Set(systest); 
> groups with modify permissions: Set()
> 18/08/09 08:17:21 ERROR SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: requirement failed: A secret key must be 
> specified via the spark.authenticate.secret config.
> at scala.Predef$.require(Predef.scala:224)
> at 
> org.apache.spark.SecurityManager.initializeAuth(SecurityManager.scala:361)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:238)
> at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:175)
> at 
> org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
> at org.apache.spark.SparkContext.(SparkContext.scala:424)
> ...
> {noformat}
> but its actually doing the wrong check in 
> {{SecurityManager.initializeAuth()}}.  The secret is there, its just in an 
> environment variable {{_SPARK_AUTH_SECRET}} (so its not visible to another 
> process).
> *Workaround*: In your program, you can pass in a dummy secret to your spark 
> conf.  It doesn't matter what it is at all, later it'll be ignored and when 
> establishing connections, the secret from the env variable will be used.  Eg.
> {noformat}
> val conf = new SparkConf()
> conf.setIfMissing("spark.authenticate.secret", "doesn't matter")
> val sc = new SparkContext(conf)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25078) Standalone cluster mode does not work with spark.authenticate.secret

2018-08-09 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-25078:


 Summary: Standalone cluster mode does not work with 
spark.authenticate.secret
 Key: SPARK-25078
 URL: https://issues.apache.org/jira/browse/SPARK-25078
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.4.0
Reporter: Imran Rashid


When running a spark standalone cluster with spark.authenticate.secret setup, 
you cannot submit a program in cluster mode, even with the right secret.  The 
driver fails with:

{noformat}
18/08/09 08:17:21 INFO SecurityManager: SecurityManager: authentication 
enabled; ui acls disabled; users  with view permissions: Set(systest); groups 
with view permissions: Set(); users  with modify permissions: Set(systest); 
groups with modify permissions: Set()
18/08/09 08:17:21 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: requirement failed: A secret key must be 
specified via the spark.authenticate.secret config.
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.SecurityManager.initializeAuth(SecurityManager.scala:361)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:238)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:175)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
at org.apache.spark.SparkContext.(SparkContext.scala:424)
...
{noformat}

but its actually doing the wrong check in {{SecurityManager.initializeAuth()}}. 
 The secret is there, its just in an environment variable 
{{_SPARK_AUTH_SECRET}} (so its not visible to another process).

*Workaround*: In your program, you can pass in a dummy secret to your spark 
conf.  It doesn't matter what it is at all, later it'll be ignored and when 
establishing connections, the secret from the env variable will be used.  Eg.

{noformat}
val conf = new SparkConf()
conf.setIfMissing("spark.authenticate.secret", "doesn't matter")
val sc = new SparkContext(conf)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25077) Delete unused variable in WindowExec

2018-08-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25077:


Assignee: Apache Spark

> Delete unused variable in WindowExec
> 
>
> Key: SPARK-25077
> URL: https://issues.apache.org/jira/browse/SPARK-25077
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Li Yuanjian
>Assignee: Apache Spark
>Priority: Trivial
>
> Delete the unused variable `inputFields` in WindowExec, avoid making others 
> confused while reading the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25077) Delete unused variable in WindowExec

2018-08-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575053#comment-16575053
 ] 

Apache Spark commented on SPARK-25077:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/22057

> Delete unused variable in WindowExec
> 
>
> Key: SPARK-25077
> URL: https://issues.apache.org/jira/browse/SPARK-25077
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Li Yuanjian
>Priority: Trivial
>
> Delete the unused variable `inputFields` in WindowExec, avoid making others 
> confused while reading the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25077) Delete unused variable in WindowExec

2018-08-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25077:


Assignee: (was: Apache Spark)

> Delete unused variable in WindowExec
> 
>
> Key: SPARK-25077
> URL: https://issues.apache.org/jira/browse/SPARK-25077
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Li Yuanjian
>Priority: Trivial
>
> Delete the unused variable `inputFields` in WindowExec, avoid making others 
> confused while reading the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25077) Delete unused variable in WindowExec

2018-08-09 Thread Li Yuanjian (JIRA)
Li Yuanjian created SPARK-25077:
---

 Summary: Delete unused variable in WindowExec
 Key: SPARK-25077
 URL: https://issues.apache.org/jira/browse/SPARK-25077
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Li Yuanjian


Delete the unused variable `inputFields` in WindowExec, avoid making others 
confused while reading the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2018-08-09 Thread Kyle Prifogle (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575044#comment-16575044
 ] 

Kyle Prifogle commented on SPARK-12449:
---

[~oae]  as far as I can tell the issue lives on here:  
https://issues.apache.org/jira/browse/SPARK-22386

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
>Priority: Major
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24502) flaky test: UnsafeRowSerializerSuite

2018-08-09 Thread Paul Praet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575026#comment-16575026
 ] 

Paul Praet commented on SPARK-24502:


We are only creating SparkSessions with a SparkSessionBuilder.getOrCreate() and 
then calling sparkSession.close() when we are done.

Can you confirm this is not enough then ?

 

> flaky test: UnsafeRowSerializerSuite
> 
>
> Key: SPARK-24502
> URL: https://issues.apache.org/jira/browse/SPARK-24502
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: flaky-test
> Fix For: 2.3.2, 2.4.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4193/testReport/org.apache.spark.sql.execution/UnsafeRowSerializerSuite/toUnsafeRow___test_helper_method/
> {code}
> sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is 
> stopped.
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
>   at 
> org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
>   at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
>   at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126)
>   at 
> org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25063) Rename class KnowNotNull to KnownNotNull

2018-08-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25063.
-
   Resolution: Fixed
 Assignee: Maryann Xue
Fix Version/s: 2.4.0

> Rename class KnowNotNull to KnownNotNull
> 
>
> Key: SPARK-25063
> URL: https://issues.apache.org/jira/browse/SPARK-25063
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Trivial
> Fix For: 2.4.0
>
>
> It's a class name typo checked in through SPARK-24891



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers

2018-08-09 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574956#comment-16574956
 ] 

Thomas Graves commented on SPARK-23207:
---

ok, I guess I disagree with that. Any correctness bug is very bad in my 
opinion, corrupt/lost data is much worse than taking a performance hit as 
corrupt/lost data could easily result in to lost revenue or errors in business 
critical data.  

> Shuffle+Repartition on an DataFrame could lead to incorrect answers
> ---
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.0
>
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 100:
> {code}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25076) SQLConf should not be retrieved from a stopped SparkSession

2018-08-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574915#comment-16574915
 ] 

Apache Spark commented on SPARK-25076:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22056

> SQLConf should not be retrieved from a stopped SparkSession
> ---
>
> Key: SPARK-25076
> URL: https://issues.apache.org/jira/browse/SPARK-25076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25076) SQLConf should not be retrieved from a stopped SparkSession

2018-08-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25076:


Assignee: Wenchen Fan  (was: Apache Spark)

> SQLConf should not be retrieved from a stopped SparkSession
> ---
>
> Key: SPARK-25076
> URL: https://issues.apache.org/jira/browse/SPARK-25076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25076) SQLConf should not be retrieved from a stopped SparkSession

2018-08-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25076:


Assignee: Apache Spark  (was: Wenchen Fan)

> SQLConf should not be retrieved from a stopped SparkSession
> ---
>
> Key: SPARK-25076
> URL: https://issues.apache.org/jira/browse/SPARK-25076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24502) flaky test: UnsafeRowSerializerSuite

2018-08-09 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574908#comment-16574908
 ] 

Wenchen Fan commented on SPARK-24502:
-

There is no resource leakage. Users need to manage the active and default 
SparkSession manually, by calling `get/set/clearActiveSession` and 
`get/set/clearDefaultSession`. This is not very user-friendly, but it's what it 
is.

Unfortunately, our test framework had a bug: it didn't clear active/default 
session when a spark session is stopped. This causes a problem because we use 
`SQLConf.get` a lot in the test code. My PR fixed it. 

It's totally fine if you create and close multiple spark sessions in the 
production code, there is no resource leak. But you need to pay attention if 
you get active/default session. It's the same in Spark 2.2.

I'm adding a safeguard for SQLConf.get: 
https://issues.apache.org/jira/browse/SPARK-25076 . Hopefully this problem can 
be eased.

> flaky test: UnsafeRowSerializerSuite
> 
>
> Key: SPARK-24502
> URL: https://issues.apache.org/jira/browse/SPARK-24502
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: flaky-test
> Fix For: 2.3.2, 2.4.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4193/testReport/org.apache.spark.sql.execution/UnsafeRowSerializerSuite/toUnsafeRow___test_helper_method/
> {code}
> sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is 
> stopped.
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
>   at 
> org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
>   at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
>   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
>   at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126)
>   at 
> org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers

2018-08-09 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574903#comment-16574903
 ] 

Jiang Xingbo commented on SPARK-23207:
--

This affects the 2.2 and lower versions, the reason why we didn't backport the 
patch is that it can cause huge perf regression to `repartition()` operation, 
and chance to hit this correctness bug is small. cc [~smilegator][~sameerag]

> Shuffle+Repartition on an DataFrame could lead to incorrect answers
> ---
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.0
>
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 100:
> {code}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25076) SQLConf should not be retrieved from a stopped SparkSession

2018-08-09 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25076:
---

 Summary: SQLConf should not be retrieved from a stopped 
SparkSession
 Key: SPARK-25076
 URL: https://issues.apache.org/jira/browse/SPARK-25076
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers

2018-08-09 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574886#comment-16574886
 ] 

Thomas Graves commented on SPARK-23207:
---

[~jiangxb1987] ^

> Shuffle+Repartition on an DataFrame could lead to incorrect answers
> ---
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.0
>
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 100:
> {code}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25073) Spark-submit on Yarn Task : When the yarn.nodemanager.resource.memory-mb and/or yarn.scheduler.maximum-allocation-mb is insufficient, Spark always reports an error req

2018-08-09 Thread Sujith (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574864#comment-16574864
 ] 

Sujith commented on SPARK-25073:


Seems to be you are right, Message is bit misleading to the user. As per my 
understanding there is also dependency in yarn.nodemanager.resource.memory-mb 
parameter.

*_yarn.nodemanager.resource.memory-mb:_*

Amount of physical memory, in MB, that can be allocated for containers. It 
means the amount of memory YARN can utilize on this node and therefore this 
property should be lower then the total memory of that machine.

*_yarn.scheduler.maximum-allocation-mb_*

It defines the maximum memory allocation available for a container in MB, it 
means RM can only allocate memory to containers in increments of 
{{"yarn.scheduler.minimum-allocation-mb"}} and not exceed 
{{"yarn.scheduler.maximum-allocation-mb"}} and It should not be more then total 
allocated memory of the Node.

 

I will try to analyze more on this and i will raise PR if it requires a fix. 
Thanks.

 

 

> Spark-submit on Yarn Task : When the yarn.nodemanager.resource.memory-mb 
> and/or yarn.scheduler.maximum-allocation-mb is insufficient, Spark always 
> reports an error request to adjust yarn.scheduler.maximum-allocation-mb
> --
>
> Key: SPARK-25073
> URL: https://issues.apache.org/jira/browse/SPARK-25073
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0, 2.3.1
>Reporter: vivek kumar
>Priority: Minor
>
> When the yarn.nodemanager.resource.memory-mb and/or 
> yarn.scheduler.maximum-allocation-mb is insufficient, Spark *always* reports 
> an error request to adjust Yarn.scheduler.maximum-allocation-mb. Expecting 
> the error request to be  more around yarn.scheduler.maximum-allocation-mb' 
> and/or 'yarn.nodemanager.resource.memory-mb'.
>  
> Scenario 1. yarn.scheduler.maximum-allocation-mb =4g and 
> yarn.nodemanager.resource.memory-mb =8G
>  # Launch shell on Yarn with am.memory less than nodemanager.resource memory 
> but greater than yarn.scheduler.maximum-allocation-mb
> eg; spark-shell --master yarn --conf spark.yarn.am.memory 5g
>  Error: java.lang.IllegalArgumentException: Required AM memory (5120+512 MB) 
> is above the max threshold (4096 MB) of this cluster! Please increase the 
> value of 'yarn.scheduler.maximum-allocation-mb'.
> at 
> org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:325)
>  
> *Scenario 2*. yarn.scheduler.maximum-allocation-mb =15g and 
> yarn.nodemanager.resource.memory-mb =8g
> a. Launch shell on Yarn with am.memory greater than nodemanager.resource 
> memory but less than yarn.scheduler.maximum-allocation-mb
> eg; *spark-shell --master yarn --conf spark.yarn.am.memory=10g*
>  Error :
> java.lang.IllegalArgumentException: Required AM memory (10240+1024 MB) is 
> above the max threshold (*8096 MB*) of this cluster! *Please increase the 
> value of 'yarn.scheduler.maximum-allocation-mb'.*
> at 
> org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:325)
>  
> b. Launch shell on Yarn with am.memory greater than nodemanager.resource 
> memory and yarn.scheduler.maximum-allocation-mb
> eg; *spark-shell --master yarn --conf spark.yarn.am.memory=17g*
>  Error:
> java.lang.IllegalArgumentException: Required AM memory (17408+1740 MB) is 
> above the max threshold (*8096 MB*) of this cluster! *Please increase the 
> value of 'yarn.scheduler.maximum-allocation-mb'.*
> at 
> org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:325)
>  
> *Expected* : Error request for scenario2 should be more around 
> yarn.scheduler.maximum-allocation-mb' and/or 
> 'yarn.nodemanager.resource.memory-mb'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25044) Address translation of LMF closure primitive args to Object in Scala 2.12

2018-08-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574841#comment-16574841
 ] 

Sean Owen commented on SPARK-25044:
---

I tried this – it's not hard – but the implementation method signature in this 
case still uses Object, not ints or longs.

Actually, this functionality seems to only be used in the SQL Analyzer, and 
only to figure out whether the args are primitive, and then too only to decide 
if it's necessary to handle null values of that argument. I tried simply 
changing the Analyzer to ignore whether the arg is primitive, and not skip the 
check if it's primitive. It causes some tests to pass, but not all of them.

I might next investigate whether it's feasible to fix this by not analyzing 
primitive-ness of arguments [~smilegator]
{code:java}
- SPARK-11725: correctly handle null inputs for ScalaUDF *** FAILED ***
== FAIL: Plans do not match ===
!Project [if (isnull(a#0)) null else UDF(knownotnull(a#0)) AS #0] Project 
[UDF(a#0) AS #0]
+- LocalRelation , [a#0, b#0, c#0, d#0, e#0] +- LocalRelation , 
[a#0, b#0, c#0, d#0, e#0] (PlanTest.scala:119){code}

> Address translation of LMF closure primitive args to Object in Scala 2.12
> -
>
> Key: SPARK-25044
> URL: https://issues.apache.org/jira/browse/SPARK-25044
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> A few SQL-related tests fail in Scala 2.12, such as UDFSuite's "SPARK-24891 
> Fix HandleNullInputsForUDF rule":
> {code:java}
> - SPARK-24891 Fix HandleNullInputsForUDF rule *** FAILED ***
> Results do not match for query:
> ...
> == Results ==
> == Results ==
> !== Correct Answer - 3 == == Spark Answer - 3 ==
> !struct<> struct
> ![0,10,null] [0,10,0]
> ![1,12,null] [1,12,1]
> ![2,14,null] [2,14,2] (QueryTest.scala:163){code}
> You can kind of get what's going on reading the test:
> {code:java}
> test("SPARK-24891 Fix HandleNullInputsForUDF rule") {
> // assume(!ClosureCleanerSuite2.supportsLMFs)
> // This test won't test what it intends to in 2.12, as lambda metafactory 
> closures
> // have arg types that are not primitive, but Object
> val udf1 = udf({(x: Int, y: Int) => x + y})
> val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
> .withColumn("c", udf1($"a", lit(null)))
> val plan = spark.sessionState.executePlan(df.logicalPlan).analyzed
> comparePlans(df.logicalPlan, plan)
> checkAnswer(
> df,
> Seq(
> Row(0, 10, null),
> Row(1, 12, null),
> Row(2, 14, null)))
> }{code}
>  
> It seems that the closure that is fed in as a UDF changes behavior, in a way 
> that primitive-type arguments are handled differently. For example an Int 
> argument, when fed 'null', acts like 0.
> I'm sure it's a difference in the LMF closure and how its types are 
> understood, but not exactly sure of the cause yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2018-08-09 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574839#comment-16574839
 ] 

Guillaume Massé commented on SPARK-25075:
-

Scala 2.13 is currently in the milestone phases (We are at 2.13.0-M4 at the 
moment of writing this). Since Sparks build with 2.12, we can start the 
migration to 2.13.0-M4 to find any incompatibilities. It's a good timing to add 
anything to 2.13.X before we finalize the collection API.

To ease the process of the migration to 2.13 the Scala Center and the Scala 
team created automatic migration rules and compatibility library available at 
[https://github.com/scala/scala-collection-compat.]

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25075) Build and test Spark against Scala 2.13

2018-08-09 Thread JIRA
Guillaume Massé created SPARK-25075:
---

 Summary: Build and test Spark against Scala 2.13
 Key: SPARK-25075
 URL: https://issues.apache.org/jira/browse/SPARK-25075
 Project: Spark
  Issue Type: Umbrella
  Components: Build, Project Infra
Affects Versions: 2.1.0
Reporter: Guillaume Massé


This umbrella JIRA tracks the requirements for building and testing Spark 
against the current Scala 2.13 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25047) Can't assign SerializedLambda to scala.Function1 in deserialization of BucketedRandomProjectionLSHModel

2018-08-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25047.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22032
[https://github.com/apache/spark/pull/22032]

> Can't assign SerializedLambda to scala.Function1 in deserialization of 
> BucketedRandomProjectionLSHModel
> ---
>
> Key: SPARK-25047
> URL: https://issues.apache.org/jira/browse/SPARK-25047
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
> Fix For: 2.4.0
>
>
> Another distinct test failure:
> {code:java}
> - BucketedRandomProjectionLSH: streaming transform *** FAILED ***
>   org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 7f34fb07-a718-4488-b644-d27cfd29ff6c, runId = 
> 0bbc0ba2-2952-4504-85d6-8aba877ba01b] terminated with exception: Job aborted 
> due to stage failure: Task 0 in stage 16.0 failed 1 times, most recent 
> failure: Lost task 0.0 in stage 16.0 (TID 16, localhost, executor driver): 
> java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> org.apache.spark.ml.feature.BucketedRandomProjectionLSHModel.hashFunction of 
> type scala.Function1 in instance of 
> org.apache.spark.ml.feature.BucketedRandomProjectionLSHModel
> ...
>   Cause: java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> org.apache.spark.ml.feature.BucketedRandomProjectionLSHModel.hashFunction of 
> type scala.Function1 in instance of 
> org.apache.spark.ml.feature.BucketedRandomProjectionLSHModel
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233)
>   at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1405)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2284)
> ...{code}
> Here the different nature of a Java 8 LMF closure trips of Java 
> serialization/deserialization. I think this can be patched by manually 
> implementing the Java serialization here, and don't see other instances (yet).
> Also wondering if this "val" can be a "def".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25047) Can't assign SerializedLambda to scala.Function1 in deserialization of BucketedRandomProjectionLSHModel

2018-08-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25047:
-

Assignee: Sean Owen

> Can't assign SerializedLambda to scala.Function1 in deserialization of 
> BucketedRandomProjectionLSHModel
> ---
>
> Key: SPARK-25047
> URL: https://issues.apache.org/jira/browse/SPARK-25047
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>
> Another distinct test failure:
> {code:java}
> - BucketedRandomProjectionLSH: streaming transform *** FAILED ***
>   org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 7f34fb07-a718-4488-b644-d27cfd29ff6c, runId = 
> 0bbc0ba2-2952-4504-85d6-8aba877ba01b] terminated with exception: Job aborted 
> due to stage failure: Task 0 in stage 16.0 failed 1 times, most recent 
> failure: Lost task 0.0 in stage 16.0 (TID 16, localhost, executor driver): 
> java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> org.apache.spark.ml.feature.BucketedRandomProjectionLSHModel.hashFunction of 
> type scala.Function1 in instance of 
> org.apache.spark.ml.feature.BucketedRandomProjectionLSHModel
> ...
>   Cause: java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> org.apache.spark.ml.feature.BucketedRandomProjectionLSHModel.hashFunction of 
> type scala.Function1 in instance of 
> org.apache.spark.ml.feature.BucketedRandomProjectionLSHModel
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233)
>   at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1405)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2284)
> ...{code}
> Here the different nature of a Java 8 LMF closure trips of Java 
> serialization/deserialization. I think this can be patched by manually 
> implementing the Java serialization here, and don't see other instances (yet).
> Also wondering if this "val" can be a "def".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25073) Spark-submit on Yarn Task : When the yarn.nodemanager.resource.memory-mb and/or yarn.scheduler.maximum-allocation-mb is insufficient, Spark always reports an error reque

2018-08-09 Thread vivek kumar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vivek kumar updated SPARK-25073:

Description: 
When the yarn.nodemanager.resource.memory-mb and/or 
yarn.scheduler.maximum-allocation-mb is insufficient, Spark *always* reports an 
error request to adjust Yarn.scheduler.maximum-allocation-mb. Expecting the 
error request to be  more around yarn.scheduler.maximum-allocation-mb' and/or 
'yarn.nodemanager.resource.memory-mb'.

 

Scenario 1. yarn.scheduler.maximum-allocation-mb =4g and 
yarn.nodemanager.resource.memory-mb =8G
 # Launch shell on Yarn with am.memory less than nodemanager.resource memory 
but greater than yarn.scheduler.maximum-allocation-mb

eg; spark-shell --master yarn --conf spark.yarn.am.memory 5g

 Error: java.lang.IllegalArgumentException: Required AM memory (5120+512 MB) is 
above the max threshold (4096 MB) of this cluster! Please increase the value of 
'yarn.scheduler.maximum-allocation-mb'.

at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:325)

 

*Scenario 2*. yarn.scheduler.maximum-allocation-mb =15g and 
yarn.nodemanager.resource.memory-mb =8g

a. Launch shell on Yarn with am.memory greater than nodemanager.resource memory 
but less than yarn.scheduler.maximum-allocation-mb

eg; *spark-shell --master yarn --conf spark.yarn.am.memory=10g*

 Error :

java.lang.IllegalArgumentException: Required AM memory (10240+1024 MB) is above 
the max threshold (*8096 MB*) of this cluster! *Please increase the value of 
'yarn.scheduler.maximum-allocation-mb'.*

at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:325)

 

b. Launch shell on Yarn with am.memory greater than nodemanager.resource memory 
and yarn.scheduler.maximum-allocation-mb

eg; *spark-shell --master yarn --conf spark.yarn.am.memory=17g*

 Error:

java.lang.IllegalArgumentException: Required AM memory (17408+1740 MB) is above 
the max threshold (*8096 MB*) of this cluster! *Please increase the value of 
'yarn.scheduler.maximum-allocation-mb'.*

at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:325)

 

*Expected* : Error request for scenario2 should be more around 
yarn.scheduler.maximum-allocation-mb' and/or 
'yarn.nodemanager.resource.memory-mb'.

  was:
When the yarn.nodemanager.resource.memory-mb and/or 
yarn.scheduler.maximum-allocation-mb is insufficient, Spark always reports an 
error request to adjust Yarn.scheduler.maximum-allocation-mb. Expecting the 
error request to be  more around yarn.scheduler.maximum-allocation-mb' and/or 
'yarn.nodemanager.resource.memory-mb'.

 

Scenario 1. yarn.scheduler.maximum-allocation-mb =4g and 
yarn.nodemanager.resource.memory-mb =8G
 # Launch shell on Yarn with am.memory less than nodemanager.resource memory 
but greater than yarn.scheduler.maximum-allocation-mb

eg; spark-shell --master yarn --conf spark.yarn.am.memory 5g

 Error:

java.lang.IllegalArgumentException: Required AM memory (5120+512 MB) is above 
the max threshold (4096 MB) of this cluster! Please increase the value of 
'yarn.scheduler.maximum-allocation-mb'.

at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:325)

 

Scenario 2. yarn.scheduler.maximum-allocation-mb =15g and 
yarn.nodemanager.resource.memory-mb =8g

a.Launch shell on Yarn with am.memory greater than nodemanager.resource memory 
but less than yarn.scheduler.maximum-allocation-mb

eg; spark-shell --master yarn --conf spark.yarn.am.memory=10g

 

Error :

java.lang.IllegalArgumentException: Required AM memory (10240+1024 MB) is above 
the max threshold (*8096 MB*) of this cluster! Please increase the value of 
*'yarn.scheduler.maximum-allocation-mb'*.

at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:325)

b.Launch shell on Yarn with am.memory greater than nodemanager.resource memory 
and yarn.scheduler.maximum-allocation-mb

eg; spark-shell --master yarn --conf spark.yarn.am.memory=17g

 Error:

java.lang.IllegalArgumentException: Required AM memory (17408+1740 MB) is above 
the max threshold (*8096 MB*) of this cluster! Please increase the value of 
*'yarn.scheduler.maximum-allocation-mb'*.

at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:325)

 

Expected : Error request for scenario2 should be more around 
yarn.scheduler.maximum-allocation-mb' and/or 
'yarn.nodemanager.resource.memory-mb'.


> Spark-submit on Yarn Task : When the yarn.nodemanager.resource.memory-mb 
> and/or yarn.scheduler.maximum-allocation-mb is insufficient, Spark always 
> reports an error request to adjust yarn.scheduler.maximum-allocation-mb
> --
>
> Key: SPARK-25073
> URL: 

[jira] [Created] (SPARK-25074) Implement maxNumConcurrentTasks() in MesosFineGrainedSchedulerBackend

2018-08-09 Thread Jiang Xingbo (JIRA)
Jiang Xingbo created SPARK-25074:


 Summary: Implement maxNumConcurrentTasks() in 
MesosFineGrainedSchedulerBackend
 Key: SPARK-25074
 URL: https://issues.apache.org/jira/browse/SPARK-25074
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Jiang Xingbo


We added a new method `maxNumConcurrentTasks()` to `SchedulerBackend` to get 
the max number of tasks that can be concurrent launched currently. However the 
method is not implemented in `MesosFineGrainedSchedulerBackend`, so submit a 
job containing barrier stage shall always fail fast with 
`MesosFineGrainedSchedulerBackend` resource manager.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky

2018-08-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23415:
---

Assignee: Kazuaki Ishizaki

> BufferHolderSparkSubmitSuite is flaky
> -
>
> Key: SPARK-23415
> URL: https://issues.apache.org/jira/browse/SPARK-23415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> The test suite fails due to 60-second timeout sometimes.
> {code:java}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> {code}
>  - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/4759/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/412/]
>  (June 15th)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky

2018-08-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23415.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20636
[https://github.com/apache/spark/pull/20636]

> BufferHolderSparkSubmitSuite is flaky
> -
>
> Key: SPARK-23415
> URL: https://issues.apache.org/jira/browse/SPARK-23415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> The test suite fails due to 60-second timeout sometimes.
> {code:java}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> {code}
>  - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/4759/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/412/]
>  (June 15th)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2018-08-09 Thread Johannes Zillmann (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574770#comment-16574770
 ] 

Johannes Zillmann commented on SPARK-12449:
---

I'm a bit confused. Reading 
https://www.snowflake.com/snowflake-spark-part-2-pushing-query-processing/ and 
https://github.com/snowflakedb/spark-snowflake/pull/8/files it looks like what 
the ticket is describing has already been realized ?

Can somebody shed light on this !?

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
>Priority: Major
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25032) Create table is failing, after dropping the database . It is not falling back to default database

2018-08-09 Thread sandeep katta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574754#comment-16574754
 ] 

sandeep katta commented on SPARK-25032:
---

I will be looking into this.

 

Solution:

1)Don't allow to delete the current database

2)Fall back to default once the database is deleted

 

> Create table is failing, after dropping the database . It is not falling back 
> to default database
> -
>
> Key: SPARK-25032
> URL: https://issues.apache.org/jira/browse/SPARK-25032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.3.0, 2.3.1
> Environment: Spark 2.3.1 
> Hadoop 2.7.3
>  
>Reporter: Ayush Anubhava
>Priority: Minor
>
> *Launch spark-beeline for both the scenarios*
> *Scenario 1*
> create database cbo1;
> use cbo1;
> create table test2 ( a int, b string , c int) stored as parquet;
> drop database cbo1 cascade;
> create table test1 ( a int, b string , c int) stored as parquet;
> {color:#ff}Output : Exception is thrown at this point {color}
> {color:#ff}Error: 
> org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 
> 'cbo1' not found; (state=,code=0){color}
> *Scenario 2:*
> create database cbo1;
> use cbo1;
> create table test2 ( a int, b string , c int) stored as parquet;
> drop database cbo1 cascade;
> create database cbo1;
> create table test1 ( a int, b string , c int) stored as parquet;
> {color:#ff}Output : Table is getting created in the database  "*cbo1*", 
> even on not using the database.It should have been created in default 
> db.{color}
>  
> In beeline session, after dropping the database , it is not falling back to 
> default db
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25073) Spark-submit on Yarn Task : When the yarn.nodemanager.resource.memory-mb and/or yarn.scheduler.maximum-allocation-mb is insufficient, Spark always reports an error reque

2018-08-09 Thread vivek kumar (JIRA)
vivek kumar created SPARK-25073:
---

 Summary: Spark-submit on Yarn Task : When the 
yarn.nodemanager.resource.memory-mb and/or yarn.scheduler.maximum-allocation-mb 
is insufficient, Spark always reports an error request to adjust 
yarn.scheduler.maximum-allocation-mb
 Key: SPARK-25073
 URL: https://issues.apache.org/jira/browse/SPARK-25073
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.3.1, 2.3.0
Reporter: vivek kumar


When the yarn.nodemanager.resource.memory-mb and/or 
yarn.scheduler.maximum-allocation-mb is insufficient, Spark always reports an 
error request to adjust Yarn.scheduler.maximum-allocation-mb. Expecting the 
error request to be  more around yarn.scheduler.maximum-allocation-mb' and/or 
'yarn.nodemanager.resource.memory-mb'.

 

Scenario 1. yarn.scheduler.maximum-allocation-mb =4g and 
yarn.nodemanager.resource.memory-mb =8G
 # Launch shell on Yarn with am.memory less than nodemanager.resource memory 
but greater than yarn.scheduler.maximum-allocation-mb

eg; spark-shell --master yarn --conf spark.yarn.am.memory 5g

 Error:

java.lang.IllegalArgumentException: Required AM memory (5120+512 MB) is above 
the max threshold (4096 MB) of this cluster! Please increase the value of 
'yarn.scheduler.maximum-allocation-mb'.

at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:325)

 

Scenario 2. yarn.scheduler.maximum-allocation-mb =15g and 
yarn.nodemanager.resource.memory-mb =8g

a.Launch shell on Yarn with am.memory greater than nodemanager.resource memory 
but less than yarn.scheduler.maximum-allocation-mb

eg; spark-shell --master yarn --conf spark.yarn.am.memory=10g

 

Error :

java.lang.IllegalArgumentException: Required AM memory (10240+1024 MB) is above 
the max threshold (*8096 MB*) of this cluster! Please increase the value of 
*'yarn.scheduler.maximum-allocation-mb'*.

at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:325)

b.Launch shell on Yarn with am.memory greater than nodemanager.resource memory 
and yarn.scheduler.maximum-allocation-mb

eg; spark-shell --master yarn --conf spark.yarn.am.memory=17g

 Error:

java.lang.IllegalArgumentException: Required AM memory (17408+1740 MB) is above 
the max threshold (*8096 MB*) of this cluster! Please increase the value of 
*'yarn.scheduler.maximum-allocation-mb'*.

at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:325)

 

Expected : Error request for scenario2 should be more around 
yarn.scheduler.maximum-allocation-mb' and/or 
'yarn.nodemanager.resource.memory-mb'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25032) Create table is failing, after dropping the database . It is not falling back to default database

2018-08-09 Thread sandeep katta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574754#comment-16574754
 ] 

sandeep katta edited comment on SPARK-25032 at 8/9/18 12:16 PM:


I will be looking into this.

 

Solution:

1)Don't allow to delete the current database or

2)Fall back to default once the database is deleted

 


was (Author: sandeep.katta2007):
I will be looking into this.

 

Solution:

1)Don't allow to delete the current database

2)Fall back to default once the database is deleted

 

> Create table is failing, after dropping the database . It is not falling back 
> to default database
> -
>
> Key: SPARK-25032
> URL: https://issues.apache.org/jira/browse/SPARK-25032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.3.0, 2.3.1
> Environment: Spark 2.3.1 
> Hadoop 2.7.3
>  
>Reporter: Ayush Anubhava
>Priority: Minor
>
> *Launch spark-beeline for both the scenarios*
> *Scenario 1*
> create database cbo1;
> use cbo1;
> create table test2 ( a int, b string , c int) stored as parquet;
> drop database cbo1 cascade;
> create table test1 ( a int, b string , c int) stored as parquet;
> {color:#ff}Output : Exception is thrown at this point {color}
> {color:#ff}Error: 
> org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 
> 'cbo1' not found; (state=,code=0){color}
> *Scenario 2:*
> create database cbo1;
> use cbo1;
> create table test2 ( a int, b string , c int) stored as parquet;
> drop database cbo1 cascade;
> create database cbo1;
> create table test1 ( a int, b string , c int) stored as parquet;
> {color:#ff}Output : Table is getting created in the database  "*cbo1*", 
> even on not using the database.It should have been created in default 
> db.{color}
>  
> In beeline session, after dropping the database , it is not falling back to 
> default db
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25072) PySpark custom Row class can be given extra parameters

2018-08-09 Thread Jan-Willem van der Sijp (JIRA)
Jan-Willem van der Sijp created SPARK-25072:
---

 Summary: PySpark custom Row class can be given extra parameters
 Key: SPARK-25072
 URL: https://issues.apache.org/jira/browse/SPARK-25072
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
 Environment: {noformat}
SPARK_MAJOR_VERSION is set to 2, using Spark2
Python 3.4.5 (default, Dec 11 2017, 16:57:19)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
18/08/01 04:49:16 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
18/08/01 04:49:17 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
18/08/01 04:49:27 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0
  /_/

Using Python version 3.4.5 (default, Dec 11 2017 16:57:19)
SparkSession available as 'spark'.
{noformat}

{{CentOS release 6.9 (Final)}}
{{Linux sandbox-hdp.hortonworks.com 4.14.0-1.el7.elrepo.x86_64 #1 SMP Sun Nov 
12 20:21:04 EST 2017 x86_64 x86_64 x86_64 GNU/Linux}}
{noformat}openjdk version "1.8.0_161"
OpenJDK Runtime Environment (build 1.8.0_161-b14)
OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode){noformat}
Reporter: Jan-Willem van der Sijp


When a custom Row class is made in PySpark, it is possible to provide the 
constructor of this class with more parameters than there are columns. These 
extra parameters affect the value of the Row, but are not part of the {{repr}} 
or {{str}} output, making it hard to debug errors due to these "invisible" 
values. The hidden values can be accessed through integer-based indexing though.

Some examples:

{code:python}
In [69]: RowClass = Row("column1", "column2")

In [70]: RowClass(1, 2) == RowClass(1, 2)
Out[70]: True

In [71]: RowClass(1, 2) == RowClass(1, 2, 3)
Out[71]: False

In [75]: RowClass(1, 2, 3)
Out[75]: Row(column1=1, column2=2)

In [76]: RowClass(1, 2)
Out[76]: Row(column1=1, column2=2)

In [77]: RowClass(1, 2, 3).asDict()
Out[77]: {'column1': 1, 'column2': 2}

In [78]: RowClass(1, 2, 3)[2]
Out[78]: 3

In [79]: repr(RowClass(1, 2, 3))
Out[79]: 'Row(column1=1, column2=2)'

In [80]: str(RowClass(1, 2, 3))
Out[80]: 'Row(column1=1, column2=2)'
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25071) BuildSide is coming not as expected with join queries

2018-08-09 Thread Ayush Anubhava (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Anubhava updated SPARK-25071:
---
Attachment: (was: SPARK-25071_IMG2.PNG)

> BuildSide is coming not as expected with join queries
> -
>
> Key: SPARK-25071
> URL: https://issues.apache.org/jira/browse/SPARK-25071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 
> Hadoop 2.7.3
>Reporter: Ayush Anubhava
>Priority: Major
>
> *BuildSide is not coming as expected.*
> Pre-requisites:
> *CBO is set as true &  spark.sql.cbo.joinReorder.enabled= true.*
> *import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec*
> *Steps:*
> *Scenario 1:*
> spark.sql("CREATE TABLE small3 (c1 bigint) TBLPROPERTIES ('numRows'='2', 
> 'rawDataSize'='600','totalSize'='800')")
>  spark.sql("CREATE TABLE big3 (c1 bigint) TBLPROPERTIES ('numRows'='2', 
> 'rawDataSize'='6000', 'totalSize'='800')")
>  val plan = spark.sql("select * from small3 t1 join big3 t2 on (t1.c1 = 
> t2.c1)").queryExecution.executedPlan
>  val buildSide = 
> plan.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide
>  println(buildSide)
>  
> *Result 1:*
> scala> val plan = spark.sql("select * from small3 t1 join big3 t2 on (t1.c1 = 
> t2.c1)").queryExecution.executedPlan
>  plan: org.apache.spark.sql.execution.SparkPlan =
>  *(2) BroadcastHashJoin [c1#0L|#0L], [c1#1L|#1L], Inner, BuildRight
>  :- *(2) Filter isnotnull(c1#0L)
>  : +- HiveTableScan [c1#0L|#0L], HiveTableRelation `default`.`small3`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#0L|#0L]
>  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]))
>  +- *(1) Filter isnotnull(c1#1L)
>  +- HiveTableScan [c1#1L|#1L], HiveTableRelation `default`.`big3`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1L|#1L]
> scala> val buildSide = 
> plan.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide
>  buildSide: org.apache.spark.sql.execution.joins.BuildSide = BuildRight
> scala> println(buildSide)
>  *BuildRight*
>  
> *Scenario 2:*
> spark.sql("CREATE TABLE small4 (c1 bigint) TBLPROPERTIES ('numRows'='2', 
> 'rawDataSize'='600','totalSize'='80')")
>  spark.sql("CREATE TABLE big4 (c1 bigint) TBLPROPERTIES ('numRows'='2', 
> 'rawDataSize'='6000', 'totalSize'='800')")
>  val plan = spark.sql("select * from small4 t1 join big4 t2 on (t1.c1 = 
> t2.c1)").queryExecution.executedPlan
>  val buildSide = 
> plan.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide
>  println(buildSide)
> *Result 2:*
> scala> val plan = spark.sql("select * from small4 t1 join big4 t2 on (t1.c1 = 
> t2.c1)").queryExecution.executedPlan
>  plan: org.apache.spark.sql.execution.SparkPlan =
>  *(2) BroadcastHashJoin [c1#4L|#4L], [c1#5L|#5L], Inner, BuildRight
>  :- *(2) Filter isnotnull(c1#4L)
>  : +- HiveTableScan [c1#4L|#4L], HiveTableRelation `default`.`small4`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#4L|#4L]
>  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]))
>  +- *(1) Filter isnotnull(c1#5L)
>  +- HiveTableScan [c1#5L|#5L], HiveTableRelation `default`.`big4`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5L|#5L]
> scala> val buildSide = 
> plan.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide
>  buildSide: org.apache.spark.sql.execution.joins.BuildSide = *BuildRight*
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >