[jira] [Commented] (SPARK-28672) [UDF] Duplicate function creation should not allow

2019-08-19 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911007#comment-16911007
 ] 

Liang-Chi Hsieh commented on SPARK-28672:
-

Is there any rule in Hive regarding this? like disallow duplicate 
permanent/temporary functions, or resolving temporary/permanent function first 
when duplicating?

> [UDF] Duplicate function creation should not allow 
> ---
>
> Key: SPARK-28672
> URL: https://issues.apache.org/jira/browse/SPARK-28672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create function addm_3  AS 
> 'com.huawei.bigdata.hive.example.udf.multiply' using jar 
> 'hdfs://hacluster/user/Multiply.jar';
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.084 seconds)
> {code}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create temporary function addm_3  
> AS 'com.huawei.bigdata.hive.example.udf.multiply' using jar 
> 'hdfs://hacluster/user/Multiply.jar';
> INFO  : converting to local hdfs://hacluster/user/Multiply.jar
> INFO  : Added 
> [/tmp/8a396308-41f8-4335-9de4-8268ce5c70fe_resources/Multiply.jar] to class 
> path
> INFO  : Added resources: [hdfs://hacluster/user/Multiply.jar]
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.134 seconds)
> {code}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> show functions like addm_3;
> +-+--+
> |function |
> +-+--+
> | addm_3  |
> | default.addm_3  |
> +-+--+
> 2 rows selected (0.047 seconds)
> {code}
> When show function executed it is listing both the function but what about 
> the db for permanent function when user has not specified.
> Duplicate should not be allowed if user creating temporary one with the same 
> name.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28426) Metadata Handling in Thrift Server

2019-08-19 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-28426.
-
Resolution: Fixed

> Metadata Handling in Thrift Server
> --
>
> Key: SPARK-28426
> URL: https://issues.apache.org/jira/browse/SPARK-28426
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
>
> Currently, only the `executeStatement is handled` for SQL commands. But 
> others like `getTables`, `getSchemas`, `getColumns` and so on fallback to an 
> in-memory derby empty catalog. Then some of BI tools could not show the 
> correct object information. 
>  
> This umbrella Jira is tracking the related improvement of Thrift-server. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28672) [UDF] Duplicate function creation should not allow

2019-08-19 Thread pavithra ramachandran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911003#comment-16911003
 ] 

pavithra ramachandran commented on SPARK-28672:
---

[~maropu] [~viirya] 

The intention of this Jira is to not allow creation of temporary function, when 
a permanent function with the same name exist. 

Could you confirm if its ok to handle this case?

> [UDF] Duplicate function creation should not allow 
> ---
>
> Key: SPARK-28672
> URL: https://issues.apache.org/jira/browse/SPARK-28672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create function addm_3  AS 
> 'com.huawei.bigdata.hive.example.udf.multiply' using jar 
> 'hdfs://hacluster/user/Multiply.jar';
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.084 seconds)
> {code}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create temporary function addm_3  
> AS 'com.huawei.bigdata.hive.example.udf.multiply' using jar 
> 'hdfs://hacluster/user/Multiply.jar';
> INFO  : converting to local hdfs://hacluster/user/Multiply.jar
> INFO  : Added 
> [/tmp/8a396308-41f8-4335-9de4-8268ce5c70fe_resources/Multiply.jar] to class 
> path
> INFO  : Added resources: [hdfs://hacluster/user/Multiply.jar]
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.134 seconds)
> {code}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> show functions like addm_3;
> +-+--+
> |function |
> +-+--+
> | addm_3  |
> | default.addm_3  |
> +-+--+
> 2 rows selected (0.047 seconds)
> {code}
> When show function executed it is listing both the function but what about 
> the db for permanent function when user has not specified.
> Duplicate should not be allowed if user creating temporary one with the same 
> name.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28672) [UDF] Duplicate function creation should not allow

2019-08-19 Thread pavithra ramachandran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911002#comment-16911002
 ] 

pavithra ramachandran commented on SPARK-28672:
---

[~abhishek.akg] -  When we execute show function- it is displaying the 
temporary and permanent function that is created, since no database is 
mentioned while creating the permanent it is stored as default.addm3, Temporary 
functions are not specific to any database, So it is displayed without any db 
name. I dont think that is an issue

> [UDF] Duplicate function creation should not allow 
> ---
>
> Key: SPARK-28672
> URL: https://issues.apache.org/jira/browse/SPARK-28672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create function addm_3  AS 
> 'com.huawei.bigdata.hive.example.udf.multiply' using jar 
> 'hdfs://hacluster/user/Multiply.jar';
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.084 seconds)
> {code}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create temporary function addm_3  
> AS 'com.huawei.bigdata.hive.example.udf.multiply' using jar 
> 'hdfs://hacluster/user/Multiply.jar';
> INFO  : converting to local hdfs://hacluster/user/Multiply.jar
> INFO  : Added 
> [/tmp/8a396308-41f8-4335-9de4-8268ce5c70fe_resources/Multiply.jar] to class 
> path
> INFO  : Added resources: [hdfs://hacluster/user/Multiply.jar]
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.134 seconds)
> {code}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> show functions like addm_3;
> +-+--+
> |function |
> +-+--+
> | addm_3  |
> | default.addm_3  |
> +-+--+
> 2 rows selected (0.047 seconds)
> {code}
> When show function executed it is listing both the function but what about 
> the db for permanent function when user has not specified.
> Duplicate should not be allowed if user creating temporary one with the same 
> name.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28774) ReusedExchangeExec cannot be columnar

2019-08-19 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910999#comment-16910999
 ] 

Hyukjin Kwon commented on SPARK-28774:
--

Please avoid to set target version which is usually reserved for committers.

> ReusedExchangeExec cannot be columnar
> -
>
> Key: SPARK-28774
> URL: https://issues.apache.org/jira/browse/SPARK-28774
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> If a ShuffleExchangeExec is replaced with a columnar version and deduped to a 
> ReusedExchangeExec it will fail because ReusedExchangeExec does not implement 
> any of the columnar APIs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28774) ReusedExchangeExec cannot be columnar

2019-08-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28774:
-
Target Version/s:   (was: 3.0.0)

> ReusedExchangeExec cannot be columnar
> -
>
> Key: SPARK-28774
> URL: https://issues.apache.org/jira/browse/SPARK-28774
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> If a ShuffleExchangeExec is replaced with a columnar version and deduped to a 
> ReusedExchangeExec it will fail because ReusedExchangeExec does not implement 
> any of the columnar APIs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910994#comment-16910994
 ] 

Dongjoon Hyun commented on SPARK-28699:
---

Thank you for the update, [~XuanYuan]!

> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
>
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>   
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Yuanjian Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-28699:

Description: 
It's another case for the indeterminate stage/RDD rerun while stage rerun 
happened.

We can reproduce this by the following code, thanks to Tyson for reporting this!
  
{code:scala}
import scala.sys.process._
import org.apache.spark.TaskContext

val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, x)}
// kill an executor in the stage that performs repartition(239)
val df = res.repartition(113).cache.repartition(239).map { x =>
 if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
TaskContext.get.stageAttemptNumber == 0) {
 throw new Exception("pkill -f -n java".!!)
 }
 x
}

val r2 = df.distinct.count()
{code}

  was:
It's another case for the indeterminate stage/RDD rerun while stage rerun 
happened. In the CachedRDDBuilder.

We can reproduce this by the following code, thanks to Tyson for reporting this!
  
{code:scala}
import scala.sys.process._
import org.apache.spark.TaskContext

val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, x)}
// kill an executor in the stage that performs repartition(239)
val df = res.repartition(113).cache.repartition(239).map { x =>
 if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
TaskContext.get.stageAttemptNumber == 0) {
 throw new Exception("pkill -f -n java".!!)
 }
 x
}

val r2 = df.distinct.count()
{code}


> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
>
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>   
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28699:
--
Affects Version/s: 2.3.3
   2.4.3

> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
>
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>   
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Yuanjian Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910990#comment-16910990
 ] 

Yuanjian Li commented on SPARK-28699:
-

[~dongjoon] Sure, the affects version is spark-2.1 after 2.1.4, spark 2.2 after 
2.2.3, spark 2.3 and spark 2.4, Jira field update is done.

> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
>
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>   
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Yuanjian Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-28699:

Affects Version/s: 2.3.3
   2.4.3

> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
>
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>   
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28699:
--
 Target Version/s: 2.3.4, 2.4.4
Affects Version/s: (was: 2.4.3)
   (was: 2.3.3)

> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
>
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>   
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910987#comment-16910987
 ] 

Dongjoon Hyun commented on SPARK-28699:
---

:)

BTW, I updated this to `Blocker` according to [~smilegator]'s advice.

> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
>
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>   
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28699:
--
Priority: Blocker  (was: Major)

> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
>
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>   
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28699:
--
Description: 
It's another case for the indeterminate stage/RDD rerun while stage rerun 
happened. In the CachedRDDBuilder.

We can reproduce this by the following code, thanks to Tyson for reporting this!
  
{code:scala}
import scala.sys.process._
import org.apache.spark.TaskContext

val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, x)}
// kill an executor in the stage that performs repartition(239)
val df = res.repartition(113).cache.repartition(239).map { x =>
 if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
TaskContext.get.stageAttemptNumber == 0) {
 throw new Exception("pkill -f -n java".!!)
 }
 x
}

val r2 = df.distinct.count()
{code}

  was:
Related with SPARK-23207 SPARK-23243

It's another case for the indeterminate stage/RDD rerun while stage rerun 
happened. In the CachedRDDBuilder.

We can reproduce this by the following code, thanks to Tyson for reporting this!
  
{code:scala}
import scala.sys.process._
import org.apache.spark.TaskContext

val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, x)}
// kill an executor in the stage that performs repartition(239)
val df = res.repartition(113).cache.repartition(239).map { x =>
 if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
TaskContext.get.stageAttemptNumber == 0) {
 throw new Exception("pkill -f -n java".!!)
 }
 x
}

val r2 = df.distinct.count()
{code}


> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>   
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28699:
--
Description: 
Related with SPARK-23207 SPARK-23243

It's another case for the indeterminate stage/RDD rerun while stage rerun 
happened. In the CachedRDDBuilder.

We can reproduce this by the following code, thanks to Tyson for reporting this!
  
{code:scala}
import scala.sys.process._
import org.apache.spark.TaskContext

val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, x)}
// kill an executor in the stage that performs repartition(239)
val df = res.repartition(113).cache.repartition(239).map { x =>
 if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
TaskContext.get.stageAttemptNumber == 0) {
 throw new Exception("pkill -f -n java".!!)
 }
 x
}

val r2 = df.distinct.count()
{code}

  was:
Related with SPARK-23207 SPARK-23243

It's another case for the indeterminate stage/RDD rerun while stage rerun 
happened. In the CachedRDDBuilder.

We can reproduce this by the following code, thanks to Tyson for reporting this!
  
{code:scala}
import scala.sys.process._
import org.apache.spark.TaskContext

val res = spark.range(0, 1 * 1, 1).map\{ x => (x % 1000, x)}
// kill an executor in the stage that performs repartition(239)
val df = res.repartition(113).cache.repartition(239).map { x =>
 if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
TaskContext.get.stageAttemptNumber == 0) {
 throw new Exception("pkill -f -n java".!!)
 }
 x
}

val r2 = df.distinct.count()
{code}


> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Related with SPARK-23207 SPARK-23243
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>   
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910983#comment-16910983
 ] 

Kazuaki Ishizaki commented on SPARK-28699:
--

[~dongjoon] Thank you for pointing out my typo. You are right. I should have 
said {2.3.4-rc1}.

Actually, while I was doing the following, it is not reflected at the 
repository yet!  After fixing this, let me restart the release process for 
{2.3.4-rc1}.

{code}
Release details:
BRANCH: branch-2.3
VERSION: 2.3.4
TAG: v2.3.4-rc1
{code}

> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Related with SPARK-23207 SPARK-23243
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>   
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map\{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Yuanjian Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905250#comment-16905250
 ] 

Yuanjian Li edited comment on SPARK-28699 at 8/20/19 4:09 AM:
--

-The current [approach|https://github.com/apache/spark/pull/25420] just a 
bandage fix for returning the wrong answer.-

After further investigation, we found that this bug is nothing to do with cache 
operation. So we focus on the sort + shuffle self and finally found the root 
cause is about the wrong usage for radix sort.

In original logic, we open the radix sort only depends on the config, and use 
the radix for the binary data comparison. It’s maybe OK for the dataset only 
has one column which is numeric, but during this case, binary format after 
transform “map\{ x => (x%1000, x)}” operation can’t be sorted by radix sort.

After the fix in [https://github.com/apache/spark/pull/25491] all tests passed 
with the right answer.

Also, find a corner case of DAGScheduler during the test is fixed separately in 
[https://github.com/apache/spark/pull/25491].

After we finish the work of indeterminate stage rerunning(SPARK-25341), we can 
fix this by unpersisting the original RDD and rerunning the cached 
indeterminate stage. Gives a preview codebase 
[here|https://github.com/xuanyuanking/spark/tree/SPARK-28699-RERUN].


was (Author: xuanyuan):
The current [approach|https://github.com/apache/spark/pull/25420] just a 
bandage fix for returning the wrong answer.

After we finish the work of indeterminate stage rerunning(SPARK-25341), we can 
fix this by unpersisting the original RDD and rerunning the cached 
indeterminate stage. Gives a preview codebase 
[here|https://github.com/xuanyuanking/spark/tree/SPARK-28699-RERUN].

> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Related with SPARK-23207 SPARK-23243
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder, we miss tracking the `isOrderSensitive` 
> characteristic to the newly created MapPartitionsRDD.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>  
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map\{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Yuanjian Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-28699:

Description: 
Related with SPARK-23207 SPARK-23243

It's another case for the indeterminate stage/RDD rerun while stage rerun 
happened. In the CachedRDDBuilder.

We can reproduce this by the following code, thanks to Tyson for reporting this!
  
{code:scala}
import scala.sys.process._
import org.apache.spark.TaskContext

val res = spark.range(0, 1 * 1, 1).map\{ x => (x % 1000, x)}
// kill an executor in the stage that performs repartition(239)
val df = res.repartition(113).cache.repartition(239).map { x =>
 if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
TaskContext.get.stageAttemptNumber == 0) {
 throw new Exception("pkill -f -n java".!!)
 }
 x
}

val r2 = df.distinct.count()
{code}

  was:
Related with SPARK-23207 SPARK-23243

It's another case for the indeterminate stage/RDD rerun while stage rerun 
happened. In the CachedRDDBuilder, we miss tracking the `isOrderSensitive` 
characteristic to the newly created MapPartitionsRDD.

We can reproduce this by the following code, thanks to Tyson for reporting this!
 
{code:scala}
import scala.sys.process._
import org.apache.spark.TaskContext

val res = spark.range(0, 1 * 1, 1).map\{ x => (x % 1000, x)}
// kill an executor in the stage that performs repartition(239)
val df = res.repartition(113).cache.repartition(239).map { x =>
 if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
TaskContext.get.stageAttemptNumber == 0) {
 throw new Exception("pkill -f -n java".!!)
 }
 x
}

val r2 = df.distinct.count()
{code}



> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Related with SPARK-23207 SPARK-23243
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>   
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map\{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910974#comment-16910974
 ] 

Dongjoon Hyun commented on SPARK-28699:
---

? [~kiszk]. `2.4.4-rc1` is `branch-2.4` and mine. You should not remove that.
I guess you wanted to say `2.3.4-rc1` and `2.3.4-rc1` is not created yet.

> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Related with SPARK-23207 SPARK-23243
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder, we miss tracking the `isOrderSensitive` 
> characteristic to the newly created MapPartitionsRDD.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>  
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map\{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28777) Pyspark sql function "format_string" has the wrong parameters in doc string

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28777.
---
Fix Version/s: 2.4.4
   2.3.4
   3.0.0
   Resolution: Fixed

Issue resolved by pull request 25506
[https://github.com/apache/spark/pull/25506]

> Pyspark sql function "format_string" has the wrong parameters in doc string
> ---
>
> Key: SPARK-28777
> URL: https://issues.apache.org/jira/browse/SPARK-28777
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Darren Tirto
>Assignee: Darren Tirto
>Priority: Minor
> Fix For: 3.0.0, 2.3.4, 2.4.4
>
>
> The pyspark sql function "format_string" has a function declaration of "_*def 
> _format_string_(_format_, _*__cols_):*"_ However, the function's doc strings 
> describes the parameters as:
>  
> {code:java}
> :param col: the column name of the numeric value to be formatted
> :param d: the N decimal places
> {code}
>  
> We want to update the doc string to accurately describe the parameters.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28777) Pyspark sql function "format_string" has the wrong parameters in doc string

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28777:
-

Assignee: Darren Tirto

> Pyspark sql function "format_string" has the wrong parameters in doc string
> ---
>
> Key: SPARK-28777
> URL: https://issues.apache.org/jira/browse/SPARK-28777
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Darren Tirto
>Assignee: Darren Tirto
>Priority: Minor
>
> The pyspark sql function "format_string" has a function declaration of "_*def 
> _format_string_(_format_, _*__cols_):*"_ However, the function's doc strings 
> describes the parameters as:
>  
> {code:java}
> :param col: the column name of the numeric value to be formatted
> :param d: the N decimal places
> {code}
>  
> We want to update the doc string to accurately describe the parameters.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-28712) spark structured stream with kafka don't really delete temp files in spark standalone cluster

2019-08-19 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-28712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

凭落 closed SPARK-28712.
--

solved by SPARK-28025 

> spark structured stream with kafka don't really delete temp files in spark 
> standalone cluster
> -
>
> Key: SPARK-28712
> URL: https://issues.apache.org/jira/browse/SPARK-28712
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: redhat 7
> jdk 1.8
> scala 2.11.12
>  spark standalone cluster 2.4.3
>  kafka 0.10.2.1
>  
>Reporter: 凭落
>Priority: Major
>
> the folder in  Driver
> {noformat}
> /tmp/temporary-{noformat}
>  takes up all the space in /tmp after runing spark structured stream job a 
> long time.
> it is mainly under the offsets and commits folders.but when I watch it by us 
> command
> {noformat}
> du -sh offsets du -sh commits{noformat}
> it got more than 600M,but when We  use command
> {noformat}
> ll -h offsets   ll -h commits{noformat}
> it got 400K.
> I think it is because when the file is deleted,it is still used in job.
> It wasn't released only if the job is stopped.
> How can I solve it?
> We use 
> {code}
> df.writeStream.trigger(ProcessingTime("1 seconds"))
> {code}
> not
> {code}
> df.writeStream.trigger(Continuous("1 seconds"))
> {code}
> Is there something wrong here?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28712) spark structured stream with kafka don't really delete temp files in spark standalone cluster

2019-08-19 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-28712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910957#comment-16910957
 ] 

凭落 commented on SPARK-28712:


[~kabhwan] thanks a lot! It really helps!

> spark structured stream with kafka don't really delete temp files in spark 
> standalone cluster
> -
>
> Key: SPARK-28712
> URL: https://issues.apache.org/jira/browse/SPARK-28712
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: redhat 7
> jdk 1.8
> scala 2.11.12
>  spark standalone cluster 2.4.3
>  kafka 0.10.2.1
>  
>Reporter: 凭落
>Priority: Major
>
> the folder in  Driver
> {noformat}
> /tmp/temporary-{noformat}
>  takes up all the space in /tmp after runing spark structured stream job a 
> long time.
> it is mainly under the offsets and commits folders.but when I watch it by us 
> command
> {noformat}
> du -sh offsets du -sh commits{noformat}
> it got more than 600M,but when We  use command
> {noformat}
> ll -h offsets   ll -h commits{noformat}
> it got 400K.
> I think it is because when the file is deleted,it is still used in job.
> It wasn't released only if the job is stopped.
> How can I solve it?
> We use 
> {code}
> df.writeStream.trigger(ProcessingTime("1 seconds"))
> {code}
> not
> {code}
> df.writeStream.trigger(Continuous("1 seconds"))
> {code}
> Is there something wrong here?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28712) spark structured stream with kafka don't really delete temp files in spark standalone cluster

2019-08-19 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-28712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910956#comment-16910956
 ] 

凭落 commented on SPARK-28712:


[~hyukjin.kwon] I'm sorry about it, next time I'll use mail list first.

> spark structured stream with kafka don't really delete temp files in spark 
> standalone cluster
> -
>
> Key: SPARK-28712
> URL: https://issues.apache.org/jira/browse/SPARK-28712
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: redhat 7
> jdk 1.8
> scala 2.11.12
>  spark standalone cluster 2.4.3
>  kafka 0.10.2.1
>  
>Reporter: 凭落
>Priority: Major
>
> the folder in  Driver
> {noformat}
> /tmp/temporary-{noformat}
>  takes up all the space in /tmp after runing spark structured stream job a 
> long time.
> it is mainly under the offsets and commits folders.but when I watch it by us 
> command
> {noformat}
> du -sh offsets du -sh commits{noformat}
> it got more than 600M,but when We  use command
> {noformat}
> ll -h offsets   ll -h commits{noformat}
> it got 400K.
> I think it is because when the file is deleted,it is still used in job.
> It wasn't released only if the job is stopped.
> How can I solve it?
> We use 
> {code}
> df.writeStream.trigger(ProcessingTime("1 seconds"))
> {code}
> not
> {code}
> df.writeStream.trigger(Continuous("1 seconds"))
> {code}
> Is there something wrong here?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910907#comment-16910907
 ] 

Kazuaki Ishizaki commented on SPARK-28699:
--

[~smilegator] Thank you for cc. I wait for fixing this.

I was in the middle of releasing RC1. Thus, there is already {{2.4.4-rc1}} tag 
in the [branch-2.3|https://github.com/apache/spark/tree/branch-2.3]. Should I 
remove this tag and release rc1? Or should I leave this tag and release rc2 at 
first?


> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Related with SPARK-23207 SPARK-23243
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder, we miss tracking the `isOrderSensitive` 
> characteristic to the newly created MapPartitionsRDD.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>  
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map\{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28777) Pyspark sql function "format_string" has the wrong parameters in doc string

2019-08-19 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910893#comment-16910893
 ] 

Dongjoon Hyun commented on SPARK-28777:
---

Welcome, [~darrentirto]. Thank you for filing a new JIRA for this.
In Apache Spark community, we will assign this issue to you after merging your 
PR. 
When you are assigned an issue, you will be added to Apache Spark contributor 
group.

> Pyspark sql function "format_string" has the wrong parameters in doc string
> ---
>
> Key: SPARK-28777
> URL: https://issues.apache.org/jira/browse/SPARK-28777
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Darren Tirto
>Priority: Minor
>
> The pyspark sql function "format_string" has a function declaration of "_*def 
> _format_string_(_format_, _*__cols_):*"_ However, the function's doc strings 
> describes the parameters as:
>  
> {code:java}
> :param col: the column name of the numeric value to be formatted
> :param d: the N decimal places
> {code}
>  
> We want to update the doc string to accurately describe the parameters.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28775) DateTimeUtilsSuite fails for JDKs using the tzdata2018i or newer timezone database

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28775.
---
Fix Version/s: 2.4.4
   2.3.4
   3.0.0
   Resolution: Fixed

Issue resolved by pull request 25504
[https://github.com/apache/spark/pull/25504]

> DateTimeUtilsSuite fails for JDKs using the tzdata2018i or newer timezone 
> database
> --
>
> Key: SPARK-28775
> URL: https://issues.apache.org/jira/browse/SPARK-28775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Herman van Hovell
>Assignee: Sean Owen
>Priority: Major
> Fix For: 3.0.0, 2.3.4, 2.4.4
>
>
> org.apache.spark.sql.catalyst.util.DateTimeUtilsSuite 'daysToMillis and 
> millisToDays'  test case fails because of an update in the timezone library: 
> tzdata2018h. This retroactively changes a the value of a missing day for the 
> Kwalalein atol. See for more information: 
> https://bugs.openjdk.java.net/browse/JDK-8215981
> Let's fix this by excluding both dates.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28224) Check overflow in decimal Sum aggregate

2019-08-19 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-28224.
--
Fix Version/s: 3.0.0
 Assignee: Mick Jermsurawong
   Resolution: Fixed

Resolved by 
[https://github.com/apache/spark/pull/25033|https://github.com/apache/spark/pull/25033#]

> Check overflow in decimal Sum aggregate
> ---
>
> Key: SPARK-28224
> URL: https://issues.apache.org/jira/browse/SPARK-28224
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mick Jermsurawong
>Assignee: Mick Jermsurawong
>Priority: Major
> Fix For: 3.0.0
>
>
> To reproduce:
> {code:java}
> import spark.implicits._
> val ds = spark
>   .createDataset(Seq(BigDecimal("1" * 20), BigDecimal("9" * 20)))
>   .agg(sum("value"))
>   .as[BigDecimal]
> ds.collect shouldEqual Seq(null){code}
> Given the option to throw exception on overflow on, sum aggregation of 
> overflowing bigdecimal still remain null. {{DecimalAggregates}} is only 
> invoked when expression of the sum (not the elements to be operated) has 
> sufficiently small precision. The fix seems to be in Sum expression itself. 
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28777) Pyspark sql function "format_string" has the wrong parameters in doc string

2019-08-19 Thread Darren Tirto (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910865#comment-16910865
 ] 

Darren Tirto commented on SPARK-28777:
--

Sorry, I'm a little new to this Jira board. I created a git pull request for 
this here: [https://github.com/apache/spark/pull/25506]

However, I'm not very sure to how officially link the PR to this issue or how 
to add myself as an assignee. Thanks

> Pyspark sql function "format_string" has the wrong parameters in doc string
> ---
>
> Key: SPARK-28777
> URL: https://issues.apache.org/jira/browse/SPARK-28777
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Darren Tirto
>Priority: Minor
>
> The pyspark sql function "format_string" has a function declaration of "_*def 
> _format_string_(_format_, _*__cols_):*"_ However, the function's doc strings 
> describes the parameters as:
>  
> {code:java}
> :param col: the column name of the numeric value to be formatted
> :param d: the N decimal places
> {code}
>  
> We want to update the doc string to accurately describe the parameters.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-08-19 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910864#comment-16910864
 ] 

Jungtaek Lim commented on SPARK-27648:
--

Even better if you could reproduce with local filesystem or HDFS, as the origin 
reporter doesn't seem to use s3. If you're dealing with non-Apache distribution 
please stick with Apache distribution while reproducing.

Thanks in advance!

> In Spark2.4 Structured Streaming:The executor storage memory increasing over 
> time
> -
>
> Key: SPARK-27648
> URL: https://issues.apache.org/jira/browse/SPARK-27648
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: tommy duan
>Priority: Major
> Attachments: houragg(1).out, houragg_filter.csv, 
> houragg_with_state1_state2.csv, houragg_with_state1_state2.xlsx, 
> image-2019-05-09-17-51-14-036.png, image-2019-05-10-17-49-42-034.png, 
> image-2019-05-24-10-20-25-723.png, image-2019-05-27-10-10-30-460.png, 
> image-2019-06-02-19-43-21-652.png
>
>
> *Spark Program Code Business:*
>  Read the topic on kafka, aggregate the stream data sources, and then output 
> it to another topic line of kafka.
> *Problem Description:*
>  *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
> overflow problems often occur (because of too many versions of state stored 
> in memory, this bug has been modified in spark 2.4).
> {code:java}
> /spark-submit \
> --conf “spark.yarn.executor.memoryOverhead=4096M”
> --num-executors 15 \
> --executor-memory 3G \
> --executor-cores 2 \
> --driver-memory 6G \{code}
> {code}
> Executor memory exceptions occur when running with this submit resource under 
> SPARK 2.2 and the normal running time does not exceed one day.
> The solution is to set the executor memory larger than before 
> {code:java}
>  My spark-submit script is as follows:
> /spark-submit\
> conf "spark. yarn. executor. memoryOverhead = 4096M"
> num-executors 15\
> executor-memory 46G\
> executor-cores 3\
> driver-memory 6G\
> ...{code}
> In this case, the spark program can be guaranteed to run stably for a long 
> time, and the executor storage memory is less than 10M (it has been running 
> stably for more than 20 days).
> *2) From the upgrade information of Spark 2.4, we can see that the problem of 
> large memory consumption of state storage has been solved in Spark 2.4.* 
>  So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
> and found that the use of memory was reduced.
>  But a problem arises, as the running time increases, the storage memory of 
> executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
> Resource Manager UI).
>  This program has been running for 14 days (under SPARK 2.2, running with 
> this submit resource, the normal running time is not more than one day, 
> Executor memory abnormalities will occur).
>  The script submitted by the program under spark2.4 is as follows:
> {code:java}
> /spark-submit \
>  --conf “spark.yarn.executor.memoryOverhead=4096M”
>  --num-executors 15 \
>  --executor-memory 3G \
>  --executor-cores 2 \
>  --driver-memory 6G 
> {code}
> Under Spark 2.4, I counted the size of executor memory as time went by during 
> the running of the spark program:
> |Run-time(hour)|Storage Memory size(MB)|Memory growth rate(MB/hour)|
> |23.5H|41.6MB/1.5GB|1.770212766|
> |108.4H|460.2MB/1.5GB|4.245387454|
> |131.7H|559.1MB/1.5GB|4.245254366|
> |135.4H|575MB/1.5GB|4.246676514|
> |153.6H|641.2MB/1.5GB|4.174479167|
> |219H|888.1MB/1.5GB|4.055251142|
> |263H|1126.4MB/1.5GB|4.282889734|
> |309H|1228.8MB/1.5GB|3.976699029|



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28777) Pyspark sql function "format_string" has the wrong parameters in doc string

2019-08-19 Thread Darren Tirto (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darren Tirto updated SPARK-28777:
-
Shepherd:   (was: Darren Tirto)

> Pyspark sql function "format_string" has the wrong parameters in doc string
> ---
>
> Key: SPARK-28777
> URL: https://issues.apache.org/jira/browse/SPARK-28777
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Darren Tirto
>Priority: Minor
>
> The pyspark sql function "format_string" has a function declaration of "_*def 
> _format_string_(_format_, _*__cols_):*"_ However, the function's doc strings 
> describes the parameters as:
>  
> {code:java}
> :param col: the column name of the numeric value to be formatted
> :param d: the N decimal places
> {code}
>  
> We want to update the doc string to accurately describe the parameters.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28777) Pyspark sql function "format_string" has the wrong parameters in doc string

2019-08-19 Thread Darren Tirto (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darren Tirto updated SPARK-28777:
-
Shepherd: Darren Tirto

> Pyspark sql function "format_string" has the wrong parameters in doc string
> ---
>
> Key: SPARK-28777
> URL: https://issues.apache.org/jira/browse/SPARK-28777
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Darren Tirto
>Priority: Minor
>
> The pyspark sql function "format_string" has a function declaration of "_*def 
> _format_string_(_format_, _*__cols_):*"_ However, the function's doc strings 
> describes the parameters as:
>  
> {code:java}
> :param col: the column name of the numeric value to be formatted
> :param d: the N decimal places
> {code}
>  
> We want to update the doc string to accurately describe the parameters.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-08-19 Thread Puneet Loya (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910861#comment-16910861
 ] 

Puneet Loya commented on SPARK-27648:
-

Cassandra Sink is nothing but Cassandra Foreach batch(by default supported in 
Spark 2.4). Also used s3 checkpointing.

Had used Hadoop 3.1 because Hadoop 2.7.3 aws-sdk does not have the retry logic 
if it disconnects from s3. 

But will try to repeat it with 2.7.3.

> In Spark2.4 Structured Streaming:The executor storage memory increasing over 
> time
> -
>
> Key: SPARK-27648
> URL: https://issues.apache.org/jira/browse/SPARK-27648
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: tommy duan
>Priority: Major
> Attachments: houragg(1).out, houragg_filter.csv, 
> houragg_with_state1_state2.csv, houragg_with_state1_state2.xlsx, 
> image-2019-05-09-17-51-14-036.png, image-2019-05-10-17-49-42-034.png, 
> image-2019-05-24-10-20-25-723.png, image-2019-05-27-10-10-30-460.png, 
> image-2019-06-02-19-43-21-652.png
>
>
> *Spark Program Code Business:*
>  Read the topic on kafka, aggregate the stream data sources, and then output 
> it to another topic line of kafka.
> *Problem Description:*
>  *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
> overflow problems often occur (because of too many versions of state stored 
> in memory, this bug has been modified in spark 2.4).
> {code:java}
> /spark-submit \
> --conf “spark.yarn.executor.memoryOverhead=4096M”
> --num-executors 15 \
> --executor-memory 3G \
> --executor-cores 2 \
> --driver-memory 6G \{code}
> {code}
> Executor memory exceptions occur when running with this submit resource under 
> SPARK 2.2 and the normal running time does not exceed one day.
> The solution is to set the executor memory larger than before 
> {code:java}
>  My spark-submit script is as follows:
> /spark-submit\
> conf "spark. yarn. executor. memoryOverhead = 4096M"
> num-executors 15\
> executor-memory 46G\
> executor-cores 3\
> driver-memory 6G\
> ...{code}
> In this case, the spark program can be guaranteed to run stably for a long 
> time, and the executor storage memory is less than 10M (it has been running 
> stably for more than 20 days).
> *2) From the upgrade information of Spark 2.4, we can see that the problem of 
> large memory consumption of state storage has been solved in Spark 2.4.* 
>  So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
> and found that the use of memory was reduced.
>  But a problem arises, as the running time increases, the storage memory of 
> executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
> Resource Manager UI).
>  This program has been running for 14 days (under SPARK 2.2, running with 
> this submit resource, the normal running time is not more than one day, 
> Executor memory abnormalities will occur).
>  The script submitted by the program under spark2.4 is as follows:
> {code:java}
> /spark-submit \
>  --conf “spark.yarn.executor.memoryOverhead=4096M”
>  --num-executors 15 \
>  --executor-memory 3G \
>  --executor-cores 2 \
>  --driver-memory 6G 
> {code}
> Under Spark 2.4, I counted the size of executor memory as time went by during 
> the running of the spark program:
> |Run-time(hour)|Storage Memory size(MB)|Memory growth rate(MB/hour)|
> |23.5H|41.6MB/1.5GB|1.770212766|
> |108.4H|460.2MB/1.5GB|4.245387454|
> |131.7H|559.1MB/1.5GB|4.245254366|
> |135.4H|575MB/1.5GB|4.246676514|
> |153.6H|641.2MB/1.5GB|4.174479167|
> |219H|888.1MB/1.5GB|4.055251142|
> |263H|1126.4MB/1.5GB|4.282889734|
> |309H|1228.8MB/1.5GB|3.976699029|



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22390) Aggregate push down

2019-08-19 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910860#comment-16910860
 ] 

Huaxin Gao commented on SPARK-22390:


I haven't looked this Datasource V2 implementation for a while. I will take a 
look and see how to fit my stuff there. [~arunkhetarpal87]

> Aggregate push down
> ---
>
> Key: SPARK-22390
> URL: https://issues.apache.org/jira/browse/SPARK-22390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910859#comment-16910859
 ] 

Xiao Li commented on SPARK-28699:
-

Also cc [~kiszk] Let us wait for this before starting RC1 for 2.3

> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Related with SPARK-23207 SPARK-23243
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder, we miss tracking the `isOrderSensitive` 
> characteristic to the newly created MapPartitionsRDD.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>  
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map\{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28749) Fix PySpark tests not to require kafka-0-8 in branch-2.4

2019-08-19 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28749:
-

Assignee: Matt Foley

> Fix PySpark tests not to require kafka-0-8 in branch-2.4
> 
>
> Key: SPARK-28749
> URL: https://issues.apache.org/jira/browse/SPARK-28749
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.4.3
>Reporter: Matt Foley
>Assignee: Matt Foley
>Priority: Minor
>
> As noted in SPARK-27550 we want to encourage testing of Spark 2.4.x with 
> Scala-2.12, and kafka-0-8 does not support Scala-2.12.
> Currently, the PySpark tests invoked by `python/run-tests` demand the 
> presence of kafka-0-8 libraries. If not present, this failure message will be 
> generated:
>  {code}
> Traceback (most recent call last):
>  File "/usr/lib64/python2.7/runpy.py", line 174, in _run_module_as_main
>  "__main__", fname, loader, pkg_name)
>  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
>  exec code in run_globals
>  File "spark/python/pyspark/streaming/tests.py", line 1579, in 
>  kafka_assembly_jar = search_kafka_assembly_jar()
>  File "spark/python/pyspark/streaming/tests.py", line 1524, in 
> search_kafka_assembly_jar
>  "You need to build Spark with "
>  Exception: Failed to find Spark Streaming kafka assembly jar in 
> spark/external/kafka-0-8-assembly. You need to build Spark with 'build/sbt 
> -Pkafka-0-8 assembly/package streaming-kafka-0-8-assembly/assembly' or 
> 'build/mvn -DskipTests -Pkafka-0-8 package' before running this test.
> Had test failures in pyspark.streaming.tests with 
> spark/py_virtenv/bin/python; see logs.
>  Process exited with code 255
> {code}
> This change is only targeted at branch-2.4, as most kafka-0-8 related 
> materials have been removed in master and this problem no longer occurs there.
> PROPOSED SOLUTION
> The proposed solution is to make the kafka-0-8 stream testing optional for 
> pyspark testing, exactly the same as the Kinesis stream testing currently is, 
> in file `python/pyspark/streaming/tests.py`. This is only a few lines of 
> change.
> Ideally it would be limited to when SPARK_SCALA_VERSION >= 2.12, but it turns 
> out to be somewhat onerous to reliably obtain that value from within the 
> python test env, and no other python test code currently does so. So my 
> proposed solution simply makes the use of the kafka-0-8 profile optional, and 
> leaves it to the tester to include it for Scala-2.11 test builds and exclude 
> it for Scala-2.12 test builds.
> PR will be available in a day or so.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28749) Fix PySpark tests not to require kafka-0-8 in branch-2.4

2019-08-19 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28749.
---
Fix Version/s: 2.4.5
   Resolution: Fixed

Issue resolved by pull request 25482
[https://github.com/apache/spark/pull/25482]

> Fix PySpark tests not to require kafka-0-8 in branch-2.4
> 
>
> Key: SPARK-28749
> URL: https://issues.apache.org/jira/browse/SPARK-28749
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.4.3
>Reporter: Matt Foley
>Assignee: Matt Foley
>Priority: Minor
> Fix For: 2.4.5
>
>
> As noted in SPARK-27550 we want to encourage testing of Spark 2.4.x with 
> Scala-2.12, and kafka-0-8 does not support Scala-2.12.
> Currently, the PySpark tests invoked by `python/run-tests` demand the 
> presence of kafka-0-8 libraries. If not present, this failure message will be 
> generated:
>  {code}
> Traceback (most recent call last):
>  File "/usr/lib64/python2.7/runpy.py", line 174, in _run_module_as_main
>  "__main__", fname, loader, pkg_name)
>  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
>  exec code in run_globals
>  File "spark/python/pyspark/streaming/tests.py", line 1579, in 
>  kafka_assembly_jar = search_kafka_assembly_jar()
>  File "spark/python/pyspark/streaming/tests.py", line 1524, in 
> search_kafka_assembly_jar
>  "You need to build Spark with "
>  Exception: Failed to find Spark Streaming kafka assembly jar in 
> spark/external/kafka-0-8-assembly. You need to build Spark with 'build/sbt 
> -Pkafka-0-8 assembly/package streaming-kafka-0-8-assembly/assembly' or 
> 'build/mvn -DskipTests -Pkafka-0-8 package' before running this test.
> Had test failures in pyspark.streaming.tests with 
> spark/py_virtenv/bin/python; see logs.
>  Process exited with code 255
> {code}
> This change is only targeted at branch-2.4, as most kafka-0-8 related 
> materials have been removed in master and this problem no longer occurs there.
> PROPOSED SOLUTION
> The proposed solution is to make the kafka-0-8 stream testing optional for 
> pyspark testing, exactly the same as the Kinesis stream testing currently is, 
> in file `python/pyspark/streaming/tests.py`. This is only a few lines of 
> change.
> Ideally it would be limited to when SPARK_SCALA_VERSION >= 2.12, but it turns 
> out to be somewhat onerous to reliably obtain that value from within the 
> python test env, and no other python test code currently does so. So my 
> proposed solution simply makes the use of the kafka-0-8 profile optional, and 
> leaves it to the tester to include it for Scala-2.11 test builds and exclude 
> it for Scala-2.12 test builds.
> PR will be available in a day or so.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28775) DateTimeUtilsSuite fails for JDKs using the tzdata2018i or newer timezone database

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28775:
--
Issue Type: Bug  (was: Improvement)

> DateTimeUtilsSuite fails for JDKs using the tzdata2018i or newer timezone 
> database
> --
>
> Key: SPARK-28775
> URL: https://issues.apache.org/jira/browse/SPARK-28775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Herman van Hovell
>Assignee: Sean Owen
>Priority: Major
>
> org.apache.spark.sql.catalyst.util.DateTimeUtilsSuite 'daysToMillis and 
> millisToDays'  test case fails because of an update in the timezone library: 
> tzdata2018h. This retroactively changes a the value of a missing day for the 
> Kwalalein atol. See for more information: 
> https://bugs.openjdk.java.net/browse/JDK-8215981
> Let's fix this by excluding both dates.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28775) DateTimeUtilsSuite fails for JDKs using the tzdata2018i or newer timezone database

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28775:
--
Component/s: Tests

> DateTimeUtilsSuite fails for JDKs using the tzdata2018i or newer timezone 
> database
> --
>
> Key: SPARK-28775
> URL: https://issues.apache.org/jira/browse/SPARK-28775
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Herman van Hovell
>Assignee: Sean Owen
>Priority: Major
>
> org.apache.spark.sql.catalyst.util.DateTimeUtilsSuite 'daysToMillis and 
> millisToDays'  test case fails because of an update in the timezone library: 
> tzdata2018h. This retroactively changes a the value of a missing day for the 
> Kwalalein atol. See for more information: 
> https://bugs.openjdk.java.net/browse/JDK-8215981
> Let's fix this by excluding both dates.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28779) CSV writer doesn't handle older Mac line endings

2019-08-19 Thread nicolas paris (Jira)
nicolas paris created SPARK-28779:
-

 Summary: CSV writer doesn't handle older Mac line endings
 Key: SPARK-28779
 URL: https://issues.apache.org/jira/browse/SPARK-28779
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0, 2.3.0
Reporter: nicolas paris


The spark csv writer does not consider "\r"  as a newline in string type 
columns. As a result, the resulting csv are not quoted, and they get corrupted.

All \n, \r\n and \r should be considered as newline to allow robust csv 
serialization.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28778) Shuffle jobs fail due to incorrect advertised address when running in virtual network

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28778:
--
Summary: Shuffle jobs fail due to incorrect advertised address when running 
in virtual network  (was: [MESOS] Shuffle jobs fail due to incorrect advertised 
address when running in virtual network)

> Shuffle jobs fail due to incorrect advertised address when running in virtual 
> network
> -
>
> Key: SPARK-28778
> URL: https://issues.apache.org/jira/browse/SPARK-28778
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.3, 2.3.0, 2.4.3
>Reporter: Anton Kirillov
>Priority: Major
>  Labels: Mesos
>
> When shuffle jobs are launched by Mesos in a virtual network, Mesos scheduler 
> sets executor {{--hostname}} parameter to {{0.0.0.0}} in the case when 
> {{spark.mesos.network.name}} is provided. This makes executors use 
> {{0.0.0.0}} as their advertised address and, in the presence of shuffle, 
> executors fail to fetch shuffle blocks from each other using {{0.0.0.0}} as 
> the origin. When a virtual network is used the hostname or IP address is not 
> known upfront and assigned to a container at its start time so the executor 
> process needs to advertise the correct dynamically assigned address to be 
> reachable by other executors.
> h3.  
> The bug described above prevents Mesos users from running any jobs which 
> involve shuffle due to the inability of executors to fetch shuffle blocks 
> because of incorrect advertised address when virtual network is used.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28778) [MESOS] Shuffle jobs fail due to incorrect advertised address when running in virtual network

2019-08-19 Thread Anton Kirillov (Jira)
Anton Kirillov created SPARK-28778:
--

 Summary: [MESOS] Shuffle jobs fail due to incorrect advertised 
address when running in virtual network
 Key: SPARK-28778
 URL: https://issues.apache.org/jira/browse/SPARK-28778
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.4.3, 2.3.0, 2.2.3
Reporter: Anton Kirillov


When shuffle jobs are launched by Mesos in a virtual network, Mesos scheduler 
sets executor {{--hostname}} parameter to {{0.0.0.0}} in the case when 
{{spark.mesos.network.name}} is provided. This makes executors use {{0.0.0.0}} 
as their advertised address and, in the presence of shuffle, executors fail to 
fetch shuffle blocks from each other using {{0.0.0.0}} as the origin. When a 
virtual network is used the hostname or IP address is not known upfront and 
assigned to a container at its start time so the executor process needs to 
advertise the correct dynamically assigned address to be reachable by other 
executors.
h3.  

The bug described above prevents Mesos users from running any jobs which 
involve shuffle due to the inability of executors to fetch shuffle blocks 
because of incorrect advertised address when virtual network is used.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910795#comment-16910795
 ] 

Dongjoon Hyun commented on SPARK-28699:
---

Hi, [~XuanYuan].
Could you check old Spark versions and update `Affects Version/s:` of this JIRA 
issue?

> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Related with SPARK-23207 SPARK-23243
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder, we miss tracking the `isOrderSensitive` 
> characteristic to the newly created MapPartitionsRDD.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>  
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map\{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28466) FileSystem closed error when to call Hive.moveFile

2019-08-19 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910790#comment-16910790
 ] 

Dongjoon Hyun commented on SPARK-28466:
---

Hi, [~angerszhuuu]. For the `Improvement` JIRA issue, 3.0.0 will be a proper 
affected version because we don't allow backporting new features.

>  FileSystem closed error when to call Hive.moveFile
> ---
>
> Key: SPARK-28466
> URL: https://issues.apache.org/jira/browse/SPARK-28466
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2019-07-22-09-58-19-023.png, 
> image-2019-07-22-09-58-55-107.png
>
>
> When we close a session of STS, if this session has done some SQL about 
> insert, then other session do CTAS/INSERT and trigger Hive.moveFile, 
> DFSClient will do checkOpen and throw java.io.IOException: Filesystem closed.
> **Root cause** :
> When we first execut SQL like CTAS/INSERT, it will call Hive.moveFile, during 
> this method, it will initialize this field SessionState.hdfsEncryptionShim , 
> when initialize this field, it will initialize a FS.
> But this FS is under current HiveSessionImpleWithUgi.sessionUgi, so when we 
> close this session, it will call `FileSystem.closeForUgi()`, above FileSystem 
> will be closed, then during other session execute SQL like CTAS/INSERT, such 
> error will happen since FS has been close.
> Some one may be confused why HiveServer2 won't appear this problem :
> - In HiveServer2, each session has it's own SessionState, so close current 
> session's FS is ok.
> - In SparkThriftServer, all session interact with hive through one 
> HiveClientImpl, it has only one SessionState, when we call method with 
> HiveClientImpl, it will call **withHiveState** first to set HiveClientImpl's 
> sessionState to current Thread.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28466) FileSystem closed error when to call Hive.moveFile

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28466:
--
Affects Version/s: (was: 2.4.0)
   (was: 2.3.0)

>  FileSystem closed error when to call Hive.moveFile
> ---
>
> Key: SPARK-28466
> URL: https://issues.apache.org/jira/browse/SPARK-28466
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2019-07-22-09-58-19-023.png, 
> image-2019-07-22-09-58-55-107.png
>
>
> When we close a session of STS, if this session has done some SQL about 
> insert, then other session do CTAS/INSERT and trigger Hive.moveFile, 
> DFSClient will do checkOpen and throw java.io.IOException: Filesystem closed.
> **Root cause** :
> When we first execut SQL like CTAS/INSERT, it will call Hive.moveFile, during 
> this method, it will initialize this field SessionState.hdfsEncryptionShim , 
> when initialize this field, it will initialize a FS.
> But this FS is under current HiveSessionImpleWithUgi.sessionUgi, so when we 
> close this session, it will call `FileSystem.closeForUgi()`, above FileSystem 
> will be closed, then during other session execute SQL like CTAS/INSERT, such 
> error will happen since FS has been close.
> Some one may be confused why HiveServer2 won't appear this problem :
> - In HiveServer2, each session has it's own SessionState, so close current 
> session's FS is ok.
> - In SparkThriftServer, all session interact with hive through one 
> HiveClientImpl, it has only one SessionState, when we call method with 
> HiveClientImpl, it will call **withHiveState** first to set HiveClientImpl's 
> sessionState to current Thread.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28590) Add sort_stats Setter for Custom Profiler

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28590:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Add sort_stats Setter for Custom Profiler
> -
>
> Key: SPARK-28590
> URL: https://issues.apache.org/jira/browse/SPARK-28590
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Albertus Kelvin
>Priority: Minor
>
> When I want to use BasicProfiler with different sorters in sort_stats, I 
> sometimes need to create a custom profiler and implement the show() method 
> only to replace the following line: stats.sort_stats("time", 
> "cumulative").print_stats().
> I think it'd be better if the users are able to specify the sorters without 
> creating a custom profiler.
> I implemented the changes in PySpark only.
> To apply the setter and getter methods, one can use the following way:
> {code:python}
> conf = SparkConf().set("spark.python.profile", "true")
> # use BasicProfiler
> sc = SparkContext('local', 'test', conf=conf)
> sc.profiler_collector.profiler_cls.set_sort_stats_sorters(BasicProfiler, 
> ['ncalls', 'tottime', 'name']
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28594:
--
Affects Version/s: (was: 2.4.0)
   (was: 2.2.1)
   (was: 2.3.0)
   (was: 2.1.0)
   3.0.0

> Allow event logs for running streaming apps to be rolled over.
> --
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: This has been reported on 2.0.2.22 but affects all 
> currently available versions.
>Reporter: Stephen Levett
>Priority: Minor
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28590) Add sort_stats Setter for Custom Profiler

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28590:
--
Target Version/s:   (was: 2.4.0)

> Add sort_stats Setter for Custom Profiler
> -
>
> Key: SPARK-28590
> URL: https://issues.apache.org/jira/browse/SPARK-28590
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Albertus Kelvin
>Priority: Minor
>
> When I want to use BasicProfiler with different sorters in sort_stats, I 
> sometimes need to create a custom profiler and implement the show() method 
> only to replace the following line: stats.sort_stats("time", 
> "cumulative").print_stats().
> I think it'd be better if the users are able to specify the sorters without 
> creating a custom profiler.
> I implemented the changes in PySpark only.
> To apply the setter and getter methods, one can use the following way:
> {code:python}
> conf = SparkConf().set("spark.python.profile", "true")
> # use BasicProfiler
> sc = SparkContext('local', 'test', conf=conf)
> sc.profiler_collector.profiler_cls.set_sort_stats_sorters(BasicProfiler, 
> ['ncalls', 'tottime', 'name']
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28597) Spark streaming terminated when close meta data log error

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28597:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Spark streaming terminated when close meta data log error 
> --
>
> Key: SPARK-28597
> URL: https://issues.apache.org/jira/browse/SPARK-28597
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Priority: Major
>
> Now, HDFSMetaDataLog is the only implement of MetaDataLog. If 
> HDFSMetaDataLog.writeBatchToFile() execute error, then total app will be 
> terminated.
> The meta data log is very important for spark, we should retry some times 
> when exception, and add a config to control retry times.
> For example HDFS has a bug that close error if has no enough replicas detail 
> HDFS-11486.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28547:
--
Affects Version/s: (was: 2.4.4)
   (was: 2.4.3)
   3.0.0

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28560) Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28560:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Optimize shuffle reader to local shuffle reader when smj converted to bhj in 
> adaptive execution
> ---
>
> Key: SPARK-28560
> URL: https://issues.apache.org/jira/browse/SPARK-28560
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> Implement a rule in the new adaptive execution framework introduced in 
> SPARK-23128. This rule is used to optimize the shuffle reader to local 
> shuffle reader when smj is converted to bhj in adaptive execution.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28715) Introduce collectInPlanAndSubqueries and subqueriesAll in QueryPlan

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28715:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Introduce collectInPlanAndSubqueries and subqueriesAll in QueryPlan
> ---
>
> Key: SPARK-28715
> URL: https://issues.apache.org/jira/browse/SPARK-28715
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> Introduces the {{collectInPlanAndSubqueries and subqueriesAll}} methods in 
> QueryPlan that consider all the plans in the query plan, including the ones 
> in nested subqueries.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28552) The URL prefix lowercase of MySQL is not necessary, but it is necessary in spark

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28552:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> The URL prefix lowercase of MySQL is not necessary, but it is necessary in 
> spark
> 
>
> Key: SPARK-28552
> URL: https://issues.apache.org/jira/browse/SPARK-28552
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: teeyog
>Priority: Major
>
> The URL prefix lowercase of MySQL is not necessary, but it is necessary in 
> spark,So we should not sensitive in spark's JDBC dialect.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28631) Update Kinesis dependencies to the Apache version licensed versions

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28631:
--
Affects Version/s: (was: 2.4.3)

> Update Kinesis dependencies to the Apache version licensed versions
> ---
>
> Key: SPARK-28631
> URL: https://issues.apache.org/jira/browse/SPARK-28631
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.0.0
>Reporter: Ismaël Mejía
>Priority: Major
>
> Amazon has released new versions of its Kinesis libraries (KCL and KPL) with 
> the Apache license. It is probably a good idea to update the kinesis modules 
> to use it. This may also allow Spark to distribute the spark kinesis jars as 
> part of the normal distribution without any legal issue with the ASF.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28678) Specify that start index is 1-based in docstring of pyspark.sql.functions.slice

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28678:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Specify that start index is 1-based in docstring of 
> pyspark.sql.functions.slice
> ---
>
> Key: SPARK-28678
> URL: https://issues.apache.org/jira/browse/SPARK-28678
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Sivam Pasupathipillai
>Priority: Trivial
>
> The start index parameter in pyspark.sql.functions.slice should be 1-based, 
> otherwise the method fails with an exception.
> This is not documented anywhere. Fix the docstring.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28746) Add repartitionby hint to support RepartitionByExpression

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28746:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Add repartitionby hint to support RepartitionByExpression
> -
>
> Key: SPARK-28746
> URL: https://issues.apache.org/jira/browse/SPARK-28746
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Priority: Minor
>
> Now, `RepartitionByExpression` is allowed at Dataset method 
> `Dataset.repartition()`. But in spark sql,  we do not have an equivalent 
> functionality. 
> In hive, we can use `distribute by`, so it's worth to add a hint to support 
> such function.
> Similar jira https://issues.apache.org/jira/browse/SPARK-24940



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28596) Use Java 8 time API in date_trunc

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28596:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Use Java 8 time API in date_trunc
> -
>
> Key: SPARK-28596
> URL: https://issues.apache.org/jira/browse/SPARK-28596
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The DateTimeUtils.truncTimestamp function can be fully implemented on top of 
> Java 8 time API by adjusting ZonedDateTime to year, month, ..., seconds.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28771) Join partitioned dataframes on superset of partitioning columns without shuffle

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28771:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Join partitioned dataframes on superset of partitioning columns without 
> shuffle
> ---
>
> Key: SPARK-28771
> URL: https://issues.apache.org/jira/browse/SPARK-28771
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.0.0
> Environment:  
> {code:java}
> from pyspark import SparkConf
> from pyspark.sql import SparkSession
> conf = SparkConf()
> conf.setAll({
>   'spark.master': 'local[*]',
>   'spark.sql.execution.arrow.enabled': 'true',
>   'spark.sql.autoBroadcastJoinThreshold': '-1',
>   'spark.sql.shuffle.partitions': '10'
> }.items());
> spark = SparkSession.builder.config(conf=conf).getOrCreate()
> {code}
>  
>  
>Reporter: Artem Bergkamp
>Priority: Minor
>
> Hi Spark developers.
> Few months ago I asked this question at 
> [stackoverflow|https://stackoverflow.com/questions/55229290/question-about-joining-dataframes-in-spark]
>  but could not get usable solution.
>  Only one valid suggestion was to implement it via catalyst optimizer 
> extensions but this is not something that an ordinary user can do.
>  I decided to raise improvement request since think such functionality should 
> be available out of the box.
> Suppose I have two partitioned dataframes:
> {code:java}
> df1 = spark.createDataFrame(
> [(1,1,1), (2,2,2)], ['key1', 'key2', 'time']
> ).repartition(3, 'key1', 'key2')df2 = spark.createDataFrame(
> [(1,1,1)], ['key1', 'key2', 'time']
> ).repartition(3, 'key1', 'key2')
> {code}
> *(scenario 1)* If I join them by [key1, key2] join operation is performed 
> within each partition without shuffle (number of partitions in result 
> dataframe is the same):
> {code:java}
> x = df1.join(df2, on=['key1', 'key2'], how='left')
> assert x.rdd.getNumPartitions() == 3
> {code}
> *(scenario 2)* But If I joint them by [key1, key2, time] shuffle operation 
> takes place (number of partitions in result dataframe is 10 which is driven 
> by spark.sql.shuffle.partitions option):
> {code:java}
> x = df1.join(df2, on=['key1', 'key2', 'time'], how='left')
> assert x.rdd.getNumPartitions() == 10
> {code}
> *(scenario 3)* Join them by [key1, key2, time] via another version of join 
> method:
> {code:java}
> x = df1.join(df2, [
> df1['key1'] == df2['key1'],
> df1['key2'] == df2['key2'],
> df1['time'] == df2['time']
> ], how='left').drop(df2['key1']).drop(df2['key2']).drop(df2['time'])
> assert x.rdd.getNumPartitions() == 10
> {code}
> *(scenario 4)* Join them by [key1, key2, time] via another version of join 
> method with quality condition changed to equivalent. And surprisingly *it 
> uses partitioning*:
> {code:java}
> x = df1.join(df2, [
> df1['key1'] == df2['key1'],
> df1['key2'] == df2['key2'],
> (df1['time'] <= df2['time']) & (df1['time'] >= df2['time'])
> ], how='left').drop(df2['key1']).drop(df2['key2']).drop(df2['time'])
> assert x.rdd.getNumPartitions() == 3
> {code}
> *I expect all four described join scenarios to use partitioning and avoid 
> shuffle.*
> At the same time groupby and window operations by [key1, key2, time] preserve 
> number of partitions and done without shuffle.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28577) Ensure executorMemoryHead requested value not less than MEMORY_OFFHEAP_SIZE when MEMORY_OFFHEAP_ENABLED is true

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28577:
--
Affects Version/s: (was: 2.4.3)
   (was: 2.3.3)
   (was: 2.2.3)
   3.0.0

> Ensure executorMemoryHead requested value not less than MEMORY_OFFHEAP_SIZE 
> when MEMORY_OFFHEAP_ENABLED is true
> ---
>
> Key: SPARK-28577
> URL: https://issues.apache.org/jira/browse/SPARK-28577
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Major
>
> If MEMORY_OFFHEAP_ENABLED is true, we should ensure executorOverheadMemory 
> not less than MEMORY_OFFHEAP_SIZE, otherwise the memory resource requested 
> for executor may be not enough.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28727) Request for partial least square (PLS) regression model

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28727:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Request for partial least square (PLS) regression model
> ---
>
> Key: SPARK-28727
> URL: https://issues.apache.org/jira/browse/SPARK-28727
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Affects Versions: 3.0.0
> Environment: I am using Windows 10, Spark v2.3.2
>Reporter: Nikunj
>Priority: Major
>  Labels: PLS, least, partial, regression, square
>
> Hi.
> Is there any development going on with regards to a PLS model? Or is there a 
> plan for it in the near future? The application I am developing needs a PLS 
> model as it is mandatory in that particular industry. I am using sparklyr, 
> and have started a bit of the implementation, but was wondering if something 
> is already in the pipeline.
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28415) Add messageHandler to Kafka 10 direct stream API

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28415:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Add messageHandler to Kafka 10 direct stream API
> 
>
> Key: SPARK-28415
> URL: https://issues.apache.org/jira/browse/SPARK-28415
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 3.0.0
>Reporter: Michael Spector
>Priority: Major
>
> Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in new 
> Kafka API is what prevents us from upgrading our processes to use it, and 
> here's why:
>  # messageHandler() allowed parsing / filtering / projecting huge JSON files 
> at an early stage (only a small subset of JSON fields is required for a 
> process), without this current cluster configuration doesn't keep up with the 
> traffic.
>  # Transforming Kafka events right after a stream is created prevents from 
> using HasOffsetRanges interface later. This means that whole message must be 
> propagated to the end of a pipeline, which is very ineffective.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28716) Add id to Exchange and Subquery's stringArgs method for easier identifying their reuses in query plans

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28716:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Add id to Exchange and Subquery's stringArgs method for easier identifying 
> their reuses in query plans
> --
>
> Key: SPARK-28716
> URL: https://issues.apache.org/jira/browse/SPARK-28716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> Add id to Exchange and Subquery's stringArgs method for easier identifying 
> their reuses in query plans, for example:
> {{ReusedExchange [d_date_sk#827|#827], BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) 
> [[id=#2710|#2710]]}}
> Where {{2710}} is the id of the reused exchange.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28762) Read JAR main class if JAR is not located in local file system

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28762:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Read JAR main class if JAR is not located in local file system
> --
>
> Key: SPARK-28762
> URL: https://issues.apache.org/jira/browse/SPARK-28762
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core, Spark Submit
>Affects Versions: 3.0.0
>Reporter: Ivan Gozali
>Priority: Minor
>
> Currently, {{spark-submit}} doesn't attempt to read the main class from a 
> Spark app JAR file if the scheme of the primary resource URI is not {{file}}. 
> In other words, if the JAR is not in the local file system, it will barf.
> It would be useful to have this feature if I deploy my Spark app JARs in S3 
> or HDFS.
> If it makes sense to maintainers, I can take a stab at this - I think I know 
> which files to look at.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28573) Convert InsertIntoTable(HiveTableRelation) to Datasource inserting for partitioned table

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28573:
--
Affects Version/s: (was: 2.4.3)
   (was: 2.3.3)
   3.0.0

> Convert InsertIntoTable(HiveTableRelation) to Datasource inserting for 
> partitioned table
> 
>
> Key: SPARK-28573
> URL: https://issues.apache.org/jira/browse/SPARK-28573
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianjin YE
>Priority: Major
>
> Currently we don't translate InsertInto(HiveTableRelation) to DataSource 
> insertion when partitioned table is involved, the reason is that, quote from 
> the comments:
> {quote}// Inserting into partitioned table is not supported in Parquet/Orc 
> data source (yet).
> {quote}
>  
> which doesn't hold any more. Since datasource table dynamic partition insert 
> now supports 
> dynamic mode (SPARK-20236). I think it's worthy to translate 
> InsertIntoTable(HiveTableRelation) to datasource table.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28751) Imporve java serializer deserialization performance

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28751:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Imporve java serializer deserialization performance
> ---
>
> Key: SPARK-28751
> URL: https://issues.apache.org/jira/browse/SPARK-28751
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xianyang Liu
>Priority: Major
>
> Improve the performance of java serializer deserialization by caching the 
> resolved class. Java serializer is used in many places: closure, RPC and 
> others. This change could improve the performance of deserialization 1.3X ~ 
> 1.5X. Especially for java primitive instances.
> And also, add new UT tests and benchmarks case.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28655) Support to cut the event log, and solve the history server was too slow when event log is too large.

2019-08-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28655:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Support to cut the event log, and solve the history server was too slow when 
> event log is too large.
> 
>
> Key: SPARK-28655
> URL: https://issues.apache.org/jira/browse/SPARK-28655
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28434) Decision Tree model isn't equal after save and load

2019-08-19 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28434.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25485
[https://github.com/apache/spark/pull/25485]

> Decision Tree model isn't equal after save and load
> ---
>
> Key: SPARK-28434
> URL: https://issues.apache.org/jira/browse/SPARK-28434
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.4.3
> Environment: spark from master
>Reporter: Ievgen Prokhorenko
>Assignee: Ievgen Prokhorenko
>Priority: Minor
> Fix For: 3.0.0
>
>
> The file 
> `mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala` on 
> the line no. 628 has a TODO saying:
>  
> {code:java}
> // TODO: Check other fields besides the information gain.
> {code}
> If, in addition to the existing check of InformationGainStats' gain value I 
> add another check, for instance, impurity – the test fails because the values 
> are different in the saved model and the one restored from disk.
>  
> See PR with an example.
>  
> The tests are executed with this command:
>  
> {code:java}
> build/mvn -e -Dtest=none 
> -DwildcardSuites=org.apache.spark.mllib.tree.DecisionTreeSuite test{code}
>  
> Excerpts from the output of the command above:
> {code:java}
> ...
> - model save/load *** FAILED ***
> checkEqual failed since the two trees were not identical.
> TREE A:
> DecisionTreeModel classifier of depth 2 with 5 nodes
> If (feature 0 <= 0.5)
> Predict: 0.0
> Else (feature 0 > 0.5)
> If (feature 1 in {0.0,1.0})
> Predict: 0.0
> Else (feature 1 not in {0.0,1.0})
> Predict: 0.0
> TREE B:
> DecisionTreeModel classifier of depth 2 with 5 nodes
> If (feature 0 <= 0.5)
> Predict: 0.0
> Else (feature 0 > 0.5)
> If (feature 1 in {0.0,1.0})
> Predict: 0.0
> Else (feature 1 not in {0.0,1.0})
> Predict: 0.0 (DecisionTreeSuite.scala:610)
> ...{code}
> If I add a little debug info in the `DecisionTreeSuite.checkEqual`:
>  
> {code:java}
> val aStats = a.stats
> val bStats = b.stats
> println(s"id ${a.id} ${b.id}")
> println(s"impurity ${aStats.get.impurity} ${bStats.get.impurity}")
> println(s"leftImpurity ${aStats.get.leftImpurity} ${bStats.get.leftImpurity}")
> println(s"rightImpurity ${aStats.get.rightImpurity} 
> ${bStats.get.rightImpurity}")
> println(s"leftPredict ${aStats.get.leftPredict} ${bStats.get.leftPredict}")
> println(s"rightPredict ${aStats.get.rightPredict} ${bStats.get.rightPredict}")
> println(s"gain ${aStats.get.gain} ${bStats.get.gain}")
> {code}
>  
> Then, in the output of the test command we can see that only values of `gain` 
> are equal:
>  
> {code:java}
> id 1 1
> impurity 0.2 0.5
> leftImpurity 0.3 0.5
> rightImpurity 0.4 0.5
> leftPredict 1.0 (prob = 0.4) 0.0 (prob = 1.0)
> rightPredict 0.0 (prob = 0.6) 0.0 (prob = 1.0)
> gain 0.1 0.1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28775) DateTimeUtilsSuite fails for JDKs using the tzdata2018i or newer timezone database

2019-08-19 Thread Herman van Hovell (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-28775:
-

Assignee: Sean Owen  (was: Herman van Hovell)

> DateTimeUtilsSuite fails for JDKs using the tzdata2018i or newer timezone 
> database
> --
>
> Key: SPARK-28775
> URL: https://issues.apache.org/jira/browse/SPARK-28775
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Herman van Hovell
>Assignee: Sean Owen
>Priority: Major
>
> org.apache.spark.sql.catalyst.util.DateTimeUtilsSuite 'daysToMillis and 
> millisToDays'  test case fails because of an update in the timezone library: 
> tzdata2018h. This retroactively changes a the value of a missing day for the 
> Kwalalein atol. See for more information: 
> https://bugs.openjdk.java.net/browse/JDK-8215981
> Let's fix this by excluding both dates.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28777) Pyspark sql function "format_string" has the wrong parameters in doc string

2019-08-19 Thread Darren Tirto (Jira)
Darren Tirto created SPARK-28777:


 Summary: Pyspark sql function "format_string" has the wrong 
parameters in doc string
 Key: SPARK-28777
 URL: https://issues.apache.org/jira/browse/SPARK-28777
 Project: Spark
  Issue Type: Documentation
  Components: PySpark
Affects Versions: 2.4.3
Reporter: Darren Tirto


The pyspark sql function "format_string" has a function declaration of "_*def 
_format_string_(_format_, _*__cols_):*"_ However, the function's doc strings 
describes the parameters as:

 
{code:java}
:param col: the column name of the numeric value to be formatted
:param d: the N decimal places
{code}
 

We want to update the doc string to accurately describe the parameters.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28775) DateTimeUtilsSuite fails for JDKs using the tzdata2018i or newer timezone database

2019-08-19 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28775:
--
Summary: DateTimeUtilsSuite fails for JDKs using the tzdata2018i or newer 
timezone database  (was: DateTimeUtilsSuite fails for JDKs using the 
tzdata2018h or newer timezone database)

> DateTimeUtilsSuite fails for JDKs using the tzdata2018i or newer timezone 
> database
> --
>
> Key: SPARK-28775
> URL: https://issues.apache.org/jira/browse/SPARK-28775
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Major
>
> org.apache.spark.sql.catalyst.util.DateTimeUtilsSuite 'daysToMillis and 
> millisToDays'  test case fails because of an update in the timezone library: 
> tzdata2018h. This retroactively changes a the value of a missing day for the 
> Kwalalein atol. See for more information: 
> https://bugs.openjdk.java.net/browse/JDK-8215981
> Let's fix this by excluding both dates.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28776) SparkML MLWriter gets hadoop conf from spark context instead of session

2019-08-19 Thread Helen Yu (Jira)
Helen Yu created SPARK-28776:


 Summary: SparkML MLWriter gets hadoop conf from spark context 
instead of session
 Key: SPARK-28776
 URL: https://issues.apache.org/jira/browse/SPARK-28776
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.4.3
Reporter: Helen Yu


In handleOverwrite of MLWriter, the hadoop configuration of the spark context 
is used where as the hadoop configuration of the spark session's session state 
should be used instead. 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L677]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28775) DateTimeUtilsSuite fails for JDKs using the tzdata2018h or newer timezone database

2019-08-19 Thread Herman van Hovell (Jira)
Herman van Hovell created SPARK-28775:
-

 Summary: DateTimeUtilsSuite fails for JDKs using the tzdata2018h 
or newer timezone database
 Key: SPARK-28775
 URL: https://issues.apache.org/jira/browse/SPARK-28775
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Herman van Hovell
Assignee: Herman van Hovell


org.apache.spark.sql.catalyst.util.DateTimeUtilsSuite 'daysToMillis and 
millisToDays'  test case fails because of an update in the timezone library: 
tzdata2018h. This retroactively changes a the value of a missing day for the 
Kwalalein atol. See for more information: 
https://bugs.openjdk.java.net/browse/JDK-8215981

Let's fix this by excluding both dates.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25603) Generalize Nested Column Pruning

2019-08-19 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910738#comment-16910738
 ] 

Nicholas Chammas commented on SPARK-25603:
--

[~dbtsai] - Just watched [your Spark Summit talk on this 
issue|https://www.youtube.com/watch?v=qHIA5YbZ8_4]. Thanks for all the work you 
and others have done here.

Will there be any spillover benefits from this effort for other issues 
affecting how Spark handles nested fields, like SPARK-16483 or SPARK-18084?

> Generalize Nested Column Pruning
> 
>
> Key: SPARK-25603
> URL: https://issues.apache.org/jira/browse/SPARK-25603
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2019-08-19 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910655#comment-16910655
 ] 

Nicholas Chammas edited comment on SPARK-4502 at 8/19/19 7:55 PM:
--

Thanks for your notes [~Bartalos]. Just FYI, nested schema pruning is set to be 
enabled by default as part of SPARK-27644.

-With regards to aggregates breaking pruning, have you reported that somewhere? 
If not, I recommend reporting it and linking to the new issue from here.-

Looks like the problem with aggregates breaking schema pruning is already being 
tracked in SPARK-27217.


was (Author: nchammas):
Thanks for your notes [~Bartalos]. Just FYI, nested schema pruning is set to be 
enabled by default as part of SPARK-27644.

With regards to aggregates breaking pruning, have you reported that somewhere? 
If not, I recommend reporting it and linking to the new issue from here.

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Assignee: Michael Allman
>Priority: Critical
> Fix For: 2.4.0
>
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2019-08-19 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910691#comment-16910691
 ] 

Nicholas Chammas edited comment on SPARK-25150 at 8/19/19 7:39 PM:
---

I haven't been able to boil down the reproduction further, but I'm updating 
this issue to confirm that it is still present as of Spark 2.4.3 and, 
particularly in the case where cross joins are enabled, it appears to be a 
correctness issue.

My original attachments still capture the problem. These are the inputs:
 * [^persons.csv]
 * [^states.csv]
 * [^zombie-analysis.py]

And here are the outputs:
 * [^expected-output.txt]
 * [^output-without-implicit-cross-join.txt]
 * [^output-with-implicit-cross-join.txt]


was (Author: nchammas):
I haven't been able to boil down the reproduction further, but I'm updating 
this issue to confirm that it is still present as of Spark 2.4.3 and, 
particularly in the case where cross joins are enabled, it appears to be a 
correctness issue.

My original attachments still capture the problem. These are the inputs:
 * !persons.csv|width=7,height=7,align=absmiddle!
 * !states.csv|width=7,height=7,align=absmiddle!
 * [^zombie-analysis.py] !zombie-analysis.py|width=7,height=7,align=absmiddle!

And here are the outputs:
 * [^expected-output.txt]
 * [^output-without-implicit-cross-join.txt]
 * [^output-with-implicit-cross-join.txt]

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>Reporter: Nicholas Chammas
>Priority: Major
>  Labels: correctness
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-08-19 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910692#comment-16910692
 ] 

Jungtaek Lim commented on SPARK-27648:
--

[~ploya]

Looks like you're running load test, then can the code being used for test be 
shared? It might require non-trivial effort to reproduce if we don't have 
reproducer.

And could you refrain applying multiple changes (like Cassandra sink, Hadoop 
3.1) while inspecting the storage memory issue? It would be best if we can find 
smallest reproducer.

> In Spark2.4 Structured Streaming:The executor storage memory increasing over 
> time
> -
>
> Key: SPARK-27648
> URL: https://issues.apache.org/jira/browse/SPARK-27648
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: tommy duan
>Priority: Major
> Attachments: houragg(1).out, houragg_filter.csv, 
> houragg_with_state1_state2.csv, houragg_with_state1_state2.xlsx, 
> image-2019-05-09-17-51-14-036.png, image-2019-05-10-17-49-42-034.png, 
> image-2019-05-24-10-20-25-723.png, image-2019-05-27-10-10-30-460.png, 
> image-2019-06-02-19-43-21-652.png
>
>
> *Spark Program Code Business:*
>  Read the topic on kafka, aggregate the stream data sources, and then output 
> it to another topic line of kafka.
> *Problem Description:*
>  *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
> overflow problems often occur (because of too many versions of state stored 
> in memory, this bug has been modified in spark 2.4).
> {code:java}
> /spark-submit \
> --conf “spark.yarn.executor.memoryOverhead=4096M”
> --num-executors 15 \
> --executor-memory 3G \
> --executor-cores 2 \
> --driver-memory 6G \{code}
> {code}
> Executor memory exceptions occur when running with this submit resource under 
> SPARK 2.2 and the normal running time does not exceed one day.
> The solution is to set the executor memory larger than before 
> {code:java}
>  My spark-submit script is as follows:
> /spark-submit\
> conf "spark. yarn. executor. memoryOverhead = 4096M"
> num-executors 15\
> executor-memory 46G\
> executor-cores 3\
> driver-memory 6G\
> ...{code}
> In this case, the spark program can be guaranteed to run stably for a long 
> time, and the executor storage memory is less than 10M (it has been running 
> stably for more than 20 days).
> *2) From the upgrade information of Spark 2.4, we can see that the problem of 
> large memory consumption of state storage has been solved in Spark 2.4.* 
>  So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
> and found that the use of memory was reduced.
>  But a problem arises, as the running time increases, the storage memory of 
> executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
> Resource Manager UI).
>  This program has been running for 14 days (under SPARK 2.2, running with 
> this submit resource, the normal running time is not more than one day, 
> Executor memory abnormalities will occur).
>  The script submitted by the program under spark2.4 is as follows:
> {code:java}
> /spark-submit \
>  --conf “spark.yarn.executor.memoryOverhead=4096M”
>  --num-executors 15 \
>  --executor-memory 3G \
>  --executor-cores 2 \
>  --driver-memory 6G 
> {code}
> Under Spark 2.4, I counted the size of executor memory as time went by during 
> the running of the spark program:
> |Run-time(hour)|Storage Memory size(MB)|Memory growth rate(MB/hour)|
> |23.5H|41.6MB/1.5GB|1.770212766|
> |108.4H|460.2MB/1.5GB|4.245387454|
> |131.7H|559.1MB/1.5GB|4.245254366|
> |135.4H|575MB/1.5GB|4.246676514|
> |153.6H|641.2MB/1.5GB|4.174479167|
> |219H|888.1MB/1.5GB|4.055251142|
> |263H|1126.4MB/1.5GB|4.282889734|
> |309H|1228.8MB/1.5GB|3.976699029|



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2019-08-19 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-25150:
-
Affects Version/s: 2.4.3
   Labels: correctness  (was: )

I haven't been able to boil down the reproduction further, but I'm updating 
this issue to confirm that it is still present as of Spark 2.4.3 and, 
particularly in the case where cross joins are enabled, it appears to be a 
correctness issue.

My original attachments still capture the problem. These are the inputs:
 * !persons.csv|width=7,height=7,align=absmiddle!
 * !states.csv|width=7,height=7,align=absmiddle!
 * [^zombie-analysis.py] !zombie-analysis.py|width=7,height=7,align=absmiddle!

And here are the outputs:
 * [^expected-output.txt]
 * [^output-without-implicit-cross-join.txt]
 * [^output-with-implicit-cross-join.txt]

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>Reporter: Nicholas Chammas
>Priority: Major
>  Labels: correctness
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-08-19 Thread Puneet Loya (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910680#comment-16910680
 ] 

Puneet Loya commented on SPARK-27648:
-

Had posted about high storage memory issue on the mailing list, seems related 
to it. 

[http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-4-3-Structured-Streaming-high-on-Storage-Memory-td35644.html#a35645]


Any advice would be helpful

> In Spark2.4 Structured Streaming:The executor storage memory increasing over 
> time
> -
>
> Key: SPARK-27648
> URL: https://issues.apache.org/jira/browse/SPARK-27648
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: tommy duan
>Priority: Major
> Attachments: houragg(1).out, houragg_filter.csv, 
> houragg_with_state1_state2.csv, houragg_with_state1_state2.xlsx, 
> image-2019-05-09-17-51-14-036.png, image-2019-05-10-17-49-42-034.png, 
> image-2019-05-24-10-20-25-723.png, image-2019-05-27-10-10-30-460.png, 
> image-2019-06-02-19-43-21-652.png
>
>
> *Spark Program Code Business:*
>  Read the topic on kafka, aggregate the stream data sources, and then output 
> it to another topic line of kafka.
> *Problem Description:*
>  *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
> overflow problems often occur (because of too many versions of state stored 
> in memory, this bug has been modified in spark 2.4).
> {code:java}
> /spark-submit \
> --conf “spark.yarn.executor.memoryOverhead=4096M”
> --num-executors 15 \
> --executor-memory 3G \
> --executor-cores 2 \
> --driver-memory 6G \{code}
> {code}
> Executor memory exceptions occur when running with this submit resource under 
> SPARK 2.2 and the normal running time does not exceed one day.
> The solution is to set the executor memory larger than before 
> {code:java}
>  My spark-submit script is as follows:
> /spark-submit\
> conf "spark. yarn. executor. memoryOverhead = 4096M"
> num-executors 15\
> executor-memory 46G\
> executor-cores 3\
> driver-memory 6G\
> ...{code}
> In this case, the spark program can be guaranteed to run stably for a long 
> time, and the executor storage memory is less than 10M (it has been running 
> stably for more than 20 days).
> *2) From the upgrade information of Spark 2.4, we can see that the problem of 
> large memory consumption of state storage has been solved in Spark 2.4.* 
>  So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
> and found that the use of memory was reduced.
>  But a problem arises, as the running time increases, the storage memory of 
> executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
> Resource Manager UI).
>  This program has been running for 14 days (under SPARK 2.2, running with 
> this submit resource, the normal running time is not more than one day, 
> Executor memory abnormalities will occur).
>  The script submitted by the program under spark2.4 is as follows:
> {code:java}
> /spark-submit \
>  --conf “spark.yarn.executor.memoryOverhead=4096M”
>  --num-executors 15 \
>  --executor-memory 3G \
>  --executor-cores 2 \
>  --driver-memory 6G 
> {code}
> Under Spark 2.4, I counted the size of executor memory as time went by during 
> the running of the spark program:
> |Run-time(hour)|Storage Memory size(MB)|Memory growth rate(MB/hour)|
> |23.5H|41.6MB/1.5GB|1.770212766|
> |108.4H|460.2MB/1.5GB|4.245387454|
> |131.7H|559.1MB/1.5GB|4.245254366|
> |135.4H|575MB/1.5GB|4.246676514|
> |153.6H|641.2MB/1.5GB|4.174479167|
> |219H|888.1MB/1.5GB|4.055251142|
> |263H|1126.4MB/1.5GB|4.282889734|
> |309H|1228.8MB/1.5GB|3.976699029|



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2019-08-19 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-19248:
-
Labels: correctness  (was: )

Tagging this as a correctness issue since Spark 2+'s output differ's both from 
Python's as well as from Spark 1.6's.

Python 3.7.4 + Spark 2.4.3:
{code:java}
>>> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
>>> df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
[Row(col='5')]
>>> df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS col"]).collect()
[Row(col='')]  <-- This differs from Python's output as well as Spark 1.6's 
output.
>>> import re
>>> re.sub(pattern='( |\.)*', repl='', string='..   5.')
'5'{code}

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access

2019-08-19 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-18084:
-
Affects Version/s: 2.4.3

Retested and confirmed that this issue is still present in Spark 2.4.3.

> write.partitionBy() does not recognize nested columns that select() can access
> --
>
> Key: SPARK-18084
> URL: https://issues.apache.org/jira/browse/SPARK-18084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.4.3
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a simple repro in the PySpark shell:
> {code}
> from pyspark.sql import Row
> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> df = spark.createDataFrame(rdd)
> df.printSchema()
> df.select('a.b').show()  # works
> df.write.partitionBy('a.b').text('/tmp/test')  # doesn't work
> {code}
> Here's what I see when I run this:
> {code}
> >>> from pyspark.sql import Row
> >>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> >>> df = spark.createDataFrame(rdd)
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: long (nullable = true)
> >>> df.show()
> +---+
> |  a|
> +---+
> |[5]|
> +---+
> >>> df.select('a.b').show()
> +---+
> |  b|
> +---+
> |  5|
> +---+
> >>> df.write.partitionBy('a.b').text('/tmp/test')
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
>  line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o233.text.
> : org.apache.spark.sql.AnalysisException: Partition column a.b not found in 
> schema 
> StructType(StructField(a,StructType(StructField(b,LongType,true)),true));
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>   at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/readwriter.py",

[jira] [Updated] (SPARK-10892) Join with Data Frame returns wrong results

2019-08-19 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-10892:
-
Affects Version/s: 2.4.0
   Labels: correctness  (was: )

Updating affected version and label per the comments.

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0, 2.4.0
>Reporter: Ofer Mendelevitch
>Priority: Critical
>  Labels: correctness
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|0|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMAX|   10|USW00023272|  322|2012|
> |20121002|  2|  TMAX|   10|USW00023272|  344|2012|
> |20121003|  3|  TMAX|   10|USW00023272|  222|2012|
> |20121004|  4|  TMAX|   10|USW00023272|  

[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2019-08-19 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910655#comment-16910655
 ] 

Nicholas Chammas commented on SPARK-4502:
-

Thanks for your notes [~Bartalos]. Just FYI, nested schema pruning is set to be 
enabled by default as part of SPARK-27644.

With regards to aggregates breaking pruning, have you reported that somewhere? 
If not, I recommend reporting it and linking to the new issue from here.

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Assignee: Michael Allman
>Priority: Critical
> Fix For: 2.4.0
>
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28634) Failed to start SparkSession with Keytab file

2019-08-19 Thread Marcelo Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-28634:
--

Assignee: Marcelo Vanzin

> Failed to start SparkSession with Keytab file 
> --
>
> Key: SPARK-28634
> URL: https://issues.apache.org/jira/browse/SPARK-28634
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Marcelo Vanzin
>Priority: Minor
>
> {noformat}
> [user-etl@hermesdevour002-700165 spark-3.0.0-SNAPSHOT-bin-2.7.4]$ 
> bin/spark-sql --master yarn --conf 
> spark.yarn.keytab=/apache/spark-2.3.0-bin-2.7.3/conf/user-etl.keytab --conf 
> spark.yarn.principal=user-...@prod.example.com
> log4j:WARN No such property [maxFileSize] in 
> org.apache.log4j.rolling.RollingFileAppender.
> log4j:WARN No such property [maxBackupIndex] in 
> org.apache.log4j.rolling.RollingFileAppender.
> Exception in thread "main" org.apache.spark.SparkException: Application 
> application_1564558112805_1794 failed 2 times due to AM Container for 
> appattempt_1564558112805_1794_02 exited with  exitCode: 1
> For more detailed output, check the application tracking page: 
> https://0.0.0.0:8190/applicationhistory/app/application_1564558112805_1794 
> Then click on links to logs of each attempt.
> Diagnostics: Exception from container-launch.
> Container id: container_e1987_1564558112805_1794_02_01
> Exit code: 1
> Shell output: main : command provided 1
> main : run as user is user-etl
> main : requested yarn user is user-etl
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /hadoop/2/yarn/local/nmPrivate/application_1564558112805_1794/container_e1987_1564558112805_1794_02_01/container_e1987_1564558112805_1794_02_01.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> Container exited with a non-zero exit code 1. Error file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> Last 4096 bytes of stderr :
> log4j:WARN No such property [maxFileSize] in 
> org.apache.log4j.rolling.RollingFileAppender.
> log4j:WARN No such property [maxBackupIndex] in 
> org.apache.log4j.rolling.RollingFileAppender.
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/hadoop/2/yarn/local/usercache/user-etl/filecache/58/__spark_libs__4358879230136591830.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/apache/releases/hbase-1.1.2.2.6.4.1/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/apache/releases/hadoop-2.7.3.2.6.4.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> Exception in thread "main" org.apache.spark.SparkException: Keytab file: 
> /apache/spark-2.3.0-bin-2.7.3/conf/user-etl.keytab does not exist
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.loginUserFromKeytab(SparkHadoopUtil.scala:131)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:846)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:889)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> Failing this attempt. Failing the application.
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:95)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:185)
>   at org.apache.spark.SparkContext.(SparkContext.scala:509)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2466)
>   at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:948)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:939)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:315)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:166)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   

[jira] [Resolved] (SPARK-28634) Failed to start SparkSession with Keytab file

2019-08-19 Thread Marcelo Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-28634.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25467
[https://github.com/apache/spark/pull/25467]

> Failed to start SparkSession with Keytab file 
> --
>
> Key: SPARK-28634
> URL: https://issues.apache.org/jira/browse/SPARK-28634
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 3.0.0
>
>
> {noformat}
> [user-etl@hermesdevour002-700165 spark-3.0.0-SNAPSHOT-bin-2.7.4]$ 
> bin/spark-sql --master yarn --conf 
> spark.yarn.keytab=/apache/spark-2.3.0-bin-2.7.3/conf/user-etl.keytab --conf 
> spark.yarn.principal=user-...@prod.example.com
> log4j:WARN No such property [maxFileSize] in 
> org.apache.log4j.rolling.RollingFileAppender.
> log4j:WARN No such property [maxBackupIndex] in 
> org.apache.log4j.rolling.RollingFileAppender.
> Exception in thread "main" org.apache.spark.SparkException: Application 
> application_1564558112805_1794 failed 2 times due to AM Container for 
> appattempt_1564558112805_1794_02 exited with  exitCode: 1
> For more detailed output, check the application tracking page: 
> https://0.0.0.0:8190/applicationhistory/app/application_1564558112805_1794 
> Then click on links to logs of each attempt.
> Diagnostics: Exception from container-launch.
> Container id: container_e1987_1564558112805_1794_02_01
> Exit code: 1
> Shell output: main : command provided 1
> main : run as user is user-etl
> main : requested yarn user is user-etl
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /hadoop/2/yarn/local/nmPrivate/application_1564558112805_1794/container_e1987_1564558112805_1794_02_01/container_e1987_1564558112805_1794_02_01.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> Container exited with a non-zero exit code 1. Error file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> Last 4096 bytes of stderr :
> log4j:WARN No such property [maxFileSize] in 
> org.apache.log4j.rolling.RollingFileAppender.
> log4j:WARN No such property [maxBackupIndex] in 
> org.apache.log4j.rolling.RollingFileAppender.
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/hadoop/2/yarn/local/usercache/user-etl/filecache/58/__spark_libs__4358879230136591830.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/apache/releases/hbase-1.1.2.2.6.4.1/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/apache/releases/hadoop-2.7.3.2.6.4.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> Exception in thread "main" org.apache.spark.SparkException: Keytab file: 
> /apache/spark-2.3.0-bin-2.7.3/conf/user-etl.keytab does not exist
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.loginUserFromKeytab(SparkHadoopUtil.scala:131)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:846)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:889)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> Failing this attempt. Failing the application.
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:95)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:185)
>   at org.apache.spark.SparkContext.(SparkContext.scala:509)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2466)
>   at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:948)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:939)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:315)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:166)
>   at 
> 

[jira] [Resolved] (SPARK-25262) Support tmpfs for local dirs in k8s

2019-08-19 Thread Marcelo Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25262.

Fix Version/s: 3.0.0
 Assignee: Rob Vesse
   Resolution: Fixed

> Support tmpfs for local dirs in k8s
> ---
>
> Key: SPARK-25262
> URL: https://issues.apache.org/jira/browse/SPARK-25262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Assignee: Rob Vesse
>Priority: Major
> Fix For: 3.0.0
>
>
> As discussed during review of the design document for SPARK-24434 while 
> providing pod templates will provide more in-depth customisation for Spark on 
> Kubernetes there are some things that cannot be modified because Spark code 
> generates pod specs in very specific ways.
> The particular issue identified relates to handling on {{spark.local.dirs}} 
> which is done by {{LocalDirsFeatureStep.scala}}.  For each directory 
> specified, or a single default if no explicit specification, it creates a 
> Kubernetes {{emptyDir}} volume.  As noted in the Kubernetes documentation 
> this will be backed by the node storage 
> (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).  In some 
> compute environments this may be extremely undesirable.  For example with 
> diskless compute resources the node storage will likely be a non-performant 
> remote mounted disk, often with limited capacity.  For such environments it 
> would likely be better to set {{medium: Memory}} on the volume per the K8S 
> documentation to use a {{tmpfs}} volume instead.
> Another closely related issue is that users might want to use a different 
> volume type to back the local directories and there is no possibility to do 
> that.
> Pod templates will not really solve either of these issues because Spark is 
> always going to attempt to generate a new volume for each local directory and 
> always going to set these as {{emptyDir}}.
> Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}:
> * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} 
> volumes
> * Modify the logic to check if there is a volume already defined with the 
> name and if so skip generating a volume definition for it



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25262) Support tmpfs for local dirs in k8s

2019-08-19 Thread Marcelo Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-25262:
---
Summary: Support tmpfs for local dirs in k8s  (was: Make Spark local dir 
volumes configurable with Spark on Kubernetes)

> Support tmpfs for local dirs in k8s
> ---
>
> Key: SPARK-25262
> URL: https://issues.apache.org/jira/browse/SPARK-25262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Priority: Major
>
> As discussed during review of the design document for SPARK-24434 while 
> providing pod templates will provide more in-depth customisation for Spark on 
> Kubernetes there are some things that cannot be modified because Spark code 
> generates pod specs in very specific ways.
> The particular issue identified relates to handling on {{spark.local.dirs}} 
> which is done by {{LocalDirsFeatureStep.scala}}.  For each directory 
> specified, or a single default if no explicit specification, it creates a 
> Kubernetes {{emptyDir}} volume.  As noted in the Kubernetes documentation 
> this will be backed by the node storage 
> (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).  In some 
> compute environments this may be extremely undesirable.  For example with 
> diskless compute resources the node storage will likely be a non-performant 
> remote mounted disk, often with limited capacity.  For such environments it 
> would likely be better to set {{medium: Memory}} on the volume per the K8S 
> documentation to use a {{tmpfs}} volume instead.
> Another closely related issue is that users might want to use a different 
> volume type to back the local directories and there is no possibility to do 
> that.
> Pod templates will not really solve either of these issues because Spark is 
> always going to attempt to generate a new volume for each local directory and 
> always going to set these as {{emptyDir}}.
> Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}:
> * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} 
> volumes
> * Modify the logic to check if there is a volume already defined with the 
> name and if so skip generating a volume definition for it



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes

2019-08-19 Thread Marcelo Vanzin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910616#comment-16910616
 ] 

Marcelo Vanzin commented on SPARK-25262:


Full configurability was actually added in SPARK-28042. There is still a commit 
related to this one particular bug (da6fa38), so I won't dupe this, and will 
fix the title to reflect that feature instead.

> Make Spark local dir volumes configurable with Spark on Kubernetes
> --
>
> Key: SPARK-25262
> URL: https://issues.apache.org/jira/browse/SPARK-25262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Priority: Major
>
> As discussed during review of the design document for SPARK-24434 while 
> providing pod templates will provide more in-depth customisation for Spark on 
> Kubernetes there are some things that cannot be modified because Spark code 
> generates pod specs in very specific ways.
> The particular issue identified relates to handling on {{spark.local.dirs}} 
> which is done by {{LocalDirsFeatureStep.scala}}.  For each directory 
> specified, or a single default if no explicit specification, it creates a 
> Kubernetes {{emptyDir}} volume.  As noted in the Kubernetes documentation 
> this will be backed by the node storage 
> (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).  In some 
> compute environments this may be extremely undesirable.  For example with 
> diskless compute resources the node storage will likely be a non-performant 
> remote mounted disk, often with limited capacity.  For such environments it 
> would likely be better to set {{medium: Memory}} on the volume per the K8S 
> documentation to use a {{tmpfs}} volume instead.
> Another closely related issue is that users might want to use a different 
> volume type to back the local directories and there is no possibility to do 
> that.
> Pod templates will not really solve either of these issues because Spark is 
> always going to attempt to generate a new volume for each local directory and 
> always going to set these as {{emptyDir}}.
> Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}:
> * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} 
> volumes
> * Modify the logic to check if there is a volume already defined with the 
> name and if so skip generating a volume definition for it



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28774) ReusedExchangeExec cannot be columnar

2019-08-19 Thread Robert Joseph Evans (Jira)
Robert Joseph Evans created SPARK-28774:
---

 Summary: ReusedExchangeExec cannot be columnar
 Key: SPARK-28774
 URL: https://issues.apache.org/jira/browse/SPARK-28774
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Robert Joseph Evans


If a ShuffleExchangeExec is replaced with a columnar version and deduped to a 
ReusedExchangeExec it will fail because ReusedExchangeExec does not implement 
any of the columnar APIs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28773) NULL Handling

2019-08-19 Thread Xiao Li (Jira)
Xiao Li created SPARK-28773:
---

 Summary: NULL Handling
 Key: SPARK-28773
 URL: https://issues.apache.org/jira/browse/SPARK-28773
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Xiao Li
Assignee: Xiao Li






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28734) Create a table of content in the left hand side bar for SQL doc.

2019-08-19 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-28734.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Create a table of content in the left hand side bar for SQL doc.
> 
>
> Key: SPARK-28734
> URL: https://issues.apache.org/jira/browse/SPARK-28734
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28734) Create a table of content in the left hand side bar for SQL doc.

2019-08-19 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-28734:
---

Assignee: Dilip Biswal

> Create a table of content in the left hand side bar for SQL doc.
> 
>
> Key: SPARK-28734
> URL: https://issues.apache.org/jira/browse/SPARK-28734
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28772) Upgrade breeze to 1.0

2019-08-19 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-28772:
---

 Summary: Upgrade breeze to 1.0
 Key: SPARK-28772
 URL: https://issues.apache.org/jira/browse/SPARK-28772
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.0.0
Reporter: Yuming Wang


The latest release is 1.0, which is cross-built against Scala 2.11, 2.12, and 
2.13.

[https://github.com/scalanlp/breeze/releases/tag/releases%2Fv1.0]

[https://mvnrepository.com/artifact/org.scalanlp/breeze_2.13/1.0]

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28699) Cache an indeterminate RDD could lead to incorrect result while stage rerun

2019-08-19 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-28699:
---
Labels: correctness  (was: )

> Cache an indeterminate RDD could lead to incorrect result while stage rerun
> ---
>
> Key: SPARK-28699
> URL: https://issues.apache.org/jira/browse/SPARK-28699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> Related with SPARK-23207 SPARK-23243
> It's another case for the indeterminate stage/RDD rerun while stage rerun 
> happened. In the CachedRDDBuilder, we miss tracking the `isOrderSensitive` 
> characteristic to the newly created MapPartitionsRDD.
> We can reproduce this by the following code, thanks to Tyson for reporting 
> this!
>  
> {code:scala}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1 * 1, 1).map\{ x => (x % 1000, x)}
> // kill an executor in the stage that performs repartition(239)
> val df = res.repartition(113).cache.repartition(239).map { x =>
>  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && 
> TaskContext.get.stageAttemptNumber == 0) {
>  throw new Exception("pkill -f -n java".!!)
>  }
>  x
> }
> val r2 = df.distinct.count()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org