date:20180727

[jira] [Assigned] (SPARK-24624) Can not mix vectorized and non-vectorized UDFs

2018-07-27 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24624:


Assignee: Li Jin

> Can not mix vectorized and non-vectorized UDFs
> --
>
> Key: SPARK-24624
> URL: https://issues.apache.org/jira/browse/SPARK-24624
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> In the current impl, we have the limitation: users are unable to mix 
> vectorized and non-vectorized UDFs in same Project. This becomes worse since 
> our optimizer could combine continuous Projects into a single one. For 
> example, 
> {code}
> applied_df = df.withColumn('regular', my_regular_udf('total', 
> 'qty')).withColumn('pandas', my_pandas_udf('total', 'qty'))
> {code}
> Returns the following error. 
> {code}
> IllegalArgumentException: Can not mix vectorized and non-vectorized UDFs
> java.lang.IllegalArgumentException: Can not mix vectorized and non-vectorized 
> UDFs
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:170)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:146)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>  at scala.collection.immutable.List.map(List.scala:285)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:146)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113)
>  at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>  at scala.collection.immutable.List.foldLeft(List.scala:84)
>  at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:113)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:99)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3312)
>  at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:2750)
>  ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24624) Can not mix vectorized and non-vectorized UDFs

2018-07-27 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24624.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21650
[https://github.com/apache/spark/pull/21650]

> Can not mix vectorized and non-vectorized UDFs
> --
>
> Key: SPARK-24624
> URL: https://issues.apache.org/jira/browse/SPARK-24624
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> In the current impl, we have the limitation: users are unable to mix 
> vectorized and non-vectorized UDFs in same Project. This becomes worse since 
> our optimizer could combine continuous Projects into a single one. For 
> example, 
> {code}
> applied_df = df.withColumn('regular', my_regular_udf('total', 
> 'qty')).withColumn('pandas', my_pandas_udf('total', 'qty'))
> {code}
> Returns the following error. 
> {code}
> IllegalArgumentException: Can not mix vectorized and non-vectorized UDFs
> java.lang.IllegalArgumentException: Can not mix vectorized and non-vectorized 
> UDFs
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:170)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:146)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>  at scala.collection.immutable.List.map(List.scala:285)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:146)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113)
>  at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>  at scala.collection.immutable.List.foldLeft(List.scala:84)
>  at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:113)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:99)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3312)
>  at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:2750)
>  ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe,

[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-07-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560624#comment-16560624
 ] 

Apache Spark commented on SPARK-24924:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/21906

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-27 Thread David Vogelbacher (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Vogelbacher updated SPARK-24957:
--
Description: 
I noticed a bug when doing arithmetic on a dataframe containing decimal values 
with codegen enabled.
I tried to narrow it down on a small repro and got this (executed in 
spark-shell):
{noformat}
scala> val df = Seq(
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("11.88")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("11.88")),
 | ("a", BigDecimal("11.88"))
 | ).toDF("text", "number")
df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]

scala> val df_grouped_1 = 
df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
decimal(38,22)]

scala> df_grouped_1.collect()
res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])

scala> val df_grouped_2 = 
df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
decimal(38,22)]

scala> df_grouped_2.collect()
res1: Array[org.apache.spark.sql.Row] = 
Array([a,11948571.4285714285714285714286])

scala> val df_total_sum = 
df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]

scala> df_total_sum.collect()
res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
{noformat}

The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
result of {{df_grouped_2}} is clearly incorrect (it is the value of the correct 
result times {{10^14}}).

When codegen is disabled all results are correct. 

  was:
I noticed a bug when doing arithmetic on a dataframe containing decimal values 
with codegen enabled.
I tried to narrow it down on a small repro and got this (executed in 
spark-shell):
{noformat}
scala> val df = Seq(
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("11.88")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("11.88")),
 | ("a", BigDecimal("11.88"))
 | ).toDF("text", "number")
df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]

scala> val df_grouped_1 = 
df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
decimal(38,22)]

scala> df_grouped_1.collect()
res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])

scala> val df_grouped_2 = 
df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
decimal(38,22)]

scala> df_grouped_2.collect()
res1: Array[org.apache.spark.sql.Row] = 
Array([a,11948571.4285714285714285714286])

scala> val df_total_sum = 
df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]

scala> df_total_sum.collect()
res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
{noformat}

The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
result of {{df_grouped_2}} is clearly incorrect (it is the value of the correct 
result times {{10^14}}).


> Decimal arithmetic can lead to wrong values using codegen
> -
>
> Key: SPARK-24957
> URL: https://issues.apache.org/jira/browse/SPARK-24957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Priority: Major
>
> I noticed a bug when doing arithmetic on a dataframe containing decimal 
> values with codegen enabled.
> I tried to narrow it down on a small repro and got this (executed in 
> spark-shell):
> {noformat}
> scala> val df = Seq(
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("11.88"))
>  | ).toDF("text", "number")
> df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]
> scala> val df_grouped_1 = 
> df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
> df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_1.collect()
> res0:

[jira] [Created] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-27 Thread David Vogelbacher (JIRA)

David Vogelbacher created SPARK-24957:
-

 Summary: Decimal arithmetic can lead to wrong values using codegen
 Key: SPARK-24957
 URL: https://issues.apache.org/jira/browse/SPARK-24957
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: David Vogelbacher


I noticed a bug when doing arithmetic on a dataframe containing decimal values 
with codegen enabled.
I tried to narrow it down on a small repro and got this (executed in 
spark-shell):
{noformat}
scala> val df = Seq(
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("11.88")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("12.0")),
 | ("a", BigDecimal("11.88")),
 | ("a", BigDecimal("11.88"))
 | ).toDF("text", "number")
df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]

scala> val df_grouped_1 = 
df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
decimal(38,22)]

scala> df_grouped_1.collect()
res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])

scala> val df_grouped_2 = 
df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
decimal(38,22)]

scala> df_grouped_2.collect()
res1: Array[org.apache.spark.sql.Row] = 
Array([a,11948571.4285714285714285714286])

scala> val df_total_sum = 
df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]

scala> df_total_sum.collect()
res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
{noformat}

The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
result of {{df_grouped_2}} is clearly incorrect (it is the value of the correct 
result times {{10^14}}).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-27 Thread Kazuaki Ishizaki (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560600#comment-16560600
 ] 

Kazuaki Ishizaki commented on SPARK-24895:
--

[~ericfchang] Thank you very much for your suggestion. As the first step, I 
created [a PR|https://github.com/apache/spark/pull/21905] to upgrade maven.

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24956) Upgrade maven from 3.3.9 to 3.5.4

2018-07-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24956:


Assignee: (was: Apache Spark)

> Upgrade maven from 3.3.9 to 3.5.4
> -
>
> Key: SPARK-24956
> URL: https://issues.apache.org/jira/browse/SPARK-24956
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> Maven 3.3.9 looks pretty old. It would be good to upgrade this to the latest.
> As suggest in SPARK-24895, the current maven will see a problem with some 
> plugins.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24956) Upgrade maven from 3.3.9 to 3.5.4

2018-07-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24956:


Assignee: Apache Spark

> Upgrade maven from 3.3.9 to 3.5.4
> -
>
> Key: SPARK-24956
> URL: https://issues.apache.org/jira/browse/SPARK-24956
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Major
>
> Maven 3.3.9 looks pretty old. It would be good to upgrade this to the latest.
> As suggest in SPARK-24895, the current maven will see a problem with some 
> plugins.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24956) Upgrade maven from 3.3.9 to 3.5.4

2018-07-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560595#comment-16560595
 ] 

Apache Spark commented on SPARK-24956:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21905

> Upgrade maven from 3.3.9 to 3.5.4
> -
>
> Key: SPARK-24956
> URL: https://issues.apache.org/jira/browse/SPARK-24956
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> Maven 3.3.9 looks pretty old. It would be good to upgrade this to the latest.
> As suggest in SPARK-24895, the current maven will see a problem with some 
> plugins.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24956) Upgrade maven from 3.3.9 to 3.5.4

2018-07-27 Thread Kazuaki Ishizaki (JIRA)

Kazuaki Ishizaki created SPARK-24956:


 Summary: Upgrade maven from 3.3.9 to 3.5.4
 Key: SPARK-24956
 URL: https://issues.apache.org/jira/browse/SPARK-24956
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


Maven 3.3.9 looks pretty old. It would be good to upgrade this to the latest.

As suggest in SPARK-24895, the current maven will see a problem with some 
plugins.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24954) Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled

2018-07-27 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-24954:
-
Priority: Blocker  (was: Major)

> Fail fast on job submit if run a barrier stage with dynamic resource 
> allocation enabled
> ---
>
> Key: SPARK-24954
> URL: https://issues.apache.org/jira/browse/SPARK-24954
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Blocker
>
> Since we explicitly listed "Support running barrier stage with dynamic 
> resource allocation" a Non-Goal in the design doc, we shall fail fast on job 
> submit if running a barrier stage with dynamic resource allocation enabled, 
> to avoid some confusing behaviors (can refer to SPARK-24942 for some 
> examples).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error

2018-07-27 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24809:

Labels: correctness  (was: )

> Serializing LongHashedRelation in executor may result in data error
> ---
>
> Key: SPARK-24809
> URL: https://issues.apache.org/jira/browse/SPARK-24809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
> Environment: Spark 2.2.1
> hadoop 2.7.1
>Reporter: Lijia Liu
>Priority: Critical
>  Labels: correctness
> Attachments: Spark LongHashedRelation serialization.svg
>
>
> When join key is long or int in broadcast join, Spark will use 
> LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if 
> the broadcast value is abnormal big, executor will serialize it to disk. But, 
> data will lost when serializing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-07-27 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560544#comment-16560544
 ] 

shane knapp commented on SPARK-24950:
-

the spark-master-test-sbt build failed on ubuntu, but the the improtant bit is 
that the DateTimeUtilsSuite tests passed!

back to unraveling R...  :\

thanks for the quick patch, [~d80tb7]!

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.2.2, 2.3.1, 2.4.0
>Reporter: shane knapp
>Priority: Major
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF

2018-07-27 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560537#comment-16560537
 ] 

Wenchen Fan commented on SPARK-24721:
-

good catch! I think we should filter out python UDFs when picking partition 
predicates in `FileSourceStrategy`

> Failed to call PythonUDF whose input is the output of another PythonUDF
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
>

[jira] [Updated] (SPARK-24955) spark continuing to execute on a task despite not reading all data from a downed machine

2018-07-27 Thread San Tung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

San Tung updated SPARK-24955:
-
Description: 
We've recently run into a few instances where a downed node has led to 
incomplete data, causing correctness issues, which we can reproduce some of the 
time.

*Setup:*
 - we're currently on spark 2.3.0
 - we allow retries on failed tasks and stages
 - we use PySpark to perform these operations

*Stages:*

Simplistically, the job does the following:
 - Stage 1/2: computes a number of `(sha256 hash, 0, 1)` partitioned into 65536 
partitions
 - Stage 3/4: computes a number of `(sha256 hash, 1, 0)` partitioned into 6408 
partitions (one hash may exist in multiple partitions)
 - Stage 5:
 - repartitions stage 2 and stage 4 by the first 2 bytes of each hash, and find 
which ones are not in common (stage 2 hashes - stage 4 hashes).
 - store this partition into a persistent data source.

*Failure Scenario:*
 - We take out one of the machines (do a forced shutdown, for example)
 - For some tasks, stage 5 will die immediately with one of the following:
 - `ExecutorLostFailure (executor 24 exited caused by one of the running tasks) 
Reason: worker lost`
 - `FetchFailed(BlockManagerId(24, [redacted], 36829, None), shuffleId=2, 
mapId=14377, reduceId=48402, message=`
 - these tasks are reused to calculate stage 1-2 and 3-4 again that were 
missing on downed nodes, which is correctly recalculated by spark.
 - However, some tasks still continue executing from Stage 5, seemingly missing 
stage 4 data, dumping incorrect data to the stage 5 data source. We noticed the 
subtract operation taking ~1-2 minutes after the machine goes down, and stores 
a lot more data than usual (which on inspection is wrong).
 - we've seen this happen with slightly different execution plans too which 
don't involve or-ing, but end up being some variant of missing some stage 4 
data.

However, we cannot reproduce this consistently - sometimes all tasks fail 
gracefully. Correctly downed nodes means all these tasks fail and re-work on 
stage 1-2/3-4. Note that this solution produces the correct results if machines 
stay alive!

We were wondering if a machine going down can result in a state where a task 
could keep executing even though not all data has been fetched which gives us 
incorrect results (or if there is setting that allows this - we tried scanning 
spark configs up and down). This seems similar to 
https://issues.apache.org/jira/browse/SPARK-24160 (maybe we get an empty 
packet?), but it doesn't look like that was to explicitly resolve any known bug.

  was:
We've recently run into a few instances where a downed node has led to 
incomplete data, causing correctness issues, which we can reproduce some of the 
time.

Setup:
 - we're currently on spark 2.3.0
 - we allow retries on failed tasks and stages
 - we use PySpark to perform these operations

Stages:

Simplistically, the job does the following:
 - Stage 1/2: computes a number of `(sha256 hash, 0, 1)` partitioned into 65536 
partitions
 - Stage 3/4: computes a number of `(sha256 hash, 1, 0)` partitioned into 6408 
partitions (one hash may exist in multiple partitions)
 - Stage 5:
 - repartitions stage 2 and stage 4 by the first 2 bytes of each hash, and find 
which ones are not in common (stage 2 hashes - stage 4 hashes).
 - store this partition into a persistent data source.

Failure Scenario:
 - We take out one of the machines (do a forced shutdown, for example)
 - For some tasks, stage 5 will die immediately with one of the following:
 - `ExecutorLostFailure (executor 24 exited caused by one of the running tasks) 
Reason: worker lost`
 - `FetchFailed(BlockManagerId(24, [redacted], 36829, None), shuffleId=2, 
mapId=14377, reduceId=48402, message=`
 - these tasks are reused to calculate stage 1-2 and 3-4 again that were 
missing on downed nodes, which is correctly recalculated by spark.
 - However, some tasks still continue executing from Stage 5, seemingly missing 
stage 4 data, dumping incorrect data to the stage 5 data source. We noticed the 
subtract operation taking ~1-2 minutes after the machine goes down, and stores 
a lot more data than usual (which on inspection is wrong).
 - we've seen this happen with slightly different execution plans too which 
don't involve or-ing, but end up being some variant of missing some stage 4 
data.

However, we cannot reproduce this consistently - sometimes all tasks fail 
gracefully. Correctly downed nodes means all these tasks fail and re-work on 
stage 1-2/3-4. Note that this solution produces the correct results if machines 
stay alive!

We were wondering if a machine going down can result in a state where a task 
could keep executing even though not all data has been fetched which gives us 
incorrect results (or if there is setting that allows this - we tried scanning 
spark configs up and down). This seems similar to

[jira] [Created] (SPARK-24955) spark continuing to execute on a task despite not reading all data from a downed machine

2018-07-27 Thread San Tung (JIRA)

San Tung created SPARK-24955:


 Summary: spark continuing to execute on a task despite not reading 
all data from a downed machine
 Key: SPARK-24955
 URL: https://issues.apache.org/jira/browse/SPARK-24955
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Shuffle
Affects Versions: 2.3.0
Reporter: San Tung


We've recently run into a few instances where a downed node has led to 
incomplete data, causing correctness issues, which we can reproduce some of the 
time.

Setup:
 - we're currently on spark 2.3.0
 - we allow retries on failed tasks and stages
 - we use PySpark to perform these operations

Stages:

Simplistically, the job does the following:
 - Stage 1/2: computes a number of `(sha256 hash, 0, 1)` partitioned into 65536 
partitions
 - Stage 3/4: computes a number of `(sha256 hash, 1, 0)` partitioned into 6408 
partitions (one hash may exist in multiple partitions)
 - Stage 5:
 - repartitions stage 2 and stage 4 by the first 2 bytes of each hash, and find 
which ones are not in common (stage 2 hashes - stage 4 hashes).
 - store this partition into a persistent data source.

Failure Scenario:
 - We take out one of the machines (do a forced shutdown, for example)
 - For some tasks, stage 5 will die immediately with one of the following:
 - `ExecutorLostFailure (executor 24 exited caused by one of the running tasks) 
Reason: worker lost`
 - `FetchFailed(BlockManagerId(24, [redacted], 36829, None), shuffleId=2, 
mapId=14377, reduceId=48402, message=`
 - these tasks are reused to calculate stage 1-2 and 3-4 again that were 
missing on downed nodes, which is correctly recalculated by spark.
 - However, some tasks still continue executing from Stage 5, seemingly missing 
stage 4 data, dumping incorrect data to the stage 5 data source. We noticed the 
subtract operation taking ~1-2 minutes after the machine goes down, and stores 
a lot more data than usual (which on inspection is wrong).
 - we've seen this happen with slightly different execution plans too which 
don't involve or-ing, but end up being some variant of missing some stage 4 
data.

However, we cannot reproduce this consistently - sometimes all tasks fail 
gracefully. Correctly downed nodes means all these tasks fail and re-work on 
stage 1-2/3-4. Note that this solution produces the correct results if machines 
stay alive!

We were wondering if a machine going down can result in a state where a task 
could keep executing even though not all data has been fetched which gives us 
incorrect results (or if there is setting that allows this - we tried scanning 
spark configs up and down). This seems similar to 
https://issues.apache.org/jira/browse/SPARK-24160 (maybe we get an empty 
packet?), but it doesn't look like that was to explicitly resolve any known bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24954) Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled

2018-07-27 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-24954:


 Summary: Fail fast on job submit if run a barrier stage with 
dynamic resource allocation enabled
 Key: SPARK-24954
 URL: https://issues.apache.org/jira/browse/SPARK-24954
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Jiang Xingbo


Since we explicitly listed "Support running barrier stage with dynamic resource 
allocation" a Non-Goal in the design doc, we shall fail fast on job submit if 
running a barrier stage with dynamic resource allocation enabled, to avoid some 
confusing behaviors (can refer to SPARK-24942 for some examples).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24953) Prune a branch in `CaseWhen` if previously seen

2018-07-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560498#comment-16560498
 ] 

Apache Spark commented on SPARK-24953:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/21904

> Prune a branch in `CaseWhen` if previously seen
> ---
>
> Key: SPARK-24953
> URL: https://issues.apache.org/jira/browse/SPARK-24953
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> If a condition in a branch is previously seen, this branch can be pruned.
> If the outputs of two adjacent branches are the same, two branches can be 
> combined.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24953) Prune a branch in `CaseWhen` if previously seen

2018-07-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24953:


Assignee: (was: Apache Spark)

> Prune a branch in `CaseWhen` if previously seen
> ---
>
> Key: SPARK-24953
> URL: https://issues.apache.org/jira/browse/SPARK-24953
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> If a condition in a branch is previously seen, this branch can be pruned.
> If the outputs of two adjacent branches are the same, two branches can be 
> combined.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24953) Prune a branch in `CaseWhen` if previously seen

2018-07-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24953:


Assignee: Apache Spark

> Prune a branch in `CaseWhen` if previously seen
> ---
>
> Key: SPARK-24953
> URL: https://issues.apache.org/jira/browse/SPARK-24953
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>
> If a condition in a branch is previously seen, this branch can be pruned.
> If the outputs of two adjacent branches are the same, two branches can be 
> combined.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24953) Prune a branch in `CaseWhen` if previously seen

2018-07-27 Thread DB Tsai (JIRA)

DB Tsai created SPARK-24953:
---

 Summary: Prune a branch in `CaseWhen` if previously seen
 Key: SPARK-24953
 URL: https://issues.apache.org/jira/browse/SPARK-24953
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: DB Tsai


If a condition in a branch is previously seen, this branch can be pruned.

If the outputs of two adjacent branches are the same, two branches can be 
combined.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24724) Discuss necessary info and access in barrier mode + Kubernetes

2018-07-27 Thread Yinan Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560436#comment-16560436
 ] 

Yinan Li commented on SPARK-24724:
--

Sorry haven't got a chance to look into this. What pieces of info and what kind 
of access we need to provide? I saw some comments on thesimilar Jira for Yarn 
and particularly the one quoted below:

"The main problem is how to provide necessary information for barrier tasks to 
start MPI job in a password-less manner".

Is the main problem the same for Kubernetes?

 

 

 

> Discuss necessary info and access in barrier mode + Kubernetes
> --
>
> Key: SPARK-24724
> URL: https://issues.apache.org/jira/browse/SPARK-24724
> Project: Spark
>  Issue Type: Story
>  Components: Kubernetes, ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Yinan Li
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + Kubernetes. There were 
> some past and on-going attempts from the Kubenetes community. So we should 
> find someone with good knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-07-27 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24950:
--
Affects Version/s: 2.2.2

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.2.2, 2.3.1, 2.4.0
>Reporter: shane knapp
>Priority: Major
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-07-27 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24950:
--
Affects Version/s: 2.3.1

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.3.1, 2.4.0
>Reporter: shane knapp
>Priority: Major
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-07-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560411#comment-16560411
 ] 

Apache Spark commented on SPARK-24950:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/21903

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.4.0
>Reporter: shane knapp
>Priority: Major
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24952) Support LZMA2 compression by Avro datasource

2018-07-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24952:


Assignee: Apache Spark

> Support LZMA2 compression by Avro datasource
> 
>
> Key: SPARK-24952
> URL: https://issues.apache.org/jira/browse/SPARK-24952
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> LZMA2 (XZ) has much more better compression ratio comparing to currently 
> supported snappy and deflate. Underlying Avro library supports the 
> compression codec already. Need to set parameters for the codec and allow 
> users to specify "xz" compression via AvroOptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24952) Support LZMA2 compression by Avro datasource

2018-07-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560357#comment-16560357
 ] 

Apache Spark commented on SPARK-24952:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21902

> Support LZMA2 compression by Avro datasource
> 
>
> Key: SPARK-24952
> URL: https://issues.apache.org/jira/browse/SPARK-24952
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> LZMA2 (XZ) has much more better compression ratio comparing to currently 
> supported snappy and deflate. Underlying Avro library supports the 
> compression codec already. Need to set parameters for the codec and allow 
> users to specify "xz" compression via AvroOptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24952) Support LZMA2 compression by Avro datasource

2018-07-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24952:


Assignee: (was: Apache Spark)

> Support LZMA2 compression by Avro datasource
> 
>
> Key: SPARK-24952
> URL: https://issues.apache.org/jira/browse/SPARK-24952
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> LZMA2 (XZ) has much more better compression ratio comparing to currently 
> supported snappy and deflate. Underlying Avro library supports the 
> compression codec already. Need to set parameters for the codec and allow 
> users to specify "xz" compression via AvroOptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24952) Support LZMA2 compression by Avro datasource

2018-07-27 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-24952:
--

 Summary: Support LZMA2 compression by Avro datasource
 Key: SPARK-24952
 URL: https://issues.apache.org/jira/browse/SPARK-24952
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


LZMA2 (XZ) has much more better compression ratio comparing to currently 
supported snappy and deflate. Underlying Avro library supports the compression 
codec already. Need to set parameters for the codec and allow users to specify 
"xz" compression via AvroOptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF

2018-07-27 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560337#comment-16560337
 ] 

Li Jin edited comment on SPARK-24721 at 7/27/18 9:18 PM:
-

I think the issue is the UDF is being pushed down to the PartitionFilter in 
FileScan physical node and then ExtractPythonUDFs rule throws the exception 
(this is the Spark plan before execution the ExtractPythonUDFs rule):
{code:java}
Project [_c0#17, (1) AS v1#20]
+- Filter ((1) = 0)
   +- FileScan csv [_c0#17] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:/tmp/tab3], PartitionFilters: [((1) = 0)], 
PushedFilters: [], ReadSchema: struct<_c0:string>
{code}
I am not familiar with how PartiionFilters pushdown is supposed to work. 
[~smilegator] and [~cloud_fan] could you guys maybe point me to the right 
direction? Should we not push down the filter  to FileScan? Or 
should we ignore it in the ExtractPythonUDFs rule?

 


was (Author: icexelloss):
I think the issue is the UDF is being pushed down to the PartitionFilter in 
FileScan physical node (this is the Spark plan before execution the 
ExtractPythonUDFs rule):

 
{code:java}
Project [_c0#17, (1) AS v1#20]
+- Filter ((1) = 0)
   +- FileScan csv [_c0#17] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:/tmp/tab3], PartitionFilters: [((1) = 0)], 
PushedFilters: [], ReadSchema: struct<_c0:string>
{code}
I am not familiar with how PartiionFilters pushdown is supposed to work. 
[~smilegator] and [~cloud_fan] could you guys maybe point me to the right 
direction? Should we not push down the filter  to FileScan? Or 
should we ignore it in the ExtractPythonUDFs rule?

 

> Failed to call PythonUDF whose input is the output of another PythonUDF
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
>

[jira] [Commented] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF

2018-07-27 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560337#comment-16560337
 ] 

Li Jin commented on SPARK-24721:


I think the issue is the UDF is being pushed down to the PartitionFilter in 
FileScan physical node (this is the Spark plan before execution the 
ExtractPythonUDFs rule):

 
{code:java}
Project [_c0#17, (1) AS v1#20]
+- Filter ((1) = 0)
   +- FileScan csv [_c0#17] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:/tmp/tab3], PartitionFilters: [((1) = 0)], 
PushedFilters: [], ReadSchema: struct<_c0:string>
{code}
I am not familiar with how PartiionFilters pushdown is supposed to work. 
[~smilegator] and [~cloud_fan] could you guys maybe point me to the right 
direction? Should we not push down the filter  to FileScan? Or 
should we ignore it in the ExtractPythonUDFs rule?

 

> Failed to call PythonUDF whose input is the output of another PythonUDF
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
>

[jira] [Updated] (SPARK-23243) Shuffle+Repartition on an RDD could lead to incorrect answers

2018-07-27 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-23243:
--
Priority: Blocker  (was: Major)

> Shuffle+Repartition on an RDD could lead to incorrect answers
> -
>
> Key: SPARK-23243
> URL: https://issues.apache.org/jira/browse/SPARK-23243
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Jiang Xingbo
>Priority: Blocker
>
> The RDD repartition also uses the round-robin way to distribute data, this 
> can also cause incorrect answers on RDD workload the similar way as in 
> https://issues.apache.org/jira/browse/SPARK-23207
> The approach that fixes DataFrame.repartition() doesn't apply on the RDD 
> repartition issue, as discussed in 
> https://github.com/apache/spark/pull/20393#issuecomment-360912451
> We track for alternative solutions for this issue in this task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24951) Table valued functions should throw AnalysisException instead of IllegalArgumentException

2018-07-27 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-24951:
---

 Summary: Table valued functions should throw AnalysisException 
instead of IllegalArgumentException
 Key: SPARK-24951
 URL: https://issues.apache.org/jira/browse/SPARK-24951
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 2.4.0


When arguments don't match, TVFs currently throw IllegalArgumentException, 
inconsistent with other functions, which throw AnalysisException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-07-27 Thread shane knapp (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-24950:

Comment: was deleted

(was: 
[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/877/]

now we wait...)

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.4.0
>Reporter: shane knapp
>Priority: Major
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-07-27 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560318#comment-16560318
 ] 

shane knapp commented on SPARK-24950:
-

testing this manually:

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/878/

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.4.0
>Reporter: shane knapp
>Priority: Major
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-07-27 Thread Chris Martin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560312#comment-16560312
 ] 

Chris Martin commented on SPARK-24950:
--

Hi,

 

just to say that I looked at this and came to the same conclusion as Shane.  
I've submitted a PR which excludes both New Years Eve and New Years day from 
the test- which should mean it will work on both old and new jvms.

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.4.0
>Reporter: shane knapp
>Priority: Major
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-07-27 Thread Chris Martin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560312#comment-16560312
 ] 

Chris Martin edited comment on SPARK-24950 at 7/27/18 8:48 PM:
---

Hi,

 

just to say that I looked at this and came to the same conclusion as Sean.  
I've submitted a PR which excludes both New Years Eve and New Years day from 
the test- which should mean it will work on both old and new jvms.


was (Author: d80tb7):
Hi,

 

just to say that I looked at this and came to the same conclusion as Shane.  
I've submitted a PR which excludes both New Years Eve and New Years day from 
the test- which should mean it will work on both old and new jvms.

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.4.0
>Reporter: shane knapp
>Priority: Major
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-07-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24950:


Assignee: (was: Apache Spark)

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.4.0
>Reporter: shane knapp
>Priority: Major
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-07-27 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560310#comment-16560310
 ] 

shane knapp commented on SPARK-24950:
-

[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/877/]

now we wait...

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.4.0
>Reporter: shane knapp
>Priority: Major
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-07-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24950:


Assignee: Apache Spark

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.4.0
>Reporter: shane knapp
>Assignee: Apache Spark
>Priority: Major
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-07-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560308#comment-16560308
 ] 

Apache Spark commented on SPARK-24950:
--

User 'd80tb7' has created a pull request for this issue:
https://github.com/apache/spark/pull/21901

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.4.0
>Reporter: shane knapp
>Priority: Major
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-07-27 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560306#comment-16560306
 ] 

shane knapp commented on SPARK-24950:
-

sgtm

i also dug through the java release notes WRT timezone changes and didn't find 
anything (which i forgot to mention).  sorry about that!  :)

i'll start by just commenting out the failing TZ (Pacific/Enderbury) and see if 
that works.

 

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.4.0
>Reporter: shane knapp
>Priority: Major
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF

2018-07-27 Thread Li Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560283#comment-16560283
 ] 

Li Jin commented on SPARK-24721:


{code:java}
from pyspark.sql.functions import udf, lit, col

spark.range(1).write.mode("overwrite").format('csv').save("/tmp/tab3")
df = spark.read.csv('/tmp/tab3')
df2 = df.withColumn('v1', udf(lambda x: x, 'int')(lit(1)))
df2 = df2.filter(df2['v1'] == 0)

df2.explain()
{code}
This is a simpler reproduce

> Failed to call PythonUDF whose input is the output of another PythonUDF
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
>

[jira] [Commented] (SPARK-23146) Support client mode for Kubernetes cluster backend

2018-07-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560279#comment-16560279
 ] 

Apache Spark commented on SPARK-23146:
--

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/21900

> Support client mode for Kubernetes cluster backend
> --
>
> Key: SPARK-23146
> URL: https://issues.apache.org/jira/browse/SPARK-23146
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Major
> Fix For: 2.4.0
>
>
> This issue tracks client mode support within Spark when running in the 
> Kubernetes cluster backend.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-07-27 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560249#comment-16560249
 ] 

Sean Owen commented on SPARK-24950:
---

It's pretty clear this is down to differences in how time zones are defined, as 
they change over time and the JDK incorporates updated versions of the standard 
definitions in each release. 

It looks like the difference between _171 and _181 is the difference between 
2018c and 2018e in this table: 
http://www.oracle.com/technetwork/java/javase/tzdata-versions-138805.html

Nothing obviously relevant from Oracle's release notes. But I found this in the 
notes for 2018d:

[http://mm.icann.org/pipermail/tz-announce/2018-March/49.html]

"Enderbury and Kiritimati skipped New Year's Eve 1994, not New Year's Day 1995. 
 (Thanks to Kerry Shetline.)"

So the answer is probably that the test has to be updated to reflect the fix to 
the timezone definition.

 

Of course, if the test changes, it also starts failing on older Java 8 
versions! probably not worth it.

I'd suggest we resolve it by commenting this out with a note. There's no 
evidence this is a problem in Spark itself.

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.4.0
>Reporter: shane knapp
>Priority: Major
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24922) Iterative rdd union + reduceByKey operations on small dataset leads to "No space left on device" error on account of lot of shuffle spill.

2018-07-27 Thread Dinesh Dharme (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dinesh Dharme updated SPARK-24922:
--
Description: 
I am trying to do few (union + reduceByKey) operations on a hiearchical dataset 
in a iterative fashion in rdd. The first few loops run fine but on the 
subsequent loops, the operations ends up using the whole scratch space provided 
to it. 

I have set the scratch directory, i.e. SPARK_LOCAL_DIRS , to be one having *100 
GB* space.

The heirarchical dataset, whose size is (< 400kB), remains constant throughout 
the iterations.

 I have tried the worker cleanup flag but it has no effect i.e. 
"spark.worker.cleanup.enabled=true"

 

Error : 

 
{noformat}
Caused by: java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.writeLong(DataOutputStream.java:224)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1$$anonfun$apply$mcV$sp$1.apply$mcVJ$sp(IndexShuffleBlockResolver.scala:151)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1$$anonfun$apply$mcV$sp$1.apply(IndexShuffleBlockResolver.scala:149)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1$$anonfun$apply$mcV$sp$1.apply(IndexShuffleBlockResolver.scala:149)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1.apply$mcV$sp(IndexShuffleBlockResolver.scala:149)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1.apply(IndexShuffleBlockResolver.scala:145)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1.apply(IndexShuffleBlockResolver.scala:145)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:153)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
 

*What I am trying to do (High Level)*:

I have a dataset of 5 different csv ( Parent, Child1, Child2, Child21, Child22 
) which are related in a hierarchical fashion as shown below. 

Parent-> Child1 -> Child2  -> Child21 

Parent-> Child1 -> Child2  -> Child22 

Each element in the tree has 14 columns (elementid, parentelement_id, cat1, 
cat2, num1, num2,., num10)

I am trying to aggregate the values of one column of Child21 into Child1 (i.e. 
2 levels up). I am doing the same for another column value of Child22 into 
Child1. Then I am merging these aggregated values at the same Child1 level.

This is present in the code at location : 

spark.rddexample.dummyrdd.tree.child1.events.Function1

 

 

*Code which replicates the issue*: 

1] [https://github.com/dineshdharme/SparkRddShuffleIssue]

 

*Steps to reproduce the issue :* 

1] Clone the above repository.

2] Put the csvs in the "issue-data" folder in the above repository at a hadoop 
location "hdfs:///tree/dummy/data/"

3] Set the spark scratch directory (SPARK_LOCAL_DIRS) to a folder which has 
large space. (> *100 GB*)

4] Run "sbt assembly"

5] Run the following command at the project location : 

/path/to/spark-2.3.0-bin-hadoop2.7/bin/spark-submit \
 --class spark.rddexample.dummyrdd.FunctionExecutor \
 --master local[2] \
 --deploy-mode client \
 --executor-memory 2G \
 --driver-memory 2G \
 target/scala-2.11/rdd-shuffle-assembly-0.1.0.jar \
 20 \
 hdfs:///tree/dummy/data/ \
 hdfs:///tree/dummy/results/   

  was:
I am trying to do few (union + reduceByKey) operations on a hiearchical dataset 
in a iterative fashion in rdd. The first few loops run fine but on the 
subsequent loops, the operations ends up using the whole scratch space provided 
to it. 

I have set the scratch directory, i.e. SPARK_LOCAL_DIRS , to be one having *100 
GB* space.

The heirarchical dataset, whose size is (< 400kB), remains constant throughout 
the iterations.

 I have tried the worker cleanup flag but it has no effect i.e.

[jira] [Updated] (SPARK-24702) Unable to cast to calendar interval in spark sql.

2018-07-27 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24702:

Target Version/s: 3.0.0

> Unable to cast to calendar interval in spark sql.
> -
>
> Key: SPARK-24702
> URL: https://issues.apache.org/jira/browse/SPARK-24702
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Priyanka Garg
>Priority: Major
>
> when I am trying to cast string type to calendar interval type, I am getting 
> the following error:
> spark.sql("select cast(cast(interval '1' day as string) as 
> calendarinterval)").show()
> ^^^
>  
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitPrimitiveDataType$1.apply(AstBuilder.scala:1673)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitPrimitiveDataType$1.apply(AstBuilder.scala:1651)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:108)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitPrimitiveDataType(AstBuilder.scala:1651)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitPrimitiveDataType(AstBuilder.scala:49)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$PrimitiveDataTypeContext.accept(SqlBaseParser.java:13779)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:55)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.org$apache$spark$sql$catalyst$parser$AstBuilder$$visitSparkDataType(AstBuilde



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24912) Broadcast join OutOfMemory stack trace obscures actual cause of OOM

2018-07-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24912:


Assignee: Apache Spark

> Broadcast join OutOfMemory stack trace obscures actual cause of OOM
> ---
>
> Key: SPARK-24912
> URL: https://issues.apache.org/jira/browse/SPARK-24912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Minor
>
> When the Spark driver suffers an OutOfMemoryError while attempting to 
> broadcast a table for a broadcast join, the resulting stack trace obscures 
> the actual cause of the OOM. For e.g.:
> {noformat}
> [GC (Allocation Failure)  585453K->585453K(928768K), 0.0060025 secs]
> [Full GC (Allocation Failure)  585453K->582524K(928768K), 0.4019639 secs]
> java.lang.OutOfMemoryError: Java heap space
> Dumping heap to java_pid12446.hprof ...
> Heap dump file created [632701033 bytes in 1.016 secs]
> Exception in thread "main" java.lang.OutOfMemoryError: Not enough memory to 
> build and broadcast the table to all worker nodes. As a workaround, you can 
> either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to 
> -1 or increase the spark driver memory by setting spark.driver.memory to a 
> higher value
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:122)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:101)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:98)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 18/07/24 14:29:58 INFO ContextCleaner: Cleaned accumulator 30
> 18/07/24 14:29:58 INFO ContextCleaner: Cleaned accumulator 35
> {noformat}
> The above stack trace blames BroadcastExchangeExec. However, the given line 
> is actually where the original OutOfMemoryError was caught and a new one was 
> created and wrapped by a SparkException. The actual location where the OOM 
> occurred was in LongToUnsafeRowMap#grow, at this line:
> {noformat}
> val newPage = new Array[Long](newNumWords.toInt)
> {noformat}
> Sometimes it is helpful to know the actual location from which an OOM is 
> thrown. In the above case, the location indicated that Spark underestimated 
> the size of a large-ish table and ran out of memory trying to load it into 
> memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24912) Broadcast join OutOfMemory stack trace obscures actual cause of OOM

2018-07-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24912:


Assignee: (was: Apache Spark)

> Broadcast join OutOfMemory stack trace obscures actual cause of OOM
> ---
>
> Key: SPARK-24912
> URL: https://issues.apache.org/jira/browse/SPARK-24912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> When the Spark driver suffers an OutOfMemoryError while attempting to 
> broadcast a table for a broadcast join, the resulting stack trace obscures 
> the actual cause of the OOM. For e.g.:
> {noformat}
> [GC (Allocation Failure)  585453K->585453K(928768K), 0.0060025 secs]
> [Full GC (Allocation Failure)  585453K->582524K(928768K), 0.4019639 secs]
> java.lang.OutOfMemoryError: Java heap space
> Dumping heap to java_pid12446.hprof ...
> Heap dump file created [632701033 bytes in 1.016 secs]
> Exception in thread "main" java.lang.OutOfMemoryError: Not enough memory to 
> build and broadcast the table to all worker nodes. As a workaround, you can 
> either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to 
> -1 or increase the spark driver memory by setting spark.driver.memory to a 
> higher value
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:122)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:101)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:98)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 18/07/24 14:29:58 INFO ContextCleaner: Cleaned accumulator 30
> 18/07/24 14:29:58 INFO ContextCleaner: Cleaned accumulator 35
> {noformat}
> The above stack trace blames BroadcastExchangeExec. However, the given line 
> is actually where the original OutOfMemoryError was caught and a new one was 
> created and wrapped by a SparkException. The actual location where the OOM 
> occurred was in LongToUnsafeRowMap#grow, at this line:
> {noformat}
> val newPage = new Array[Long](newNumWords.toInt)
> {noformat}
> Sometimes it is helpful to know the actual location from which an OOM is 
> thrown. In the above case, the location indicated that Spark underestimated 
> the size of a large-ish table and ran out of memory trying to load it into 
> memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24912) Broadcast join OutOfMemory stack trace obscures actual cause of OOM

2018-07-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560155#comment-16560155
 ] 

Apache Spark commented on SPARK-24912:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/21899

> Broadcast join OutOfMemory stack trace obscures actual cause of OOM
> ---
>
> Key: SPARK-24912
> URL: https://issues.apache.org/jira/browse/SPARK-24912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> When the Spark driver suffers an OutOfMemoryError while attempting to 
> broadcast a table for a broadcast join, the resulting stack trace obscures 
> the actual cause of the OOM. For e.g.:
> {noformat}
> [GC (Allocation Failure)  585453K->585453K(928768K), 0.0060025 secs]
> [Full GC (Allocation Failure)  585453K->582524K(928768K), 0.4019639 secs]
> java.lang.OutOfMemoryError: Java heap space
> Dumping heap to java_pid12446.hprof ...
> Heap dump file created [632701033 bytes in 1.016 secs]
> Exception in thread "main" java.lang.OutOfMemoryError: Not enough memory to 
> build and broadcast the table to all worker nodes. As a workaround, you can 
> either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to 
> -1 or increase the spark driver memory by setting spark.driver.memory to a 
> higher value
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:122)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:101)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:98)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 18/07/24 14:29:58 INFO ContextCleaner: Cleaned accumulator 30
> 18/07/24 14:29:58 INFO ContextCleaner: Cleaned accumulator 35
> {noformat}
> The above stack trace blames BroadcastExchangeExec. However, the given line 
> is actually where the original OutOfMemoryError was caught and a new one was 
> created and wrapped by a SparkException. The actual location where the OOM 
> occurred was in LongToUnsafeRowMap#grow, at this line:
> {noformat}
> val newPage = new Array[Long](newNumWords.toInt)
> {noformat}
> Sometimes it is helpful to know the actual location from which an OOM is 
> thrown. In the above case, the location indicated that Spark underestimated 
> the size of a large-ish table and ran out of memory trying to load it into 
> memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-07-27 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560126#comment-16560126
 ] 

shane knapp commented on SPARK-24950:
-

one solution, of course, is to pin the java version on the upcoming ubuntu 
workers to one that passes this test, but things like this make build engineers 
like me die a little bit inside.

> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.4.0
>Reporter: shane knapp
>Priority: Major
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this exact build on the other ubuntu workers, it passes.  
> they systems are set up (for the most part) identically except for the java 
> version:
> sknapp@amp-jenkins-staging-worker-02:~$ java -version
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> there are some minor kernel and other package differences on these ubuntu 
> workers, but nothing that (in my opinion) would affect this test.  i am 
> willing to help investigate this, however.
> the test also passes on the centos 6.9 workers, which have the following java 
> version installed:
> [sknapp@amp-jenkins-worker-05 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is 
> that either:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
> or
> sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
> is doing something wrong.  i am not a scala expert by any means, so i'd 
> really like some help in trying to un-block the project to port the builds to 
> ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-07-27 Thread shane knapp (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-24950:

Description: 
during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
encountered a strange and apparently java version-specific failure on *one* 
specific unit test.

the failure is here:

[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]

the java version on this worker is:

sknapp@ubuntu-testing:~$ java -version
 java version "1.8.0_181"
 Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

however, when i run this exact build on the other ubuntu workers, it passes.  
they systems are set up (for the most part) identically except for the java 
version:

sknapp@amp-jenkins-staging-worker-02:~$ java -version
 java version "1.8.0_171"
 Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
 Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

there are some minor kernel and other package differences on these ubuntu 
workers, but nothing that (in my opinion) would affect this test.  i am willing 
to help investigate this, however.

the test also passes on the centos 6.9 workers, which have the following java 
version installed:

[sknapp@amp-jenkins-worker-05 ~]$ java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)my guess is that 
either:

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

or

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala

is doing something wrong.  i am not a scala expert by any means, so i'd really 
like some help in trying to un-block the project to port the builds to ubuntu.

  was:
during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
encountered a strange and apparently java version-specific failure on *one* 
specific unit test.

the failure is here:

[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]

the java version on this worker is:

sknapp@ubuntu-testing:~$ java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

however, when i run this exact build on the other ubuntu workers, it passes.  
they systems are set up (for the most part) identically except for the java 
version:

sknapp@amp-jenkins-staging-worker-02:~$ java -version
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

there are some minor kernel and other package differences on these ubuntu 
workers, but nothing that (in my opinion) would affect this test.  i am willing 
to help investigate this, however.

the test also passes on the centos 6.9 workers, which have the following java 
version installed:

[sknapp@amp-jenkins-worker-05 ~]$ java -version
java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)

my guess is that either:

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

or

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala

is doing something wrong.  i am not a scala expert by any means, so i'd really 
like some help in trying to un-block the project to port the builds to ubuntu.


> scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
> -
>
> Key: SPARK-24950
> URL: https://issues.apache.org/jira/browse/SPARK-24950
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.4.0
>Reporter: shane knapp
>Priority: Major
>
> during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
> encountered a strange and apparently java version-specific failure on *one* 
> specific unit test.
> the failure is here:
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]
> the java version on this worker is:
> sknapp@ubuntu-testing:~$ java -version
>  java version "1.8.0_181"
>  Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> however, when i run this

[jira] [Created] (SPARK-24950) scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13

2018-07-27 Thread shane knapp (JIRA)

shane knapp created SPARK-24950:
---

 Summary: scala DateTimeUtilsSuite daysToMillis and millisToDays 
fails w/java 8 181-b13
 Key: SPARK-24950
 URL: https://issues.apache.org/jira/browse/SPARK-24950
 Project: Spark
  Issue Type: Bug
  Components: Build, Tests
Affects Versions: 2.4.0
Reporter: shane knapp


during my travails to port the spark builds to run on ubuntu 16.04LTS, i have 
encountered a strange and apparently java version-specific failure on *one* 
specific unit test.

the failure is here:

[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/868/testReport/junit/org.apache.spark.sql.catalyst.util/DateTimeUtilsSuite/daysToMillis_and_millisToDays/]

the java version on this worker is:

sknapp@ubuntu-testing:~$ java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

however, when i run this exact build on the other ubuntu workers, it passes.  
they systems are set up (for the most part) identically except for the java 
version:

sknapp@amp-jenkins-staging-worker-02:~$ java -version
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

there are some minor kernel and other package differences on these ubuntu 
workers, but nothing that (in my opinion) would affect this test.  i am willing 
to help investigate this, however.

the test also passes on the centos 6.9 workers, which have the following java 
version installed:

[sknapp@amp-jenkins-worker-05 ~]$ java -version
java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)

my guess is that either:

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

or

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala

is doing something wrong.  i am not a scala expert by any means, so i'd really 
like some help in trying to un-block the project to port the builds to ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24817) Implement BarrierTaskContext.barrier()

2018-07-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24817:


Assignee: Apache Spark

> Implement BarrierTaskContext.barrier()
> --
>
> Key: SPARK-24817
> URL: https://issues.apache.org/jira/browse/SPARK-24817
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Major
>
> Implement BarrierTaskContext.barrier(), to support global sync between all 
> the tasks in a barrier stage. The global sync shall finish immediately once 
> all tasks in the same barrier stage reaches the same barrier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24817) Implement BarrierTaskContext.barrier()

2018-07-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560096#comment-16560096
 ] 

Apache Spark commented on SPARK-24817:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/21898

> Implement BarrierTaskContext.barrier()
> --
>
> Key: SPARK-24817
> URL: https://issues.apache.org/jira/browse/SPARK-24817
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Implement BarrierTaskContext.barrier(), to support global sync between all 
> the tasks in a barrier stage. The global sync shall finish immediately once 
> all tasks in the same barrier stage reaches the same barrier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24817) Implement BarrierTaskContext.barrier()

2018-07-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24817:


Assignee: (was: Apache Spark)

> Implement BarrierTaskContext.barrier()
> --
>
> Key: SPARK-24817
> URL: https://issues.apache.org/jira/browse/SPARK-24817
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Implement BarrierTaskContext.barrier(), to support global sync between all 
> the tasks in a barrier stage. The global sync shall finish immediately once 
> all tasks in the same barrier stage reaches the same barrier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18381) Wrong date conversion between spark and python for dates before 1583

2018-07-27 Thread Stephen Brennan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560068#comment-16560068
 ] 

Stephen Brennan commented on SPARK-18381:
-

Just encountered this issue myself and came here to report it. The root cause, 
as far as I can tell, is that PySpark uses Python day ordinals to translate 
between the internal and external forms ([see code in PySpark 
docs|https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/types.html#DateType]).
 Python's day ordinals are relative to day 1: January 1st, year 1, on the 
[proleptic Gregorian 
calendar|https://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar], which is 
an idealized calendar that assumes that the Gregorian calendar extends 
indefinitely into the past and future.

In reality, the Gregorian calendar switch began in October of 1582 - 10 days 
were skipped - they went from October 4th directly to October 15th.

It seems that whatever internal date library Spark is using does not use this 
proleptic Gregorian calendar, and actually switches to the Julian calendar, 
which was in use before the Gregorian.  The dates are correct, but in different 
calendars! The following PySpark code demonstrates this:
{code:java}
>>> from datetime import date
>>> df = spark.createDataFrame([
{'date': date(1, 1, 1)},
{'date': date(1582, 10, 3)},
{'date': date(1582, 10, 4)},
{'date': date(1582, 10, 5)},
{'date': date(1582, 10, 6)},
{'date': date(1582, 10, 7)},
{'date': date(1582, 10, 8)},
{'date': date(1582, 10, 9)},
{'date': date(1582, 10, 10)},
{'date': date(1582, 10, 11)},
{'date': date(1582, 10, 12)},
{'date': date(1582, 10, 13)},
{'date': date(1582, 10, 14)},
{'date': date(1582, 10, 15)},
{'date': date(2016, 6, 6)},
])

+--+
|  date|
+--+
|0001-01-03|
|1582-09-23|
|1582-09-24|
|1582-09-25|
|1582-09-26|
|1582-09-27|
|1582-09-28|
|1582-09-29|
|1582-09-30|
|1582-10-01|
|1582-10-02|
|1582-10-03|
|1582-10-04|
|1582-10-15|
|2016-06-06|
+--+{code}
The correct fix for this may simply be to include a note in the documentation 
that Spark uses the Julian calendar before October 15, 1582, which is in 
contrast to Python's idealized Gregorian calendar?

> Wrong date conversion between spark and python for dates before 1583
> 
>
> Key: SPARK-18381
> URL: https://issues.apache.org/jira/browse/SPARK-18381
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Luca Caniparoli
>Priority: Major
>
> Dates before 1538 (julian/gregorian calendar transition) are processed 
> incorrectly. 
> * With python udf (datetime.strptime), .show() returns wrong dates but 
> .collect() returns correct dates
> * With pyspark.sql.functions.to_date, .show() shows correct dates but 
> .collect() returns wrong dates. Additionally, collecting '0001-01-01' returns 
> error when collecting dataframe. 
> {code:none}
> from pyspark.sql.types import DateType
> from pyspark.sql.functions import to_date, udf
> from datetime import datetime
> strToDate =  udf (lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType())
> l = [('0002-01-01', 1), ('1581-01-01', 2), ('1582-01-01', 3), ('1583-01-01', 
> 4), ('1584-01-01', 5), ('2012-01-21', 6)]
> l_older = [('0001-01-01', 1)]
> test_df = spark.createDataFrame(l, ["date_string", "number"])
> test_df_older = spark.createDataFrame(l_older, ["date_string", "number"])
> test_df_strptime = test_df.withColumn( "date_cast", 
> strToDate(test_df["date_string"]))
> test_df_todate = test_df.withColumn( "date_cast", 
> to_date(test_df["date_string"]))
> test_df_older_todate = test_df_older.withColumn( "date_cast", 
> to_date(test_df_older["date_string"]))
> test_df_strptime.show()
> test_df_todate.show()
> print test_df_strptime.collect()
> print test_df_todate.collect()
> print test_df_older_todate.collect()
> {code}
> {noformat}
> +---+--+--+
> |date_string|number| date_cast|
> +---+--+--+
> | 0002-01-01| 1|0002-01-03|
> | 1581-01-01| 2|1580-12-22|
> | 1582-01-01| 3|1581-12-22|
> | 1583-01-01| 4|1583-01-01|
> | 1584-01-01| 5|1584-01-01|
> | 2012-01-21| 6|2012-01-21|
> +---+--+--+
> +---+--+--+
> |date_string|number| date_cast|
> +---+--+--+
> | 0002-01-01| 1|0002-01-01|
> | 1581-01-01| 2|1581-01-01|
> | 1582-01-01| 3|1582-01-01|
> | 1583-01-01| 4|1583-01-01|
> | 1584-01-01| 5|1584-01-01|
> | 2012-01-21| 6|2012-01-21|
> +---+--+--+
> [Row(date_string=u'0002-01-01', number=1, date_cast=datetime.date(2, 1, 1)), 
> Row(date_string=u'1581-01-01', number=2, date_cast=datetime.date(1581, 1, 
> 1)),

[jira] [Commented] (SPARK-24865) Remove AnalysisBarrier

2018-07-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560052#comment-16560052
 ] 

Apache Spark commented on SPARK-24865:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21896

> Remove AnalysisBarrier
> --
>
> Key: SPARK-24865
> URL: https://issues.apache.org/jira/browse/SPARK-24865
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
>
> AnalysisBarrier was introduced in SPARK-20392 to improve analysis speed 
> (don't re-analyze nodes that have already been analyzed).
> Before AnalysisBarrier, we already had some infrastructure in place, with 
> analysis specific functions (resolveOperators and resolveExpressions). These 
> functions do not recursively traverse down subplans that are already analyzed 
> (with a mutable boolean flag _analyzed). The issue with the old system was 
> that developers started using transformDown, which does a top-down traversal 
> of the plan tree, because there was not top-down resolution function, and as 
> a result analyzer performance became pretty bad.
> In order to fix the issue in SPARK-20392, AnalysisBarrier was introduced as a 
> special node and for this special node, transform/transformUp/transformDown 
> don't traverse down. However, the introduction of this special node caused a 
> lot more troubles than it solves. This implicit node breaks assumptions and 
> code in a few places, and it's hard to know when analysis barrier would 
> exist, and when it wouldn't. Just a simple search of AnalysisBarrier in PR 
> discussions demonstrates it is a source of bugs and additional complexity.
> Instead, I think a much simpler fix to the original issue is to introduce 
> resolveOperatorsDown, and change all places that call transformDown in the 
> analyzer to use that. We can also ban accidental uses of the various 
> transform* methods by using a linter (which can only lint specific packages), 
> or in test mode inspect the stack trace and fail explicitly if transform* are 
> called in the analyzer. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13343) speculative tasks that didn't commit shouldn't be marked as success

2018-07-27 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-13343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-13343.
---
   Resolution: Fixed
 Assignee: Hieu Tri Huynh
Fix Version/s: 2.4.0

> speculative tasks that didn't commit shouldn't be marked as success
> ---
>
> Key: SPARK-13343
> URL: https://issues.apache.org/jira/browse/SPARK-13343
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Thomas Graves
>Assignee: Hieu Tri Huynh
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: Screen Shot 2018-07-08 at 3.49.52 PM.png, image.png, 
> image.png
>
>
> Currently Speculative tasks that didn't commit can show up as success  
> (depending on timing of commit). This is a bit confusing because that task 
> didn't really succeed in the sense it didn't write anything.
> I think these tasks should be marked as KILLED or something that is more 
> obvious to the user exactly what happened. it is happened to hit the timing 
> where it got a commit denied exception then it shows up as failed and counts 
> against your task failures. It shouldn't count against task failures since 
> that failure really doesn't matter.
> MapReduce handles these situation so perhaps we can look there for a model.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20597) KafkaSourceProvider falls back on path as synonym for topic

2018-07-27 Thread Satyajit varma (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560036#comment-16560036
 ] 

Satyajit varma commented on SPARK-20597:


[~jlaskowski] will do submit a PR today

> KafkaSourceProvider falls back on path as synonym for topic
> ---
>
> Key: SPARK-20597
> URL: https://issues.apache.org/jira/browse/SPARK-20597
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> # {{KafkaSourceProvider}} supports {{topic}} option that sets the Kafka topic 
> to save a DataFrame's rows to
> # {{KafkaSourceProvider}} can use {{topic}} column to assign rows to Kafka 
> topics for writing
> What seems a quite interesting option is to support {{start(path: String)}} 
> as the least precedence option in which {{path}} would designate the default 
> topic when no other options are used.
> {code}
> df.writeStream.format("kafka").start("topic")
> {code}
> See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/KafkaSourceProvider-Why-topic-option-and-column-without-reverting-to-path-as-the-least-priority-td21458.html
>  for discussion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21216) Streaming DataFrames fail to join with Hive tables

2018-07-27 Thread Russell Spitzer (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560034#comment-16560034
 ] 

Russell Spitzer commented on SPARK-21216:
-

For anyone else searching, this also fixes custom Spark Strategies added via 
spark.sql.extensions not being applied in a Structured Streaming Context.

> Streaming DataFrames fail to join with Hive tables
> --
>
> Key: SPARK-21216
> URL: https://issues.apache.org/jira/browse/SPARK-21216
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 2.3.0
>
>
> The following code will throw a cryptic exception:
> {code}
> import org.apache.spark.sql.execution.streaming.MemoryStream
> import testImplicits._
> implicit val _sqlContext = spark.sqlContext
> Seq((1, "one"), (2, "two"), (4, "four")).toDF("number", 
> "word").createOrReplaceTempView("t1")
> // Make a table and ensure it will be broadcast.
> sql("""CREATE TABLE smallTable(word string, number int)
>   |ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>   |STORED AS TEXTFILE
> """.stripMargin)
> sql(
>   """INSERT INTO smallTable
> |SELECT word, number from t1
>   """.stripMargin)
> val inputData = MemoryStream[Int]
> val joined = inputData.toDS().toDF()
>   .join(spark.table("smallTable"), $"value" === $"number")
> val sq = joined.writeStream
>   .format("memory")
>   .queryName("t2")
>   .start()
> try {
>   inputData.addData(1, 2)
>   sq.processAllAvailable()
> } finally {
>   sq.stop()
> }
> {code}
> If someone creates a HiveSession, the planner in `IncrementalExecution` 
> doesn't take into account the Hive scan strategies



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21960) Spark Streaming Dynamic Allocation should respect spark.executor.instances

2018-07-27 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21960:
-

Assignee: Karthik Palaniappan

> Spark Streaming Dynamic Allocation should respect spark.executor.instances
> --
>
> Key: SPARK-21960
> URL: https://issues.apache.org/jira/browse/SPARK-21960
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Karthik Palaniappan
>Assignee: Karthik Palaniappan
>Priority: Minor
> Fix For: 2.4.0
>
>
> This check enforces that spark.executor.instances (aka --num-executors) is 
> either unset or explicitly set to 0. 
> https://github.com/apache/spark/blob/v2.2.0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala#L207
> If spark.executor.instances is unset, the check is fine, and the property 
> defaults to 2. Spark requests the cluster manager for 2 executors to start 
> with, then adds/removes executors appropriately.
> However, if you explicitly set it to 0, the check also succeeds, but Spark 
> never asks the cluster manager for any executors. When running on YARN, I 
> repeatedly saw:
> {code:java}
> 17/08/22 19:35:21 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> 17/08/22 19:35:36 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> 17/08/22 19:35:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> {code}
> I noticed that at least Google Dataproc and Ambari explicitly set 
> spark.executor.instances to a positive number, meaning that to use dynamic 
> allocation, you would have to edit spark-defaults.conf to remove the 
> property. That's obnoxious.
> In addition, in Spark 2.3, spark-submit will refuse to accept "0" as a value 
> for --num-executors or --conf spark.executor.instances: 
> https://github.com/apache/spark/commit/0fd84b05dc9ac3de240791e2d4200d8bdffbb01a#diff-63a5d817d2d45ae24de577f6a1bd80f9
> It is much more reasonable for Streaming DRA to use spark.executor.instances, 
> just like Core DRA. I'll open a pull request to remove the check if there are 
> no objections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21960) Spark Streaming Dynamic Allocation should respect spark.executor.instances

2018-07-27 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21960.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 19183
[https://github.com/apache/spark/pull/19183]

> Spark Streaming Dynamic Allocation should respect spark.executor.instances
> --
>
> Key: SPARK-21960
> URL: https://issues.apache.org/jira/browse/SPARK-21960
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Karthik Palaniappan
>Priority: Minor
> Fix For: 2.4.0
>
>
> This check enforces that spark.executor.instances (aka --num-executors) is 
> either unset or explicitly set to 0. 
> https://github.com/apache/spark/blob/v2.2.0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala#L207
> If spark.executor.instances is unset, the check is fine, and the property 
> defaults to 2. Spark requests the cluster manager for 2 executors to start 
> with, then adds/removes executors appropriately.
> However, if you explicitly set it to 0, the check also succeeds, but Spark 
> never asks the cluster manager for any executors. When running on YARN, I 
> repeatedly saw:
> {code:java}
> 17/08/22 19:35:21 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> 17/08/22 19:35:36 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> 17/08/22 19:35:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> {code}
> I noticed that at least Google Dataproc and Ambari explicitly set 
> spark.executor.instances to a positive number, meaning that to use dynamic 
> allocation, you would have to edit spark-defaults.conf to remove the 
> property. That's obnoxious.
> In addition, in Spark 2.3, spark-submit will refuse to accept "0" as a value 
> for --num-executors or --conf spark.executor.instances: 
> https://github.com/apache/spark/commit/0fd84b05dc9ac3de240791e2d4200d8bdffbb01a#diff-63a5d817d2d45ae24de577f6a1bd80f9
> It is much more reasonable for Streaming DRA to use spark.executor.instances, 
> just like Core DRA. I'll open a pull request to remove the check if there are 
> no objections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17984) Add support for numa aware feature

2018-07-27 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17984.
---
Resolution: Won't Fix

See pull requests.

> Add support for numa aware feature
> --
>
> Key: SPARK-17984
> URL: https://issues.apache.org/jira/browse/SPARK-17984
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Mesos, YARN
>Affects Versions: 2.0.1
> Environment: Cluster Topo: 1 Master + 4 Slaves
> CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores)
> Memory: 128GB(2 NUMA Nodes)
> SW Version: Hadoop-5.7.0 + Spark-2.0.0
>Reporter: quanfuwang
>Priority: Major
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This Jira is target to add support numa aware feature which can help improve 
> performance by making core access local memory rather than remote one. 
>  A patch is being developed, see https://github.com/apache/spark/pull/15524.
> And the whole task includes 3 subtasks and will be developed iteratively:
> Numa aware support for Yarn based deployment mode
> Numa aware support for Mesos based deployment mode
> Numa aware support for Standalone based deployment mode



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24949) pyspark.sql.Column breaks the iterable contract

2018-07-27 Thread Daniel Shields (JIRA)

Daniel Shields created SPARK-24949:
--

 Summary: pyspark.sql.Column breaks the iterable contract
 Key: SPARK-24949
 URL: https://issues.apache.org/jira/browse/SPARK-24949
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.1
Reporter: Daniel Shields


pyspark.sql.Column implements __iter__ just to raise a TypeError:
{code:java}
def __iter__(self):
raise TypeError("Column is not iterable")
{code}
This makes column look iterable even when it isn't:
{code:java}
isinstance(mycolumn, collections.Iterable) # Evaluates to True{code}
This function should be removed from Column completely so it behaves like every 
other non-iterable class.

For further motivation of why this should be fixed, consider the below example, 
which currently requires listing Column explicitly:
{code:java}
def listlike(value):
# Column unfortunately implements __iter__ just to raise a TypeError.
# This breaks the iterable contract and should be fixed in Spark proper.
return isinstance(value, collections.Iterable) and not isinstance(value, 
(str, bytes, pyspark.sql.Column))
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-27 Thread Eric Chang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559996#comment-16559996
 ] 

Eric Chang edited comment on SPARK-24895 at 7/27/18 5:00 PM:
-

[~kiszk] for maven, you may need 3.5.2 which includes this fix: 
https://issues.apache.org/jira/browse/MNG-6240.  It seems others have mentioned 
upgrading maven alone doesn't always work so I suspect you may need to upgrade 
spotbugs too as [~yhuai] suggested.

I think the spotbugs error 
[https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] mentioned some 
errors on the aether resolver.

I think maven install will give you an error that looks like the first error in 
the bug description.  Regarding verification of apache repo artifacts, you can 
check the maven-metadata.xml file by going to a link like this one (this is for 
spark-core):

[https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml]


was (Author: ericfchang):
[~kiszk] for maven, you may need 3.5.2 which includes this fix: 
https://issues.apache.org/jira/browse/MNG-6240

I think the spotbugs error 
[https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] mentioned some 
errors on the aether resolver.

I think maven install will give you an error that looks like the first error in 
the bug description.  Regarding verification of apache repo artifacts, you can 
check the maven-metadata.xml file by going to a link like this one (this is for 
spark-core):

https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-

[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-27 Thread Eric Chang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559996#comment-16559996
 ] 

Eric Chang commented on SPARK-24895:


[~kiszk] for maven, you may need 3.5.2 which includes this fix: 
https://issues.apache.org/jira/browse/MNG-6240

I think the spotbugs error 
[https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] mentioned some 
errors on the aether resolver.

I think maven install will give you an error that looks like the first error in 
the bug description.  Regarding verification of apache repo artifacts, you can 
check the maven-metadata.xml file by going to a link like this one (this is for 
spark-core):

https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.11/2.4.0-SNAPSHOT/maven-metadata.xml

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-27 Thread Kazuaki Ishizaki (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559987#comment-16559987
 ] 

Kazuaki Ishizaki commented on SPARK-24895:
--

I see. Thank you very much. At first, I will try to make a PR to upgrade a 
maven.

BTW, I have no idea to make sure maven central repo works well for now.

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-27 Thread Yin Huai (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559977#comment-16559977
 ] 

Yin Huai commented on SPARK-24895:
--

[https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] has some info on 
it. I am wondering if it requires upgrading both the plugin and maven. We 
probably need to setup a testing jenkins job to make sure everything works 
before checking in changes.

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-27 Thread Kazuaki Ishizaki (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559974#comment-16559974
 ] 

Kazuaki Ishizaki commented on SPARK-24895:
--

[~yhuai] Thank you.

BTW, how can I re-enable spotbugs without this problem? Do you have any 
suggestion? cc: [~hyukjin.kwon]

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24925) input bytesRead metrics fluctuate from time to time

2018-07-27 Thread Kazuaki Ishizaki (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559972#comment-16559972
 ] 

Kazuaki Ishizaki commented on SPARK-24925:
--

Do we need a test case or which test case covers this PR?

> input bytesRead metrics fluctuate from time to time
> ---
>
> Key: SPARK-24925
> URL: https://issues.apache.org/jira/browse/SPARK-24925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
> Attachments: bytesRead.gif
>
>
> input bytesRead metrics fluctuate from time to time, it is worse when 
> pushdown enabled.
> Query
> {code:java}
> CREATE TABLE dev AS
> SELECT
> ...
> FROM lstg_item cold, lstg_item_vrtn v
> WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
> AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
> ...
> {code}
> Issue
> See attached bytesRead.gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, 
> 54GB, 53GB ... 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-07-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24948:


Assignee: Apache Spark

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Major
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24881) New options - compression and compressionLevel

2018-07-27 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24881.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21837
[https://github.com/apache/spark/pull/21837]

> New options - compression and compressionLevel
> --
>
> Key: SPARK-24881
> URL: https://issues.apache.org/jira/browse/SPARK-24881
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently Avro datasource takes the compression codec name from SQL config 
> (config key is hard coded in AvroFileFormat): 
> https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L121-L125
>  . The obvious cons of it is modification of the global config can impact of 
> multiple writes.
> A purpose of the ticket is to add new Avro option - "compression" the same as 
> we already have for other datasource like JSON, CSV and etc. If new option is 
> not set by an user, we take settings from SQL config 
> spark.sql.avro.compression.codec. If the former one is not set too, default 
> compression codec will be snappy (this is current behavior in the master).
> Besides of the compression option, need to add another option - 
> compressionLevel which should reflect another SQL config in Avro: 
> https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L122



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24881) New options - compression and compressionLevel

2018-07-27 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24881:


Assignee: Maxim Gekk

> New options - compression and compressionLevel
> --
>
> Key: SPARK-24881
> URL: https://issues.apache.org/jira/browse/SPARK-24881
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Currently Avro datasource takes the compression codec name from SQL config 
> (config key is hard coded in AvroFileFormat): 
> https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L121-L125
>  . The obvious cons of it is modification of the global config can impact of 
> multiple writes.
> A purpose of the ticket is to add new Avro option - "compression" the same as 
> we already have for other datasource like JSON, CSV and etc. If new option is 
> not set by an user, we take settings from SQL config 
> spark.sql.avro.compression.codec. If the former one is not set too, default 
> compression codec will be snappy (this is current behavior in the master).
> Besides of the compression option, need to add another option - 
> compressionLevel which should reflect another SQL config in Avro: 
> https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L122



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-07-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559951#comment-16559951
 ] 

Apache Spark commented on SPARK-24948:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21895

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Priority: Major
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-07-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24948:


Assignee: (was: Apache Spark)

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Priority: Major
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24921) SparkStreaming steadily increasing job generation delay due to apparent URLClassLoader contention

2018-07-27 Thread Tommy S (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommy S resolved SPARK-24921.
-
Resolution: Not A Bug

> SparkStreaming steadily increasing job generation delay due to apparent 
> URLClassLoader contention
> -
>
> Key: SPARK-24921
> URL: https://issues.apache.org/jira/browse/SPARK-24921
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.1
>Reporter: Tommy S
>Priority: Major
>
> I'm seeing an issue where the job generation time of my spark streaming job 
> is steadily increasing after some time.
> Looking at the thread dumps I see that the JobGenerator thread is BLOCKED 
> waiting for URLClassPath.getLoader synchronized method:
> {noformat}
> "JobGenerator" #153 daemon prio=5 os_prio=0 tid=0x02dad800 nid=0x253c 
> waiting for monitor entry [0x7f4b311c2000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>     at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:469)
>     - waiting to lock <0x7f4be023f940> (a sun.misc.URLClassPath)
>     at sun.misc.URLClassPath.findResource(URLClassPath.java:214)
>     at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
>     at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
>     at java.lang.ClassLoader.getResource(ClassLoader.java:1096)
>     at java.lang.ClassLoader.getResource(ClassLoader.java:1091)
>     at 
> java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
>     at java.lang.Class.getResourceAsStream(Class.java:2223)
>     at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:40)
>     at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:84)
>     at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:224)
>     at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>     at org.apache.spark.SparkContext.clean(SparkContext.scala:2299)
>     at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKeyWithClassTag$1.apply(PairRDDFunctions.scala:89)
>     at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKeyWithClassTag$1.apply(PairRDDFunctions.scala:77)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>     at 
> org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:77)
>     at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$1.apply(PairRDDFunctions.scala:119)
>     at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$1.apply(PairRDDFunctions.scala:119)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>     at 
> org.apache.spark.rdd.PairRDDFunctions.combineByKey(PairRDDFunctions.scala:117)
>     at 
> org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:42)
>     at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
>     at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
>     at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>     at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
>     at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
>     at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
>     at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
>     at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
>     at scala.Option.orElse(Option.scala:289)
>     at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)
>     at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:36)
>     at 
>

[jira] [Commented] (SPARK-24921) SparkStreaming steadily increasing job generation delay due to apparent URLClassLoader contention

2018-07-27 Thread Tommy S (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559943#comment-16559943
 ] 

Tommy S commented on SPARK-24921:
-

Fair point. I'll close this ticket and reopen if I confirm that it is indeed a 
bug. Thanks for the reply [~hyukjin.kwon]

> SparkStreaming steadily increasing job generation delay due to apparent 
> URLClassLoader contention
> -
>
> Key: SPARK-24921
> URL: https://issues.apache.org/jira/browse/SPARK-24921
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.1
>Reporter: Tommy S
>Priority: Major
>
> I'm seeing an issue where the job generation time of my spark streaming job 
> is steadily increasing after some time.
> Looking at the thread dumps I see that the JobGenerator thread is BLOCKED 
> waiting for URLClassPath.getLoader synchronized method:
> {noformat}
> "JobGenerator" #153 daemon prio=5 os_prio=0 tid=0x02dad800 nid=0x253c 
> waiting for monitor entry [0x7f4b311c2000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>     at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:469)
>     - waiting to lock <0x7f4be023f940> (a sun.misc.URLClassPath)
>     at sun.misc.URLClassPath.findResource(URLClassPath.java:214)
>     at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
>     at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
>     at java.lang.ClassLoader.getResource(ClassLoader.java:1096)
>     at java.lang.ClassLoader.getResource(ClassLoader.java:1091)
>     at 
> java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
>     at java.lang.Class.getResourceAsStream(Class.java:2223)
>     at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:40)
>     at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:84)
>     at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:224)
>     at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>     at org.apache.spark.SparkContext.clean(SparkContext.scala:2299)
>     at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKeyWithClassTag$1.apply(PairRDDFunctions.scala:89)
>     at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKeyWithClassTag$1.apply(PairRDDFunctions.scala:77)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>     at 
> org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:77)
>     at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$1.apply(PairRDDFunctions.scala:119)
>     at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$1.apply(PairRDDFunctions.scala:119)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>     at 
> org.apache.spark.rdd.PairRDDFunctions.combineByKey(PairRDDFunctions.scala:117)
>     at 
> org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:42)
>     at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
>     at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
>     at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>     at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
>     at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
>     at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
>     at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
>     at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
>     at scala.Option.orElse(Option.scala:289)
>     at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)
>     at 
>

[jira] [Created] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-07-27 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-24948:
---

 Summary: SHS filters wrongly some applications due to permission 
check
 Key: SPARK-24948
 URL: https://issues.apache.org/jira/browse/SPARK-24948
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.1
Reporter: Marco Gaido


SHS filters the event logs it doesn't have permissions to read. Unfortunately, 
this check is quite naive, as it takes into account only the base permissions 
(ie. user, group, other permissions). For instance, if ACL are enabled, they 
are ignored in this check; moreover, each filesystem may have different 
policies (eg. they can consider spark as a superuser who can access everything).

This results in some applications not being displayed in the SHS, despite the 
Spark user (or whatever user the SHS is started with) can actually read their 
ent logs.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-27 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24927:

Fix Version/s: 2.4.0
   2.2.3

> The hadoop-provided profile doesn't play well with Snappy-compressed Parquet 
> files
> --
>
> Key: SPARK-24927
> URL: https://issues.apache.org/jira/browse/SPARK-24927
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> Reproduction:
> {noformat}
> wget 
> https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
> wget 
> https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
> tar xzf spark-2.3.1-bin-without-hadoop.tgz
> tar xzf hadoop-2.7.3.tar.gz
> export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
> ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
> ...
> scala> 
> spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
> {noformat}
> Exception:
> {noformat}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>   ... 69 more
> Caused by: org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
>   at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
>   at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
>   at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
>   at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
>   at 
>

[jira] [Resolved] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-27 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24927.
-
   Resolution: Fixed
Fix Version/s: 2.3.2

> The hadoop-provided profile doesn't play well with Snappy-compressed Parquet 
> files
> --
>
> Key: SPARK-24927
> URL: https://issues.apache.org/jira/browse/SPARK-24927
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
> Fix For: 2.3.2
>
>
> Reproduction:
> {noformat}
> wget 
> https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
> wget 
> https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
> tar xzf spark-2.3.1-bin-without-hadoop.tgz
> tar xzf hadoop-2.7.3.tar.gz
> export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
> ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
> ...
> scala> 
> spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
> {noformat}
> Exception:
> {noformat}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>   ... 69 more
> Caused by: org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
>   at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
>   at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
>   at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
>   at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
>   at 
>

[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API

2018-07-27 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559918#comment-16559918
 ] 

Wenchen Fan commented on SPARK-24882:
-

Hi [~rdblue] , I like your naming changes and will update them in the doc. I 
also like your idea of merging `DataSourceReader` and `ReadSupport` to reduce # 
of interfaces. I think we can also apply it to the write API.

About the builder pattern, I agree it's good to make the API immutable, but I'd 
say it's hard to do so. I've spent a lot of time thinking of the builder 
pattern and have no luck.

One problem is: Spark needs feedback from the data source when an operator is 
pushed. e.g. when Spark pushes a Filter to a data source, Spark needs to know 
if all the filters are pushed, so that it can keep pushing the next operator. 
Spark can't blindly push all operators to the data source one by one. Spark 
needs to ask the data source if it can accept the next operator, before pushing 
it. And this is not a builder pattern anymore.

> separate responsibilities of the data source v2 read API
> 
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 read API. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24940) Coalesce Hint for SQL Queries

2018-07-27 Thread John Zhuge (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559909#comment-16559909
 ] 

John Zhuge commented on SPARK-24940:


Thx [~hyukjin.kwon]

> Coalesce Hint for SQL Queries
> -
>
> Key: SPARK-24940
> URL: https://issues.apache.org/jira/browse/SPARK-24940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: John Zhuge
>Priority: Major
>
> Many Spark SQL users in my company have asked for a way to control the number 
> of output files in Spark SQL. The users prefer not to use function 
> repartition\(n\) or coalesce(n, shuffle) that require them to write and 
> deploy Scala/Java/Python code.
>   
>  There are use cases to either reduce or increase the number.
>   
>  The DataFrame API has repartition/coalesce for a long time. However, we do 
> not have an equivalent functionality in SQL queries. We propose adding the 
> following Hive-style Coalesce hint to Spark SQL.
> {noformat}
> /*+ COALESCE(n, shuffle) */
> /*+ REPARTITION(n) */
> {noformat}
> REPARTITION\(n\) is equal to COALESCE(n, shuffle=true).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24947) aggregateAsync and foldAsync for RDD

2018-07-27 Thread Cody Allen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Allen updated SPARK-24947:
---
Description: 
{{AsyncRDDActions}} contains {{collectAsync}}, {{countAsync}}, 
{{foreachAsync}}, etc; but it doesn't provide general mechanisms for reducing 
datasets asynchronously. If I want to aggregate some statistics on a large 
dataset and it's going to take an hour, I shouldn't need to completely block a 
thread for the hour to wait for the result.

 

I propose the following methods be added to {{AsyncRDDActions}}:

 

{{def aggregateAsync[U](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): 
FutureAction[U]}}

{{def foldAsync(zeroValue: T)(op: (T, T) => T): FutureAction[T]}}

 

Locally I have a version of {{aggregateAsync}} implemented based on 
{{submitJob}} (similar to how {{countAsync}} is implemented), and a 
{{foldAsync}} implementation that simply delegates through to 
{{aggregateAsync}}. I haven't yet written unit tests for these, but I can do so 
if this is a contribution that would be accepted. Please let me know.

  was:
{{AsyncRDDActions}} contains {{collectAsync}}, {{countAsync}}, 
{{foreachAsync}}, etc; but it doesn't provide general mechanisms for reducing 
datasets asynchronously. If I want to aggregate some statistics on a large 
dataset and it's going to take an hour, I shouldn't need to completely block a 
thread for the hour to wait for the result.

 

I propose the following methods be added to {{AsyncRDDActions}}:

 

{{def aggregateAsync[U](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): 
FutureAction[U]}}

{{}}{{def foldAsync(zeroValue: T)(op: (T, T) => T): FutureAction[T]}}

 

Locally I have a version of {{aggregateAsync}} implemented based on 
{{submitJob}} (similar to how {{countAsync}} is implemented), and a 
{{foldAsync}} implementation that simply delegates through to 
{{aggregateAsync}}. I haven't yet written unit tests for these, but I can do so 
if this is a contribution that would be accepted. Please let me know.


> aggregateAsync and foldAsync for RDD
> 
>
> Key: SPARK-24947
> URL: https://issues.apache.org/jira/browse/SPARK-24947
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Cody Allen
>Priority: Minor
>
> {{AsyncRDDActions}} contains {{collectAsync}}, {{countAsync}}, 
> {{foreachAsync}}, etc; but it doesn't provide general mechanisms for reducing 
> datasets asynchronously. If I want to aggregate some statistics on a large 
> dataset and it's going to take an hour, I shouldn't need to completely block 
> a thread for the hour to wait for the result.
>  
> I propose the following methods be added to {{AsyncRDDActions}}:
>  
> {{def aggregateAsync[U](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => 
> U): FutureAction[U]}}
> {{def foldAsync(zeroValue: T)(op: (T, T) => T): FutureAction[T]}}
>  
> Locally I have a version of {{aggregateAsync}} implemented based on 
> {{submitJob}} (similar to how {{countAsync}} is implemented), and a 
> {{foldAsync}} implementation that simply delegates through to 
> {{aggregateAsync}}. I haven't yet written unit tests for these, but I can do 
> so if this is a contribution that would be accepted. Please let me know.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24947) aggregateAsync and foldAsync for RDD

2018-07-27 Thread Cody Allen (JIRA)

Cody Allen created SPARK-24947:
--

 Summary: aggregateAsync and foldAsync for RDD
 Key: SPARK-24947
 URL: https://issues.apache.org/jira/browse/SPARK-24947
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Cody Allen


{{AsyncRDDActions}} contains {{collectAsync}}, {{countAsync}}, 
{{foreachAsync}}, etc; but it doesn't provide general mechanisms for reducing 
datasets asynchronously. If I want to aggregate some statistics on a large 
dataset and it's going to take an hour, I shouldn't need to completely block a 
thread for the hour to wait for the result.

 

I propose the following methods be added to {{AsyncRDDActions}}:

 

{{def aggregateAsync[U](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): 
FutureAction[U]}}

{{}}{{def foldAsync(zeroValue: T)(op: (T, T) => T): FutureAction[T]}}

 

Locally I have a version of {{aggregateAsync}} implemented based on 
{{submitJob}} (similar to how {{countAsync}} is implemented), and a 
{{foldAsync}} implementation that simply delegates through to 
{{aggregateAsync}}. I haven't yet written unit tests for these, but I can do so 
if this is a contribution that would be accepted. Please let me know.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24935) Problem with Executing Hive UDF's from Spark 2.2 Onwards

2018-07-27 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559828#comment-16559828
 ] 

Wenchen Fan commented on SPARK-24935:
-

seems like Hive UDAF can reject partial aggregate?

> Problem with Executing Hive UDF's from Spark 2.2 Onwards
> 
>
> Key: SPARK-24935
> URL: https://issues.apache.org/jira/browse/SPARK-24935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Parth Gandhi
>Priority: Major
>
> A user of sketches library(https://github.com/DataSketches/sketches-hive) 
> reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark 
> or Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards. For 
> more details on the issue, you can refer to the discussion in the 
> sketches-user list:
> [https://groups.google.com/forum/?utm_medium=email_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ]
>  
> On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF 
> provides support for partial aggregation, and has removed the functionality 
> that supported complete mode aggregation(Refer 
> https://issues.apache.org/jira/browse/SPARK-19060 and 
> https://issues.apache.org/jira/browse/SPARK-18186). Thus, instead of 
> expecting update method to be called, merge method is called here 
> ([https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56)]
>  which throws the exception as described in the forums above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24942) Improve cluster resource management with jobs containing barrier stage

2018-07-27 Thread Jiang Xingbo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiang Xingbo updated SPARK-24942:
-
Target Version/s: 3.0.0

> Improve cluster resource management with jobs containing barrier stage
> --
>
> Key: SPARK-24942
> URL: https://issues.apache.org/jira/browse/SPARK-24942
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r205652317
> We shall improve cluster resource management to address the following issues:
> - With dynamic resource allocation enabled, it may happen that we acquire 
> some executors (but not enough to launch all the tasks in a barrier stage) 
> and later release them due to executor idle time expire, and then acquire 
> again.
> - There can be deadlock with two concurrent applications. Each application 
> may acquire some resources, but not enough to launch all the tasks in a 
> barrier stage. And after hitting the idle timeout and releasing them, they 
> may acquire resources again, but just continually trade resources between 
> each other.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24941) Add RDDBarrier.coalesce() function

2018-07-27 Thread Jiang Xingbo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiang Xingbo updated SPARK-24941:
-
Target Version/s: 3.0.0

> Add RDDBarrier.coalesce() function
> --
>
> Key: SPARK-24941
> URL: https://issues.apache.org/jira/browse/SPARK-24941
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r204917245
> The number of partitions from the input data can be unexpectedly large, eg. 
> if you do
> {code}
> sc.textFile(...).barrier().mapPartitions()
> {code}
> The number of input partitions is based on the hdfs input splits. We shall 
> provide a way in RDDBarrier to enable users to specify the number of tasks in 
> a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) 
> .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24946) PySpark - Allow np.Arrays and pd.Series in df.approxQuantile

2018-07-27 Thread Paul Westenthanner (JIRA)

Paul Westenthanner created SPARK-24946:
--

 Summary: PySpark - Allow np.Arrays and pd.Series in 
df.approxQuantile
 Key: SPARK-24946
 URL: https://issues.apache.org/jira/browse/SPARK-24946
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 2.3.1
Reporter: Paul Westenthanner


As Python user it is convenient to pass a numpy array or pandas series 
`{{approxQuantile}}(_col_, _probabilities_, _relativeError_)` for the 
probabilities parameter. 

 

Especially for creating cumulative plots (say in 1% steps) it is handy to use 
`approxQuantile(col, np.arange(0, 1.0, 0.01), relativeError)`.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23928) High-order function: shuffle(x) → array

2018-07-27 Thread Takuya Ueshin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23928.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21802
https://github.com/apache/spark/pull/21802

> High-order function: shuffle(x) → array
> ---
>
> Key: SPARK-23928
> URL: https://issues.apache.org/jira/browse/SPARK-23928
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: H Lu
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Generate a random permutation of the given array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23928) High-order function: shuffle(x) → array

2018-07-27 Thread Takuya Ueshin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23928:
-

Assignee: H Lu

> High-order function: shuffle(x) → array
> ---
>
> Key: SPARK-23928
> URL: https://issues.apache.org/jira/browse/SPARK-23928
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: H Lu
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Generate a random permutation of the given array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24944) SparkUi build problem

2018-07-27 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559790#comment-16559790
 ] 

Marco Gaido commented on SPARK-24944:
-

This seems more a problem in your project and your dependencies than an issue 
in Spark. This - rather than a JIRA - should have been a question sent to the 
mailing list.

> SparkUi build problem
> -
>
> Key: SPARK-24944
> URL: https://issues.apache.org/jira/browse/SPARK-24944
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0, 2.3.1
> Environment: scala 2.11.8
> java version "1.8.0_181" 
> Java(TM) SE Runtime Environment (build 1.8.0_181-b13) 
> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> 
> Gradle 4.5.1
> 
> Build time: 2018-02-05 13:22:49 UTC
> Revision: 37007e1c012001ff09973e0bd095139239ecd3b3
> Groovy: 2.4.12
> Ant: Apache Ant(TM) version 1.9.9 compiled on February 2 2017
> JVM: 1.8.0_181 (Oracle Corporation 25.181-b13)
> OS: Windows 7 6.1 amd64
>  
> build.gradle:
> group 'it.build-test.spark'
> version '1.0-SNAPSHOT'
> apply plugin: 'java'
> apply plugin: 'scala'
> sourceCompatibility = 1.8
> repositories {
>  mavenCentral()
> }
> dependencies {
>  compile 'org.apache.spark:spark-core_2.11:2.3.1'
>  compile 'org.scala-lang:scala-library:2.11.8'
> }
> tasks.withType(ScalaCompile) {
>  scalaCompileOptions.additionalParameters = ["-Ylog-classpath"]
> }
>Reporter: Fabio
>Priority: Major
>  Labels: UI, WebUI, build
> Attachments: build-test.zip
>
>
> Hi. I'm trying to customize SparkUi with my business logic. Trying to access 
> to ui, I have ta build problem. It's enough to create this class:
> _package org.apache.spark_
> _import org.apache.spark.ui.SparkUI_
> _case class SparkContextUtils(sc: SparkContext) {_
>  _def ui: Option[SparkUI] = sc.ui_
> _}_
>  
> to have this error:
>  
> _missing or invalid dependency detected while loading class file 
> 'WebUI.class'._
> _Could not access term eclipse in package org,_
> _because it (or its dependencies) are missing. Check your build definition 
> for_
> _missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see 
> the problematic classpath.)_
> _A full rebuild may help if 'WebUI.class' was compiled against an 
> incompatible version of org._
> _missing or invalid dependency detected while loading class file 
> 'WebUI.class'._
> _Could not access term jetty in value org.eclipse,_
> _because it (or its dependencies) are missing. Check your build definition 
> for_
> _missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see 
> the problematic classpath.)_
> _A full rebuild may help if 'WebUI.class' was compiled against an 
> incompatible version of org.eclipse._
> _two errors found_
> _:compileScala FAILED_
> _FAILURE: Build failed with an exception._
> _* What went wrong:_
> _Execution failed for task ':compileScala'._
> _> Compilation failed_
> _* Try:_
> _Run with --stacktrace option to get the stack trace. Run with --info or 
> --debug option to get more log output. Run with --scan to get full insights._
> _* Get more help at https://help.gradle.org_
> _BUILD FAILED in 26s_
> _1 actionable task: 1 executed_
> _Compilation failed_
>  
> The option "-Ylog-classpath" hasn't any useful information
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24945) Switch to uniVocity 2.7.2

2018-07-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559783#comment-16559783
 ] 

Apache Spark commented on SPARK-24945:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21892

> Switch to uniVocity 2.7.2
> -
>
> Key: SPARK-24945
> URL: https://issues.apache.org/jira/browse/SPARK-24945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> The recent version 2.7.2 of uniVocity parser includes the fix: 
> https://github.com/uniVocity/univocity-parsers/issues/250 . We don't need the 
> fix https://github.com/apache/spark/pull/21631 anymore



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24945) Switch to uniVocity 2.7.2

2018-07-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24945:


Assignee: Apache Spark

> Switch to uniVocity 2.7.2
> -
>
> Key: SPARK-24945
> URL: https://issues.apache.org/jira/browse/SPARK-24945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> The recent version 2.7.2 of uniVocity parser includes the fix: 
> https://github.com/uniVocity/univocity-parsers/issues/250 . We don't need the 
> fix https://github.com/apache/spark/pull/21631 anymore



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24945) Switch to uniVocity 2.7.2

2018-07-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24945:


Assignee: (was: Apache Spark)

> Switch to uniVocity 2.7.2
> -
>
> Key: SPARK-24945
> URL: https://issues.apache.org/jira/browse/SPARK-24945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> The recent version 2.7.2 of uniVocity parser includes the fix: 
> https://github.com/uniVocity/univocity-parsers/issues/250 . We don't need the 
> fix https://github.com/apache/spark/pull/21631 anymore



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24945) Switch to uniVocity 2.7.2

2018-07-27 Thread Maxim Gekk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-24945:
---
Summary: Switch to uniVocity 2.7.2  (was: Switch to unoVocity 2.7.2)

> Switch to uniVocity 2.7.2
> -
>
> Key: SPARK-24945
> URL: https://issues.apache.org/jira/browse/SPARK-24945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> The recent version 2.7.2 of uniVocity parser includes the fix: 
> https://github.com/uniVocity/univocity-parsers/issues/250 . We don't need the 
> fix https://github.com/apache/spark/pull/21631 anymore



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24945) Switch to unoVocity 2.7.2

2018-07-27 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-24945:
--

 Summary: Switch to unoVocity 2.7.2
 Key: SPARK-24945
 URL: https://issues.apache.org/jira/browse/SPARK-24945
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


The recent version 2.7.2 of uniVocity parser includes the fix: 
https://github.com/uniVocity/univocity-parsers/issues/250 . We don't need the 
fix https://github.com/apache/spark/pull/21631 anymore



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24944) SparkUi build problem

2018-07-27 Thread Fabio (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabio updated SPARK-24944:
--
Attachment: build-test.zip

> SparkUi build problem
> -
>
> Key: SPARK-24944
> URL: https://issues.apache.org/jira/browse/SPARK-24944
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0, 2.3.1
> Environment: scala 2.11.8
> java version "1.8.0_181" 
> Java(TM) SE Runtime Environment (build 1.8.0_181-b13) 
> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> 
> Gradle 4.5.1
> 
> Build time: 2018-02-05 13:22:49 UTC
> Revision: 37007e1c012001ff09973e0bd095139239ecd3b3
> Groovy: 2.4.12
> Ant: Apache Ant(TM) version 1.9.9 compiled on February 2 2017
> JVM: 1.8.0_181 (Oracle Corporation 25.181-b13)
> OS: Windows 7 6.1 amd64
>  
> build.gradle:
> group 'it.build-test.spark'
> version '1.0-SNAPSHOT'
> apply plugin: 'java'
> apply plugin: 'scala'
> sourceCompatibility = 1.8
> repositories {
>  mavenCentral()
> }
> dependencies {
>  compile 'org.apache.spark:spark-core_2.11:2.3.1'
>  compile 'org.scala-lang:scala-library:2.11.8'
> }
> tasks.withType(ScalaCompile) {
>  scalaCompileOptions.additionalParameters = ["-Ylog-classpath"]
> }
>Reporter: Fabio
>Priority: Major
>  Labels: UI, WebUI, build
> Attachments: build-test.zip
>
>
> Hi. I'm trying to customize SparkUi with my business logic. Trying to access 
> to ui, I have ta build problem. It's enough to create this class:
> _package org.apache.spark_
> _import org.apache.spark.ui.SparkUI_
> _case class SparkContextUtils(sc: SparkContext) {_
>  _def ui: Option[SparkUI] = sc.ui_
> _}_
>  
> to have this error:
>  
> _missing or invalid dependency detected while loading class file 
> 'WebUI.class'._
> _Could not access term eclipse in package org,_
> _because it (or its dependencies) are missing. Check your build definition 
> for_
> _missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see 
> the problematic classpath.)_
> _A full rebuild may help if 'WebUI.class' was compiled against an 
> incompatible version of org._
> _missing or invalid dependency detected while loading class file 
> 'WebUI.class'._
> _Could not access term jetty in value org.eclipse,_
> _because it (or its dependencies) are missing. Check your build definition 
> for_
> _missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see 
> the problematic classpath.)_
> _A full rebuild may help if 'WebUI.class' was compiled against an 
> incompatible version of org.eclipse._
> _two errors found_
> _:compileScala FAILED_
> _FAILURE: Build failed with an exception._
> _* What went wrong:_
> _Execution failed for task ':compileScala'._
> _> Compilation failed_
> _* Try:_
> _Run with --stacktrace option to get the stack trace. Run with --info or 
> --debug option to get more log output. Run with --scan to get full insights._
> _* Get more help at https://help.gradle.org_
> _BUILD FAILED in 26s_
> _1 actionable task: 1 executed_
> _Compilation failed_
>  
> The option "-Ylog-classpath" hasn't any useful information
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24944) SparkUi build problem

2018-07-27 Thread Fabio (JIRA)

Fabio created SPARK-24944:
-

 Summary: SparkUi build problem
 Key: SPARK-24944
 URL: https://issues.apache.org/jira/browse/SPARK-24944
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.3.1, 2.3.0
 Environment: scala 2.11.8


java version "1.8.0_181" 
Java(TM) SE Runtime Environment (build 1.8.0_181-b13) 
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)


Gradle 4.5.1


Build time: 2018-02-05 13:22:49 UTC
Revision: 37007e1c012001ff09973e0bd095139239ecd3b3

Groovy: 2.4.12
Ant: Apache Ant(TM) version 1.9.9 compiled on February 2 2017
JVM: 1.8.0_181 (Oracle Corporation 25.181-b13)
OS: Windows 7 6.1 amd64

 

build.gradle:

group 'it.build-test.spark'
version '1.0-SNAPSHOT'

apply plugin: 'java'
apply plugin: 'scala'

sourceCompatibility = 1.8

repositories {
 mavenCentral()
}

dependencies {
 compile 'org.apache.spark:spark-core_2.11:2.3.1'
 compile 'org.scala-lang:scala-library:2.11.8'
}

tasks.withType(ScalaCompile) {
 scalaCompileOptions.additionalParameters = ["-Ylog-classpath"]
}
Reporter: Fabio


Hi. I'm trying to customize SparkUi with my business logic. Trying to access to 
ui, I have ta build problem. It's enough to create this class:



_package org.apache.spark_

_import org.apache.spark.ui.SparkUI_

_case class SparkContextUtils(sc: SparkContext) {_
 _def ui: Option[SparkUI] = sc.ui_
_}_

 

to have this error:

 

_missing or invalid dependency detected while loading class file 'WebUI.class'._
_Could not access term eclipse in package org,_
_because it (or its dependencies) are missing. Check your build definition for_
_missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the 
problematic classpath.)_
_A full rebuild may help if 'WebUI.class' was compiled against an incompatible 
version of org._
_missing or invalid dependency detected while loading class file 'WebUI.class'._
_Could not access term jetty in value org.eclipse,_
_because it (or its dependencies) are missing. Check your build definition for_
_missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the 
problematic classpath.)_
_A full rebuild may help if 'WebUI.class' was compiled against an incompatible 
version of org.eclipse._
_two errors found_
_:compileScala FAILED_

_FAILURE: Build failed with an exception._

_* What went wrong:_
_Execution failed for task ':compileScala'._
_> Compilation failed_

_* Try:_
_Run with --stacktrace option to get the stack trace. Run with --info or 
--debug option to get more log output. Run with --scan to get full insights._

_* Get more help at https://help.gradle.org_

_BUILD FAILED in 26s_
_1 actionable task: 1 executed_
_Compilation failed_

 

The option "-Ylog-classpath" hasn't any useful information

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24492) Endless attempted task when TaskCommitDenied exception writing to S3A

2018-07-27 Thread Dmitry Bugaychenko (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559560#comment-16559560
 ] 

Dmitry Bugaychenko commented on SPARK-24492:


Seen the same problem with Spark 2.3.1 on YARN with shuffle service enabled:
 * 15K maps completed, reduces started
 * One node gets down and few reduces fails with fetch failed
 * Failed reduce tasks are retried
 * In the end driver refuses the commit from those retried task
 * The whole job eventually fails

> Endless attempted task when TaskCommitDenied exception writing to S3A
> -
>
> Key: SPARK-24492
> URL: https://issues.apache.org/jira/browse/SPARK-24492
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yu-Jhe Li
>Priority: Critical
> Attachments: retry_stage.png, 螢幕快照 2018-05-16 上午11.10.46.png, 螢幕快照 
> 2018-05-16 上午11.10.57.png
>
>
> Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and 
> output file to S3, some tasks endless retry and all of them failed with 
> TaskCommitDenied exception. This happened when we run Spark application on 
> some network issue instances. (it runs well on healthy spot instances)
> Sorry, I can find a easy way to reproduce this issue, here's all I can 
> provide.
> The Spark UI shows (in attachments) one task of stage 112 failed due to 
> FetchFailedException (it is network issue) and attempt to retry a new stage 
> 112 (retry 1). But in stage 112 (retry 1), all task failed due to 
> TaskCommitDenied exception, and keep retry (it never succeed and cause lots 
> of S3 requests).
> On the other side, driver logs shows:
>  # task 123.0 in stage 112.0 failed due to FetchFailedException (network 
> issue cause corrupted file)
>  # warning message from OutputCommitCoordinator
>  # task 92.0 in stage 112.1 failed when writing rows
>  # keep retry the failed tasks, but never succeed
> {noformat}
> 2018-05-16 02:38:055 WARN  TaskSetManager:66 - Lost task 123.0 in stage 112.0 
> (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, 
> 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message=
> org.apache.spark.shuffle.FetchFailedException: Stream is corrupted
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>

1 2 >

1 - 100 of 124 matches

Mail list logo