[jira] [Updated] (SPARK-24896) Uuid expression should produce different values in each execution under streaming query

2018-07-23 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-24896:

Component/s: Structured Streaming

> Uuid expression should produce different values in each execution under 
> streaming query
> ---
>
> Key: SPARK-24896
> URL: https://issues.apache.org/jira/browse/SPARK-24896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Uuid's results depend on random seed given during analysis. Thus under 
> streaming query, we will have the same uuids in each execution. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24896) Uuid expression should produce different values in each execution under streaming query

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24896:


Assignee: Apache Spark

> Uuid expression should produce different values in each execution under 
> streaming query
> ---
>
> Key: SPARK-24896
> URL: https://issues.apache.org/jira/browse/SPARK-24896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Uuid's results depend on random seed given during analysis. Thus under 
> streaming query, we will have the same uuids in each execution. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24896) Uuid expression should produce different values in each execution under streaming query

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24896:


Assignee: (was: Apache Spark)

> Uuid expression should produce different values in each execution under 
> streaming query
> ---
>
> Key: SPARK-24896
> URL: https://issues.apache.org/jira/browse/SPARK-24896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Uuid's results depend on random seed given during analysis. Thus under 
> streaming query, we will have the same uuids in each execution. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24896) Uuid expression should produce different values in each execution under streaming query

2018-07-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553803#comment-16553803
 ] 

Apache Spark commented on SPARK-24896:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/21854

> Uuid expression should produce different values in each execution under 
> streaming query
> ---
>
> Key: SPARK-24896
> URL: https://issues.apache.org/jira/browse/SPARK-24896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Uuid's results depend on random seed given during analysis. Thus under 
> streaming query, we will have the same uuids in each execution. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24896) Uuid expression should produce different values in each execution under streaming query

2018-07-23 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-24896:
---

 Summary: Uuid expression should produce different values in each 
execution under streaming query
 Key: SPARK-24896
 URL: https://issues.apache.org/jira/browse/SPARK-24896
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Liang-Chi Hsieh


Uuid's results depend on random seed given during analysis. Thus under 
streaming query, we will have the same uuids in each execution. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24857) required the sample code test the spark steaming job in kubernates and write the data in remote hdfs file system

2018-07-23 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24857.
--
Resolution: Cannot Reproduce

Please reopen if you happen to have a fix or provide reproducible steps. From 
reading the description in the JIRA, looks unclear to me.

> required the sample code test the spark steaming job in kubernates and write 
> the data in remote hdfs file system
> 
>
> Key: SPARK-24857
> URL: https://issues.apache.org/jira/browse/SPARK-24857
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Spark Submit
>Affects Versions: 2.3.1
>Reporter: kumpatla murali krishna
>Priority: Major
>
> ./bin/spark-submit --master k8s://https://api.kubernates.aws.phenom.local 
> --deploy-mode cluster --name spark-pi --class  
> com.phenom.analytics.executor.SummarizationJobExecutor --conf 
> spark.executor.instances=5 --conf 
> spark.kubernetes.container.image=phenommurali/spark_new  --jars  
> hdfs://test-dev.com:8020/user/spark/jobs/Test_jar_without_jars.jar
> error 
> Normal SuccessfulMountVolume 2m kubelet, ip-x.ec2.internal 
> MountVolume.SetUp succeeded for volume "download-files-volume" Warning 
> FailedMount 2m kubelet, ip-.ec2.internal MountVolume.SetUp failed for 
> volume "spark-init-properties" : configmaps 
> "spark-pi-b5be4308783c3c479c6bf2f9da9b49dc-init-config" not found



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-23 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24895:
-
Description: 
Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven 
repo has mismatched filenames:

{code}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
(enforce-banned-dependencies) on project spark_2.4: Execution 
enforce-banned-dependencies of goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: Could 
not resolve following dependencies: 
[org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
resolve dependencies for project com.databricks:spark_2.4:pom:1: The following 
artifacts could not be resolved: 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find artifact 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
{code}
 

If you check the artifact metadata you will see the pom and jar files are 
2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
{code:xml}

org.apache.spark
spark-mllib-local_2.11
2.4.0-SNAPSHOT


20180723.232411
177

20180723232411


jar
2.4.0-20180723.232411-177
20180723232411


pom
2.4.0-20180723.232411-177
20180723232411


tests
jar
2.4.0-20180723.232410-177
20180723232411


sources
jar
2.4.0-20180723.232410-177
20180723232411


test-sources
jar
2.4.0-20180723.232410-177
20180723232411




{code}
 
 This behavior is very similar to this issue: 
https://issues.apache.org/jira/browse/MDEPLOY-221

Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
2.8.2 plugin, it is highly possible that we introduced a new plugin that causes 
this. 

The most recent addition is the spot-bugs plugin, which is known to have 
incompatibilities with other plugins: 
[https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]

We may want to try building without it to sanity check.

  was:
Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven 
repo has mismatched filenames:

[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
(enforce-banned-dependencies) on project spark_2.4: Execution 
enforce-banned-dependencies of goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: Could 
not resolve following dependencies: 
[org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
resolve dependencies for project com.databricks:spark_2.4:pom:1: The following 
artifacts could not be resolved: 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find artifact 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]

 

If you check the artifact metadata you will see the pom and jar files are 
2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
{code:xml}

org.apache.spark
spark-mllib-local_2.11
2.4.0-SNAPSHOT


20180723.232411
177

20180723232411


jar
2.4.0-20180723.232411-177
20180723232411


pom
2.4.0-20180723.232411-177
20180723232411


tests
jar
2.4.0-20180723.232410-177
20180723232411


sources
jar
2.4.0-20180723.232410-177
20180723232411


test-sources
jar
2.4.0-20180723.232410-177
20180723232411




{code}
 
 This behavior is very similar to this issue: 
https://issues.apache.org/jira/browse/MDEPLOY-221

Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
2.8.2 plugin, it is highly possible that we introduced a new plugin that causes 
this. 

The most recent addition is the spot-bugs plugin, which is known to have 
incompatibilities with other plugins: 
[https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]

We may want to try building without it to sanity check.


> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  

[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data

2018-07-23 Thread James (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553768#comment-16553768
 ] 

James commented on SPARK-21097:
---

Hi [~bradkaiser] whats the progress of this project or problem now? i'd like to 
join this project.

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24760) Pandas UDF does not handle NaN correctly

2018-07-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553756#comment-16553756
 ] 

Hyukjin Kwon commented on SPARK-24760:
--

+1 for not a problem resolution for now.

> Pandas UDF does not handle NaN correctly
> 
>
> Key: SPARK-24760
> URL: https://issues.apache.org/jira/browse/SPARK-24760
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0, 2.3.1
> Environment: Spark 2.3.1
> Pandas 0.23.1
>Reporter: Mortada Mehyar
>Priority: Minor
>
> I noticed that having `NaN` values when using the new Pandas UDF feature 
> triggers a JVM exception. Not sure if this is an issue with PySpark or 
> PyArrow. Here is a somewhat contrived example to showcase the problem.
> {code}
> In [1]: import pandas as pd
>...: from pyspark.sql.functions import lit, pandas_udf, PandasUDFType
> In [2]: d = [{'key': 'a', 'value': 1},
>  {'key': 'a', 'value': 2},
>  {'key': 'b', 'value': 3},
>  {'key': 'b', 'value': -2}]
> df = spark.createDataFrame(d, "key: string, value: int")
> df.show()
> +---+-+
> |key|value|
> +---+-+
> |  a|1|
> |  a|2|
> |  b|3|
> |  b|   -2|
> +---+-+
> In [3]: df_tmp = df.withColumn('new', lit(1.0))  # add a DoubleType column
> df_tmp.printSchema()
> root
>  |-- key: string (nullable = true)
>  |-- value: integer (nullable = true)
>  |-- new: double (nullable = false)
> {code}
> And the Pandas UDF is simply creating a new column where negative values 
> would be set to a particular float, in this case INF and it works fine
> {code}
> In [4]: @pandas_udf(df_tmp.schema, PandasUDFType.GROUPED_MAP)
>...: def func(pdf):
>...: pdf['new'] = pdf['value'].where(pdf['value'] > 0, float('inf'))
>...: return pdf
> In [5]: df.groupby('key').apply(func).show()
> +---+-+--+
> |key|value|new|
> +---+-+--+
> |  b|3|   3.0|
> |  b|   -2|  Infinity|
> |  a|1|   1.0|
> |  a|2|   2.0|
> +---+-+--+
> {code}
> However if we set this value to NaN then it triggers an exception:
> {code}
> In [6]: @pandas_udf(df_tmp.schema, PandasUDFType.GROUPED_MAP)
>...: def func(pdf):
>...: pdf['new'] = pdf['value'].where(pdf['value'] > 0, float('nan'))
>...: return pdf
>...:
>...: df.groupby('key').apply(func).show()
> [Stage 23:==> (73 + 2) / 
> 75]2018-07-07 16:26:27 ERROR Executor:91 - Exception in task 36.0 in stage 
> 23.0 (TID 414)
> java.lang.IllegalStateException: Value at index is null
>   at org.apache.arrow.vector.Float8Vector.get(Float8Vector.java:98)
>   at 
> org.apache.spark.sql.vectorized.ArrowColumnVector$DoubleAccessor.getDouble(ArrowColumnVector.java:344)
>   at 
> org.apache.spark.sql.vectorized.ArrowColumnVector.getDouble(ArrowColumnVector.java:99)
>   at 
> org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getDouble(MutableColumnarRow.java:126)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> 

[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2018-07-23 Thread Carson Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553700#comment-16553700
 ] 

Carson Wang commented on SPARK-23128:
-

[~tgraves], We have split the code into 7 PRs based on Spark 2.3 and they are 
all available at [https://github.com/Intel-bigdata/spark-adaptive/pulls]  
(AE2.3-01 to AE2.3-07). Many companies have evaluated this feature including 
Baidu, Alibaba, JD.com, eBay etc and have very good feedback. Many of them also 
have merged the feature into their own Spark version. However there is little 
progress for contributing this to upstream. Any suggestion for moving this 
forward? 

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Priority: Major
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-23 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553679#comment-16553679
 ] 

Saisai Shao commented on SPARK-24615:
-

[~Tagar] yes!

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15694) Implement ScriptTransformation in sql/core

2018-07-23 Thread Abdulrahman Alfozan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553677#comment-16553677
 ] 

Abdulrahman Alfozan commented on SPARK-15694:
-

I'm currently working on this.

> Implement ScriptTransformation in sql/core
> --
>
> Key: SPARK-15694
> URL: https://issues.apache.org/jira/browse/SPARK-15694
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>
> ScriptTransformation currently relies on Hive internals. It'd be great if we 
> can implement a native ScriptTransformation in sql/core module to remove the 
> extra Hive dependency here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24594) Introduce metrics for YARN executor allocation problems

2018-07-23 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24594.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21635
[https://github.com/apache/spark/pull/21635]

> Introduce metrics for YARN executor allocation problems 
> 
>
> Key: SPARK-24594
> URL: https://issues.apache.org/jira/browse/SPARK-24594
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.0
>
>
> Within SPARK-16630 it come up to introduce metrics for  YARN executor 
> allocation problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24594) Introduce metrics for YARN executor allocation problems

2018-07-23 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24594:
---

Assignee: Attila Zsolt Piros

> Introduce metrics for YARN executor allocation problems 
> 
>
> Key: SPARK-24594
> URL: https://issues.apache.org/jira/browse/SPARK-24594
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.0
>
>
> Within SPARK-16630 it come up to introduce metrics for  YARN executor 
> allocation problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553656#comment-16553656
 ] 

Hyukjin Kwon commented on SPARK-24895:
--

FWIW, I hit this too. Thanks for analysis.

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Priority: Major
>
> Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
> org.apache.spark
> spark-mllib-local_2.11
> 2.4.0-SNAPSHOT
> 
> 
> 20180723.232411
> 177
> 
> 20180723232411
> 
> 
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
> 
> 
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
> 
> 
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> 
> 
> {code}
>  
>  This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553654#comment-16553654
 ] 

Hyukjin Kwon commented on SPARK-24895:
--

Yes please! don't mind at all.

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Priority: Major
>
> Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
> org.apache.spark
> spark-mllib-local_2.11
> 2.4.0-SNAPSHOT
> 
> 
> 20180723.232411
> 177
> 
> 20180723232411
> 
> 
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
> 
> 
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
> 
> 
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> 
> 
> {code}
>  
>  This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-23 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553610#comment-16553610
 ] 

Yin Huai commented on SPARK-24895:
--

[~kiszk] [~hyukjin.kwon] since this thing is pretty tricky to test it out 
actually, do you mind if I remove the spotbugs and test out our nightly 
snapshot build? If this plugin is not the cause, I will add it back. If it is 
indeed the cause, we can figure out how to fix it. Thanks!

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Priority: Major
>
> Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
> org.apache.spark
> spark-mllib-local_2.11
> 2.4.0-SNAPSHOT
> 
> 
> 20180723.232411
> 177
> 
> 20180723232411
> 
> 
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
> 
> 
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
> 
> 
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> 
> 
> {code}
>  
>  This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-23 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553568#comment-16553568
 ] 

Yin Huai commented on SPARK-24895:
--

 

[~kiszk] and [~hyukjin.kwon] we hit this issue today. Per 
[https://github.com/spotbugs/spotbugs-maven-plugin/issues/21,] it may be 
related to spot-bug plugin. We are trying to verify it now.

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Priority: Major
>
> Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
> org.apache.spark
> spark-mllib-local_2.11
> 2.4.0-SNAPSHOT
> 
> 
> 20180723.232411
> 177
> 
> 20180723232411
> 
> 
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
> 
> 
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
> 
> 
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> 
> 
> {code}
>  
>  This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-23 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-24895:
-
Target Version/s: 2.4.0

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Priority: Major
>
> Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
> org.apache.spark
> spark-mllib-local_2.11
> 2.4.0-SNAPSHOT
> 
> 
> 20180723.232411
> 177
> 
> 20180723232411
> 
> 
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
> 
> 
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
> 
> 
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> 
> 
> {code}
>  
>  This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24288) Enable preventing predicate pushdown

2018-07-23 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553566#comment-16553566
 ] 

Xiao Li commented on SPARK-24288:
-

[~TomaszGaweda] Based on my understanding, when the predicates fail to use any 
pre-defined index (e.g., OR expressions with non-equality comparison 
expressions), predicate push-down could perform slower than the regular full 
JDBC table scan + Spark SQL filtering. Is my understanding right?

> Enable preventing predicate pushdown
> 
>
> Key: SPARK-24288
> URL: https://issues.apache.org/jira/browse/SPARK-24288
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Tomasz Gawęda
>Priority: Major
> Attachments: SPARK-24288.simple.patch
>
>
> Issue discussed on Mailing List: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
> While working with JDBC datasource I saw that many "or" clauses with 
> non-equality operators causes huge performance degradation of SQL query 
> to database (DB2). For example: 
> val df = spark.read.format("jdbc").(other options to parallelize 
> load).load() 
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
>  > 100)").show() // in real application whose predicates were pushed 
> many many lines below, many ANDs and ORs 
> If I use cache() before where, there is no predicate pushdown of this 
> "where" clause. However, in production system caching many sources is a 
> waste of memory (especially is pipeline is long and I must do cache many 
> times).There are also few more workarounds, but it would be great if Spark 
> will support preventing predicate pushdown by user.
>  
> For example: df.withAnalysisBarrier().where(...) ?
>  
> Note, that this should not be a global configuration option. If I read 2 
> DataFrames, df1 and df2, I would like to specify that df1 should not have 
> some predicates pushed down, but some may be, but df2 should have all 
> predicates pushed down, even if target query joins df1 and df2. As far as I 
> understand Spark optimizer, if we use functions like `withAnalysisBarrier` 
> and put AnalysisBarrier explicitly in logical plan, then predicates won't be 
> pushed down on this particular DataFrames and PP will be still possible on 
> the second one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-23 Thread Eric Chang (JIRA)
Eric Chang created SPARK-24895:
--

 Summary: Spark 2.4.0 Snapshot artifacts has broken metadata due to 
mismatched filenames
 Key: SPARK-24895
 URL: https://issues.apache.org/jira/browse/SPARK-24895
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.0
Reporter: Eric Chang


Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven 
repo has mismatched filenames:

[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
(enforce-banned-dependencies) on project spark_2.4: Execution 
enforce-banned-dependencies of goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: Could 
not resolve following dependencies: 
[org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
resolve dependencies for project com.databricks:spark_2.4:pom:1: The following 
artifacts could not be resolved: 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find artifact 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]

 

If you check the artifact metadata you will see the pom and jar files are 
2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
{code:xml}

org.apache.spark
spark-mllib-local_2.11
2.4.0-SNAPSHOT


20180723.232411
177

20180723232411


jar
2.4.0-20180723.232411-177
20180723232411


pom
2.4.0-20180723.232411-177
20180723232411


tests
jar
2.4.0-20180723.232410-177
20180723232411


sources
jar
2.4.0-20180723.232410-177
20180723232411


test-sources
jar
2.4.0-20180723.232410-177
20180723232411




{code}
 
 This behavior is very similar to this issue: 
https://issues.apache.org/jira/browse/MDEPLOY-221

Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
2.8.2 plugin, it is highly possible that we introduced a new plugin that causes 
this. 

The most recent addition is the spot-bugs plugin, which is known to have 
incompatibilities with other plugins: 
[https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]

We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24894) Invalid DNS name due to hostname truncation

2018-07-23 Thread Dharmesh Kakadia (JIRA)
Dharmesh Kakadia created SPARK-24894:


 Summary: Invalid DNS name due to hostname truncation 
 Key: SPARK-24894
 URL: https://issues.apache.org/jira/browse/SPARK-24894
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.3.1
Reporter: Dharmesh Kakadia


The truncation for hostname happening here 
[https://github.com/apache/spark/blob/5ff1b9ba1983d5601add62aef64a3e87d07050eb/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L77]
  is a problematic and can lead to DNS names starting with "-". 

Originally filled here : 
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/229

```
{{2018-07-23 21:21:42 ERROR Utils:91 - Uncaught exception in thread 
kubernetes-pod-allocator 
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https://kubernetes.default.svc/api/v1/namespaces/default/pods. Message: Pod 
"user-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9" is 
invalid: spec.hostname: Invalid value: 
"-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9": a DNS-1123 
label must consist of lower case alphanumeric characters or '-', and must start 
and end with an alphanumeric character (e.g. 'my-name', or '123-abc', regex 
used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'). Received status: 
Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=spec.hostname, message=Invalid 
value: "-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9": a 
DNS-1123 label must consist of lower case alphanumeric characters or '-', and 
must start and end with an alphanumeric character (e.g. 'my-name', or 
'123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'), 
reason=FieldValueInvalid, additionalProperties={})], group=null, kind=Pod, 
name=user-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9, 
retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
message=Pod 
"user-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9" is 
invalid: spec.hostname: Invalid value: 
"-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9": a DNS-1123 
label must consist of lower case alphanumeric characters or '-', and must start 
and end with an alphanumeric character (e.g. 'my-name', or '123-abc', regex 
used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'), 
metadata=ListMeta(resourceVersion=null, selfLink=null, 
additionalProperties={}), reason=Invalid, status=Failure, 
additionalProperties={}). at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:470)
 at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:409)
 at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:379)
 at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343)
 at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:226)
 at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:769)
 at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:356)
 at 
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$$anon$1$$anonfun$3$$anonfun$apply$3.apply(KubernetesClusterSchedulerBackend.scala:140)
 at 
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$$anon$1$$anonfun$3$$anonfun$apply$3.apply(KubernetesClusterSchedulerBackend.scala:140)
 at org.apache.spark.util.Utils$.tryLog(Utils.scala:1922) at 
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$$anon$1$$anonfun$3.apply(KubernetesClusterSchedulerBackend.scala:139)
 at 
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$$anon$1$$anonfun$3.apply(KubernetesClusterSchedulerBackend.scala:138)
 at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
 at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
 at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
 at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) 
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) 
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) 
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at 
scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) 
at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:245) at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 

[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API

2018-07-23 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553460#comment-16553460
 ] 

Ryan Blue commented on SPARK-24882:
---

[~cloud_fan], just to confirm: this is *not intended to go into 2.4.0* right?

> separate responsibilities of the data source v2 read API
> 
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 read API. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23957) Sorts in subqueries are redundant and can be removed

2018-07-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553434#comment-16553434
 ] 

Apache Spark commented on SPARK-23957:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/21853

> Sorts in subqueries are redundant and can be removed
> 
>
> Key: SPARK-23957
> URL: https://issues.apache.org/jira/browse/SPARK-23957
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Henry Robinson
>Priority: Major
>
> Unless combined with a {{LIMIT}}, there's no correctness reason that planned 
> and optimized subqueries should have any sort operators (since the result of 
> the subquery is an unordered collection of tuples). 
> For example:
> {{SELECT count(1) FROM (select id FROM dft ORDER by id)}}
> has the following plan:
> {code:java}
> == Physical Plan ==
> *(3) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition
>+- *(2) HashAggregate(keys=[], functions=[partial_count(1)])
>   +- *(2) Project
>  +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>+- *(1) Project [id#0L]
>   +- *(1) FileScan parquet [id#0L] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct
> {code}
> ... but the sort operator is redundant.
> Less intuitively, the sort is also redundant in selections from an ordered 
> subquery:
> {{SELECT * FROM (SELECT id FROM dft ORDER BY id)}}
> has plan:
> {code:java}
> == Physical Plan ==
> *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>+- *(1) Project [id#0L]
>   +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> {code}
> ... but again, since the subquery returns a bag of tuples, the sort is 
> unnecessary.
> We should consider adding an optimizer rule that removes a sort inside a 
> subquery. SPARK-23375 is related, but removes sorts that are functionally 
> redundant because they perform the same ordering.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24814) Relationship between catalog and datasources

2018-07-23 Thread Bruce Robbins (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-24814:
--
Description: 
This is somewhat related, though not identical to, [~rdblue]'s SPIP on 
datasources and catalogs.

Here are the requirements (IMO) for fully implementing V2 datasources and their 
relationships to catalogs:
 # The global catalog should be configurable (the default can be HMS, but it 
should be overridable).
 # The default catalog (or an explicitly specified catalog in a query, once 
multiple catalogs are supported) can determine the V2 datasource to use for 
reading and writing the data.
 # Once multiple catalogs are supported, a user should be able to specify a 
catalog on spark.read and df.write operations. As specified above, The catalog 
would determine the datasource to use for the read or write operation.


Old #3:
-Conversely, a V2 datasource can determine which catalog to use for resolution 
(e.g., if the user issues {{spark.read.format("acmex").table("mytable")}}, the 
acmex datasource would decide which catalog to use for resolving “mytable”).-

  was:
This is somewhat related, though not identical to, [~rdblue]'s SPIP on 
datasources and catalogs.

Here are the requirements (IMO) for fully implementing V2 datasources and their 
relationships to catalogs:
 # The global catalog should be configurable (the default can be HMS, but it 
should be overridable).
 # The default catalog (or an explicitly specified catalog in a query, once 
multiple catalogs are supported) can determine the V2 datasource to use for 
reading and writing the data.
 # Conversely, a V2 datasource can determine which catalog to use for 
resolution (e.g., if the user issues 
{{spark.read.format("acmex").table("mytable")}}, the acmex datasource would 
decide which catalog to use for resolving “mytable”).


> Relationship between catalog and datasources
> 
>
> Key: SPARK-24814
> URL: https://issues.apache.org/jira/browse/SPARK-24814
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This is somewhat related, though not identical to, [~rdblue]'s SPIP on 
> datasources and catalogs.
> Here are the requirements (IMO) for fully implementing V2 datasources and 
> their relationships to catalogs:
>  # The global catalog should be configurable (the default can be HMS, but it 
> should be overridable).
>  # The default catalog (or an explicitly specified catalog in a query, once 
> multiple catalogs are supported) can determine the V2 datasource to use for 
> reading and writing the data.
>  # Once multiple catalogs are supported, a user should be able to specify a 
> catalog on spark.read and df.write operations. As specified above, The 
> catalog would determine the datasource to use for the read or write operation.
> Old #3:
> -Conversely, a V2 datasource can determine which catalog to use for 
> resolution (e.g., if the user issues 
> {{spark.read.format("acmex").table("mytable")}}, the acmex datasource would 
> decide which catalog to use for resolving “mytable”).-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24339) spark sql can not prune column in transform/map/reduce query

2018-07-23 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24339.
-
   Resolution: Fixed
 Assignee: Li Yuanjian
Fix Version/s: 2.4.0

> spark sql can not prune column in transform/map/reduce query
> 
>
> Key: SPARK-24339
> URL: https://issues.apache.org/jira/browse/SPARK-24339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: xdcjie
>Assignee: Li Yuanjian
>Priority: Minor
>  Labels: map, reduce, sql, transform
> Fix For: 2.4.0
>
>
> I was using {{TRANSFORM USING}} with branch-2.1/2.2, and noticed that it will 
> scan all column of data, query like:
> {code:java}
> SELECT TRANSFORM(usid, cch) USING 'python test.py' AS (u1, c1, u2, c2) FROM 
> test_table;{code}
> it's physical plan like:
> {code:java}
> == Physical Plan ==
> ScriptTransformation [usid#17, cch#9], python test.py, [u1#784, c1#785, 
> u2#786, c2#787], 
> HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,
> )),List((field.delim,   
> )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
> +- FileScan parquet 
> [sh#0L,clk#1L,chg#2L,qey#3,ship#4,chgname#5,sid#6,bid#7,dis#8L,cch#9,wch#10,wid#11L,arank#12L,rtag#13,iid#14,uid#15,pid#16,usid#17,wdid#18,bid#19,oqleft#20,oqright#21,poqvalue#22,tm#23,...
>  367 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/Downloads/part-r-00093-0ef5d59f-2e08-4085-9b46-458a1652932a.g...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct {code}
> In our scenario, parquet has 400 column, this query will take more time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24699) Watermark / Append mode should work with Trigger.Once

2018-07-23 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-24699.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21746
[https://github.com/apache/spark/pull/21746]

> Watermark / Append mode should work with Trigger.Once
> -
>
> Key: SPARK-24699
> URL: https://issues.apache.org/jira/browse/SPARK-24699
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Chris Horn
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: watermark-once.scala, watermark-stream.scala
>
>
> I have a use case where I would like to trigger a structured streaming job 
> from an external scheduler (once every 15 minutes or so) and have it write 
> window aggregates to Kafka.
> I am able to get my code to work when running with `Trigger.ProcessingTime` 
> but when I switch to `Trigger.Once` the watermarking feature of structured 
> streams does not persist to (or is not recollected from) the checkpoint state.
> This causes the stream to never generate output because the watermark is 
> perpetually stuck at `1970-01-01T00:00:00Z`.
> I have created a failing test case in the `EventTimeWatermarkSuite`, I will 
> create a [WIP] pull request on github and link it here.
>  
> It seems that even if it generated the watermark, and given the current 
> streaming behavior, I would have to trigger the job twice to generate any 
> output.
>  
> The microbatcher only calculates the watermark off of the previous batch's 
> input and emits new aggs based off of that timestamp.
> This state is not available to a newly started `MicroBatchExecution` stream.
> Would it be an appropriate strategy to create a new checkpoint file with the 
> most up to watermark or watermark and query stats?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24699) Watermark / Append mode should work with Trigger.Once

2018-07-23 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-24699:
-

Assignee: Tathagata Das

> Watermark / Append mode should work with Trigger.Once
> -
>
> Key: SPARK-24699
> URL: https://issues.apache.org/jira/browse/SPARK-24699
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Chris Horn
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: watermark-once.scala, watermark-stream.scala
>
>
> I have a use case where I would like to trigger a structured streaming job 
> from an external scheduler (once every 15 minutes or so) and have it write 
> window aggregates to Kafka.
> I am able to get my code to work when running with `Trigger.ProcessingTime` 
> but when I switch to `Trigger.Once` the watermarking feature of structured 
> streams does not persist to (or is not recollected from) the checkpoint state.
> This causes the stream to never generate output because the watermark is 
> perpetually stuck at `1970-01-01T00:00:00Z`.
> I have created a failing test case in the `EventTimeWatermarkSuite`, I will 
> create a [WIP] pull request on github and link it here.
>  
> It seems that even if it generated the watermark, and given the current 
> streaming behavior, I would have to trigger the job twice to generate any 
> output.
>  
> The microbatcher only calculates the watermark off of the previous batch's 
> input and emits new aggs based off of that timestamp.
> This state is not available to a newly started `MicroBatchExecution` stream.
> Would it be an appropriate strategy to create a new checkpoint file with the 
> most up to watermark or watermark and query stats?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24814) Relationship between catalog and datasources

2018-07-23 Thread Bruce Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553336#comment-16553336
 ] 

Bruce Robbins commented on SPARK-24814:
---

[~rdblue] Your parquet example is a compelling one.

If #2 holds, and the user can specify a catalog on spark.read/df.write 
statements, then my use cases are covered.

 

> Relationship between catalog and datasources
> 
>
> Key: SPARK-24814
> URL: https://issues.apache.org/jira/browse/SPARK-24814
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This is somewhat related, though not identical to, [~rdblue]'s SPIP on 
> datasources and catalogs.
> Here are the requirements (IMO) for fully implementing V2 datasources and 
> their relationships to catalogs:
>  # The global catalog should be configurable (the default can be HMS, but it 
> should be overridable).
>  # The default catalog (or an explicitly specified catalog in a query, once 
> multiple catalogs are supported) can determine the V2 datasource to use for 
> reading and writing the data.
>  # Conversely, a V2 datasource can determine which catalog to use for 
> resolution (e.g., if the user issues 
> {{spark.read.format("acmex").table("mytable")}}, the acmex datasource would 
> decide which catalog to use for resolving “mytable”).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6459) Warn when Column API is constructing trivially true equality

2018-07-23 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553330#comment-16553330
 ] 

Michael Armbrust commented on SPARK-6459:
-

[~tenstriker] this will never happen from a SQL query.  This only happens when 
you take already resolved attributes from different parts of a DataFrame and 
manually construct an equality that can't be differentiated.

> Warn when Column API is constructing trivially true equality
> 
>
> Key: SPARK-6459
> URL: https://issues.apache.org/jira/browse/SPARK-6459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.3.1, 1.4.0
>
>
> Right now its pretty confusing when a user constructs and equality predicate 
> that is going to be use in a self join, where the optimizer cannot 
> distinguish between the attributes in question (e.g.,  [SPARK-6231]).  Since 
> there is really no good reason to do this, lets print a warning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24814) Relationship between catalog and datasources

2018-07-23 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553324#comment-16553324
 ] 

Ryan Blue commented on SPARK-24814:
---

I've been implementing more logical plans (AppendData, DeleteFrom, CTAS, and 
RTAS) on top of my PR to add the proposed table catalog API. After thinking 
about this more, I don't think that we need #3. I think we should always go 
from a catalog to a table implementation (data source v2) instead of from a 
data source to a catalog.

For example, think about the "parquet" data source. Once we have multiple table 
catalogs, what table catalog should Parquet return? We could make it simply the 
"default", but then that restricts Spark from creating Parquet tables through 
other sources on some write paths. I think it makes no sense for a user to 
specify a CTAS for a Parquet table without also specifying a catalog in the 
table name (via name triple, {{catalog.db.table}}). TableIdentifier triples are 
supported through saveAsTable, insertIntoTable, and all SQL statements, so it 
is easy to specify the catalog nearly everywhere. The one write path that is 
left out is `df.write.save`, but that could require a `catalog` option like the 
`table` and `database` options.

> Relationship between catalog and datasources
> 
>
> Key: SPARK-24814
> URL: https://issues.apache.org/jira/browse/SPARK-24814
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This is somewhat related, though not identical to, [~rdblue]'s SPIP on 
> datasources and catalogs.
> Here are the requirements (IMO) for fully implementing V2 datasources and 
> their relationships to catalogs:
>  # The global catalog should be configurable (the default can be HMS, but it 
> should be overridable).
>  # The default catalog (or an explicitly specified catalog in a query, once 
> multiple catalogs are supported) can determine the V2 datasource to use for 
> reading and writing the data.
>  # Conversely, a V2 datasource can determine which catalog to use for 
> resolution (e.g., if the user issues 
> {{spark.read.format("acmex").table("mytable")}}, the acmex datasource would 
> decide which catalog to use for resolving “mytable”).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24760) Pandas UDF does not handle NaN correctly

2018-07-23 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553315#comment-16553315
 ] 

Wes McKinney commented on SPARK-24760:
--

If data comes to Spark from pandas, any "NaN" values should be treated as 
"null". Any other behavior is going to cause users significant problems

> Pandas UDF does not handle NaN correctly
> 
>
> Key: SPARK-24760
> URL: https://issues.apache.org/jira/browse/SPARK-24760
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0, 2.3.1
> Environment: Spark 2.3.1
> Pandas 0.23.1
>Reporter: Mortada Mehyar
>Priority: Minor
>
> I noticed that having `NaN` values when using the new Pandas UDF feature 
> triggers a JVM exception. Not sure if this is an issue with PySpark or 
> PyArrow. Here is a somewhat contrived example to showcase the problem.
> {code}
> In [1]: import pandas as pd
>...: from pyspark.sql.functions import lit, pandas_udf, PandasUDFType
> In [2]: d = [{'key': 'a', 'value': 1},
>  {'key': 'a', 'value': 2},
>  {'key': 'b', 'value': 3},
>  {'key': 'b', 'value': -2}]
> df = spark.createDataFrame(d, "key: string, value: int")
> df.show()
> +---+-+
> |key|value|
> +---+-+
> |  a|1|
> |  a|2|
> |  b|3|
> |  b|   -2|
> +---+-+
> In [3]: df_tmp = df.withColumn('new', lit(1.0))  # add a DoubleType column
> df_tmp.printSchema()
> root
>  |-- key: string (nullable = true)
>  |-- value: integer (nullable = true)
>  |-- new: double (nullable = false)
> {code}
> And the Pandas UDF is simply creating a new column where negative values 
> would be set to a particular float, in this case INF and it works fine
> {code}
> In [4]: @pandas_udf(df_tmp.schema, PandasUDFType.GROUPED_MAP)
>...: def func(pdf):
>...: pdf['new'] = pdf['value'].where(pdf['value'] > 0, float('inf'))
>...: return pdf
> In [5]: df.groupby('key').apply(func).show()
> +---+-+--+
> |key|value|new|
> +---+-+--+
> |  b|3|   3.0|
> |  b|   -2|  Infinity|
> |  a|1|   1.0|
> |  a|2|   2.0|
> +---+-+--+
> {code}
> However if we set this value to NaN then it triggers an exception:
> {code}
> In [6]: @pandas_udf(df_tmp.schema, PandasUDFType.GROUPED_MAP)
>...: def func(pdf):
>...: pdf['new'] = pdf['value'].where(pdf['value'] > 0, float('nan'))
>...: return pdf
>...:
>...: df.groupby('key').apply(func).show()
> [Stage 23:==> (73 + 2) / 
> 75]2018-07-07 16:26:27 ERROR Executor:91 - Exception in task 36.0 in stage 
> 23.0 (TID 414)
> java.lang.IllegalStateException: Value at index is null
>   at org.apache.arrow.vector.Float8Vector.get(Float8Vector.java:98)
>   at 
> org.apache.spark.sql.vectorized.ArrowColumnVector$DoubleAccessor.getDouble(ArrowColumnVector.java:344)
>   at 
> org.apache.spark.sql.vectorized.ArrowColumnVector.getDouble(ArrowColumnVector.java:99)
>   at 
> org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getDouble(MutableColumnarRow.java:126)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at 

[jira] [Assigned] (SPARK-24893) Remove the entire CaseWhen if all the outputs are semantic equivalence

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24893:


Assignee: Apache Spark  (was: DB Tsai)

> Remove the entire CaseWhen if all the outputs are semantic equivalence
> --
>
> Key: SPARK-24893
> URL: https://issues.apache.org/jira/browse/SPARK-24893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>
> Similar to [SPARK-24890], if all the outputs of `CaseWhen` are semantic 
> equivalence, `CaseWhen` can be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24893) Remove the entire CaseWhen if all the outputs are semantic equivalence

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24893:


Assignee: DB Tsai  (was: Apache Spark)

> Remove the entire CaseWhen if all the outputs are semantic equivalence
> --
>
> Key: SPARK-24893
> URL: https://issues.apache.org/jira/browse/SPARK-24893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> Similar to [SPARK-24890], if all the outputs of `CaseWhen` are semantic 
> equivalence, `CaseWhen` can be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24893) Remove the entire CaseWhen if all the outputs are semantic equivalence

2018-07-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553286#comment-16553286
 ] 

Apache Spark commented on SPARK-24893:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/21852

> Remove the entire CaseWhen if all the outputs are semantic equivalence
> --
>
> Key: SPARK-24893
> URL: https://issues.apache.org/jira/browse/SPARK-24893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> Similar to [SPARK-24890], if all the outputs of `CaseWhen` are semantic 
> equivalence, `CaseWhen` can be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24893) Remove the entire CaseWhen if all the outputs are semantic equivalence

2018-07-23 Thread DB Tsai (JIRA)
DB Tsai created SPARK-24893:
---

 Summary: Remove the entire CaseWhen if all the outputs are 
semantic equivalence
 Key: SPARK-24893
 URL: https://issues.apache.org/jira/browse/SPARK-24893
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: DB Tsai
Assignee: DB Tsai


Similar to [SPARK-24890], if all the outputs of `CaseWhen` are semantic 
equivalence, `CaseWhen` can be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24891) Fix HandleNullInputsForUDF rule

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24891:


Assignee: Apache Spark

> Fix HandleNullInputsForUDF rule
> ---
>
> Key: SPARK-24891
> URL: https://issues.apache.org/jira/browse/SPARK-24891
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Assignee: Apache Spark
>Priority: Major
>
> The HandleNullInputsForUDF rule can generate new {{If}} node infinitely, thus 
> causing problems like match of SQL cache missed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24891) Fix HandleNullInputsForUDF rule

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24891:


Assignee: (was: Apache Spark)

> Fix HandleNullInputsForUDF rule
> ---
>
> Key: SPARK-24891
> URL: https://issues.apache.org/jira/browse/SPARK-24891
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Priority: Major
>
> The HandleNullInputsForUDF rule can generate new {{If}} node infinitely, thus 
> causing problems like match of SQL cache missed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24891) Fix HandleNullInputsForUDF rule

2018-07-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553271#comment-16553271
 ] 

Apache Spark commented on SPARK-24891:
--

User 'maryannxue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21851

> Fix HandleNullInputsForUDF rule
> ---
>
> Key: SPARK-24891
> URL: https://issues.apache.org/jira/browse/SPARK-24891
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Priority: Major
>
> The HandleNullInputsForUDF rule can generate new {{If}} node infinitely, thus 
> causing problems like match of SQL cache missed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24891) Fix HandleNullInputsForUDF rule

2018-07-23 Thread Maryann Xue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated SPARK-24891:

Summary: Fix HandleNullInputsForUDF rule  (was: Fix HandleNullInputsForUDF 
rule.)

> Fix HandleNullInputsForUDF rule
> ---
>
> Key: SPARK-24891
> URL: https://issues.apache.org/jira/browse/SPARK-24891
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Priority: Major
>
> The HandleNullInputsForUDF rule can generate new {{If}} node infinitely, thus 
> causing problems like match of SQL cache missed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24890) Short circuiting the `if` condition when `trueValue` and `falseValue` are the same

2018-07-23 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-24890:
---

Assignee: DB Tsai

> Short circuiting the `if` condition when `trueValue` and `falseValue` are the 
> same
> --
>
> Key: SPARK-24890
> URL: https://issues.apache.org/jira/browse/SPARK-24890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> When `trueValue` and `falseValue` are semantic equivalence, the condition 
> expression in `if` can be removed to avoid extra computation in runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24892) Simplify `CaseWhen` to `If` when there is only one branch

2018-07-23 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-24892:
---

Assignee: DB Tsai

> Simplify `CaseWhen` to `If` when there is only one branch
> -
>
> Key: SPARK-24892
> URL: https://issues.apache.org/jira/browse/SPARK-24892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> After the rule of removing the unreachable branches, it could be only one 
> branch left. In this situation, `CaseWhen` can be converted to `If` to do 
> further optimization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24892) Simplify `CaseWhen` to `If` when there is only one branch

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24892:


Assignee: Apache Spark

> Simplify `CaseWhen` to `If` when there is only one branch
> -
>
> Key: SPARK-24892
> URL: https://issues.apache.org/jira/browse/SPARK-24892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>
> After the rule of removing the unreachable branches, it could be only one 
> branch left. In this situation, `CaseWhen` can be converted to `If` to do 
> further optimization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24892) Simplify `CaseWhen` to `If` when there is only one branch

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24892:


Assignee: (was: Apache Spark)

> Simplify `CaseWhen` to `If` when there is only one branch
> -
>
> Key: SPARK-24892
> URL: https://issues.apache.org/jira/browse/SPARK-24892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> After the rule of removing the unreachable branches, it could be only one 
> branch left. In this situation, `CaseWhen` can be converted to `If` to do 
> further optimization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24892) Simplify `CaseWhen` to `If` when there is only one branch

2018-07-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553245#comment-16553245
 ] 

Apache Spark commented on SPARK-24892:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/21850

> Simplify `CaseWhen` to `If` when there is only one branch
> -
>
> Key: SPARK-24892
> URL: https://issues.apache.org/jira/browse/SPARK-24892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> After the rule of removing the unreachable branches, it could be only one 
> branch left. In this situation, `CaseWhen` can be converted to `If` to do 
> further optimization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24892) Simplify `CaseWhen` to `If` when there is only one branch

2018-07-23 Thread DB Tsai (JIRA)
DB Tsai created SPARK-24892:
---

 Summary: Simplify `CaseWhen` to `If` when there is only one branch
 Key: SPARK-24892
 URL: https://issues.apache.org/jira/browse/SPARK-24892
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: DB Tsai


After the rule of removing the unreachable branches, it could be only one 
branch left. In this situation, `CaseWhen` can be converted to `If` to do 
further optimization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24891) Fix HandleNullInputsForUDF rule.

2018-07-23 Thread Maryann Xue (JIRA)
Maryann Xue created SPARK-24891:
---

 Summary: Fix HandleNullInputsForUDF rule.
 Key: SPARK-24891
 URL: https://issues.apache.org/jira/browse/SPARK-24891
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maryann Xue


The HandleNullInputsForUDF rule can generate new {{If}} node infinitely, thus 
causing problems like match of SQL cache missed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24243) Expose exceptions from InProcessAppHandle

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24243:


Assignee: Apache Spark

> Expose exceptions from InProcessAppHandle
> -
>
> Key: SPARK-24243
> URL: https://issues.apache.org/jira/browse/SPARK-24243
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Assignee: Apache Spark
>Priority: Major
>
> {{InProcessAppHandle}} runs {{SparkSubmit}} in a dedicated thread, any 
> exceptions thrown are logged and then the state is set to {{FAILED}}. It 
> would be nice to expose the {{Throwable}} object  to the application rather 
> than logging it and dropping it. Applications may want to manipulate the 
> underlying {{Throwable}} / control its logging at a finer granularity. For 
> example, the app might want to call 
> {{Throwables.getRootCause(throwable).getMessage()}} and expose the message to 
> the app users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24243) Expose exceptions from InProcessAppHandle

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24243:


Assignee: (was: Apache Spark)

> Expose exceptions from InProcessAppHandle
> -
>
> Key: SPARK-24243
> URL: https://issues.apache.org/jira/browse/SPARK-24243
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Priority: Major
>
> {{InProcessAppHandle}} runs {{SparkSubmit}} in a dedicated thread, any 
> exceptions thrown are logged and then the state is set to {{FAILED}}. It 
> would be nice to expose the {{Throwable}} object  to the application rather 
> than logging it and dropping it. Applications may want to manipulate the 
> underlying {{Throwable}} / control its logging at a finer granularity. For 
> example, the app might want to call 
> {{Throwables.getRootCause(throwable).getMessage()}} and expose the message to 
> the app users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24243) Expose exceptions from InProcessAppHandle

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24243:


Assignee: Apache Spark

> Expose exceptions from InProcessAppHandle
> -
>
> Key: SPARK-24243
> URL: https://issues.apache.org/jira/browse/SPARK-24243
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Assignee: Apache Spark
>Priority: Major
>
> {{InProcessAppHandle}} runs {{SparkSubmit}} in a dedicated thread, any 
> exceptions thrown are logged and then the state is set to {{FAILED}}. It 
> would be nice to expose the {{Throwable}} object  to the application rather 
> than logging it and dropping it. Applications may want to manipulate the 
> underlying {{Throwable}} / control its logging at a finer granularity. For 
> example, the app might want to call 
> {{Throwables.getRootCause(throwable).getMessage()}} and expose the message to 
> the app users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24243) Expose exceptions from InProcessAppHandle

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24243:


Assignee: (was: Apache Spark)

> Expose exceptions from InProcessAppHandle
> -
>
> Key: SPARK-24243
> URL: https://issues.apache.org/jira/browse/SPARK-24243
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Priority: Major
>
> {{InProcessAppHandle}} runs {{SparkSubmit}} in a dedicated thread, any 
> exceptions thrown are logged and then the state is set to {{FAILED}}. It 
> would be nice to expose the {{Throwable}} object  to the application rather 
> than logging it and dropping it. Applications may want to manipulate the 
> underlying {{Throwable}} / control its logging at a finer granularity. For 
> example, the app might want to call 
> {{Throwables.getRootCause(throwable).getMessage()}} and expose the message to 
> the app users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24243) Expose exceptions from InProcessAppHandle

2018-07-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553155#comment-16553155
 ] 

Apache Spark commented on SPARK-24243:
--

User 'sahilTakiar' has created a pull request for this issue:
https://github.com/apache/spark/pull/21849

> Expose exceptions from InProcessAppHandle
> -
>
> Key: SPARK-24243
> URL: https://issues.apache.org/jira/browse/SPARK-24243
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Priority: Major
>
> {{InProcessAppHandle}} runs {{SparkSubmit}} in a dedicated thread, any 
> exceptions thrown are logged and then the state is set to {{FAILED}}. It 
> would be nice to expose the {{Throwable}} object  to the application rather 
> than logging it and dropping it. Applications may want to manipulate the 
> underlying {{Throwable}} / control its logging at a finer granularity. For 
> example, the app might want to call 
> {{Throwables.getRootCause(throwable).getMessage()}} and expose the message to 
> the app users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24890) Short circuiting the `if` condition when `trueValue` and `falseValue` are the same

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24890:


Assignee: (was: Apache Spark)

> Short circuiting the `if` condition when `trueValue` and `falseValue` are the 
> same
> --
>
> Key: SPARK-24890
> URL: https://issues.apache.org/jira/browse/SPARK-24890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> When `trueValue` and `falseValue` are semantic equivalence, the condition 
> expression in `if` can be removed to avoid extra computation in runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24890) Short circuiting the `if` condition when `trueValue` and `falseValue` are the same

2018-07-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553151#comment-16553151
 ] 

Apache Spark commented on SPARK-24890:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/21848

> Short circuiting the `if` condition when `trueValue` and `falseValue` are the 
> same
> --
>
> Key: SPARK-24890
> URL: https://issues.apache.org/jira/browse/SPARK-24890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> When `trueValue` and `falseValue` are semantic equivalence, the condition 
> expression in `if` can be removed to avoid extra computation in runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24890) Short circuiting the `if` condition when `trueValue` and `falseValue` are the same

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24890:


Assignee: Apache Spark

> Short circuiting the `if` condition when `trueValue` and `falseValue` are the 
> same
> --
>
> Key: SPARK-24890
> URL: https://issues.apache.org/jira/browse/SPARK-24890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>
> When `trueValue` and `falseValue` are semantic equivalence, the condition 
> expression in `if` can be removed to avoid extra computation in runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24890) Short circuiting the `if` condition when `trueValue` and `falseValue` are the same

2018-07-23 Thread DB Tsai (JIRA)
DB Tsai created SPARK-24890:
---

 Summary: Short circuiting the `if` condition when `trueValue` and 
`falseValue` are the same
 Key: SPARK-24890
 URL: https://issues.apache.org/jira/browse/SPARK-24890
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: DB Tsai


When `trueValue` and `falseValue` are semantic equivalence, the condition 
expression in `if` can be removed to avoid extra computation in runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24835) col function ignores drop

2018-07-23 Thread Michael Souder (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553136#comment-16553136
 ] 

Michael Souder commented on SPARK-24835:


Liang-Chi, your first block actually fails for me with an AnalysisException:
{code:python}
df2 = df.drop('c')
df2.where(df['c'] < 6).show()

Py4JJavaError: An error occurred while calling o11586.filter.
: org.apache.spark.sql.AnalysisException: Resolved attribute(s) c#87962L 
missing from a#87960L,b#87961L in operator !Filter (c#87962L < cast(6 as 
bigint)).;;
{code}
But I was more interested in the difference between resolving a column with the 
bracket interface df['c'] and with F.col('c') and why one can see a previously 
dropped column and the other can't.  You mention that drop adds a projection on 
top of the original dataset.  So accessing the column with brackets applies 
after the projection, but accessing with F.col() can see before the projection?

> col function ignores drop
> -
>
> Key: SPARK-24835
> URL: https://issues.apache.org/jira/browse/SPARK-24835
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Spark 2.3.0
> Python 3.5.3
>Reporter: Michael Souder
>Priority: Minor
>
> Not sure if this is a bug or user error, but I've noticed that accessing 
> columns with the col function ignores a previous call to drop.
> {code}
> import pyspark.sql.functions as F
> df = spark.createDataFrame([(1,3,5), (2, None, 7), (0, 3, 2)], ['a', 'b', 
> 'c'])
> df.show()
> +---++---+
> |  a|   b|  c|
> +---++---+
> |  1|   3|  5|
> |  2|null|  7|
> |  0|   3|  2|
> +---++---+
> df = df.drop('c')
> # the col function is able to see the 'c' column even though it has been 
> dropped
> df.where(F.col('c') < 6).show()
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  3|
> |  0|  3|
> +---+---+
> # trying the same with brackets on the data frame fails with the expected 
> error
> df.where(df['c'] < 6).show()
> Py4JJavaError: An error occurred while calling o36909.apply.
> : org.apache.spark.sql.AnalysisException: Cannot resolve column name "c" 
> among (a, b);{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-23 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553121#comment-16553121
 ] 

Ruslan Dautkhanov commented on SPARK-24615:
---

Is this work related to Spark's project Hydrogen's API?

[https://www.datanami.com/2018/06/05/project-hydrogen-unites-apache-spark-with-dl-frameworks/]
{quote}We also want to make Spark aware of accelerators so you can actually 
comfortably use FPGA or GPUs in your latest clusters.
{quote}

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24850) Query plan string representation grows exponentially on queries with recursive cached datasets

2018-07-23 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24850.
-
   Resolution: Fixed
 Assignee: Onur Satici
Fix Version/s: 2.4.0

> Query plan string representation grows exponentially on queries with 
> recursive cached datasets
> --
>
> Key: SPARK-24850
> URL: https://issues.apache.org/jira/browse/SPARK-24850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Onur Satici
>Assignee: Onur Satici
>Priority: Major
> Fix For: 2.4.0
>
>
> As of [https://github.com/apache/spark/pull/21018], InMemoryRelation includes 
> its cacheBuilder when logging query plans. This CachedRDDBuilder includes the 
> cachedPlan, so calling treeString on InMemoryRelation will log the cachedPlan 
> in the cacheBuilder.
> Given the sample dataset:
> {code:java}
> $ cat test.csv
> A,B
> 0,0{code}
> If the query plan has multiple cached datasets that depend on each other:
> {code:java}
> var df_cached = spark.read.format("csv").option("header", 
> "true").load("test.csv").cache()
> 0 to 1 foreach { _ =>
> df_cached = df_cached.join(spark.read.format("csv").option("header", 
> "true").load("test.csv"), "A").cache()
> }
> df_cached.explain
> {code}
> results in:
> {code:java}
> == Physical Plan ==
> InMemoryTableScan [A#10, B#11, B#35, B#87]
> +- InMemoryRelation [A#10, B#11, B#35, B#87], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(2) Project [A#10, B#11, B#35, B#87]
> +- *(2) BroadcastHashJoin [A#10], [A#86], Inner, BuildRight
> :- *(2) Filter isnotnull(A#10)
> : +- InMemoryTableScan [A#10, B#11, B#35], [isnotnull(A#10)]
> : +- InMemoryRelation [A#10, B#11, B#35], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(2) Project [A#10, B#11, B#35]
> +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
> :- *(2) Filter isnotnull(A#10)
> : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
> : +- InMemoryRelation [A#10, B#11], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, 
> Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> ,None)
> : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
> InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> false]))
> +- *(1) Filter isnotnull(A#34)
> +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
> +- InMemoryRelation [A#34, B#35], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, 
> Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> ,None)
> +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
> InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> ,None)
> : +- *(2) Project [A#10, B#11, B#35]
> : +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
> : :- *(2) Filter isnotnull(A#10)
> : : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
> : : +- InMemoryRelation [A#10, B#11], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, 
> Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> ,None)
> : : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
> InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> false]))
> : +- *(1) Filter isnotnull(A#34)
> : +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
> : +- InMemoryRelation [A#34, B#35], 
> CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
> replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, 
> Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> ,None)
> : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: 
> InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> false]))
> +- *(1) Filter isnotnull(A#86)
> +- InMemoryTableScan [A#86, B#87], [isnotnull(A#86)]
> +- InMemoryRelation [A#86, B#87], 
> CachedRDDBuilder(true,1,StorageLevel(disk, 

[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-07-23 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553077#comment-16553077
 ] 

Xiao Li commented on SPARK-23874:
-

[~bryanc] Thank you for driving this. 

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24855) Built-in AVRO support should support specified schema on write

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24855:


Assignee: Brian Lindblom  (was: Apache Spark)

> Built-in AVRO support should support specified schema on write
> --
>
> Key: SPARK-24855
> URL: https://issues.apache.org/jira/browse/SPARK-24855
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Brian Lindblom
>Assignee: Brian Lindblom
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> spark-avro appears to have been brought in from an upstream project, 
> [https://github.com/databricks/spark-avro.]  I opened a PR a while ago to 
> enable support for 'forceSchema', which allows us to specify an AVRO schema 
> with which to write our records to handle some use cases we have.  I didn't 
> get this code merged but would like to add this feature to the AVRO 
> reader/writer code that was brought in.  The PR is here and I will follow up 
> with a more formal PR/Patch rebased on spark master branch: 
> https://github.com/databricks/spark-avro/pull/222
>  
> This change allows us to specify a schema, which should be compatible with 
> the schema generated by spark-avro from the dataset definition.  This allows 
> a user to do things like specify default values, change union ordering, or... 
> in the case where you're reading in an AVRO data set, doing some sort of 
> in-line field cleansing, then writing out with the original schema, preserve 
> that original schema in the output container files.  I've had several use 
> cases where this behavior was desired and there were several other asks for 
> this in the spark-avro project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24855) Built-in AVRO support should support specified schema on write

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24855:


Assignee: Apache Spark  (was: Brian Lindblom)

> Built-in AVRO support should support specified schema on write
> --
>
> Key: SPARK-24855
> URL: https://issues.apache.org/jira/browse/SPARK-24855
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Brian Lindblom
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> spark-avro appears to have been brought in from an upstream project, 
> [https://github.com/databricks/spark-avro.]  I opened a PR a while ago to 
> enable support for 'forceSchema', which allows us to specify an AVRO schema 
> with which to write our records to handle some use cases we have.  I didn't 
> get this code merged but would like to add this feature to the AVRO 
> reader/writer code that was brought in.  The PR is here and I will follow up 
> with a more formal PR/Patch rebased on spark master branch: 
> https://github.com/databricks/spark-avro/pull/222
>  
> This change allows us to specify a schema, which should be compatible with 
> the schema generated by spark-avro from the dataset definition.  This allows 
> a user to do things like specify default values, change union ordering, or... 
> in the case where you're reading in an AVRO data set, doing some sort of 
> in-line field cleansing, then writing out with the original schema, preserve 
> that original schema in the output container files.  I've had several use 
> cases where this behavior was desired and there were several other asks for 
> this in the spark-avro project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24855) Built-in AVRO support should support specified schema on write

2018-07-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553040#comment-16553040
 ] 

Apache Spark commented on SPARK-24855:
--

User 'lindblombr' has created a pull request for this issue:
https://github.com/apache/spark/pull/21847

> Built-in AVRO support should support specified schema on write
> --
>
> Key: SPARK-24855
> URL: https://issues.apache.org/jira/browse/SPARK-24855
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Brian Lindblom
>Assignee: Brian Lindblom
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> spark-avro appears to have been brought in from an upstream project, 
> [https://github.com/databricks/spark-avro.]  I opened a PR a while ago to 
> enable support for 'forceSchema', which allows us to specify an AVRO schema 
> with which to write our records to handle some use cases we have.  I didn't 
> get this code merged but would like to add this feature to the AVRO 
> reader/writer code that was brought in.  The PR is here and I will follow up 
> with a more formal PR/Patch rebased on spark master branch: 
> https://github.com/databricks/spark-avro/pull/222
>  
> This change allows us to specify a schema, which should be compatible with 
> the schema generated by spark-avro from the dataset definition.  This allows 
> a user to do things like specify default values, change union ordering, or... 
> in the case where you're reading in an AVRO data set, doing some sort of 
> in-line field cleansing, then writing out with the original schema, preserve 
> that original schema in the output container files.  I've had several use 
> cases where this behavior was desired and there were several other asks for 
> this in the spark-avro project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24887) Use SerializableConfiguration in Spark util

2018-07-23 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24887.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.4.0

> Use SerializableConfiguration in Spark util
> ---
>
> Key: SPARK-24887
> URL: https://issues.apache.org/jira/browse/SPARK-24887
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> To implement the method `buildReader` in `FileFormat`, it is required to 
> serializable the hadoop configuration for executors.
> Previous spark-avro uses its own class `SerializableConfiguration` for the 
> serialization. As now it is part of Spark, we can use 
> SerializableConfiguration in Spark util to deduplicate the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24802) Optimization Rule Exclusion

2018-07-23 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24802.
-
   Resolution: Fixed
 Assignee: Maryann Xue
Fix Version/s: 2.4.0

> Optimization Rule Exclusion
> ---
>
> Key: SPARK-24802
> URL: https://issues.apache.org/jira/browse/SPARK-24802
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Major
> Fix For: 2.4.0
>
>
> Since Spark has provided fairly clear interfaces for adding user-defined 
> optimization rules, it would be nice to have an easy-to-use interface for 
> excluding an optimization rule from the Spark query optimizer as well.
> This would make customizing Spark optimizer easier and sometimes could 
> debugging issues too.
>  # Add a new config {{spark.sql.optimizer.excludedRules}}, with the value 
> being a list of rule names separated by comma.
>  # Modify the current {{batches}} method to remove the excluded rules from 
> the default batches. Log the rules that have been excluded.
>  # Split the existing default batches into "post-analysis batches" and 
> "optimization batches" so that only rules in the "optimization batches" can 
> be excluded.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24869) SaveIntoDataSourceCommand's input Dataset does not use Cached Data

2018-07-23 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552965#comment-16552965
 ] 

Takeshi Yamamuro commented on SPARK-24869:
--

yea, I made the branch just for this test purpose (sorry for confusing you).
 Actually, I think this query uses cache internally for saving output data.
 For example, to check this, I added a test-purpose metric for counting cache 
hits and the test below passed;
 [https://github.com/apache/spark/compare/master...maropu:SPARK-24869-2]

`SaveIntoDataSourceCommand` has a logical plan (`query`) for saving. It 
internally wraps this plan with `Dataset` and passes this inner dataset into 
`DataSource.createRelation` on runtime (in `RunnableCommand.run`);
 
[https://github.com/apache/spark/blob/ab18b02e66fd04bc8f1a4fb7b6a7f2773902a494/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala#L46]

The inner dataset replaces the logical plan with a cached plan, then 
`DataSource.createRelation` saves the output of the cached plan.
 In case of the JDBC datasource, `DataSource.createRelation` calls 
`JdbcUtils.saveTable`.
 Since `saveTable` directly references the rdd of the inner dataset, 
`QueryExecutionListener` cannot listen to the execution of the logical plan 
with cache data.
 
[https://github.com/apache/spark/blob/ab18b02e66fd04bc8f1a4fb7b6a7f2773902a494/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L834]

Moreover, in the test of the description, `QueryExecution.withCacedData` of the 
save command (SaveIntoDataSourceCommand) is called. But, IIUC this operation 
seems meaningless because `CacheManager` doesn't replace an inner logical plan 
(`innerChildren`) with cached one. So, `SaveIntoDataSourceCommand.query` is 
always just an analyzed logical plan.

> SaveIntoDataSourceCommand's input Dataset does not use Cached Data
> --
>
> Key: SPARK-24869
> URL: https://issues.apache.org/jira/browse/SPARK-24869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> withTable("t") {
>   withTempPath { path =>
> var numTotalCachedHit = 0
> val listener = new QueryExecutionListener {
>   override def onFailure(f: String, qe: QueryExecution, e: 
> Exception):Unit = {}
>   override def onSuccess(funcName: String, qe: QueryExecution, 
> duration: Long): Unit = {
> qe.withCachedData match {
>   case c: SaveIntoDataSourceCommand
>   if c.query.isInstanceOf[InMemoryRelation] =>
> numTotalCachedHit += 1
>   case _ =>
> println(qe.withCachedData)
> }
>   }
> }
> spark.listenerManager.register(listener)
> val udf1 = udf({ (x: Int, y: Int) => x + y })
> val df = spark.range(0, 3).toDF("a")
>   .withColumn("b", udf1(col("a"), lit(10)))
> df.cache()
> df.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.DROPTEST", 
> properties)
> assert(numTotalCachedHit == 1, "expected to be cached in jdbc")
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24889) dataset.unpersist() doesn't update storage memory stats

2018-07-23 Thread Yuri Bogomolov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuri Bogomolov updated SPARK-24889:
---
Description: 
Steps to reproduce:

1) Start a Spark cluster, and check the storage memory value from the Spark Web 
UI "Executors" tab (it should be equal to zero if you just started)

2) Run:
{code:java}
val df = spark.sqlContext.range(1, 10)
df.cache()
df.count()
df.unpersist(true){code}
3) Check the storage memory value again, now it's equal to 1GB

 

Looks like the memory is actually released, but stats aren't updated. This 
issue makes cluster management more complicated.

!image-2018-07-23-10-53-58-474.png!

  was:
Steps to reproduce:

1) Start a Spark cluster, and check the storage memory value from the Spark Web 
UI "Executors" tab (it should be equal to zero if you just started)

2) Run:
{code:java}
val df = spark.sqlContext.range(1, 10)
df.cache()
df.count()
df.unpersist(true){code}
3) Check the storage memory value again, now it's equal to 1GB

 

Looks like the memory is actually released, but stats aren't updated. This 
issue makes cluster management more complicated.

!image-2018-07-23-10-51-31-140.png!


> dataset.unpersist() doesn't update storage memory stats
> ---
>
> Key: SPARK-24889
> URL: https://issues.apache.org/jira/browse/SPARK-24889
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yuri Bogomolov
>Priority: Major
> Attachments: image-2018-07-23-10-53-58-474.png
>
>
> Steps to reproduce:
> 1) Start a Spark cluster, and check the storage memory value from the Spark 
> Web UI "Executors" tab (it should be equal to zero if you just started)
> 2) Run:
> {code:java}
> val df = spark.sqlContext.range(1, 10)
> df.cache()
> df.count()
> df.unpersist(true){code}
> 3) Check the storage memory value again, now it's equal to 1GB
>  
> Looks like the memory is actually released, but stats aren't updated. This 
> issue makes cluster management more complicated.
> !image-2018-07-23-10-53-58-474.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24889) dataset.unpersist() doesn't update storage memory stats

2018-07-23 Thread Yuri Bogomolov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuri Bogomolov updated SPARK-24889:
---
Attachment: image-2018-07-23-10-53-58-474.png

> dataset.unpersist() doesn't update storage memory stats
> ---
>
> Key: SPARK-24889
> URL: https://issues.apache.org/jira/browse/SPARK-24889
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yuri Bogomolov
>Priority: Major
> Attachments: image-2018-07-23-10-53-58-474.png
>
>
> Steps to reproduce:
> 1) Start a Spark cluster, and check the storage memory value from the Spark 
> Web UI "Executors" tab (it should be equal to zero if you just started)
> 2) Run:
> {code:java}
> val df = spark.sqlContext.range(1, 10)
> df.cache()
> df.count()
> df.unpersist(true){code}
> 3) Check the storage memory value again, now it's equal to 1GB
>  
> Looks like the memory is actually released, but stats aren't updated. This 
> issue makes cluster management more complicated.
> !image-2018-07-23-10-51-31-140.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24889) dataset.unpersist() doesn't update storage memory stats

2018-07-23 Thread Yuri Bogomolov (JIRA)
Yuri Bogomolov created SPARK-24889:
--

 Summary: dataset.unpersist() doesn't update storage memory stats
 Key: SPARK-24889
 URL: https://issues.apache.org/jira/browse/SPARK-24889
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Yuri Bogomolov


Steps to reproduce:

1) Start a Spark cluster, and check the storage memory value from the Spark Web 
UI "Executors" tab (it should be equal to zero if you just started)

2) Run:
{code:java}
val df = spark.sqlContext.range(1, 10)
df.cache()
df.count()
df.unpersist(true){code}
3) Check the storage memory value again, now it's equal to 1GB

 

Looks like the memory is actually released, but stats aren't updated. This 
issue makes cluster management more complicated.

!image-2018-07-23-10-51-31-140.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data

2018-07-23 Thread Brad (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552847#comment-16552847
 ] 

Brad commented on SPARK-21097:
--

No, I did not consider cpu utilization when redistributing the cache. It might 
be a good idea, but I'm not sure how you would implement it.

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24888) spark-submit --master spark://host:port --status driver-id does not work

2018-07-23 Thread srinivasan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

srinivasan updated SPARK-24888:
---
Description: 
spark-submit --master spark://host:port --status driver-id

does not return anything. The command terminates without any error or output.

Behaviour is the same from linux or windows

  was:
spark-submit --master spark://host:port --status driver-id

does not return anything. The command terminates without any error or output.

The output is the same from linux and windows


> spark-submit --master spark://host:port --status driver-id does not work 
> -
>
> Key: SPARK-24888
> URL: https://issues.apache.org/jira/browse/SPARK-24888
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.1
>Reporter: srinivasan
>Priority: Major
>
> spark-submit --master spark://host:port --status driver-id
> does not return anything. The command terminates without any error or output.
> Behaviour is the same from linux or windows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24888) spark-submit --master spark://host:port --status driver-id does not work

2018-07-23 Thread srinivasan (JIRA)
srinivasan created SPARK-24888:
--

 Summary: spark-submit --master spark://host:port --status 
driver-id does not work 
 Key: SPARK-24888
 URL: https://issues.apache.org/jira/browse/SPARK-24888
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.3.1
Reporter: srinivasan


spark-submit --master spark://host:port --status driver-id

does not return anything. The command terminates without any error or output.

The output is the same from linux and windows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24887) Use SerializableConfiguration in Spark util

2018-07-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552558#comment-16552558
 ] 

Apache Spark commented on SPARK-24887:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21846

> Use SerializableConfiguration in Spark util
> ---
>
> Key: SPARK-24887
> URL: https://issues.apache.org/jira/browse/SPARK-24887
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> To implement the method `buildReader` in `FileFormat`, it is required to 
> serializable the hadoop configuration for executors.
> Previous spark-avro uses its own class `SerializableConfiguration` for the 
> serialization. As now it is part of Spark, we can use 
> SerializableConfiguration in Spark util to deduplicate the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24887) Use SerializableConfiguration in Spark util

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24887:


Assignee: (was: Apache Spark)

> Use SerializableConfiguration in Spark util
> ---
>
> Key: SPARK-24887
> URL: https://issues.apache.org/jira/browse/SPARK-24887
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> To implement the method `buildReader` in `FileFormat`, it is required to 
> serializable the hadoop configuration for executors.
> Previous spark-avro uses its own class `SerializableConfiguration` for the 
> serialization. As now it is part of Spark, we can use 
> SerializableConfiguration in Spark util to deduplicate the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24887) Use SerializableConfiguration in Spark util

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24887:


Assignee: Apache Spark

> Use SerializableConfiguration in Spark util
> ---
>
> Key: SPARK-24887
> URL: https://issues.apache.org/jira/browse/SPARK-24887
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> To implement the method `buildReader` in `FileFormat`, it is required to 
> serializable the hadoop configuration for executors.
> Previous spark-avro uses its own class `SerializableConfiguration` for the 
> serialization. As now it is part of Spark, we can use 
> SerializableConfiguration in Spark util to deduplicate the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24887) Use SerializableConfiguration in Spark util

2018-07-23 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-24887:
--

 Summary: Use SerializableConfiguration in Spark util
 Key: SPARK-24887
 URL: https://issues.apache.org/jira/browse/SPARK-24887
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Gengliang Wang


To implement the method `buildReader` in `FileFormat`, it is required to 
serializable the hadoop configuration for executors.

Previous spark-avro uses its own class `SerializableConfiguration` for the 
serialization. As now it is part of Spark, we can use SerializableConfiguration 
in Spark util to deduplicate the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24883) Remove implicit class AvroDataFrameWriter/AvroDataFrameReader

2018-07-23 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24883.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21841
[https://github.com/apache/spark/pull/21841]

> Remove implicit class AvroDataFrameWriter/AvroDataFrameReader
> -
>
> Key: SPARK-24883
> URL: https://issues.apache.org/jira/browse/SPARK-24883
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> As per Reynold's comment: 
> [https://github.com/apache/spark/pull/21742#discussion_r203496489]
> It makes sense to remove the implicit class 
> AvroDataFrameWriter/AvroDataFrameReader, since the Avro package is external 
> module. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24883) Remove implicit class AvroDataFrameWriter/AvroDataFrameReader

2018-07-23 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24883:


Assignee: Gengliang Wang

> Remove implicit class AvroDataFrameWriter/AvroDataFrameReader
> -
>
> Key: SPARK-24883
> URL: https://issues.apache.org/jira/browse/SPARK-24883
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> As per Reynold's comment: 
> [https://github.com/apache/spark/pull/21742#discussion_r203496489]
> It makes sense to remove the implicit class 
> AvroDataFrameWriter/AvroDataFrameReader, since the Avro package is external 
> module. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24883) Remove implicit class AvroDataFrameWriter/AvroDataFrameReader

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24883:


Assignee: (was: Apache Spark)

> Remove implicit class AvroDataFrameWriter/AvroDataFrameReader
> -
>
> Key: SPARK-24883
> URL: https://issues.apache.org/jira/browse/SPARK-24883
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> As per Reynold's comment: 
> [https://github.com/apache/spark/pull/21742#discussion_r203496489]
> It makes sense to remove the implicit class 
> AvroDataFrameWriter/AvroDataFrameReader, since the Avro package is external 
> module. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24883) Remove implicit class AvroDataFrameWriter/AvroDataFrameReader

2018-07-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552410#comment-16552410
 ] 

Apache Spark commented on SPARK-24883:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21841

> Remove implicit class AvroDataFrameWriter/AvroDataFrameReader
> -
>
> Key: SPARK-24883
> URL: https://issues.apache.org/jira/browse/SPARK-24883
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> As per Reynold's comment: 
> [https://github.com/apache/spark/pull/21742#discussion_r203496489]
> It makes sense to remove the implicit class 
> AvroDataFrameWriter/AvroDataFrameReader, since the Avro package is external 
> module. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24883) Remove implicit class AvroDataFrameWriter/AvroDataFrameReader

2018-07-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24883:


Assignee: Apache Spark

> Remove implicit class AvroDataFrameWriter/AvroDataFrameReader
> -
>
> Key: SPARK-24883
> URL: https://issues.apache.org/jira/browse/SPARK-24883
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> As per Reynold's comment: 
> [https://github.com/apache/spark/pull/21742#discussion_r203496489]
> It makes sense to remove the implicit class 
> AvroDataFrameWriter/AvroDataFrameReader, since the Avro package is external 
> module. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org