[jira] [Commented] (SPARK-16693) Remove R deprecated methods

2018-01-03 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310872#comment-16310872
 ] 

Felix Cheung commented on SPARK-16693:
--

I thought we did but I couldn't find any record.
I suppose we keep this till 2.4.0

> Remove R deprecated methods
> ---
>
> Key: SPARK-16693
> URL: https://issues.apache.org/jira/browse/SPARK-16693
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>
> For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22771) SQL concat for binary

2018-01-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310847#comment-16310847
 ] 

Apache Spark commented on SPARK-22771:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/20149

> SQL concat for binary 
> --
>
> Key: SPARK-22771
> URL: https://issues.apache.org/jira/browse/SPARK-22771
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Fernando Pereira
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.3.0
>
>
> spark.sql {{concat}}  function automatically casts arguments to StringType 
> and returns a String.
> This might be the behavior of traditional databases, however in Spark there's 
> Binary as a standard type, and concat'ing binary seems reasonable if it 
> returns another binary sequence.
> Taking the example of, e.g. Python where both {{bytes}} and {{unicode}} 
> represent text, by concat'ing both we end up with the same type as the 
> arguments, and in case they are intermixed (str + unicode) the most generic 
> type is returned (unicode).
> Following the same principle, I believe that when concat'ing binary it would 
> make sense to return a binary. 
> In terms of Spark behavior, it would affect only the case when all arguments 
> are binary. All other cases should remain unchanged.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22953) Duplicated secret volumes in Spark pods when init-containers are used

2018-01-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22953:


Assignee: (was: Apache Spark)

> Duplicated secret volumes in Spark pods when init-containers are used
> -
>
> Key: SPARK-22953
> URL: https://issues.apache.org/jira/browse/SPARK-22953
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yinan Li
> Fix For: 2.3.0
>
>
> User-specified secrets are mounted into both the main container and 
> init-container (when it is used) in a Spark driver/executor pod, using the 
> {{MountSecretsBootstrap}}. Because {{MountSecretsBootstrap}} always adds the 
> secret volumes to the pod, the same secret volumes get added twice, one when 
> mounting the secrets to the main container, and the other when mounting the 
> secrets to the init-container. See 
> https://github.com/apache-spark-on-k8s/spark/issues/594.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22953) Duplicated secret volumes in Spark pods when init-containers are used

2018-01-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310835#comment-16310835
 ] 

Apache Spark commented on SPARK-22953:
--

User 'liyinan926' has created a pull request for this issue:
https://github.com/apache/spark/pull/20148

> Duplicated secret volumes in Spark pods when init-containers are used
> -
>
> Key: SPARK-22953
> URL: https://issues.apache.org/jira/browse/SPARK-22953
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yinan Li
> Fix For: 2.3.0
>
>
> User-specified secrets are mounted into both the main container and 
> init-container (when it is used) in a Spark driver/executor pod, using the 
> {{MountSecretsBootstrap}}. Because {{MountSecretsBootstrap}} always adds the 
> secret volumes to the pod, the same secret volumes get added twice, one when 
> mounting the secrets to the main container, and the other when mounting the 
> secrets to the init-container. See 
> https://github.com/apache-spark-on-k8s/spark/issues/594.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22953) Duplicated secret volumes in Spark pods when init-containers are used

2018-01-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22953:


Assignee: Apache Spark

> Duplicated secret volumes in Spark pods when init-containers are used
> -
>
> Key: SPARK-22953
> URL: https://issues.apache.org/jira/browse/SPARK-22953
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yinan Li
>Assignee: Apache Spark
> Fix For: 2.3.0
>
>
> User-specified secrets are mounted into both the main container and 
> init-container (when it is used) in a Spark driver/executor pod, using the 
> {{MountSecretsBootstrap}}. Because {{MountSecretsBootstrap}} always adds the 
> secret volumes to the pod, the same secret volumes get added twice, one when 
> mounting the secrets to the main container, and the other when mounting the 
> secrets to the init-container. See 
> https://github.com/apache-spark-on-k8s/spark/issues/594.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22933) R Structured Streaming API for withWatermark, trigger, partitionBy

2018-01-03 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-22933.
--
  Resolution: Fixed
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> R Structured Streaming API for withWatermark, trigger, partitionBy
> --
>
> Key: SPARK-22933
> URL: https://issues.apache.org/jira/browse/SPARK-22933
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics

2018-01-03 Thread Sandeep Kumar Choudhary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310776#comment-16310776
 ] 

Sandeep Kumar Choudhary commented on SPARK-18844:
-

I am working on this JIRA.

> Add more binary classification metrics to BinaryClassificationMetrics
> -
>
> Key: SPARK-18844
> URL: https://issues.apache.org/jira/browse/SPARK-18844
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.0.2
>Reporter: Zak Patterson
>Priority: Minor
>  Labels: evaluation
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> BinaryClassificationMetrics only implements Precision (positive predictive 
> value) and recall (true positive rate). It should implement more 
> comprehensive metrics.
> Moreover, the instance variables storing computed counts are marked private, 
> and there are no accessors for them. So if one desired to add this 
> functionality, one would have to duplicate this calculation, which is not 
> trivial:
> https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144
> Currently Implemented Metrics
> ---
> * Precision (PPV): `precisionByThreshold`
> * Recall (Sensitivity, true positive rate): `recallByThreshold`
> Desired additional metrics
> ---
> * False omission rate: `forByThreshold`
> * False discovery rate: `fdrByThreshold`
> * Negative predictive value: `npvByThreshold`
> * False negative rate: `fnrByThreshold`
> * True negative rate (Specificity): `specificityByThreshold`
> * False positive rate: `fprByThreshold`
> Alternatives
> ---
> The `createCurve` method is marked private. If it were marked public, and the 
> trait BinaryClassificationMetricComputer were also marked public, then it 
> would be easy to define new computers to get whatever the user wanted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22944) improve FoldablePropagation

2018-01-03 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22944.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> improve FoldablePropagation
> ---
>
> Key: SPARK-22944
> URL: https://issues.apache.org/jira/browse/SPARK-22944
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22953) Duplicated secret volumes in Spark pods when init-containers are used

2018-01-03 Thread Yinan Li (JIRA)
Yinan Li created SPARK-22953:


 Summary: Duplicated secret volumes in Spark pods when 
init-containers are used
 Key: SPARK-22953
 URL: https://issues.apache.org/jira/browse/SPARK-22953
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Yinan Li
 Fix For: 2.3.0


User-specified secrets are mounted into both the main container and 
init-container (when it is used) in a Spark driver/executor pod, using the 
{{MountSecretsBootstrap}}. Because {{MountSecretsBootstrap}} always adds the 
secret volumes to the pod, the same secret volumes get added twice, one when 
mounting the secrets to the main container, and the other when mounting the 
secrets to the init-container. See 
https://github.com/apache-spark-on-k8s/spark/issues/594.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22940) Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't have wget installed

2018-01-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22940:


Assignee: (was: Apache Spark)

> Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't 
> have wget installed
> -
>
> Key: SPARK-22940
> URL: https://issues.apache.org/jira/browse/SPARK-22940
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1
> Environment: MacOS Sierra 10.12.6
>Reporter: Bruce Robbins
>Priority: Minor
>
> On platforms that don't have wget installed (e.g., Mac OS X), test suite 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an 
> exception and aborts:
> java.io.IOException: Cannot run program "wget": error=2, No such file or 
> directory
> HiveExternalCatalogVersionsSuite uses wget to download older versions of 
> Spark for compatibility testing. First it uses wget to find a suitable 
> mirror, and then it uses wget to download a tar file from the mirror.
> There are several ways to fix this (in reverse order of difficulty of 
> implementation)
> 1. Require Mac OS X users to install wget if they wish to run unit tests (or 
> at the very least if they wish to run HiveExternalCatalogVersionsSuite). 
> Also, update documentation to make this requirement explicit.
> 2. Fall back on curl when wget is not available.
> 3. Use an HTTP library to query for a suitable mirror and download the tar 
> file.
> Number 2 is easy to implement, and I did so to get the unit test to run. But 
> it relies on another external program if wget is not installed.
> Number 3 is probably slightly more complex to implement and requires more 
> corner-case checking (e.g, redirects, etc.).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22940) Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't have wget installed

2018-01-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310717#comment-16310717
 ] 

Apache Spark commented on SPARK-22940:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/20147

> Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't 
> have wget installed
> -
>
> Key: SPARK-22940
> URL: https://issues.apache.org/jira/browse/SPARK-22940
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1
> Environment: MacOS Sierra 10.12.6
>Reporter: Bruce Robbins
>Priority: Minor
>
> On platforms that don't have wget installed (e.g., Mac OS X), test suite 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an 
> exception and aborts:
> java.io.IOException: Cannot run program "wget": error=2, No such file or 
> directory
> HiveExternalCatalogVersionsSuite uses wget to download older versions of 
> Spark for compatibility testing. First it uses wget to find a suitable 
> mirror, and then it uses wget to download a tar file from the mirror.
> There are several ways to fix this (in reverse order of difficulty of 
> implementation)
> 1. Require Mac OS X users to install wget if they wish to run unit tests (or 
> at the very least if they wish to run HiveExternalCatalogVersionsSuite). 
> Also, update documentation to make this requirement explicit.
> 2. Fall back on curl when wget is not available.
> 3. Use an HTTP library to query for a suitable mirror and download the tar 
> file.
> Number 2 is easy to implement, and I did so to get the unit test to run. But 
> it relies on another external program if wget is not installed.
> Number 3 is probably slightly more complex to implement and requires more 
> corner-case checking (e.g, redirects, etc.).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22940) Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't have wget installed

2018-01-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22940:


Assignee: Apache Spark

> Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't 
> have wget installed
> -
>
> Key: SPARK-22940
> URL: https://issues.apache.org/jira/browse/SPARK-22940
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1
> Environment: MacOS Sierra 10.12.6
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Minor
>
> On platforms that don't have wget installed (e.g., Mac OS X), test suite 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an 
> exception and aborts:
> java.io.IOException: Cannot run program "wget": error=2, No such file or 
> directory
> HiveExternalCatalogVersionsSuite uses wget to download older versions of 
> Spark for compatibility testing. First it uses wget to find a suitable 
> mirror, and then it uses wget to download a tar file from the mirror.
> There are several ways to fix this (in reverse order of difficulty of 
> implementation)
> 1. Require Mac OS X users to install wget if they wish to run unit tests (or 
> at the very least if they wish to run HiveExternalCatalogVersionsSuite). 
> Also, update documentation to make this requirement explicit.
> 2. Fall back on curl when wget is not available.
> 3. Use an HTTP library to query for a suitable mirror and download the tar 
> file.
> Number 2 is easy to implement, and I did so to get the unit test to run. But 
> it relies on another external program if wget is not installed.
> Number 3 is probably slightly more complex to implement and requires more 
> corner-case checking (e.g, redirects, etc.).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py

2018-01-03 Thread Prateek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310706#comment-16310706
 ] 

Prateek commented on SPARK-22711:
-

Sorry. I will update it.

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0, 2.2.1
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
> Attachments: Jira_Spark_minimized_code.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", 

[jira] [Commented] (SPARK-11215) Add multiple columns support to StringIndexer

2018-01-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310690#comment-16310690
 ] 

Apache Spark commented on SPARK-11215:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/20146

> Add multiple columns support to StringIndexer
> -
>
> Key: SPARK-11215
> URL: https://issues.apache.org/jira/browse/SPARK-11215
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Add multiple columns support to StringIndexer, then users can transform 
> multiple input columns to multiple output columns simultaneously. See 
> discussion SPARK-8418.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22952) Deprecate stageAttemptId in favour of stageAttemptNumber

2018-01-03 Thread Xianjin YE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310672#comment-16310672
 ] 

Xianjin YE commented on SPARK-22952:


I will send pr soon

> Deprecate stageAttemptId in favour of stageAttemptNumber
> 
>
> Key: SPARK-22952
> URL: https://issues.apache.org/jira/browse/SPARK-22952
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Xianjin YE
>Priority: Minor
> Fix For: 2.3.0
>
>
> As discussed in [PR-20082|https://github.com/apache/spark/pull/20082] for 
> SPARK-22897, we prefer stageAttemptNumber over stageAttemptId. 
> This is the followup to deprecate stageAttemptId which will make public APIs 
> more consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22952) Deprecate stageAttemptId in favour of stageAttemptNumber

2018-01-03 Thread Xianjin YE (JIRA)
Xianjin YE created SPARK-22952:
--

 Summary: Deprecate stageAttemptId in favour of stageAttemptNumber
 Key: SPARK-22952
 URL: https://issues.apache.org/jira/browse/SPARK-22952
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.1, 2.1.2
Reporter: Xianjin YE
Priority: Minor
 Fix For: 2.3.0


As discussed in [PR-20082|https://github.com/apache/spark/pull/20082] for 
SPARK-22897, we prefer stageAttemptNumber over stageAttemptId. 

This is the followup to deprecate stageAttemptId which will make public APIs 
more consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22950) user classpath first cause no class found exception

2018-01-03 Thread Kent Yao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-22950:
-
Description: 

{code:java}
bin/spark-submit --class some.class.enbaled.hive.Support --master yarn 
--deploy-mode client  --conf spark.driver.userClassPathFirst=true  the.jar
{code}

spark.driver.userClassPathFirst=true with default builtin hive jars cause 
classnotfoundexception during initializing hive client
{code:java}
Please make sure that jars for your version of hive and hadoop are included in 
the paths passed to spark.sql.hive.metastore.jars.
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:362)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:266)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:193)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:105)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:93)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35)
at 
org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1050)
... 29 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
... 46 more
Caused by: java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/ql/session/SessionState
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:136)
... 51 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hive.ql.session.SessionState
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:221)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 52 more
{code}

  was:

{code:java}
Please make sure that jars for your version of hive and hadoop are included in 
the paths passed to spark.sql.hive.metastore.jars.
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:362)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:266)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:193)
at 

[jira] [Commented] (SPARK-22950) user classpath first cause no class found exception

2018-01-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310630#comment-16310630
 ] 

Apache Spark commented on SPARK-22950:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/20145

> user classpath first cause no class found exception
> ---
>
> Key: SPARK-22950
> URL: https://issues.apache.org/jira/browse/SPARK-22950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Kent Yao
>
> {code:java}
> Please make sure that jars for your version of hive and hadoop are included 
> in the paths passed to spark.sql.hive.metastore.jars.
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:362)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:266)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:193)
> at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:105)
> at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:93)
> at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
> at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
> at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35)
> at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
> at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1050)
> ... 29 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
> ... 46 more
> Caused by: java.lang.NoClassDefFoundError: 
> org/apache/hadoop/hive/ql/session/SessionState
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:136)
> ... 51 more
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.hive.ql.session.SessionState
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:221)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 52 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22950) user classpath first cause no class found exception

2018-01-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22950:


Assignee: (was: Apache Spark)

> user classpath first cause no class found exception
> ---
>
> Key: SPARK-22950
> URL: https://issues.apache.org/jira/browse/SPARK-22950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Kent Yao
>
> {code:java}
> Please make sure that jars for your version of hive and hadoop are included 
> in the paths passed to spark.sql.hive.metastore.jars.
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:362)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:266)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:193)
> at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:105)
> at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:93)
> at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
> at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
> at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35)
> at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
> at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1050)
> ... 29 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
> ... 46 more
> Caused by: java.lang.NoClassDefFoundError: 
> org/apache/hadoop/hive/ql/session/SessionState
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:136)
> ... 51 more
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.hive.ql.session.SessionState
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:221)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 52 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22950) user classpath first cause no class found exception

2018-01-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22950:


Assignee: Apache Spark

> user classpath first cause no class found exception
> ---
>
> Key: SPARK-22950
> URL: https://issues.apache.org/jira/browse/SPARK-22950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Kent Yao
>Assignee: Apache Spark
>
> {code:java}
> Please make sure that jars for your version of hive and hadoop are included 
> in the paths passed to spark.sql.hive.metastore.jars.
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:362)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:266)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:193)
> at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:105)
> at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:93)
> at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
> at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
> at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35)
> at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
> at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1050)
> ... 29 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
> ... 46 more
> Caused by: java.lang.NoClassDefFoundError: 
> org/apache/hadoop/hive/ql/session/SessionState
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:136)
> ... 51 more
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.hive.ql.session.SessionState
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:221)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 52 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21475) Change the usage of FileInputStream/OutputStream to Files.newInput/OutputStream in the critical path

2018-01-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310619#comment-16310619
 ] 

Apache Spark commented on SPARK-21475:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/20144

> Change the usage of FileInputStream/OutputStream to 
> Files.newInput/OutputStream in the critical path
> 
>
> Key: SPARK-21475
> URL: https://issues.apache.org/jira/browse/SPARK-21475
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
>
> Java's {{FileInputStream}} and {{FileOutputStream}} overrides {{finalize()}}, 
> even this file input/output stream is closed correctly and promptly, it will 
> still leave some memory footprints which will get cleaned in Full GC. This 
> will introduce two side effects:
> 1. Lots of memory footprints regarding to Finalizer will be kept in memory 
> and this will increase the memory overhead. In our use case of external 
> shuffle service, a busy shuffle service will have bunch of this object and 
> potentially lead to OOM.
> 2. The Finalizer will only be called in Full GC, and this will increase the 
> overhead of Full GC and lead to long GC pause.
> So to fix this potential issue, here propose to use NIO's 
> Files#newInput/OutputStream instead in some critical paths like shuffle.
> https://www.cloudbees.com/blog/fileinputstream-fileoutputstream-considered-harmful



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2018-01-03 Thread Michael Dreibelbis (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Dreibelbis updated SPARK-22951:
---
Summary: count() after dropDuplicates() on emptyDataFrame returns incorrect 
value  (was: count() after dropDuplicates() on emptyDataFrame() returns 
incorrect value)

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Michael Dreibelbis
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame() returns incorrect value

2018-01-03 Thread Michael Dreibelbis (JIRA)
Michael Dreibelbis created SPARK-22951:
--

 Summary: count() after dropDuplicates() on emptyDataFrame() 
returns incorrect value
 Key: SPARK-22951
 URL: https://issues.apache.org/jira/browse/SPARK-22951
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.2
Reporter: Michael Dreibelbis


here is a minimal Spark Application to reproduce:

{code}
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}


object DropDupesApp extends App {
  
  override def main(args: Array[String]): Unit = {
val conf = new SparkConf()
  .setAppName("test")
  .setMaster("local")
val sc = new SparkContext(conf)
val sql = SQLContext.getOrCreate(sc)
assert(sql.emptyDataFrame.count == 0) // expected
assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
  }
  
}
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22950) user classpath first cause no class found exception

2018-01-03 Thread Kent Yao (JIRA)
Kent Yao created SPARK-22950:


 Summary: user classpath first cause no class found exception
 Key: SPARK-22950
 URL: https://issues.apache.org/jira/browse/SPARK-22950
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1, 2.1.0
Reporter: Kent Yao



{code:java}
Please make sure that jars for your version of hive and hadoop are included in 
the paths passed to spark.sql.hive.metastore.jars.
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:362)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:266)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:193)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:105)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:93)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35)
at 
org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1050)
... 29 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
... 46 more
Caused by: java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/ql/session/SessionState
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:136)
... 51 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hive.ql.session.SessionState
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:221)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 52 more
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming

2018-01-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21926:
--
Target Version/s:   (was: 2.3.0)

> Compatibility between ML Transformers and Structured Streaming
> --
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> More details here SPARK-22346.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> -b) Allow user to set the cardinality of OneHotEncoder.-



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming

2018-01-03 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310582#comment-16310582
 ] 

Joseph K. Bradley commented on SPARK-21926:
---

This will be mostly done in Spark 2.3, but there will be a few items left for 
Spark 2.4.  I'll retarget this.

> Compatibility between ML Transformers and Structured Streaming
> --
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> More details here SPARK-22346.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> -b) Allow user to set the cardinality of OneHotEncoder.-



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22797) Add multiple column support to PySpark Bucketizer

2018-01-03 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310580#comment-16310580
 ] 

Joseph K. Bradley commented on SPARK-22797:
---

[~mlnick]  Shall we retarget this and the other multi-column support JIRAs 
you're shepherding to Spark version 2.4?

> Add multiple column support to PySpark Bucketizer
> -
>
> Key: SPARK-22797
> URL: https://issues.apache.org/jira/browse/SPARK-22797
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2018-01-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22883:
--
Target Version/s:   (was: 2.3.0)

> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>
> *For featurizers with names from A - M*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22887) ML test for StructuredStreaming: spark.ml.fpm

2018-01-03 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310579#comment-16310579
 ] 

Joseph K. Bradley commented on SPARK-22887:
---

Not currently, I think, so please go ahead!  It should be pretty 
straightforward, based on the existing tests (e.g., for Linear Regression).

> ML test for StructuredStreaming: spark.ml.fpm
> -
>
> Key: SPARK-22887
> URL: https://issues.apache.org/jira/browse/SPARK-22887
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22735) Add VectorSizeHint to ML features documentation

2018-01-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22735:
--
Shepherd: Joseph K. Bradley

> Add VectorSizeHint to ML features documentation
> ---
>
> Key: SPARK-22735
> URL: https://issues.apache.org/jira/browse/SPARK-22735
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22882) ML test for StructuredStreaming: spark.ml.classification

2018-01-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22882:
--
Target Version/s:   (was: 2.3.0)

> ML test for StructuredStreaming: spark.ml.classification
> 
>
> Key: SPARK-22882
> URL: https://issues.apache.org/jira/browse/SPARK-22882
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22887) ML test for StructuredStreaming: spark.ml.fpm

2018-01-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22887:
--
Target Version/s:   (was: 2.3.0)

> ML test for StructuredStreaming: spark.ml.fpm
> -
>
> Key: SPARK-22887
> URL: https://issues.apache.org/jira/browse/SPARK-22887
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22885) ML test for StructuredStreaming: spark.ml.tuning

2018-01-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22885:
--
Target Version/s:   (was: 2.3.0)

> ML test for StructuredStreaming: spark.ml.tuning
> 
>
> Key: SPARK-22885
> URL: https://issues.apache.org/jira/browse/SPARK-22885
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22886) ML test for StructuredStreaming: spark.ml.recommendation

2018-01-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22886:
--
Target Version/s:   (was: 2.3.0)

> ML test for StructuredStreaming: spark.ml.recommendation
> 
>
> Key: SPARK-22886
> URL: https://issues.apache.org/jira/browse/SPARK-22886
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22735) Add VectorSizeHint to ML features documentation

2018-01-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22735:
--
Issue Type: Documentation  (was: Improvement)

> Add VectorSizeHint to ML features documentation
> ---
>
> Key: SPARK-22735
> URL: https://issues.apache.org/jira/browse/SPARK-22735
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22735) Add VectorSizeHint to ML features documentation

2018-01-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22735:
--
Component/s: Documentation

> Add VectorSizeHint to ML features documentation
> ---
>
> Key: SPARK-22735
> URL: https://issues.apache.org/jira/browse/SPARK-22735
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering

2018-01-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22884:
--
Target Version/s:   (was: 2.3.0)

> ML test for StructuredStreaming: spark.ml.clustering
> 
>
> Key: SPARK-22884
> URL: https://issues.apache.org/jira/browse/SPARK-22884
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z

2018-01-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22915:
--
Target Version/s:   (was: 2.3.0)

> ML test for StructuredStreaming: spark.ml.feature, N-Z
> --
>
> Key: SPARK-22915
> URL: https://issues.apache.org/jira/browse/SPARK-22915
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>
> *For featurizers with names from N - Z*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15571) Pipeline unit test improvements

2018-01-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15571:
--
Summary: Pipeline unit test improvements  (was: Pipeline unit test 
improvements for 2.3)

> Pipeline unit test improvements
> ---
>
> Key: SPARK-15571
> URL: https://issues.apache.org/jira/browse/SPARK-15571
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> Issue:
> * There are several pieces of standard functionality shared by all 
> algorithms: Params, UIDs, fit/transform/save/load, etc.  Currently, these 
> pieces are generally tested in ad hoc tests for each algorithm.
> * This has led to inconsistent coverage, especially within the Python API.
> Goal:
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality.
> * Eliminate duplicate code.  Improve test coverage.  Simplify adding these 
> standard unit tests for future algorithms and APIs.
> This will require several subtasks.  If you identify an issue, please create 
> a subtask, or comment below if the issue needs to be discussed first.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15571) Pipeline unit test improvements for 2.3

2018-01-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15571:
--
Shepherd:   (was: Joseph K. Bradley)

> Pipeline unit test improvements for 2.3
> ---
>
> Key: SPARK-15571
> URL: https://issues.apache.org/jira/browse/SPARK-15571
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> Issue:
> * There are several pieces of standard functionality shared by all 
> algorithms: Params, UIDs, fit/transform/save/load, etc.  Currently, these 
> pieces are generally tested in ad hoc tests for each algorithm.
> * This has led to inconsistent coverage, especially within the Python API.
> Goal:
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality.
> * Eliminate duplicate code.  Improve test coverage.  Simplify adding these 
> standard unit tests for future algorithms and APIs.
> This will require several subtasks.  If you identify an issue, please create 
> a subtask, or comment below if the issue needs to be discussed first.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15571) Pipeline unit test improvements for 2.3

2018-01-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15571:
--
Target Version/s:   (was: 2.3.0)

> Pipeline unit test improvements for 2.3
> ---
>
> Key: SPARK-15571
> URL: https://issues.apache.org/jira/browse/SPARK-15571
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> Issue:
> * There are several pieces of standard functionality shared by all 
> algorithms: Params, UIDs, fit/transform/save/load, etc.  Currently, these 
> pieces are generally tested in ad hoc tests for each algorithm.
> * This has led to inconsistent coverage, especially within the Python API.
> Goal:
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality.
> * Eliminate duplicate code.  Improve test coverage.  Simplify adding these 
> standard unit tests for future algorithms and APIs.
> This will require several subtasks.  If you identify an issue, please create 
> a subtask, or comment below if the issue needs to be discussed first.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20960) make ColumnVector public

2018-01-03 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20960.
-
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.3.0

> make ColumnVector public
> 
>
> Key: SPARK-20960
> URL: https://issues.apache.org/jira/browse/SPARK-20960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>
> ColumnVector is an internal interface in Spark SQL, which is only used for 
> vectorized parquet reader to represent the in-memory columnar format.
> In Spark 2.3 we want to make ColumnVector public, so that we can provide a 
> more efficient way for data exchanges between Spark and external systems. For 
> example, we can use ColumnVector to build the columnar read API in data 
> source framework, we can use ColumnVector to build a more efficient UDF API, 
> etc.
> We also want to introduce a new ColumnVector implementation based on Apache 
> Arrow(basically just a wrapper over Arrow), so that external systems(like 
> Python Pandas DataFrame) can build ColumnVector very easily.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22949) Reduce memory requirement for TrainValidationSplit

2018-01-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22949:


Assignee: Apache Spark

> Reduce memory requirement for TrainValidationSplit
> --
>
> Key: SPARK-22949
> URL: https://issues.apache.org/jira/browse/SPARK-22949
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Assignee: Apache Spark
>Priority: Critical
>
> There was a fix in {{ CrossValidator }} to reduce memory usage on the driver, 
> the same patch to be applied to {{ TrainValidationSplit }}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22949) Reduce memory requirement for TrainValidationSplit

2018-01-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22949:


Assignee: (was: Apache Spark)

> Reduce memory requirement for TrainValidationSplit
> --
>
> Key: SPARK-22949
> URL: https://issues.apache.org/jira/browse/SPARK-22949
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> There was a fix in {{ CrossValidator }} to reduce memory usage on the driver, 
> the same patch to be applied to {{ TrainValidationSplit }}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22949) Reduce memory requirement for TrainValidationSplit

2018-01-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310465#comment-16310465
 ] 

Apache Spark commented on SPARK-22949:
--

User 'MrBago' has created a pull request for this issue:
https://github.com/apache/spark/pull/20143

> Reduce memory requirement for TrainValidationSplit
> --
>
> Key: SPARK-22949
> URL: https://issues.apache.org/jira/browse/SPARK-22949
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> There was a fix in {{ CrossValidator }} to reduce memory usage on the driver, 
> the same patch to be applied to {{ TrainValidationSplit }}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22949) Reduce memory requirement for TrainValidationSplit

2018-01-03 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-22949:
---

 Summary: Reduce memory requirement for TrainValidationSplit
 Key: SPARK-22949
 URL: https://issues.apache.org/jira/browse/SPARK-22949
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.0
Reporter: Bago Amirbekian
Priority: Critical


There was a fix in {{ CrossValidator }} to reduce memory usage on the driver, 
the same patch to be applied to {{ TrainValidationSplit }}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-01-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310423#comment-16310423
 ] 

Michael Armbrust commented on SPARK-22947:
--

+1 to [~rxin]'s question.  This seems like it might just be sugar on top of 
[SPARK-8682].

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, id, quantity
> 20160101, 1, 100
> 20160101, 2, 50
> 20160102, 1, -50
> 20160102, 2, 50
> dfB:
> time, id, price
> 20151231, 1, 100.0
> 20150102, 1, 105.0
> 20150102, 2, 195.0
> Output:
> time, id, quantity, price
> 20160101, 1, 100, 100.0
> 20160101, 2, 50, null
> 20160102, 1, -50, 105.0
> 20160102, 2, 50, 195.0
> {code}
> h2. Optional Design Sketch
> h3. Implementation A
> (This is just initial thought of 

[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-01-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310388#comment-16310388
 ] 

Reynold Xin commented on SPARK-22947:
-

Li,

Why are these not just normal inner joins with conditions that describe ranges?


> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, id, quantity
> 20160101, 1, 100
> 20160101, 2, 50
> 20160102, 1, -50
> 20160102, 2, 50
> dfB:
> time, id, price
> 20151231, 1, 100.0
> 20150102, 1, 105.0
> 20150102, 2, 195.0
> Output:
> time, id, quantity, price
> 20160101, 1, 100, 100.0
> 20160101, 2, 50, null
> 20160102, 1, -50, 105.0
> 20160102, 2, 50, 195.0
> {code}
> h2. Optional Design Sketch
> h3. Implementation A
> (This is just initial thought of how to 

[jira] [Created] (SPARK-22948) "SparkPodInitContainer" shouldn't be in "rest" package

2018-01-03 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-22948:
--

 Summary: "SparkPodInitContainer" shouldn't be in "rest" package
 Key: SPARK-22948
 URL: https://issues.apache.org/jira/browse/SPARK-22948
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin
Priority: Trivial


Just noticed while playing with the code that this class is in 
{{org.apache.spark.deploy.rest.k8s}}; "rest" doesn't make sense here since 
there's no REST server (and it's the only class in there, too).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22928) API Documentation issue for org.apache.spark.sql.streaming.Trigger

2018-01-03 Thread Ganesh Chand (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ganesh Chand updated SPARK-22928:
-
Affects Version/s: 1.6.3
   2.0.0
   2.2.1

> API Documentation issue for org.apache.spark.sql.streaming.Trigger
> --
>
> Key: SPARK-22928
> URL: https://issues.apache.org/jira/browse/SPARK-22928
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Java API
>Affects Versions: 1.6.3, 2.0.0, 2.2.0, 2.2.1
>Reporter: Ganesh Chand
>  Labels: docuentation
>
> 1) {{Trigger.java}} link is broken in the documentation - 
> https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.streaming.Trigger.
> The link redirect to: 
> https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java.scala
>  and results in {{404}} error. 
> Notice {{Trigger.java.scala}}. It should be {{Trigger.java}}
> So, the correct link is 
> https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java
> 2) Because of the above issue, {{org.apache.spark.sql.streaming.Trigger}} has 
> incomplete documentation. See 
> https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.streaming.Trigger.
>  As a result, Public methods such as {{ProcessingTime()}} isn't documented.
> Update - 01/03/2018: It looks like all {{.java}} files referenced in Scala 
> API documentation suffer from this problem. For example - {{SaveMode}} also 
> has a broken link - 
> https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/SaveMode.java.scala



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22928) API Documentation issue for org.apache.spark.sql.streaming.Trigger

2018-01-03 Thread Ganesh Chand (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ganesh Chand updated SPARK-22928:
-
Description: 
1) {{Trigger.java}} link is broken in the documentation - 
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.streaming.Trigger.
The link redirect to: 
https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java.scala
 and results in {{404}} error. 
Notice {{Trigger.java.scala}}. It should be {{Trigger.java}}

So, the correct link is 
https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java

2) Because of the above issue, {{org.apache.spark.sql.streaming.Trigger}} has 
incomplete documentation. See 
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.streaming.Trigger.
 As a result, Public methods such as {{ProcessingTime()}} isn't documented.

Update - 01/03/2018: It looks like all {{.java}} files referenced in Scala API 
documentation suffer from this problem. For example - {{SaveMode}} also has a 
broken link - 
https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/SaveMode.java.scala



  was:
1) {{Trigger.java}} link is broken in the documentation - 
{{https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.streaming.Trigger}}.
 The link redirect to 
{{https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java.scala}}
 and results in {{404}} error. 
Notice {{Trigger.java.scala}}. It should be {{Trigger.java}}

So, the correct link is 
{{https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java}}

2) Because of the above issue, {{org.apache.spark.sql.streaming.Trigger}} has 
incomplete documentation. See 
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.streaming.Trigger.
 As a result, Public methods such as {{ProcessingTime()}} isn't documented.

Update - 01/03/2018: It looks like all {{.java}} files referenced in Scala API 
documentation suffer from this problem. For example - {{SaveMode}} also has a 
broken link - 
https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/SaveMode.java.scala




> API Documentation issue for org.apache.spark.sql.streaming.Trigger
> --
>
> Key: SPARK-22928
> URL: https://issues.apache.org/jira/browse/SPARK-22928
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Java API
>Affects Versions: 2.2.0
>Reporter: Ganesh Chand
>  Labels: docuentation
>
> 1) {{Trigger.java}} link is broken in the documentation - 
> https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.streaming.Trigger.
> The link redirect to: 
> https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java.scala
>  and results in {{404}} error. 
> Notice {{Trigger.java.scala}}. It should be {{Trigger.java}}
> So, the correct link is 
> https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java
> 2) Because of the above issue, {{org.apache.spark.sql.streaming.Trigger}} has 
> incomplete documentation. See 
> https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.streaming.Trigger.
>  As a result, Public methods such as {{ProcessingTime()}} isn't documented.
> Update - 01/03/2018: It looks like all {{.java}} files referenced in Scala 
> API documentation suffer from this problem. For example - {{SaveMode}} also 
> has a broken link - 
> https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/SaveMode.java.scala



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22930) Improve the description of Vectorized UDFs for non-deterministic cases

2018-01-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22930:


Assignee: Apache Spark

> Improve the description of Vectorized UDFs for non-deterministic cases
> --
>
> Key: SPARK-22930
> URL: https://issues.apache.org/jira/browse/SPARK-22930
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> After we merge this commit 
> https://github.com/apache/spark/commit/ff48b1b338241039a7189e7a3c04333b1256fdb3,
>  we also need to update the function description of Vectorized UDFs. Users 
> are able to create non-deterministic Vectorized UDFs. 
> Also, add the related test cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22930) Improve the description of Vectorized UDFs for non-deterministic cases

2018-01-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22930:


Assignee: (was: Apache Spark)

> Improve the description of Vectorized UDFs for non-deterministic cases
> --
>
> Key: SPARK-22930
> URL: https://issues.apache.org/jira/browse/SPARK-22930
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>
> After we merge this commit 
> https://github.com/apache/spark/commit/ff48b1b338241039a7189e7a3c04333b1256fdb3,
>  we also need to update the function description of Vectorized UDFs. Users 
> are able to create non-deterministic Vectorized UDFs. 
> Also, add the related test cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22930) Improve the description of Vectorized UDFs for non-deterministic cases

2018-01-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310341#comment-16310341
 ] 

Apache Spark commented on SPARK-22930:
--

User 'icexelloss' has created a pull request for this issue:
https://github.com/apache/spark/pull/20142

> Improve the description of Vectorized UDFs for non-deterministic cases
> --
>
> Key: SPARK-22930
> URL: https://issues.apache.org/jira/browse/SPARK-22930
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>
> After we merge this commit 
> https://github.com/apache/spark/commit/ff48b1b338241039a7189e7a3c04333b1256fdb3,
>  we also need to update the function description of Vectorized UDFs. Users 
> are able to create non-deterministic Vectorized UDFs. 
> Also, add the related test cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22928) API Documentation issue for org.apache.spark.sql.streaming.Trigger

2018-01-03 Thread Ganesh Chand (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ganesh Chand updated SPARK-22928:
-
Description: 
1) {{Trigger.java}} link is broken in the documentation - 
{{https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.streaming.Trigger}}.
 The link redirect to 
{{https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java.scala}}
 and results in {{404}} error. 
Notice {{Trigger.java.scala}}. It should be {{Trigger.java}}

So, the correct link is 
{{https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java}}

2) Because of the above issue, {{org.apache.spark.sql.streaming.Trigger}} has 
incomplete documentation. See 
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.streaming.Trigger.
 As a result, Public methods such as {{ProcessingTime()}} isn't documented.

Update - 01/03/2018: It looks like all {{.java}} files referenced in Scala API 
documentation suffer from this problem. For example - {{SaveMode}} also has a 
broken link - 
https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/SaveMode.java.scala



  was:
1) {{Trigger.java}} link is broken in the documentation - 
{{https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.streaming.Trigger}}.
 The link redirect to 
{{https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java.scala}}
 and results in {{404}} error. 
Notice {{Trigger.java.scala}}. It should be {{Trigger.java}}

So, the correct link is 
{{https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java}}

2) Because of the above issue, {{org.apache.spark.sql.streaming.Trigger}} has 
incomplete documentation. See 
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.streaming.Trigger.
 As a result, Public methods such as {{ProcessingTime()}} isn't documented.




> API Documentation issue for org.apache.spark.sql.streaming.Trigger
> --
>
> Key: SPARK-22928
> URL: https://issues.apache.org/jira/browse/SPARK-22928
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Java API
>Affects Versions: 2.2.0
>Reporter: Ganesh Chand
>  Labels: docuentation
>
> 1) {{Trigger.java}} link is broken in the documentation - 
> {{https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.streaming.Trigger}}.
>  The link redirect to 
> {{https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java.scala}}
>  and results in {{404}} error. 
> Notice {{Trigger.java.scala}}. It should be {{Trigger.java}}
> So, the correct link is 
> {{https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java}}
> 2) Because of the above issue, {{org.apache.spark.sql.streaming.Trigger}} has 
> incomplete documentation. See 
> https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.streaming.Trigger.
>  As a result, Public methods such as {{ProcessingTime()}} isn't documented.
> Update - 01/03/2018: It looks like all {{.java}} files referenced in Scala 
> API documentation suffer from this problem. For example - {{SaveMode}} also 
> has a broken link - 
> https://github.com/apache/spark/tree/v2.2.0/sql/core/src/main/java/org/apache/spark/sql/SaveMode.java.scala



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21809) Change Stage Page to use datatables to support sorting columns and searching

2018-01-03 Thread Nuochen Lyu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nuochen Lyu resolved SPARK-21809.
-
Resolution: Fixed

> Change Stage Page to use datatables to support sorting columns and searching
> 
>
> Key: SPARK-21809
> URL: https://issues.apache.org/jira/browse/SPARK-21809
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Nuochen Lyu
>Priority: Minor
>
> Support column sort and search for Stage Server using jQuery DataTable and 
> REST API. Before this commit, the Stage page was generated hard-coded HTML 
> and can not support search, also, the sorting was disabled if there is any 
> application that has more than one attempt. Supporting search and sort (over 
> all applications rather than the 20 entries in the current page) in any case 
> will greatly improve the user experience.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22887) ML test for StructuredStreaming: spark.ml.fpm

2018-01-03 Thread Sandor Murakozi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310258#comment-16310258
 ] 

Sandor Murakozi commented on SPARK-22887:
-

Is anyone working on this? 
If not I would like to work on it.

> ML test for StructuredStreaming: spark.ml.fpm
> -
>
> Key: SPARK-22887
> URL: https://issues.apache.org/jira/browse/SPARK-22887
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-01-03 Thread Li Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-22947:
---
Description: 
h2. Background and Motivation
Time series analysis is one of the most common analysis on financial data. In 
time series analysis, as-of join is a very common operation. Supporting as-of 
join in Spark SQL will allow many use cases of using Spark SQL for time series 
analysis.

As-of join is “join on time” with inexact time matching criteria. Various 
library has implemented asof join or similar functionality:
Kdb: https://code.kx.com/wiki/Reference/aj
Pandas: 
http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
R: This functionality is called “Last Observation Carried Forward”
https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
Flint: https://github.com/twosigma/flint#temporal-join-functions

This proposal advocates introducing new API in Spark SQL to support as-of join.

h2. Target Personas
Data scientists, data engineers

h2. Goals
* New API in Spark SQL that allows as-of join
* As-of join of multiple table (>2) should be performant, because it’s very 
common that users need to join multiple data sources together for further 
analysis.
* Define Distribution, Partitioning and shuffle strategy for ordered time 
series data

h2. Non-Goals
These are out of scope for the existing SPIP, should be considered in future 
SPIP as improvement to Spark’s time series analysis ability:
* Utilize partition information from data source, i.e, begin/end of each 
partition to reduce sorting/shuffling
* Define API for user to implement asof join time spec in business calendar 
(i.e. lookback one business day, this is very common in financial data analysis 
because of market calendars)
* Support broadcast join

h2. Proposed API Changes

h3. TimeContext
TimeContext is an object that defines the time scope of the analysis, it has 
begin time (inclusive) and end time (exclusive). User should be able to change 
the time scope of the analysis (i.e, from one month to five year) by just 
changing the TimeContext. 

To Spark engine, TimeContext is a hint that:
can be used to repartition data for join
serve as a predicate that can be pushed down to storage layer

Time context is similar to filtering time by begin/end, the main difference is 
that time context can be expanded based on the operation taken (see example in 
as-of join).

Time context example:

{code:java}
TimeContext timeContext = TimeContext("20160101", "20170101")
{code}

h3. asofJoin
h4. User Case A (join without key)
Join two DataFrames on time, with one day lookback:

{code:java}
TimeContext timeContext = TimeContext("20160101", "20170101")

dfA = ...
dfB = ...

JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
result = dfA.asofJoin(dfB, joinSpec)
{code}


Example input/output:

{code:java}
dfA:
time, quantity
20160101, 100
20160102, 50
20160104, -50
20160105, 100

dfB:
time, price
20151231, 100.0
20160104, 105.0
20160105, 102.0

output:
time, quantity, price
20160101, 100, 100.0
20160102, 50, null
20160104, -50, 105.0
20160105, 100, 102.0

{code}

Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
is an important illustration of the time context - it is able to expand the 
context to 20151231 on dfB because of the 1 day lookback.

h4. Use Case B (join with key)
To join on time and another key (for instance, id), we use “by” to specify the 
key.

{code:java}

TimeContext timeContext = TimeContext("20160101", "20170101")

dfA = ...
dfB = ...

JoinSpec joinSpec = JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
result = dfA.asofJoin(dfB, joinSpec)
{code}


Example input/output:

{code:java}
dfA:
time, id, quantity
20160101, 1, 100
20160101, 2, 50
20160102, 1, -50
20160102, 2, 50

dfB:
time, id, price
20151231, 1, 100.0
20150102, 1, 105.0
20150102, 2, 195.0

Output:
time, id, quantity, price
20160101, 1, 100, 100.0
20160101, 2, 50, null
20160102, 1, -50, 105.0
20160102, 2, 50, 195.0
{code}

h2. Optional Design Sketch
h3. Implementation A

(This is just initial thought of how to implement this)

(1) Using begin/end of the TimeContext, we first partition the left DataFrame 
intonon-overlapping partitions. For the purpose of demonstration, assume we 
partition it into one-day partitions:

{code:java}
[20160101, 20160102) [20160102, 20160103) ... [20161231, 20170101)
{code}


(2) Then we partition right DataFrame into overlapping partitions, taking into 
account tolerance, e.g. one day lookback:

{code:java}
[20151231, 20160102) [20160101, 20160103) ... [20161230, 20170101)
{code}


(3) Pair left and right partitions

(4) For each pair of partitions, because all data for the join is in the 
partition pair, we can now join the partition pair locally.

(5) Use partitioning in (1) as the 

[jira] [Updated] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-01-03 Thread Li Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-22947:
---
Description: 
h2. Background and Motivation
Time series analysis is one of the most common analysis on financial data. In 
time series analysis, as-of join is a very common operation. Supporting as-of 
join in Spark SQL will allow many use cases of using Spark SQL for time series 
analysis.

As-of join is “join on time” with inexact time matching criteria. Various 
library has implemented asof join or similar functionality:
Kdb: https://code.kx.com/wiki/Reference/aj
Pandas: 
http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
R: This functionality is called “Last Observation Carried Forward”
https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
Flint: https://github.com/twosigma/flint#temporal-join-functions

This proposal advocates introducing new API in Spark SQL to support as-of join.

h2. Target Personas
Data scientists, data engineers

h2. Goals
* New API in Spark SQL that allows as-of join
* As-of join of multiple table (>2) should be performant, because it’s very 
common that users need to join multiple data sources together for further 
analysis.
* Define Distribution, Partitioning and shuffle strategy for ordered time 
series data

h2. Non-Goals
These are out of scope for the existing SPIP, should be considered in future 
SPIP as improvement to Spark’s time series analysis ability:
* Utilize partition information from data source, i.e, begin/end of each 
partition to reduce sorting/shuffling
* Define API for user to implement asof join time spec in business calendar 
(i.e. lookback one business day, this is very common in financial data analysis 
because of market calendars)
* Support broadcast join

h2. Proposed API Changes

h3. TimeContext
TimeContext is an object that defines the time scope of the analysis, it has 
begin time (inclusive) and end time (exclusive). User should be able to change 
the time scope of the analysis (i.e, from one month to five year) by just 
changing the TimeContext. 

To Spark engine, TimeContext is a hint that:
can be used to repartition data for join
serve as a predicate that can be pushed down to storage layer

Time context is similar to filtering time by begin/end, the main difference is 
that time context can be expanded based on the operation taken (see example in 
as-of join).

Time context example:

{code:java}
TimeContext timeContext = TimeContext("20160101", "20170101")
{code}

h3. asofJoin
h4. User Case A (join without key)
Join two DataFrames on time, with one day lookback:

{code:java}
TimeContext timeContext = TimeContext("20160101", "20170101")

dfA = ...
dfB = ...

JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
result = dfA.asofJoin(dfB, joinSpec)
{code}


Example input/output:

{code:java}
dfA:
time, quantity
20160101, 100
20160102, 50
20160104, -50
20160105, 100

dfB:
time, price
20151231, 100.0
20160104, 105.0
20160105, 102.0

output:
time, quantity, price
20160101, 100, 100.0
20160102, 50, null
20160104, -50, 105.0
20160105, 100, 102.0

{code}

Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
is an important illustration of the time context - it is able to expand the 
context to 20151231 on dfB because of the 1 day lookback.

h4. Use Case B (join with key)
To join on time and another key (for instance, id), we use “by” to specify the 
key.

{code:java}

TimeContext timeContext = TimeContext("20160101", "20170101")

dfA = ...
dfB = ...

JoinSpec joinSpec = JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
result = dfA.asofJoin(dfB, joinSpec)
{code}


Example input/output:

{code:java}
dfA:
time, id, quantity
20160101, 1, 100
20160101, 2, 50
20160102, 1, -50
20160102, 2, 50

dfB:
time, id, price
20151231, 1, 100.0
20150102, 1, 105.0
20150102, 2, 195.0

Output:
time, id, quantity, price
20160101, 1, 100, 100.0
20160101, 2, 50, null
20160102, 1, -50, 105.0
20160102, 2, 50, 195.0
{code}




  was:
h2. Background and Motivation
Time series analysis is one of the most common analysis on financial data. In 
time series analysis, as-of join is a very common operation. Supporting as-of 
join in Spark SQL will allow many use cases of using Spark SQL for time series 
analysis.

As-of join is “join on time” with inexact time matching criteria. Various 
library has implemented asof join or similar functionality:
Kdb: https://code.kx.com/wiki/Reference/aj
Pandas: 
http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
R: This functionality is called “Last Observation Carried Forward”
https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
Flint: 

[jira] [Updated] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-01-03 Thread Li Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-22947:
---
Attachment: SPIP_ as-of join in Spark SQL (1).pdf

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> See attachment



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-01-03 Thread Li Jin (JIRA)
Li Jin created SPARK-22947:
--

 Summary: SPIP: as-of join in Spark SQL
 Key: SPARK-22947
 URL: https://issues.apache.org/jira/browse/SPARK-22947
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1
Reporter: Li Jin


h2. Background and Motivation
Time series analysis is one of the most common analysis on financial data. In 
time series analysis, as-of join is a very common operation. Supporting as-of 
join in Spark SQL will allow many use cases of using Spark SQL for time series 
analysis.

As-of join is “join on time” with inexact time matching criteria. Various 
library has implemented asof join or similar functionality:
Kdb: https://code.kx.com/wiki/Reference/aj
Pandas: 
http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
R: This functionality is called “Last Observation Carried Forward”
https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
Flint: https://github.com/twosigma/flint#temporal-join-functions

This proposal advocates introducing new API in Spark SQL to support as-of join.

h2. Target Personas
Data scientists, data engineers

h2. Goals
* New API in Spark SQL that allows as-of join
* As-of join of multiple table (>2) should be performant, because it’s very 
common that users need to join multiple data sources together for further 
analysis.
* Define Distribution, Partitioning and shuffle strategy for ordered time 
series data

h2. Non-Goals
These are out of scope for the existing SPIP, should be considered in future 
SPIP as improvement to Spark’s time series analysis ability:
* Utilize partition information from data source, i.e, begin/end of each 
partition to reduce sorting/shuffling
* Define API for user to implement asof join time spec in business calendar 
(i.e. lookback one business day, this is very common in financial data analysis 
because of market calendars)
* Support broadcast join

h2. Proposed API Changes
See attachment




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22896) Improvement in String interpolation

2018-01-03 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310119#comment-16310119
 ] 

Dongjoon Hyun commented on SPARK-22896:
---

Thank you!

> Improvement in String interpolation 
> 
>
> Key: SPARK-22896
> URL: https://issues.apache.org/jira/browse/SPARK-22896
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.2.1
>Reporter: Chetan Khatri
>Assignee: Chetan Khatri
>Priority: Trivial
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> String Interpolation Scala Style has been improved as per the scala standard



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22553) Drop FROM in nonReserved

2018-01-03 Thread Suchith J N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310068#comment-16310068
 ] 

Suchith J N commented on SPARK-22553:
-

The change breaks 3 tests

1. *TableIdentifierParserSuite* : table identifier - strict keywords{code:java}
org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'from' 
expecting {'SELECT', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE',
{code}

2. *PlanParserSuite* : simple select query
{code:java}
org.apache.spark.sql.catalyst.parser.ParseException: no viable alternative at 
input 'select from'(line 1, pos 7)
{code}

3. *DataTypeParserSuite* : parse struct
{code:java}
org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'from' 
expecting {'SELECT', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE',
{code}

> Drop FROM in nonReserved
> 
>
> Key: SPARK-22553
> URL: https://issues.apache.org/jira/browse/SPARK-22553
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Priority: Trivial
> Attachments: Removed_FROM_from_nonReserved_list.patch
>
>
> A simple query below throws a misleading error because nonReserved has 
> `SELECT` in SqlBase.q4:
> {code}
> scala> Seq((1, 2)).toDF("a", "b").createTempView("t")
> scala> sql("select a, count(1), from t group by 1").show
> org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input 
> columns: []; line 1 pos 7;
> 'Aggregate [unresolvedordinal(1)], ['a, count(1) AS count(1)#13L, 'from AS 
> t#11]
> +- OneRowRelation$
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
> {code}
> I know nonReserved currently has `SELECT` because of the historical reason 
> (https://github.com/apache/spark/pull/18079#discussion_r118842186). But, 
> since IMHO this is a kind of common mistakes (This message annoyed me a few 
> days ago in large SQL queries...), it might be worth dropping it in the 
> reserved.
> FYI: In postgresql throws an explicit error in this case:
> {code}
> postgres=# select a, count(1), from test group by b;
> ERROR:  syntax error at or near "from" at character 21
> STATEMENT:  select a, count(1), from test group by b;
> ERROR:  syntax error at or near "from"
> LINE 1: select a, count(1), from test group by b;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22896) Improvement in String interpolation

2018-01-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310055#comment-16310055
 ] 

Sean Owen commented on SPARK-22896:
---

Oops, right. Also merged to 2.3.

> Improvement in String interpolation 
> 
>
> Key: SPARK-22896
> URL: https://issues.apache.org/jira/browse/SPARK-22896
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.2.1
>Reporter: Chetan Khatri
>Assignee: Chetan Khatri
>Priority: Trivial
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> String Interpolation Scala Style has been improved as per the scala standard



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22896) Improvement in String interpolation

2018-01-03 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310038#comment-16310038
 ] 

Dongjoon Hyun commented on SPARK-22896:
---

[~srowen].
This is in `master` branch only. Can we have this in `branch-2.3` or fix `Fix 
Version/s:` into `2.4`?

> Improvement in String interpolation 
> 
>
> Key: SPARK-22896
> URL: https://issues.apache.org/jira/browse/SPARK-22896
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.2.1
>Reporter: Chetan Khatri
>Assignee: Chetan Khatri
>Priority: Trivial
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> String Interpolation Scala Style has been improved as per the scala standard



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-22553) Drop FROM in nonReserved

2018-01-03 Thread Suchith J N (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suchith J N updated SPARK-22553:

Comment: was deleted

(was: Shall I submit a PR?)

> Drop FROM in nonReserved
> 
>
> Key: SPARK-22553
> URL: https://issues.apache.org/jira/browse/SPARK-22553
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Priority: Trivial
> Attachments: Removed_FROM_from_nonReserved_list.patch
>
>
> A simple query below throws a misleading error because nonReserved has 
> `SELECT` in SqlBase.q4:
> {code}
> scala> Seq((1, 2)).toDF("a", "b").createTempView("t")
> scala> sql("select a, count(1), from t group by 1").show
> org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input 
> columns: []; line 1 pos 7;
> 'Aggregate [unresolvedordinal(1)], ['a, count(1) AS count(1)#13L, 'from AS 
> t#11]
> +- OneRowRelation$
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
> {code}
> I know nonReserved currently has `SELECT` because of the historical reason 
> (https://github.com/apache/spark/pull/18079#discussion_r118842186). But, 
> since IMHO this is a kind of common mistakes (This message annoyed me a few 
> days ago in large SQL queries...), it might be worth dropping it in the 
> reserved.
> FYI: In postgresql throws an explicit error in this case:
> {code}
> postgres=# select a, count(1), from test group by b;
> ERROR:  syntax error at or near "from" at character 21
> STATEMENT:  select a, count(1), from test group by b;
> ERROR:  syntax error at or near "from"
> LINE 1: select a, count(1), from test group by b;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22896) Improvement in String interpolation

2018-01-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22896.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20070
[https://github.com/apache/spark/pull/20070]

> Improvement in String interpolation 
> 
>
> Key: SPARK-22896
> URL: https://issues.apache.org/jira/browse/SPARK-22896
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.2.1
>Reporter: Chetan Khatri
>Assignee: Chetan Khatri
>Priority: Trivial
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> String Interpolation Scala Style has been improved as per the scala standard



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22896) Improvement in String interpolation

2018-01-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22896:
-

Assignee: Chetan Khatri

> Improvement in String interpolation 
> 
>
> Key: SPARK-22896
> URL: https://issues.apache.org/jira/browse/SPARK-22896
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.2.1
>Reporter: Chetan Khatri
>Assignee: Chetan Khatri
>Priority: Trivial
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> String Interpolation Scala Style has been improved as per the scala standard



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-01-03 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309910#comment-16309910
 ] 

Thomas Graves commented on SPARK-22683:
---

[~jcuquemelle] . just to confirm your applications normally have multiple 
stages then and the issue is that the executors get started and run very few 
tasks and then take the idle timeout (30s in your case) to timeout and thus 
waste those resources?  If this was just a single stage it would finish job and 
release them right away and presumably your next stage doesn't need them.   The 
reason I ask is to make sure we aren't just being to slow about releasing 
containers back when not needed.   Meaning you request yarn a bunch of 
containers but by the time it gives them to  you the # of tasks is already < 
the number of executors. We should give those containers back immediately.  I'm 
assuming in this case its still trying to launch a bunch of containers but in 
the mean time the tasks either finish or each of those executors only gets a 
few tasks?  How long is it taking for your executors to start?   You aren't 
localizing something verify big for instance and that time is taking longer 
then it should?

The only other way I can see to do this is to try to have spark be smarter 
(similar to speculative execution) where it would estimate the time it takes 
for tasks to finish vs the time it takes to start executors and have it 
estimate that it won't need more executors. That is much more complicated then 
this config though.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, 

[jira] [Commented] (SPARK-22826) [SQL] findWiderTypeForTwo Fails over StructField of Array

2018-01-03 Thread Aleksander Eskilson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309794#comment-16309794
 ] 

Aleksander Eskilson commented on SPARK-22826:
-

This bug would affect 2.3.0, and so would also affect any encoder having fields 
as described above, which are pretty common in Avro data. Which means it would 
also preclude work in a 
[fork|https://github.com/databricks/spark-avro/pull/217] of Spark-Avro that 
enables creating datasets from Avro objects. Any chance the PR for this issue 
can still make it in time for the final 2.3.0 release so that those Spark-Avro 
forks can run over the latest version as we continue working over SPARK-22739?

cc: [~marmbrus]

> [SQL] findWiderTypeForTwo Fails over StructField of Array
> -
>
> Key: SPARK-22826
> URL: https://issues.apache.org/jira/browse/SPARK-22826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Aleksander Eskilson
>
> The {{findWiderTypeForTwo}} codepath in Catalyst {{TypeCoercion}} fails when 
> applied to to {{StructType}} having the following fields:
> {noformat}
>   StructType(StructField("a", ArrayType(StringType, containsNull=true)) 
> :: Nil),
>   StructType(StructField("a", ArrayType(StringType, containsNull=false)) 
> :: Nil)
> {noformat}
> When in {{findTightestCommonType}}, the function attempts to recursively find 
> the tightest common type of two arrays. These two arrays are not equal types 
> (since one would admit null elements and the other would not), but 
> {{findTightestCommonType}} has no match case for {{ArrayType}} (or 
> {{MapType}}), so the 
> [get|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L108]
>  operation on the dataType of the {{StructField}} throws a 
> {{NoSuchElementException}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22946) Recursive withColumn calls cause org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2018-01-03 Thread Harpreet Chopra (JIRA)
Harpreet Chopra created SPARK-22946:
---

 Summary: Recursive withColumn calls cause 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
 Key: SPARK-22946
 URL: https://issues.apache.org/jira/browse/SPARK-22946
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Harpreet Chopra


Recursive calls to withColumn, for updating the same column causes 
_org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
_

This can be reproduced in Spark 1.x and also in the latest version we are using 
(Spark 2.2.0) 
code to reproduce: 

import org.apache.spark.sql.functions._
var df 
=sc.parallelize(Seq(("123","CustOne"),("456","CustTwo"))).toDF("ID","CustName")
(1 to 20).foreach(x => {
df = df.withColumn("ID", when(col("ID") === "123", 
lit("678")).otherwise(col("ID")))
println(" "+x)
df.show
})

Stack dump:

at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$cata
lyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:555)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerato
r.scala:575)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerato
r.scala:572)
at 
org.spark-project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java
:3599)
at 
org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
... 28 more
Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
"(Lorg/apache/spark/sql/cataly
st/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V
" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows
 beyond 64 KB
at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:9912)
at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:9897)
at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3317)
at org.codehaus.janino.UnitCompiler.access$8500(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$10.visitLocalVariableAccess(UnitCompiler.java:3285)
at org.codehaus.janino.Java$LocalVariableAccess.accept(Java.java:3189)
at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3313)
at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$10.visitAmbiguousName(UnitCompiler.java:3280)
at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3138)
at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
at 
org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2669)
at org.codehaus.janino.UnitCompiler.access$4500(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$7.visitAssignment(UnitCompiler.java:2619)
at org.codehaus.janino.Java$Assignment.accept(Java.java:3405)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1643)
at org.codehaus.janino.UnitCompiler.access$1100(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$4.visitExpressionStatement(UnitCompiler.java:936)
at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2097)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:958)
at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1007)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:993)
at org.codehaus.janino.UnitCompiler.access$1000(UnitCompiler.java:185)
at org.codehaus.janino.UnitCompiler$4.visitBlock(UnitCompiler.java:935)
at org.codehaus.janino.Java$Block.accept(Java.java:2012)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:958)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1753)
at org.codehaus.janino.UnitCompiler.access$1200(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$4.visitIfStatement(UnitCompiler.java:937)
at org.codehaus.janino.Java$IfStatement.accept(Java.java:2157)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:958)
at 

[jira] [Assigned] (SPARK-22945) add java UDF APIs in the functions object

2018-01-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22945:


Assignee: Wenchen Fan  (was: Apache Spark)

> add java UDF APIs in the functions object
> -
>
> Key: SPARK-22945
> URL: https://issues.apache.org/jira/browse/SPARK-22945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>
> Currently Scala users can use UDF like
> {code}
> val foo = udf((i: Int) => Math.random() + i).asNondeterministic
> df.select(foo('a))
> {code}
> Python users can also do it with similar APIs. However Java users can't do 
> it, we should add Java UDF APIs in the functions object.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22945) add java UDF APIs in the functions object

2018-01-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309730#comment-16309730
 ] 

Apache Spark commented on SPARK-22945:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20141

> add java UDF APIs in the functions object
> -
>
> Key: SPARK-22945
> URL: https://issues.apache.org/jira/browse/SPARK-22945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>
> Currently Scala users can use UDF like
> {code}
> val foo = udf((i: Int) => Math.random() + i).asNondeterministic
> df.select(foo('a))
> {code}
> Python users can also do it with similar APIs. However Java users can't do 
> it, we should add Java UDF APIs in the functions object.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22945) add java UDF APIs in the functions object

2018-01-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22945:


Assignee: Apache Spark  (was: Wenchen Fan)

> add java UDF APIs in the functions object
> -
>
> Key: SPARK-22945
> URL: https://issues.apache.org/jira/browse/SPARK-22945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>
> Currently Scala users can use UDF like
> {code}
> val foo = udf((i: Int) => Math.random() + i).asNondeterministic
> df.select(foo('a))
> {code}
> Python users can also do it with similar APIs. However Java users can't do 
> it, we should add Java UDF APIs in the functions object.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22945) add java UDF APIs in the functions object

2018-01-03 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22945:
---

 Summary: add java UDF APIs in the functions object
 Key: SPARK-22945
 URL: https://issues.apache.org/jira/browse/SPARK-22945
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan


Currently Scala users can use UDF like
{code}
val foo = udf((i: Int) => Math.random() + i).asNondeterministic
df.select(foo('a))
{code}

Python users can also do it with similar APIs. However Java users can't do it, 
we should add Java UDF APIs in the functions object.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20236) Overwrite a partitioned data source table should only overwrite related partitions

2018-01-03 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20236.
-
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.3.0

> Overwrite a partitioned data source table should only overwrite related 
> partitions
> --
>
> Key: SPARK-20236
> URL: https://issues.apache.org/jira/browse/SPARK-20236
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> When we overwrite a partitioned data source table, currently Spark will 
> truncate the entire table to write new data, or truncate a bunch of 
> partitions according to the given static partitions.
> For example, {{INSERT OVERWRITE tbl ...}} will truncate the entire table, 
> {{INSERT OVERWRITE tbl PARTITION (a=1, b)}} will truncate all the partitions 
> that starts with {{a=1}}.
> This behavior is kind of reasonable as we can know which partitions will be 
> overwritten before runtime. However, hive has a different behavior that it 
> only overwrites related partitions, e.g. {{INSERT OVERWRITE tbl SELECT 
> 1,2,3}} will only overwrite partition {{a=2, b=3}}, assuming {{tbl}} has only 
> one data column and is partitioned by {{a}} and {{b}}.
> It seems better if we can follow hive's behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22934) Make optional clauses order insensitive for CREATE TABLE SQL statement

2018-01-03 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22934.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Make optional clauses order insensitive for CREATE TABLE SQL statement
> --
>
> Key: SPARK-22934
> URL: https://issues.apache.org/jira/browse/SPARK-22934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> Each time, when I write a complex Create Table statement, I have to open the 
> .g4 file to find the EXACT order of clauses in CREATE TABLE statement. When 
> the order is not right, I will get A strange confusing error message 
> generated from ALTR4. 
> {noformat}
> CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
> [(col_name1 col_type1 [COMMENT col_comment1], ...)]
> USING datasource
> [OPTIONS (key1=val1, key2=val2, ...)]
> [PARTITIONED BY (col_name1, col_name2, ...)]
> [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
> [LOCATION path]
> [COMMENT table_comment]
> [TBLPROPERTIES (key1=val1, key2=val2, ...)]
> [AS select_statement]
> {noformat}
> The proposal is to make the following clauses order insensitive. 
> {noformat}
> [OPTIONS (key1=val1, key2=val2, ...)]
> [PARTITIONED BY (col_name1, col_name2, ...)]
> [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
> [LOCATION path]
> [COMMENT table_comment]
> [TBLPROPERTIES (key1=val1, key2=val2, ...)]
> {noformat}
> The same idea is also applicable to Create Hive Table. 
> {noformat}
> CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
> [(col_name1[:] col_type1 [COMMENT col_comment1], ...)]
> [COMMENT table_comment]
> [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)]
> [ROW FORMAT row_format]
> [STORED AS file_format]
> [LOCATION path]
> [TBLPROPERTIES (key1=val1, key2=val2, ...)]
> [AS select_statement]
> {noformat}
> The proposal is to make the following clauses order insensitive. 
> {noformat}
> [COMMENT table_comment]
> [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)]
> [ROW FORMAT row_format]
> [STORED AS file_format]
> [LOCATION path]
> [TBLPROPERTIES (key1=val1, key2=val2, ...)]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17833) 'monotonicallyIncreasingId()' should be deterministic

2018-01-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17833.
---
Resolution: Duplicate

> 'monotonicallyIncreasingId()' should be deterministic
> -
>
> Key: SPARK-17833
> URL: https://issues.apache.org/jira/browse/SPARK-17833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Kevin Ushey
>Priority: Critical
>
> Right now, it's (IMHO) too easy to shoot yourself in the foot using 
> 'monotonicallyIncreasingId()', as it's easy to expect the generated numbers 
> to function as a 'stable' primary key, for example, and then go on to use 
> that key in e.g. 'joins' and so on.
> Is there any reason why this function can't be made deterministic? Or, could 
> a deterministic analogue of this function be added (e.g. 
> 'withPrimaryKey(columnName = ...)')?
> A solution is to immediately cache / persist the table after calling 
> 'monotonicallyIncreasingId()'; it's also possible that the documentation 
> should spell that out loud and clear.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22936) providing HttpStreamSource and HttpStreamSink

2018-01-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309682#comment-16309682
 ] 

Sean Owen commented on SPARK-22936:
---

The typical place to list these is https://spark-packages.org/

> providing HttpStreamSource and HttpStreamSink
> -
>
> Key: SPARK-22936
> URL: https://issues.apache.org/jira/browse/SPARK-22936
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: bluejoe
>
> Hi, in my project I completed a spark-http-stream, which is now available on 
> https://github.com/bluejoe2008/spark-http-stream. I am thinking if it is 
> useful to others and is ok to be integrated as a part of Spark.
> spark-http-stream transfers Spark structured stream over HTTP protocol. 
> Unlike tcp streams, Kafka streams and HDFS file streams, http streams often 
> flow across distributed big data centers on the Web. This feature is very 
> helpful to build global data processing pipelines across different data 
> centers (scientific research institutes, for example) who own separated data 
> sets.
> The following code shows how to load messages from a HttpStreamSource:
> ```
> val lines = spark.readStream.format(classOf[HttpStreamSourceProvider].getName)
>   .option("httpServletUrl", "http://localhost:8080/;)
>   .option("topic", "topic-1");
>   .option("includesTimestamp", "true")
>   .load();
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF

2018-01-03 Thread Matthew Fishkin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309663#comment-16309663
 ] 

Matthew Fishkin edited comment on SPARK-22942 at 1/3/18 1:50 PM:
-

So the fact that 

{code:java}

val rightBreakdown = rightOnly.withColumn("has_black", 
udf(hasBlack).apply($"r_infos"))

rightBreakdown.show(false)

res2:
++---+-+
|name|r_infos|has_black|
++---+-+
|c   |[[5,white], [4,orange]]|false|
|d   |[[4,orange], [2,black]]|true |
++---+-+
{code}


correctly uses the UDF, but when I do
{code:java}
rightBreakdown.filter("has_black == true").show(false)
{code}

it says that the UDF hit a nullPointer (even though we see the values were 
computed correctly without the filter).. that is all due to the optimizer order 
of operations?


was (Author: mjfish93):
So the fact that 

{code:java}

val rightBreakdown = rightOnly.withColumn("has_black", 
udf(hasBlack).apply($"r_infos"))

rightBreakdown.show(false)

res2:
++---+-+
|name|r_infos|has_black|
++---+-+
|c   |[[5,white], [4,orange]]|false|
|d   |[[4,orange], [2,black]]|true |
++---+-+
{code}


correctly uses the UDF, but when I do

`rightBreakdown.filter("has_black == true").show(false)`

it says that the UDF hit a nullPointer (even though we see the values were 
computed correctly without the filter).. that is all due to the optimizer order 
of operations?

> Spark Sql UDF throwing NullPointer when adding a filter on a columns that 
> uses that UDF
> ---
>
> Key: SPARK-22942
> URL: https://issues.apache.org/jira/browse/SPARK-22942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0
>Reporter: Matthew Fishkin
>
> I ran into an interesting issue when trying to do a `filter` on a dataframe 
> that has columns that were added using a UDF. I am able to replicate the 
> problem with a smaller set of data.
> Given the dummy case classes:
> {code:java}
> case class Info(number: Int, color: String)
> case class Record(name: String, infos: Seq[Info])
> {code}
> And the following data:
> {code:java}
> val blue = Info(1, "blue")
> val black = Info(2, "black")
> val yellow = Info(3, "yellow")
> val orange = Info(4, "orange")
> val white = Info(5, "white")
> val a  = Record("a", Seq(blue, black, white))
> val a2 = Record("a", Seq(yellow, white, orange))
> val b = Record("b", Seq(blue, black))
> val c = Record("c", Seq(white, orange))
>  val d = Record("d", Seq(orange, black))
> {code}
> Create two dataframes (we will call them left and right)
> {code:java}
> val left = Seq(a, b).toDF
> val right = Seq(a2, c, d).toDF
> {code}
> Join those two dataframes with an outer join (So two of our columns are 
> nullable now.
> {code:java}
> val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer")
> joined.show(false)
> res0:
> +++---+
> |name|infos   |infos  |
> +++---+
> |b   |[[1,blue], [2,black]]   |null   |
> |a   |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]|
> |c   |null|[[5,white], [4,orange]]|
> |d   |null|[[4,orange], [2,black]]|
> +++---+
> {code}
> Then, take only the `name`s that exist in the right Dataframe
> {code:java}
> val rightOnly = joined.filter("l.infos is null").select($"name", 
> $"r.infos".as("r_infos"))
> rightOnly.show(false)
> res1:
> ++---+
> |name|r_infos|
> ++---+
> |c   |[[5,white], [4,orange]]|
> |d   |[[4,orange], [2,black]]|
> ++---+
> {code}
> Now, add a new column called `has_black` which will be true if the `r_infos` 
> contains _black_ as a color
> {code:java}
> def hasBlack = (s: Seq[Row]) => {
>   s.exists{ case Row(num: Int, color: String) =>
> color == "black"
>   }
> }
> val rightBreakdown = rightOnly.withColumn("has_black", 
> udf(hasBlack).apply($"r_infos"))
> rightBreakdown.show(false)
> res2:
> ++---+-+
> |name|r_infos|has_black|
> ++---+-+
> |c   |[[5,white], [4,orange]]|false|
> |d   |[[4,orange], [2,black]]|true |
> ++---+-+
> {code}
> So far, _exactly_ what 

[jira] [Comment Edited] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF

2018-01-03 Thread Matthew Fishkin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309663#comment-16309663
 ] 

Matthew Fishkin edited comment on SPARK-22942 at 1/3/18 1:49 PM:
-

So the fact that 

{code:java}

val rightBreakdown = rightOnly.withColumn("has_black", 
udf(hasBlack).apply($"r_infos"))

rightBreakdown.show(false)

res2:
++---+-+
|name|r_infos|has_black|
++---+-+
|c   |[[5,white], [4,orange]]|false|
|d   |[[4,orange], [2,black]]|true |
++---+-+
{code}


correctly uses the UDF, but when I do

`rightBreakdown.filter("has_black == true").show(false)`

it says that the UDF hit a nullPointer (even though we see the values were 
computed correctly without the filter).. that is all due to the optimizer order 
of operations?


was (Author: mjfish93):
So the fact that 

[val rightBreakdown = rightOnly.withColumn("has_black", 
udf(hasBlack).apply($"r_infos"))

rightBreakdown.show(false)

res2:
++---+-+
|name|r_infos|has_black|
++---+-+
|c   |[[5,white], [4,orange]]|false|
|d   |[[4,orange], [2,black]]|true |
++---+-+]

correctly uses the UDF, but when I do

`rightBreakdown.filter("has_black == true").show(false)`

it says that the UDF hit a nullPointer (even though we see the values were 
computed correctly without the filter).. that is all due to the optimizer order 
of operations?

> Spark Sql UDF throwing NullPointer when adding a filter on a columns that 
> uses that UDF
> ---
>
> Key: SPARK-22942
> URL: https://issues.apache.org/jira/browse/SPARK-22942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0
>Reporter: Matthew Fishkin
>
> I ran into an interesting issue when trying to do a `filter` on a dataframe 
> that has columns that were added using a UDF. I am able to replicate the 
> problem with a smaller set of data.
> Given the dummy case classes:
> {code:java}
> case class Info(number: Int, color: String)
> case class Record(name: String, infos: Seq[Info])
> {code}
> And the following data:
> {code:java}
> val blue = Info(1, "blue")
> val black = Info(2, "black")
> val yellow = Info(3, "yellow")
> val orange = Info(4, "orange")
> val white = Info(5, "white")
> val a  = Record("a", Seq(blue, black, white))
> val a2 = Record("a", Seq(yellow, white, orange))
> val b = Record("b", Seq(blue, black))
> val c = Record("c", Seq(white, orange))
>  val d = Record("d", Seq(orange, black))
> {code}
> Create two dataframes (we will call them left and right)
> {code:java}
> val left = Seq(a, b).toDF
> val right = Seq(a2, c, d).toDF
> {code}
> Join those two dataframes with an outer join (So two of our columns are 
> nullable now.
> {code:java}
> val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer")
> joined.show(false)
> res0:
> +++---+
> |name|infos   |infos  |
> +++---+
> |b   |[[1,blue], [2,black]]   |null   |
> |a   |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]|
> |c   |null|[[5,white], [4,orange]]|
> |d   |null|[[4,orange], [2,black]]|
> +++---+
> {code}
> Then, take only the `name`s that exist in the right Dataframe
> {code:java}
> val rightOnly = joined.filter("l.infos is null").select($"name", 
> $"r.infos".as("r_infos"))
> rightOnly.show(false)
> res1:
> ++---+
> |name|r_infos|
> ++---+
> |c   |[[5,white], [4,orange]]|
> |d   |[[4,orange], [2,black]]|
> ++---+
> {code}
> Now, add a new column called `has_black` which will be true if the `r_infos` 
> contains _black_ as a color
> {code:java}
> def hasBlack = (s: Seq[Row]) => {
>   s.exists{ case Row(num: Int, color: String) =>
> color == "black"
>   }
> }
> val rightBreakdown = rightOnly.withColumn("has_black", 
> udf(hasBlack).apply($"r_infos"))
> rightBreakdown.show(false)
> res2:
> ++---+-+
> |name|r_infos|has_black|
> ++---+-+
> |c   |[[5,white], [4,orange]]|false|
> |d   |[[4,orange], [2,black]]|true |
> ++---+-+
> {code}
> So far, _exactly_ what we expected. 
> *However*, when I 

[jira] [Commented] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF

2018-01-03 Thread Matthew Fishkin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309663#comment-16309663
 ] 

Matthew Fishkin commented on SPARK-22942:
-

So the fact that 

[val rightBreakdown = rightOnly.withColumn("has_black", 
udf(hasBlack).apply($"r_infos"))

rightBreakdown.show(false)

res2:
++---+-+
|name|r_infos|has_black|
++---+-+
|c   |[[5,white], [4,orange]]|false|
|d   |[[4,orange], [2,black]]|true |
++---+-+]

correctly uses the UDF, but when I do

`rightBreakdown.filter("has_black == true").show(false)`

it says that the UDF hit a nullPointer (even though we see the values were 
computed correctly without the filter).. that is all due to the optimizer order 
of operations?

> Spark Sql UDF throwing NullPointer when adding a filter on a columns that 
> uses that UDF
> ---
>
> Key: SPARK-22942
> URL: https://issues.apache.org/jira/browse/SPARK-22942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0
>Reporter: Matthew Fishkin
>
> I ran into an interesting issue when trying to do a `filter` on a dataframe 
> that has columns that were added using a UDF. I am able to replicate the 
> problem with a smaller set of data.
> Given the dummy case classes:
> {code:java}
> case class Info(number: Int, color: String)
> case class Record(name: String, infos: Seq[Info])
> {code}
> And the following data:
> {code:java}
> val blue = Info(1, "blue")
> val black = Info(2, "black")
> val yellow = Info(3, "yellow")
> val orange = Info(4, "orange")
> val white = Info(5, "white")
> val a  = Record("a", Seq(blue, black, white))
> val a2 = Record("a", Seq(yellow, white, orange))
> val b = Record("b", Seq(blue, black))
> val c = Record("c", Seq(white, orange))
>  val d = Record("d", Seq(orange, black))
> {code}
> Create two dataframes (we will call them left and right)
> {code:java}
> val left = Seq(a, b).toDF
> val right = Seq(a2, c, d).toDF
> {code}
> Join those two dataframes with an outer join (So two of our columns are 
> nullable now.
> {code:java}
> val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer")
> joined.show(false)
> res0:
> +++---+
> |name|infos   |infos  |
> +++---+
> |b   |[[1,blue], [2,black]]   |null   |
> |a   |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]|
> |c   |null|[[5,white], [4,orange]]|
> |d   |null|[[4,orange], [2,black]]|
> +++---+
> {code}
> Then, take only the `name`s that exist in the right Dataframe
> {code:java}
> val rightOnly = joined.filter("l.infos is null").select($"name", 
> $"r.infos".as("r_infos"))
> rightOnly.show(false)
> res1:
> ++---+
> |name|r_infos|
> ++---+
> |c   |[[5,white], [4,orange]]|
> |d   |[[4,orange], [2,black]]|
> ++---+
> {code}
> Now, add a new column called `has_black` which will be true if the `r_infos` 
> contains _black_ as a color
> {code:java}
> def hasBlack = (s: Seq[Row]) => {
>   s.exists{ case Row(num: Int, color: String) =>
> color == "black"
>   }
> }
> val rightBreakdown = rightOnly.withColumn("has_black", 
> udf(hasBlack).apply($"r_infos"))
> rightBreakdown.show(false)
> res2:
> ++---+-+
> |name|r_infos|has_black|
> ++---+-+
> |c   |[[5,white], [4,orange]]|false|
> |d   |[[4,orange], [2,black]]|true |
> ++---+-+
> {code}
> So far, _exactly_ what we expected. 
> *However*, when I try to filter `rightBreakdown`, it fails.
> {code:java}
> rightBreakdown.filter("has_black == true").show(false)
> org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$hasBlack$1: (array>) => 
> boolean)
>   at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:411)
>   at 
> org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(joins.scala:127)
>   at 
> 

[jira] [Assigned] (SPARK-22938) Assert that SQLConf.get is accessed only on the driver.

2018-01-03 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22938:
---

Assignee: Juliusz Sompolski

> Assert that SQLConf.get is accessed only on the driver.
> ---
>
> Key: SPARK-22938
> URL: https://issues.apache.org/jira/browse/SPARK-22938
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
> Fix For: 2.3.0
>
>
> Assert if code tries to access SQLConf.get on executor.
> This can lead to hard to detect bugs, where the executor will read 
> fallbackConf, falling back to default config values, ignoring potentially 
> changed non-default configs.
> If a config is to be passed to executor code, it needs to be read on the 
> driver, and passed explicitly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22938) Assert that SQLConf.get is accessed only on the driver.

2018-01-03 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22938.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20136
[https://github.com/apache/spark/pull/20136]

> Assert that SQLConf.get is accessed only on the driver.
> ---
>
> Key: SPARK-22938
> URL: https://issues.apache.org/jira/browse/SPARK-22938
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
> Fix For: 2.3.0
>
>
> Assert if code tries to access SQLConf.get on executor.
> This can lead to hard to detect bugs, where the executor will read 
> fallbackConf, falling back to default config values, ignoring potentially 
> changed non-default configs.
> If a config is to be passed to executor code, it needs to be read on the 
> driver, and passed explicitly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22936) providing HttpStreamSource and HttpStreamSink

2018-01-03 Thread bluejoe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309220#comment-16309220
 ] 

bluejoe edited comment on SPARK-22936 at 1/3/18 1:31 PM:
-

The latest spark-http-stream artifact has been released to the central maven 
repository and is free to use in any java/scala projects.

I will be very appreciated if spark-http-stream is listed as a third-party 
Source/Sink in Spark documentation.


was (Author: bluejoe):
The latest spark-http-stream artifact has been released to the central maven 
repository and is free to use in any java/scala projects.

I would be very appreciated that spark-http-stream is listed as a third-party 
Source/Sink in Spark documentation.

> providing HttpStreamSource and HttpStreamSink
> -
>
> Key: SPARK-22936
> URL: https://issues.apache.org/jira/browse/SPARK-22936
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: bluejoe
>
> Hi, in my project I completed a spark-http-stream, which is now available on 
> https://github.com/bluejoe2008/spark-http-stream. I am thinking if it is 
> useful to others and is ok to be integrated as a part of Spark.
> spark-http-stream transfers Spark structured stream over HTTP protocol. 
> Unlike tcp streams, Kafka streams and HDFS file streams, http streams often 
> flow across distributed big data centers on the Web. This feature is very 
> helpful to build global data processing pipelines across different data 
> centers (scientific research institutes, for example) who own separated data 
> sets.
> The following code shows how to load messages from a HttpStreamSource:
> ```
> val lines = spark.readStream.format(classOf[HttpStreamSourceProvider].getName)
>   .option("httpServletUrl", "http://localhost:8080/;)
>   .option("topic", "topic-1");
>   .option("includesTimestamp", "true")
>   .load();
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7721) Generate test coverage report from Python

2018-01-03 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309585#comment-16309585
 ] 

Hyukjin Kwon commented on SPARK-7721:
-

[~rxin] What do you think about doing a script and then integrating it with 
Jenkins? Or would you want me to check "2. Integrating with Jenkins" a bit more 
for clarification?

> Generate test coverage report from Python
> -
>
> Key: SPARK-7721
> URL: https://issues.apache.org/jira/browse/SPARK-7721
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Reporter: Reynold Xin
>
> Would be great to have test coverage report for Python. Compared with Scala, 
> it is tricker to understand the coverage without coverage reports in Python 
> because we employ both docstring tests and unit tests in test files. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py

2018-01-03 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309582#comment-16309582
 ] 

Hyukjin Kwon commented on SPARK-22711:
--

Thanks [~PrateekRM]. However, seems it's not quite self-contained. For example, 
sys module is missing and setup_environment functions it missing too. Could you 
double check and make it self-runnable?

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0, 2.2.1
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
> Attachments: Jira_Spark_minimized_code.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in 

[jira] [Commented] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored

2018-01-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309509#comment-16309509
 ] 

Apache Spark commented on SPARK-19228:
--

User 'sergey-rubtsov' has created a pull request for this issue:
https://github.com/apache/spark/pull/20140

> inferSchema function processed csv date column as string and "dateFormat" 
> DataSource option is ignored
> --
>
> Key: SPARK-19228
> URL: https://issues.apache.org/jira/browse/SPARK-19228
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.0
>Reporter: Sergey Rubtsov
>  Labels: easyfix
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> I need to process user.csv like this:
> {code}
> id,project,started,ended
> sergey.rubtsov,project0,12/12/2012,10/10/2015
> {code}
> When I add date format options:
> {code}
> Dataset users = spark.read().format("csv").option("mode", 
> "PERMISSIVE").option("header", "true")
> .option("inferSchema", 
> "true").option("dateFormat", 
> "dd/MM/").load("src/main/resources/user.csv");
>   users.printSchema();
> {code}
> expected scheme should be 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: date (nullable = true)
>  |-- ended: date (nullable = true)
> {code}
> but the actual result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: string (nullable = true)
>  |-- ended: string (nullable = true)
> {code}
> This mean that date processed as string and "dateFormat" option is ignored.
> If I add option 
> {code}
> .option("timestampFormat", "dd/MM/")
> {code}
> result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: timestamp (nullable = true)
>  |-- ended: timestamp (nullable = true)
> {code}
> I think, the issue is somewhere in object CSVInferSchema, function 
> inferField, lines 80-97 and
> method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
> date/timestamp process logic need to be changed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17833) 'monotonicallyIncreasingId()' should be deterministic

2018-01-03 Thread Sujith Jay Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309411#comment-16309411
 ] 

Sujith Jay Nair commented on SPARK-17833:
-

This issue is resolved in 2.0, as mentioned in [SPARK-14241 | 
https://issues.apache.org/jira/browse/SPARK-14241]

> 'monotonicallyIncreasingId()' should be deterministic
> -
>
> Key: SPARK-17833
> URL: https://issues.apache.org/jira/browse/SPARK-17833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Kevin Ushey
>Priority: Critical
>
> Right now, it's (IMHO) too easy to shoot yourself in the foot using 
> 'monotonicallyIncreasingId()', as it's easy to expect the generated numbers 
> to function as a 'stable' primary key, for example, and then go on to use 
> that key in e.g. 'joins' and so on.
> Is there any reason why this function can't be made deterministic? Or, could 
> a deterministic analogue of this function be added (e.g. 
> 'withPrimaryKey(columnName = ...)')?
> A solution is to immediately cache / persist the table after calling 
> 'monotonicallyIncreasingId()'; it's also possible that the documentation 
> should spell that out loud and clear.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22898) collect_set aggregation on bucketed table causes an exchange stage

2018-01-03 Thread Modi Tamam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309407#comment-16309407
 ] 

Modi Tamam commented on SPARK-22898:


Sure, no problem. I did double check it on 2.2.1, and it looks just fine.

> collect_set aggregation on bucketed table causes an exchange stage
> --
>
> Key: SPARK-22898
> URL: https://issues.apache.org/jira/browse/SPARK-22898
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Modi Tamam
>  Labels: bucketing
>
> I'm using Spark-2.2. I'm POCing Spark's bucketing. I've created a bucketed 
> table, here's the desc formatted my_bucketed_tbl output:
> +++---+
> |col_nam| data_type|comment|
> +++---+
> |  bundle| string|   null|
> | ifa| string|   null|
> |   date_|date|   null|
> |hour| int|   null|
> |||   |
> |# Detailed Table ...||   |
> |Database| default|   |
> |   Table| my_bucketed_tbl|
> |   Owner|zeppelin|   |
> | Created|Thu Dec 21 13:43:...|   |
> | Last Access|Thu Jan 01 00:00:...|   |
> |Type|EXTERNAL|   |
> |Provider| orc|   |
> | Num Buckets|  16|   |
> |  Bucket Columns| [`ifa`]|   |
> |Sort Columns| [`ifa`]|   |
> |Table Properties|[transient_lastDd...|   |
> |Location|hdfs:/user/hive/w...|   |
> |   Serde Library|org.apache.hadoop...|   |
> | InputFormat|org.apache.hadoop...|   |
> |OutputFormat|org.apache.hadoop...|   |
> |  Storage Properties|[serialization.fo...|   |
> +++---+
> When I'm executing an explain of a group by query, I can see that we've 
> spared the exchange phase :
> {code:java}
> sql("select ifa,max(bundle) from my_bucketed_tbl group by ifa").explain
> == Physical Plan ==
> SortAggregate(key=[ifa#932], functions=[max(bundle#920)])
> +- SortAggregate(key=[ifa#932], functions=[partial_max(bundle#920)])
>+- *Sort [ifa#932 ASC NULLS FIRST], false, 0
>   +- *FileScan orc default.level_1[bundle#920,ifa#932] Batched: false, 
> Format: ORC, Location: 
> InMemoryFileIndex[hdfs://ip-10-44-9-73.ec2.internal:8020/user/hive/warehouse/level_1/date_=2017-1...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
> {code}
> But, when I replace Spark's max function with collect_set, I can see that the 
> execution plan is the same as a non-bucketed table, means, the exchange phase 
> is not spared :
> {code:java}
> sql("select ifa,collect_set(bundle) from my_bucketed_tbl group by 
> ifa").explain
> == Physical Plan ==
> ObjectHashAggregate(keys=[ifa#1010], functions=[collect_set(bundle#998, 0, 
> 0)])
> +- Exchange hashpartitioning(ifa#1010, 200)
>+- ObjectHashAggregate(keys=[ifa#1010], 
> functions=[partial_collect_set(bundle#998, 0, 0)])
>   +- *FileScan orc default.level_1[bundle#998,ifa#1010] Batched: false, 
> Format: ORC, Location: 
> InMemoryFileIndex[hdfs://ip-10-44-9-73.ec2.internal:8020/user/hive/warehouse/level_1/date_=2017-1...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22714) Spark API Not responding when Fatal exception occurred in event loop

2018-01-03 Thread Sujith Jay Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309366#comment-16309366
 ] 

Sujith Jay Nair commented on SPARK-22714:
-

Hi [~todesking], is this reproducible outside of Spark REPL? Trying to 
understand if this is specific to Spark shell.

> Spark API Not responding when Fatal exception occurred in event loop
> 
>
> Key: SPARK-22714
> URL: https://issues.apache.org/jira/browse/SPARK-22714
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: todesking
>Priority: Critical
>
> To reproduce, let Spark to throw an OOM Exception in event loop:
> {noformat}
> scala> spark.sparkContext.getConf.get("spark.driver.memory")
> res0: String = 1g
> scala> val a = new Array[Int](4 * 1000 * 1000)
> scala> val ds = spark.createDataset(a)
> scala> ds.rdd.zipWithIndex
> [Stage 0:>  (0 + 0) / 
> 3]Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: 
> Java heap space
> [Stage 0:>  (0 + 0) / 
> 3]
> // Spark is not responding
> {noformat}
> While not responding, Spark waiting for some Promise, but is never done.
> The promise depends some process in event loop thread, but the thread is dead 
> when Fatal exception is thrown.
> {noformat}
> "main" #1 prio=5 os_prio=31 tid=0x7ffc9300b000 nid=0x1703 waiting on 
> condition [0x70216000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007ad978eb8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:619)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
> at 
> org.apache.spark.rdd.ZippedWithIndexRDD.(ZippedWithIndexRDD.scala:50)
> at 
> org.apache.spark.rdd.RDD$$anonfun$zipWithIndex$1.apply(RDD.scala:1293)
> at 
> org.apache.spark.rdd.RDD$$anonfun$zipWithIndex$1.apply(RDD.scala:1293)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
> at org.apache.spark.rdd.RDD.zipWithIndex(RDD.scala:1292)
> {noformat}
> I don't know how to fix it properly, but it seems we need to add Fatal error 
> handling to EventLoop.run() in core/EventLoop.scala



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2018-01-03 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309363#comment-16309363
 ] 

Saisai Shao commented on SPARK-22876:
-

This looks like a bug, ideally we should get number of attempts from RM, not by 
calculating attempt id.

> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> -
>
> Key: SPARK-22876
> URL: https://issues.apache.org/jira/browse/SPARK-22876
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
> Environment: hadoop version 2.7.3
>Reporter: Jinhan Zhong
>Priority: Minor
>
> I assume we can use spark.yarn.maxAppAttempts together with 
> spark.yarn.am.attemptFailuresValidityInterval to make a long running 
> application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n 
> times ( n is minimum value of spark.yarn.maxAppAttempts and 
> yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 
> times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file 
> ApplicationMaster.scala 
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
> if attempt id >= maxAppAttempts, it will try to unregister the application 
> and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22944) improve FoldablePropagation

2018-01-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22944:


Assignee: Wenchen Fan  (was: Apache Spark)

> improve FoldablePropagation
> ---
>
> Key: SPARK-22944
> URL: https://issues.apache.org/jira/browse/SPARK-22944
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22944) improve FoldablePropagation

2018-01-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22944:


Assignee: Apache Spark  (was: Wenchen Fan)

> improve FoldablePropagation
> ---
>
> Key: SPARK-22944
> URL: https://issues.apache.org/jira/browse/SPARK-22944
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22944) improve FoldablePropagation

2018-01-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309314#comment-16309314
 ] 

Apache Spark commented on SPARK-22944:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20139

> improve FoldablePropagation
> ---
>
> Key: SPARK-22944
> URL: https://issues.apache.org/jira/browse/SPARK-22944
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22944) improve FoldablePropagation

2018-01-03 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22944:
---

 Summary: improve FoldablePropagation
 Key: SPARK-22944
 URL: https://issues.apache.org/jira/browse/SPARK-22944
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org