[jira] [Updated] (SPARK-22814) JDBC support date/timestamp type as partitionColumn

2017-12-15 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22814:

Component/s: (was: Input/Output)
 SQL

> JDBC support date/timestamp type as partitionColumn
> ---
>
> Key: SPARK-22814
> URL: https://issues.apache.org/jira/browse/SPARK-22814
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.1
>Reporter: Yuechen Chen
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In spark, you can partition MySQL queries by partitionColumn.
> val df = (spark.read.jdbc(url=jdbcUrl,
> table="employees",
> columnName="emp_no",
> lowerBound=1L,
> upperBound=10L,
> numPartitions=100,
> connectionProperties=connectionProperties))
> display(df)
> But, partitionColumn must be a numeric column from the table.
> However, there are lots of table, which has no primary key, and has some 
> date/timestamp indexes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22814) JDBC support date/timestamp type as partitionColumn

2017-12-15 Thread Yuechen Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuechen Chen updated SPARK-22814:
-
Docs Text:   (was: https://github.com/apache/spark/pull/1)

https://github.com/apache/spark/pull/1

> JDBC support date/timestamp type as partitionColumn
> ---
>
> Key: SPARK-22814
> URL: https://issues.apache.org/jira/browse/SPARK-22814
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.2, 2.2.1
>Reporter: Yuechen Chen
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In spark, you can partition MySQL queries by partitionColumn.
> val df = (spark.read.jdbc(url=jdbcUrl,
> table="employees",
> columnName="emp_no",
> lowerBound=1L,
> upperBound=10L,
> numPartitions=100,
> connectionProperties=connectionProperties))
> display(df)
> But, partitionColumn must be a numeric column from the table.
> However, there are lots of table, which has no primary key, and has some 
> date/timestamp indexes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-22814) JDBC support date/timestamp type as partitionColumn

2017-12-15 Thread Yuechen Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuechen Chen updated SPARK-22814:
-
Comment: was deleted

(was: https://github.com/apache/spark/pull/1)

> JDBC support date/timestamp type as partitionColumn
> ---
>
> Key: SPARK-22814
> URL: https://issues.apache.org/jira/browse/SPARK-22814
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.2, 2.2.1
>Reporter: Yuechen Chen
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In spark, you can partition MySQL queries by partitionColumn.
> val df = (spark.read.jdbc(url=jdbcUrl,
> table="employees",
> columnName="emp_no",
> lowerBound=1L,
> upperBound=10L,
> numPartitions=100,
> connectionProperties=connectionProperties))
> display(df)
> But, partitionColumn must be a numeric column from the table.
> However, there are lots of table, which has no primary key, and has some 
> date/timestamp indexes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22814) JDBC support date/timestamp type as partitionColumn

2017-12-15 Thread Yuechen Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuechen Chen updated SPARK-22814:
-
 Docs Text: https://github.com/apache/spark/pull/1
External issue URL:   (was: https://github.com/apache/spark/pull/1)

> JDBC support date/timestamp type as partitionColumn
> ---
>
> Key: SPARK-22814
> URL: https://issues.apache.org/jira/browse/SPARK-22814
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.2, 2.2.1
>Reporter: Yuechen Chen
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In spark, you can partition MySQL queries by partitionColumn.
> val df = (spark.read.jdbc(url=jdbcUrl,
> table="employees",
> columnName="emp_no",
> lowerBound=1L,
> upperBound=10L,
> numPartitions=100,
> connectionProperties=connectionProperties))
> display(df)
> But, partitionColumn must be a numeric column from the table.
> However, there are lots of table, which has no primary key, and has some 
> date/timestamp indexes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22814) JDBC support date/timestamp type as partitionColumn

2017-12-15 Thread Yuechen Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuechen Chen updated SPARK-22814:
-
External issue URL: https://github.com/apache/spark/pull/1

> JDBC support date/timestamp type as partitionColumn
> ---
>
> Key: SPARK-22814
> URL: https://issues.apache.org/jira/browse/SPARK-22814
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.2, 2.2.1
>Reporter: Yuechen Chen
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In spark, you can partition MySQL queries by partitionColumn.
> val df = (spark.read.jdbc(url=jdbcUrl,
> table="employees",
> columnName="emp_no",
> lowerBound=1L,
> upperBound=10L,
> numPartitions=100,
> connectionProperties=connectionProperties))
> display(df)
> But, partitionColumn must be a numeric column from the table.
> However, there are lots of table, which has no primary key, and has some 
> date/timestamp indexes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22814) JDBC support date/timestamp type as partitionColumn

2017-12-15 Thread Yuechen Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293680#comment-16293680
 ] 

Yuechen Chen commented on SPARK-22814:
--

https://github.com/apache/spark/pull/1


> JDBC support date/timestamp type as partitionColumn
> ---
>
> Key: SPARK-22814
> URL: https://issues.apache.org/jira/browse/SPARK-22814
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.2, 2.2.1
>Reporter: Yuechen Chen
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In spark, you can partition MySQL queries by partitionColumn.
> val df = (spark.read.jdbc(url=jdbcUrl,
> table="employees",
> columnName="emp_no",
> lowerBound=1L,
> upperBound=10L,
> numPartitions=100,
> connectionProperties=connectionProperties))
> display(df)
> But, partitionColumn must be a numeric column from the table.
> However, there are lots of table, which has no primary key, and has some 
> date/timestamp indexes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22814) JDBC support date/timestamp type as partitionColumn

2017-12-15 Thread Yuechen Chen (JIRA)
Yuechen Chen created SPARK-22814:


 Summary: JDBC support date/timestamp type as partitionColumn
 Key: SPARK-22814
 URL: https://issues.apache.org/jira/browse/SPARK-22814
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.2.1, 1.6.2
Reporter: Yuechen Chen


In spark, you can partition MySQL queries by partitionColumn.
val df = (spark.read.jdbc(url=jdbcUrl,
table="employees",
columnName="emp_no",
lowerBound=1L,
upperBound=10L,
numPartitions=100,
connectionProperties=connectionProperties))
display(df)

But, partitionColumn must be a numeric column from the table.
However, there are lots of table, which has no primary key, and has some 
date/timestamp indexes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22496) beeline display operation log

2017-12-15 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-22496:
--

It was reverted in 
https://github.com/apache/spark/commit/e58f275678fb4f904124a4a2a1762f04c835eb0e 
due to R test failure.

> beeline display operation log
> -
>
> Key: SPARK-22496
> URL: https://issues.apache.org/jira/browse/SPARK-22496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StephenZou
>Priority: Minor
>
> For now,when end user runs queries in beeline or in hue through STS, 
> no logs are displayed, end user will wait until the job finishes or fails. 
> Progress information is needed to inform end users how the job is running if 
> they are not familiar with yarn RM or standalone spark master ui. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22812) Failing cran-check on master

2017-12-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293648#comment-16293648
 ] 

Felix Cheung commented on SPARK-22812:
--

Not exactly... what’s the environment? Seems like something is wrong 
connecting/pulling from CRAN.






> Failing cran-check on master 
> -
>
> Key: SPARK-22812
> URL: https://issues.apache.org/jira/browse/SPARK-22812
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hossein Falaki
>
> When I run {{R/run-tests.sh}} or {{R/check-cran.sh}} I get the following 
> failure message:
> {code}
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) :
>   dims [product 22] do not match the length of object [0]
> {code}
> cc [~felixcheung] have you experienced this error before?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22796) Add multiple column support to PySpark QuantileDiscretizer

2017-12-15 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293619#comment-16293619
 ] 

Huaxin Gao commented on SPARK-22796:


I will work on this. 

> Add multiple column support to PySpark QuantileDiscretizer
> --
>
> Key: SPARK-22796
> URL: https://issues.apache.org/jira/browse/SPARK-22796
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22813) run-tests.py fails when /usr/sbin/lsof does not exist

2017-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22813:


Assignee: Apache Spark

> run-tests.py fails when /usr/sbin/lsof does not exist
> -
>
> Key: SPARK-22813
> URL: https://issues.apache.org/jira/browse/SPARK-22813
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Minor
>
> Running ./dev/run-tests.py for mvn on OS that does not have /usr/sbin/lsof 
> (e.g. /usr/bin/lsof) gives the error
> {code}
> /bin/sh: 1: /usr/sbin/lsof: not found
> Usage:
>  kill [options]  [...]
> Options:
>   [...]send signal to every  listed
>  -, -s, --signal 
> specify the  to be sent
>  -l, --list=[]  list all signal names, or convert one to a name
>  -L, --tablelist all signal names in a nice table
>  -h, --help display this help and exit
>  -V, --version  output version information and exit
> For more details see kill(1).
> Traceback (most recent call last):
>   File "./dev/run-tests.py", line 626, in 
> main()
>   File "./dev/run-tests.py", line 597, in main
> build_apache_spark(build_tool, hadoop_version)
>   File "./dev/run-tests.py", line 389, in build_apache_spark
> build_spark_maven(hadoop_version)
>   File "./dev/run-tests.py", line 329, in build_spark_maven
> exec_maven(profiles_and_goals)
>   File "./dev/run-tests.py", line 270, in exec_maven
> kill_zinc_on_port(zinc_port)
>   File "./dev/run-tests.py", line 258, in kill_zinc_on_port
> subprocess.check_call(cmd, shell=True)
>   File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
> raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command '/usr/sbin/lsof -P |grep 3156 | grep 
> LISTEN | awk '{ print $2; }' | xargs kill' returned non-zero exit status 123
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22813) run-tests.py fails when /usr/sbin/lsof does not exist

2017-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293617#comment-16293617
 ] 

Apache Spark commented on SPARK-22813:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19998

> run-tests.py fails when /usr/sbin/lsof does not exist
> -
>
> Key: SPARK-22813
> URL: https://issues.apache.org/jira/browse/SPARK-22813
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Running ./dev/run-tests.py for mvn on OS that does not have /usr/sbin/lsof 
> (e.g. /usr/bin/lsof) gives the error
> {code}
> /bin/sh: 1: /usr/sbin/lsof: not found
> Usage:
>  kill [options]  [...]
> Options:
>   [...]send signal to every  listed
>  -, -s, --signal 
> specify the  to be sent
>  -l, --list=[]  list all signal names, or convert one to a name
>  -L, --tablelist all signal names in a nice table
>  -h, --help display this help and exit
>  -V, --version  output version information and exit
> For more details see kill(1).
> Traceback (most recent call last):
>   File "./dev/run-tests.py", line 626, in 
> main()
>   File "./dev/run-tests.py", line 597, in main
> build_apache_spark(build_tool, hadoop_version)
>   File "./dev/run-tests.py", line 389, in build_apache_spark
> build_spark_maven(hadoop_version)
>   File "./dev/run-tests.py", line 329, in build_spark_maven
> exec_maven(profiles_and_goals)
>   File "./dev/run-tests.py", line 270, in exec_maven
> kill_zinc_on_port(zinc_port)
>   File "./dev/run-tests.py", line 258, in kill_zinc_on_port
> subprocess.check_call(cmd, shell=True)
>   File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
> raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command '/usr/sbin/lsof -P |grep 3156 | grep 
> LISTEN | awk '{ print $2; }' | xargs kill' returned non-zero exit status 123
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22813) run-tests.py fails when /usr/sbin/lsof does not exist

2017-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22813:


Assignee: (was: Apache Spark)

> run-tests.py fails when /usr/sbin/lsof does not exist
> -
>
> Key: SPARK-22813
> URL: https://issues.apache.org/jira/browse/SPARK-22813
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Running ./dev/run-tests.py for mvn on OS that does not have /usr/sbin/lsof 
> (e.g. /usr/bin/lsof) gives the error
> {code}
> /bin/sh: 1: /usr/sbin/lsof: not found
> Usage:
>  kill [options]  [...]
> Options:
>   [...]send signal to every  listed
>  -, -s, --signal 
> specify the  to be sent
>  -l, --list=[]  list all signal names, or convert one to a name
>  -L, --tablelist all signal names in a nice table
>  -h, --help display this help and exit
>  -V, --version  output version information and exit
> For more details see kill(1).
> Traceback (most recent call last):
>   File "./dev/run-tests.py", line 626, in 
> main()
>   File "./dev/run-tests.py", line 597, in main
> build_apache_spark(build_tool, hadoop_version)
>   File "./dev/run-tests.py", line 389, in build_apache_spark
> build_spark_maven(hadoop_version)
>   File "./dev/run-tests.py", line 329, in build_spark_maven
> exec_maven(profiles_and_goals)
>   File "./dev/run-tests.py", line 270, in exec_maven
> kill_zinc_on_port(zinc_port)
>   File "./dev/run-tests.py", line 258, in kill_zinc_on_port
> subprocess.check_call(cmd, shell=True)
>   File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
> raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command '/usr/sbin/lsof -P |grep 3156 | grep 
> LISTEN | awk '{ print $2; }' | xargs kill' returned non-zero exit status 123
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22377) Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof

2017-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293589#comment-16293589
 ] 

Apache Spark commented on SPARK-22377:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19998

> Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof
> --
>
> Key: SPARK-22377
> URL: https://issues.apache.org/jira/browse/SPARK-22377
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Xin Lu
>Assignee: Hyukjin Kwon
> Fix For: 2.1.3, 2.2.1, 2.3.0
>
>
> It looks like multiple workers in the amplab jenkins cannot execute lsof.  
> Example log below:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.1-maven-snapshots/182/console
> spark-build/dev/create-release/release-build.sh: line 344: lsof: command not 
> found
> usage: kill [ -s signal | -p ] [ -a ] pid ...
>kill -l [ signal ]
> I looked at the jobs and it looks like only  amp-jenkins-worker-01 works so 
> you are getting a successful build every week or so.  Unclear if the snapshot 
> is actually released.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22811) pyspark.ml.tests is missing a py4j import.

2017-12-15 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22811:


Assignee: Bago Amirbekian

> pyspark.ml.tests is missing a py4j import.
> --
>
> Key: SPARK-22811
> URL: https://issues.apache.org/jira/browse/SPARK-22811
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Minor
> Fix For: 2.3.0
>
>
> This bug isn't getting caught because the relevant code only gets run if the 
> test environment does not have Hive.
> https://github.com/apache/spark/blob/46776234a49742e94c64897322500582d7393d35/python/pyspark/ml/tests.py#L1866



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22813) run-tests.py fails when /usr/sbin/lsof does not exist

2017-12-15 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293570#comment-16293570
 ] 

Kazuaki Ishizaki commented on SPARK-22813:
--

Thank you for pointing the PR that we worked. I was not able to remember the 
number.

> run-tests.py fails when /usr/sbin/lsof does not exist
> -
>
> Key: SPARK-22813
> URL: https://issues.apache.org/jira/browse/SPARK-22813
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Running ./dev/run-tests.py for mvn on OS that does not have /usr/sbin/lsof 
> (e.g. /usr/bin/lsof) gives the error
> {code}
> /bin/sh: 1: /usr/sbin/lsof: not found
> Usage:
>  kill [options]  [...]
> Options:
>   [...]send signal to every  listed
>  -, -s, --signal 
> specify the  to be sent
>  -l, --list=[]  list all signal names, or convert one to a name
>  -L, --tablelist all signal names in a nice table
>  -h, --help display this help and exit
>  -V, --version  output version information and exit
> For more details see kill(1).
> Traceback (most recent call last):
>   File "./dev/run-tests.py", line 626, in 
> main()
>   File "./dev/run-tests.py", line 597, in main
> build_apache_spark(build_tool, hadoop_version)
>   File "./dev/run-tests.py", line 389, in build_apache_spark
> build_spark_maven(hadoop_version)
>   File "./dev/run-tests.py", line 329, in build_spark_maven
> exec_maven(profiles_and_goals)
>   File "./dev/run-tests.py", line 270, in exec_maven
> kill_zinc_on_port(zinc_port)
>   File "./dev/run-tests.py", line 258, in kill_zinc_on_port
> subprocess.check_call(cmd, shell=True)
>   File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
> raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command '/usr/sbin/lsof -P |grep 3156 | grep 
> LISTEN | awk '{ print $2; }' | xargs kill' returned non-zero exit status 123
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22813) run-tests.py fails when /usr/sbin/lsof does not exist

2017-12-15 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22813:
-
Description: 
Running ./dev/run-tests.py for mvn on OS that does not have /usr/sbin/lsof 
(e.g. /usr/bin/lsof) gives the error

{code}
/bin/sh: 1: /usr/sbin/lsof: not found

Usage:
 kill [options]  [...]

Options:
  [...]send signal to every  listed
 -, -s, --signal 
specify the  to be sent
 -l, --list=[]  list all signal names, or convert one to a name
 -L, --tablelist all signal names in a nice table

 -h, --help display this help and exit
 -V, --version  output version information and exit

For more details see kill(1).
Traceback (most recent call last):
  File "./dev/run-tests.py", line 626, in 
main()
  File "./dev/run-tests.py", line 597, in main
build_apache_spark(build_tool, hadoop_version)
  File "./dev/run-tests.py", line 389, in build_apache_spark
build_spark_maven(hadoop_version)
  File "./dev/run-tests.py", line 329, in build_spark_maven
exec_maven(profiles_and_goals)
  File "./dev/run-tests.py", line 270, in exec_maven
kill_zinc_on_port(zinc_port)
  File "./dev/run-tests.py", line 258, in kill_zinc_on_port
subprocess.check_call(cmd, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/usr/sbin/lsof -P |grep 3156 | grep 
LISTEN | awk '{ print $2; }' | xargs kill' returned non-zero exit status 123
{code}

  was:
Running ./dev/run-tests.py for mvn on OS that does not have /usr/sbin/lsof 
(e.g. /usr/bin/lsof) gives the error

```
/bin/sh: 1: /usr/sbin/lsof: not found

Usage:
 kill [options]  [...]

Options:
  [...]send signal to every  listed
 -, -s, --signal 
specify the  to be sent
 -l, --list=[]  list all signal names, or convert one to a name
 -L, --tablelist all signal names in a nice table

 -h, --help display this help and exit
 -V, --version  output version information and exit

For more details see kill(1).
Traceback (most recent call last):
  File "./dev/run-tests.py", line 626, in 
main()
  File "./dev/run-tests.py", line 597, in main
build_apache_spark(build_tool, hadoop_version)
  File "./dev/run-tests.py", line 389, in build_apache_spark
build_spark_maven(hadoop_version)
  File "./dev/run-tests.py", line 329, in build_spark_maven
exec_maven(profiles_and_goals)
  File "./dev/run-tests.py", line 270, in exec_maven
kill_zinc_on_port(zinc_port)
  File "./dev/run-tests.py", line 258, in kill_zinc_on_port
subprocess.check_call(cmd, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/usr/sbin/lsof -P |grep 3156 | grep 
LISTEN | awk '{ print $2; }' | xargs kill' returned non-zero exit status 123
```


> run-tests.py fails when /usr/sbin/lsof does not exist
> -
>
> Key: SPARK-22813
> URL: https://issues.apache.org/jira/browse/SPARK-22813
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Running ./dev/run-tests.py for mvn on OS that does not have /usr/sbin/lsof 
> (e.g. /usr/bin/lsof) gives the error
> {code}
> /bin/sh: 1: /usr/sbin/lsof: not found
> Usage:
>  kill [options]  [...]
> Options:
>   [...]send signal to every  listed
>  -, -s, --signal 
> specify the  to be sent
>  -l, --list=[]  list all signal names, or convert one to a name
>  -L, --tablelist all signal names in a nice table
>  -h, --help display this help and exit
>  -V, --version  output version information and exit
> For more details see kill(1).
> Traceback (most recent call last):
>   File "./dev/run-tests.py", line 626, in 
> main()
>   File "./dev/run-tests.py", line 597, in main
> build_apache_spark(build_tool, hadoop_version)
>   File "./dev/run-tests.py", line 389, in build_apache_spark
> build_spark_maven(hadoop_version)
>   File "./dev/run-tests.py", line 329, in build_spark_maven
> exec_maven(profiles_and_goals)
>   File "./dev/run-tests.py", line 270, in exec_maven
> kill_zinc_on_port(zinc_port)
>   File "./dev/run-tests.py", line 258, in kill_zinc_on_port
> subprocess.check_call(cmd, shell=True)
>   File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
> raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command '/usr/sbin/lsof -P |grep 3156 | grep 
> LISTEN | awk '{ print $2; }' | xargs kill' returned non-zero exit status 123
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SPARK-22813) run-tests.py fails when /usr/sbin/lsof does not exist

2017-12-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293564#comment-16293564
 ] 

Sean Owen commented on SPARK-22813:
---

https://issues.apache.org/jira/browse/SPARK-22377

> run-tests.py fails when /usr/sbin/lsof does not exist
> -
>
> Key: SPARK-22813
> URL: https://issues.apache.org/jira/browse/SPARK-22813
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Running ./dev/run-tests.py for mvn on OS that does not have /usr/sbin/lsof 
> (e.g. /usr/bin/lsof) gives the error
> ```
> /bin/sh: 1: /usr/sbin/lsof: not found
> Usage:
>  kill [options]  [...]
> Options:
>   [...]send signal to every  listed
>  -, -s, --signal 
> specify the  to be sent
>  -l, --list=[]  list all signal names, or convert one to a name
>  -L, --tablelist all signal names in a nice table
>  -h, --help display this help and exit
>  -V, --version  output version information and exit
> For more details see kill(1).
> Traceback (most recent call last):
>   File "./dev/run-tests.py", line 626, in 
> main()
>   File "./dev/run-tests.py", line 597, in main
> build_apache_spark(build_tool, hadoop_version)
>   File "./dev/run-tests.py", line 389, in build_apache_spark
> build_spark_maven(hadoop_version)
>   File "./dev/run-tests.py", line 329, in build_spark_maven
> exec_maven(profiles_and_goals)
>   File "./dev/run-tests.py", line 270, in exec_maven
> kill_zinc_on_port(zinc_port)
>   File "./dev/run-tests.py", line 258, in kill_zinc_on_port
> subprocess.check_call(cmd, shell=True)
>   File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
> raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command '/usr/sbin/lsof -P |grep 3156 | grep 
> LISTEN | awk '{ print $2; }' | xargs kill' returned non-zero exit status 123
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22813) run-tests.py fails when /usr/sbin/lsof does not exist

2017-12-15 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-22813:


 Summary: run-tests.py fails when /usr/sbin/lsof does not exist
 Key: SPARK-22813
 URL: https://issues.apache.org/jira/browse/SPARK-22813
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.2.1, 2.3.0
Reporter: Kazuaki Ishizaki
Priority: Minor


Running ./dev/run-tests.py for mvn on OS that does not have /usr/sbin/lsof 
(e.g. /usr/bin/lsof) gives the error

```
/bin/sh: 1: /usr/sbin/lsof: not found

Usage:
 kill [options]  [...]

Options:
  [...]send signal to every  listed
 -, -s, --signal 
specify the  to be sent
 -l, --list=[]  list all signal names, or convert one to a name
 -L, --tablelist all signal names in a nice table

 -h, --help display this help and exit
 -V, --version  output version information and exit

For more details see kill(1).
Traceback (most recent call last):
  File "./dev/run-tests.py", line 626, in 
main()
  File "./dev/run-tests.py", line 597, in main
build_apache_spark(build_tool, hadoop_version)
  File "./dev/run-tests.py", line 389, in build_apache_spark
build_spark_maven(hadoop_version)
  File "./dev/run-tests.py", line 329, in build_spark_maven
exec_maven(profiles_and_goals)
  File "./dev/run-tests.py", line 270, in exec_maven
kill_zinc_on_port(zinc_port)
  File "./dev/run-tests.py", line 258, in kill_zinc_on_port
subprocess.check_call(cmd, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/usr/sbin/lsof -P |grep 3156 | grep 
LISTEN | awk '{ print $2; }' | xargs kill' returned non-zero exit status 123
```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22811) pyspark.ml.tests is missing a py4j import.

2017-12-15 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22811.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19997
[https://github.com/apache/spark/pull/19997]

> pyspark.ml.tests is missing a py4j import.
> --
>
> Key: SPARK-22811
> URL: https://issues.apache.org/jira/browse/SPARK-22811
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Minor
> Fix For: 2.3.0
>
>
> This bug isn't getting caught because the relevant code only gets run if the 
> test environment does not have Hive.
> https://github.com/apache/spark/blob/46776234a49742e94c64897322500582d7393d35/python/pyspark/ml/tests.py#L1866



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22812) Failing cran-check on master

2017-12-15 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-22812:
--

 Summary: Failing cran-check on master 
 Key: SPARK-22812
 URL: https://issues.apache.org/jira/browse/SPARK-22812
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Hossein Falaki


When I run {{R/run-tests.sh}} or {{R/check-cran.sh}} I get the following 
failure message:

{code}
* checking CRAN incoming feasibility ...Error in 
.check_package_CRAN_incoming(pkgdir) :
  dims [product 22] do not match the length of object [0]
{code}

cc [~felixcheung] have you experienced this error before?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22811) pyspark.ml.tests is missing a py4j import.

2017-12-15 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22811:

Priority: Minor  (was: Major)

> pyspark.ml.tests is missing a py4j import.
> --
>
> Key: SPARK-22811
> URL: https://issues.apache.org/jira/browse/SPARK-22811
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Minor
>
> This bug isn't getting caught because the relevant code only gets run if the 
> test environment does not have Hive.
> https://github.com/apache/spark/blob/46776234a49742e94c64897322500582d7393d35/python/pyspark/ml/tests.py#L1866



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22810) PySpark supports LinearRegression with huber loss

2017-12-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-22810:
---

Assignee: Yanbo Liang

> PySpark supports LinearRegression with huber loss
> -
>
> Key: SPARK-22810
> URL: https://issues.apache.org/jira/browse/SPARK-22810
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Expose Python API for LinearRegression with huber loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22805) Use aliases for StorageLevel in event logs

2017-12-15 Thread Sergei Lebedev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293369#comment-16293369
 ] 

Sergei Lebedev commented on SPARK-22805:


After some investigation, it turns out that the majority of space is taken by 
{{SparkListenerTaskEnd}} which reports block statuses. I suspect the space 
reduction after SPARK-20923 would be less impressive.

> Use aliases for StorageLevel in event logs
> --
>
> Key: SPARK-22805
> URL: https://issues.apache.org/jira/browse/SPARK-22805
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Sergei Lebedev
>Priority: Minor
>
> Fact 1: {{StorageLevel}} has a private constructor, therefore a list of 
> predefined levels is not extendable (by the users).
> Fact 2: The format of event logs uses redundant representation for storage 
> levels 
> {code}
> >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, 
> >>> "Replication": 1}')
> 79
> >>> len('DISK_ONLY')
> 9
> {code}
> Fact 3: This leads to excessive log sizes for workloads with lots of 
> partitions, because every partition would have the storage level field which 
> is 60-70 bytes more than it should be.
> Suggested quick win: use the names of the predefined levels to identify them 
> in the event log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22811) pyspark.ml.tests is missing a py4j import.

2017-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22811:


Assignee: Apache Spark

> pyspark.ml.tests is missing a py4j import.
> --
>
> Key: SPARK-22811
> URL: https://issues.apache.org/jira/browse/SPARK-22811
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Assignee: Apache Spark
>
> This bug isn't getting caught because the relevant code only gets run if the 
> test environment does not have Hive.
> https://github.com/apache/spark/blob/46776234a49742e94c64897322500582d7393d35/python/pyspark/ml/tests.py#L1866



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22811) pyspark.ml.tests is missing a py4j import.

2017-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293364#comment-16293364
 ] 

Apache Spark commented on SPARK-22811:
--

User 'MrBago' has created a pull request for this issue:
https://github.com/apache/spark/pull/19997

> pyspark.ml.tests is missing a py4j import.
> --
>
> Key: SPARK-22811
> URL: https://issues.apache.org/jira/browse/SPARK-22811
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>
> This bug isn't getting caught because the relevant code only gets run if the 
> test environment does not have Hive.
> https://github.com/apache/spark/blob/46776234a49742e94c64897322500582d7393d35/python/pyspark/ml/tests.py#L1866



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22811) pyspark.ml.tests is missing a py4j import.

2017-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22811:


Assignee: (was: Apache Spark)

> pyspark.ml.tests is missing a py4j import.
> --
>
> Key: SPARK-22811
> URL: https://issues.apache.org/jira/browse/SPARK-22811
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>
> This bug isn't getting caught because the relevant code only gets run if the 
> test environment does not have Hive.
> https://github.com/apache/spark/blob/46776234a49742e94c64897322500582d7393d35/python/pyspark/ml/tests.py#L1866



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22811) pyspark.ml.tests is missing a py4j import.

2017-12-15 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-22811:
---

 Summary: pyspark.ml.tests is missing a py4j import.
 Key: SPARK-22811
 URL: https://issues.apache.org/jira/browse/SPARK-22811
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Bago Amirbekian


This bug isn't getting caught because the relevant code only gets run if the 
test environment does not have Hive.

https://github.com/apache/spark/blob/46776234a49742e94c64897322500582d7393d35/python/pyspark/ml/tests.py#L1866



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19804) HiveClientImpl does not work with Hive 2.2.0 metastore

2017-12-15 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293307#comment-16293307
 ] 

Marcelo Vanzin commented on SPARK-19804:


Spark still doesn't have explicit support for Hive 2.2.

> HiveClientImpl does not work with Hive 2.2.0 metastore
> --
>
> Key: SPARK-19804
> URL: https://issues.apache.org/jira/browse/SPARK-19804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.2.0
>
>
> I know that Spark currently does not officially support Hive 2.2 (perhaps 
> because it hasn't been released yet); but we have some 2.2 patches in CDH and 
> the current code in the isolated client fails. The most probably culprit are 
> changes added in HIVE-13149.
> The fix is simple, and here's the patch we applied in CDH:
> https://github.com/cloudera/spark/commit/954f060afe6ed469e85d656abd02790a79ec07a0
> Fixing that doesn't affect any existing Hive version support, but will make 
> it easier to support 2.2 when it's out.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19804) HiveClientImpl does not work with Hive 2.2.0 metastore

2017-12-15 Thread Zhongting Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293302#comment-16293302
 ] 

Zhongting Hu commented on SPARK-19804:
--

[~smilegator] , I've made some comments on above PR, so for this issue, does it 
mean, before this fix (spark 2.2.0 release), all the version of spark can not 
talk to hive 2.2.0 metastore?



> HiveClientImpl does not work with Hive 2.2.0 metastore
> --
>
> Key: SPARK-19804
> URL: https://issues.apache.org/jira/browse/SPARK-19804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.2.0
>
>
> I know that Spark currently does not officially support Hive 2.2 (perhaps 
> because it hasn't been released yet); but we have some 2.2 patches in CDH and 
> the current code in the isolated client fails. The most probably culprit are 
> changes added in HIVE-13149.
> The fix is simple, and here's the patch we applied in CDH:
> https://github.com/cloudera/spark/commit/954f060afe6ed469e85d656abd02790a79ec07a0
> Fixing that doesn't affect any existing Hive version support, but will make 
> it easier to support 2.2 when it's out.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22805) Use aliases for StorageLevel in event logs

2017-12-15 Thread Sergei Lebedev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292758#comment-16292758
 ] 

Sergei Lebedev edited comment on SPARK-22805 at 12/15/17 9:44 PM:
--

I have a patch which preserves backward compatibility. Will post some number a 
bit later.

Also, note that the format is flexible "in theory", in practice, it always 
contains one of the predefined levels.

*Update*: turns out there's {{StorageLevel.apply}}, so "always" above should be 
read as "almost always".


was (Author: lebedev):
I have a patch which preserves backward compatibility. Will post some number a 
bit later.

Also, note that the format is flexible "in theory", in practice, it always 
contains one of the predefined levels.

**Update**: turns out there's {{StorageLevel.apply}}, so "always" above should 
be read as "almost always".

> Use aliases for StorageLevel in event logs
> --
>
> Key: SPARK-22805
> URL: https://issues.apache.org/jira/browse/SPARK-22805
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Sergei Lebedev
>Priority: Minor
>
> Fact 1: {{StorageLevel}} has a private constructor, therefore a list of 
> predefined levels is not extendable (by the users).
> Fact 2: The format of event logs uses redundant representation for storage 
> levels 
> {code}
> >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, 
> >>> "Replication": 1}')
> 79
> >>> len('DISK_ONLY')
> 9
> {code}
> Fact 3: This leads to excessive log sizes for workloads with lots of 
> partitions, because every partition would have the storage level field which 
> is 60-70 bytes more than it should be.
> Suggested quick win: use the names of the predefined levels to identify them 
> in the event log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22805) Use aliases for StorageLevel in event logs

2017-12-15 Thread Sergei Lebedev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292758#comment-16292758
 ] 

Sergei Lebedev edited comment on SPARK-22805 at 12/15/17 9:43 PM:
--

I have a patch which preserves backward compatibility. Will post some number a 
bit later.

Also, note that the format is flexible "in theory", in practice, it always 
contains one of the predefined levels.

**Update**: turns out there's {{StorageLevel.apply}}, so "always" above should 
be read as "almost always".


was (Author: lebedev):
I have a patch which preserves backward compatibility. Will post some number a 
bit later.

Also, note that the format is flexible "in theory", in practice, it always 
contains one of the predefined levels.

> Use aliases for StorageLevel in event logs
> --
>
> Key: SPARK-22805
> URL: https://issues.apache.org/jira/browse/SPARK-22805
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Sergei Lebedev
>Priority: Minor
>
> Fact 1: {{StorageLevel}} has a private constructor, therefore a list of 
> predefined levels is not extendable (by the users).
> Fact 2: The format of event logs uses redundant representation for storage 
> levels 
> {code}
> >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, 
> >>> "Replication": 1}')
> 79
> >>> len('DISK_ONLY')
> 9
> {code}
> Fact 3: This leads to excessive log sizes for workloads with lots of 
> partitions, because every partition would have the storage level field which 
> is 60-70 bytes more than it should be.
> Suggested quick win: use the names of the predefined levels to identify them 
> in the event log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2017-12-15 Thread Cricket Temple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293209#comment-16293209
 ] 

Cricket Temple commented on SPARK-22809:


Outputs: When I run it, it plots a picture and prints "See?"

This is certainly "unexpected behavior" for me.
{noformat}
> import a.b
> import a.b as a_b
> a_b2 = a.b
> your_function(a_b)
Yay!
>your_function(a.b)
Boo!
>your_function(a_b2)
Yay!
>your_function(a.b)
Yay!
{noformat}

The problem is that when people port code to pyspark they're going to have 
errors until they go through and update imports to avoid this.  If it's 
possible to trigger this from a library (which I don't know if it is), that 
might be hard to work around.


> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Cricket Temple
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22807) Change configuration options to use "container" instead of "docker"

2017-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22807:


Assignee: Apache Spark

> Change configuration options to use "container" instead of "docker"
> ---
>
> Key: SPARK-22807
> URL: https://issues.apache.org/jira/browse/SPARK-22807
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Assignee: Apache Spark
>
> Based on the discussion in 
> https://github.com/apache/spark/pull/19946#discussion_r157063535, we want to 
> rename "docker" in kubernetes mode configuration to "container".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22807) Change configuration options to use "container" instead of "docker"

2017-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293201#comment-16293201
 ] 

Apache Spark commented on SPARK-22807:
--

User 'foxish' has created a pull request for this issue:
https://github.com/apache/spark/pull/19995

> Change configuration options to use "container" instead of "docker"
> ---
>
> Key: SPARK-22807
> URL: https://issues.apache.org/jira/browse/SPARK-22807
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> Based on the discussion in 
> https://github.com/apache/spark/pull/19946#discussion_r157063535, we want to 
> rename "docker" in kubernetes mode configuration to "container".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22807) Change configuration options to use "container" instead of "docker"

2017-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22807:


Assignee: (was: Apache Spark)

> Change configuration options to use "container" instead of "docker"
> ---
>
> Key: SPARK-22807
> URL: https://issues.apache.org/jira/browse/SPARK-22807
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> Based on the discussion in 
> https://github.com/apache/spark/pull/19946#discussion_r157063535, we want to 
> rename "docker" in kubernetes mode configuration to "container".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22810) PySpark supports LinearRegression with huber loss

2017-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22810:


Assignee: (was: Apache Spark)

> PySpark supports LinearRegression with huber loss
> -
>
> Key: SPARK-22810
> URL: https://issues.apache.org/jira/browse/SPARK-22810
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> Expose Python API for LinearRegression with huber loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22810) PySpark supports LinearRegression with huber loss

2017-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22810:


Assignee: Apache Spark

> PySpark supports LinearRegression with huber loss
> -
>
> Key: SPARK-22810
> URL: https://issues.apache.org/jira/browse/SPARK-22810
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Expose Python API for LinearRegression with huber loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22810) PySpark supports LinearRegression with huber loss

2017-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293133#comment-16293133
 ] 

Apache Spark commented on SPARK-22810:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/19994

> PySpark supports LinearRegression with huber loss
> -
>
> Key: SPARK-22810
> URL: https://issues.apache.org/jira/browse/SPARK-22810
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> Expose Python API for LinearRegression with huber loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22810) PySpark supports LinearRegression with huber loss

2017-12-15 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-22810:
---

 Summary: PySpark supports LinearRegression with huber loss
 Key: SPARK-22810
 URL: https://issues.apache.org/jira/browse/SPARK-22810
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Yanbo Liang


Expose Python API for LinearRegression with huber loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22809) pyspark is sensitive to imports with dots

2017-12-15 Thread Cricket Temple (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cricket Temple updated SPARK-22809:
---
Description: 
User code can fail with dotted imports.  Here's a repro script.

{noformat}
import numpy as np
import pandas as pd
import pyspark
import scipy.interpolate
import scipy.interpolate as scipy_interpolate
import py4j

scipy_interpolate2 = scipy.interpolate

sc = pyspark.SparkContext()
spark_session = pyspark.SQLContext(sc)

###
# The details of this dataset are irrelevant  #
# Sorry if you'd have preferred something more boring #
###
x__ = np.linspace(0,10,1000)
freq__ = np.arange(1,5)
x_, freq_ = np.ix_(x__, freq__)
y = np.sin(x_ * freq_).ravel()
x = (x_ * np.ones(freq_.shape)).ravel()
freq = (np.ones(x_.shape) * freq_).ravel()
df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
df_sk = spark_session.createDataFrame(df_pd)
assert(df_sk.toPandas() == df_pd).all().all()

try:
import matplotlib.pyplot as plt
for f, data in df_pd.groupby("freq"):
plt.plot(*data[['x','y']].values.T)
plt.show()
except:
print("I guess we can't plot anything")

def mymap(x, interp_fn):
df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
return interp_fn(df.x.values, df.y.values)(np.pi)

df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()

result = df_by_freq.mapValues(lambda x: mymap(x, 
scipy_interpolate.interp1d)).collect()
assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), atol=1e-6))

try:
result = df_by_freq.mapValues(lambda x: mymap(x, 
scipy.interpolate.interp1d)).collect()
raise Excpetion("Not going to reach this line")
except py4j.protocol.Py4JJavaError, e:
print("See?")

result = df_by_freq.mapValues(lambda x: mymap(x, 
scipy_interpolate2.interp1d)).collect()
assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), atol=1e-6))

# But now it works!
result = df_by_freq.mapValues(lambda x: mymap(x, 
scipy.interpolate.interp1d)).collect()
assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), atol=1e-6))
{noformat}

  was:
User code can fail with dotted imports.  Here's a repro script.

{noformat}
import numpy as np
import pandas as pd
import pyspark
import scipy.interpolate
import scipy.interpolate as scipy_interpolate
import py4j

sc = pyspark.SparkContext()
spark_session = pyspark.SQLContext(sc)

###
# The details of this dataset are irrelevant  #
# Sorry if you'd have preferred something more boring #
###
x__ = np.linspace(0,10,1000)
freq__ = np.arange(1,5)
x_, freq_ = np.ix_(x__, freq__)
y = np.sin(x_ * freq_).ravel()
x = (x_ * np.ones(freq_.shape)).ravel()
freq = (np.ones(x_.shape) * freq_).ravel()
df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
df_sk = spark_session.createDataFrame(df_pd)
assert(df_sk.toPandas() == df_pd).all().all()

try:
import matplotlib.pyplot as plt
for f, data in df_pd.groupby("freq"):
plt.plot(*data[['x','y']].values.T)
plt.show()
except:
print("I guess we can't plot anything")

def mymap(x, interp_fn):
df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
return interp_fn(df.x.values, df.y.values)(np.pi)

df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()

result = df_by_freq.mapValues(lambda x: mymap(x, 
scipy_interpolate.interp1d)).collect()
assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), atol=1e-6))
try:
result = df_by_freq.mapValues(lambda x: mymap(x, 
scipy.interpolate.interp1d)).collect()
assert(False, "Not going to reach this line")
except py4j.protocol.Py4JJavaError, e:
print("See?")
{noformat}


> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Cricket Temple
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * 

[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2017-12-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293103#comment-16293103
 ] 

Sean Owen commented on SPARK-22809:
---

It's not clear what the problem is? what does this output?
Spark isn't managing imports, so I am not yet clear how this is a Spark-related 
issue.

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Cricket Temple
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(False, "Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22809) pyspark is sensitive to imports with dots

2017-12-15 Thread Cricket Temple (JIRA)
Cricket Temple created SPARK-22809:
--

 Summary: pyspark is sensitive to imports with dots
 Key: SPARK-22809
 URL: https://issues.apache.org/jira/browse/SPARK-22809
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Cricket Temple


User code can fail with dotted imports.  Here's a repro script.

{noformat}
import numpy as np
import pandas as pd
import pyspark
import scipy.interpolate
import scipy.interpolate as scipy_interpolate
import py4j

sc = pyspark.SparkContext()
spark_session = pyspark.SQLContext(sc)

###
# The details of this dataset are irrelevant  #
# Sorry if you'd have preferred something more boring #
###
x__ = np.linspace(0,10,1000)
freq__ = np.arange(1,5)
x_, freq_ = np.ix_(x__, freq__)
y = np.sin(x_ * freq_).ravel()
x = (x_ * np.ones(freq_.shape)).ravel()
freq = (np.ones(x_.shape) * freq_).ravel()
df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
df_sk = spark_session.createDataFrame(df_pd)
assert(df_sk.toPandas() == df_pd).all().all()

try:
import matplotlib.pyplot as plt
for f, data in df_pd.groupby("freq"):
plt.plot(*data[['x','y']].values.T)
plt.show()
except:
print("I guess we can't plot anything")

def mymap(x, interp_fn):
df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
return interp_fn(df.x.values, df.y.values)(np.pi)

df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()

result = df_by_freq.mapValues(lambda x: mymap(x, 
scipy_interpolate.interp1d)).collect()
assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), atol=1e-6))
try:
result = df_by_freq.mapValues(lambda x: mymap(x, 
scipy.interpolate.interp1d)).collect()
assert(False, "Not going to reach this line")
except py4j.protocol.Py4JJavaError, e:
print("See?")
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22808) saveAsTable() should be marked as deprecated

2017-12-15 Thread Jason Vaccaro (JIRA)
Jason Vaccaro created SPARK-22808:
-

 Summary: saveAsTable() should be marked as deprecated
 Key: SPARK-22808
 URL: https://issues.apache.org/jira/browse/SPARK-22808
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.1.1
Reporter: Jason Vaccaro


As discussed in SPARK-16803, saveAsTable is not supported as a method for 
writing to Hive and insertInto should be used instead. However, on the java api 
documentation for version 2.1.1, the saveAsTable method is not marked as 
deprecated and the programming guides indicate that saveAsTable is the proper 
way to write to Hive. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293012#comment-16293012
 ] 

Sean Owen commented on SPARK-22683:
---

I don't think the pros/cons have changed here. I don't doubt there are use 
cases this would help, just not compelling enough for yet another knob to tune. 
I do not support this but would not go so far as to veto it.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-22807) Change configuration options to use "container" instead of "docker"

2017-12-15 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-22807:
---
Summary: Change configuration options to use "container" instead of 
"docker"  (was: Change commandline options to use "container" instead of 
"docker")

> Change configuration options to use "container" instead of "docker"
> ---
>
> Key: SPARK-22807
> URL: https://issues.apache.org/jira/browse/SPARK-22807
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> Based on the discussion in 
> https://github.com/apache/spark/pull/19946#discussion_r157063535, we want to 
> rename "docker" in kubernetes mode configuration to "container".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22807) Change commandline options to use "container" instead of "docker"

2017-12-15 Thread Anirudh Ramanathan (JIRA)
Anirudh Ramanathan created SPARK-22807:
--

 Summary: Change commandline options to use "container" instead of 
"docker"
 Key: SPARK-22807
 URL: https://issues.apache.org/jira/browse/SPARK-22807
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Anirudh Ramanathan


Based on the discussion in 
https://github.com/apache/spark/pull/19946#discussion_r157063535, we want to 
rename "docker" in kubernetes mode configuration to "container".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22806) Window Aggregate functions: unexpected result at ordered partition

2017-12-15 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292995#comment-16292995
 ] 

Attila Zsolt Piros edited comment on SPARK-22806 at 12/15/17 6:38 PM:
--

Test attached to reproduce the error


was (Author: attilapiros):
Test to reproduce the error

> Window Aggregate functions: unexpected result at ordered partition
> --
>
> Key: SPARK-22806
> URL: https://issues.apache.org/jira/browse/SPARK-22806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
> Attachments: WindowFunctionsWithGroupByError.scala
>
>
> I got different results for aggregate functions (even for sum and count) when 
> the partition is ordered "Window.partitionBy(column).orderBy(column))" and 
> when it is not ordered 'Window.partitionBy(column)".
> Example:
> {code:java}
> test("count, sum, stddev_pop functions over window") {
> val df = Seq(
>   ("a", 1, 100.0),
>   ("b", 1, 200.0)).toDF("key", "partition", "value")
> df.createOrReplaceTempView("window_table")
> checkAnswer(
>   df.select(
> $"key",
> count("value").over(Window.partitionBy("partition")),
> sum("value").over(Window.partitionBy("partition")),
> stddev_pop("value").over(Window.partitionBy("partition"))
>   ),
>   Seq(
> Row("a", 2, 300.0, 50.0),
> Row("b", 2, 300.0, 50.0)))
>   }
>   test("count, sum, stddev_pop functions over ordered by window") {
> val df = Seq(
>   ("a", 1, 100.0),
>   ("b", 1, 200.0)).toDF("key", "partition", "value")
> df.createOrReplaceTempView("window_table")
> checkAnswer(
>   df.select(
> $"key",
> count("value").over(Window.partitionBy("partition").orderBy("key")),
> sum("value").over(Window.partitionBy("partition").orderBy("key")),
> 
> stddev_pop("value").over(Window.partitionBy("partition").orderBy("key"))
>   ),
>   Seq(
> Row("a", 2, 300.0, 50.0),
> Row("b", 2, 300.0, 50.0)))
>   }
> {code}
> The "count, sum, stddev_pop functions over ordered by window" fails with the 
> error:
> {noformat}
> == Results ==
> !== Correct Answer - 2 ==   == Spark Answer - 2 ==
> !struct<>   struct partition ORDER BY key ASC NULLS FIRST unspecifiedframe$()):bigint,sum(value) 
> OVER (PARTITION BY partition ORDER BY key ASC NULLS FIRST 
> unspecifiedframe$()):double,stddev_pop(value) OVER (PARTITION BY partition 
> ORDER BY key ASC NULLS FIRST unspecifiedframe$()):double>
> ![a,2,300.0,50.0]   [a,1,100.0,0.0]
>  [b,2,300.0,50.0]   [b,2,300.0,50.0]
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22806) Window Aggregate functions: unexpected result at ordered partition

2017-12-15 Thread Attila Zsolt Piros (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-22806:
---
Attachment: WindowFunctionsWithGroupByError.scala

Test to reproduce the error

> Window Aggregate functions: unexpected result at ordered partition
> --
>
> Key: SPARK-22806
> URL: https://issues.apache.org/jira/browse/SPARK-22806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
> Attachments: WindowFunctionsWithGroupByError.scala
>
>
> I got different results for aggregate functions (even for sum and count) when 
> the partition is ordered "Window.partitionBy(column).orderBy(column))" and 
> when it is not ordered 'Window.partitionBy(column)".
> Example:
> {code:java}
> test("count, sum, stddev_pop functions over window") {
> val df = Seq(
>   ("a", 1, 100.0),
>   ("b", 1, 200.0)).toDF("key", "partition", "value")
> df.createOrReplaceTempView("window_table")
> checkAnswer(
>   df.select(
> $"key",
> count("value").over(Window.partitionBy("partition")),
> sum("value").over(Window.partitionBy("partition")),
> stddev_pop("value").over(Window.partitionBy("partition"))
>   ),
>   Seq(
> Row("a", 2, 300.0, 50.0),
> Row("b", 2, 300.0, 50.0)))
>   }
>   test("count, sum, stddev_pop functions over ordered by window") {
> val df = Seq(
>   ("a", 1, 100.0),
>   ("b", 1, 200.0)).toDF("key", "partition", "value")
> df.createOrReplaceTempView("window_table")
> checkAnswer(
>   df.select(
> $"key",
> count("value").over(Window.partitionBy("partition").orderBy("key")),
> sum("value").over(Window.partitionBy("partition").orderBy("key")),
> 
> stddev_pop("value").over(Window.partitionBy("partition").orderBy("key"))
>   ),
>   Seq(
> Row("a", 2, 300.0, 50.0),
> Row("b", 2, 300.0, 50.0)))
>   }
> {code}
> The "count, sum, stddev_pop functions over ordered by window" fails with the 
> error:
> {noformat}
> == Results ==
> !== Correct Answer - 2 ==   == Spark Answer - 2 ==
> !struct<>   struct partition ORDER BY key ASC NULLS FIRST unspecifiedframe$()):bigint,sum(value) 
> OVER (PARTITION BY partition ORDER BY key ASC NULLS FIRST 
> unspecifiedframe$()):double,stddev_pop(value) OVER (PARTITION BY partition 
> ORDER BY key ASC NULLS FIRST unspecifiedframe$()):double>
> ![a,2,300.0,50.0]   [a,1,100.0,0.0]
>  [b,2,300.0,50.0]   [b,2,300.0,50.0]
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22806) Window Aggregate functions: unexpected result at ordered partition

2017-12-15 Thread Attila Zsolt Piros (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-22806:
---
Description: 
I got different results for aggregate functions (even for sum and count) when 
the partition is ordered "Window.partitionBy(column).orderBy(column))" and when 
it is not ordered 'Window.partitionBy(column)".

Example:

{code:java}
test("count, sum, stddev_pop functions over window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition")),
sum("value").over(Window.partitionBy("partition")),
stddev_pop("value").over(Window.partitionBy("partition"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }

  test("count, sum, stddev_pop functions over ordered by window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition").orderBy("key")),
sum("value").over(Window.partitionBy("partition").orderBy("key")),
stddev_pop("value").over(Window.partitionBy("partition").orderBy("key"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }
{code}

The "count, sum, stddev_pop functions over ordered by window" fails with the 
error:

{noformat}
== Results ==
!== Correct Answer - 2 ==   == Spark Answer - 2 ==
!struct<>   struct
![a,2,300.0,50.0]   [a,1,100.0,0.0]
 [b,2,300.0,50.0]   [b,2,300.0,50.0]
{noformat}





 

  was:
I got different results for aggregate functions (even for sum and count) when 
the partition is ordered "Window.partitionBy(column).orderBy(column))" and when 
it is not ordered 'Window.partitionBy(column)".

Example:

{code:java}
test("count, sum, stddev_pop functions over window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition")),
sum("value").over(Window.partitionBy("partition")),
stddev_pop("value").over(Window.partitionBy("partition"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }

  test("count, sum, stddev_pop functions over ordered by window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition").orderBy("key")),
sum("value").over(Window.partitionBy("partition").orderBy("key")),
stddev_pop("value").over(Window.partitionBy("partition").orderBy("key"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }
{code}


{noformat}
The "count, sum, stddev_pop functions over ordered by window" fails with the 
error:
== Results ==
!== Correct Answer - 2 ==   == Spark Answer - 2 ==
!struct<>   struct
![a,2,300.0,50.0]   [a,1,100.0,0.0]
 [b,2,300.0,50.0]   [b,2,300.0,50.0]
{noformat}





 


> Window Aggregate functions: unexpected result at ordered partition
> --
>
> Key: SPARK-22806
> URL: https://issues.apache.org/jira/browse/SPARK-22806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>
> I got different results for aggregate functions (even for sum and count) when 
> the partition is ordered "Window.partitionBy(column).orderBy(column))" and 
> when it is not ordered 'Window.partitionBy(column)".
> Example:
> {code:java}
> test("count, sum, stddev_pop functions over window") {
> val df = Seq(
>   ("a", 1, 100.0),
>   ("b", 1, 200.0)).toDF("key", "partition", 

[jira] [Updated] (SPARK-22806) Window Aggregate functions: unexpected result at ordered partition

2017-12-15 Thread Attila Zsolt Piros (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-22806:
---
Description: 
I got different results for aggregate functions (even for sum and count) when 
the partition is ordered "Window.partitionBy(column).orderBy(column))" and when 
it is not ordered 'Window.partitionBy(column)".

Example:

{code:java}
test("count, sum, stddev_pop functions over window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition")),
sum("value").over(Window.partitionBy("partition")),
stddev_pop("value").over(Window.partitionBy("partition"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }

  test("count, sum, stddev_pop functions over ordered by window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition").orderBy("key")),
sum("value").over(Window.partitionBy("partition").orderBy("key")),
stddev_pop("value").over(Window.partitionBy("partition").orderBy("key"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }
{code}


{noformat}
The "count, sum, stddev_pop functions over ordered by window" fails with the 
error:
== Results ==
!== Correct Answer - 2 ==   == Spark Answer - 2 ==
!struct<>   struct
![a,2,300.0,50.0]   [a,1,100.0,0.0]
 [b,2,300.0,50.0]   [b,2,300.0,50.0]
{noformat}





 

  was:
I got different results for aggregate functions (even for sum and count) when 
the partition is ordered "Window.partitionBy(column).orderBy(column))" and when 
it is not ordered 'Window.partitionBy(column)".

Example:


{code:scala}

test("count, sum, stddev_pop functions over window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition")),
sum("value").over(Window.partitionBy("partition")),
stddev_pop("value").over(Window.partitionBy("partition"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }

  test("count, sum, stddev_pop functions over ordered by window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition").orderBy("key")),
sum("value").over(Window.partitionBy("partition").orderBy("key")),
stddev_pop("value").over(Window.partitionBy("partition").orderBy("key"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }
{code}


The "count, sum, stddev_pop functions over ordered by window" fails with the 
error:
== Results ==
!== Correct Answer - 2 ==   == Spark Answer - 2 ==
!struct<>   struct
![a,2,300.0,50.0]   [a,1,100.0,0.0]
 [b,2,300.0,50.0]   [b,2,300.0,50.0]



 


> Window Aggregate functions: unexpected result at ordered partition
> --
>
> Key: SPARK-22806
> URL: https://issues.apache.org/jira/browse/SPARK-22806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>
> I got different results for aggregate functions (even for sum and count) when 
> the partition is ordered "Window.partitionBy(column).orderBy(column))" and 
> when it is not ordered 'Window.partitionBy(column)".
> Example:
> {code:java}
> test("count, sum, stddev_pop functions over window") {
> val df = Seq(
>   ("a", 1, 100.0),
>   ("b", 1, 200.0)).toDF("key", "partition", "value")
> 

[jira] [Updated] (SPARK-22806) Window Aggregate functions: unexpected result at ordered partition

2017-12-15 Thread Attila Zsolt Piros (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-22806:
---
Description: 
I got different results for aggregate functions (even for sum and count) when 
the partition is ordered "Window.partitionBy(column).orderBy(column))" and when 
it is not ordered 'Window.partitionBy(column)".

Example:


{code:scala}

test("count, sum, stddev_pop functions over window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition")),
sum("value").over(Window.partitionBy("partition")),
stddev_pop("value").over(Window.partitionBy("partition"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }

  test("count, sum, stddev_pop functions over ordered by window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition").orderBy("key")),
sum("value").over(Window.partitionBy("partition").orderBy("key")),
stddev_pop("value").over(Window.partitionBy("partition").orderBy("key"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }
{code}


The "count, sum, stddev_pop functions over ordered by window" fails with the 
error:
== Results ==
!== Correct Answer - 2 ==   == Spark Answer - 2 ==
!struct<>   struct
![a,2,300.0,50.0]   [a,1,100.0,0.0]
 [b,2,300.0,50.0]   [b,2,300.0,50.0]



 

  was:
I got different results for aggregate functions (even for sum and count) when 
the partition is ordered "Window.partitionBy(column).orderBy(column))" and when 
it is not ordered 'Window.partitionBy(column)".

Example:

test("count, sum, stddev_pop functions over window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition")),
sum("value").over(Window.partitionBy("partition")),
stddev_pop("value").over(Window.partitionBy("partition"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }

  test("count, sum, stddev_pop functions over ordered by window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition").orderBy("key")),
sum("value").over(Window.partitionBy("partition").orderBy("key")),
stddev_pop("value").over(Window.partitionBy("partition").orderBy("key"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }

The "count, sum, stddev_pop functions over ordered by window" fails with the 
error:
== Results ==
!== Correct Answer - 2 ==   == Spark Answer - 2 ==
!struct<>   struct
![a,2,300.0,50.0]   [a,1,100.0,0.0]
 [b,2,300.0,50.0]   [b,2,300.0,50.0]



 


> Window Aggregate functions: unexpected result at ordered partition
> --
>
> Key: SPARK-22806
> URL: https://issues.apache.org/jira/browse/SPARK-22806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>
> I got different results for aggregate functions (even for sum and count) when 
> the partition is ordered "Window.partitionBy(column).orderBy(column))" and 
> when it is not ordered 'Window.partitionBy(column)".
> Example:
> {code:scala}
> test("count, sum, stddev_pop functions over window") {
> val df = Seq(
>   ("a", 1, 100.0),
>   ("b", 1, 200.0)).toDF("key", "partition", "value")
> df.createOrReplaceTempView("window_table")
> 

[jira] [Updated] (SPARK-22806) Window Aggregate functions: unexpected result at ordered partition

2017-12-15 Thread Attila Zsolt Piros (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-22806:
---
Description: 
I got different results for aggregate functions (even for sum and count) when 
the partition is ordered "Window.partitionBy(column).orderBy(column))" and when 
it is not ordered 'Window.partitionBy(column)".

Example:

test("count, sum, stddev_pop functions over window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition")),
sum("value").over(Window.partitionBy("partition")),
stddev_pop("value").over(Window.partitionBy("partition"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }

  test("count, sum, stddev_pop functions over ordered by window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition").orderBy("key")),
sum("value").over(Window.partitionBy("partition").orderBy("key")),
stddev_pop("value").over(Window.partitionBy("partition").orderBy("key"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }

The "count, sum, stddev_pop functions over ordered by window" fails with the 
error:
== Results ==
!== Correct Answer - 2 ==   == Spark Answer - 2 ==
!struct<>   struct
![a,2,300.0,50.0]   [a,1,100.0,0.0]
 [b,2,300.0,50.0]   [b,2,300.0,50.0]



 

  was:
I got different results for aggregate functions (even for sum and count) when 
the partition is ordered "Window.partitionBy(column).orderBy(column))" and when 
it is not ordered 'Window.partitionBy("column")".

Example:

test("count, sum, stddev_pop functions over window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition")),
sum("value").over(Window.partitionBy("partition")),
stddev_pop("value").over(Window.partitionBy("partition"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }

  test("count, sum, stddev_pop functions over ordered by window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition").orderBy("key")),
sum("value").over(Window.partitionBy("partition").orderBy("key")),
stddev_pop("value").over(Window.partitionBy("partition").orderBy("key"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }

The "count, sum, stddev_pop functions over ordered by window" fails with the 
error:
== Results ==
!== Correct Answer - 2 ==   == Spark Answer - 2 ==
!struct<>   struct
![a,2,300.0,50.0]   [a,1,100.0,0.0]
 [b,2,300.0,50.0]   [b,2,300.0,50.0]



 


> Window Aggregate functions: unexpected result at ordered partition
> --
>
> Key: SPARK-22806
> URL: https://issues.apache.org/jira/browse/SPARK-22806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>
> I got different results for aggregate functions (even for sum and count) when 
> the partition is ordered "Window.partitionBy(column).orderBy(column))" and 
> when it is not ordered 'Window.partitionBy(column)".
> Example:
> test("count, sum, stddev_pop functions over window") {
> val df = Seq(
>   ("a", 1, 100.0),
>   ("b", 1, 200.0)).toDF("key", "partition", "value")
> df.createOrReplaceTempView("window_table")
> checkAnswer(
>   df.select(
> 

[jira] [Created] (SPARK-22806) Window Aggregate functions: unexpected result at ordered partition

2017-12-15 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-22806:
--

 Summary: Window Aggregate functions: unexpected result at ordered 
partition
 Key: SPARK-22806
 URL: https://issues.apache.org/jira/browse/SPARK-22806
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Attila Zsolt Piros


I got different results for the aggregate function (even for sum and count) 
when the partition is ordered "Window.partitionBy(column).orderBy(column))" and 
when it is not ordered 'Window.partitionBy("column")".

Example:

test("count, sum, stddev_pop functions over window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition")),
sum("value").over(Window.partitionBy("partition")),
stddev_pop("value").over(Window.partitionBy("partition"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }

  test("count, sum, stddev_pop functions over ordered by window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition").orderBy("key")),
sum("value").over(Window.partitionBy("partition").orderBy("key")),
stddev_pop("value").over(Window.partitionBy("partition").orderBy("key"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }

The "count, sum, stddev_pop functions over ordered by window" fails with the 
error:
== Results ==
!== Correct Answer - 2 ==   == Spark Answer - 2 ==
!struct<>   struct
![a,2,300.0,50.0]   [a,1,100.0,0.0]
 [b,2,300.0,50.0]   [b,2,300.0,50.0]



 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22806) Window Aggregate functions: unexpected result at ordered partition

2017-12-15 Thread Attila Zsolt Piros (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-22806:
---
Description: 
I got different results for aggregate functions (even for sum and count) when 
the partition is ordered "Window.partitionBy(column).orderBy(column))" and when 
it is not ordered 'Window.partitionBy("column")".

Example:

test("count, sum, stddev_pop functions over window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition")),
sum("value").over(Window.partitionBy("partition")),
stddev_pop("value").over(Window.partitionBy("partition"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }

  test("count, sum, stddev_pop functions over ordered by window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition").orderBy("key")),
sum("value").over(Window.partitionBy("partition").orderBy("key")),
stddev_pop("value").over(Window.partitionBy("partition").orderBy("key"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }

The "count, sum, stddev_pop functions over ordered by window" fails with the 
error:
== Results ==
!== Correct Answer - 2 ==   == Spark Answer - 2 ==
!struct<>   struct
![a,2,300.0,50.0]   [a,1,100.0,0.0]
 [b,2,300.0,50.0]   [b,2,300.0,50.0]



 

  was:
I got different results for the aggregate function (even for sum and count) 
when the partition is ordered "Window.partitionBy(column).orderBy(column))" and 
when it is not ordered 'Window.partitionBy("column")".

Example:

test("count, sum, stddev_pop functions over window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition")),
sum("value").over(Window.partitionBy("partition")),
stddev_pop("value").over(Window.partitionBy("partition"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }

  test("count, sum, stddev_pop functions over ordered by window") {
val df = Seq(
  ("a", 1, 100.0),
  ("b", 1, 200.0)).toDF("key", "partition", "value")
df.createOrReplaceTempView("window_table")
checkAnswer(
  df.select(
$"key",
count("value").over(Window.partitionBy("partition").orderBy("key")),
sum("value").over(Window.partitionBy("partition").orderBy("key")),
stddev_pop("value").over(Window.partitionBy("partition").orderBy("key"))
  ),
  Seq(
Row("a", 2, 300.0, 50.0),
Row("b", 2, 300.0, 50.0)))
  }

The "count, sum, stddev_pop functions over ordered by window" fails with the 
error:
== Results ==
!== Correct Answer - 2 ==   == Spark Answer - 2 ==
!struct<>   struct
![a,2,300.0,50.0]   [a,1,100.0,0.0]
 [b,2,300.0,50.0]   [b,2,300.0,50.0]



 


> Window Aggregate functions: unexpected result at ordered partition
> --
>
> Key: SPARK-22806
> URL: https://issues.apache.org/jira/browse/SPARK-22806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>
> I got different results for aggregate functions (even for sum and count) when 
> the partition is ordered "Window.partitionBy(column).orderBy(column))" and 
> when it is not ordered 'Window.partitionBy("column")".
> Example:
> test("count, sum, stddev_pop functions over window") {
> val df = Seq(
>   ("a", 1, 100.0),
>   ("b", 1, 200.0)).toDF("key", "partition", "value")
> df.createOrReplaceTempView("window_table")
> checkAnswer(
>   df.select(
>   

[jira] [Assigned] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2017-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22799:


Assignee: Apache Spark

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Apache Spark
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2017-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22799:


Assignee: (was: Apache Spark)

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2017-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292972#comment-16292972
 ] 

Apache Spark commented on SPARK-22799:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/19993

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22800) Add a SSB query suite

2017-12-15 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22800.
-
Resolution: Fixed
  Assignee: Takeshi Yamamuro

> Add a SSB query suite
> -
>
> Key: SPARK-22800
> URL: https://issues.apache.org/jira/browse/SPARK-22800
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>
> Add a test suite to ensure all the SSB(Star Schema Benchmark) queries can be 
> successfully analyzed, optimized and compiled without hitting the max 
> iteration threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22800) Add a SSB query suite

2017-12-15 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22800:

Fix Version/s: 2.3.0

> Add a SSB query suite
> -
>
> Key: SPARK-22800
> URL: https://issues.apache.org/jira/browse/SPARK-22800
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
> Fix For: 2.3.0
>
>
> Add a test suite to ensure all the SSB(Star Schema Benchmark) queries can be 
> successfully analyzed, optimized and compiled without hitting the max 
> iteration threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22362) Add unit test for Window Aggregate Functions

2017-12-15 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292902#comment-16292902
 ] 

Attila Zsolt Piros commented on SPARK-22362:


I am working on this subtask.


> Add unit test for Window Aggregate Functions
> 
>
> Key: SPARK-22362
> URL: https://issues.apache.org/jira/browse/SPARK-22362
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>
> * Declarative
> * Imperative
> * UDAF



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22805) Use aliases for StorageLevel in event logs

2017-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22805:


Assignee: (was: Apache Spark)

> Use aliases for StorageLevel in event logs
> --
>
> Key: SPARK-22805
> URL: https://issues.apache.org/jira/browse/SPARK-22805
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Sergei Lebedev
>Priority: Minor
>
> Fact 1: {{StorageLevel}} has a private constructor, therefore a list of 
> predefined levels is not extendable (by the users).
> Fact 2: The format of event logs uses redundant representation for storage 
> levels 
> {code}
> >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, 
> >>> "Replication": 1}')
> 79
> >>> len('DISK_ONLY')
> 9
> {code}
> Fact 3: This leads to excessive log sizes for workloads with lots of 
> partitions, because every partition would have the storage level field which 
> is 60-70 bytes more than it should be.
> Suggested quick win: use the names of the predefined levels to identify them 
> in the event log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22805) Use aliases for StorageLevel in event logs

2017-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292876#comment-16292876
 ] 

Apache Spark commented on SPARK-22805:
--

User 'superbobry' has created a pull request for this issue:
https://github.com/apache/spark/pull/19992

> Use aliases for StorageLevel in event logs
> --
>
> Key: SPARK-22805
> URL: https://issues.apache.org/jira/browse/SPARK-22805
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Sergei Lebedev
>Priority: Minor
>
> Fact 1: {{StorageLevel}} has a private constructor, therefore a list of 
> predefined levels is not extendable (by the users).
> Fact 2: The format of event logs uses redundant representation for storage 
> levels 
> {code}
> >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, 
> >>> "Replication": 1}')
> 79
> >>> len('DISK_ONLY')
> 9
> {code}
> Fact 3: This leads to excessive log sizes for workloads with lots of 
> partitions, because every partition would have the storage level field which 
> is 60-70 bytes more than it should be.
> Suggested quick win: use the names of the predefined levels to identify them 
> in the event log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22805) Use aliases for StorageLevel in event logs

2017-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22805:


Assignee: Apache Spark

> Use aliases for StorageLevel in event logs
> --
>
> Key: SPARK-22805
> URL: https://issues.apache.org/jira/browse/SPARK-22805
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Sergei Lebedev
>Assignee: Apache Spark
>Priority: Minor
>
> Fact 1: {{StorageLevel}} has a private constructor, therefore a list of 
> predefined levels is not extendable (by the users).
> Fact 2: The format of event logs uses redundant representation for storage 
> levels 
> {code}
> >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, 
> >>> "Replication": 1}')
> 79
> >>> len('DISK_ONLY')
> 9
> {code}
> Fact 3: This leads to excessive log sizes for workloads with lots of 
> partitions, because every partition would have the storage level field which 
> is 60-70 bytes more than it should be.
> Suggested quick win: use the names of the predefined levels to identify them 
> in the event log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292855#comment-16292855
 ] 

Thomas Graves commented on SPARK-22683:
---

Thanks for the clarification, a few of those I misread and was thinking cores 
per executor vs the tasks per slot.  I think this proposal makes more sense to 
me now.  

| I don't think asking upfront vs exponential has any effect over how Yarn 
yields containers.

This is not necessarily true.  You have time between heartbeats and depends on 
how much capacity the cluster has and how often nodemanager heartbeat. yarn 
schedules on nodemanager heartbeats and it schedules based on you asks. You ask 
once, then you have to check again to see if it allocated them.   If I ask for 
all upfront and yarn can give them to me before I heartbeat again, I can launch 
them all upfront.  If I do exponential I ask for 1, wait 5+ seconds, check and 
ask for more, wait, etc.  spark ramps up asks quickly but you have wasted some 
time and if your tasks are really small and your launch time is quick it can 
make a difference.  You can see more skew you referenced in a bunch of tasks 
could have run on few executors while waiting for the ramp up.   You can 
mitigate this by asking more often and it can be affected by the overall 
cluster utilization.  In my experience I haven't seen the exponential ramp up 
help much so don't see a reason to not just ask for it all up front.  But this 
is the use cases I see and at the time it was written I believe they did see it 
help some of their jobs. this goes back to the above where this works well for 
your workload the question is how this applies in general to others.

Thinking about this more really this could be orthogonal to the allocation 
policy (SPARK-16158) though since its really more about dynamic configuration 
of the max # of executors.  This could apply to different allocation policies 
(existing exponential or like the all up front I mentioned).   There is a jira 
out there to add the ability to limit # of concurrent tasks for a stage/job 
which is in line with this but the intention is different (limit DOS attacks 
from spark to other services like HBASE) and it would be harder to configure 
that to solve this problem.

[~srowen] I know you had some reservations were all your concerns addressed?  I 
can see where the existing configs don't cover this sufficiently or at least 
not ideally.  You can set max, but that would be for entire job, different 
stages could have greatly different # of tasks.   I can also see 
schedulerBacklogTimeout not being optimal for this as it adversely could affect 
run time and the experiments here seem to confirm that.

The thing I'm thinking about is what the config is exactly so it makes sense to 
the user.  tasksPerSlot to me could be mis-interpreted as limiting # of tasks 
run on each slot and we don't really use slot anywhere else.  I am thinking 
more along the lines of 
spark.dynamicAllocation.dynamicMaxExecutorsPerStagePercent=[max # of executors 
based on percent of # of tasks required for that stage] . I need to think about 
this some more though.

I think we would also need to define its interaction with 
spark.dynamicAllocation.maxExecutors as well as how it works as # of running/to 
be run tasks changes.  For instance I assume the # of executors here doesn't 
change until the # of running/to be run tasks goes below this, so this really 
just applies to the initial max executors.








> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - 

[jira] [Comment Edited] (SPARK-22805) Use aliases for StorageLevel in event logs

2017-12-15 Thread Sergei Lebedev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292848#comment-16292848
 ] 

Sergei Lebedev edited comment on SPARK-22805 at 12/15/17 5:25 PM:
--

Here're results for a single application with 6K partitions. Admittedly, this 
is not generalizable to any application, but it gives an idea of the redundancy 
due to {{StorageLevel}}:

|| Mode || Size||
|LZ4-compressed|8.1G|
|Decompressed|79G|
|LZ4-compressed with patch|7.2G|
|Decompressed with patch|49G|


was (Author: lebedev):
Here're results for a single application with 6K partitions:

|| Mode || Size||
|LZ4-compressed|8.1G|
|Decompressed|79G|
|LZ4-compressed with patch|7.2G|
|Decompressed with patch|49G|

> Use aliases for StorageLevel in event logs
> --
>
> Key: SPARK-22805
> URL: https://issues.apache.org/jira/browse/SPARK-22805
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Sergei Lebedev
>Priority: Minor
>
> Fact 1: {{StorageLevel}} has a private constructor, therefore a list of 
> predefined levels is not extendable (by the users).
> Fact 2: The format of event logs uses redundant representation for storage 
> levels 
> {code}
> >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, 
> >>> "Replication": 1}')
> 79
> >>> len('DISK_ONLY')
> 9
> {code}
> Fact 3: This leads to excessive log sizes for workloads with lots of 
> partitions, because every partition would have the storage level field which 
> is 60-70 bytes more than it should be.
> Suggested quick win: use the names of the predefined levels to identify them 
> in the event log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22805) Use aliases for StorageLevel in event logs

2017-12-15 Thread Sergei Lebedev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292848#comment-16292848
 ] 

Sergei Lebedev commented on SPARK-22805:


Here're results for a single application with 6K partitions:

|| Mode || Size||
|LZ4-compressed|8.1G|
|Decompressed|79G|
|LZ4-compressed with patch|7.2G|
|Decompressed with patch|49G|

> Use aliases for StorageLevel in event logs
> --
>
> Key: SPARK-22805
> URL: https://issues.apache.org/jira/browse/SPARK-22805
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Sergei Lebedev
>Priority: Minor
>
> Fact 1: {{StorageLevel}} has a private constructor, therefore a list of 
> predefined levels is not extendable (by the users).
> Fact 2: The format of event logs uses redundant representation for storage 
> levels 
> {code}
> >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, 
> >>> "Replication": 1}')
> 79
> >>> len('DISK_ONLY')
> 9
> {code}
> Fact 3: This leads to excessive log sizes for workloads with lots of 
> partitions, because every partition would have the storage level field which 
> is 60-70 bytes more than it should be.
> Suggested quick win: use the names of the predefined levels to identify them 
> in the event log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22805) Use aliases for StorageLevel in event logs

2017-12-15 Thread Sergei Lebedev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292758#comment-16292758
 ] 

Sergei Lebedev commented on SPARK-22805:


I have a patch which preserves backward compatibility. Will post some number a 
bit later.

> Use aliases for StorageLevel in event logs
> --
>
> Key: SPARK-22805
> URL: https://issues.apache.org/jira/browse/SPARK-22805
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Sergei Lebedev
>Priority: Minor
>
> Fact 1: {{StorageLevel}} has a private constructor, therefore a list of 
> predefined levels is not extendable (by the users).
> Fact 2: The format of event logs uses redundant representation for storage 
> levels 
> {code}
> >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, 
> >>> "Replication": 1}')
> 79
> >>> len('DISK_ONLY')
> 9
> {code}
> Fact 3: This leads to excessive log sizes for workloads with lots of 
> partitions, because every partition would have the storage level field which 
> is 60-70 bytes more than it should be.
> Suggested quick win: use the names of the predefined levels to identify them 
> in the event log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22805) Use aliases for StorageLevel in event logs

2017-12-15 Thread Sergei Lebedev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292758#comment-16292758
 ] 

Sergei Lebedev edited comment on SPARK-22805 at 12/15/17 4:28 PM:
--

I have a patch which preserves backward compatibility. Will post some number a 
bit later.

Also, note that the format is flexible "in theory", in practice, it always 
contains one of the predefined levels.


was (Author: lebedev):
I have a patch which preserves backward compatibility. Will post some number a 
bit later.

> Use aliases for StorageLevel in event logs
> --
>
> Key: SPARK-22805
> URL: https://issues.apache.org/jira/browse/SPARK-22805
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Sergei Lebedev
>Priority: Minor
>
> Fact 1: {{StorageLevel}} has a private constructor, therefore a list of 
> predefined levels is not extendable (by the users).
> Fact 2: The format of event logs uses redundant representation for storage 
> levels 
> {code}
> >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, 
> >>> "Replication": 1}')
> 79
> >>> len('DISK_ONLY')
> 9
> {code}
> Fact 3: This leads to excessive log sizes for workloads with lots of 
> partitions, because every partition would have the storage level field which 
> is 60-70 bytes more than it should be.
> Suggested quick win: use the names of the predefined levels to identify them 
> in the event log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22805) Use aliases for StorageLevel in event logs

2017-12-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292742#comment-16292742
 ] 

Sean Owen commented on SPARK-22805:
---

The current format is somewhat more flexible but yes it's verbose. 
How much difference does it make in practice?
The problem with changing it is backwards compatibility.

> Use aliases for StorageLevel in event logs
> --
>
> Key: SPARK-22805
> URL: https://issues.apache.org/jira/browse/SPARK-22805
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Sergei Lebedev
>Priority: Minor
>
> Fact 1: {{StorageLevel}} has a private constructor, therefore a list of 
> predefined levels is not extendable (by the users).
> Fact 2: The format of event logs uses redundant representation for storage 
> levels 
> {code}
> >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, 
> >>> "Replication": 1}')
> 79
> >>> len('DISK_ONLY')
> 9
> {code}
> Fact 3: This leads to excessive log sizes for workloads with lots of 
> partitions, because every partition would have the storage level field which 
> is 60-70 bytes more than it should be.
> Suggested quick win: use the names of the predefined levels to identify them 
> in the event log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22805) Use aliases for StorageLevel in event logs

2017-12-15 Thread Sergei Lebedev (JIRA)
Sergei Lebedev created SPARK-22805:
--

 Summary: Use aliases for StorageLevel in event logs
 Key: SPARK-22805
 URL: https://issues.apache.org/jira/browse/SPARK-22805
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.1, 2.1.2
Reporter: Sergei Lebedev
Priority: Minor


Fact 1: {{StorageLevel}} has a private constructor, therefore a list of 
predefined levels is not extendable (by the users).

Fact 2: The format of event logs uses redundant representation for storage 
levels 

{code}
>>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, 
>>> "Replication": 1}')
79
>>> len('DISK_ONLY')
9
{code}

Fact 3: This leads to excessive log sizes for workloads with lots of 
partitions, because every partition would have the storage level field which is 
60-70 bytes more than it should be.

Suggested quick win: use the names of the predefined levels to identify them in 
the event log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22804) Using a window function inside of an aggregation causes StackOverflowError

2017-12-15 Thread Sandor Murakozi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandor Murakozi updated SPARK-22804:

Priority: Minor  (was: Major)

> Using a window function inside of an aggregation causes StackOverflowError
> --
>
> Key: SPARK-22804
> URL: https://issues.apache.org/jira/browse/SPARK-22804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sandor Murakozi
>Priority: Minor
>
> {code}
> import org.apache.spark.sql.expressions.Window
> val df = Seq(("a", 1), ("a", 2), ("b", 3)).toDF("key", "value")
> df.select(min(avg('value).over(Window.partitionBy('key.show
> {code}
> produces 
> {code}
> java.lang.StackOverflowError
>   at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
>   at scala.Option.orElse(Option.scala:289)
> ...
>   at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:109)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$hasWindowFunction(Analyzer.scala:1853)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
>   at 
> scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
>   at 
> scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableLike$class.partition(TraversableLike.scala:314)
>   at scala.collection.AbstractTraversable.partition(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1877)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2060)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2021)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> {code}
>  
> {code}
> df.select(min(avg('value).over())).show
> {code}
> produces a Stackoverflow as well.
> {code}
> df.select(min(avg('value))).show
> org.apache.spark.sql.AnalysisException: It is not allowed to use an aggregate 
> function in the argument of another aggregate function. Please use the inner 
> aggregate function in a sub-query.;;
> ...
> df.select(min(avg('value)).over()).show
> +---+
> |min(avg(value)) OVER (UnspecifiedFrame)|
> +---+
> |2.0|
> +---+
> {code}
> I think this is a valid use case, so in the ideal case it should work. 
> But even if it's not supported I would expect an error message similar to the 
> non-window version. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22804) Using a window function inside of an aggregation causes StackOverflowError

2017-12-15 Thread Sandor Murakozi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292732#comment-16292732
 ] 

Sandor Murakozi commented on SPARK-22804:
-

Indeed, it looks pretty similar. I will check if it's the same.

Thanks for the hint, [~sowen]

> Using a window function inside of an aggregation causes StackOverflowError
> --
>
> Key: SPARK-22804
> URL: https://issues.apache.org/jira/browse/SPARK-22804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sandor Murakozi
>
> {code}
> import org.apache.spark.sql.expressions.Window
> val df = Seq(("a", 1), ("a", 2), ("b", 3)).toDF("key", "value")
> df.select(min(avg('value).over(Window.partitionBy('key.show
> {code}
> produces 
> {code}
> java.lang.StackOverflowError
>   at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
>   at scala.Option.orElse(Option.scala:289)
> ...
>   at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:109)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$hasWindowFunction(Analyzer.scala:1853)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
>   at 
> scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
>   at 
> scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableLike$class.partition(TraversableLike.scala:314)
>   at scala.collection.AbstractTraversable.partition(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1877)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2060)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2021)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> {code}
>  
> {code}
> df.select(min(avg('value).over())).show
> {code}
> produces a Stackoverflow as well.
> {code}
> df.select(min(avg('value))).show
> org.apache.spark.sql.AnalysisException: It is not allowed to use an aggregate 
> function in the argument of another aggregate function. Please use the inner 
> aggregate function in a sub-query.;;
> ...
> df.select(min(avg('value)).over()).show
> +---+
> |min(avg(value)) OVER (UnspecifiedFrame)|
> +---+
> |2.0|
> +---+
> {code}
> I think this is a valid use case, so in the ideal case it should work. 
> But even if it's not supported I would expect an error message similar to the 
> non-window version. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22804) Using a window function inside of an aggregation causes StackOverflowError

2017-12-15 Thread Sandor Murakozi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandor Murakozi updated SPARK-22804:

Description: 
{code}
import org.apache.spark.sql.expressions.Window

val df = Seq(("a", 1), ("a", 2), ("b", 3)).toDF("key", "value")
df.select(min(avg('value).over(Window.partitionBy('key.show
{code}

produces 

{code}
java.lang.StackOverflowError
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:106)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at scala.Option.orElse(Option.scala:289)
...
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$hasWindowFunction(Analyzer.scala:1853)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.partition(TraversableLike.scala:314)
  at scala.collection.AbstractTraversable.partition(Traversable.scala:104)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1877)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2060)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2021)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

{code}
 

{code}
df.select(min(avg('value).over())).show
{code}
produces a Stackoverflow as well.

{code}
df.select(min(avg('value))).show

org.apache.spark.sql.AnalysisException: It is not allowed to use an aggregate 
function in the argument of another aggregate function. Please use the inner 
aggregate function in a sub-query.;;
...

df.select(min(avg('value)).over()).show

+---+
|min(avg(value)) OVER (UnspecifiedFrame)|
+---+
|2.0|
+---+
{code}

I think this is a valid use case, so in the ideal case it should work. 
But even if it's not supported I would expect an error message similar to the 
non-window version. 


  was:
{code}
import org.apache.spark.sql.expressions.Window

val df = Seq(("a", 1), ("a", 2), ("b", 3)).toDF("key", "value")
df.select(min(avg('value).over(Window.partitionBy('key.show
{code}

produces 

{code}
java.lang.StackOverflowError
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:106)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at scala.Option.orElse(Option.scala:289)
...
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$hasWindowFunction(Analyzer.scala:1853)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.partition(TraversableLike.scala:314)
  at scala.collection.AbstractTraversable.partition(Traversable.scala:104)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1877)
  at 

[jira] [Commented] (SPARK-22804) Using a window function inside of an aggregation causes StackOverflowError

2017-12-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292698#comment-16292698
 ] 

Sean Owen commented on SPARK-22804:
---

Same as SPARK-21896?

> Using a window function inside of an aggregation causes StackOverflowError
> --
>
> Key: SPARK-22804
> URL: https://issues.apache.org/jira/browse/SPARK-22804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sandor Murakozi
>
> {code}
> import org.apache.spark.sql.expressions.Window
> val df = Seq(("a", 1), ("a", 2), ("b", 3)).toDF("key", "value")
> df.select(min(avg('value).over(Window.partitionBy('key.show
> {code}
> produces 
> {code}
> java.lang.StackOverflowError
>   at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
>   at scala.Option.orElse(Option.scala:289)
> ...
>   at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:109)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$hasWindowFunction(Analyzer.scala:1853)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
>   at 
> scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
>   at 
> scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableLike$class.partition(TraversableLike.scala:314)
>   at scala.collection.AbstractTraversable.partition(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1877)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2060)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2021)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> {code}
>  
> {code}
> df.select(min(avg('value).over())).show
> {code}
> produces a Stackoverflow as well.
> {code}
> df.select(min(avg('value))).show
> org.apache.spark.sql.AnalysisException: It is not allowed to use an aggregate 
> function in the argument of another aggregate function. Please use the inner 
> aggregate function in a sub-query.;;
> ...
> df.select(min(avg('value)).over()).show
> +---+
> |min(avg(value)) OVER (UnspecifiedFrame)|
> +---+
> |2.0|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22804) Using a window function inside of an aggregation causes StackOverflowError

2017-12-15 Thread Sandor Murakozi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandor Murakozi updated SPARK-22804:

Description: 
{code}
import org.apache.spark.sql.expressions.Window

val df = Seq(("a", 1), ("a", 2), ("b", 3)).toDF("key", "value")
df.select(min(avg('value).over(Window.partitionBy('key.show
{code}

produces 

{code}
java.lang.StackOverflowError
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:106)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at scala.Option.orElse(Option.scala:289)
...
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$hasWindowFunction(Analyzer.scala:1853)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.partition(TraversableLike.scala:314)
  at scala.collection.AbstractTraversable.partition(Traversable.scala:104)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1877)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2060)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2021)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

{code}
 

{code}
df.select(min(avg('value).over())).show
{code}
produces a Stackoverflow as well.

{code}
df.select(min(avg('value))).show

org.apache.spark.sql.AnalysisException: It is not allowed to use an aggregate 
function in the argument of another aggregate function. Please use the inner 
aggregate function in a sub-query.;;
...

df.select(min(avg('value)).over()).show

+---+
|min(avg(value)) OVER (UnspecifiedFrame)|
+---+
|2.0|
+---+
{code}

  was:
{code}
import org.apache.spark.sql.expressions.Window

val df = Seq(("a", 1), ("a", 2), ("b", 3)).toDF("key", "value")
df.select(min(avg('value).over(Window.partitionBy('key.show
{code}

produces 

{code}
java.lang.StackOverflowError
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:106)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at scala.Option.orElse(Option.scala:289)
...
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$hasWindowFunction(Analyzer.scala:1853)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.partition(TraversableLike.scala:314)
  at scala.collection.AbstractTraversable.partition(Traversable.scala:104)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1877)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2060)
  at 

[jira] [Updated] (SPARK-22804) Using a window function inside of an aggregation causes StackOverflowError

2017-12-15 Thread Sandor Murakozi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandor Murakozi updated SPARK-22804:

Description: 
{code}
import org.apache.spark.sql.expressions.Window

val df = Seq(("a", 1), ("a", 2), ("b", 3)).toDF("key", "value")
df.select(min(avg('value).over(Window.partitionBy('key.show
{code}

produces 

{code}
java.lang.StackOverflowError
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:106)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at scala.Option.orElse(Option.scala:289)
...
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$hasWindowFunction(Analyzer.scala:1853)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.partition(TraversableLike.scala:314)
  at scala.collection.AbstractTraversable.partition(Traversable.scala:104)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1877)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2060)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2021)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

{code}
 

{code}
df.select(min(avg('value).over())).show
{code}
produces a similar error.

{code}
df.select(min(avg('value))).show

org.apache.spark.sql.AnalysisException: It is not allowed to use an aggregate 
function in the argument of another aggregate function. Please use the inner 
aggregate function in a sub-query.;;
...

df.select(min(avg('value)).over()).show

+---+
|min(avg(value)) OVER (UnspecifiedFrame)|
+---+
|2.0|
+---+
{code}

  was:
{code}
import org.apache.spark.sql.expressions.Window

val df = Seq(("a", 1), ("a", 2), ("b", 3)).toDF("key", "value")
df.select(min(avg('value).over(Window.partitionBy('key.show
{code}

produces 

{code}
java.lang.StackOverflowError
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:106)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at scala.Option.orElse(Option.scala:289)
...
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$hasWindowFunction(Analyzer.scala:1853)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.partition(TraversableLike.scala:314)
  at scala.collection.AbstractTraversable.partition(Traversable.scala:104)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1877)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2060)
  at 

[jira] [Updated] (SPARK-22804) Using a window function inside of an aggregation causes StackOverflowError

2017-12-15 Thread Sandor Murakozi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandor Murakozi updated SPARK-22804:

Description: 
{code}
import org.apache.spark.sql.expressions.Window

val df = Seq(("a", 1), ("a", 2), ("b", 3)).toDF("key", "value")
df.select(min(avg('value).over(Window.partitionBy('key.show
{code}

produces 

{code}
java.lang.StackOverflowError
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:106)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at scala.Option.orElse(Option.scala:289)
...
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$hasWindowFunction(Analyzer.scala:1853)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.partition(TraversableLike.scala:314)
  at scala.collection.AbstractTraversable.partition(Traversable.scala:104)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1877)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2060)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2021)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

{code}
...
{code}
df.select(min(avg('value).over())).show
{code}
produces a similar error.

{code}
df.select(min(avg('value))).show

org.apache.spark.sql.AnalysisException: It is not allowed to use an aggregate 
function in the argument of another aggregate function. Please use the inner 
aggregate function in a sub-query.;;
...

df.select(min(avg('value)).over()).show

+---+
|min(avg(value)) OVER (UnspecifiedFrame)|
+---+
|2.0|
+---+
{code}

  was:
{code:scala}
import org.apache.spark.sql.expressions.Window

val df = Seq(("a", 1), ("a", 2), ("b", 3)).toDF("key", "value")
df.select(min(avg('value).over(Window.partitionBy('key.show
{code}

produces 

{code}
java.lang.StackOverflowError
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:106)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at scala.Option.orElse(Option.scala:289)
...
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$hasWindowFunction(Analyzer.scala:1853)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.partition(TraversableLike.scala:314)
  at scala.collection.AbstractTraversable.partition(Traversable.scala:104)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1877)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2060)
  at 

[jira] [Created] (SPARK-22804) Using a window function inside of an aggregation causes StackOverflowError

2017-12-15 Thread Sandor Murakozi (JIRA)
Sandor Murakozi created SPARK-22804:
---

 Summary: Using a window function inside of an aggregation causes 
StackOverflowError
 Key: SPARK-22804
 URL: https://issues.apache.org/jira/browse/SPARK-22804
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Sandor Murakozi


{code:scala}
import org.apache.spark.sql.expressions.Window

val df = Seq(("a", 1), ("a", 2), ("b", 3)).toDF("key", "value")
df.select(min(avg('value).over(Window.partitionBy('key.show
{code}

produces 

{code}
java.lang.StackOverflowError
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:106)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
  at scala.Option.orElse(Option.scala:289)
...
  at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:109)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$hasWindowFunction(Analyzer.scala:1853)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$65.apply(Analyzer.scala:1877)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.partition(TraversableLike.scala:314)
  at scala.collection.AbstractTraversable.partition(Traversable.scala:104)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1877)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2060)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$27.applyOrElse(Analyzer.scala:2021)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

{code}
...
{code:scala}
df.select(min(avg('value).over())).show
{code}
produces a similar error.

{code:scala}
df.select(min(avg('value))).show

org.apache.spark.sql.AnalysisException: It is not allowed to use an aggregate 
function in the argument of another aggregate function. Please use the inner 
aggregate function in a sub-query.;;
...

df.select(min(avg('value)).over()).show

+---+
|min(avg(value)) OVER (UnspecifiedFrame)|
+---+
|2.0|
+---+
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22794) Spark Job failed, but the state is succeeded in Yarn Web

2017-12-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292675#comment-16292675
 ] 

Sean Owen commented on SPARK-22794:
---

This looks like a duplicate of lots of things like SPARK-22708, SPARK-15283 
maybe, etc.

> Spark Job failed, but the state is succeeded in Yarn Web
> 
>
> Key: SPARK-22794
> URL: https://issues.apache.org/jira/browse/SPARK-22794
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
> Attachments: task_is_succeeded_in_yarn_web.png
>
>
> I run a job in yarn mode, the job is failed:
> {noformat}
> 17/12/15 11:55:16 INFO SharedState: Warehouse path is '/apps/hive/warehouse'.
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does 
> not exist: hdfs://node1.huleilei.h3c.com:8020/user/hdfs/sss;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
> {noformat}
> but in the yarn web, the job state is succeeded.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22465) Cogroup of two disproportionate RDDs could lead into 2G limit BUG

2017-12-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292608#comment-16292608
 ] 

Thomas Graves commented on SPARK-22465:
---

Yes I think that makes sense.  

> Cogroup of two disproportionate RDDs could lead into 2G limit BUG
> -
>
> Key: SPARK-22465
> URL: https://issues.apache.org/jira/browse/SPARK-22465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 
> 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 
> 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0
>Reporter: Amit Kumar
>Priority: Critical
>
> While running my spark pipeline, it failed with the following exception
> {noformat}
> 2017-11-03 04:49:09,776 [Executor task launch worker for task 58670] ERROR 
> org.apache.spark.executor.Executor  - Exception in task 630.0 in stage 28.0 
> (TID 58670)
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
>   at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:469)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> After debugging I found that the issue lies with how spark handles cogroup of 
> two RDDs.
> Here is the relevant code from apache spark
> {noformat}
>  /**
>* For each key k in `this` or `other`, return a resulting RDD that 
> contains a tuple with the
>* list of values for that key in `this` as well as `other`.
>*/
>   def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))] = 
> self.withScope {
> cogroup(other, defaultPartitioner(self, other))
>   }
> /**
>* Choose a partitioner to use for a cogroup-like operation between a 
> number of RDDs.
>*
>* If any of the RDDs already has a partitioner, choose that one.
>*
>* Otherwise, we use a default HashPartitioner. For the number of 
> partitions, if
>* spark.default.parallelism is set, then we'll use the value from 
> SparkContext
>* defaultParallelism, otherwise we'll use the max number of upstream 
> partitions.
>*
>* Unless spark.default.parallelism is set, the number of partitions will 
> be the
>* same as the number of partitions in the largest upstream RDD, as this 
> should
>* be least likely to cause out-of-memory errors.
>*
>* We use two method parameters (rdd, others) to enforce callers passing at 
> least 1 RDD.
>*/
>   def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
> val rdds = (Seq(rdd) ++ others)
> val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 
> 0))
> if (hasPartitioner.nonEmpty) {
>   hasPartitioner.maxBy(_.partitions.length).partitioner.get
> } else {
>   if (rdd.context.conf.contains("spark.default.parallelism")) {
> new HashPartitioner(rdd.context.defaultParallelism)
>   } else {
> new HashPartitioner(rdds.map(_.partitions.length).max)
>   }
> }
>   }
> {noformat}
> Given this  suppose we have two  pair RDDs.
> RDD1 : A small RDD which fewer data and partitions
> RDD2: A huge RDD which has loads of data and partitions
> Now in the code if we were to have a cogroup
> {noformat}
> val RDD3 = RDD1.cogroup(RDD2)
> {noformat}
> there is a case where this could lead to the SPARK-6235 Bug which is If RDD1 
> has a partitioner when it is being called into a cogroup. This is because the 
> cogroups partitions are then decided by the partitioner and could lead to the 
> huge RDD2 being shuffled into a small number of partitions.
> One way is probably to add a safety check here that would ignore the 
> partitioner if the number of partitions on the two RDDs are very different in 
> magnitude.



--
This 

[jira] [Commented] (SPARK-22765) Create a new executor allocation scheme based on that of MR

2017-12-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292594#comment-16292594
 ] 

Thomas Graves commented on SPARK-22765:
---

I'm not sure how mr style and 4-core executor go together.  either way if you 
are allocated executors with 4 cores and only assigning 4 tasks to it (like mr 
style) , 3 of those tasks might finish quickly while one last much longer and 
you waste resources. 

How does MR style solve scheduler inefficiencies?  one way or another you have 
to schedule tasks to containers you get.  Seems like a scheduler issue not a 
container allocation issue and this would apply to both schemes unless you 
happen to get lucky in that perhaps its not as busy to schedule up front, but 
that is going to depend on when you get containers.  If you want to investigate 
this to improve that would be great.

I'm not really sure what this proposal does that you can't do now (with 
SPARK-21656) other then kill an executor after running X tasks rather then wait 
for idle.  This might be useful but again might hurt you especially in busy 
clusters where you might only get a percentage of your executors up front.  Let 
me know if I'm missing something in your proposal though. 
Note that we saw a significant increase in cluster utilization (meaning we were 
better usage resources not wasting them) when we moved to pig on tez with 
container re-use vs single container MR style.  Some of this was due to not 
having to do multiple jobs, some was the container reuse.   This was across a 
cluster though with mixed workloads.  I would expect container reuse on spark 
to do the same, but there are obviously specialized workloads.

Let me know what you find with SPARK-21656 and if its not sufficient please add 
more specifics (design) on what you propose to change.

> Create a new executor allocation scheme based on that of MR
> ---
>
> Key: SPARK-22765
> URL: https://issues.apache.org/jira/browse/SPARK-22765
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Xuefu Zhang
>
> Many users migrating their workload from MR to Spark find a significant 
> resource consumption hike (i.e, SPARK-22683). While this might not be a 
> concern for users that are more performance centric, for others conscious 
> about cost, such hike creates a migration obstacle. This situation can get 
> worse as more users are moving to cloud.
> Dynamic allocation make it possible for Spark to be deployed in multi-tenant 
> environment. With its performance-centric design, its inefficiency has also 
> unfortunately shown up, especially when compared with MR. Thus, it's believed 
> that MR-styled scheduler still has its merit. Based on our research, the 
> inefficiency associated with dynamic allocation comes in many aspects such as 
> executor idling out, bigger executors, many stages (rather than 2 stages only 
> in MR) in a spark job, etc.
> Rather than fine tuning dynamic allocation for efficiency, the proposal here 
> is to add a new, efficiency-centric  scheduling scheme based on that of MR. 
> Such a MR-based scheme can be further enhanced and be more adapted to Spark 
> execution model. This alternative is expected to offer good performance 
> improvement (compared to MR) still with similar to or even better efficiency 
> than MR.
> Inputs are greatly welcome!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18294) Implement commit protocol to support `mapred` package's committer

2017-12-15 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292595#comment-16292595
 ] 

Steve Loughran commented on SPARK-18294:


Following up on this, one question: Why support the older mapred protocol? 

The standard impl in Hadoop just relays to the new stuff, it just complicates 
everyones life as there's two test paths, APIs to document, risk of different 
failure modes.

The v1 API isn't being actively developed, and really its time to move off it

> Implement commit protocol to support `mapred` package's committer
> -
>
> Key: SPARK-18294
> URL: https://issues.apache.org/jira/browse/SPARK-18294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
> Fix For: 2.3.0
>
>
> Current `FileCommitProtocol` is based on `mapreduce` package, we should 
> implement a `HadoopMapRedCommitProtocol` that supports the older mapred 
> package's commiter.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22465) Cogroup of two disproportionate RDDs could lead into 2G limit BUG

2017-12-15 Thread Sujith Jay Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292581#comment-16292581
 ] 

Sujith Jay Nair commented on SPARK-22465:
-

Would something along the lines of  'add a safety-check that ignores the 
partitioner if the number of partitions on the RDDs are very different in 
magnitude', as the reporter suggests, be a satisfactory solution? Any pointers 
here would be very helpful.

> Cogroup of two disproportionate RDDs could lead into 2G limit BUG
> -
>
> Key: SPARK-22465
> URL: https://issues.apache.org/jira/browse/SPARK-22465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 
> 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 
> 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0
>Reporter: Amit Kumar
>Priority: Critical
>
> While running my spark pipeline, it failed with the following exception
> {noformat}
> 2017-11-03 04:49:09,776 [Executor task launch worker for task 58670] ERROR 
> org.apache.spark.executor.Executor  - Exception in task 630.0 in stage 28.0 
> (TID 58670)
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
>   at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:469)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> After debugging I found that the issue lies with how spark handles cogroup of 
> two RDDs.
> Here is the relevant code from apache spark
> {noformat}
>  /**
>* For each key k in `this` or `other`, return a resulting RDD that 
> contains a tuple with the
>* list of values for that key in `this` as well as `other`.
>*/
>   def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))] = 
> self.withScope {
> cogroup(other, defaultPartitioner(self, other))
>   }
> /**
>* Choose a partitioner to use for a cogroup-like operation between a 
> number of RDDs.
>*
>* If any of the RDDs already has a partitioner, choose that one.
>*
>* Otherwise, we use a default HashPartitioner. For the number of 
> partitions, if
>* spark.default.parallelism is set, then we'll use the value from 
> SparkContext
>* defaultParallelism, otherwise we'll use the max number of upstream 
> partitions.
>*
>* Unless spark.default.parallelism is set, the number of partitions will 
> be the
>* same as the number of partitions in the largest upstream RDD, as this 
> should
>* be least likely to cause out-of-memory errors.
>*
>* We use two method parameters (rdd, others) to enforce callers passing at 
> least 1 RDD.
>*/
>   def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
> val rdds = (Seq(rdd) ++ others)
> val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 
> 0))
> if (hasPartitioner.nonEmpty) {
>   hasPartitioner.maxBy(_.partitions.length).partitioner.get
> } else {
>   if (rdd.context.conf.contains("spark.default.parallelism")) {
> new HashPartitioner(rdd.context.defaultParallelism)
>   } else {
> new HashPartitioner(rdds.map(_.partitions.length).max)
>   }
> }
>   }
> {noformat}
> Given this  suppose we have two  pair RDDs.
> RDD1 : A small RDD which fewer data and partitions
> RDD2: A huge RDD which has loads of data and partitions
> Now in the code if we were to have a cogroup
> {noformat}
> val RDD3 = RDD1.cogroup(RDD2)
> {noformat}
> there is a case where this could lead to the SPARK-6235 Bug which is If RDD1 
> has a partitioner when it is being called into a cogroup. This is because the 
> cogroups partitions are then decided by the partitioner and could lead to the 
> huge 

[jira] [Commented] (SPARK-22465) Cogroup of two disproportionate RDDs could lead into 2G limit BUG

2017-12-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292569#comment-16292569
 ] 

Thomas Graves commented on SPARK-22465:
---

I don't have time at the moment to work on this so if you want to pick it up 
that would be great.

> Cogroup of two disproportionate RDDs could lead into 2G limit BUG
> -
>
> Key: SPARK-22465
> URL: https://issues.apache.org/jira/browse/SPARK-22465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 
> 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 
> 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0
>Reporter: Amit Kumar
>Priority: Critical
>
> While running my spark pipeline, it failed with the following exception
> {noformat}
> 2017-11-03 04:49:09,776 [Executor task launch worker for task 58670] ERROR 
> org.apache.spark.executor.Executor  - Exception in task 630.0 in stage 28.0 
> (TID 58670)
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
>   at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:469)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> After debugging I found that the issue lies with how spark handles cogroup of 
> two RDDs.
> Here is the relevant code from apache spark
> {noformat}
>  /**
>* For each key k in `this` or `other`, return a resulting RDD that 
> contains a tuple with the
>* list of values for that key in `this` as well as `other`.
>*/
>   def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))] = 
> self.withScope {
> cogroup(other, defaultPartitioner(self, other))
>   }
> /**
>* Choose a partitioner to use for a cogroup-like operation between a 
> number of RDDs.
>*
>* If any of the RDDs already has a partitioner, choose that one.
>*
>* Otherwise, we use a default HashPartitioner. For the number of 
> partitions, if
>* spark.default.parallelism is set, then we'll use the value from 
> SparkContext
>* defaultParallelism, otherwise we'll use the max number of upstream 
> partitions.
>*
>* Unless spark.default.parallelism is set, the number of partitions will 
> be the
>* same as the number of partitions in the largest upstream RDD, as this 
> should
>* be least likely to cause out-of-memory errors.
>*
>* We use two method parameters (rdd, others) to enforce callers passing at 
> least 1 RDD.
>*/
>   def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
> val rdds = (Seq(rdd) ++ others)
> val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 
> 0))
> if (hasPartitioner.nonEmpty) {
>   hasPartitioner.maxBy(_.partitions.length).partitioner.get
> } else {
>   if (rdd.context.conf.contains("spark.default.parallelism")) {
> new HashPartitioner(rdd.context.defaultParallelism)
>   } else {
> new HashPartitioner(rdds.map(_.partitions.length).max)
>   }
> }
>   }
> {noformat}
> Given this  suppose we have two  pair RDDs.
> RDD1 : A small RDD which fewer data and partitions
> RDD2: A huge RDD which has loads of data and partitions
> Now in the code if we were to have a cogroup
> {noformat}
> val RDD3 = RDD1.cogroup(RDD2)
> {noformat}
> there is a case where this could lead to the SPARK-6235 Bug which is If RDD1 
> has a partitioner when it is being called into a cogroup. This is because the 
> cogroups partitions are then decided by the partitioner and could lead to the 
> huge RDD2 being shuffled into a small number of partitions.
> One way is probably to add a safety check here that would ignore the 
> partitioner if the number of 

[jira] [Commented] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2017-12-15 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292540#comment-16292540
 ] 

Marco Gaido commented on SPARK-22799:
-

may I work on this?

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22803) Connection Refused Error is happening sometime.

2017-12-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22803.
---
Resolution: Invalid

Please don't put questions on JIRA. Use the mailing list.

> Connection Refused Error is happening sometime.
> ---
>
> Key: SPARK-22803
> URL: https://issues.apache.org/jira/browse/SPARK-22803
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.2.1
> Environment: windows,pycharm,python
>Reporter: Annamalai Venugopal
>
> I am often facing with the connection refused error which says
> ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
> server (127.0.0.1:52865)
> Traceback (most recent call last):
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\py4j\java_gateway.py",
>  line 827, in _get_connection
> connection = self.deque.pop()
> IndexError: pop from an empty deque
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\py4j\java_gateway.py",
>  line 963, in start
> self.socket.connect((self.address, self.port))
> ConnectionRefusedError: [WinError 10061] No connection could be made because 
> the target machine actively refused it



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22792) PySpark UDF registering issue

2017-12-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22792.
---
Resolution: Invalid

For JIRAs, you'd need to narrow this down to a clearly-described and narrow 
issue. Like your other JIRA, this is just a paste of your code.

> PySpark UDF registering issue
> -
>
> Key: SPARK-22792
> URL: https://issues.apache.org/jira/browse/SPARK-22792
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
> Environment: Windows OS, Python pycharm ,Spark
>Reporter: Annamalai Venugopal
>  Labels: windows
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> I am doing a project with pyspark i am struck with an issue
> Traceback (most recent call last):
>   File "C:/Users/avenugopal/PycharmProjects/POC_for_vectors/main.py", line 
> 187, in 
> hypernym_extracted_data = result.withColumn("hypernym_extracted_data", 
> hypernym_fn(F.column("token_extracted_data")))
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1957, in wrapper
> return udf_obj(*args)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1916, in __call__
> judf = self._judf
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1900, in _judf
> self._judf_placeholder = self._create_judf()
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1909, in _create_judf
> wrapped_func = _wrap_function(sc, self.func, self.returnType)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1866, in _wrap_function
> pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\rdd.py",
>  line 2374, in _prepare_for_python_RDD
> pickled_command = ser.dumps(command)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\serializers.py",
>  line 460, in dumps
> return cloudpickle.dumps(obj, 2)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 704, in dumps
> cp.dump(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 148, in dump
> return Pickler.dump(self, obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 409, in dump
> self.save(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 736, in save_tuple
> save(element)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 249, in save_function
> self.save_function_tuple(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 297, in save_function_tuple
> save(f_globals)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 821, in save_dict
> self._batch_setitems(obj.items())
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 852, in _batch_setitems
> save(v)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 249, in save_function
> self.save_function_tuple(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 297, in save_function_tuple
> save(f_globals)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> 

[jira] [Created] (SPARK-22803) Connection Refused Error is happening sometime.

2017-12-15 Thread Annamalai Venugopal (JIRA)
Annamalai Venugopal created SPARK-22803:
---

 Summary: Connection Refused Error is happening sometime.
 Key: SPARK-22803
 URL: https://issues.apache.org/jira/browse/SPARK-22803
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 2.2.1
 Environment: windows,pycharm,python
Reporter: Annamalai Venugopal


I am often facing with the connection refused error which says

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
server (127.0.0.1:52865)
Traceback (most recent call last):
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\py4j\java_gateway.py",
 line 827, in _get_connection
connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\py4j\java_gateway.py",
 line 963, in start
self.socket.connect((self.address, self.port))
ConnectionRefusedError: [WinError 10061] No connection could be made because 
the target machine actively refused it





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-22802) Regarding max tax size

2017-12-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-22802.
-

> Regarding max tax size
> --
>
> Key: SPARK-22802
> URL: https://issues.apache.org/jira/browse/SPARK-22802
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.2.1
> Environment: Windows,Pycharm
>Reporter: Annamalai Venugopal
>  Labels: windows
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> 17/12/15 17:15:55 WARN TaskSetManager: Stage 15 contains a task of very large 
> size (895 KB). The maximum recommended task size is 100 KB.
> Traceback (most recent call last):
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\runpy.py", 
> line 193, in _run_module_as_main
> "__main__", mod_spec)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\runpy.py", 
> line 85, in _run_code
> exec(code, run_globals)
>   File 
> "C:\Users\avenugopal\Downloads\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
>  line 216, in 
>   File 
> "C:\Users\avenugopal\Downloads\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
>  line 202, in main
>   File 
> "C:\Users\avenugopal\Downloads\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py",
>  line 577, in read_int
> EOFError



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22802) Regarding max tax size

2017-12-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22802.
---
Resolution: Invalid

Don't reopen this.

> Regarding max tax size
> --
>
> Key: SPARK-22802
> URL: https://issues.apache.org/jira/browse/SPARK-22802
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.2.1
> Environment: Windows,Pycharm
>Reporter: Annamalai Venugopal
>  Labels: windows
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> 17/12/15 17:15:55 WARN TaskSetManager: Stage 15 contains a task of very large 
> size (895 KB). The maximum recommended task size is 100 KB.
> Traceback (most recent call last):
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\runpy.py", 
> line 193, in _run_module_as_main
> "__main__", mod_spec)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\runpy.py", 
> line 85, in _run_code
> exec(code, run_globals)
>   File 
> "C:\Users\avenugopal\Downloads\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
>  line 216, in 
>   File 
> "C:\Users\avenugopal\Downloads\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
>  line 202, in main
>   File 
> "C:\Users\avenugopal\Downloads\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py",
>  line 577, in read_int
> EOFError



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22802) Regarding max tax size

2017-12-15 Thread Annamalai Venugopal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292527#comment-16292527
 ] 

Annamalai Venugopal commented on SPARK-22802:
-

I am doing  a project with spark with the source code as:

import sys
import os
from nltk.stem import SnowballStemmer
from nltk.corpus import wordnet as wn
from os import listdir
from os.path import isfile, join
from pyspark.sql.types import *

from pyspark.ml.feature import RegexTokenizer
from pyspark.sql.functions import udf
from pyspark.ml.feature import StopWordsRemover, CountVectorizer, IDF
from pyspark.ml.linalg import SparseVector, VectorUDT

lmtzr = SnowballStemmer("english")

# Set the path for spark installation
# this is the path where you downloaded and uncompressed the Spark download
# Using forward slashes on windows, \\ should work too.
os.environ['SPARK_HOME'] = 
"C:/Users/avenugopal/Downloads/spark-2.2.0-bin-hadoop2.7/"
# Append the python dir to PYTHONPATH so that pyspark could be found
sys.path.append("C:/Users/avenugopal/Downloads/spark-2.2.0-bin-hadoop2.7/python/")
# Append the python/build to PYTHONPATH so that py4j could be found
sys.path.append('C:/Users/avenugopal/Downloads/spark-2.2.0-bin-hadoop2.7/python/build')
# try the import Spark Modules
try:
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SQLContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as fun

except ImportError as e:
print("Error importing Spark Modules", e)
sys.exit(1)

# Run a word count example on the local machine:

# if len(sys.argv) != 2:
# print ( sys.stderr, "Usage: wordcount ")
# exit(-1)
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("spark_wc").set("spark.driver.maxResultSize", 
"3g").set("spark.executor.memory", "3g")
sc = SparkContext(conf=conf)

sc.range(100)

my_path = "venv//reuters-extracted"
only_files = [f for f in listdir(my_path) if isfile(join(my_path, f))]

schema = StructType([
StructField("id", IntegerType(), True),
StructField("sentence", StringType(), True),
StructField("file_name", StringType(), True)
])
rowList = []
index = 0
for file in enumerate(only_files):
with open(os.path.join(my_path, file[1]), 'r') as my_file:
data = my_file.read().replace('\n', '')
rowList.append((index, data, file))
index = index + 1
spark = SparkSession.builder.appName("TokenizerExample").getOrCreate()
# $example on$
# sentenceDataFrame = spark.createDataFrame([
# (0, "Hi I heard about Spark"),
# (1, "I wish Java could use case classes"),
# (2, "Logistic,regression,models,are,neat")
# ], ["id", "sentence"])
sentenceDataFrame = spark.createDataFrame(sc.parallelize(rowList), schema)

regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", 
pattern="[^a-zA-Z]")
# alternatively, pattern="\\w+", gaps(False)

countTokens = udf(lambda words: len(words), IntegerType())

regexTokenized = regexTokenizer.transform(sentenceDataFrame)

remover = StopWordsRemover(inputCol="words", outputCol="filtered")
filtered = remover.transform(regexTokenized)
filtered = filtered.drop("sentence").drop("words")

dataType = ArrayType(StringType(), False)
stemmingFn = udf(lambda words: stemmer(words), dataType)


def stemmer(words):
stemmed_words = [lmtzr.stem(word) for word in words]
return stemmed_words


stemmed = filtered.withColumn("stemmedWords", 
stemmingFn(fun.column("filtered")))

print(stemmed.rdd.getNumPartitions())
stemmed.select("stemmedWords").show(truncate=False)

cv = CountVectorizer(inputCol="stemmedWords", outputCol="features")

model = cv.fit(stemmed)
result = model.transform(stemmed)
result.select("features").show(truncate=False)
vocabulary = model.vocabulary
print(len(vocabulary))

idf = IDF(inputCol="features", outputCol="IDF_features")
idfModel = idf.fit(result)
rescaledData = idfModel.transform(result)

rescaledData.select("IDF_features").show(truncate=False)
print(rescaledData.rdd.getNumPartitions())


def filtering(vector):
size = vector.size
indices = vector.indices
values = vector.values
new_indices = []
new_values = []
for iterator, value in zip(indices, values):
if value >= 0.8:
new_indices.append(iterator)
new_values.append(value)
sparse_vector = SparseVector(size, new_indices, new_values)
return sparse_vector


filterFn = udf(lambda vector: filtering(vector), VectorUDT())
filteredData = rescaledData.withColumn("filteredData", filterFn("IDF_features"))
filteredData = filteredData.drop("filtered")
filteredData.select("filteredData").show(truncate=False)


def token_extract(vector):
indices = vector.indices
tokens = []
for iterator in indices:
tokens.append(vocabulary[iterator])
return tokens


token_extract_fn = udf(lambda vector: token_extract(vector), 
ArrayType(StringType()))

token_extracted_data = 

[jira] [Reopened] (SPARK-22802) Regarding max tax size

2017-12-15 Thread Annamalai Venugopal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Annamalai Venugopal reopened SPARK-22802:
-

I am doing  a project with spark with the source code as:

import sys
import os
from nltk.stem import SnowballStemmer
from nltk.corpus import wordnet as wn
from os import listdir
from os.path import isfile, join
from pyspark.sql.types import *

from pyspark.ml.feature import RegexTokenizer
from pyspark.sql.functions import udf
from pyspark.ml.feature import StopWordsRemover, CountVectorizer, IDF
from pyspark.ml.linalg import SparseVector, VectorUDT

lmtzr = SnowballStemmer("english")

# Set the path for spark installation
# this is the path where you downloaded and uncompressed the Spark download
# Using forward slashes on windows, \\ should work too.
os.environ['SPARK_HOME'] = 
"C:/Users/avenugopal/Downloads/spark-2.2.0-bin-hadoop2.7/"
# Append the python dir to PYTHONPATH so that pyspark could be found
sys.path.append("C:/Users/avenugopal/Downloads/spark-2.2.0-bin-hadoop2.7/python/")
# Append the python/build to PYTHONPATH so that py4j could be found
sys.path.append('C:/Users/avenugopal/Downloads/spark-2.2.0-bin-hadoop2.7/python/build')
# try the import Spark Modules
try:
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SQLContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as fun

except ImportError as e:
print("Error importing Spark Modules", e)
sys.exit(1)

# Run a word count example on the local machine:

# if len(sys.argv) != 2:
# print ( sys.stderr, "Usage: wordcount ")
# exit(-1)
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("spark_wc").set("spark.driver.maxResultSize", 
"3g").set("spark.executor.memory", "3g")
sc = SparkContext(conf=conf)

sc.range(100)

my_path = "venv//reuters-extracted"
only_files = [f for f in listdir(my_path) if isfile(join(my_path, f))]

schema = StructType([
StructField("id", IntegerType(), True),
StructField("sentence", StringType(), True),
StructField("file_name", StringType(), True)
])
rowList = []
index = 0
for file in enumerate(only_files):
with open(os.path.join(my_path, file[1]), 'r') as my_file:
data = my_file.read().replace('\n', '')
rowList.append((index, data, file))
index = index + 1
spark = SparkSession.builder.appName("TokenizerExample").getOrCreate()
# $example on$
# sentenceDataFrame = spark.createDataFrame([
# (0, "Hi I heard about Spark"),
# (1, "I wish Java could use case classes"),
# (2, "Logistic,regression,models,are,neat")
# ], ["id", "sentence"])
sentenceDataFrame = spark.createDataFrame(sc.parallelize(rowList), schema)

regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", 
pattern="[^a-zA-Z]")
# alternatively, pattern="\\w+", gaps(False)

countTokens = udf(lambda words: len(words), IntegerType())

regexTokenized = regexTokenizer.transform(sentenceDataFrame)

remover = StopWordsRemover(inputCol="words", outputCol="filtered")
filtered = remover.transform(regexTokenized)
filtered = filtered.drop("sentence").drop("words")

dataType = ArrayType(StringType(), False)
stemmingFn = udf(lambda words: stemmer(words), dataType)


def stemmer(words):
stemmed_words = [lmtzr.stem(word) for word in words]
return stemmed_words


stemmed = filtered.withColumn("stemmedWords", 
stemmingFn(fun.column("filtered")))

print(stemmed.rdd.getNumPartitions())
stemmed.select("stemmedWords").show(truncate=False)

cv = CountVectorizer(inputCol="stemmedWords", outputCol="features")

model = cv.fit(stemmed)
result = model.transform(stemmed)
result.select("features").show(truncate=False)
vocabulary = model.vocabulary
print(len(vocabulary))

idf = IDF(inputCol="features", outputCol="IDF_features")
idfModel = idf.fit(result)
rescaledData = idfModel.transform(result)

rescaledData.select("IDF_features").show(truncate=False)
print(rescaledData.rdd.getNumPartitions())


def filtering(vector):
size = vector.size
indices = vector.indices
values = vector.values
new_indices = []
new_values = []
for iterator, value in zip(indices, values):
if value >= 0.8:
new_indices.append(iterator)
new_values.append(value)
sparse_vector = SparseVector(size, new_indices, new_values)
return sparse_vector


filterFn = udf(lambda vector: filtering(vector), VectorUDT())
filteredData = rescaledData.withColumn("filteredData", filterFn("IDF_features"))
filteredData = filteredData.drop("filtered")
filteredData.select("filteredData").show(truncate=False)


def token_extract(vector):
indices = vector.indices
tokens = []
for iterator in indices:
tokens.append(vocabulary[iterator])
return tokens


token_extract_fn = udf(lambda vector: token_extract(vector), 
ArrayType(StringType()))

token_extracted_data = 

[jira] [Resolved] (SPARK-22802) Regarding max tax size

2017-12-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22802.
---
Resolution: Invalid

It's not clear what you're even asking, but it should go to the mailing list

> Regarding max tax size
> --
>
> Key: SPARK-22802
> URL: https://issues.apache.org/jira/browse/SPARK-22802
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.2.1
> Environment: Windows,Pycharm
>Reporter: Annamalai Venugopal
>  Labels: windows
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> 17/12/15 17:15:55 WARN TaskSetManager: Stage 15 contains a task of very large 
> size (895 KB). The maximum recommended task size is 100 KB.
> Traceback (most recent call last):
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\runpy.py", 
> line 193, in _run_module_as_main
> "__main__", mod_spec)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\runpy.py", 
> line 85, in _run_code
> exec(code, run_globals)
>   File 
> "C:\Users\avenugopal\Downloads\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
>  line 216, in 
>   File 
> "C:\Users\avenugopal\Downloads\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
>  line 202, in main
>   File 
> "C:\Users\avenugopal\Downloads\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py",
>  line 577, in read_int
> EOFError



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22801) Allow FeatureHasher to specify numeric columns to treat as categorical

2017-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22801:


Assignee: Nick Pentreath  (was: Apache Spark)

> Allow FeatureHasher to specify numeric columns to treat as categorical
> --
>
> Key: SPARK-22801
> URL: https://issues.apache.org/jira/browse/SPARK-22801
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> {{FeatureHasher}} added in SPARK-13964 always treats numeric type columns as 
> numbers and never as categorical features. It is quite common to have 
> categorical features represented as numbers or codes (often say {{Int}}) in 
> data sources. 
> In order to hash these features as categorical, users must first explicitly 
> convert them to strings which is cumbersome. 
> Add a new param {{categoricalCols}} which specifies the numeric columns that 
> should be treated as categorical features.
> *Note* while the reverse case is certainly possible (i.e. numeric features 
> that are encoded as strings and a user would like to treat them as numeric), 
> this is probably less likely and this case won't be supported at this time. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22801) Allow FeatureHasher to specify numeric columns to treat as categorical

2017-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292467#comment-16292467
 ] 

Apache Spark commented on SPARK-22801:
--

User 'MLnick' has created a pull request for this issue:
https://github.com/apache/spark/pull/19991

> Allow FeatureHasher to specify numeric columns to treat as categorical
> --
>
> Key: SPARK-22801
> URL: https://issues.apache.org/jira/browse/SPARK-22801
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> {{FeatureHasher}} added in SPARK-13964 always treats numeric type columns as 
> numbers and never as categorical features. It is quite common to have 
> categorical features represented as numbers or codes (often say {{Int}}) in 
> data sources. 
> In order to hash these features as categorical, users must first explicitly 
> convert them to strings which is cumbersome. 
> Add a new param {{categoricalCols}} which specifies the numeric columns that 
> should be treated as categorical features.
> *Note* while the reverse case is certainly possible (i.e. numeric features 
> that are encoded as strings and a user would like to treat them as numeric), 
> this is probably less likely and this case won't be supported at this time. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22801) Allow FeatureHasher to specify numeric columns to treat as categorical

2017-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22801:


Assignee: Apache Spark  (was: Nick Pentreath)

> Allow FeatureHasher to specify numeric columns to treat as categorical
> --
>
> Key: SPARK-22801
> URL: https://issues.apache.org/jira/browse/SPARK-22801
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Apache Spark
>
> {{FeatureHasher}} added in SPARK-13964 always treats numeric type columns as 
> numbers and never as categorical features. It is quite common to have 
> categorical features represented as numbers or codes (often say {{Int}}) in 
> data sources. 
> In order to hash these features as categorical, users must first explicitly 
> convert them to strings which is cumbersome. 
> Add a new param {{categoricalCols}} which specifies the numeric columns that 
> should be treated as categorical features.
> *Note* while the reverse case is certainly possible (i.e. numeric features 
> that are encoded as strings and a user would like to treat them as numeric), 
> this is probably less likely and this case won't be supported at this time. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22802) Regarding max tax size

2017-12-15 Thread Annamalai Venugopal (JIRA)
Annamalai Venugopal created SPARK-22802:
---

 Summary: Regarding max tax size
 Key: SPARK-22802
 URL: https://issues.apache.org/jira/browse/SPARK-22802
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 2.2.1
 Environment: Windows,Pycharm
Reporter: Annamalai Venugopal


17/12/15 17:15:55 WARN TaskSetManager: Stage 15 contains a task of very large 
size (895 KB). The maximum recommended task size is 100 KB.
Traceback (most recent call last):
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 
193, in _run_module_as_main
"__main__", mod_spec)
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 
85, in _run_code
exec(code, run_globals)
  File 
"C:\Users\avenugopal\Downloads\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
 line 216, in 
  File 
"C:\Users\avenugopal\Downloads\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
 line 202, in main
  File 
"C:\Users\avenugopal\Downloads\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py",
 line 577, in read_int
EOFError




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22792) PySpark UDF registering issue

2017-12-15 Thread Annamalai Venugopal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292378#comment-16292378
 ] 

Annamalai Venugopal commented on SPARK-22792:
-

def hypernym_generation(token_array_a): 
token_array = token_array_a 
print(token_array) 
hypernym_dictionary = {} 
for token in token_array: 
synsets = wn.synsets(token) 
hypernym_set = [] 
# print(type(synsets)) 
# for synset in synsets: 
hypernyms = synsets[0].hypernyms() 
if len(hypernyms) > 0: 
   for hypernym in hypernyms: 
   hypernym_set.append(hypernym.lemma_names()[0]) 
hypernym_dictionary[token] = hypernym_set 
return hypernym_dictionary 


def convert(token_array): 
print(token_array) 
return hypernym_generation(token_array) 


hypernym_fn = udf(convert, MapType(StringType(), 
ArrayType(StringType(),False),True)) 
hypernym_extracted_data = result.withColumn("hypernym_extracted_data", 
hypernym_fn(F.column("token_extracted_data"))) 
hypernym_extracted_data.select("hypernym_extracted_data").show(truncate=False) 
print(hypernym_extracted_data.rdd.getNumPartitions()) 
sc.stop()

this part is producing the specified error


> PySpark UDF registering issue
> -
>
> Key: SPARK-22792
> URL: https://issues.apache.org/jira/browse/SPARK-22792
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
> Environment: Windows OS, Python pycharm ,Spark
>Reporter: Annamalai Venugopal
>  Labels: windows
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> I am doing a project with pyspark i am struck with an issue
> Traceback (most recent call last):
>   File "C:/Users/avenugopal/PycharmProjects/POC_for_vectors/main.py", line 
> 187, in 
> hypernym_extracted_data = result.withColumn("hypernym_extracted_data", 
> hypernym_fn(F.column("token_extracted_data")))
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1957, in wrapper
> return udf_obj(*args)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1916, in __call__
> judf = self._judf
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1900, in _judf
> self._judf_placeholder = self._create_judf()
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1909, in _create_judf
> wrapped_func = _wrap_function(sc, self.func, self.returnType)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1866, in _wrap_function
> pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\rdd.py",
>  line 2374, in _prepare_for_python_RDD
> pickled_command = ser.dumps(command)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\serializers.py",
>  line 460, in dumps
> return cloudpickle.dumps(obj, 2)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 704, in dumps
> cp.dump(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 148, in dump
> return Pickler.dump(self, obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 409, in dump
> self.save(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 736, in save_tuple
> save(element)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 249, in save_function
> self.save_function_tuple(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 297, in save_function_tuple
> save(f_globals)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 

[jira] [Updated] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2017-12-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-22799:
---
Description: See the related discussion: 
https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >