[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475943#comment-15475943
 ] 

Shivaram Venkataraman commented on SPARK-17428:
---

I think there are bunch of issues being discussed here. My initial take would 
be to add support for something simple and then iterate based on user feedback. 
Given that R users generally don't know / care much about package version 
numbers I'd say an initial cut that handles two flags in spark-submit 

(a) a list of package names and calls `install.packages` on each machine with 
them
(b) a list of package tar.gz that are installed with `R CMD INSTALL` on each 
machine 

We can also make the package installs lazy, i.e. they only get run on a worker 
when there is a R worker process launched there. Will this meet the user needs 
you have in mind [~yanboliang] ?

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6160) ChiSqSelector should keep test statistic info

2016-09-08 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475860#comment-15475860
 ] 

Peng Meng commented on SPARK-6160:
--

hi [~GayathriMurali], are you still working on this, if not, I can work on it. 
thanks.

> ChiSqSelector should keep test statistic info
> -
>
> Key: SPARK-6160
> URL: https://issues.apache.org/jira/browse/SPARK-6160
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It is useful to have the test statistics explaining selected features, but 
> these data are thrown out when constructing the ChiSqSelectorModel.  The data 
> are expensive to recompute, so the ChiSqSelectorModel should store and expose 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17464) SparkR spark.als arguments reg should be 0.1 by default

2016-09-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475849#comment-15475849
 ] 

Apache Spark commented on SPARK-17464:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/15021

> SparkR spark.als arguments reg should be 0.1 by default
> ---
>
> Key: SPARK-17464
> URL: https://issues.apache.org/jira/browse/SPARK-17464
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Priority: Minor
>
> SparkR spark.als arguments {{reg}} should be 0.1 by default, which need to be 
> consistent with ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17464) SparkR spark.als arguments reg should be 0.1 by default

2016-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17464:


Assignee: (was: Apache Spark)

> SparkR spark.als arguments reg should be 0.1 by default
> ---
>
> Key: SPARK-17464
> URL: https://issues.apache.org/jira/browse/SPARK-17464
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Priority: Minor
>
> SparkR spark.als arguments {{reg}} should be 0.1 by default, which need to be 
> consistent with ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17464) SparkR spark.als arguments reg should be 0.1 by default

2016-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17464:


Assignee: Apache Spark

> SparkR spark.als arguments reg should be 0.1 by default
> ---
>
> Key: SPARK-17464
> URL: https://issues.apache.org/jira/browse/SPARK-17464
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> SparkR spark.als arguments {{reg}} should be 0.1 by default, which need to be 
> consistent with ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17464) SparkR spark.als arguments reg should be 0.1 by default

2016-09-08 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-17464:
---

 Summary: SparkR spark.als arguments reg should be 0.1 by default
 Key: SPARK-17464
 URL: https://issues.apache.org/jira/browse/SPARK-17464
 Project: Spark
  Issue Type: Bug
  Components: ML, SparkR
Reporter: Yanbo Liang
Priority: Minor


SparkR spark.als arguments {{reg}} should be 0.1 by default, which need to be 
consistent with ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6160) ChiSqSelector should keep test statistic info

2016-09-08 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475769#comment-15475769
 ] 

Peng Meng commented on SPARK-6160:
--

Hi [~josephkb], I have some discussion with [~srowen] about keeping test 
statistic info of ChiSqSelector in PR: 
https://github.com/apache/spark/pull/14597. 
Can you review that PR, and commit: 
https://github.com/apache/spark/pull/14597/commits/3d6aecb8441503c9c3d62a2d8a3d48824b9d6637


> ChiSqSelector should keep test statistic info
> -
>
> Key: SPARK-6160
> URL: https://issues.apache.org/jira/browse/SPARK-6160
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It is useful to have the test statistics explaining selected features, but 
> these data are thrown out when constructing the ChiSqSelectorModel.  The data 
> are expensive to recompute, so the ChiSqSelectorModel should store and expose 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15509) R MLlib algorithms should support input columns "features" and "label"

2016-09-08 Thread WangJianfei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475758#comment-15475758
 ] 

WangJianfei commented on SPARK-15509:
-

please check my issue https://issues.apache.org/jira/browse/SPARK-17447
thinks

> R MLlib algorithms should support input columns "features" and "label"
> --
>
> Key: SPARK-15509
> URL: https://issues.apache.org/jira/browse/SPARK-15509
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xin Ren
> Fix For: 2.1.0
>
>
> Currently in SparkR, when you load a LibSVM dataset using the sqlContext and 
> then pass it to an MLlib algorithm, the ML wrappers will fail since they will 
> try to create a "features" column, which conflicts with the existing 
> "features" column from the LibSVM loader.  E.g., using the "mnist" dataset 
> from LibSVM:
> {code}
> training <- loadDF(sqlContext, ".../mnist", "libsvm")
> model <- naiveBayes(label ~ features, training)
> {code}
> This fails with:
> {code}
> 16/05/24 11:52:41 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.NaiveBayesWrapper failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   java.lang.IllegalArgumentException: Output column features already exists.
>   at 
> org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120)
>   at 
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
>   at 
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>   at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
>   at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
>   at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131)
>   at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169)
>   at 
> org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62)
>   at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca
> {code}
> The same issue appears for the "label" column once you rename the "features" 
> column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17245) NPE thrown by ClientWrapper.conf

2016-09-08 Thread WangJianfei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475756#comment-15475756
 ] 

WangJianfei commented on SPARK-17245:
-

please check the issue https://issues.apache.org/jira/browse/SPARK-17447
thinks

> NPE thrown by ClientWrapper.conf
> 
>
> Key: SPARK-17245
> URL: https://issues.apache.org/jira/browse/SPARK-17245
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.3
>
>
> This issue has been fixed in Spark 2.0. Seems ClientWrapper.conf is trying to 
> access the ThreadLocal SessionState, which has been set. 
> {code}
> java.lang.NullPointerException 
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:225) 
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.client(ClientWrapper.scala:279)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:291)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:246)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:245)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:288)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:493)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:483)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:603)
>  
> at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:654) 
> at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:105)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>  
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>  
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) 
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>  
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>  
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>  
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) 
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>  
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) 
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) 
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) 
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) 
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17387) Creating SparkContext() from python without spark-submit ignores user conf

2016-09-08 Thread WangJianfei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475754#comment-15475754
 ] 

WangJianfei commented on SPARK-17387:
-

please check https://issues.apache.org/jira/browse/SPARK-17447
thinks

> Creating SparkContext() from python without spark-submit ignores user conf
> --
>
> Key: SPARK-17387
> URL: https://issues.apache.org/jira/browse/SPARK-17387
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Consider the following scenario: user runs a python application not through 
> spark-submit, but by adding the pyspark module and manually creating a Spark 
> context. Kinda like this:
> {noformat}
> $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
> Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> from pyspark import SparkContext
> >>> from pyspark import SparkConf
> >>> conf = SparkConf().set("spark.driver.memory", "4g")
> >>> sc = SparkContext(conf=conf)
> {noformat}
> If you look at the JVM launched by the pyspark code, it ignores the user's 
> configuration:
> {noformat}
> $ ps ax | grep $(pgrep -f SparkSubmit)
> 12283 pts/2Sl+0:03 /apps/java7/bin/java -cp ... -Xmx1g 
> -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit pyspark-shell
> {noformat}
> Note the "1g" of memory. If instead you use "pyspark", you get the correct 
> "4g" in the JVM.
> This also affects other configs; for example, you can't really add jars to 
> the driver's classpath using "spark.jars".
> You can work around this by setting the undocumented env variable Spark 
> itself uses:
> {noformat}
> $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
> Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import os
> >>> os.environ['PYSPARK_SUBMIT_ARGS'] = "pyspark-shell --conf 
> >>> spark.driver.memory=4g"
> >>> from pyspark import SparkContext
> >>> sc = SparkContext()
> {noformat}
> But it would be nicer if the configs were automatically propagated.
> BTW the reason for this is that the {{launch_gateway}} function used to start 
> the JVM does not take any parameters, and the only place where it reads 
> arguments for Spark is that env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17449) Relation between heartbeatInterval and network timeout

2016-09-08 Thread Yang Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475727#comment-15475727
 ] 

Yang Liang commented on SPARK-17449:


Sorry , let me clarify it . The relation between spark.network.timeout and 
spark.executor.heartbeatInterval  is confusing. It should be mentioned in the 
document at least.

> Relation between heartbeatInterval and network timeout
> --
>
> Key: SPARK-17449
> URL: https://issues.apache.org/jira/browse/SPARK-17449
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yang Liang
>  Labels: bug
>
> $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s 
> --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 
> ms exceeds timeout 12 ms
> ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed 
> out after 168136 ms
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 
> ms exceeds timeout 1 ms
> ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed 
> out after 11949 m
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 
> ms exceeds timeout 1 ms
> ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed 
> out after 39299 ms
> Source Code:
> spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
> /**
>  * A heartbeat from executors to the driver. This is a shared message used by 
> several internal
>  * components to convey liveness or execution information for in-progress 
> tasks. It will also
>  * expire the hosts that have not heartbeated for more than 
> spark.network.timeout.
>  */
> private val executorTimeoutMs =
> sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") 
> * 1000
> The relation between spark.network.timeout and 
> spark.executor.heartbeatInterval should be mentioned in the document at 
> least. Otherwise error above would be confusing. Do some checks when get 
> settings ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17449) Relation between heartbeatInterval and network timeout

2016-09-08 Thread Yang Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Liang updated SPARK-17449:
---
Description: 
$ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s 
--num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 
ms exceeds timeout 12 ms
ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed 
out after 168136 ms


spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
spark.network.timeout=10s --num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 ms 
exceeds timeout 1 ms
ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed 
out after 11949 m


spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
spark.network.timeout=10s --num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 ms 
exceeds timeout 1 ms
ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed 
out after 39299 ms


Source Code:

spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala

/**
 * A heartbeat from executors to the driver. This is a shared message used by 
several internal
 * components to convey liveness or execution information for in-progress 
tasks. It will also
 * expire the hosts that have not heartbeated for more than 
spark.network.timeout.
 */


private val executorTimeoutMs =
sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") * 
1000


The relation between spark.network.timeout and spark.executor.heartbeatInterval 
should be mentioned in the document at least. Otherwise error above would be 
confusing. Do some checks when get settings ?


  was:
$ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s 
--num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 
ms exceeds timeout 12 ms
ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed 
out after 168136 ms


spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
spark.network.timeout=10s --num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 ms 
exceeds timeout 1 ms
ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed 
out after 11949 m


spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
spark.network.timeout=10s --num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 ms 
exceeds timeout 1 ms
ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed 
out after 39299 ms


Source Code:

spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala

/**
 * A heartbeat from executors to the driver. This is a shared message used by 
several internal
 * components to convey liveness or execution information for in-progress 
tasks. It will also
 * expire the hosts that have not heartbeated for more than 
spark.network.timeout.
 */


private val executorTimeoutMs =
sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") * 
1000


The relation between spark.network.timeout and spark.executor.heartbeatInterval 
should be mentioned in the document at least. Otherwise error above would be 
confusing.



> Relation between heartbeatInterval and network timeout
> --
>
> Key: SPARK-17449
> URL: https://issues.apache.org/jira/browse/SPARK-17449
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yang Liang
>  Labels: bug
>
> $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s 
> --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 
> ms exceeds timeout 12 ms
> ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed 
> out after 168136 ms
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 
> ms exceeds timeout 1 ms
> ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed 
> out after 11949 m
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 
> ms exceeds timeout 1 ms
> ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed 
> out after 39299 ms
> Source Code:
> spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
> /**
>  * A heartbeat from executors to the driver. This is a shared 

[jira] [Updated] (SPARK-17449) executorTimeoutMs configure error

2016-09-08 Thread Yang Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Liang updated SPARK-17449:
---
Description: 
$ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s 
--num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 
ms exceeds timeout 12 ms
ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed 
out after 168136 ms


spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
spark.network.timeout=10s --num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 ms 
exceeds timeout 1 ms
ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed 
out after 11949 m


spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
spark.network.timeout=10s --num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 ms 
exceeds timeout 1 ms
ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed 
out after 39299 ms


Source Code:

spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala

/**
 * A heartbeat from executors to the driver. This is a shared message used by 
several internal
 * components to convey liveness or execution information for in-progress 
tasks. It will also
 * expire the hosts that have not heartbeated for more than 
spark.network.timeout.
 */


private val executorTimeoutMs =
sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") * 
1000


The relation between spark.network.timeout and spark.executor.heartbeatInterval 
should be mentioned in the document at least. Otherwise error above would be 
confusing.


  was:
$ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s 
--num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 
ms exceeds timeout 12 ms
ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed 
out after 168136 ms


spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
spark.network.timeout=10s --num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 ms 
exceeds timeout 1 ms
ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed 
out after 11949 m


spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
spark.network.timeout=10s --num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 ms 
exceeds timeout 1 ms
ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed 
out after 39299 ms


Source Code:

spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala

/**
 * A heartbeat from executors to the driver. This is a shared message used by 
several internal
 * components to convey liveness or execution information for in-progress 
tasks. It will also
 * expire the hosts that have not heartbeated for more than 
spark.network.timeout.
 */


private val executorTimeoutMs =
sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") * 
1000



> executorTimeoutMs configure error
> -
>
> Key: SPARK-17449
> URL: https://issues.apache.org/jira/browse/SPARK-17449
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yang Liang
>  Labels: bug
>
> $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s 
> --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 
> ms exceeds timeout 12 ms
> ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed 
> out after 168136 ms
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 
> ms exceeds timeout 1 ms
> ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed 
> out after 11949 m
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 
> ms exceeds timeout 1 ms
> ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed 
> out after 39299 ms
> Source Code:
> spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
> /**
>  * A heartbeat from executors to the driver. This is a shared message used by 
> several internal
>  * components to convey liveness or execution information for in-progress 
> tasks. It will also
>  * expire the hosts that have not heartbeated for more than 
> spark.network.timeout.
>  */
> private val 

[jira] [Updated] (SPARK-17449) Relation between

2016-09-08 Thread Yang Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Liang updated SPARK-17449:
---
Summary: Relation between   (was: executorTimeoutMs configure error)

> Relation between 
> -
>
> Key: SPARK-17449
> URL: https://issues.apache.org/jira/browse/SPARK-17449
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yang Liang
>  Labels: bug
>
> $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s 
> --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 
> ms exceeds timeout 12 ms
> ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed 
> out after 168136 ms
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 
> ms exceeds timeout 1 ms
> ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed 
> out after 11949 m
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 
> ms exceeds timeout 1 ms
> ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed 
> out after 39299 ms
> Source Code:
> spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
> /**
>  * A heartbeat from executors to the driver. This is a shared message used by 
> several internal
>  * components to convey liveness or execution information for in-progress 
> tasks. It will also
>  * expire the hosts that have not heartbeated for more than 
> spark.network.timeout.
>  */
> private val executorTimeoutMs =
> sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") 
> * 1000
> The relation between spark.network.timeout and 
> spark.executor.heartbeatInterval should be mentioned in the document at 
> least. Otherwise error above would be confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17449) Relation between heartbeatInterval and network timeout

2016-09-08 Thread Yang Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Liang updated SPARK-17449:
---
Summary: Relation between heartbeatInterval and network timeout  (was: 
Relation between )

> Relation between heartbeatInterval and network timeout
> --
>
> Key: SPARK-17449
> URL: https://issues.apache.org/jira/browse/SPARK-17449
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yang Liang
>  Labels: bug
>
> $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s 
> --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 
> ms exceeds timeout 12 ms
> ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed 
> out after 168136 ms
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 
> ms exceeds timeout 1 ms
> ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed 
> out after 11949 m
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 
> ms exceeds timeout 1 ms
> ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed 
> out after 39299 ms
> Source Code:
> spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
> /**
>  * A heartbeat from executors to the driver. This is a shared message used by 
> several internal
>  * components to convey liveness or execution information for in-progress 
> tasks. It will also
>  * expire the hosts that have not heartbeated for more than 
> spark.network.timeout.
>  */
> private val executorTimeoutMs =
> sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") 
> * 1000
> The relation between spark.network.timeout and 
> spark.executor.heartbeatInterval should be mentioned in the document at 
> least. Otherwise error above would be confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6160) ChiSqSelector should keep test statistic info

2016-09-08 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475709#comment-15475709
 ] 

Peng Meng commented on SPARK-6160:
--

hi Joseph K. Bradley

> ChiSqSelector should keep test statistic info
> -
>
> Key: SPARK-6160
> URL: https://issues.apache.org/jira/browse/SPARK-6160
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It is useful to have the test statistics explaining selected features, but 
> these data are thrown out when constructing the ChiSqSelectorModel.  The data 
> are expensive to recompute, so the ChiSqSelectorModel should store and expose 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6160) ChiSqSelector should keep test statistic info

2016-09-08 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-6160:
-
Comment: was deleted

(was: hi Joseph K. Bradley)

> ChiSqSelector should keep test statistic info
> -
>
> Key: SPARK-6160
> URL: https://issues.apache.org/jira/browse/SPARK-6160
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It is useful to have the test statistics explaining selected features, but 
> these data are thrown out when constructing the ChiSqSelectorModel.  The data 
> are expensive to recompute, so the ChiSqSelectorModel should store and expose 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17449) executorTimeoutMs configure error

2016-09-08 Thread Yang Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Liang updated SPARK-17449:
---
Description: 
$ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s 
--num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 
ms exceeds timeout 12 ms
ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed 
out after 168136 ms


spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
spark.network.timeout=10s --num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 ms 
exceeds timeout 1 ms
ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed 
out after 11949 m


spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
spark.network.timeout=10s --num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 ms 
exceeds timeout 1 ms
ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed 
out after 39299 ms


Source Code:

spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala

/**
 * A heartbeat from executors to the driver. This is a shared message used by 
several internal
 * components to convey liveness or execution information for in-progress 
tasks. It will also
 * expire the hosts that have not heartbeated for more than 
spark.network.timeout.
 */


private val executorTimeoutMs =
sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") * 
1000


  was:

$ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s 
--num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 
ms exceeds timeout 12 ms
ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed 
out after 168136 ms


spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
spark.network.timeout=10s --num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 ms 
exceeds timeout 1 ms
ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed 
out after 11949 m


spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
spark.network.timeout=10s --num-executors 1

WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 ms 
exceeds timeout 1 ms
ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed 
out after 39299 ms


Source Code:

spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala

private val executorTimeoutMs =
sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") * 
1000



> executorTimeoutMs configure error
> -
>
> Key: SPARK-17449
> URL: https://issues.apache.org/jira/browse/SPARK-17449
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yang Liang
>  Labels: bug
>
> $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s 
> --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 
> ms exceeds timeout 12 ms
> ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed 
> out after 168136 ms
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 
> ms exceeds timeout 1 ms
> ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed 
> out after 11949 m
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 
> ms exceeds timeout 1 ms
> ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed 
> out after 39299 ms
> Source Code:
> spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
> /**
>  * A heartbeat from executors to the driver. This is a shared message used by 
> several internal
>  * components to convey liveness or execution information for in-progress 
> tasks. It will also
>  * expire the hosts that have not heartbeated for more than 
> spark.network.timeout.
>  */
> private val executorTimeoutMs =
> sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") 
> * 1000



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475664#comment-15475664
 ] 

Jeff Zhang commented on SPARK-17428:


Found another elegant way to specify version, using devtools
https://support.rstudio.com/hc/en-us/articles/219949047-Installing-older-versions-of-packages
{code}
require(devtools)
install_version("ggplot2", version = "0.9.1", repos = 
"http://cran.us.r-project.org;)
{code}

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17447) performance improvement in Partitioner.DefaultPartitioner

2016-09-08 Thread WangJianfei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475607#comment-15475607
 ] 

WangJianfei edited comment on SPARK-17447 at 9/9/16 2:28 AM:
-

we can just scan the rdd array only one time to find the rdd with the maxmum 
partitions and whose partitioner is difined.
you can just look the code of spark and the code of me:
the code of spark:
  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.length).reverse
for (r <- bySize if r.partitioner.isDefined && 
r.partitioner.get.numPartitions > 0) {
  return r.partitioner.get
}
if (rdd.context.conf.contains("spark.default.parallelism")) {
  new HashPartitioner(rdd.context.defaultParallelism)
} else {
  new HashPartitioner(bySize.head.partitions.length)
}
  }

this is my basic logic:

var maxP=0
if(rdd.partitioner.isDefined && rdd.partitioner.get.numPartitions >0 ){
  maxP=rdd.partitioner.get.numPartitions
}
for(i <- 0 until others.length){
  if(others(i).partitioner.isDefined && 
others(i).partitioner.get.numPartitions > maxP){
maxP=others(i).partitioner.get.numPartitions
  }
}


was (Author: codlife):
we can just scan the rdd array only one time to find the rdd with the maxmum 
partitions and whose partitioner is difined.
you can just look the code of spark and the code of me:
the code of spark:
  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.length).reverse
for (r <- bySize if r.partitioner.isDefined && 
r.partitioner.get.numPartitions > 0) {
  return r.partitioner.get
}
if (rdd.context.conf.contains("spark.default.parallelism")) {
  new HashPartitioner(rdd.context.defaultParallelism)
} else {
  new HashPartitioner(bySize.head.partitions.length)
}
  }

this is my basic logic:

  var seq=List[(Int,Int)]()
for(i <- 0 until others.length){
  if(others(i).partitioner.isDefined) {
seq = seq :+ (others(i).partitions.length, i)
  }
}
println("this is seq:")
seq.foreach(println)
var maxP=0
if(rdd.partitioner.isDefined && rdd.partitioner.get.numPartitions >0 ){
  maxP=rdd.partitioner.get.numPartitions
}
for(i <- 0 until seq.length){
  if(others(seq(i)._2).partitioner.isDefined && 
others(seq(i)._2).partitioner.get.numPartitions > maxP){
maxP=others(seq(i)._2).partitioner.get.numPartitions
  }
}

> performance improvement in Partitioner.DefaultPartitioner 
> --
>
> Key: SPARK-17447
> URL: https://issues.apache.org/jira/browse/SPARK-17447
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: WangJianfei
>  Labels: performance
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> if there are many rdds in some situations,the sort will loss he performance 
> servely,actually we needn't sort the rdds , we can just scan the rdds one 
> time to gain the same goal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475645#comment-15475645
 ] 

Jeff Zhang commented on SPARK-17428:


I just link the jira of python virtualenv.  It seems R support virtualenv 
natively. Install.packages can specify the version, installation dest folder. 
And it is isolated cross users. I think there's 2 scenarios for SparkR 
environment. One is cluster has internet access, another is without internet 
access.
If the cluster has internet access, then I think we can call install.packages 
directly. 
{code}
install.packages("dplyr", lib="")
library(dplyr, lib.loc="")
{code}
If the cluster doesn't have internet access, then the driver can first download 
these package tarball and add them through --files. And executor will try to 
compile and install these packages
{code}
install.packages(, repos = NULL, type="source", 
lib="")
library(dplyr, lib.loc="")
{code}
For this scenario, if the package has dependencies, it would still try to 
download its dependencies from internet. Or user has to manually figure out its 
dependencies and add them in the spark app.   


> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17447) performance improvement in Partitioner.DefaultPartitioner

2016-09-08 Thread WangJianfei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475607#comment-15475607
 ] 

WangJianfei edited comment on SPARK-17447 at 9/9/16 2:12 AM:
-

we can just scan the rdd array only one time to find the rdd with the maxmum 
partitions and whose partitioner is difined.
you can just look the code of spark and the code of me:
the code of spark:
  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.length).reverse
for (r <- bySize if r.partitioner.isDefined && 
r.partitioner.get.numPartitions > 0) {
  return r.partitioner.get
}
if (rdd.context.conf.contains("spark.default.parallelism")) {
  new HashPartitioner(rdd.context.defaultParallelism)
} else {
  new HashPartitioner(bySize.head.partitions.length)
}
  }

this is my basic logic:

  var seq=List[(Int,Int)]()
for(i <- 0 until others.length){
  if(others(i).partitioner.isDefined) {
seq = seq :+ (others(i).partitions.length, i)
  }
}
println("this is seq:")
seq.foreach(println)
var maxP=0
if(rdd.partitioner.isDefined && rdd.partitioner.get.numPartitions >0 ){
  maxP=rdd.partitioner.get.numPartitions
}
for(i <- 0 until seq.length){
  if(others(seq(i)._2).partitioner.isDefined && 
others(seq(i)._2).partitioner.get.numPartitions > maxP){
maxP=others(seq(i)._2).partitioner.get.numPartitions
  }
}


was (Author: codlife):
 you can look this source code as below: we can just scan the rdd array only 
one time to find the rdd with the maxmum partitions and whose partitioner is 
difined.
  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.length).reverse
for (r <- bySize if r.partitioner.isDefined && 
r.partitioner.get.numPartitions > 0) {
  return r.partitioner.get
}
if (rdd.context.conf.contains("spark.default.parallelism")) {
  new HashPartitioner(rdd.context.defaultParallelism)
} else {
  new HashPartitioner(bySize.head.partitions.length)
}
  }

> performance improvement in Partitioner.DefaultPartitioner 
> --
>
> Key: SPARK-17447
> URL: https://issues.apache.org/jira/browse/SPARK-17447
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: WangJianfei
>  Labels: performance
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> if there are many rdds in some situations,the sort will loss he performance 
> servely,actually we needn't sort the rdds , we can just scan the rdds one 
> time to gain the same goal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475630#comment-15475630
 ] 

Jeff Zhang commented on SPARK-17428:


Source code url needs to be specified for version. 
http://stackoverflow.com/questions/17082341/installing-older-version-of-r-package


> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17448) There should a limit of k in mllib.Kmeans

2016-09-08 Thread WangJianfei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475619#comment-15475619
 ] 

WangJianfei commented on SPARK-17448:
-

Mabye we can limit k according to the number of elements of the input 
dataset,but as you say,the error is clear.

> There should a limit of k in mllib.Kmeans
> -
>
> Key: SPARK-17448
> URL: https://issues.apache.org/jira/browse/SPARK-17448
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: WangJianfei
>  Labels: patch
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In the mllib.Kmeans, there are no limit to the param k, if we pass a big K, 
> then because of he array(k), java outofmemory error will occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475612#comment-15475612
 ] 

Felix Cheung edited comment on SPARK-17428 at 9/9/16 1:59 AM:
--

I don't think I see a way to specify a version number for install.packages in R?

Python does compile code - installing packages with pip compiles the python 
scripts. https://www.google.com/search?q=pyc
And also many packages have native components which will not work without 
installing or compiling as root (or heavy hacking), eg. matplotlib, scipy.



was (Author: felixcheung):
I don't think I see a way to specify a version number for install.packages in R?

Python does compile code - installing packages with pip compiles the python 
scripts. https://www.google.com/search?q=pyc
And also many packages have heavy native components which will not work without 
installing as root (or heavy hacking), eg. matplotlib, scipy.


> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475612#comment-15475612
 ] 

Felix Cheung commented on SPARK-17428:
--

I don't think I see a way to specify a version number for install.packages in R?

Python does compile code - installing packages with pip compiles the python 
scripts. https://www.google.com/search?q=pyc
And also many packages have heavy native components which will not work without 
installing as root (or heavy hacking), eg. matplotlib, scipy.


> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17447) performance improvement in Partitioner.DefaultPartitioner

2016-09-08 Thread WangJianfei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475607#comment-15475607
 ] 

WangJianfei commented on SPARK-17447:
-

 you can look this source code as below: we can just scan the rdd array only 
one time to find the rdd with the maxmum partitions and whose partitioner is 
difined.
  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.length).reverse
for (r <- bySize if r.partitioner.isDefined && 
r.partitioner.get.numPartitions > 0) {
  return r.partitioner.get
}
if (rdd.context.conf.contains("spark.default.parallelism")) {
  new HashPartitioner(rdd.context.defaultParallelism)
} else {
  new HashPartitioner(bySize.head.partitions.length)
}
  }

> performance improvement in Partitioner.DefaultPartitioner 
> --
>
> Key: SPARK-17447
> URL: https://issues.apache.org/jira/browse/SPARK-17447
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: WangJianfei
>  Labels: performance
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> if there are many rdds in some situations,the sort will loss he performance 
> servely,actually we needn't sort the rdds , we can just scan the rdds one 
> time to gain the same goal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475593#comment-15475593
 ] 

Sun Rui edited comment on SPARK-17428 at 9/9/16 1:52 AM:
-

I don't understand the meaning of exact version control. I think a user can 
specify downloaded R packages or specify a package name and version, and SparkR 
can download it from CRAN.

PySpark does not have the compilation issue, as Python code needs no 
complication. The python interpreter abstracts the underly architecture 
differences just as JVM does.

For R package compilation issue, maybe we can have the following polices:
1. For binary R packages, just deliver them to worker nodes;
2. For source R packges:
  2.1 if only R code is contained, complication on the driver node is OK
  2.2 if C/c++ code is contained, by default, compile it on the driver node. 
But we can have an option --compile-on-workers allowing users to choose to 
compile on worker nodes. If the option is specified, users should ensure the 
compiling tool chain and the runtime libraries being depended be ready on 
worker nodes.


was (Author: sunrui):
I don't understand the meaning of exact version control. I think a user can 
specify downloaded R packages or specify a package name and version, and SparkR 
can download it from CRAN.

PySpark does not have the compilation issue, as Python code needs no 
complication. The python interpreter abstracts the underly architecture 
differences just as JVM does.

For R package compilation issue, maybe we can have the following polices:
1. For binary R packages, just deliver them to worker nodes;
2. For source R packges:
  2.1 if only R code is contained, complication on the driver node is OK
  2.2 if C/c++ code is contained, by default, compile it on the driver node. 
But we can have an option --compile-on-workers allowing users to choose to 
compile on worker nodes. If the option is specified, users should ensure the 
compiling tool chain and the dependent runtime libraries be ready on worker 
nodes.

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475593#comment-15475593
 ] 

Sun Rui edited comment on SPARK-17428 at 9/9/16 1:50 AM:
-

I don't understand the meaning of exact version control. I think a user can 
specify downloaded R packages or specify a package name and version, and SparkR 
can download it from CRAN.

PySpark does not have the compilation issue, as Python code needs no 
complication. The python interpreter abstracts the underly architecture 
differences just as JVM does.

For R package compilation issue, maybe we can have the following polices:
1. For binary R packages, just deliver them to worker nodes;
2. For source R packges:
  2.1 if only R code is contained, complication on the driver node is OK
  2.2 if C/c++ code is contained, by default, compile it on the driver node. 
But we can have an option --compile-on-workers allowing users to choose to 
compile on worker nodes. If the option is specified, users should ensure the 
compiling tool chain and the dependent runtime libraries be ready on worker 
nodes.


was (Author: sunrui):
I don't understand the meaning of exact version control. I think a user can 
specify downloaded R packages or specify a package name and version, and SparkR 
can download it from CRAN.

PySpark does not have the compilation issue, as Python code needs no 
complication. The python interpreter abstracts the underly architecture 
differences just as JVM does.

For R package compilation issue, maybe we can have the following polices:
1. For binary R packages, just deliver them to worker nodes;
2. For source R packges:
  2.1 if only R code is contained, complication on the driver node is OK
  2.2 if C/c++ code is contained, by default, compile it on the driver node. 
But we can have an option --compile-on-workers allowing users to choose to 
compile on worker nodes. If the option is specified, users should ensure the 
compiling tool chain be ready on worker nodes.

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475593#comment-15475593
 ] 

Sun Rui commented on SPARK-17428:
-

I don't understand the meaning of exact version control. I think a user can 
specify downloaded R packages or specify a package name and version, and SparkR 
can download it from CRAN.

PySpark does not have the compilation issue, as Python code needs no 
complication. The python interpreter abstracts the underly architecture 
differences just as JVM does.

For R package compilation issue, maybe we can have the following polices:
1. For binary R packages, just deliver them to worker nodes;
2. For source R packges:
  2.1 if only R code is contained, complication on the driver node is OK
  2.2 if C/c++ code is contained, by default, compile it on the driver node. 
But we can have an option --compile-on-workers allowing users to choose to 
compile on worker nodes. If the option is specified, users should ensure the 
compiling tool chain be ready on worker nodes.

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-09-08 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-17463:
--

 Summary: Serialization of accumulators in heartbeats is not 
thread-safe
 Key: SPARK-17463
 URL: https://issues.apache.org/jira/browse/SPARK-17463
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Josh Rosen
Priority: Critical


Check out the following {{ConcurrentModificationException}}:

{code}

16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
attempts
org.apache.spark.SparkException: Exception thrown in awaitResult
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.ConcurrentModificationException
at java.util.ArrayList.writeObject(ArrayList.java:766)
at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 

[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-09-08 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475536#comment-15475536
 ] 

Josh Rosen commented on SPARK-17463:


[~zsxwing], FYI, since you're good at these types of RPC thread-safety issues.

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Critical
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> 

[jira] [Created] (SPARK-17462) Check for places within MLlib which should use VersionUtils to parse Spark version strings

2016-09-08 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-17462:
-

 Summary: Check for places within MLlib which should use 
VersionUtils to parse Spark version strings
 Key: SPARK-17462
 URL: https://issues.apache.org/jira/browse/SPARK-17462
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Joseph K. Bradley
Priority: Minor


[SPARK-17456] creates a utility for parsing Spark versions.  Several places in 
MLlib use custom regexes or other approaches.  Those should be fixed to use the 
new (tested) utility.  This task is for checking and replacing those instances.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI

2016-09-08 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-15487.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

> Spark Master UI to reverse proxy Application and Workers UI
> ---
>
> Key: SPARK-15487
> URL: https://issues.apache.org/jira/browse/SPARK-15487
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gurvinder
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently when running in Standalone mode, Spark UI's link to workers and 
> application drivers are pointing to internal/protected network endpoints. So 
> to access workers/application UI user's machine has to connect to VPN or need 
> to have access to internal network directly.
> Therefore the proposal is to make Spark master UI reverse proxy this 
> information back to the user. So only Spark master UI needs to be opened up 
> to internet. 
> The minimal changes can be done by adding another route e.g. 
> http://spark-master.com/target// so when request goes to target, 
> ProxyServlet kicks in and takes the  and forwards the request to it 
> and send response back to user.
> More information about discussions for this features can be found on this 
> mailing list thread 
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17387) Creating SparkContext() from python without spark-submit ignores user conf

2016-09-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475418#comment-15475418
 ] 

Apache Spark commented on SPARK-17387:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/15019

> Creating SparkContext() from python without spark-submit ignores user conf
> --
>
> Key: SPARK-17387
> URL: https://issues.apache.org/jira/browse/SPARK-17387
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Consider the following scenario: user runs a python application not through 
> spark-submit, but by adding the pyspark module and manually creating a Spark 
> context. Kinda like this:
> {noformat}
> $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
> Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> from pyspark import SparkContext
> >>> from pyspark import SparkConf
> >>> conf = SparkConf().set("spark.driver.memory", "4g")
> >>> sc = SparkContext(conf=conf)
> {noformat}
> If you look at the JVM launched by the pyspark code, it ignores the user's 
> configuration:
> {noformat}
> $ ps ax | grep $(pgrep -f SparkSubmit)
> 12283 pts/2Sl+0:03 /apps/java7/bin/java -cp ... -Xmx1g 
> -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit pyspark-shell
> {noformat}
> Note the "1g" of memory. If instead you use "pyspark", you get the correct 
> "4g" in the JVM.
> This also affects other configs; for example, you can't really add jars to 
> the driver's classpath using "spark.jars".
> You can work around this by setting the undocumented env variable Spark 
> itself uses:
> {noformat}
> $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
> Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import os
> >>> os.environ['PYSPARK_SUBMIT_ARGS'] = "pyspark-shell --conf 
> >>> spark.driver.memory=4g"
> >>> from pyspark import SparkContext
> >>> sc = SparkContext()
> {noformat}
> But it would be nicer if the configs were automatically propagated.
> BTW the reason for this is that the {{launch_gateway}} function used to start 
> the JVM does not take any parameters, and the only place where it reads 
> arguments for Spark is that env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17460) Dataset.joinWith causes OutOfMemory due to logicalPlan sizeInBytes being negative

2016-09-08 Thread Chris Perluss (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Perluss updated SPARK-17460:
--
Affects Version/s: 2.0.0

> Dataset.joinWith causes OutOfMemory due to logicalPlan sizeInBytes being 
> negative
> -
>
> Key: SPARK-17460
> URL: https://issues.apache.org/jira/browse/SPARK-17460
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Spark 2.0 in local mode as well as on GoogleDataproc
>Reporter: Chris Perluss
>
> Dataset.joinWith is performing a BroadcastJoin on a table that is gigabytes 
> in size due to the dataset.logicalPlan.statistics.sizeInBytes < 0.
> The issue is that org.apache.spark.sql.types.ArrayType.defaultSize is of 
> datatype Int.  In my dataset, there is an Array column whose data size 
> exceeds the limits of an Int and so the data size becomes negative.
> The issue can be repeated by running this code in REPL:
> val ds = (0 to 1).map( i => (i, Seq((i, Seq((i, "This is really not that 
> long of a string")).toDS()
> // You might have to remove private[sql] from Dataset.logicalPlan to get this 
> to work
> val stats = ds.logicalPlan.statistics
> yields
> stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = 
> Statistics(-1890686892,false)
> This causes joinWith to performWith to perform a broadcast join even tho my 
> data is gigabytes in size, which of course causes the executors to run out of 
> memory.
> Setting spark.sql.autoBroadcastJoinThreshold=-1 does not help because the 
> logicalPlan.statistics.sizeInBytes is a large negative number and thus it is 
> less than the join threshold of -1.
> I've been able to work around this issue by setting 
> autoBroadcastJoinThreshold to a very large negative number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17461) Dataset.joinWith causes OutOfMemory due to logicalPlan sizeInBytes being negative

2016-09-08 Thread Chris Perluss (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Perluss closed SPARK-17461.
-
Resolution: Duplicate

Duplicate of SPARK-17460

> Dataset.joinWith causes OutOfMemory due to logicalPlan sizeInBytes being 
> negative
> -
>
> Key: SPARK-17461
> URL: https://issues.apache.org/jira/browse/SPARK-17461
> Project: Spark
>  Issue Type: Bug
> Environment: Spark 2.0 in local mode as well as on GoogleDataproc
>Reporter: Chris Perluss
>
> Dataset.joinWith is performing a BroadcastJoin on a table that is gigabytes 
> in size due to the dataset.logicalPlan.statistics.sizeInBytes < 0.
> The issue is that org.apache.spark.sql.types.ArrayType.defaultSize is of 
> datatype Int.  In my dataset, there is an Array column whose data size 
> exceeds the limits of an Int and so the data size becomes negative.
> The issue can be repeated by running this code in REPL:
> val ds = (0 to 1).map( i => (i, Seq((i, Seq((i, "This is really not that 
> long of a string")).toDS()
> // You might have to remove private[sql] from Dataset.logicalPlan to get this 
> to work
> val stats = ds.logicalPlan.statistics
> yields
> stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = 
> Statistics(-1890686892,false)
> This causes joinWith to performWith to perform a broadcast join even tho my 
> data is gigabytes in size, which of course causes the executors to run out of 
> memory.
> Setting spark.sql.autoBroadcastJoinThreshold=-1 does not help because the 
> logicalPlan.statistics.sizeInBytes is a large negative number and thus it is 
> less than the join threshold of -1.
> I've been able to work around this issue by setting 
> autoBroadcastJoinThreshold to a very large negative number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17461) Dataset.joinWith causes OutOfMemory due to logicalPlan sizeInBytes being negative

2016-09-08 Thread Chris Perluss (JIRA)
Chris Perluss created SPARK-17461:
-

 Summary: Dataset.joinWith causes OutOfMemory due to logicalPlan 
sizeInBytes being negative
 Key: SPARK-17461
 URL: https://issues.apache.org/jira/browse/SPARK-17461
 Project: Spark
  Issue Type: Bug
 Environment: Spark 2.0 in local mode as well as on GoogleDataproc
Reporter: Chris Perluss


Dataset.joinWith is performing a BroadcastJoin on a table that is gigabytes in 
size due to the dataset.logicalPlan.statistics.sizeInBytes < 0.

The issue is that org.apache.spark.sql.types.ArrayType.defaultSize is of 
datatype Int.  In my dataset, there is an Array column whose data size exceeds 
the limits of an Int and so the data size becomes negative.

The issue can be repeated by running this code in REPL:
val ds = (0 to 1).map( i => (i, Seq((i, Seq((i, "This is really not that 
long of a string")).toDS()

// You might have to remove private[sql] from Dataset.logicalPlan to get this 
to work
val stats = ds.logicalPlan.statistics

yields

stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = 
Statistics(-1890686892,false)

This causes joinWith to performWith to perform a broadcast join even tho my 
data is gigabytes in size, which of course causes the executors to run out of 
memory.

Setting spark.sql.autoBroadcastJoinThreshold=-1 does not help because the 
logicalPlan.statistics.sizeInBytes is a large negative number and thus it is 
less than the join threshold of -1.

I've been able to work around this issue by setting autoBroadcastJoinThreshold 
to a very large negative number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17460) Dataset.joinWith causes OutOfMemory due to logicalPlan sizeInBytes being negative

2016-09-08 Thread Chris Perluss (JIRA)
Chris Perluss created SPARK-17460:
-

 Summary: Dataset.joinWith causes OutOfMemory due to logicalPlan 
sizeInBytes being negative
 Key: SPARK-17460
 URL: https://issues.apache.org/jira/browse/SPARK-17460
 Project: Spark
  Issue Type: Bug
 Environment: Spark 2.0 in local mode as well as on GoogleDataproc
Reporter: Chris Perluss


Dataset.joinWith is performing a BroadcastJoin on a table that is gigabytes in 
size due to the dataset.logicalPlan.statistics.sizeInBytes < 0.

The issue is that org.apache.spark.sql.types.ArrayType.defaultSize is of 
datatype Int.  In my dataset, there is an Array column whose data size exceeds 
the limits of an Int and so the data size becomes negative.

The issue can be repeated by running this code in REPL:
val ds = (0 to 1).map( i => (i, Seq((i, Seq((i, "This is really not that 
long of a string")).toDS()

// You might have to remove private[sql] from Dataset.logicalPlan to get this 
to work
val stats = ds.logicalPlan.statistics

yields

stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = 
Statistics(-1890686892,false)

This causes joinWith to performWith to perform a broadcast join even tho my 
data is gigabytes in size, which of course causes the executors to run out of 
memory.

Setting spark.sql.autoBroadcastJoinThreshold=-1 does not help because the 
logicalPlan.statistics.sizeInBytes is a large negative number and thus it is 
less than the join threshold of -1.

I've been able to work around this issue by setting autoBroadcastJoinThreshold 
to a very large negative number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17459) Add Linear Discriminant to dimensionality reduction algorithms

2016-09-08 Thread Joshua Howard (JIRA)
Joshua Howard created SPARK-17459:
-

 Summary: Add Linear Discriminant to dimensionality reduction 
algorithms
 Key: SPARK-17459
 URL: https://issues.apache.org/jira/browse/SPARK-17459
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Joshua Howard
Priority: Minor


The goal is to add linear discriminant analysis as a method of dimensionality 
reduction. The algorithm and code are very similar to PCA, but instead project 
the data set onto vectors that provide class separation. LDA is a more 
effective alternative to PCA in terms of preprocessing for classification 
algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525

2016-09-08 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17405.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15016
[https://github.com/apache/spark/pull/15016]

> Simple aggregation query OOMing after SPARK-16525
> -
>
> Key: SPARK-17405
> URL: https://issues.apache.org/jira/browse/SPARK-17405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Eric Liang
>Priority: Blocker
> Fix For: 2.1.0
>
>
> Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the 
> following query ran fine via Beeline / Thrift Server and the Spark shell, but 
> after that patch it is consistently OOMING:
> {code}
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS (
>   SELECT * FROM (VALUES
> (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
> (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 
> 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS 
> INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
> (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 
> 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 
> 00:00:00.0'), '211', -959, CAST(NULL AS STRING)),
> (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
> (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 
> 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, 
> TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'),
> (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 
> 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 
> 00:00:00.0'), '-345', 566, '-574'),
> (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 
> 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), 
> TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'),
> (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142')
>   ) as t);
> CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, 
> boolean_col_4, timestamp_col_5, decimal3317_col_6) AS (
>   SELECT * FROM (VALUES
> ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), 
> false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))),
> ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), 
> true, CAST(NULL AS TIMESTAMP), -653.51000BD),
> ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), 
> false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD),
> ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), 
> false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD),
> ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), 
> false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD),
> ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), 
> false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD)
>   ) as t);
> SELECT
> CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col,
> LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS 
> STRING)) AS decimal_col,
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col
> FROM table_3 t1
> INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND 
> ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = 
> (t1.string_col_1))
> WHERE
> (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9)
> GROUP BY
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9);
> {code}
> Here's the OOM:
> {code}
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 
> (TID 9, localhost): java.lang.OutOfMemoryError: Unable to acquire 262144 
> bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
> at 

[jira] [Commented] (SPARK-17387) Creating SparkContext() from python without spark-submit ignores user conf

2016-09-08 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475253#comment-15475253
 ] 

Marcelo Vanzin commented on SPARK-17387:


Yeah, that's what I mean. Running the pyspark shell, or using spark-submit to 
run a python script, correctly preserves the user's configuration.

BTW my workaround above is wrong - "pyspark-shell" needs to be the last 
argument in "PYSPARK_SUBMIT_ARGS", not the first.

> Creating SparkContext() from python without spark-submit ignores user conf
> --
>
> Key: SPARK-17387
> URL: https://issues.apache.org/jira/browse/SPARK-17387
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Consider the following scenario: user runs a python application not through 
> spark-submit, but by adding the pyspark module and manually creating a Spark 
> context. Kinda like this:
> {noformat}
> $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
> Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> from pyspark import SparkContext
> >>> from pyspark import SparkConf
> >>> conf = SparkConf().set("spark.driver.memory", "4g")
> >>> sc = SparkContext(conf=conf)
> {noformat}
> If you look at the JVM launched by the pyspark code, it ignores the user's 
> configuration:
> {noformat}
> $ ps ax | grep $(pgrep -f SparkSubmit)
> 12283 pts/2Sl+0:03 /apps/java7/bin/java -cp ... -Xmx1g 
> -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit pyspark-shell
> {noformat}
> Note the "1g" of memory. If instead you use "pyspark", you get the correct 
> "4g" in the JVM.
> This also affects other configs; for example, you can't really add jars to 
> the driver's classpath using "spark.jars".
> You can work around this by setting the undocumented env variable Spark 
> itself uses:
> {noformat}
> $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
> Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import os
> >>> os.environ['PYSPARK_SUBMIT_ARGS'] = "pyspark-shell --conf 
> >>> spark.driver.memory=4g"
> >>> from pyspark import SparkContext
> >>> sc = SparkContext()
> {noformat}
> But it would be nicer if the configs were automatically propagated.
> BTW the reason for this is that the {{launch_gateway}} function used to start 
> the JVM does not take any parameters, and the only place where it reads 
> arguments for Spark is that env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17387) Creating SparkContext() from python without spark-submit ignores user conf

2016-09-08 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475242#comment-15475242
 ] 

Bryan Cutler commented on SPARK-17387:
--

[~vanzin] you said if you use PySpark you could get the correct "4g" memory 
option, but I was only able to do this through the command line like this

{noformat}
$> bin/pyspark --conf spark.driver.memory=4g
{noformat}

Is that what you meant?  I think just adding the command line confs to 
configure the JVM through a plain Python shell would be a simple fix, and still 
be inline with how the Scala spark-shell works too.

> Creating SparkContext() from python without spark-submit ignores user conf
> --
>
> Key: SPARK-17387
> URL: https://issues.apache.org/jira/browse/SPARK-17387
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Consider the following scenario: user runs a python application not through 
> spark-submit, but by adding the pyspark module and manually creating a Spark 
> context. Kinda like this:
> {noformat}
> $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
> Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> from pyspark import SparkContext
> >>> from pyspark import SparkConf
> >>> conf = SparkConf().set("spark.driver.memory", "4g")
> >>> sc = SparkContext(conf=conf)
> {noformat}
> If you look at the JVM launched by the pyspark code, it ignores the user's 
> configuration:
> {noformat}
> $ ps ax | grep $(pgrep -f SparkSubmit)
> 12283 pts/2Sl+0:03 /apps/java7/bin/java -cp ... -Xmx1g 
> -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit pyspark-shell
> {noformat}
> Note the "1g" of memory. If instead you use "pyspark", you get the correct 
> "4g" in the JVM.
> This also affects other configs; for example, you can't really add jars to 
> the driver's classpath using "spark.jars".
> You can work around this by setting the undocumented env variable Spark 
> itself uses:
> {noformat}
> $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
> Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import os
> >>> os.environ['PYSPARK_SUBMIT_ARGS'] = "pyspark-shell --conf 
> >>> spark.driver.memory=4g"
> >>> from pyspark import SparkContext
> >>> sc = SparkContext()
> {noformat}
> But it would be nicer if the configs were automatically propagated.
> BTW the reason for this is that the {{launch_gateway}} function used to start 
> the JVM does not take any parameters, and the only place where it reads 
> arguments for Spark is that env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2016-09-08 Thread Ravi Somepalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Somepalli updated SPARK-17458:
---
Description: 
When using pivot and multiple aggregations we need to alias to avoid special 
characters, but alias does not help because 

df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show

||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) AS 
`COLD` || foo_max(`B`) AS `COLB` ||
|small|   5.5|   two|2.3335|   
two|
|large|   5.5|   two|   2.0|
   one|

Expected Output
||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
|small|   5.5|   two|2.3335|   
two|
|large|   5.5|   two|   2.0|
   one|


One approach you can fix this issue is to change the class
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 and change the outputName method in 
{code}
object ResolvePivot extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
{code}


{code}
def outputName(value: Literal, aggregate: Expression): String = {
  val suffix = aggregate match {
 case n: NamedExpression => 
aggregate.asInstanceOf[NamedExpression].name
 case _ => aggregate.sql
   }
  if (singleAgg) value.toString else value + "_" + suffix
}
{code}

Version : 2.0.0
{code}
def outputName(value: Literal, aggregate: Expression): String = {
  if (singleAgg) value.toString else value + "_" + aggregate.sql
}
{code}

  was:
When using pivot and multiple aggregations we need to alias to avoid special 
characters, but alias does not help because 

df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show

||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) AS 
`COLD` || foo_max(`B`) AS `COLB` ||
|small|   5.5|   two|2.3335|   
two|
|large|   5.5|   two|   2.0|
   one|

Expected Output
||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
|small|   5.5|   two|2.3335|   
two|
|large|   5.5|   two|   2.0|
   one|


One approach you can fix this issue is to change the class
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 and change the outputName method in 
{code}
object ResolvePivot extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
{code}


{code}
def outputName(value: Literal, aggregate: Expression): String = {
  val suffix = aggregate match {
 case n: NamedExpression => 
aggregate.asInstanceOf[NamedExpression].name
 case _ => aggregate.sql
   }
  if (singleAgg) value.toString else value + "_" + suffix
}
{code}

Version : 2.0.0
{code}
def outputName(value: Literal, aggregate: Expression): String = {
  if (singleAgg) value.toString else value + "_" + aggregate.sql
}
{code?


> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val 

[jira] [Updated] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2016-09-08 Thread Ravi Somepalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Somepalli updated SPARK-17458:
---
Description: 
When using pivot and multiple aggregations we need to alias to avoid special 
characters, but alias does not help because 

df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show

||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) AS 
`COLD` || foo_max(`B`) AS `COLB` ||
|small|   5.5|   two|2.3335|   
two|
|large|   5.5|   two|   2.0|
   one|

Expected Output
||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
|small|   5.5|   two|2.3335|   
two|
|large|   5.5|   two|   2.0|
   one|


One approach you can fix this issue is to change the class
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 and change the outputName method in 
{code}
object ResolvePivot extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
{code}


{code}
def outputName(value: Literal, aggregate: Expression): String = {
  val suffix = aggregate match {
 case n: NamedExpression => 
aggregate.asInstanceOf[NamedExpression].name
 case _ => aggregate.sql
   }
  if (singleAgg) value.toString else value + "_" + suffix
}
{code}

Version : 2.0.0
{code}
def outputName(value: Literal, aggregate: Expression): String = {
  if (singleAgg) value.toString else value + "_" + aggregate.sql
}
{code?

  was:
When using pivot and multiple aggregations we need to alias to avoid special 
characters, but alias does not help because 

df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show

||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) AS 
`COLD` || foo_max(`B`) AS `COLB` ||
|small|   5.5|   two|2.3335|   
two|
|large|   5.5|   two|   2.0|
   one|

Expected Output
||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
|small|   5.5|   two|2.3335|   
two|
|large|   5.5|   two|   2.0|
   one|


One approach you can fix this issue is to change the class
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 and change the outputName method in 
{code}
object ResolvePivot extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
{code}

{code}
def outputName(value: Literal, aggregate: Expression): String = {
  val suffix = aggregate match {
 case n: NamedExpression => 
aggregate.asInstanceOf[NamedExpression].name
 case _ => aggregate.sql
   }
  if (singleAgg) value.toString else value + "_" + suffix
}
{code}



> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val suffix = aggregate match {
>  case n: NamedExpression => 
> aggregate.asInstanceOf[NamedExpression].name
>  case _ => aggregate.sql
>}
> 

[jira] [Created] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2016-09-08 Thread Ravi Somepalli (JIRA)
Ravi Somepalli created SPARK-17458:
--

 Summary: Alias specified for aggregates in a pivot are not honored
 Key: SPARK-17458
 URL: https://issues.apache.org/jira/browse/SPARK-17458
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Ravi Somepalli


When using pivot and multiple aggregations we need to alias to avoid special 
characters, but alias does not help because 

df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show

||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) AS 
`COLD` || foo_max(`B`) AS `COLB` ||
|small|   5.5|   two|2.3335|   
two|
|large|   5.5|   two|   2.0|
   one|

Expected Output
||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
|small|   5.5|   two|2.3335|   
two|
|large|   5.5|   two|   2.0|
   one|


One approach you can fix this issue is to change the class
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 and change the outputName method in 
{code}
object ResolvePivot extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
{code}

{code}
def outputName(value: Literal, aggregate: Expression): String = {
  val suffix = aggregate match {
 case n: NamedExpression => 
aggregate.asInstanceOf[NamedExpression].name
 case _ => aggregate.sql
   }
  if (singleAgg) value.toString else value + "_" + suffix
}
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17455) IsotonicRegression takes non-polynomial time for some inputs

2016-09-08 Thread Nic Eggert (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Eggert updated SPARK-17455:
---
Priority: Minor  (was: Major)

> IsotonicRegression takes non-polynomial time for some inputs
> 
>
> Key: SPARK-17455
> URL: https://issues.apache.org/jira/browse/SPARK-17455
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0
>Reporter: Nic Eggert
>Priority: Minor
>
> The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently 
> in MLlib can take O(N!) time for certain inputs, when it should have 
> worst-case complexity of O(N^2).
> To reproduce this, I pulled the private method poolAdjacentViolators out of 
> mllib.regression.IsotonicRegression and into a benchmarking harness.
> Given this input
> {code}
> val x = (1 to length).toArray.map(_.toDouble)
> val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 
> else yi}
> val w = Array.fill(length)(1d)
> val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, 
> x), w) => (y, x, w)}
> {code}
> I vary the length of the input to get these timings:
> || Input Length || Time (us) ||
> | 100 | 1.35 |
> | 200 | 3.14 | 
> | 400 | 116.10 |
> | 800 | 2134225.90 |
> (tests were performed using 
> https://github.com/sirthias/scala-benchmarking-template)
> I can also confirm that I run into this issue on a real dataset I'm working 
> on when trying to calibrate random forest probability output. Some partitions 
> take > 12 hours to run. This isn't a skew issue, since the largest partitions 
> finish in minutes. I can only assume that some partitions cause something 
> approaching this worst-case complexity.
> I'm working on a patch that borrows the implementation that is used in 
> scikit-learn and the R "iso" package, both of which handle this particular 
> input in linear time and are quadratic in the worst case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17455) IsotonicRegression takes non-polynomial time for some inputs

2016-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17455:


Assignee: (was: Apache Spark)

> IsotonicRegression takes non-polynomial time for some inputs
> 
>
> Key: SPARK-17455
> URL: https://issues.apache.org/jira/browse/SPARK-17455
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0
>Reporter: Nic Eggert
>
> The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently 
> in MLlib can take O(N!) time for certain inputs, when it should have 
> worst-case complexity of O(N^2).
> To reproduce this, I pulled the private method poolAdjacentViolators out of 
> mllib.regression.IsotonicRegression and into a benchmarking harness.
> Given this input
> {code}
> val x = (1 to length).toArray.map(_.toDouble)
> val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 
> else yi}
> val w = Array.fill(length)(1d)
> val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, 
> x), w) => (y, x, w)}
> {code}
> I vary the length of the input to get these timings:
> || Input Length || Time (us) ||
> | 100 | 1.35 |
> | 200 | 3.14 | 
> | 400 | 116.10 |
> | 800 | 2134225.90 |
> (tests were performed using 
> https://github.com/sirthias/scala-benchmarking-template)
> I can also confirm that I run into this issue on a real dataset I'm working 
> on when trying to calibrate random forest probability output. Some partitions 
> take > 12 hours to run. This isn't a skew issue, since the largest partitions 
> finish in minutes. I can only assume that some partitions cause something 
> approaching this worst-case complexity.
> I'm working on a patch that borrows the implementation that is used in 
> scikit-learn and the R "iso" package, both of which handle this particular 
> input in linear time and are quadratic in the worst case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17455) IsotonicRegression takes non-polynomial time for some inputs

2016-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17455:


Assignee: Apache Spark

> IsotonicRegression takes non-polynomial time for some inputs
> 
>
> Key: SPARK-17455
> URL: https://issues.apache.org/jira/browse/SPARK-17455
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0
>Reporter: Nic Eggert
>Assignee: Apache Spark
>
> The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently 
> in MLlib can take O(N!) time for certain inputs, when it should have 
> worst-case complexity of O(N^2).
> To reproduce this, I pulled the private method poolAdjacentViolators out of 
> mllib.regression.IsotonicRegression and into a benchmarking harness.
> Given this input
> {code}
> val x = (1 to length).toArray.map(_.toDouble)
> val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 
> else yi}
> val w = Array.fill(length)(1d)
> val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, 
> x), w) => (y, x, w)}
> {code}
> I vary the length of the input to get these timings:
> || Input Length || Time (us) ||
> | 100 | 1.35 |
> | 200 | 3.14 | 
> | 400 | 116.10 |
> | 800 | 2134225.90 |
> (tests were performed using 
> https://github.com/sirthias/scala-benchmarking-template)
> I can also confirm that I run into this issue on a real dataset I'm working 
> on when trying to calibrate random forest probability output. Some partitions 
> take > 12 hours to run. This isn't a skew issue, since the largest partitions 
> finish in minutes. I can only assume that some partitions cause something 
> approaching this worst-case complexity.
> I'm working on a patch that borrows the implementation that is used in 
> scikit-learn and the R "iso" package, both of which handle this particular 
> input in linear time and are quadratic in the worst case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17302) Cannot set non-Spark SQL session variables in hive-site.xml, spark-defaults.conf, or using --conf

2016-09-08 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475156#comment-15475156
 ] 

Ryan Blue commented on SPARK-17302:
---

In 1.6.x, Spark pulled session config for Hive from a {{HiveConf}}, for example 
[Spark 
respects|https://github.com/apache/spark/blob/v1.6.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L167]
 {{hive.exec.dynamic.partition.mode}}. This could be set in hive-conf.xml or in 
spark-defaults.conf, though the latter had to use {{spark.hadoop...}} to set 
it. Now that the {{SQLConf}} is used instead of {{HiveConf}}, you lose both of 
those methods of setting Hive configuration variables. Using 
{{spark.conf.set("key", "value")}} works for users, but hive-site.xml defaults 
are no longer respected and an administrator can no longer set default values.

> Cannot set non-Spark SQL session variables in hive-site.xml, 
> spark-defaults.conf, or using --conf
> -
>
> Key: SPARK-17302
> URL: https://issues.apache.org/jira/browse/SPARK-17302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ryan Blue
>
> When configuration changed for 2.0 to the new SparkSession structure, Spark 
> stopped using Hive's internal HiveConf for session state and now uses 
> HiveSessionState and an associated SQLConf. Now, session options like 
> hive.exec.compress.output and hive.exec.dynamic.partition.mode are pulled 
> from this SQLConf. This doesn't include session properties from hive-site.xml 
> (including hive.exec.compress.output), and no longer contains Spark-specific 
> overrides from spark-defaults.conf that used the spark.hadoop.hive... pattern.
> Also, setting these variables on the command-line no longer works because 
> settings must start with "spark.".
> Is there a recommended way to set Hive session properties?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17455) IsotonicRegression takes non-polynomial time for some inputs

2016-09-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475158#comment-15475158
 ] 

Apache Spark commented on SPARK-17455:
--

User 'neggert' has created a pull request for this issue:
https://github.com/apache/spark/pull/15018

> IsotonicRegression takes non-polynomial time for some inputs
> 
>
> Key: SPARK-17455
> URL: https://issues.apache.org/jira/browse/SPARK-17455
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0
>Reporter: Nic Eggert
>
> The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently 
> in MLlib can take O(N!) time for certain inputs, when it should have 
> worst-case complexity of O(N^2).
> To reproduce this, I pulled the private method poolAdjacentViolators out of 
> mllib.regression.IsotonicRegression and into a benchmarking harness.
> Given this input
> {code}
> val x = (1 to length).toArray.map(_.toDouble)
> val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 
> else yi}
> val w = Array.fill(length)(1d)
> val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, 
> x), w) => (y, x, w)}
> {code}
> I vary the length of the input to get these timings:
> || Input Length || Time (us) ||
> | 100 | 1.35 |
> | 200 | 3.14 | 
> | 400 | 116.10 |
> | 800 | 2134225.90 |
> (tests were performed using 
> https://github.com/sirthias/scala-benchmarking-template)
> I can also confirm that I run into this issue on a real dataset I'm working 
> on when trying to calibrate random forest probability output. Some partitions 
> take > 12 hours to run. This isn't a skew issue, since the largest partitions 
> finish in minutes. I can only assume that some partitions cause something 
> approaching this worst-case complexity.
> I'm working on a patch that borrows the implementation that is used in 
> scikit-learn and the R "iso" package, both of which handle this particular 
> input in linear time and are quadratic in the worst case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17457) Spark SQL shows poor performance for group by and sort by on multiple columns

2016-09-08 Thread Sabyasachi Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sabyasachi Nayak updated SPARK-17457:
-
Description: 
In one of the use case when we are running one hive query with Tez it is taking 
45 mnts.But the same query when I am running in Spark SQL using hivecontext it 
is taking more than 2 hours.This query has no joins only group by and sort by  
on multiple columns.

spark-submit --class DataLoadingSpark --master yarn --deploy-mode client 
--num-executors 60 --executor-memory 16G --driver-memory 4G --executor-cores 5  
--conf spark.yarn.executor.memoryOverhead=2048 --conf 
spark.shuffle.consolidateFiles=true --conf spark.shuffle.memoryFraction=0.5 
--conf spark.storage.memoryFraction=.1 --conf spark.io.compression.codec=lzf 
--conf spark.driver.extraJavaOptions="-XX:MaxPermSize=1024m -XX:PermSize=512m 
-Dhdp.version=2.3.2.0-2950" --conf spark.shuffle.blockTransferService=nio  
DataLoadingSpark.jar --inputFile basket_txn.

Spark UI shows
Input is 500+ GB and Shuffle write is also 500+GB

Spark version - 1.4.0
HDP 2.3.2.0-2950
50 node cluster 1100 Vcores

  was:
In one of the use case when we are running one hive query with Tez it is taking 
45 mnts.But the same query when I am running in Spark SQL using hivecontext it 
is taking more than 2 hours.This query has no joins only group by and sort by  
on multiple columns.

spark-submit --class DataLoadingSpark --master yarn --deploy-mode client 
--num-executors 60 --executor-memory 16G --driver-memory 4G --executor-cores 5  
--conf spark.yarn.executor.memoryOverhead=2048 --conf 
spark.shuffle.consolidateFiles=true --conf spark.shuffle.memoryFraction=0.5 
--conf spark.storage.memoryFraction=.1 --conf spark.io.compression.codec=lzf 
--conf spark.driver.extraJavaOptions="-XX:MaxPermSize=1024m -XX:PermSize=512m 
-Dhdp.version=2.3.2.0-2950" --conf spark.shuffle.blockTransferService=nio  
DataLoadingSpark.jar --inputFile basket_txn.

Spark version - 1.4.0
HDP 2.3.2.0-2950
50 node cluster 1100 Vcores


> Spark SQL shows poor performance for group by and sort by on multiple columns 
> --
>
> Key: SPARK-17457
> URL: https://issues.apache.org/jira/browse/SPARK-17457
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.4.0
>Reporter: Sabyasachi Nayak
>
> In one of the use case when we are running one hive query with Tez it is 
> taking 45 mnts.But the same query when I am running in Spark SQL using 
> hivecontext it is taking more than 2 hours.This query has no joins only group 
> by and sort by  on multiple columns.
> spark-submit --class DataLoadingSpark --master yarn --deploy-mode client 
> --num-executors 60 --executor-memory 16G --driver-memory 4G --executor-cores 
> 5  --conf spark.yarn.executor.memoryOverhead=2048 --conf 
> spark.shuffle.consolidateFiles=true --conf spark.shuffle.memoryFraction=0.5 
> --conf spark.storage.memoryFraction=.1 --conf spark.io.compression.codec=lzf 
> --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=1024m -XX:PermSize=512m 
> -Dhdp.version=2.3.2.0-2950" --conf spark.shuffle.blockTransferService=nio  
> DataLoadingSpark.jar --inputFile basket_txn.
> Spark UI shows
> Input is 500+ GB and Shuffle write is also 500+GB
> Spark version - 1.4.0
> HDP 2.3.2.0-2950
> 50 node cluster 1100 Vcores



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17457) Spark SQL shows poor performance for group by and sort by on multiple columns

2016-09-08 Thread Sabyasachi Nayak (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sabyasachi Nayak updated SPARK-17457:
-
Summary: Spark SQL shows poor performance for group by and sort by on 
multiple columns   (was: Spark SQL shows poor performance for group by on 
multiple columns )

> Spark SQL shows poor performance for group by and sort by on multiple columns 
> --
>
> Key: SPARK-17457
> URL: https://issues.apache.org/jira/browse/SPARK-17457
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.4.0
>Reporter: Sabyasachi Nayak
>
> In one of the use case when we are running one hive query with Tez it is 
> taking 45 mnts.But the same query when I am running in Spark SQL using 
> hivecontext it is taking more than 2 hours.This query has no joins only group 
> by and sort by  on multiple columns.
> spark-submit --class DataLoadingSpark --master yarn --deploy-mode client 
> --num-executors 60 --executor-memory 16G --driver-memory 4G --executor-cores 
> 5  --conf spark.yarn.executor.memoryOverhead=2048 --conf 
> spark.shuffle.consolidateFiles=true --conf spark.shuffle.memoryFraction=0.5 
> --conf spark.storage.memoryFraction=.1 --conf spark.io.compression.codec=lzf 
> --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=1024m -XX:PermSize=512m 
> -Dhdp.version=2.3.2.0-2950" --conf spark.shuffle.blockTransferService=nio  
> DataLoadingSpark.jar --inputFile basket_txn.
> Spark version - 1.4.0
> HDP 2.3.2.0-2950
> 50 node cluster 1100 Vcores



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17457) Spark SQL shows poor performance for group by on multiple columns

2016-09-08 Thread Sabyasachi Nayak (JIRA)
Sabyasachi Nayak created SPARK-17457:


 Summary: Spark SQL shows poor performance for group by on multiple 
columns 
 Key: SPARK-17457
 URL: https://issues.apache.org/jira/browse/SPARK-17457
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.0
Reporter: Sabyasachi Nayak


In one of the use case when we are running one hive query with Tez it is taking 
45 mnts.But the same query when I am running in Spark SQL using hivecontext it 
is taking more than 2 hours.This query has no joins only group by and sort by  
on multiple columns.

spark-submit --class DataLoadingSpark --master yarn --deploy-mode client 
--num-executors 60 --executor-memory 16G --driver-memory 4G --executor-cores 5  
--conf spark.yarn.executor.memoryOverhead=2048 --conf 
spark.shuffle.consolidateFiles=true --conf spark.shuffle.memoryFraction=0.5 
--conf spark.storage.memoryFraction=.1 --conf spark.io.compression.codec=lzf 
--conf spark.driver.extraJavaOptions="-XX:MaxPermSize=1024m -XX:PermSize=512m 
-Dhdp.version=2.3.2.0-2950" --conf spark.shuffle.blockTransferService=nio  
DataLoadingSpark.jar --inputFile basket_txn.

Spark version - 1.4.0
HDP 2.3.2.0-2950
50 node cluster 1100 Vcores



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525

2016-09-08 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17405:
---
Assignee: Eric Liang

> Simple aggregation query OOMing after SPARK-16525
> -
>
> Key: SPARK-17405
> URL: https://issues.apache.org/jira/browse/SPARK-17405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Eric Liang
>Priority: Blocker
>
> Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the 
> following query ran fine via Beeline / Thrift Server and the Spark shell, but 
> after that patch it is consistently OOMING:
> {code}
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS (
>   SELECT * FROM (VALUES
> (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
> (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 
> 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS 
> INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
> (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 
> 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 
> 00:00:00.0'), '211', -959, CAST(NULL AS STRING)),
> (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
> (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 
> 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, 
> TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'),
> (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 
> 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 
> 00:00:00.0'), '-345', 566, '-574'),
> (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 
> 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), 
> TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'),
> (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142')
>   ) as t);
> CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, 
> boolean_col_4, timestamp_col_5, decimal3317_col_6) AS (
>   SELECT * FROM (VALUES
> ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), 
> false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))),
> ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), 
> true, CAST(NULL AS TIMESTAMP), -653.51000BD),
> ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), 
> false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD),
> ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), 
> false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD),
> ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), 
> false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD),
> ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), 
> false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD)
>   ) as t);
> SELECT
> CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col,
> LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS 
> STRING)) AS decimal_col,
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col
> FROM table_3 t1
> INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND 
> ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = 
> (t1.string_col_1))
> WHERE
> (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9)
> GROUP BY
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9);
> {code}
> Here's the OOM:
> {code}
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 
> (TID 9, localhost): java.lang.OutOfMemoryError: Unable to acquire 262144 
> bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
> at 
> org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:783)
> at 
> 

[jira] [Assigned] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525

2016-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17405:


Assignee: (was: Apache Spark)

> Simple aggregation query OOMing after SPARK-16525
> -
>
> Key: SPARK-17405
> URL: https://issues.apache.org/jira/browse/SPARK-17405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the 
> following query ran fine via Beeline / Thrift Server and the Spark shell, but 
> after that patch it is consistently OOMING:
> {code}
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS (
>   SELECT * FROM (VALUES
> (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
> (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 
> 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS 
> INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
> (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 
> 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 
> 00:00:00.0'), '211', -959, CAST(NULL AS STRING)),
> (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
> (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 
> 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, 
> TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'),
> (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 
> 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 
> 00:00:00.0'), '-345', 566, '-574'),
> (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 
> 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), 
> TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'),
> (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142')
>   ) as t);
> CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, 
> boolean_col_4, timestamp_col_5, decimal3317_col_6) AS (
>   SELECT * FROM (VALUES
> ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), 
> false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))),
> ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), 
> true, CAST(NULL AS TIMESTAMP), -653.51000BD),
> ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), 
> false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD),
> ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), 
> false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD),
> ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), 
> false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD),
> ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), 
> false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD)
>   ) as t);
> SELECT
> CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col,
> LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS 
> STRING)) AS decimal_col,
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col
> FROM table_3 t1
> INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND 
> ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = 
> (t1.string_col_1))
> WHERE
> (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9)
> GROUP BY
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9);
> {code}
> Here's the OOM:
> {code}
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 
> (TID 9, localhost): java.lang.OutOfMemoryError: Unable to acquire 262144 
> bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
> at 
> org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:783)
> at 
> 

[jira] [Commented] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525

2016-09-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475040#comment-15475040
 ] 

Apache Spark commented on SPARK-17405:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/15016

> Simple aggregation query OOMing after SPARK-16525
> -
>
> Key: SPARK-17405
> URL: https://issues.apache.org/jira/browse/SPARK-17405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the 
> following query ran fine via Beeline / Thrift Server and the Spark shell, but 
> after that patch it is consistently OOMING:
> {code}
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS (
>   SELECT * FROM (VALUES
> (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
> (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 
> 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS 
> INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
> (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 
> 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 
> 00:00:00.0'), '211', -959, CAST(NULL AS STRING)),
> (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
> (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 
> 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, 
> TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'),
> (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 
> 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 
> 00:00:00.0'), '-345', 566, '-574'),
> (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 
> 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), 
> TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'),
> (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142')
>   ) as t);
> CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, 
> boolean_col_4, timestamp_col_5, decimal3317_col_6) AS (
>   SELECT * FROM (VALUES
> ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), 
> false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))),
> ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), 
> true, CAST(NULL AS TIMESTAMP), -653.51000BD),
> ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), 
> false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD),
> ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), 
> false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD),
> ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), 
> false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD),
> ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), 
> false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD)
>   ) as t);
> SELECT
> CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col,
> LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS 
> STRING)) AS decimal_col,
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col
> FROM table_3 t1
> INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND 
> ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = 
> (t1.string_col_1))
> WHERE
> (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9)
> GROUP BY
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9);
> {code}
> Here's the OOM:
> {code}
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 
> (TID 9, localhost): java.lang.OutOfMemoryError: Unable to acquire 262144 
> bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
> at 
> 

[jira] [Assigned] (SPARK-17456) Utility for parsing Spark versions

2016-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17456:


Assignee: Apache Spark  (was: Joseph K. Bradley)

> Utility for parsing Spark versions
> --
>
> Key: SPARK-17456
> URL: https://issues.apache.org/jira/browse/SPARK-17456
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> There are many hacks within Spark's codebase to identify and compare Spark 
> versions.  We should add a simple utility to standardize these code paths, 
> especially since there have been mistakes made in the past.  This will let us 
> add unit tests as well.  This initial patch will only add methods for 
> extracting major and minor versions as Int types in Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17456) Utility for parsing Spark versions

2016-09-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475041#comment-15475041
 ] 

Apache Spark commented on SPARK-17456:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/15017

> Utility for parsing Spark versions
> --
>
> Key: SPARK-17456
> URL: https://issues.apache.org/jira/browse/SPARK-17456
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> There are many hacks within Spark's codebase to identify and compare Spark 
> versions.  We should add a simple utility to standardize these code paths, 
> especially since there have been mistakes made in the past.  This will let us 
> add unit tests as well.  This initial patch will only add methods for 
> extracting major and minor versions as Int types in Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525

2016-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17405:


Assignee: Apache Spark

> Simple aggregation query OOMing after SPARK-16525
> -
>
> Key: SPARK-17405
> URL: https://issues.apache.org/jira/browse/SPARK-17405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Blocker
>
> Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the 
> following query ran fine via Beeline / Thrift Server and the Spark shell, but 
> after that patch it is consistently OOMING:
> {code}
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS (
>   SELECT * FROM (VALUES
> (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
> (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 
> 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS 
> INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
> (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 
> 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 
> 00:00:00.0'), '211', -959, CAST(NULL AS STRING)),
> (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
> (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 
> 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, 
> TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'),
> (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 
> 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 
> 00:00:00.0'), '-345', 566, '-574'),
> (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 
> 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), 
> TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'),
> (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142')
>   ) as t);
> CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, 
> boolean_col_4, timestamp_col_5, decimal3317_col_6) AS (
>   SELECT * FROM (VALUES
> ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), 
> false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))),
> ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), 
> true, CAST(NULL AS TIMESTAMP), -653.51000BD),
> ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), 
> false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD),
> ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), 
> false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD),
> ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), 
> false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD),
> ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), 
> false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD)
>   ) as t);
> SELECT
> CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col,
> LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS 
> STRING)) AS decimal_col,
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col
> FROM table_3 t1
> INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND 
> ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = 
> (t1.string_col_1))
> WHERE
> (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9)
> GROUP BY
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9);
> {code}
> Here's the OOM:
> {code}
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 
> (TID 9, localhost): java.lang.OutOfMemoryError: Unable to acquire 262144 
> bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
> at 
> org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:783)
> at 
> 

[jira] [Assigned] (SPARK-17456) Utility for parsing Spark versions

2016-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17456:


Assignee: Joseph K. Bradley  (was: Apache Spark)

> Utility for parsing Spark versions
> --
>
> Key: SPARK-17456
> URL: https://issues.apache.org/jira/browse/SPARK-17456
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> There are many hacks within Spark's codebase to identify and compare Spark 
> versions.  We should add a simple utility to standardize these code paths, 
> especially since there have been mistakes made in the past.  This will let us 
> add unit tests as well.  This initial patch will only add methods for 
> extracting major and minor versions as Int types in Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12452) Add exception details to TaskCompletionListener/TaskContext

2016-09-08 Thread Neelesh Shastry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475027#comment-15475027
 ] 

Neelesh Shastry commented on SPARK-12452:
-

This was originally filed for 1.5.2, which does not have TaskFailureListener 
(Its 2.0.0, I believe)

> Add exception details to TaskCompletionListener/TaskContext
> ---
>
> Key: SPARK-12452
> URL: https://issues.apache.org/jira/browse/SPARK-12452
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Neelesh Shastry
>Priority: Minor
>
> TaskCompletionListeners are called without success/failure details. 
> If we change this
> {code}
> trait TaskCompletionListener extends EventListener {
>   def onTaskCompletion(context: TaskContext)
> }
> class TaskContextImpl {
>  
> private[spark] def markTaskCompleted(throwable:Option[Throwable]): Unit
> 
> listener.onTaskCompletion(this,throwable)
> }
> {code}
> to something like
> {code}
> trait TaskCompletionListener extends EventListener {
>   def onTaskCompletion(context: TaskContext, throwable:Option[Throwable]=None)
> }
> {code}
> .. and  in Task.scala
> {code}
>val results = Try(runTask(context))
>var throwable:Option[Throwable] = None
> try {
>   runTask(context)
>
> }catch{
>   case t:Throwable => throwable=t
> }
>  finally {
>   context.markTaskCompleted(throwable)
>   TaskContext.unset()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16525) Enable Row Based HashMap in HashAggregateExec

2016-09-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475023#comment-15475023
 ] 

Apache Spark commented on SPARK-16525:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/15016

> Enable Row Based HashMap in HashAggregateExec
> -
>
> Key: SPARK-16525
> URL: https://issues.apache.org/jira/browse/SPARK-16525
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Qifan Pu
> Fix For: 2.1.0
>
>
> Allow `RowBasedHashMapGenerator` to be used in `HashAggregateExec`, so that 
> we can turn codegen `RowBasedHashMap`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17446) no total size for data source tables in InMemoryCatalog

2016-09-08 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474995#comment-15474995
 ] 

Zhenhua Wang commented on SPARK-17446:
--

Ok, I've added the description.
Thanks.


> no total size for data source tables in InMemoryCatalog
> ---
>
> Key: SPARK-17446
> URL: https://issues.apache.org/jira/browse/SPARK-17446
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Zhenhua Wang
>
> For data source table in InMemoryCatalog, it's 
> catalogTable.storage.locationUri is None, so total size can't be calculated. 
> But we can use the path parameter in catalogTable.storage.properties to 
> calculate size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17446) no total size for data source tables in InMemoryCatalog

2016-09-08 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-17446:
-
Description: For data source table in InMemoryCatalog, it's 
catalogTable.storage.locationUri is None, so total size can't be calculated. 
But we can use the path parameter in catalogTable.storage.properties to 
calculate size.

> no total size for data source tables in InMemoryCatalog
> ---
>
> Key: SPARK-17446
> URL: https://issues.apache.org/jira/browse/SPARK-17446
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Zhenhua Wang
>
> For data source table in InMemoryCatalog, it's 
> catalogTable.storage.locationUri is None, so total size can't be calculated. 
> But we can use the path parameter in catalogTable.storage.properties to 
> calculate size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17456) Utility for parsing Spark versions

2016-09-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474917#comment-15474917
 ] 

Joseph K. Bradley commented on SPARK-17456:
---

Linking a JIRA which will require this

> Utility for parsing Spark versions
> --
>
> Key: SPARK-17456
> URL: https://issues.apache.org/jira/browse/SPARK-17456
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> There are many hacks within Spark's codebase to identify and compare Spark 
> versions.  We should add a simple utility to standardize these code paths, 
> especially since there have been mistakes made in the past.  This will let us 
> add unit tests as well.  This initial patch will only add methods for 
> extracting major and minor versions as Int types in Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17456) Utility for parsing Spark versions

2016-09-08 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-17456:
-

 Summary: Utility for parsing Spark versions
 Key: SPARK-17456
 URL: https://issues.apache.org/jira/browse/SPARK-17456
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor


There are many hacks within Spark's codebase to identify and compare Spark 
versions.  We should add a simple utility to standardize these code paths, 
especially since there have been mistakes made in the past.  This will let us 
add unit tests as well.  This initial patch will only add methods for 
extracting major and minor versions as Int types in Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11035) Launcher: allow apps to be launched in-process

2016-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11035:


Assignee: (was: Apache Spark)

> Launcher: allow apps to be launched in-process
> --
>
> Key: SPARK-11035
> URL: https://issues.apache.org/jira/browse/SPARK-11035
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>
> The launcher library is currently restricted to launching apps as child 
> processes. That is fine for a lot of cases, especially if the app is running 
> in client mode.
> But in certain cases, especially launching in cluster mode, it's more 
> efficient to avoid launching a new process, since that process won't be doing 
> much.
> We should add support for launching apps in process, even if restricted to 
> cluster mode at first. This will require some rework of the launch paths to 
> avoid using system properties to propagate configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11035) Launcher: allow apps to be launched in-process

2016-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11035:


Assignee: Apache Spark

> Launcher: allow apps to be launched in-process
> --
>
> Key: SPARK-11035
> URL: https://issues.apache.org/jira/browse/SPARK-11035
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> The launcher library is currently restricted to launching apps as child 
> processes. That is fine for a lot of cases, especially if the app is running 
> in client mode.
> But in certain cases, especially launching in cluster mode, it's more 
> efficient to avoid launching a new process, since that process won't be doing 
> much.
> We should add support for launching apps in process, even if restricted to 
> cluster mode at first. This will require some rework of the launch paths to 
> avoid using system properties to propagate configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11035) Launcher: allow apps to be launched in-process

2016-09-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474872#comment-15474872
 ] 

Apache Spark commented on SPARK-11035:
--

User 'kishorvpatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/15009

> Launcher: allow apps to be launched in-process
> --
>
> Key: SPARK-11035
> URL: https://issues.apache.org/jira/browse/SPARK-11035
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>
> The launcher library is currently restricted to launching apps as child 
> processes. That is fine for a lot of cases, especially if the app is running 
> in client mode.
> But in certain cases, especially launching in cluster mode, it's more 
> efficient to avoid launching a new process, since that process won't be doing 
> much.
> We should add support for launching apps in process, even if restricted to 
> cluster mode at first. This will require some rework of the launch paths to 
> avoid using system properties to propagate configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17455) IsotonicRegression takes non-polynomial time for some inputs

2016-09-08 Thread Nic Eggert (JIRA)
Nic Eggert created SPARK-17455:
--

 Summary: IsotonicRegression takes non-polynomial time for some 
inputs
 Key: SPARK-17455
 URL: https://issues.apache.org/jira/browse/SPARK-17455
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.0.0, 1.6.2, 1.5.2, 1.4.1, 1.3.1
Reporter: Nic Eggert


The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently in 
MLlib can take O(N!) time for certain inputs, when it should have worst-case 
complexity of O(N^2).

To reproduce this, I pulled the private method poolAdjacentViolators out of 
mllib.regression.IsotonicRegression and into a benchmarking harness.

Given this input
{code}
val x = (1 to length).toArray.map(_.toDouble)
val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 
else yi}
val w = Array.fill(length)(1d)

val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, x), 
w) => (y, x, w)}
{code}

I vary the length of the input to get these timings:

|| Input Length || Time (us) ||
| 100 | 1.35 |
| 200 | 3.14 | 
| 400 | 116.10 |
| 800 | 2134225.90 |

(tests were performed using 
https://github.com/sirthias/scala-benchmarking-template)

I can also confirm that I run into this issue on a real dataset I'm working on 
when trying to calibrate random forest probability output. Some partitions take 
> 12 hours to run. This isn't a skew issue, since the largest partitions finish 
in minutes. I can only assume that some partitions cause something approaching 
this worst-case complexity.

I'm working on a patch that borrows the implementation that is used in 
scikit-learn and the R "iso" package, both of which handle this particular 
input in linear time and are quadratic in the worst case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR

2016-09-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474688#comment-15474688
 ] 

Apache Spark commented on SPARK-16445:
--

User 'keypointt' has created a pull request for this issue:
https://github.com/apache/spark/pull/15015

> Multilayer Perceptron Classifier wrapper in SparkR
> --
>
> Key: SPARK-16445
> URL: https://issues.apache.org/jira/browse/SPARK-16445
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xin Ren
> Fix For: 2.1.0
>
>
> Follow instructions in SPARK-16442 and implement multilayer perceptron 
> classifier wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17454) Add option to specify Mesos resource offer constraints

2016-09-08 Thread Chris Bannister (JIRA)
Chris Bannister created SPARK-17454:
---

 Summary: Add option to specify Mesos resource offer constraints
 Key: SPARK-17454
 URL: https://issues.apache.org/jira/browse/SPARK-17454
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 2.0.0
Reporter: Chris Bannister


Currently the driver will accept offers from Mesos which have enough ram for 
the executor and until its max cores is reached. There is no way to control the 
required CPU's or disk for each executor, it would be very useful to be able to 
apply something similar to spark.mesos.constraints to resource offers instead 
of attributes on the offer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-08 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474616#comment-15474616
 ] 

Josh Elser commented on SPARK-17445:


bq. I think one part you're missing, Josh, is that spark-packages.org is an 
index of packages from a wide variety of organizations, where anyone can submit 
a package.

Ah, that is true. I made no efforts to understand the curation/policies behind 
the website (because I really don't care what the owners do -- it's their 
website/service). I think this is disjoint from my original concern though -- 
it may be de-facto with respect to the available third-party packages for 
Apache Spark, but I still think the PMC could do a better job at clearly 
stating this it is not an "Apache Spark" service. I don't see a down-side to my 
original suggestion (and it sounds like you agree).

If it comes down to it, point me at the website source, and I'm happy to submit 
a change which introduces a new page :)

> Reference an ASF page as the main place to find third-party packages
> 
>
> Key: SPARK-17445
> URL: https://issues.apache.org/jira/browse/SPARK-17445
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> Some comments and docs like 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
>  say to go to spark-packages.org, but since this is a package index 
> maintained by a third party, it would be better to reference an ASF page that 
> we can keep updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17453) Broadcast block already exists in MemoryStore

2016-09-08 Thread Chris Bannister (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bannister updated SPARK-17453:

Description: 
Whilst doing a broadcast join we reliably hit this exception, the code worked 
earlier on in 2.0.0 branch before release, and in 1.6. The data for the join is 
coming from another RDD which is collected to a Set and then broadcast. This is 
run in a Mesos cluster.

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.collect(RDD.scala:892)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: java.lang.IllegalArgumentException: requirement 
failed: Block broadcast_17_piece0 is already present in the MemoryStore
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1260)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174)
at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65)
at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:72)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: requirement failed: Block 
broadcast_17_piece0 is already present in the MemoryStore
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.storage.memory.MemoryStore.putBytes(MemoryStore.scala:144)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutBytes$1.apply(BlockManager.scala:792)
at 

[jira] [Created] (SPARK-17453) Broadcast block already exists in MemoryStore

2016-09-08 Thread Chris Bannister (JIRA)
Chris Bannister created SPARK-17453:
---

 Summary: Broadcast block already exists in MemoryStore
 Key: SPARK-17453
 URL: https://issues.apache.org/jira/browse/SPARK-17453
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 2.0.0
Reporter: Chris Bannister


Whilst doing a broadcast join we reliably hit this exception, the code worked 
earlier on in 2.0.0 branch before release, and in 1.6. The data for the join is 
coming from another RDD which is collected to a Set and then broadcast. This is 
run in a Mesos cluster.

Stacktrace is attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474587#comment-15474587
 ] 

Felix Cheung commented on SPARK-17428:
--

Agree with above. And to be clear, packrat is still calling install.packages so 
it won't be different how this is handled regarding package directory (lib 
parameter to install.packages) or permission/access
https://github.com/rstudio/packrat/blob/master/R/install.R#L69

We are likely going to prefer having private packages under the application 
directory in the case of YARN, so they will get clean up along with the 
application.

It seems like the original point of this JIRA is around private packages and 
installation/deployment - I think we would agree we could handle that (or 
SparkR in YARN already can do that)

My point is though the benefit of such package management system is really with 
the exact version that one can control.

But even then, building packages from source on worker machine could be 
problematic (this applies both to packrat, or calls to install.packages):
https://rstudio.github.io/packrat/limitations.html
- I'm not sure we should assume all worker machines in enterprises have C 
compiler or that the user running Spark have permission to build source code.

I don't know where we are at with PySpark but I'd be very interested in seeing 
how that is resolved - I think both Python and R face similar constraints in 
terms of deployment/package building, versioning, heterogeneous machine 
architecture and so on.

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs

2016-09-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474557#comment-15474557
 ] 

Thomas Graves commented on SPARK-17321:
---

so there are 2 possible things here:

1) You are using YARN NM recovery.  If this is the case SPARK-14963 should 
prevent this problem as the recovery path is supposed to be critical to NM and 
NM should not start if its bad

2) You aren't using NM recovery. If this is the case then you probably don't 
really care about the levelDB being saved because you aren't expecting things 
to live across Nm restarts.  In this case if people are having issues like this 
perhaps we should change the code to be conditionalized on NM recovery or a 
spark config.

Which case are you running?

> YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
> --
>
> Key: SPARK-17321
> URL: https://issues.apache.org/jira/browse/SPARK-17321
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 2.0.0
>Reporter: yunjiong zhao
>
> We run spark on yarn, after enabled spark dynamic allocation, we notice some 
> spark application failed randomly due to YarnShuffleService.
> From log I found
> {quote}
> 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: 
> Error while initializing Netty pipeline
> java.lang.NullPointerException
> at 
> org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
> at 
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
> at 
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
> at 
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> {quote} 
> Which caused by the first disk in yarn.nodemanager.local-dirs was broken.
> If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost 
> hundred nodes which is unacceptable.
> We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good 
> disks if the first one is broken?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-08 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474543#comment-15474543
 ] 

Matei Zaharia commented on SPARK-17445:
---

I think one part you're missing, Josh, is that spark-packages.org *is* an index 
of packages from a wide variety of organizations, where anyone can submit a 
package. Have you looked through it? Maybe there is some concern about which 
third-party index we highlight on the site, but AFAIK there are no other 
third-party package indexes. Nonetheless it would make sense to have a stable 
URL on the Spark homepage that lists them.

BTW, in the past, we also used a wiki page to track them: 
https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects 
so we could just link to that. The spark-packages site provides some nicer 
functionality though such as letting anyone add a package with just a GitHub 
account, listing releases, etc.

> Reference an ASF page as the main place to find third-party packages
> 
>
> Key: SPARK-17445
> URL: https://issues.apache.org/jira/browse/SPARK-17445
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> Some comments and docs like 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
>  say to go to spark-packages.org, but since this is a package index 
> maintained by a third party, it would be better to reference an ASF page that 
> we can keep updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17429) spark sql length(1) return error

2016-09-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474532#comment-15474532
 ] 

Apache Spark commented on SPARK-17429:
--

User 'cenyuhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/15014

> spark sql length(1) return error
> 
>
> Key: SPARK-17429
> URL: https://issues.apache.org/jira/browse/SPARK-17429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>
> select length(11);
> select length(2.0);
> these sql will return errors, but hive is ok.
> Error in query: cannot resolve 'length(11)' due to data type mismatch: 
> argument 1 requires (string or binary) type, however, '11' is of int type.; 
> line 1 pos 14
> Error in query: cannot resolve 'length(2.0)' due to data type mismatch: 
> argument 1 requires (string or binary) type, however, '2.0' is of double 
> type.; line 1 pos 14



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17429) spark sql length(1) return error

2016-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17429:


Assignee: (was: Apache Spark)

> spark sql length(1) return error
> 
>
> Key: SPARK-17429
> URL: https://issues.apache.org/jira/browse/SPARK-17429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>
> select length(11);
> select length(2.0);
> these sql will return errors, but hive is ok.
> Error in query: cannot resolve 'length(11)' due to data type mismatch: 
> argument 1 requires (string or binary) type, however, '11' is of int type.; 
> line 1 pos 14
> Error in query: cannot resolve 'length(2.0)' due to data type mismatch: 
> argument 1 requires (string or binary) type, however, '2.0' is of double 
> type.; line 1 pos 14



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17429) spark sql length(1) return error

2016-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17429:


Assignee: Apache Spark

> spark sql length(1) return error
> 
>
> Key: SPARK-17429
> URL: https://issues.apache.org/jira/browse/SPARK-17429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>Assignee: Apache Spark
>
> select length(11);
> select length(2.0);
> these sql will return errors, but hive is ok.
> Error in query: cannot resolve 'length(11)' due to data type mismatch: 
> argument 1 requires (string or binary) type, however, '11' is of int type.; 
> line 1 pos 14
> Error in query: cannot resolve 'length(2.0)' due to data type mismatch: 
> argument 1 requires (string or binary) type, however, '2.0' is of double 
> type.; line 1 pos 14



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17450) spark sql rownumber OOM

2016-09-08 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474517#comment-15474517
 ] 

Herman van Hovell commented on SPARK-17450:
---

You could try. You would also have to add the follow-up by davies (that adds 
the spilling logic).

> spark sql rownumber OOM
> ---
>
> Key: SPARK-17450
> URL: https://issues.apache.org/jira/browse/SPARK-17450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>
> spark sql will be OOM when using row_number() over too much sorted records... 
> There will be only 1 task to handle all records
> This sql group by passenger_id,  we have 100 million passengers.
> {code} 
>  SELECT
> passenger_id,
> total_order,
> (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND 
> 670800 THEN 'V3' END) AS order_rank
>   FROM
>   (
> SELECT
>   passenger_id,
>   1 as total_order
> FROM table
> GROUP BY passenger_id
>   ) dd1
> {code}
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:104)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> physical plan:
> {code}
> == Physical Plan ==
> Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 
> <= 670800)) THEN V3 AS order_rank#1]
> +- Window [passenger_id#7L,total_order#0], 
> [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS 
> _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC]
>+- Sort [total_order#0 DESC], false, 0
>   +- TungstenExchange SinglePartition, None
>  +- Project [passenger_id#7L,total_order#0]
> +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L,total_order#0])
>+- TungstenExchange hashpartitioning(passenger_id#7L,1000), 
> None
>   +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L])
>  +- Project [passenger_id#7L]
> +- Filter product#9 IN (kuai,gulf)
>+- HiveTableScan [passenger_id#7L,product#9], 
> MetastoreRelation pbs_dw, dwv_order_whole_day, None,
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-17450) spark sql rownumber OOM

2016-09-08 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474518#comment-15474518
 ] 

Herman van Hovell commented on SPARK-17450:
---

You could try. You would also have to add the follow-up by davies (that adds 
the spilling logic).

> spark sql rownumber OOM
> ---
>
> Key: SPARK-17450
> URL: https://issues.apache.org/jira/browse/SPARK-17450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>
> spark sql will be OOM when using row_number() over too much sorted records... 
> There will be only 1 task to handle all records
> This sql group by passenger_id,  we have 100 million passengers.
> {code} 
>  SELECT
> passenger_id,
> total_order,
> (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND 
> 670800 THEN 'V3' END) AS order_rank
>   FROM
>   (
> SELECT
>   passenger_id,
>   1 as total_order
> FROM table
> GROUP BY passenger_id
>   ) dd1
> {code}
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:104)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> physical plan:
> {code}
> == Physical Plan ==
> Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 
> <= 670800)) THEN V3 AS order_rank#1]
> +- Window [passenger_id#7L,total_order#0], 
> [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS 
> _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC]
>+- Sort [total_order#0 DESC], false, 0
>   +- TungstenExchange SinglePartition, None
>  +- Project [passenger_id#7L,total_order#0]
> +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L,total_order#0])
>+- TungstenExchange hashpartitioning(passenger_id#7L,1000), 
> None
>   +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L])
>  +- Project [passenger_id#7L]
> +- Filter product#9 IN (kuai,gulf)
>+- HiveTableScan [passenger_id#7L,product#9], 
> MetastoreRelation pbs_dw, dwv_order_whole_day, None,
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (SPARK-17450) spark sql rownumber OOM

2016-09-08 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17450:
--
Comment: was deleted

(was: You could try. You would also have to add the follow-up by davies (that 
adds the spilling logic).)

> spark sql rownumber OOM
> ---
>
> Key: SPARK-17450
> URL: https://issues.apache.org/jira/browse/SPARK-17450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>
> spark sql will be OOM when using row_number() over too much sorted records... 
> There will be only 1 task to handle all records
> This sql group by passenger_id,  we have 100 million passengers.
> {code} 
>  SELECT
> passenger_id,
> total_order,
> (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND 
> 670800 THEN 'V3' END) AS order_rank
>   FROM
>   (
> SELECT
>   passenger_id,
>   1 as total_order
> FROM table
> GROUP BY passenger_id
>   ) dd1
> {code}
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:104)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> physical plan:
> {code}
> == Physical Plan ==
> Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 
> <= 670800)) THEN V3 AS order_rank#1]
> +- Window [passenger_id#7L,total_order#0], 
> [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS 
> _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC]
>+- Sort [total_order#0 DESC], false, 0
>   +- TungstenExchange SinglePartition, None
>  +- Project [passenger_id#7L,total_order#0]
> +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L,total_order#0])
>+- TungstenExchange hashpartitioning(passenger_id#7L,1000), 
> None
>   +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L])
>  +- Project [passenger_id#7L]
> +- Filter product#9 IN (kuai,gulf)
>+- HiveTableScan [passenger_id#7L,product#9], 
> MetastoreRelation pbs_dw, dwv_order_whole_day, None,
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (SPARK-17450) spark sql rownumber OOM

2016-09-08 Thread cen yuhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474430#comment-15474430
 ] 

cen yuhai edited comment on SPARK-17450 at 9/8/16 5:07 PM:
---

hi,herman, can i merge your pr for native spark window function? 
https://github.com/apache/spark/pull/9819 ?


was (Author: cenyuhai):
hi,herman, can i merge your pr for native spark window function? 
https://github.com/apache/spark/pull/9819 ???

> spark sql rownumber OOM
> ---
>
> Key: SPARK-17450
> URL: https://issues.apache.org/jira/browse/SPARK-17450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>
> spark sql will be OOM when using row_number() over too much sorted records... 
> There will be only 1 task to handle all records
> This sql group by passenger_id,  we have 100 million passengers.
> {code} 
>  SELECT
> passenger_id,
> total_order,
> (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND 
> 670800 THEN 'V3' END) AS order_rank
>   FROM
>   (
> SELECT
>   passenger_id,
>   1 as total_order
> FROM table
> GROUP BY passenger_id
>   ) dd1
> {code}
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:104)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> physical plan:
> {code}
> == Physical Plan ==
> Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 
> <= 670800)) THEN V3 AS order_rank#1]
> +- Window [passenger_id#7L,total_order#0], 
> [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS 
> _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC]
>+- Sort [total_order#0 DESC], false, 0
>   +- TungstenExchange SinglePartition, None
>  +- Project [passenger_id#7L,total_order#0]
> +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L,total_order#0])
>+- TungstenExchange hashpartitioning(passenger_id#7L,1000), 
> None
>   +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L])
>  +- Project [passenger_id#7L]
> +- Filter product#9 IN (kuai,gulf)
>+- 

[jira] [Commented] (SPARK-17450) spark sql rownumber OOM

2016-09-08 Thread cen yuhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474430#comment-15474430
 ] 

cen yuhai commented on SPARK-17450:
---

hi,herman, can i merge your pr for native spark window function? 
https://github.com/apache/spark/pull/9819 ???

> spark sql rownumber OOM
> ---
>
> Key: SPARK-17450
> URL: https://issues.apache.org/jira/browse/SPARK-17450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>
> spark sql will be OOM when using row_number() over too much sorted records... 
> There will be only 1 task to handle all records
> This sql group by passenger_id,  we have 100 million passengers.
> {code} 
>  SELECT
> passenger_id,
> total_order,
> (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND 
> 670800 THEN 'V3' END) AS order_rank
>   FROM
>   (
> SELECT
>   passenger_id,
>   1 as total_order
> FROM table
> GROUP BY passenger_id
>   ) dd1
> {code}
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:104)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> physical plan:
> {code}
> == Physical Plan ==
> Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 
> <= 670800)) THEN V3 AS order_rank#1]
> +- Window [passenger_id#7L,total_order#0], 
> [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS 
> _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC]
>+- Sort [total_order#0 DESC], false, 0
>   +- TungstenExchange SinglePartition, None
>  +- Project [passenger_id#7L,total_order#0]
> +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L,total_order#0])
>+- TungstenExchange hashpartitioning(passenger_id#7L,1000), 
> None
>   +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L])
>  +- Project [passenger_id#7L]
> +- Filter product#9 IN (kuai,gulf)
>+- HiveTableScan [passenger_id#7L,product#9], 
> MetastoreRelation pbs_dw, dwv_order_whole_day, None,
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-17443) SparkLauncher should allow stoppingApplication and need not rely on SparkSubmit binary

2016-09-08 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474386#comment-15474386
 ] 

Marcelo Vanzin commented on SPARK-17443:


The second bullet is actually SPARK-11035.

> SparkLauncher should allow stoppingApplication and need not rely on 
> SparkSubmit binary
> --
>
> Key: SPARK-17443
> URL: https://issues.apache.org/jira/browse/SPARK-17443
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Kishor Patil
>
> Oozie wants SparkLauncher to support the following things:
> - When oozie launcher is killed, the launched Spark application also gets 
> killed
> - Spark Launcher to not have to rely on spark-submit bash script



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17446) no total size for data source tables in InMemoryCatalog

2016-09-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474385#comment-15474385
 ] 

Sean Owen commented on SPARK-17446:
---

[~ZenWzh] there is no detail at all here. Please describe what you mean in the 
description. See 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

> no total size for data source tables in InMemoryCatalog
> ---
>
> Key: SPARK-17446
> URL: https://issues.apache.org/jira/browse/SPARK-17446
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Zhenhua Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17449) executorTimeoutMs configure error

2016-09-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474381#comment-15474381
 ] 

Sean Owen commented on SPARK-17449:
---

Sorry, I don't see the problem? the configured timeout matches what is 
configured.

> executorTimeoutMs configure error
> -
>
> Key: SPARK-17449
> URL: https://issues.apache.org/jira/browse/SPARK-17449
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yang Liang
>  Labels: bug
>
> $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s 
> --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 
> ms exceeds timeout 12 ms
> ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed 
> out after 168136 ms
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 
> ms exceeds timeout 1 ms
> ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed 
> out after 11949 m
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf 
> spark.network.timeout=10s --num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 
> ms exceeds timeout 1 ms
> ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed 
> out after 39299 ms
> Source Code:
> spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
> private val executorTimeoutMs =
> sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") 
> * 1000



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17447) performance improvement in Partitioner.DefaultPartitioner

2016-09-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474375#comment-15474375
 ] 

Sean Owen commented on SPARK-17447:
---

Why don't they need to be sorted?

> performance improvement in Partitioner.DefaultPartitioner 
> --
>
> Key: SPARK-17447
> URL: https://issues.apache.org/jira/browse/SPARK-17447
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: WangJianfei
>  Labels: performance
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> if there are many rdds in some situations,the sort will loss he performance 
> servely,actually we needn't sort the rdds , we can just scan the rdds one 
> time to gain the same goal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17448) There should a limit of k in mllib.Kmeans

2016-09-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474374#comment-15474374
 ] 

Sean Owen commented on SPARK-17448:
---

That's true in a thousand contexts: if you make an array that's too big you run 
out of memory.
There's no real way to know what the limit of k is. The error is appropriate 
and clear. I don't see a change to make here.

> There should a limit of k in mllib.Kmeans
> -
>
> Key: SPARK-17448
> URL: https://issues.apache.org/jira/browse/SPARK-17448
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: WangJianfei
>  Labels: patch
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In the mllib.Kmeans, there are no limit to the param k, if we pass a big K, 
> then because of he array(k), java outofmemory error will occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17450) spark sql rownumber OOM

2016-09-08 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474363#comment-15474363
 ] 

Herman van Hovell commented on SPARK-17450:
---

This generally a bad idea. All your data is moved to a single executor, sorted 
in that executor and finally processes on that executor. So I am curious to 
know what the use case is.

What makes things worse is the fact that 1.6 keeps all records in a single 
partition in memory; causing an OOM in your case. In 2.0 we spill records to 
disk (above 4000 records) in order to prevent OOMs. Could you try this in 2.0?


> spark sql rownumber OOM
> ---
>
> Key: SPARK-17450
> URL: https://issues.apache.org/jira/browse/SPARK-17450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>
> spark sql will be OOM when using row_number() over too much sorted records... 
> There will be only 1 task to handle all records
> This sql group by passenger_id,  we have 100 million passengers.
> {code} 
>  SELECT
> passenger_id,
> total_order,
> (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND 
> 670800 THEN 'V3' END) AS order_rank
>   FROM
>   (
> SELECT
>   passenger_id,
>   1 as total_order
> FROM table
> GROUP BY passenger_id
>   ) dd1
> {code}
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:104)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> physical plan:
> {code}
> == Physical Plan ==
> Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 
> <= 670800)) THEN V3 AS order_rank#1]
> +- Window [passenger_id#7L,total_order#0], 
> [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS 
> _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC]
>+- Sort [total_order#0 DESC], false, 0
>   +- TungstenExchange SinglePartition, None
>  +- Project [passenger_id#7L,total_order#0]
> +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L,total_order#0])
>+- TungstenExchange hashpartitioning(passenger_id#7L,1000), 
> None
>   +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L])
>   

[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs

2016-09-08 Thread Alexander Kasper (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474342#comment-15474342
 ] 

Alexander Kasper commented on SPARK-17321:
--

We discovered the same issue. It seems the shuffle service has no way of being 
notified about the failed disk. Only option then is probably to restart the 
node manager.

> YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
> --
>
> Key: SPARK-17321
> URL: https://issues.apache.org/jira/browse/SPARK-17321
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 2.0.0
>Reporter: yunjiong zhao
>
> We run spark on yarn, after enabled spark dynamic allocation, we notice some 
> spark application failed randomly due to YarnShuffleService.
> From log I found
> {quote}
> 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: 
> Error while initializing Netty pipeline
> java.lang.NullPointerException
> at 
> org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
> at 
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
> at 
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
> at 
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> {quote} 
> Which caused by the first disk in yarn.nodemanager.local-dirs was broken.
> If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost 
> hundred nodes which is unacceptable.
> We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good 
> disks if the first one is broken?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

2016-09-08 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474277#comment-15474277
 ] 

Nattavut Sutyanyong commented on SPARK-17154:
-

The same problem surfaced in different symptoms was discussed in SPARK-13801, 
SPARK-14040, and SPARK-17337. We shall find a solution that addresses the root 
cause.

There is a trait called {{MultiInstanceRelation}} that looks like trying to 
address the multiple references of the same relation. The proposal 
[~sarutak]-san attached here is an interesting way to resolve the root cause of 
the problem. However, I have a feeling that it might be over-complicated. I am 
working on a prototype to extend the {{MultiInstanceRelation}} trait. The goal 
is to address the symptoms discussed in:

this JIRA (SPARK-17154)
SPARK-13801
SPARK-14040
SPARK-17337

and get rid of (IMHO) the hack implemented in the call to {{dedupRight}} to 
resolve a self-join scenario.

> Wrong result can be returned or AnalysisException can be thrown after 
> self-join or similar operations
> -
>
> Key: SPARK-17154
> URL: https://issues.apache.org/jira/browse/SPARK-17154
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Kousuke Saruta
> Attachments: Solution_Proposal_SPARK-17154.pdf
>
>
> When we join two DataFrames which are originated from a same DataFrame, 
> operations to the joined DataFrame can fail.
> One reproducible  example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val joined = filtered.join(df, filtered("col1") === df("col1"), "inner")
>   val selected1 = joined.select(df("col3"))
> {code}
> In this case, AnalysisException is thrown.
> Another example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val rightOuterJoined = filtered.join(df, filtered("col1") === df("col1"), 
> "right")
>   val selected2 = rightOuterJoined.select(df("col1"))
>   selected2.show
> {code}
> In this case, we will expect to get the answer like as follows.
> {code}
> 1
> 2
> 3
> 4
> 5
> {code}
> But the actual result is as follows.
> {code}
> 1
> 2
> null
> 4
> 5
> {code}
> The cause of the problems in the examples is that the logical plan related to 
> the right side DataFrame and the expressions of its output are re-created in 
> the analyzer (at ResolveReference rule) when a DataFrame has expressions 
> which have a same exprId each other.
> Re-created expressions are equally to the original ones except exprId.
> This will happen when we do self-join or similar pattern operations.
> In the first example, df("col3") returns a Column which includes an 
> expression and the expression have an exprId (say id1 here).
> After join, the expresion which the right side DataFrame (df) has is 
> re-created and the old and new expressions are equally but exprId is renewed 
> (say id2 for the new exprId here).
> Because of the mismatch of those exprIds, AnalysisException is thrown.
> In the second example, df("col1") returns a column and the expression 
> contained in the column is assigned an exprId (say id3).
> On the other hand, a column returned by filtered("col1") has an expression 
> which has the same exprId (id3).
> After join, the expressions in the right side DataFrame are re-created and 
> the expression assigned id3 is no longer present in the right side but 
> present in the left side.
> So, referring df("col1") to the joined DataFrame, we get col1 of right side 
> which includes null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-17348) Incorrect results from subquery transformation

2016-09-08 Thread Nattavut Sutyanyong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nattavut Sutyanyong updated SPARK-17348:

Comment: was deleted

(was: The same problem surfaced in different symptoms was discussed in 
SPARK-13801, SPARK-14040, and SPARK-17154. The problem reported here is a 
specific pattern. We shall find a solution that addresses the root cause. I am 
considering closing this JIRA as a duplicate.)

> Incorrect results from subquery transformation
> --
>
> Key: SPARK-17348
> URL: https://issues.apache.org/jira/browse/SPARK-17348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>  Labels: correctness
>
> {noformat}
> Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1")
> Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= 
> t2.c2)").show
> +---+
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result of the above query should be an empty set. Here is an 
> explanation:
> Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when 
> T1.C1 = 1 so both rows needs to be processed in the same group of the 
> aggregation process in the subquery. The result of the aggregation yields 
> MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate 
> T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14040) Null-safe and equality join produces incorrect result with filtered dataframe

2016-09-08 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474239#comment-15474239
 ] 

Nattavut Sutyanyong edited comment on SPARK-14040 at 9/8/16 4:03 PM:
-

The root cause of this problem is the way Spark implemented to generate a 
unique identifier for each column without an obvious way to distinguish 
multiple references to the same column. This problem has been discovered in 
different contexts and different approaches to fix this problem have been 
discussed in various places:

SPARK-13801
SPARK-17337

A partial fix was implemented in the {{dedupRight()}} method for the {{Join}} 
operator.

and the latest attempt to fix this in SPARK-17154. We should solve this problem 
at the root cause. I will post my idea in SPARK-17154. We shall close this JIRA 
as a duplicate.




was (Author: nsyca):
The root cause of this problem is the way Spark implemented to generate a 
unique identifier for each column without an obvious way to distinguish 
multiple references to the same column. This problem has been discovered in 
different contexts and different approaches to fix this problem have been 
discussed in various places:

SPARK-14040
SPARK-17337

A partial fix was implemented in the {{dedupRight()}} method for the {{Join}} 
operator.

and the latest attempt to fix this in SPARK-17154. We should solve this problem 
at the root cause. I will post my idea in SPARK-17154. We shall close this JIRA 
as a duplicate.



> Null-safe and equality join produces incorrect result with filtered dataframe
> -
>
> Key: SPARK-14040
> URL: https://issues.apache.org/jira/browse/SPARK-14040
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Ubuntu Linux 15.10
>Reporter: Denton Cockburn
>
> Initial issue reported here: 
> http://stackoverflow.com/questions/36131942/spark-join-produces-wrong-results
>   val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
>   val a = b.where("c = 1").withColumnRenamed("a", 
> "filta").withColumnRenamed("b", "filtb")
>   a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> 
> b("c"), "left_outer").show
> Produces 2 rows instead of the expected 1.
>   a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === 
> $"b" and $"newc" === b("c"), "left_outer").show
> Also produces 2 rows instead of the expected 1.
> The only one that seemed to work correctly was:
>   a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === 
> b("c"), "left_outer").show
> But that produced a warning for :  
>   WARN Column: Constructing trivially true equals predicate, 'c#18232 = 
> c#18232' 
> As pointed out by commenter zero323:
> "The second behavior looks indeed like a bug related to the fact that you 
> still have a.c in your data. It looks like it is picked downstream before b.c 
> and the evaluated condition is actually a.newc = a.c"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result

2016-09-08 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474253#comment-15474253
 ] 

Nattavut Sutyanyong commented on SPARK-17337:
-

The same problem surfaced in different symptoms was discussed in SPARK-13801, 
SPARK-14040, and SPARK-17154. The problem reported here is a specific pattern. 
We shall find a solution that addresses the root cause. I am considering 
closing this JIRA as a duplicate.

> Incomplete algorithm for name resolution in Catalyst paser may lead to 
> incorrect result
> ---
>
> Key: SPARK-17337
> URL: https://issues.apache.org/jira/browse/SPARK-17337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>
> While investigating SPARK-16951, I found an incorrect results case from a NOT 
> IN subquery. I thought originally it is an edge case. Further investigation 
> found this is a more general problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17348) Incorrect results from subquery transformation

2016-09-08 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474246#comment-15474246
 ] 

Nattavut Sutyanyong commented on SPARK-17348:
-

The same problem surfaced in different symptoms was discussed in SPARK-13801, 
SPARK-14040, and SPARK-17154. The problem reported here is a specific pattern. 
We shall find a solution that addresses the root cause. I am considering 
closing this JIRA as a duplicate.

> Incorrect results from subquery transformation
> --
>
> Key: SPARK-17348
> URL: https://issues.apache.org/jira/browse/SPARK-17348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>  Labels: correctness
>
> {noformat}
> Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1")
> Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= 
> t2.c2)").show
> +---+
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result of the above query should be an empty set. Here is an 
> explanation:
> Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when 
> T1.C1 = 1 so both rows needs to be processed in the same group of the 
> aggregation process in the subquery. The result of the aggregation yields 
> MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate 
> T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14040) Null-safe and equality join produces incorrect result with filtered dataframe

2016-09-08 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474239#comment-15474239
 ] 

Nattavut Sutyanyong commented on SPARK-14040:
-

The root cause of this problem is the way Spark implemented to generate a 
unique identifier for each column without an obvious way to distinguish 
multiple references to the same column. This problem has been discovered in 
different contexts and different approaches to fix this problem have been 
discussed in various places:

SPARK-14040
SPARK-17337

A partial fix was implemented in the {{dedupRight()}} method for the {{Join}} 
operator.

and the latest attempt to fix this in SPARK-17154. We should solve this problem 
at the root cause. I will post my idea in SPARK-17154. We shall close this JIRA 
as a duplicate.



> Null-safe and equality join produces incorrect result with filtered dataframe
> -
>
> Key: SPARK-14040
> URL: https://issues.apache.org/jira/browse/SPARK-14040
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Ubuntu Linux 15.10
>Reporter: Denton Cockburn
>
> Initial issue reported here: 
> http://stackoverflow.com/questions/36131942/spark-join-produces-wrong-results
>   val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
>   val a = b.where("c = 1").withColumnRenamed("a", 
> "filta").withColumnRenamed("b", "filtb")
>   a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> 
> b("c"), "left_outer").show
> Produces 2 rows instead of the expected 1.
>   a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === 
> $"b" and $"newc" === b("c"), "left_outer").show
> Also produces 2 rows instead of the expected 1.
> The only one that seemed to work correctly was:
>   a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === 
> b("c"), "left_outer").show
> But that produced a warning for :  
>   WARN Column: Constructing trivially true equals predicate, 'c#18232 = 
> c#18232' 
> As pointed out by commenter zero323:
> "The second behavior looks indeed like a bug related to the fact that you 
> still have a.c in your data. It looks like it is picked downstream before b.c 
> and the evaluated condition is actually a.newc = a.c"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >