[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475943#comment-15475943 ] Shivaram Venkataraman commented on SPARK-17428: --- I think there are bunch of issues being discussed here. My initial take would be to add support for something simple and then iterate based on user feedback. Given that R users generally don't know / care much about package version numbers I'd say an initial cut that handles two flags in spark-submit (a) a list of package names and calls `install.packages` on each machine with them (b) a list of package tar.gz that are installed with `R CMD INSTALL` on each machine We can also make the package installs lazy, i.e. they only get run on a worker when there is a R worker process launched there. Will this meet the user needs you have in mind [~yanboliang] ? > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6160) ChiSqSelector should keep test statistic info
[ https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475860#comment-15475860 ] Peng Meng commented on SPARK-6160: -- hi [~GayathriMurali], are you still working on this, if not, I can work on it. thanks. > ChiSqSelector should keep test statistic info > - > > Key: SPARK-6160 > URL: https://issues.apache.org/jira/browse/SPARK-6160 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > It is useful to have the test statistics explaining selected features, but > these data are thrown out when constructing the ChiSqSelectorModel. The data > are expensive to recompute, so the ChiSqSelectorModel should store and expose > them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17464) SparkR spark.als arguments reg should be 0.1 by default
[ https://issues.apache.org/jira/browse/SPARK-17464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475849#comment-15475849 ] Apache Spark commented on SPARK-17464: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/15021 > SparkR spark.als arguments reg should be 0.1 by default > --- > > Key: SPARK-17464 > URL: https://issues.apache.org/jira/browse/SPARK-17464 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Yanbo Liang >Priority: Minor > > SparkR spark.als arguments {{reg}} should be 0.1 by default, which need to be > consistent with ML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17464) SparkR spark.als arguments reg should be 0.1 by default
[ https://issues.apache.org/jira/browse/SPARK-17464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17464: Assignee: (was: Apache Spark) > SparkR spark.als arguments reg should be 0.1 by default > --- > > Key: SPARK-17464 > URL: https://issues.apache.org/jira/browse/SPARK-17464 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Yanbo Liang >Priority: Minor > > SparkR spark.als arguments {{reg}} should be 0.1 by default, which need to be > consistent with ML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17464) SparkR spark.als arguments reg should be 0.1 by default
[ https://issues.apache.org/jira/browse/SPARK-17464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17464: Assignee: Apache Spark > SparkR spark.als arguments reg should be 0.1 by default > --- > > Key: SPARK-17464 > URL: https://issues.apache.org/jira/browse/SPARK-17464 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > SparkR spark.als arguments {{reg}} should be 0.1 by default, which need to be > consistent with ML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17464) SparkR spark.als arguments reg should be 0.1 by default
Yanbo Liang created SPARK-17464: --- Summary: SparkR spark.als arguments reg should be 0.1 by default Key: SPARK-17464 URL: https://issues.apache.org/jira/browse/SPARK-17464 Project: Spark Issue Type: Bug Components: ML, SparkR Reporter: Yanbo Liang Priority: Minor SparkR spark.als arguments {{reg}} should be 0.1 by default, which need to be consistent with ML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6160) ChiSqSelector should keep test statistic info
[ https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475769#comment-15475769 ] Peng Meng commented on SPARK-6160: -- Hi [~josephkb], I have some discussion with [~srowen] about keeping test statistic info of ChiSqSelector in PR: https://github.com/apache/spark/pull/14597. Can you review that PR, and commit: https://github.com/apache/spark/pull/14597/commits/3d6aecb8441503c9c3d62a2d8a3d48824b9d6637 > ChiSqSelector should keep test statistic info > - > > Key: SPARK-6160 > URL: https://issues.apache.org/jira/browse/SPARK-6160 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > It is useful to have the test statistics explaining selected features, but > these data are thrown out when constructing the ChiSqSelectorModel. The data > are expensive to recompute, so the ChiSqSelectorModel should store and expose > them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15509) R MLlib algorithms should support input columns "features" and "label"
[ https://issues.apache.org/jira/browse/SPARK-15509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475758#comment-15475758 ] WangJianfei commented on SPARK-15509: - please check my issue https://issues.apache.org/jira/browse/SPARK-17447 thinks > R MLlib algorithms should support input columns "features" and "label" > -- > > Key: SPARK-15509 > URL: https://issues.apache.org/jira/browse/SPARK-15509 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Joseph K. Bradley >Assignee: Xin Ren > Fix For: 2.1.0 > > > Currently in SparkR, when you load a LibSVM dataset using the sqlContext and > then pass it to an MLlib algorithm, the ML wrappers will fail since they will > try to create a "features" column, which conflicts with the existing > "features" column from the LibSVM loader. E.g., using the "mnist" dataset > from LibSVM: > {code} > training <- loadDF(sqlContext, ".../mnist", "libsvm") > model <- naiveBayes(label ~ features, training) > {code} > This fails with: > {code} > 16/05/24 11:52:41 ERROR RBackendHandler: fit on > org.apache.spark.ml.r.NaiveBayesWrapper failed > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > java.lang.IllegalArgumentException: Output column features already exists. > at > org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) > at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67) > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131) > at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169) > at > org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62) > at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca > {code} > The same issue appears for the "label" column once you rename the "features" > column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17245) NPE thrown by ClientWrapper.conf
[ https://issues.apache.org/jira/browse/SPARK-17245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475756#comment-15475756 ] WangJianfei commented on SPARK-17245: - please check the issue https://issues.apache.org/jira/browse/SPARK-17447 thinks > NPE thrown by ClientWrapper.conf > > > Key: SPARK-17245 > URL: https://issues.apache.org/jira/browse/SPARK-17245 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 1.6.3 > > > This issue has been fixed in Spark 2.0. Seems ClientWrapper.conf is trying to > access the ThreadLocal SessionState, which has been set. > {code} > java.lang.NullPointerException > at > org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:225) > at > org.apache.spark.sql.hive.client.ClientWrapper.client(ClientWrapper.scala:279) > > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:291) > > at > org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:246) > > at > org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:245) > > at > org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:288) > > at > org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:493) > > at > org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:483) > > at > org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:603) > > at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:654) > at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:105) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17387) Creating SparkContext() from python without spark-submit ignores user conf
[ https://issues.apache.org/jira/browse/SPARK-17387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475754#comment-15475754 ] WangJianfei commented on SPARK-17387: - please check https://issues.apache.org/jira/browse/SPARK-17447 thinks > Creating SparkContext() from python without spark-submit ignores user conf > -- > > Key: SPARK-17387 > URL: https://issues.apache.org/jira/browse/SPARK-17387 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Consider the following scenario: user runs a python application not through > spark-submit, but by adding the pyspark module and manually creating a Spark > context. Kinda like this: > {noformat} > $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python > Python 2.7.12 (default, Jul 1 2016, 15:12:24) > [GCC 5.4.0 20160609] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> from pyspark import SparkContext > >>> from pyspark import SparkConf > >>> conf = SparkConf().set("spark.driver.memory", "4g") > >>> sc = SparkContext(conf=conf) > {noformat} > If you look at the JVM launched by the pyspark code, it ignores the user's > configuration: > {noformat} > $ ps ax | grep $(pgrep -f SparkSubmit) > 12283 pts/2Sl+0:03 /apps/java7/bin/java -cp ... -Xmx1g > -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit pyspark-shell > {noformat} > Note the "1g" of memory. If instead you use "pyspark", you get the correct > "4g" in the JVM. > This also affects other configs; for example, you can't really add jars to > the driver's classpath using "spark.jars". > You can work around this by setting the undocumented env variable Spark > itself uses: > {noformat} > $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python > Python 2.7.12 (default, Jul 1 2016, 15:12:24) > [GCC 5.4.0 20160609] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import os > >>> os.environ['PYSPARK_SUBMIT_ARGS'] = "pyspark-shell --conf > >>> spark.driver.memory=4g" > >>> from pyspark import SparkContext > >>> sc = SparkContext() > {noformat} > But it would be nicer if the configs were automatically propagated. > BTW the reason for this is that the {{launch_gateway}} function used to start > the JVM does not take any parameters, and the only place where it reads > arguments for Spark is that env variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17449) Relation between heartbeatInterval and network timeout
[ https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475727#comment-15475727 ] Yang Liang commented on SPARK-17449: Sorry , let me clarify it . The relation between spark.network.timeout and spark.executor.heartbeatInterval is confusing. It should be mentioned in the document at least. > Relation between heartbeatInterval and network timeout > -- > > Key: SPARK-17449 > URL: https://issues.apache.org/jira/browse/SPARK-17449 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Yang Liang > Labels: bug > > $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s > --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 > ms exceeds timeout 12 ms > ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed > out after 168136 ms > spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf > spark.network.timeout=10s --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 > ms exceeds timeout 1 ms > ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed > out after 11949 m > spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf > spark.network.timeout=10s --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 > ms exceeds timeout 1 ms > ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed > out after 39299 ms > Source Code: > spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala > /** > * A heartbeat from executors to the driver. This is a shared message used by > several internal > * components to convey liveness or execution information for in-progress > tasks. It will also > * expire the hosts that have not heartbeated for more than > spark.network.timeout. > */ > private val executorTimeoutMs = > sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") > * 1000 > The relation between spark.network.timeout and > spark.executor.heartbeatInterval should be mentioned in the document at > least. Otherwise error above would be confusing. Do some checks when get > settings ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17449) Relation between heartbeatInterval and network timeout
[ https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Liang updated SPARK-17449: --- Description: $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 ms exceeds timeout 12 ms ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed out after 168136 ms spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf spark.network.timeout=10s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 ms exceeds timeout 1 ms ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed out after 11949 m spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf spark.network.timeout=10s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 ms exceeds timeout 1 ms ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed out after 39299 ms Source Code: spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala /** * A heartbeat from executors to the driver. This is a shared message used by several internal * components to convey liveness or execution information for in-progress tasks. It will also * expire the hosts that have not heartbeated for more than spark.network.timeout. */ private val executorTimeoutMs = sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") * 1000 The relation between spark.network.timeout and spark.executor.heartbeatInterval should be mentioned in the document at least. Otherwise error above would be confusing. Do some checks when get settings ? was: $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 ms exceeds timeout 12 ms ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed out after 168136 ms spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf spark.network.timeout=10s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 ms exceeds timeout 1 ms ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed out after 11949 m spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf spark.network.timeout=10s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 ms exceeds timeout 1 ms ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed out after 39299 ms Source Code: spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala /** * A heartbeat from executors to the driver. This is a shared message used by several internal * components to convey liveness or execution information for in-progress tasks. It will also * expire the hosts that have not heartbeated for more than spark.network.timeout. */ private val executorTimeoutMs = sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") * 1000 The relation between spark.network.timeout and spark.executor.heartbeatInterval should be mentioned in the document at least. Otherwise error above would be confusing. > Relation between heartbeatInterval and network timeout > -- > > Key: SPARK-17449 > URL: https://issues.apache.org/jira/browse/SPARK-17449 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Yang Liang > Labels: bug > > $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s > --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 > ms exceeds timeout 12 ms > ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed > out after 168136 ms > spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf > spark.network.timeout=10s --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 > ms exceeds timeout 1 ms > ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed > out after 11949 m > spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf > spark.network.timeout=10s --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 > ms exceeds timeout 1 ms > ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed > out after 39299 ms > Source Code: > spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala > /** > * A heartbeat from executors to the driver. This is a shared
[jira] [Updated] (SPARK-17449) executorTimeoutMs configure error
[ https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Liang updated SPARK-17449: --- Description: $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 ms exceeds timeout 12 ms ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed out after 168136 ms spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf spark.network.timeout=10s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 ms exceeds timeout 1 ms ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed out after 11949 m spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf spark.network.timeout=10s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 ms exceeds timeout 1 ms ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed out after 39299 ms Source Code: spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala /** * A heartbeat from executors to the driver. This is a shared message used by several internal * components to convey liveness or execution information for in-progress tasks. It will also * expire the hosts that have not heartbeated for more than spark.network.timeout. */ private val executorTimeoutMs = sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") * 1000 The relation between spark.network.timeout and spark.executor.heartbeatInterval should be mentioned in the document at least. Otherwise error above would be confusing. was: $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 ms exceeds timeout 12 ms ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed out after 168136 ms spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf spark.network.timeout=10s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 ms exceeds timeout 1 ms ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed out after 11949 m spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf spark.network.timeout=10s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 ms exceeds timeout 1 ms ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed out after 39299 ms Source Code: spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala /** * A heartbeat from executors to the driver. This is a shared message used by several internal * components to convey liveness or execution information for in-progress tasks. It will also * expire the hosts that have not heartbeated for more than spark.network.timeout. */ private val executorTimeoutMs = sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") * 1000 > executorTimeoutMs configure error > - > > Key: SPARK-17449 > URL: https://issues.apache.org/jira/browse/SPARK-17449 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Yang Liang > Labels: bug > > $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s > --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 > ms exceeds timeout 12 ms > ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed > out after 168136 ms > spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf > spark.network.timeout=10s --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 > ms exceeds timeout 1 ms > ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed > out after 11949 m > spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf > spark.network.timeout=10s --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 > ms exceeds timeout 1 ms > ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed > out after 39299 ms > Source Code: > spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala > /** > * A heartbeat from executors to the driver. This is a shared message used by > several internal > * components to convey liveness or execution information for in-progress > tasks. It will also > * expire the hosts that have not heartbeated for more than > spark.network.timeout. > */ > private val
[jira] [Updated] (SPARK-17449) Relation between
[ https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Liang updated SPARK-17449: --- Summary: Relation between (was: executorTimeoutMs configure error) > Relation between > - > > Key: SPARK-17449 > URL: https://issues.apache.org/jira/browse/SPARK-17449 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Yang Liang > Labels: bug > > $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s > --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 > ms exceeds timeout 12 ms > ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed > out after 168136 ms > spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf > spark.network.timeout=10s --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 > ms exceeds timeout 1 ms > ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed > out after 11949 m > spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf > spark.network.timeout=10s --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 > ms exceeds timeout 1 ms > ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed > out after 39299 ms > Source Code: > spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala > /** > * A heartbeat from executors to the driver. This is a shared message used by > several internal > * components to convey liveness or execution information for in-progress > tasks. It will also > * expire the hosts that have not heartbeated for more than > spark.network.timeout. > */ > private val executorTimeoutMs = > sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") > * 1000 > The relation between spark.network.timeout and > spark.executor.heartbeatInterval should be mentioned in the document at > least. Otherwise error above would be confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17449) Relation between heartbeatInterval and network timeout
[ https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Liang updated SPARK-17449: --- Summary: Relation between heartbeatInterval and network timeout (was: Relation between ) > Relation between heartbeatInterval and network timeout > -- > > Key: SPARK-17449 > URL: https://issues.apache.org/jira/browse/SPARK-17449 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Yang Liang > Labels: bug > > $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s > --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 > ms exceeds timeout 12 ms > ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed > out after 168136 ms > spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf > spark.network.timeout=10s --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 > ms exceeds timeout 1 ms > ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed > out after 11949 m > spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf > spark.network.timeout=10s --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 > ms exceeds timeout 1 ms > ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed > out after 39299 ms > Source Code: > spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala > /** > * A heartbeat from executors to the driver. This is a shared message used by > several internal > * components to convey liveness or execution information for in-progress > tasks. It will also > * expire the hosts that have not heartbeated for more than > spark.network.timeout. > */ > private val executorTimeoutMs = > sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") > * 1000 > The relation between spark.network.timeout and > spark.executor.heartbeatInterval should be mentioned in the document at > least. Otherwise error above would be confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6160) ChiSqSelector should keep test statistic info
[ https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475709#comment-15475709 ] Peng Meng commented on SPARK-6160: -- hi Joseph K. Bradley > ChiSqSelector should keep test statistic info > - > > Key: SPARK-6160 > URL: https://issues.apache.org/jira/browse/SPARK-6160 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > It is useful to have the test statistics explaining selected features, but > these data are thrown out when constructing the ChiSqSelectorModel. The data > are expensive to recompute, so the ChiSqSelectorModel should store and expose > them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6160) ChiSqSelector should keep test statistic info
[ https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Meng updated SPARK-6160: - Comment: was deleted (was: hi Joseph K. Bradley) > ChiSqSelector should keep test statistic info > - > > Key: SPARK-6160 > URL: https://issues.apache.org/jira/browse/SPARK-6160 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > It is useful to have the test statistics explaining selected features, but > these data are thrown out when constructing the ChiSqSelectorModel. The data > are expensive to recompute, so the ChiSqSelectorModel should store and expose > them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17449) executorTimeoutMs configure error
[ https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Liang updated SPARK-17449: --- Description: $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 ms exceeds timeout 12 ms ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed out after 168136 ms spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf spark.network.timeout=10s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 ms exceeds timeout 1 ms ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed out after 11949 m spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf spark.network.timeout=10s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 ms exceeds timeout 1 ms ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed out after 39299 ms Source Code: spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala /** * A heartbeat from executors to the driver. This is a shared message used by several internal * components to convey liveness or execution information for in-progress tasks. It will also * expire the hosts that have not heartbeated for more than spark.network.timeout. */ private val executorTimeoutMs = sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") * 1000 was: $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 ms exceeds timeout 12 ms ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed out after 168136 ms spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf spark.network.timeout=10s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 ms exceeds timeout 1 ms ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed out after 11949 m spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf spark.network.timeout=10s --num-executors 1 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 ms exceeds timeout 1 ms ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed out after 39299 ms Source Code: spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala private val executorTimeoutMs = sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") * 1000 > executorTimeoutMs configure error > - > > Key: SPARK-17449 > URL: https://issues.apache.org/jira/browse/SPARK-17449 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Yang Liang > Labels: bug > > $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s > --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 > ms exceeds timeout 12 ms > ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed > out after 168136 ms > spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf > spark.network.timeout=10s --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 > ms exceeds timeout 1 ms > ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed > out after 11949 m > spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf > spark.network.timeout=10s --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 > ms exceeds timeout 1 ms > ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed > out after 39299 ms > Source Code: > spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala > /** > * A heartbeat from executors to the driver. This is a shared message used by > several internal > * components to convey liveness or execution information for in-progress > tasks. It will also > * expire the hosts that have not heartbeated for more than > spark.network.timeout. > */ > private val executorTimeoutMs = > sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") > * 1000 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475664#comment-15475664 ] Jeff Zhang commented on SPARK-17428: Found another elegant way to specify version, using devtools https://support.rstudio.com/hc/en-us/articles/219949047-Installing-older-versions-of-packages {code} require(devtools) install_version("ggplot2", version = "0.9.1", repos = "http://cran.us.r-project.org;) {code} > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17447) performance improvement in Partitioner.DefaultPartitioner
[ https://issues.apache.org/jira/browse/SPARK-17447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475607#comment-15475607 ] WangJianfei edited comment on SPARK-17447 at 9/9/16 2:28 AM: - we can just scan the rdd array only one time to find the rdd with the maxmum partitions and whose partitioner is difined. you can just look the code of spark and the code of me: the code of spark: def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = { val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.length).reverse for (r <- bySize if r.partitioner.isDefined && r.partitioner.get.numPartitions > 0) { return r.partitioner.get } if (rdd.context.conf.contains("spark.default.parallelism")) { new HashPartitioner(rdd.context.defaultParallelism) } else { new HashPartitioner(bySize.head.partitions.length) } } this is my basic logic: var maxP=0 if(rdd.partitioner.isDefined && rdd.partitioner.get.numPartitions >0 ){ maxP=rdd.partitioner.get.numPartitions } for(i <- 0 until others.length){ if(others(i).partitioner.isDefined && others(i).partitioner.get.numPartitions > maxP){ maxP=others(i).partitioner.get.numPartitions } } was (Author: codlife): we can just scan the rdd array only one time to find the rdd with the maxmum partitions and whose partitioner is difined. you can just look the code of spark and the code of me: the code of spark: def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = { val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.length).reverse for (r <- bySize if r.partitioner.isDefined && r.partitioner.get.numPartitions > 0) { return r.partitioner.get } if (rdd.context.conf.contains("spark.default.parallelism")) { new HashPartitioner(rdd.context.defaultParallelism) } else { new HashPartitioner(bySize.head.partitions.length) } } this is my basic logic: var seq=List[(Int,Int)]() for(i <- 0 until others.length){ if(others(i).partitioner.isDefined) { seq = seq :+ (others(i).partitions.length, i) } } println("this is seq:") seq.foreach(println) var maxP=0 if(rdd.partitioner.isDefined && rdd.partitioner.get.numPartitions >0 ){ maxP=rdd.partitioner.get.numPartitions } for(i <- 0 until seq.length){ if(others(seq(i)._2).partitioner.isDefined && others(seq(i)._2).partitioner.get.numPartitions > maxP){ maxP=others(seq(i)._2).partitioner.get.numPartitions } } > performance improvement in Partitioner.DefaultPartitioner > -- > > Key: SPARK-17447 > URL: https://issues.apache.org/jira/browse/SPARK-17447 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: WangJianfei > Labels: performance > Original Estimate: 1h > Remaining Estimate: 1h > > if there are many rdds in some situations,the sort will loss he performance > servely,actually we needn't sort the rdds , we can just scan the rdds one > time to gain the same goal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475645#comment-15475645 ] Jeff Zhang commented on SPARK-17428: I just link the jira of python virtualenv. It seems R support virtualenv natively. Install.packages can specify the version, installation dest folder. And it is isolated cross users. I think there's 2 scenarios for SparkR environment. One is cluster has internet access, another is without internet access. If the cluster has internet access, then I think we can call install.packages directly. {code} install.packages("dplyr", lib="") library(dplyr, lib.loc="") {code} If the cluster doesn't have internet access, then the driver can first download these package tarball and add them through --files. And executor will try to compile and install these packages {code} install.packages(, repos = NULL, type="source", lib="") library(dplyr, lib.loc="") {code} For this scenario, if the package has dependencies, it would still try to download its dependencies from internet. Or user has to manually figure out its dependencies and add them in the spark app. > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17447) performance improvement in Partitioner.DefaultPartitioner
[ https://issues.apache.org/jira/browse/SPARK-17447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475607#comment-15475607 ] WangJianfei edited comment on SPARK-17447 at 9/9/16 2:12 AM: - we can just scan the rdd array only one time to find the rdd with the maxmum partitions and whose partitioner is difined. you can just look the code of spark and the code of me: the code of spark: def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = { val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.length).reverse for (r <- bySize if r.partitioner.isDefined && r.partitioner.get.numPartitions > 0) { return r.partitioner.get } if (rdd.context.conf.contains("spark.default.parallelism")) { new HashPartitioner(rdd.context.defaultParallelism) } else { new HashPartitioner(bySize.head.partitions.length) } } this is my basic logic: var seq=List[(Int,Int)]() for(i <- 0 until others.length){ if(others(i).partitioner.isDefined) { seq = seq :+ (others(i).partitions.length, i) } } println("this is seq:") seq.foreach(println) var maxP=0 if(rdd.partitioner.isDefined && rdd.partitioner.get.numPartitions >0 ){ maxP=rdd.partitioner.get.numPartitions } for(i <- 0 until seq.length){ if(others(seq(i)._2).partitioner.isDefined && others(seq(i)._2).partitioner.get.numPartitions > maxP){ maxP=others(seq(i)._2).partitioner.get.numPartitions } } was (Author: codlife): you can look this source code as below: we can just scan the rdd array only one time to find the rdd with the maxmum partitions and whose partitioner is difined. def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = { val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.length).reverse for (r <- bySize if r.partitioner.isDefined && r.partitioner.get.numPartitions > 0) { return r.partitioner.get } if (rdd.context.conf.contains("spark.default.parallelism")) { new HashPartitioner(rdd.context.defaultParallelism) } else { new HashPartitioner(bySize.head.partitions.length) } } > performance improvement in Partitioner.DefaultPartitioner > -- > > Key: SPARK-17447 > URL: https://issues.apache.org/jira/browse/SPARK-17447 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: WangJianfei > Labels: performance > Original Estimate: 1h > Remaining Estimate: 1h > > if there are many rdds in some situations,the sort will loss he performance > servely,actually we needn't sort the rdds , we can just scan the rdds one > time to gain the same goal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475630#comment-15475630 ] Jeff Zhang commented on SPARK-17428: Source code url needs to be specified for version. http://stackoverflow.com/questions/17082341/installing-older-version-of-r-package > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17448) There should a limit of k in mllib.Kmeans
[ https://issues.apache.org/jira/browse/SPARK-17448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475619#comment-15475619 ] WangJianfei commented on SPARK-17448: - Mabye we can limit k according to the number of elements of the input dataset,but as you say,the error is clear. > There should a limit of k in mllib.Kmeans > - > > Key: SPARK-17448 > URL: https://issues.apache.org/jira/browse/SPARK-17448 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: WangJianfei > Labels: patch > Original Estimate: 1h > Remaining Estimate: 1h > > In the mllib.Kmeans, there are no limit to the param k, if we pass a big K, > then because of he array(k), java outofmemory error will occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475612#comment-15475612 ] Felix Cheung edited comment on SPARK-17428 at 9/9/16 1:59 AM: -- I don't think I see a way to specify a version number for install.packages in R? Python does compile code - installing packages with pip compiles the python scripts. https://www.google.com/search?q=pyc And also many packages have native components which will not work without installing or compiling as root (or heavy hacking), eg. matplotlib, scipy. was (Author: felixcheung): I don't think I see a way to specify a version number for install.packages in R? Python does compile code - installing packages with pip compiles the python scripts. https://www.google.com/search?q=pyc And also many packages have heavy native components which will not work without installing as root (or heavy hacking), eg. matplotlib, scipy. > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475612#comment-15475612 ] Felix Cheung commented on SPARK-17428: -- I don't think I see a way to specify a version number for install.packages in R? Python does compile code - installing packages with pip compiles the python scripts. https://www.google.com/search?q=pyc And also many packages have heavy native components which will not work without installing as root (or heavy hacking), eg. matplotlib, scipy. > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17447) performance improvement in Partitioner.DefaultPartitioner
[ https://issues.apache.org/jira/browse/SPARK-17447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475607#comment-15475607 ] WangJianfei commented on SPARK-17447: - you can look this source code as below: we can just scan the rdd array only one time to find the rdd with the maxmum partitions and whose partitioner is difined. def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = { val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.length).reverse for (r <- bySize if r.partitioner.isDefined && r.partitioner.get.numPartitions > 0) { return r.partitioner.get } if (rdd.context.conf.contains("spark.default.parallelism")) { new HashPartitioner(rdd.context.defaultParallelism) } else { new HashPartitioner(bySize.head.partitions.length) } } > performance improvement in Partitioner.DefaultPartitioner > -- > > Key: SPARK-17447 > URL: https://issues.apache.org/jira/browse/SPARK-17447 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: WangJianfei > Labels: performance > Original Estimate: 1h > Remaining Estimate: 1h > > if there are many rdds in some situations,the sort will loss he performance > servely,actually we needn't sort the rdds , we can just scan the rdds one > time to gain the same goal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475593#comment-15475593 ] Sun Rui edited comment on SPARK-17428 at 9/9/16 1:52 AM: - I don't understand the meaning of exact version control. I think a user can specify downloaded R packages or specify a package name and version, and SparkR can download it from CRAN. PySpark does not have the compilation issue, as Python code needs no complication. The python interpreter abstracts the underly architecture differences just as JVM does. For R package compilation issue, maybe we can have the following polices: 1. For binary R packages, just deliver them to worker nodes; 2. For source R packges: 2.1 if only R code is contained, complication on the driver node is OK 2.2 if C/c++ code is contained, by default, compile it on the driver node. But we can have an option --compile-on-workers allowing users to choose to compile on worker nodes. If the option is specified, users should ensure the compiling tool chain and the runtime libraries being depended be ready on worker nodes. was (Author: sunrui): I don't understand the meaning of exact version control. I think a user can specify downloaded R packages or specify a package name and version, and SparkR can download it from CRAN. PySpark does not have the compilation issue, as Python code needs no complication. The python interpreter abstracts the underly architecture differences just as JVM does. For R package compilation issue, maybe we can have the following polices: 1. For binary R packages, just deliver them to worker nodes; 2. For source R packges: 2.1 if only R code is contained, complication on the driver node is OK 2.2 if C/c++ code is contained, by default, compile it on the driver node. But we can have an option --compile-on-workers allowing users to choose to compile on worker nodes. If the option is specified, users should ensure the compiling tool chain and the dependent runtime libraries be ready on worker nodes. > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475593#comment-15475593 ] Sun Rui edited comment on SPARK-17428 at 9/9/16 1:50 AM: - I don't understand the meaning of exact version control. I think a user can specify downloaded R packages or specify a package name and version, and SparkR can download it from CRAN. PySpark does not have the compilation issue, as Python code needs no complication. The python interpreter abstracts the underly architecture differences just as JVM does. For R package compilation issue, maybe we can have the following polices: 1. For binary R packages, just deliver them to worker nodes; 2. For source R packges: 2.1 if only R code is contained, complication on the driver node is OK 2.2 if C/c++ code is contained, by default, compile it on the driver node. But we can have an option --compile-on-workers allowing users to choose to compile on worker nodes. If the option is specified, users should ensure the compiling tool chain and the dependent runtime libraries be ready on worker nodes. was (Author: sunrui): I don't understand the meaning of exact version control. I think a user can specify downloaded R packages or specify a package name and version, and SparkR can download it from CRAN. PySpark does not have the compilation issue, as Python code needs no complication. The python interpreter abstracts the underly architecture differences just as JVM does. For R package compilation issue, maybe we can have the following polices: 1. For binary R packages, just deliver them to worker nodes; 2. For source R packges: 2.1 if only R code is contained, complication on the driver node is OK 2.2 if C/c++ code is contained, by default, compile it on the driver node. But we can have an option --compile-on-workers allowing users to choose to compile on worker nodes. If the option is specified, users should ensure the compiling tool chain be ready on worker nodes. > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475593#comment-15475593 ] Sun Rui commented on SPARK-17428: - I don't understand the meaning of exact version control. I think a user can specify downloaded R packages or specify a package name and version, and SparkR can download it from CRAN. PySpark does not have the compilation issue, as Python code needs no complication. The python interpreter abstracts the underly architecture differences just as JVM does. For R package compilation issue, maybe we can have the following polices: 1. For binary R packages, just deliver them to worker nodes; 2. For source R packges: 2.1 if only R code is contained, complication on the driver node is OK 2.2 if C/c++ code is contained, by default, compile it on the driver node. But we can have an option --compile-on-workers allowing users to choose to compile on worker nodes. If the option is specified, users should ensure the compiling tool chain be ready on worker nodes. > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
Josh Rosen created SPARK-17463: -- Summary: Serialization of accumulators in heartbeats is not thread-safe Key: SPARK-17463 URL: https://issues.apache.org/jira/browse/SPARK-17463 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Josh Rosen Priority: Critical Check out the following {{ConcurrentModificationException}}: {code} 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 attempts org.apache.spark.SparkException: Exception thrown in awaitResult at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.ConcurrentModificationException at java.util.ArrayList.writeObject(ArrayList.java:766) at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at
[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475536#comment-15475536 ] Josh Rosen commented on SPARK-17463: [~zsxwing], FYI, since you're good at these types of RPC thread-safety issues. > Serialization of accumulators in heartbeats is not thread-safe > -- > > Key: SPARK-17463 > URL: https://issues.apache.org/jira/browse/SPARK-17463 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Priority: Critical > > Check out the following {{ConcurrentModificationException}}: > {code} > 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 > attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at >
[jira] [Created] (SPARK-17462) Check for places within MLlib which should use VersionUtils to parse Spark version strings
Joseph K. Bradley created SPARK-17462: - Summary: Check for places within MLlib which should use VersionUtils to parse Spark version strings Key: SPARK-17462 URL: https://issues.apache.org/jira/browse/SPARK-17462 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Joseph K. Bradley Priority: Minor [SPARK-17456] creates a utility for parsing Spark versions. Several places in MLlib use custom regexes or other approaches. Those should be fixed to use the new (tested) utility. This task is for checking and replacing those instances. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI
[ https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-15487. -- Resolution: Fixed Fix Version/s: 2.1.0 > Spark Master UI to reverse proxy Application and Workers UI > --- > > Key: SPARK-15487 > URL: https://issues.apache.org/jira/browse/SPARK-15487 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gurvinder >Priority: Minor > Fix For: 2.1.0 > > > Currently when running in Standalone mode, Spark UI's link to workers and > application drivers are pointing to internal/protected network endpoints. So > to access workers/application UI user's machine has to connect to VPN or need > to have access to internal network directly. > Therefore the proposal is to make Spark master UI reverse proxy this > information back to the user. So only Spark master UI needs to be opened up > to internet. > The minimal changes can be done by adding another route e.g. > http://spark-master.com/target// so when request goes to target, > ProxyServlet kicks in and takes the and forwards the request to it > and send response back to user. > More information about discussions for this features can be found on this > mailing list thread > http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17387) Creating SparkContext() from python without spark-submit ignores user conf
[ https://issues.apache.org/jira/browse/SPARK-17387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475418#comment-15475418 ] Apache Spark commented on SPARK-17387: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/15019 > Creating SparkContext() from python without spark-submit ignores user conf > -- > > Key: SPARK-17387 > URL: https://issues.apache.org/jira/browse/SPARK-17387 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Consider the following scenario: user runs a python application not through > spark-submit, but by adding the pyspark module and manually creating a Spark > context. Kinda like this: > {noformat} > $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python > Python 2.7.12 (default, Jul 1 2016, 15:12:24) > [GCC 5.4.0 20160609] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> from pyspark import SparkContext > >>> from pyspark import SparkConf > >>> conf = SparkConf().set("spark.driver.memory", "4g") > >>> sc = SparkContext(conf=conf) > {noformat} > If you look at the JVM launched by the pyspark code, it ignores the user's > configuration: > {noformat} > $ ps ax | grep $(pgrep -f SparkSubmit) > 12283 pts/2Sl+0:03 /apps/java7/bin/java -cp ... -Xmx1g > -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit pyspark-shell > {noformat} > Note the "1g" of memory. If instead you use "pyspark", you get the correct > "4g" in the JVM. > This also affects other configs; for example, you can't really add jars to > the driver's classpath using "spark.jars". > You can work around this by setting the undocumented env variable Spark > itself uses: > {noformat} > $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python > Python 2.7.12 (default, Jul 1 2016, 15:12:24) > [GCC 5.4.0 20160609] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import os > >>> os.environ['PYSPARK_SUBMIT_ARGS'] = "pyspark-shell --conf > >>> spark.driver.memory=4g" > >>> from pyspark import SparkContext > >>> sc = SparkContext() > {noformat} > But it would be nicer if the configs were automatically propagated. > BTW the reason for this is that the {{launch_gateway}} function used to start > the JVM does not take any parameters, and the only place where it reads > arguments for Spark is that env variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17460) Dataset.joinWith causes OutOfMemory due to logicalPlan sizeInBytes being negative
[ https://issues.apache.org/jira/browse/SPARK-17460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Perluss updated SPARK-17460: -- Affects Version/s: 2.0.0 > Dataset.joinWith causes OutOfMemory due to logicalPlan sizeInBytes being > negative > - > > Key: SPARK-17460 > URL: https://issues.apache.org/jira/browse/SPARK-17460 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: Spark 2.0 in local mode as well as on GoogleDataproc >Reporter: Chris Perluss > > Dataset.joinWith is performing a BroadcastJoin on a table that is gigabytes > in size due to the dataset.logicalPlan.statistics.sizeInBytes < 0. > The issue is that org.apache.spark.sql.types.ArrayType.defaultSize is of > datatype Int. In my dataset, there is an Array column whose data size > exceeds the limits of an Int and so the data size becomes negative. > The issue can be repeated by running this code in REPL: > val ds = (0 to 1).map( i => (i, Seq((i, Seq((i, "This is really not that > long of a string")).toDS() > // You might have to remove private[sql] from Dataset.logicalPlan to get this > to work > val stats = ds.logicalPlan.statistics > yields > stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = > Statistics(-1890686892,false) > This causes joinWith to performWith to perform a broadcast join even tho my > data is gigabytes in size, which of course causes the executors to run out of > memory. > Setting spark.sql.autoBroadcastJoinThreshold=-1 does not help because the > logicalPlan.statistics.sizeInBytes is a large negative number and thus it is > less than the join threshold of -1. > I've been able to work around this issue by setting > autoBroadcastJoinThreshold to a very large negative number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-17461) Dataset.joinWith causes OutOfMemory due to logicalPlan sizeInBytes being negative
[ https://issues.apache.org/jira/browse/SPARK-17461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Perluss closed SPARK-17461. - Resolution: Duplicate Duplicate of SPARK-17460 > Dataset.joinWith causes OutOfMemory due to logicalPlan sizeInBytes being > negative > - > > Key: SPARK-17461 > URL: https://issues.apache.org/jira/browse/SPARK-17461 > Project: Spark > Issue Type: Bug > Environment: Spark 2.0 in local mode as well as on GoogleDataproc >Reporter: Chris Perluss > > Dataset.joinWith is performing a BroadcastJoin on a table that is gigabytes > in size due to the dataset.logicalPlan.statistics.sizeInBytes < 0. > The issue is that org.apache.spark.sql.types.ArrayType.defaultSize is of > datatype Int. In my dataset, there is an Array column whose data size > exceeds the limits of an Int and so the data size becomes negative. > The issue can be repeated by running this code in REPL: > val ds = (0 to 1).map( i => (i, Seq((i, Seq((i, "This is really not that > long of a string")).toDS() > // You might have to remove private[sql] from Dataset.logicalPlan to get this > to work > val stats = ds.logicalPlan.statistics > yields > stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = > Statistics(-1890686892,false) > This causes joinWith to performWith to perform a broadcast join even tho my > data is gigabytes in size, which of course causes the executors to run out of > memory. > Setting spark.sql.autoBroadcastJoinThreshold=-1 does not help because the > logicalPlan.statistics.sizeInBytes is a large negative number and thus it is > less than the join threshold of -1. > I've been able to work around this issue by setting > autoBroadcastJoinThreshold to a very large negative number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17461) Dataset.joinWith causes OutOfMemory due to logicalPlan sizeInBytes being negative
Chris Perluss created SPARK-17461: - Summary: Dataset.joinWith causes OutOfMemory due to logicalPlan sizeInBytes being negative Key: SPARK-17461 URL: https://issues.apache.org/jira/browse/SPARK-17461 Project: Spark Issue Type: Bug Environment: Spark 2.0 in local mode as well as on GoogleDataproc Reporter: Chris Perluss Dataset.joinWith is performing a BroadcastJoin on a table that is gigabytes in size due to the dataset.logicalPlan.statistics.sizeInBytes < 0. The issue is that org.apache.spark.sql.types.ArrayType.defaultSize is of datatype Int. In my dataset, there is an Array column whose data size exceeds the limits of an Int and so the data size becomes negative. The issue can be repeated by running this code in REPL: val ds = (0 to 1).map( i => (i, Seq((i, Seq((i, "This is really not that long of a string")).toDS() // You might have to remove private[sql] from Dataset.logicalPlan to get this to work val stats = ds.logicalPlan.statistics yields stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(-1890686892,false) This causes joinWith to performWith to perform a broadcast join even tho my data is gigabytes in size, which of course causes the executors to run out of memory. Setting spark.sql.autoBroadcastJoinThreshold=-1 does not help because the logicalPlan.statistics.sizeInBytes is a large negative number and thus it is less than the join threshold of -1. I've been able to work around this issue by setting autoBroadcastJoinThreshold to a very large negative number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17460) Dataset.joinWith causes OutOfMemory due to logicalPlan sizeInBytes being negative
Chris Perluss created SPARK-17460: - Summary: Dataset.joinWith causes OutOfMemory due to logicalPlan sizeInBytes being negative Key: SPARK-17460 URL: https://issues.apache.org/jira/browse/SPARK-17460 Project: Spark Issue Type: Bug Environment: Spark 2.0 in local mode as well as on GoogleDataproc Reporter: Chris Perluss Dataset.joinWith is performing a BroadcastJoin on a table that is gigabytes in size due to the dataset.logicalPlan.statistics.sizeInBytes < 0. The issue is that org.apache.spark.sql.types.ArrayType.defaultSize is of datatype Int. In my dataset, there is an Array column whose data size exceeds the limits of an Int and so the data size becomes negative. The issue can be repeated by running this code in REPL: val ds = (0 to 1).map( i => (i, Seq((i, Seq((i, "This is really not that long of a string")).toDS() // You might have to remove private[sql] from Dataset.logicalPlan to get this to work val stats = ds.logicalPlan.statistics yields stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(-1890686892,false) This causes joinWith to performWith to perform a broadcast join even tho my data is gigabytes in size, which of course causes the executors to run out of memory. Setting spark.sql.autoBroadcastJoinThreshold=-1 does not help because the logicalPlan.statistics.sizeInBytes is a large negative number and thus it is less than the join threshold of -1. I've been able to work around this issue by setting autoBroadcastJoinThreshold to a very large negative number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17459) Add Linear Discriminant to dimensionality reduction algorithms
Joshua Howard created SPARK-17459: - Summary: Add Linear Discriminant to dimensionality reduction algorithms Key: SPARK-17459 URL: https://issues.apache.org/jira/browse/SPARK-17459 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Joshua Howard Priority: Minor The goal is to add linear discriminant analysis as a method of dimensionality reduction. The algorithm and code are very similar to PCA, but instead project the data set onto vectors that provide class separation. LDA is a more effective alternative to PCA in terms of preprocessing for classification algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525
[ https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-17405. Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15016 [https://github.com/apache/spark/pull/15016] > Simple aggregation query OOMing after SPARK-16525 > - > > Key: SPARK-17405 > URL: https://issues.apache.org/jira/browse/SPARK-17405 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Eric Liang >Priority: Blocker > Fix For: 2.1.0 > > > Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the > following query ran fine via Beeline / Thrift Server and the Spark shell, but > after that patch it is consistently OOMING: > {code} > CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, > smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, > int_col_9, string_col_10) AS ( > SELECT * FROM (VALUES > (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), > TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, > TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'), > (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 > 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS > INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'), > (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 > 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 > 00:00:00.0'), '211', -959, CAST(NULL AS STRING)), > (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), > CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', > CAST(NULL AS INT), '936'), > (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 > 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, > TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'), > (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 > 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 > 00:00:00.0'), '-345', 566, '-574'), > (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 > 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), > TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'), > (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), > CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), > TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142') > ) as t); > CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, > boolean_col_4, timestamp_col_5, decimal3317_col_6) AS ( > SELECT * FROM (VALUES > ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), > false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))), > ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), > true, CAST(NULL AS TIMESTAMP), -653.51000BD), > ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), > false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD), > ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), > false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD), > ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), > false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD), > ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), > false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD) > ) as t); > SELECT > CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col, > LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, > t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, > t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS > STRING)) AS decimal_col, > COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col > FROM table_3 t1 > INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND > ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = > (t1.string_col_1)) > WHERE > (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9) > GROUP BY > COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9); > {code} > Here's the OOM: > {code} > org.apache.hive.service.cli.HiveSQLException: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 > (TID 9, localhost): java.lang.OutOfMemoryError: Unable to acquire 262144 > bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100) > at
[jira] [Commented] (SPARK-17387) Creating SparkContext() from python without spark-submit ignores user conf
[ https://issues.apache.org/jira/browse/SPARK-17387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475253#comment-15475253 ] Marcelo Vanzin commented on SPARK-17387: Yeah, that's what I mean. Running the pyspark shell, or using spark-submit to run a python script, correctly preserves the user's configuration. BTW my workaround above is wrong - "pyspark-shell" needs to be the last argument in "PYSPARK_SUBMIT_ARGS", not the first. > Creating SparkContext() from python without spark-submit ignores user conf > -- > > Key: SPARK-17387 > URL: https://issues.apache.org/jira/browse/SPARK-17387 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Consider the following scenario: user runs a python application not through > spark-submit, but by adding the pyspark module and manually creating a Spark > context. Kinda like this: > {noformat} > $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python > Python 2.7.12 (default, Jul 1 2016, 15:12:24) > [GCC 5.4.0 20160609] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> from pyspark import SparkContext > >>> from pyspark import SparkConf > >>> conf = SparkConf().set("spark.driver.memory", "4g") > >>> sc = SparkContext(conf=conf) > {noformat} > If you look at the JVM launched by the pyspark code, it ignores the user's > configuration: > {noformat} > $ ps ax | grep $(pgrep -f SparkSubmit) > 12283 pts/2Sl+0:03 /apps/java7/bin/java -cp ... -Xmx1g > -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit pyspark-shell > {noformat} > Note the "1g" of memory. If instead you use "pyspark", you get the correct > "4g" in the JVM. > This also affects other configs; for example, you can't really add jars to > the driver's classpath using "spark.jars". > You can work around this by setting the undocumented env variable Spark > itself uses: > {noformat} > $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python > Python 2.7.12 (default, Jul 1 2016, 15:12:24) > [GCC 5.4.0 20160609] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import os > >>> os.environ['PYSPARK_SUBMIT_ARGS'] = "pyspark-shell --conf > >>> spark.driver.memory=4g" > >>> from pyspark import SparkContext > >>> sc = SparkContext() > {noformat} > But it would be nicer if the configs were automatically propagated. > BTW the reason for this is that the {{launch_gateway}} function used to start > the JVM does not take any parameters, and the only place where it reads > arguments for Spark is that env variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17387) Creating SparkContext() from python without spark-submit ignores user conf
[ https://issues.apache.org/jira/browse/SPARK-17387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475242#comment-15475242 ] Bryan Cutler commented on SPARK-17387: -- [~vanzin] you said if you use PySpark you could get the correct "4g" memory option, but I was only able to do this through the command line like this {noformat} $> bin/pyspark --conf spark.driver.memory=4g {noformat} Is that what you meant? I think just adding the command line confs to configure the JVM through a plain Python shell would be a simple fix, and still be inline with how the Scala spark-shell works too. > Creating SparkContext() from python without spark-submit ignores user conf > -- > > Key: SPARK-17387 > URL: https://issues.apache.org/jira/browse/SPARK-17387 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Consider the following scenario: user runs a python application not through > spark-submit, but by adding the pyspark module and manually creating a Spark > context. Kinda like this: > {noformat} > $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python > Python 2.7.12 (default, Jul 1 2016, 15:12:24) > [GCC 5.4.0 20160609] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> from pyspark import SparkContext > >>> from pyspark import SparkConf > >>> conf = SparkConf().set("spark.driver.memory", "4g") > >>> sc = SparkContext(conf=conf) > {noformat} > If you look at the JVM launched by the pyspark code, it ignores the user's > configuration: > {noformat} > $ ps ax | grep $(pgrep -f SparkSubmit) > 12283 pts/2Sl+0:03 /apps/java7/bin/java -cp ... -Xmx1g > -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit pyspark-shell > {noformat} > Note the "1g" of memory. If instead you use "pyspark", you get the correct > "4g" in the JVM. > This also affects other configs; for example, you can't really add jars to > the driver's classpath using "spark.jars". > You can work around this by setting the undocumented env variable Spark > itself uses: > {noformat} > $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python > Python 2.7.12 (default, Jul 1 2016, 15:12:24) > [GCC 5.4.0 20160609] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import os > >>> os.environ['PYSPARK_SUBMIT_ARGS'] = "pyspark-shell --conf > >>> spark.driver.memory=4g" > >>> from pyspark import SparkContext > >>> sc = SparkContext() > {noformat} > But it would be nicer if the configs were automatically propagated. > BTW the reason for this is that the {{launch_gateway}} function used to start > the JVM does not take any parameters, and the only place where it reads > arguments for Spark is that env variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17458) Alias specified for aggregates in a pivot are not honored
[ https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Somepalli updated SPARK-17458: --- Description: When using pivot and multiple aggregations we need to alias to avoid special characters, but alias does not help because df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) AS `COLD` || foo_max(`B`) AS `COLB` || |small| 5.5| two|2.3335| two| |large| 5.5| two| 2.0| one| Expected Output ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || |small| 5.5| two|2.3335| two| |large| 5.5| two| 2.0| one| One approach you can fix this issue is to change the class sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala and change the outputName method in {code} object ResolvePivot extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { {code} {code} def outputName(value: Literal, aggregate: Expression): String = { val suffix = aggregate match { case n: NamedExpression => aggregate.asInstanceOf[NamedExpression].name case _ => aggregate.sql } if (singleAgg) value.toString else value + "_" + suffix } {code} Version : 2.0.0 {code} def outputName(value: Literal, aggregate: Expression): String = { if (singleAgg) value.toString else value + "_" + aggregate.sql } {code} was: When using pivot and multiple aggregations we need to alias to avoid special characters, but alias does not help because df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) AS `COLD` || foo_max(`B`) AS `COLB` || |small| 5.5| two|2.3335| two| |large| 5.5| two| 2.0| one| Expected Output ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || |small| 5.5| two|2.3335| two| |large| 5.5| two| 2.0| one| One approach you can fix this issue is to change the class sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala and change the outputName method in {code} object ResolvePivot extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { {code} {code} def outputName(value: Literal, aggregate: Expression): String = { val suffix = aggregate match { case n: NamedExpression => aggregate.asInstanceOf[NamedExpression].name case _ => aggregate.sql } if (singleAgg) value.toString else value + "_" + suffix } {code} Version : 2.0.0 {code} def outputName(value: Literal, aggregate: Expression): String = { if (singleAgg) value.toString else value + "_" + aggregate.sql } {code? > Alias specified for aggregates in a pivot are not honored > - > > Key: SPARK-17458 > URL: https://issues.apache.org/jira/browse/SPARK-17458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ravi Somepalli > > When using pivot and multiple aggregations we need to alias to avoid special > characters, but alias does not help because > df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show > ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) > AS `COLD` || foo_max(`B`) AS `COLB` || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > Expected Output > ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > One approach you can fix this issue is to change the class > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala > and change the outputName method in > {code} > object ResolvePivot extends Rule[LogicalPlan] { > def apply(plan: LogicalPlan): LogicalPlan = plan transform { > {code} > {code} > def outputName(value: Literal, aggregate: Expression): String = { > val
[jira] [Updated] (SPARK-17458) Alias specified for aggregates in a pivot are not honored
[ https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Somepalli updated SPARK-17458: --- Description: When using pivot and multiple aggregations we need to alias to avoid special characters, but alias does not help because df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) AS `COLD` || foo_max(`B`) AS `COLB` || |small| 5.5| two|2.3335| two| |large| 5.5| two| 2.0| one| Expected Output ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || |small| 5.5| two|2.3335| two| |large| 5.5| two| 2.0| one| One approach you can fix this issue is to change the class sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala and change the outputName method in {code} object ResolvePivot extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { {code} {code} def outputName(value: Literal, aggregate: Expression): String = { val suffix = aggregate match { case n: NamedExpression => aggregate.asInstanceOf[NamedExpression].name case _ => aggregate.sql } if (singleAgg) value.toString else value + "_" + suffix } {code} Version : 2.0.0 {code} def outputName(value: Literal, aggregate: Expression): String = { if (singleAgg) value.toString else value + "_" + aggregate.sql } {code? was: When using pivot and multiple aggregations we need to alias to avoid special characters, but alias does not help because df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) AS `COLD` || foo_max(`B`) AS `COLB` || |small| 5.5| two|2.3335| two| |large| 5.5| two| 2.0| one| Expected Output ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || |small| 5.5| two|2.3335| two| |large| 5.5| two| 2.0| one| One approach you can fix this issue is to change the class sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala and change the outputName method in {code} object ResolvePivot extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { {code} {code} def outputName(value: Literal, aggregate: Expression): String = { val suffix = aggregate match { case n: NamedExpression => aggregate.asInstanceOf[NamedExpression].name case _ => aggregate.sql } if (singleAgg) value.toString else value + "_" + suffix } {code} > Alias specified for aggregates in a pivot are not honored > - > > Key: SPARK-17458 > URL: https://issues.apache.org/jira/browse/SPARK-17458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ravi Somepalli > > When using pivot and multiple aggregations we need to alias to avoid special > characters, but alias does not help because > df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show > ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) > AS `COLD` || foo_max(`B`) AS `COLB` || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > Expected Output > ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > One approach you can fix this issue is to change the class > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala > and change the outputName method in > {code} > object ResolvePivot extends Rule[LogicalPlan] { > def apply(plan: LogicalPlan): LogicalPlan = plan transform { > {code} > {code} > def outputName(value: Literal, aggregate: Expression): String = { > val suffix = aggregate match { > case n: NamedExpression => > aggregate.asInstanceOf[NamedExpression].name > case _ => aggregate.sql >} >
[jira] [Created] (SPARK-17458) Alias specified for aggregates in a pivot are not honored
Ravi Somepalli created SPARK-17458: -- Summary: Alias specified for aggregates in a pivot are not honored Key: SPARK-17458 URL: https://issues.apache.org/jira/browse/SPARK-17458 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Ravi Somepalli When using pivot and multiple aggregations we need to alias to avoid special characters, but alias does not help because df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) AS `COLD` || foo_max(`B`) AS `COLB` || |small| 5.5| two|2.3335| two| |large| 5.5| two| 2.0| one| Expected Output ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || |small| 5.5| two|2.3335| two| |large| 5.5| two| 2.0| one| One approach you can fix this issue is to change the class sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala and change the outputName method in {code} object ResolvePivot extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { {code} {code} def outputName(value: Literal, aggregate: Expression): String = { val suffix = aggregate match { case n: NamedExpression => aggregate.asInstanceOf[NamedExpression].name case _ => aggregate.sql } if (singleAgg) value.toString else value + "_" + suffix } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17455) IsotonicRegression takes non-polynomial time for some inputs
[ https://issues.apache.org/jira/browse/SPARK-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nic Eggert updated SPARK-17455: --- Priority: Minor (was: Major) > IsotonicRegression takes non-polynomial time for some inputs > > > Key: SPARK-17455 > URL: https://issues.apache.org/jira/browse/SPARK-17455 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0 >Reporter: Nic Eggert >Priority: Minor > > The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently > in MLlib can take O(N!) time for certain inputs, when it should have > worst-case complexity of O(N^2). > To reproduce this, I pulled the private method poolAdjacentViolators out of > mllib.regression.IsotonicRegression and into a benchmarking harness. > Given this input > {code} > val x = (1 to length).toArray.map(_.toDouble) > val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 > else yi} > val w = Array.fill(length)(1d) > val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, > x), w) => (y, x, w)} > {code} > I vary the length of the input to get these timings: > || Input Length || Time (us) || > | 100 | 1.35 | > | 200 | 3.14 | > | 400 | 116.10 | > | 800 | 2134225.90 | > (tests were performed using > https://github.com/sirthias/scala-benchmarking-template) > I can also confirm that I run into this issue on a real dataset I'm working > on when trying to calibrate random forest probability output. Some partitions > take > 12 hours to run. This isn't a skew issue, since the largest partitions > finish in minutes. I can only assume that some partitions cause something > approaching this worst-case complexity. > I'm working on a patch that borrows the implementation that is used in > scikit-learn and the R "iso" package, both of which handle this particular > input in linear time and are quadratic in the worst case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17455) IsotonicRegression takes non-polynomial time for some inputs
[ https://issues.apache.org/jira/browse/SPARK-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17455: Assignee: (was: Apache Spark) > IsotonicRegression takes non-polynomial time for some inputs > > > Key: SPARK-17455 > URL: https://issues.apache.org/jira/browse/SPARK-17455 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0 >Reporter: Nic Eggert > > The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently > in MLlib can take O(N!) time for certain inputs, when it should have > worst-case complexity of O(N^2). > To reproduce this, I pulled the private method poolAdjacentViolators out of > mllib.regression.IsotonicRegression and into a benchmarking harness. > Given this input > {code} > val x = (1 to length).toArray.map(_.toDouble) > val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 > else yi} > val w = Array.fill(length)(1d) > val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, > x), w) => (y, x, w)} > {code} > I vary the length of the input to get these timings: > || Input Length || Time (us) || > | 100 | 1.35 | > | 200 | 3.14 | > | 400 | 116.10 | > | 800 | 2134225.90 | > (tests were performed using > https://github.com/sirthias/scala-benchmarking-template) > I can also confirm that I run into this issue on a real dataset I'm working > on when trying to calibrate random forest probability output. Some partitions > take > 12 hours to run. This isn't a skew issue, since the largest partitions > finish in minutes. I can only assume that some partitions cause something > approaching this worst-case complexity. > I'm working on a patch that borrows the implementation that is used in > scikit-learn and the R "iso" package, both of which handle this particular > input in linear time and are quadratic in the worst case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17455) IsotonicRegression takes non-polynomial time for some inputs
[ https://issues.apache.org/jira/browse/SPARK-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17455: Assignee: Apache Spark > IsotonicRegression takes non-polynomial time for some inputs > > > Key: SPARK-17455 > URL: https://issues.apache.org/jira/browse/SPARK-17455 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0 >Reporter: Nic Eggert >Assignee: Apache Spark > > The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently > in MLlib can take O(N!) time for certain inputs, when it should have > worst-case complexity of O(N^2). > To reproduce this, I pulled the private method poolAdjacentViolators out of > mllib.regression.IsotonicRegression and into a benchmarking harness. > Given this input > {code} > val x = (1 to length).toArray.map(_.toDouble) > val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 > else yi} > val w = Array.fill(length)(1d) > val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, > x), w) => (y, x, w)} > {code} > I vary the length of the input to get these timings: > || Input Length || Time (us) || > | 100 | 1.35 | > | 200 | 3.14 | > | 400 | 116.10 | > | 800 | 2134225.90 | > (tests were performed using > https://github.com/sirthias/scala-benchmarking-template) > I can also confirm that I run into this issue on a real dataset I'm working > on when trying to calibrate random forest probability output. Some partitions > take > 12 hours to run. This isn't a skew issue, since the largest partitions > finish in minutes. I can only assume that some partitions cause something > approaching this worst-case complexity. > I'm working on a patch that borrows the implementation that is used in > scikit-learn and the R "iso" package, both of which handle this particular > input in linear time and are quadratic in the worst case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17302) Cannot set non-Spark SQL session variables in hive-site.xml, spark-defaults.conf, or using --conf
[ https://issues.apache.org/jira/browse/SPARK-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475156#comment-15475156 ] Ryan Blue commented on SPARK-17302: --- In 1.6.x, Spark pulled session config for Hive from a {{HiveConf}}, for example [Spark respects|https://github.com/apache/spark/blob/v1.6.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L167] {{hive.exec.dynamic.partition.mode}}. This could be set in hive-conf.xml or in spark-defaults.conf, though the latter had to use {{spark.hadoop...}} to set it. Now that the {{SQLConf}} is used instead of {{HiveConf}}, you lose both of those methods of setting Hive configuration variables. Using {{spark.conf.set("key", "value")}} works for users, but hive-site.xml defaults are no longer respected and an administrator can no longer set default values. > Cannot set non-Spark SQL session variables in hive-site.xml, > spark-defaults.conf, or using --conf > - > > Key: SPARK-17302 > URL: https://issues.apache.org/jira/browse/SPARK-17302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ryan Blue > > When configuration changed for 2.0 to the new SparkSession structure, Spark > stopped using Hive's internal HiveConf for session state and now uses > HiveSessionState and an associated SQLConf. Now, session options like > hive.exec.compress.output and hive.exec.dynamic.partition.mode are pulled > from this SQLConf. This doesn't include session properties from hive-site.xml > (including hive.exec.compress.output), and no longer contains Spark-specific > overrides from spark-defaults.conf that used the spark.hadoop.hive... pattern. > Also, setting these variables on the command-line no longer works because > settings must start with "spark.". > Is there a recommended way to set Hive session properties? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17455) IsotonicRegression takes non-polynomial time for some inputs
[ https://issues.apache.org/jira/browse/SPARK-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475158#comment-15475158 ] Apache Spark commented on SPARK-17455: -- User 'neggert' has created a pull request for this issue: https://github.com/apache/spark/pull/15018 > IsotonicRegression takes non-polynomial time for some inputs > > > Key: SPARK-17455 > URL: https://issues.apache.org/jira/browse/SPARK-17455 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0 >Reporter: Nic Eggert > > The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently > in MLlib can take O(N!) time for certain inputs, when it should have > worst-case complexity of O(N^2). > To reproduce this, I pulled the private method poolAdjacentViolators out of > mllib.regression.IsotonicRegression and into a benchmarking harness. > Given this input > {code} > val x = (1 to length).toArray.map(_.toDouble) > val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 > else yi} > val w = Array.fill(length)(1d) > val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, > x), w) => (y, x, w)} > {code} > I vary the length of the input to get these timings: > || Input Length || Time (us) || > | 100 | 1.35 | > | 200 | 3.14 | > | 400 | 116.10 | > | 800 | 2134225.90 | > (tests were performed using > https://github.com/sirthias/scala-benchmarking-template) > I can also confirm that I run into this issue on a real dataset I'm working > on when trying to calibrate random forest probability output. Some partitions > take > 12 hours to run. This isn't a skew issue, since the largest partitions > finish in minutes. I can only assume that some partitions cause something > approaching this worst-case complexity. > I'm working on a patch that borrows the implementation that is used in > scikit-learn and the R "iso" package, both of which handle this particular > input in linear time and are quadratic in the worst case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17457) Spark SQL shows poor performance for group by and sort by on multiple columns
[ https://issues.apache.org/jira/browse/SPARK-17457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sabyasachi Nayak updated SPARK-17457: - Description: In one of the use case when we are running one hive query with Tez it is taking 45 mnts.But the same query when I am running in Spark SQL using hivecontext it is taking more than 2 hours.This query has no joins only group by and sort by on multiple columns. spark-submit --class DataLoadingSpark --master yarn --deploy-mode client --num-executors 60 --executor-memory 16G --driver-memory 4G --executor-cores 5 --conf spark.yarn.executor.memoryOverhead=2048 --conf spark.shuffle.consolidateFiles=true --conf spark.shuffle.memoryFraction=0.5 --conf spark.storage.memoryFraction=.1 --conf spark.io.compression.codec=lzf --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=1024m -XX:PermSize=512m -Dhdp.version=2.3.2.0-2950" --conf spark.shuffle.blockTransferService=nio DataLoadingSpark.jar --inputFile basket_txn. Spark UI shows Input is 500+ GB and Shuffle write is also 500+GB Spark version - 1.4.0 HDP 2.3.2.0-2950 50 node cluster 1100 Vcores was: In one of the use case when we are running one hive query with Tez it is taking 45 mnts.But the same query when I am running in Spark SQL using hivecontext it is taking more than 2 hours.This query has no joins only group by and sort by on multiple columns. spark-submit --class DataLoadingSpark --master yarn --deploy-mode client --num-executors 60 --executor-memory 16G --driver-memory 4G --executor-cores 5 --conf spark.yarn.executor.memoryOverhead=2048 --conf spark.shuffle.consolidateFiles=true --conf spark.shuffle.memoryFraction=0.5 --conf spark.storage.memoryFraction=.1 --conf spark.io.compression.codec=lzf --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=1024m -XX:PermSize=512m -Dhdp.version=2.3.2.0-2950" --conf spark.shuffle.blockTransferService=nio DataLoadingSpark.jar --inputFile basket_txn. Spark version - 1.4.0 HDP 2.3.2.0-2950 50 node cluster 1100 Vcores > Spark SQL shows poor performance for group by and sort by on multiple columns > -- > > Key: SPARK-17457 > URL: https://issues.apache.org/jira/browse/SPARK-17457 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.4.0 >Reporter: Sabyasachi Nayak > > In one of the use case when we are running one hive query with Tez it is > taking 45 mnts.But the same query when I am running in Spark SQL using > hivecontext it is taking more than 2 hours.This query has no joins only group > by and sort by on multiple columns. > spark-submit --class DataLoadingSpark --master yarn --deploy-mode client > --num-executors 60 --executor-memory 16G --driver-memory 4G --executor-cores > 5 --conf spark.yarn.executor.memoryOverhead=2048 --conf > spark.shuffle.consolidateFiles=true --conf spark.shuffle.memoryFraction=0.5 > --conf spark.storage.memoryFraction=.1 --conf spark.io.compression.codec=lzf > --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=1024m -XX:PermSize=512m > -Dhdp.version=2.3.2.0-2950" --conf spark.shuffle.blockTransferService=nio > DataLoadingSpark.jar --inputFile basket_txn. > Spark UI shows > Input is 500+ GB and Shuffle write is also 500+GB > Spark version - 1.4.0 > HDP 2.3.2.0-2950 > 50 node cluster 1100 Vcores -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17457) Spark SQL shows poor performance for group by and sort by on multiple columns
[ https://issues.apache.org/jira/browse/SPARK-17457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sabyasachi Nayak updated SPARK-17457: - Summary: Spark SQL shows poor performance for group by and sort by on multiple columns (was: Spark SQL shows poor performance for group by on multiple columns ) > Spark SQL shows poor performance for group by and sort by on multiple columns > -- > > Key: SPARK-17457 > URL: https://issues.apache.org/jira/browse/SPARK-17457 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.4.0 >Reporter: Sabyasachi Nayak > > In one of the use case when we are running one hive query with Tez it is > taking 45 mnts.But the same query when I am running in Spark SQL using > hivecontext it is taking more than 2 hours.This query has no joins only group > by and sort by on multiple columns. > spark-submit --class DataLoadingSpark --master yarn --deploy-mode client > --num-executors 60 --executor-memory 16G --driver-memory 4G --executor-cores > 5 --conf spark.yarn.executor.memoryOverhead=2048 --conf > spark.shuffle.consolidateFiles=true --conf spark.shuffle.memoryFraction=0.5 > --conf spark.storage.memoryFraction=.1 --conf spark.io.compression.codec=lzf > --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=1024m -XX:PermSize=512m > -Dhdp.version=2.3.2.0-2950" --conf spark.shuffle.blockTransferService=nio > DataLoadingSpark.jar --inputFile basket_txn. > Spark version - 1.4.0 > HDP 2.3.2.0-2950 > 50 node cluster 1100 Vcores -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17457) Spark SQL shows poor performance for group by on multiple columns
Sabyasachi Nayak created SPARK-17457: Summary: Spark SQL shows poor performance for group by on multiple columns Key: SPARK-17457 URL: https://issues.apache.org/jira/browse/SPARK-17457 Project: Spark Issue Type: Improvement Affects Versions: 1.4.0 Reporter: Sabyasachi Nayak In one of the use case when we are running one hive query with Tez it is taking 45 mnts.But the same query when I am running in Spark SQL using hivecontext it is taking more than 2 hours.This query has no joins only group by and sort by on multiple columns. spark-submit --class DataLoadingSpark --master yarn --deploy-mode client --num-executors 60 --executor-memory 16G --driver-memory 4G --executor-cores 5 --conf spark.yarn.executor.memoryOverhead=2048 --conf spark.shuffle.consolidateFiles=true --conf spark.shuffle.memoryFraction=0.5 --conf spark.storage.memoryFraction=.1 --conf spark.io.compression.codec=lzf --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=1024m -XX:PermSize=512m -Dhdp.version=2.3.2.0-2950" --conf spark.shuffle.blockTransferService=nio DataLoadingSpark.jar --inputFile basket_txn. Spark version - 1.4.0 HDP 2.3.2.0-2950 50 node cluster 1100 Vcores -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525
[ https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17405: --- Assignee: Eric Liang > Simple aggregation query OOMing after SPARK-16525 > - > > Key: SPARK-17405 > URL: https://issues.apache.org/jira/browse/SPARK-17405 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Eric Liang >Priority: Blocker > > Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the > following query ran fine via Beeline / Thrift Server and the Spark shell, but > after that patch it is consistently OOMING: > {code} > CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, > smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, > int_col_9, string_col_10) AS ( > SELECT * FROM (VALUES > (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), > TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, > TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'), > (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 > 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS > INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'), > (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 > 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 > 00:00:00.0'), '211', -959, CAST(NULL AS STRING)), > (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), > CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', > CAST(NULL AS INT), '936'), > (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 > 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, > TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'), > (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 > 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 > 00:00:00.0'), '-345', 566, '-574'), > (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 > 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), > TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'), > (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), > CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), > TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142') > ) as t); > CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, > boolean_col_4, timestamp_col_5, decimal3317_col_6) AS ( > SELECT * FROM (VALUES > ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), > false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))), > ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), > true, CAST(NULL AS TIMESTAMP), -653.51000BD), > ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), > false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD), > ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), > false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD), > ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), > false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD), > ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), > false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD) > ) as t); > SELECT > CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col, > LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, > t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, > t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS > STRING)) AS decimal_col, > COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col > FROM table_3 t1 > INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND > ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = > (t1.string_col_1)) > WHERE > (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9) > GROUP BY > COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9); > {code} > Here's the OOM: > {code} > org.apache.hive.service.cli.HiveSQLException: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 > (TID 9, localhost): java.lang.OutOfMemoryError: Unable to acquire 262144 > bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100) > at > org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:783) > at >
[jira] [Assigned] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525
[ https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17405: Assignee: (was: Apache Spark) > Simple aggregation query OOMing after SPARK-16525 > - > > Key: SPARK-17405 > URL: https://issues.apache.org/jira/browse/SPARK-17405 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the > following query ran fine via Beeline / Thrift Server and the Spark shell, but > after that patch it is consistently OOMING: > {code} > CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, > smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, > int_col_9, string_col_10) AS ( > SELECT * FROM (VALUES > (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), > TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, > TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'), > (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 > 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS > INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'), > (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 > 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 > 00:00:00.0'), '211', -959, CAST(NULL AS STRING)), > (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), > CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', > CAST(NULL AS INT), '936'), > (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 > 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, > TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'), > (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 > 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 > 00:00:00.0'), '-345', 566, '-574'), > (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 > 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), > TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'), > (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), > CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), > TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142') > ) as t); > CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, > boolean_col_4, timestamp_col_5, decimal3317_col_6) AS ( > SELECT * FROM (VALUES > ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), > false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))), > ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), > true, CAST(NULL AS TIMESTAMP), -653.51000BD), > ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), > false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD), > ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), > false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD), > ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), > false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD), > ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), > false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD) > ) as t); > SELECT > CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col, > LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, > t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, > t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS > STRING)) AS decimal_col, > COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col > FROM table_3 t1 > INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND > ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = > (t1.string_col_1)) > WHERE > (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9) > GROUP BY > COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9); > {code} > Here's the OOM: > {code} > org.apache.hive.service.cli.HiveSQLException: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 > (TID 9, localhost): java.lang.OutOfMemoryError: Unable to acquire 262144 > bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100) > at > org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:783) > at >
[jira] [Commented] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525
[ https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475040#comment-15475040 ] Apache Spark commented on SPARK-17405: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/15016 > Simple aggregation query OOMing after SPARK-16525 > - > > Key: SPARK-17405 > URL: https://issues.apache.org/jira/browse/SPARK-17405 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the > following query ran fine via Beeline / Thrift Server and the Spark shell, but > after that patch it is consistently OOMING: > {code} > CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, > smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, > int_col_9, string_col_10) AS ( > SELECT * FROM (VALUES > (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), > TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, > TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'), > (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 > 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS > INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'), > (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 > 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 > 00:00:00.0'), '211', -959, CAST(NULL AS STRING)), > (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), > CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', > CAST(NULL AS INT), '936'), > (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 > 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, > TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'), > (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 > 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 > 00:00:00.0'), '-345', 566, '-574'), > (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 > 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), > TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'), > (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), > CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), > TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142') > ) as t); > CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, > boolean_col_4, timestamp_col_5, decimal3317_col_6) AS ( > SELECT * FROM (VALUES > ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), > false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))), > ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), > true, CAST(NULL AS TIMESTAMP), -653.51000BD), > ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), > false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD), > ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), > false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD), > ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), > false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD), > ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), > false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD) > ) as t); > SELECT > CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col, > LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, > t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, > t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS > STRING)) AS decimal_col, > COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col > FROM table_3 t1 > INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND > ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = > (t1.string_col_1)) > WHERE > (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9) > GROUP BY > COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9); > {code} > Here's the OOM: > {code} > org.apache.hive.service.cli.HiveSQLException: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 > (TID 9, localhost): java.lang.OutOfMemoryError: Unable to acquire 262144 > bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100) > at >
[jira] [Assigned] (SPARK-17456) Utility for parsing Spark versions
[ https://issues.apache.org/jira/browse/SPARK-17456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17456: Assignee: Apache Spark (was: Joseph K. Bradley) > Utility for parsing Spark versions > -- > > Key: SPARK-17456 > URL: https://issues.apache.org/jira/browse/SPARK-17456 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > There are many hacks within Spark's codebase to identify and compare Spark > versions. We should add a simple utility to standardize these code paths, > especially since there have been mistakes made in the past. This will let us > add unit tests as well. This initial patch will only add methods for > extracting major and minor versions as Int types in Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17456) Utility for parsing Spark versions
[ https://issues.apache.org/jira/browse/SPARK-17456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475041#comment-15475041 ] Apache Spark commented on SPARK-17456: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/15017 > Utility for parsing Spark versions > -- > > Key: SPARK-17456 > URL: https://issues.apache.org/jira/browse/SPARK-17456 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > There are many hacks within Spark's codebase to identify and compare Spark > versions. We should add a simple utility to standardize these code paths, > especially since there have been mistakes made in the past. This will let us > add unit tests as well. This initial patch will only add methods for > extracting major and minor versions as Int types in Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525
[ https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17405: Assignee: Apache Spark > Simple aggregation query OOMing after SPARK-16525 > - > > Key: SPARK-17405 > URL: https://issues.apache.org/jira/browse/SPARK-17405 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Blocker > > Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the > following query ran fine via Beeline / Thrift Server and the Spark shell, but > after that patch it is consistently OOMING: > {code} > CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, > smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, > int_col_9, string_col_10) AS ( > SELECT * FROM (VALUES > (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), > TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, > TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'), > (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 > 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS > INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'), > (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 > 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 > 00:00:00.0'), '211', -959, CAST(NULL AS STRING)), > (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), > CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', > CAST(NULL AS INT), '936'), > (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 > 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, > TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'), > (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 > 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 > 00:00:00.0'), '-345', 566, '-574'), > (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 > 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), > TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'), > (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), > CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), > TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142') > ) as t); > CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, > boolean_col_4, timestamp_col_5, decimal3317_col_6) AS ( > SELECT * FROM (VALUES > ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), > false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))), > ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), > true, CAST(NULL AS TIMESTAMP), -653.51000BD), > ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), > false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD), > ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), > false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD), > ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), > false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD), > ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), > false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD) > ) as t); > SELECT > CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col, > LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, > t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, > t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS > STRING)) AS decimal_col, > COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col > FROM table_3 t1 > INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND > ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = > (t1.string_col_1)) > WHERE > (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9) > GROUP BY > COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9); > {code} > Here's the OOM: > {code} > org.apache.hive.service.cli.HiveSQLException: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 > (TID 9, localhost): java.lang.OutOfMemoryError: Unable to acquire 262144 > bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100) > at > org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:783) > at >
[jira] [Assigned] (SPARK-17456) Utility for parsing Spark versions
[ https://issues.apache.org/jira/browse/SPARK-17456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17456: Assignee: Joseph K. Bradley (was: Apache Spark) > Utility for parsing Spark versions > -- > > Key: SPARK-17456 > URL: https://issues.apache.org/jira/browse/SPARK-17456 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > There are many hacks within Spark's codebase to identify and compare Spark > versions. We should add a simple utility to standardize these code paths, > especially since there have been mistakes made in the past. This will let us > add unit tests as well. This initial patch will only add methods for > extracting major and minor versions as Int types in Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12452) Add exception details to TaskCompletionListener/TaskContext
[ https://issues.apache.org/jira/browse/SPARK-12452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475027#comment-15475027 ] Neelesh Shastry commented on SPARK-12452: - This was originally filed for 1.5.2, which does not have TaskFailureListener (Its 2.0.0, I believe) > Add exception details to TaskCompletionListener/TaskContext > --- > > Key: SPARK-12452 > URL: https://issues.apache.org/jira/browse/SPARK-12452 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Neelesh Shastry >Priority: Minor > > TaskCompletionListeners are called without success/failure details. > If we change this > {code} > trait TaskCompletionListener extends EventListener { > def onTaskCompletion(context: TaskContext) > } > class TaskContextImpl { > > private[spark] def markTaskCompleted(throwable:Option[Throwable]): Unit > > listener.onTaskCompletion(this,throwable) > } > {code} > to something like > {code} > trait TaskCompletionListener extends EventListener { > def onTaskCompletion(context: TaskContext, throwable:Option[Throwable]=None) > } > {code} > .. and in Task.scala > {code} >val results = Try(runTask(context)) >var throwable:Option[Throwable] = None > try { > runTask(context) > > }catch{ > case t:Throwable => throwable=t > } > finally { > context.markTaskCompleted(throwable) > TaskContext.unset() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16525) Enable Row Based HashMap in HashAggregateExec
[ https://issues.apache.org/jira/browse/SPARK-16525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475023#comment-15475023 ] Apache Spark commented on SPARK-16525: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/15016 > Enable Row Based HashMap in HashAggregateExec > - > > Key: SPARK-16525 > URL: https://issues.apache.org/jira/browse/SPARK-16525 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Qifan Pu > Fix For: 2.1.0 > > > Allow `RowBasedHashMapGenerator` to be used in `HashAggregateExec`, so that > we can turn codegen `RowBasedHashMap`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17446) no total size for data source tables in InMemoryCatalog
[ https://issues.apache.org/jira/browse/SPARK-17446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474995#comment-15474995 ] Zhenhua Wang commented on SPARK-17446: -- Ok, I've added the description. Thanks. > no total size for data source tables in InMemoryCatalog > --- > > Key: SPARK-17446 > URL: https://issues.apache.org/jira/browse/SPARK-17446 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Zhenhua Wang > > For data source table in InMemoryCatalog, it's > catalogTable.storage.locationUri is None, so total size can't be calculated. > But we can use the path parameter in catalogTable.storage.properties to > calculate size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17446) no total size for data source tables in InMemoryCatalog
[ https://issues.apache.org/jira/browse/SPARK-17446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-17446: - Description: For data source table in InMemoryCatalog, it's catalogTable.storage.locationUri is None, so total size can't be calculated. But we can use the path parameter in catalogTable.storage.properties to calculate size. > no total size for data source tables in InMemoryCatalog > --- > > Key: SPARK-17446 > URL: https://issues.apache.org/jira/browse/SPARK-17446 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Zhenhua Wang > > For data source table in InMemoryCatalog, it's > catalogTable.storage.locationUri is None, so total size can't be calculated. > But we can use the path parameter in catalogTable.storage.properties to > calculate size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17456) Utility for parsing Spark versions
[ https://issues.apache.org/jira/browse/SPARK-17456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474917#comment-15474917 ] Joseph K. Bradley commented on SPARK-17456: --- Linking a JIRA which will require this > Utility for parsing Spark versions > -- > > Key: SPARK-17456 > URL: https://issues.apache.org/jira/browse/SPARK-17456 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > There are many hacks within Spark's codebase to identify and compare Spark > versions. We should add a simple utility to standardize these code paths, > especially since there have been mistakes made in the past. This will let us > add unit tests as well. This initial patch will only add methods for > extracting major and minor versions as Int types in Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17456) Utility for parsing Spark versions
Joseph K. Bradley created SPARK-17456: - Summary: Utility for parsing Spark versions Key: SPARK-17456 URL: https://issues.apache.org/jira/browse/SPARK-17456 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor There are many hacks within Spark's codebase to identify and compare Spark versions. We should add a simple utility to standardize these code paths, especially since there have been mistakes made in the past. This will let us add unit tests as well. This initial patch will only add methods for extracting major and minor versions as Int types in Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11035) Launcher: allow apps to be launched in-process
[ https://issues.apache.org/jira/browse/SPARK-11035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11035: Assignee: (was: Apache Spark) > Launcher: allow apps to be launched in-process > -- > > Key: SPARK-11035 > URL: https://issues.apache.org/jira/browse/SPARK-11035 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin > > The launcher library is currently restricted to launching apps as child > processes. That is fine for a lot of cases, especially if the app is running > in client mode. > But in certain cases, especially launching in cluster mode, it's more > efficient to avoid launching a new process, since that process won't be doing > much. > We should add support for launching apps in process, even if restricted to > cluster mode at first. This will require some rework of the launch paths to > avoid using system properties to propagate configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11035) Launcher: allow apps to be launched in-process
[ https://issues.apache.org/jira/browse/SPARK-11035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11035: Assignee: Apache Spark > Launcher: allow apps to be launched in-process > -- > > Key: SPARK-11035 > URL: https://issues.apache.org/jira/browse/SPARK-11035 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > The launcher library is currently restricted to launching apps as child > processes. That is fine for a lot of cases, especially if the app is running > in client mode. > But in certain cases, especially launching in cluster mode, it's more > efficient to avoid launching a new process, since that process won't be doing > much. > We should add support for launching apps in process, even if restricted to > cluster mode at first. This will require some rework of the launch paths to > avoid using system properties to propagate configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11035) Launcher: allow apps to be launched in-process
[ https://issues.apache.org/jira/browse/SPARK-11035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474872#comment-15474872 ] Apache Spark commented on SPARK-11035: -- User 'kishorvpatil' has created a pull request for this issue: https://github.com/apache/spark/pull/15009 > Launcher: allow apps to be launched in-process > -- > > Key: SPARK-11035 > URL: https://issues.apache.org/jira/browse/SPARK-11035 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin > > The launcher library is currently restricted to launching apps as child > processes. That is fine for a lot of cases, especially if the app is running > in client mode. > But in certain cases, especially launching in cluster mode, it's more > efficient to avoid launching a new process, since that process won't be doing > much. > We should add support for launching apps in process, even if restricted to > cluster mode at first. This will require some rework of the launch paths to > avoid using system properties to propagate configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17455) IsotonicRegression takes non-polynomial time for some inputs
Nic Eggert created SPARK-17455: -- Summary: IsotonicRegression takes non-polynomial time for some inputs Key: SPARK-17455 URL: https://issues.apache.org/jira/browse/SPARK-17455 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.0.0, 1.6.2, 1.5.2, 1.4.1, 1.3.1 Reporter: Nic Eggert The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently in MLlib can take O(N!) time for certain inputs, when it should have worst-case complexity of O(N^2). To reproduce this, I pulled the private method poolAdjacentViolators out of mllib.regression.IsotonicRegression and into a benchmarking harness. Given this input {code} val x = (1 to length).toArray.map(_.toDouble) val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 else yi} val w = Array.fill(length)(1d) val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, x), w) => (y, x, w)} {code} I vary the length of the input to get these timings: || Input Length || Time (us) || | 100 | 1.35 | | 200 | 3.14 | | 400 | 116.10 | | 800 | 2134225.90 | (tests were performed using https://github.com/sirthias/scala-benchmarking-template) I can also confirm that I run into this issue on a real dataset I'm working on when trying to calibrate random forest probability output. Some partitions take > 12 hours to run. This isn't a skew issue, since the largest partitions finish in minutes. I can only assume that some partitions cause something approaching this worst-case complexity. I'm working on a patch that borrows the implementation that is used in scikit-learn and the R "iso" package, both of which handle this particular input in linear time and are quadratic in the worst case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474688#comment-15474688 ] Apache Spark commented on SPARK-16445: -- User 'keypointt' has created a pull request for this issue: https://github.com/apache/spark/pull/15015 > Multilayer Perceptron Classifier wrapper in SparkR > -- > > Key: SPARK-16445 > URL: https://issues.apache.org/jira/browse/SPARK-16445 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Xin Ren > Fix For: 2.1.0 > > > Follow instructions in SPARK-16442 and implement multilayer perceptron > classifier wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17454) Add option to specify Mesos resource offer constraints
Chris Bannister created SPARK-17454: --- Summary: Add option to specify Mesos resource offer constraints Key: SPARK-17454 URL: https://issues.apache.org/jira/browse/SPARK-17454 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 2.0.0 Reporter: Chris Bannister Currently the driver will accept offers from Mesos which have enough ram for the executor and until its max cores is reached. There is no way to control the required CPU's or disk for each executor, it would be very useful to be able to apply something similar to spark.mesos.constraints to resource offers instead of attributes on the offer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages
[ https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474616#comment-15474616 ] Josh Elser commented on SPARK-17445: bq. I think one part you're missing, Josh, is that spark-packages.org is an index of packages from a wide variety of organizations, where anyone can submit a package. Ah, that is true. I made no efforts to understand the curation/policies behind the website (because I really don't care what the owners do -- it's their website/service). I think this is disjoint from my original concern though -- it may be de-facto with respect to the available third-party packages for Apache Spark, but I still think the PMC could do a better job at clearly stating this it is not an "Apache Spark" service. I don't see a down-side to my original suggestion (and it sounds like you agree). If it comes down to it, point me at the website source, and I'm happy to submit a change which introduces a new page :) > Reference an ASF page as the main place to find third-party packages > > > Key: SPARK-17445 > URL: https://issues.apache.org/jira/browse/SPARK-17445 > Project: Spark > Issue Type: Improvement >Reporter: Matei Zaharia > > Some comments and docs like > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151 > say to go to spark-packages.org, but since this is a package index > maintained by a third party, it would be better to reference an ASF page that > we can keep updated and own the URL for. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17453) Broadcast block already exists in MemoryStore
[ https://issues.apache.org/jira/browse/SPARK-17453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bannister updated SPARK-17453: Description: Whilst doing a broadcast join we reliably hit this exception, the code worked earlier on in 2.0.0 branch before release, and in 1.6. The data for the join is coming from another RDD which is collected to a Set and then broadcast. This is run in a Mesos cluster. Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) at org.apache.spark.rdd.RDD.collect(RDD.scala:892) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.IOException: java.lang.IllegalArgumentException: requirement failed: Block broadcast_17_piece0 is already present in the MemoryStore at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1260) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:72) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IllegalArgumentException: requirement failed: Block broadcast_17_piece0 is already present in the MemoryStore at scala.Predef$.require(Predef.scala:224) at org.apache.spark.storage.memory.MemoryStore.putBytes(MemoryStore.scala:144) at org.apache.spark.storage.BlockManager$$anonfun$doPutBytes$1.apply(BlockManager.scala:792) at
[jira] [Created] (SPARK-17453) Broadcast block already exists in MemoryStore
Chris Bannister created SPARK-17453: --- Summary: Broadcast block already exists in MemoryStore Key: SPARK-17453 URL: https://issues.apache.org/jira/browse/SPARK-17453 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 2.0.0 Reporter: Chris Bannister Whilst doing a broadcast join we reliably hit this exception, the code worked earlier on in 2.0.0 branch before release, and in 1.6. The data for the join is coming from another RDD which is collected to a Set and then broadcast. This is run in a Mesos cluster. Stacktrace is attached -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv
[ https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474587#comment-15474587 ] Felix Cheung commented on SPARK-17428: -- Agree with above. And to be clear, packrat is still calling install.packages so it won't be different how this is handled regarding package directory (lib parameter to install.packages) or permission/access https://github.com/rstudio/packrat/blob/master/R/install.R#L69 We are likely going to prefer having private packages under the application directory in the case of YARN, so they will get clean up along with the application. It seems like the original point of this JIRA is around private packages and installation/deployment - I think we would agree we could handle that (or SparkR in YARN already can do that) My point is though the benefit of such package management system is really with the exact version that one can control. But even then, building packages from source on worker machine could be problematic (this applies both to packrat, or calls to install.packages): https://rstudio.github.io/packrat/limitations.html - I'm not sure we should assume all worker machines in enterprises have C compiler or that the user running Spark have permission to build source code. I don't know where we are at with PySpark but I'd be very interested in seeing how that is resolved - I think both Python and R face similar constraints in terms of deployment/package building, versioning, heterogeneous machine architecture and so on. > SparkR executors/workers support virtualenv > --- > > Key: SPARK-17428 > URL: https://issues.apache.org/jira/browse/SPARK-17428 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > Many users have requirements to use third party R packages in > executors/workers, but SparkR can not satisfy this requirements elegantly. > For example, you should to mess with the IT/administrators of the cluster to > deploy these R packages on each executors/workers node which is very > inflexible. > I think we should support third party R packages for SparkR users as what we > do for jar packages in the following two scenarios: > 1, Users can install R packages from CRAN or custom CRAN-like repository for > each executors. > 2, Users can load their local R packages and install them on each executors. > To achieve this goal, the first thing is to make SparkR executors support > virtualenv like Python conda. I have investigated and found > packrat(http://rstudio.github.io/packrat/) is one of the candidates to > support virtualenv for R. Packrat is a dependency management system for R and > can isolate the dependent R packages in its own private package space. Then > SparkR users can install third party packages in the application > scope(destroy after the application exit) and don’t need to bother > IT/administrators to install these packages manually. > I would like to know whether it make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
[ https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474557#comment-15474557 ] Thomas Graves commented on SPARK-17321: --- so there are 2 possible things here: 1) You are using YARN NM recovery. If this is the case SPARK-14963 should prevent this problem as the recovery path is supposed to be critical to NM and NM should not start if its bad 2) You aren't using NM recovery. If this is the case then you probably don't really care about the levelDB being saved because you aren't expecting things to live across Nm restarts. In this case if people are having issues like this perhaps we should change the code to be conditionalized on NM recovery or a spark config. Which case are you running? > YARN shuffle service should use good disk from yarn.nodemanager.local-dirs > -- > > Key: SPARK-17321 > URL: https://issues.apache.org/jira/browse/SPARK-17321 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.2, 2.0.0 >Reporter: yunjiong zhao > > We run spark on yarn, after enabled spark dynamic allocation, we notice some > spark application failed randomly due to YarnShuffleService. > From log I found > {quote} > 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: > Error while initializing Netty pipeline > java.lang.NullPointerException > at > org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) > at > org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) > at > org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) > at > io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {quote} > Which caused by the first disk in yarn.nodemanager.local-dirs was broken. > If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost > hundred nodes which is unacceptable. > We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good > disks if the first one is broken? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages
[ https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474543#comment-15474543 ] Matei Zaharia commented on SPARK-17445: --- I think one part you're missing, Josh, is that spark-packages.org *is* an index of packages from a wide variety of organizations, where anyone can submit a package. Have you looked through it? Maybe there is some concern about which third-party index we highlight on the site, but AFAIK there are no other third-party package indexes. Nonetheless it would make sense to have a stable URL on the Spark homepage that lists them. BTW, in the past, we also used a wiki page to track them: https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects so we could just link to that. The spark-packages site provides some nicer functionality though such as letting anyone add a package with just a GitHub account, listing releases, etc. > Reference an ASF page as the main place to find third-party packages > > > Key: SPARK-17445 > URL: https://issues.apache.org/jira/browse/SPARK-17445 > Project: Spark > Issue Type: Improvement >Reporter: Matei Zaharia > > Some comments and docs like > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151 > say to go to spark-packages.org, but since this is a package index > maintained by a third party, it would be better to reference an ASF page that > we can keep updated and own the URL for. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17429) spark sql length(1) return error
[ https://issues.apache.org/jira/browse/SPARK-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474532#comment-15474532 ] Apache Spark commented on SPARK-17429: -- User 'cenyuhai' has created a pull request for this issue: https://github.com/apache/spark/pull/15014 > spark sql length(1) return error > > > Key: SPARK-17429 > URL: https://issues.apache.org/jira/browse/SPARK-17429 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: cen yuhai > > select length(11); > select length(2.0); > these sql will return errors, but hive is ok. > Error in query: cannot resolve 'length(11)' due to data type mismatch: > argument 1 requires (string or binary) type, however, '11' is of int type.; > line 1 pos 14 > Error in query: cannot resolve 'length(2.0)' due to data type mismatch: > argument 1 requires (string or binary) type, however, '2.0' is of double > type.; line 1 pos 14 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17429) spark sql length(1) return error
[ https://issues.apache.org/jira/browse/SPARK-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17429: Assignee: (was: Apache Spark) > spark sql length(1) return error > > > Key: SPARK-17429 > URL: https://issues.apache.org/jira/browse/SPARK-17429 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: cen yuhai > > select length(11); > select length(2.0); > these sql will return errors, but hive is ok. > Error in query: cannot resolve 'length(11)' due to data type mismatch: > argument 1 requires (string or binary) type, however, '11' is of int type.; > line 1 pos 14 > Error in query: cannot resolve 'length(2.0)' due to data type mismatch: > argument 1 requires (string or binary) type, however, '2.0' is of double > type.; line 1 pos 14 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17429) spark sql length(1) return error
[ https://issues.apache.org/jira/browse/SPARK-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17429: Assignee: Apache Spark > spark sql length(1) return error > > > Key: SPARK-17429 > URL: https://issues.apache.org/jira/browse/SPARK-17429 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: cen yuhai >Assignee: Apache Spark > > select length(11); > select length(2.0); > these sql will return errors, but hive is ok. > Error in query: cannot resolve 'length(11)' due to data type mismatch: > argument 1 requires (string or binary) type, however, '11' is of int type.; > line 1 pos 14 > Error in query: cannot resolve 'length(2.0)' due to data type mismatch: > argument 1 requires (string or binary) type, however, '2.0' is of double > type.; line 1 pos 14 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17450) spark sql rownumber OOM
[ https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474517#comment-15474517 ] Herman van Hovell commented on SPARK-17450: --- You could try. You would also have to add the follow-up by davies (that adds the spilling logic). > spark sql rownumber OOM > --- > > Key: SPARK-17450 > URL: https://issues.apache.org/jira/browse/SPARK-17450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: cen yuhai > > spark sql will be OOM when using row_number() over too much sorted records... > There will be only 1 task to handle all records > This sql group by passenger_id, we have 100 million passengers. > {code} > SELECT > passenger_id, > total_order, > (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND > 670800 THEN 'V3' END) AS order_rank > FROM > ( > SELECT > passenger_id, > 1 as total_order > FROM table > GROUP BY passenger_id > ) dd1 > {code} > {code} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:104) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > physical plan: > {code} > == Physical Plan == > Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 > <= 670800)) THEN V3 AS order_rank#1] > +- Window [passenger_id#7L,total_order#0], > [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS > _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC] >+- Sort [total_order#0 DESC], false, 0 > +- TungstenExchange SinglePartition, None > +- Project [passenger_id#7L,total_order#0] > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L,total_order#0]) >+- TungstenExchange hashpartitioning(passenger_id#7L,1000), > None > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L]) > +- Project [passenger_id#7L] > +- Filter product#9 IN (kuai,gulf) >+- HiveTableScan [passenger_id#7L,product#9], > MetastoreRelation pbs_dw, dwv_order_whole_day, None, > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-17450) spark sql rownumber OOM
[ https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474518#comment-15474518 ] Herman van Hovell commented on SPARK-17450: --- You could try. You would also have to add the follow-up by davies (that adds the spilling logic). > spark sql rownumber OOM > --- > > Key: SPARK-17450 > URL: https://issues.apache.org/jira/browse/SPARK-17450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: cen yuhai > > spark sql will be OOM when using row_number() over too much sorted records... > There will be only 1 task to handle all records > This sql group by passenger_id, we have 100 million passengers. > {code} > SELECT > passenger_id, > total_order, > (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND > 670800 THEN 'V3' END) AS order_rank > FROM > ( > SELECT > passenger_id, > 1 as total_order > FROM table > GROUP BY passenger_id > ) dd1 > {code} > {code} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:104) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > physical plan: > {code} > == Physical Plan == > Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 > <= 670800)) THEN V3 AS order_rank#1] > +- Window [passenger_id#7L,total_order#0], > [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS > _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC] >+- Sort [total_order#0 DESC], false, 0 > +- TungstenExchange SinglePartition, None > +- Project [passenger_id#7L,total_order#0] > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L,total_order#0]) >+- TungstenExchange hashpartitioning(passenger_id#7L,1000), > None > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L]) > +- Project [passenger_id#7L] > +- Filter product#9 IN (kuai,gulf) >+- HiveTableScan [passenger_id#7L,product#9], > MetastoreRelation pbs_dw, dwv_order_whole_day, None, > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (SPARK-17450) spark sql rownumber OOM
[ https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-17450: -- Comment: was deleted (was: You could try. You would also have to add the follow-up by davies (that adds the spilling logic).) > spark sql rownumber OOM > --- > > Key: SPARK-17450 > URL: https://issues.apache.org/jira/browse/SPARK-17450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: cen yuhai > > spark sql will be OOM when using row_number() over too much sorted records... > There will be only 1 task to handle all records > This sql group by passenger_id, we have 100 million passengers. > {code} > SELECT > passenger_id, > total_order, > (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND > 670800 THEN 'V3' END) AS order_rank > FROM > ( > SELECT > passenger_id, > 1 as total_order > FROM table > GROUP BY passenger_id > ) dd1 > {code} > {code} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:104) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > physical plan: > {code} > == Physical Plan == > Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 > <= 670800)) THEN V3 AS order_rank#1] > +- Window [passenger_id#7L,total_order#0], > [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS > _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC] >+- Sort [total_order#0 DESC], false, 0 > +- TungstenExchange SinglePartition, None > +- Project [passenger_id#7L,total_order#0] > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L,total_order#0]) >+- TungstenExchange hashpartitioning(passenger_id#7L,1000), > None > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L]) > +- Project [passenger_id#7L] > +- Filter product#9 IN (kuai,gulf) >+- HiveTableScan [passenger_id#7L,product#9], > MetastoreRelation pbs_dw, dwv_order_whole_day, None, > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (SPARK-17450) spark sql rownumber OOM
[ https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474430#comment-15474430 ] cen yuhai edited comment on SPARK-17450 at 9/8/16 5:07 PM: --- hi,herman, can i merge your pr for native spark window function? https://github.com/apache/spark/pull/9819 ? was (Author: cenyuhai): hi,herman, can i merge your pr for native spark window function? https://github.com/apache/spark/pull/9819 ??? > spark sql rownumber OOM > --- > > Key: SPARK-17450 > URL: https://issues.apache.org/jira/browse/SPARK-17450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: cen yuhai > > spark sql will be OOM when using row_number() over too much sorted records... > There will be only 1 task to handle all records > This sql group by passenger_id, we have 100 million passengers. > {code} > SELECT > passenger_id, > total_order, > (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND > 670800 THEN 'V3' END) AS order_rank > FROM > ( > SELECT > passenger_id, > 1 as total_order > FROM table > GROUP BY passenger_id > ) dd1 > {code} > {code} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:104) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > physical plan: > {code} > == Physical Plan == > Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 > <= 670800)) THEN V3 AS order_rank#1] > +- Window [passenger_id#7L,total_order#0], > [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS > _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC] >+- Sort [total_order#0 DESC], false, 0 > +- TungstenExchange SinglePartition, None > +- Project [passenger_id#7L,total_order#0] > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L,total_order#0]) >+- TungstenExchange hashpartitioning(passenger_id#7L,1000), > None > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L]) > +- Project [passenger_id#7L] > +- Filter product#9 IN (kuai,gulf) >+-
[jira] [Commented] (SPARK-17450) spark sql rownumber OOM
[ https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474430#comment-15474430 ] cen yuhai commented on SPARK-17450: --- hi,herman, can i merge your pr for native spark window function? https://github.com/apache/spark/pull/9819 ??? > spark sql rownumber OOM > --- > > Key: SPARK-17450 > URL: https://issues.apache.org/jira/browse/SPARK-17450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: cen yuhai > > spark sql will be OOM when using row_number() over too much sorted records... > There will be only 1 task to handle all records > This sql group by passenger_id, we have 100 million passengers. > {code} > SELECT > passenger_id, > total_order, > (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND > 670800 THEN 'V3' END) AS order_rank > FROM > ( > SELECT > passenger_id, > 1 as total_order > FROM table > GROUP BY passenger_id > ) dd1 > {code} > {code} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:104) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > physical plan: > {code} > == Physical Plan == > Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 > <= 670800)) THEN V3 AS order_rank#1] > +- Window [passenger_id#7L,total_order#0], > [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS > _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC] >+- Sort [total_order#0 DESC], false, 0 > +- TungstenExchange SinglePartition, None > +- Project [passenger_id#7L,total_order#0] > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L,total_order#0]) >+- TungstenExchange hashpartitioning(passenger_id#7L,1000), > None > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L]) > +- Project [passenger_id#7L] > +- Filter product#9 IN (kuai,gulf) >+- HiveTableScan [passenger_id#7L,product#9], > MetastoreRelation pbs_dw, dwv_order_whole_day, None, > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-17443) SparkLauncher should allow stoppingApplication and need not rely on SparkSubmit binary
[ https://issues.apache.org/jira/browse/SPARK-17443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474386#comment-15474386 ] Marcelo Vanzin commented on SPARK-17443: The second bullet is actually SPARK-11035. > SparkLauncher should allow stoppingApplication and need not rely on > SparkSubmit binary > -- > > Key: SPARK-17443 > URL: https://issues.apache.org/jira/browse/SPARK-17443 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Kishor Patil > > Oozie wants SparkLauncher to support the following things: > - When oozie launcher is killed, the launched Spark application also gets > killed > - Spark Launcher to not have to rely on spark-submit bash script -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17446) no total size for data source tables in InMemoryCatalog
[ https://issues.apache.org/jira/browse/SPARK-17446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474385#comment-15474385 ] Sean Owen commented on SPARK-17446: --- [~ZenWzh] there is no detail at all here. Please describe what you mean in the description. See https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > no total size for data source tables in InMemoryCatalog > --- > > Key: SPARK-17446 > URL: https://issues.apache.org/jira/browse/SPARK-17446 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Zhenhua Wang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17449) executorTimeoutMs configure error
[ https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474381#comment-15474381 ] Sean Owen commented on SPARK-17449: --- Sorry, I don't see the problem? the configured timeout matches what is configured. > executorTimeoutMs configure error > - > > Key: SPARK-17449 > URL: https://issues.apache.org/jira/browse/SPARK-17449 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Yang Liang > Labels: bug > > $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s > --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 > ms exceeds timeout 12 ms > ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed > out after 168136 ms > spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf > spark.network.timeout=10s --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 > ms exceeds timeout 1 ms > ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed > out after 11949 m > spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf > spark.network.timeout=10s --num-executors 1 > WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 > ms exceeds timeout 1 ms > ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed > out after 39299 ms > Source Code: > spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala > private val executorTimeoutMs = > sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") > * 1000 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17447) performance improvement in Partitioner.DefaultPartitioner
[ https://issues.apache.org/jira/browse/SPARK-17447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474375#comment-15474375 ] Sean Owen commented on SPARK-17447: --- Why don't they need to be sorted? > performance improvement in Partitioner.DefaultPartitioner > -- > > Key: SPARK-17447 > URL: https://issues.apache.org/jira/browse/SPARK-17447 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: WangJianfei > Labels: performance > Original Estimate: 1h > Remaining Estimate: 1h > > if there are many rdds in some situations,the sort will loss he performance > servely,actually we needn't sort the rdds , we can just scan the rdds one > time to gain the same goal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17448) There should a limit of k in mllib.Kmeans
[ https://issues.apache.org/jira/browse/SPARK-17448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474374#comment-15474374 ] Sean Owen commented on SPARK-17448: --- That's true in a thousand contexts: if you make an array that's too big you run out of memory. There's no real way to know what the limit of k is. The error is appropriate and clear. I don't see a change to make here. > There should a limit of k in mllib.Kmeans > - > > Key: SPARK-17448 > URL: https://issues.apache.org/jira/browse/SPARK-17448 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: WangJianfei > Labels: patch > Original Estimate: 1h > Remaining Estimate: 1h > > In the mllib.Kmeans, there are no limit to the param k, if we pass a big K, > then because of he array(k), java outofmemory error will occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17450) spark sql rownumber OOM
[ https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474363#comment-15474363 ] Herman van Hovell commented on SPARK-17450: --- This generally a bad idea. All your data is moved to a single executor, sorted in that executor and finally processes on that executor. So I am curious to know what the use case is. What makes things worse is the fact that 1.6 keeps all records in a single partition in memory; causing an OOM in your case. In 2.0 we spill records to disk (above 4000 records) in order to prevent OOMs. Could you try this in 2.0? > spark sql rownumber OOM > --- > > Key: SPARK-17450 > URL: https://issues.apache.org/jira/browse/SPARK-17450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: cen yuhai > > spark sql will be OOM when using row_number() over too much sorted records... > There will be only 1 task to handle all records > This sql group by passenger_id, we have 100 million passengers. > {code} > SELECT > passenger_id, > total_order, > (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND > 670800 THEN 'V3' END) AS order_rank > FROM > ( > SELECT > passenger_id, > 1 as total_order > FROM table > GROUP BY passenger_id > ) dd1 > {code} > {code} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:104) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > physical plan: > {code} > == Physical Plan == > Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 > <= 670800)) THEN V3 AS order_rank#1] > +- Window [passenger_id#7L,total_order#0], > [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS > _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC] >+- Sort [total_order#0 DESC], false, 0 > +- TungstenExchange SinglePartition, None > +- Project [passenger_id#7L,total_order#0] > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L,total_order#0]) >+- TungstenExchange hashpartitioning(passenger_id#7L,1000), > None > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L]) >
[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
[ https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474342#comment-15474342 ] Alexander Kasper commented on SPARK-17321: -- We discovered the same issue. It seems the shuffle service has no way of being notified about the failed disk. Only option then is probably to restart the node manager. > YARN shuffle service should use good disk from yarn.nodemanager.local-dirs > -- > > Key: SPARK-17321 > URL: https://issues.apache.org/jira/browse/SPARK-17321 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.2, 2.0.0 >Reporter: yunjiong zhao > > We run spark on yarn, after enabled spark dynamic allocation, we notice some > spark application failed randomly due to YarnShuffleService. > From log I found > {quote} > 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: > Error while initializing Netty pipeline > java.lang.NullPointerException > at > org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) > at > org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) > at > org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) > at > io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {quote} > Which caused by the first disk in yarn.nodemanager.local-dirs was broken. > If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost > hundred nodes which is unacceptable. > We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good > disks if the first one is broken? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations
[ https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474277#comment-15474277 ] Nattavut Sutyanyong commented on SPARK-17154: - The same problem surfaced in different symptoms was discussed in SPARK-13801, SPARK-14040, and SPARK-17337. We shall find a solution that addresses the root cause. There is a trait called {{MultiInstanceRelation}} that looks like trying to address the multiple references of the same relation. The proposal [~sarutak]-san attached here is an interesting way to resolve the root cause of the problem. However, I have a feeling that it might be over-complicated. I am working on a prototype to extend the {{MultiInstanceRelation}} trait. The goal is to address the symptoms discussed in: this JIRA (SPARK-17154) SPARK-13801 SPARK-14040 SPARK-17337 and get rid of (IMHO) the hack implemented in the call to {{dedupRight}} to resolve a self-join scenario. > Wrong result can be returned or AnalysisException can be thrown after > self-join or similar operations > - > > Key: SPARK-17154 > URL: https://issues.apache.org/jira/browse/SPARK-17154 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.0 >Reporter: Kousuke Saruta > Attachments: Solution_Proposal_SPARK-17154.pdf > > > When we join two DataFrames which are originated from a same DataFrame, > operations to the joined DataFrame can fail. > One reproducible example is as follows. > {code} > val df = Seq( > (1, "a", "A"), > (2, "b", "B"), > (3, "c", "C"), > (4, "d", "D"), > (5, "e", "E")).toDF("col1", "col2", "col3") > val filtered = df.filter("col1 != 3").select("col1", "col2") > val joined = filtered.join(df, filtered("col1") === df("col1"), "inner") > val selected1 = joined.select(df("col3")) > {code} > In this case, AnalysisException is thrown. > Another example is as follows. > {code} > val df = Seq( > (1, "a", "A"), > (2, "b", "B"), > (3, "c", "C"), > (4, "d", "D"), > (5, "e", "E")).toDF("col1", "col2", "col3") > val filtered = df.filter("col1 != 3").select("col1", "col2") > val rightOuterJoined = filtered.join(df, filtered("col1") === df("col1"), > "right") > val selected2 = rightOuterJoined.select(df("col1")) > selected2.show > {code} > In this case, we will expect to get the answer like as follows. > {code} > 1 > 2 > 3 > 4 > 5 > {code} > But the actual result is as follows. > {code} > 1 > 2 > null > 4 > 5 > {code} > The cause of the problems in the examples is that the logical plan related to > the right side DataFrame and the expressions of its output are re-created in > the analyzer (at ResolveReference rule) when a DataFrame has expressions > which have a same exprId each other. > Re-created expressions are equally to the original ones except exprId. > This will happen when we do self-join or similar pattern operations. > In the first example, df("col3") returns a Column which includes an > expression and the expression have an exprId (say id1 here). > After join, the expresion which the right side DataFrame (df) has is > re-created and the old and new expressions are equally but exprId is renewed > (say id2 for the new exprId here). > Because of the mismatch of those exprIds, AnalysisException is thrown. > In the second example, df("col1") returns a column and the expression > contained in the column is assigned an exprId (say id3). > On the other hand, a column returned by filtered("col1") has an expression > which has the same exprId (id3). > After join, the expressions in the right side DataFrame are re-created and > the expression assigned id3 is no longer present in the right side but > present in the left side. > So, referring df("col1") to the joined DataFrame, we get col1 of right side > which includes null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-17348) Incorrect results from subquery transformation
[ https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nattavut Sutyanyong updated SPARK-17348: Comment: was deleted (was: The same problem surfaced in different symptoms was discussed in SPARK-13801, SPARK-14040, and SPARK-17154. The problem reported here is a specific pattern. We shall find a solution that addresses the root cause. I am considering closing this JIRA as a duplicate.) > Incorrect results from subquery transformation > -- > > Key: SPARK-17348 > URL: https://issues.apache.org/jira/browse/SPARK-17348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Labels: correctness > > {noformat} > Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1") > Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= > t2.c2)").show > +---+ > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result of the above query should be an empty set. Here is an > explanation: > Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when > T1.C1 = 1 so both rows needs to be processed in the same group of the > aggregation process in the subquery. The result of the aggregation yields > MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate > T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14040) Null-safe and equality join produces incorrect result with filtered dataframe
[ https://issues.apache.org/jira/browse/SPARK-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474239#comment-15474239 ] Nattavut Sutyanyong edited comment on SPARK-14040 at 9/8/16 4:03 PM: - The root cause of this problem is the way Spark implemented to generate a unique identifier for each column without an obvious way to distinguish multiple references to the same column. This problem has been discovered in different contexts and different approaches to fix this problem have been discussed in various places: SPARK-13801 SPARK-17337 A partial fix was implemented in the {{dedupRight()}} method for the {{Join}} operator. and the latest attempt to fix this in SPARK-17154. We should solve this problem at the root cause. I will post my idea in SPARK-17154. We shall close this JIRA as a duplicate. was (Author: nsyca): The root cause of this problem is the way Spark implemented to generate a unique identifier for each column without an obvious way to distinguish multiple references to the same column. This problem has been discovered in different contexts and different approaches to fix this problem have been discussed in various places: SPARK-14040 SPARK-17337 A partial fix was implemented in the {{dedupRight()}} method for the {{Join}} operator. and the latest attempt to fix this in SPARK-17154. We should solve this problem at the root cause. I will post my idea in SPARK-17154. We shall close this JIRA as a duplicate. > Null-safe and equality join produces incorrect result with filtered dataframe > - > > Key: SPARK-14040 > URL: https://issues.apache.org/jira/browse/SPARK-14040 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Ubuntu Linux 15.10 >Reporter: Denton Cockburn > > Initial issue reported here: > http://stackoverflow.com/questions/36131942/spark-join-produces-wrong-results > val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c") > val a = b.where("c = 1").withColumnRenamed("a", > "filta").withColumnRenamed("b", "filtb") > a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> > b("c"), "left_outer").show > Produces 2 rows instead of the expected 1. > a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === > $"b" and $"newc" === b("c"), "left_outer").show > Also produces 2 rows instead of the expected 1. > The only one that seemed to work correctly was: > a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === > b("c"), "left_outer").show > But that produced a warning for : > WARN Column: Constructing trivially true equals predicate, 'c#18232 = > c#18232' > As pointed out by commenter zero323: > "The second behavior looks indeed like a bug related to the fact that you > still have a.c in your data. It looks like it is picked downstream before b.c > and the evaluated condition is actually a.newc = a.c" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result
[ https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474253#comment-15474253 ] Nattavut Sutyanyong commented on SPARK-17337: - The same problem surfaced in different symptoms was discussed in SPARK-13801, SPARK-14040, and SPARK-17154. The problem reported here is a specific pattern. We shall find a solution that addresses the root cause. I am considering closing this JIRA as a duplicate. > Incomplete algorithm for name resolution in Catalyst paser may lead to > incorrect result > --- > > Key: SPARK-17337 > URL: https://issues.apache.org/jira/browse/SPARK-17337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > > While investigating SPARK-16951, I found an incorrect results case from a NOT > IN subquery. I thought originally it is an edge case. Further investigation > found this is a more general problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17348) Incorrect results from subquery transformation
[ https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474246#comment-15474246 ] Nattavut Sutyanyong commented on SPARK-17348: - The same problem surfaced in different symptoms was discussed in SPARK-13801, SPARK-14040, and SPARK-17154. The problem reported here is a specific pattern. We shall find a solution that addresses the root cause. I am considering closing this JIRA as a duplicate. > Incorrect results from subquery transformation > -- > > Key: SPARK-17348 > URL: https://issues.apache.org/jira/browse/SPARK-17348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Labels: correctness > > {noformat} > Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1") > Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= > t2.c2)").show > +---+ > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result of the above query should be an empty set. Here is an > explanation: > Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when > T1.C1 = 1 so both rows needs to be processed in the same group of the > aggregation process in the subquery. The result of the aggregation yields > MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate > T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14040) Null-safe and equality join produces incorrect result with filtered dataframe
[ https://issues.apache.org/jira/browse/SPARK-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474239#comment-15474239 ] Nattavut Sutyanyong commented on SPARK-14040: - The root cause of this problem is the way Spark implemented to generate a unique identifier for each column without an obvious way to distinguish multiple references to the same column. This problem has been discovered in different contexts and different approaches to fix this problem have been discussed in various places: SPARK-14040 SPARK-17337 A partial fix was implemented in the {{dedupRight()}} method for the {{Join}} operator. and the latest attempt to fix this in SPARK-17154. We should solve this problem at the root cause. I will post my idea in SPARK-17154. We shall close this JIRA as a duplicate. > Null-safe and equality join produces incorrect result with filtered dataframe > - > > Key: SPARK-14040 > URL: https://issues.apache.org/jira/browse/SPARK-14040 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Ubuntu Linux 15.10 >Reporter: Denton Cockburn > > Initial issue reported here: > http://stackoverflow.com/questions/36131942/spark-join-produces-wrong-results > val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c") > val a = b.where("c = 1").withColumnRenamed("a", > "filta").withColumnRenamed("b", "filtb") > a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> > b("c"), "left_outer").show > Produces 2 rows instead of the expected 1. > a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === > $"b" and $"newc" === b("c"), "left_outer").show > Also produces 2 rows instead of the expected 1. > The only one that seemed to work correctly was: > a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === > b("c"), "left_outer").show > But that produced a warning for : > WARN Column: Constructing trivially true equals predicate, 'c#18232 = > c#18232' > As pointed out by commenter zero323: > "The second behavior looks indeed like a bug related to the fact that you > still have a.c in your data. It looks like it is picked downstream before b.c > and the evaluated condition is actually a.newc = a.c" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org