[jira] [Commented] (SPARK-12239) SparkR - Not distributing SparkR module in YARN
[ https://issues.apache.org/jira/browse/SPARK-12239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081688#comment-15081688 ] Sen Fang commented on SPARK-12239: -- A quick note: I tested the specific case fix I proposed before and it seems to work fine for {code} sparkR.init() sparkR.init("local[2]") sparkR.init("yarn-client") {code} inside RStudio. This changes would have no impact to bin/sparkR etc since SparkContext is initialized using {{sparkR.init()}}. Again the code change is very simple: {code} submitOps <- getClientModeSparkSubmitOpts( paste( if (nzchar(master)) paste("--master", master), Sys.getenv("SPARKR_SUBMIT_ARGS", "sparkr-shell") ), sparkEnvirMap) {code} Given the fact this works {code} Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn-client sparkr-shell") sc <- spark.init() {code} but this doesn't {code} sc <- spark.init("yarn-client") {code} I feel it is rather clear that this is inconsistent/unintuitive to users. As always, I am happy to put together a PR whichever way people feel is the best long-term solution. > SparkR - Not distributing SparkR module in YARN > > > Key: SPARK-12239 > URL: https://issues.apache.org/jira/browse/SPARK-12239 > Project: Spark > Issue Type: Bug > Components: SparkR, YARN >Affects Versions: 1.5.2, 1.5.3 >Reporter: Sebastian YEPES FERNANDEZ >Priority: Critical > > Hello, > I am trying to use the SparkR in a YARN environment and I have encountered > the following problem: > Every thing work correctly when using bin/sparkR, but if I try running the > same jobs using sparkR directly through R it does not work. > I have managed to track down what is causing the problem, when sparkR is > launched through R the "SparkR" module is not distributed to the worker nodes. > I have tried working around this issue using the setting > "spark.yarn.dist.archives", but it does not work as it deploys the > file/extracted folder with the extension ".zip" and workers are actually > looking for a folder with the name "sparkr" > Is there currently any way to make this work? > {code} > # spark-defaults.conf > spark.yarn.dist.archives /opt/apps/spark/R/lib/sparkr.zip > # R > library(SparkR, lib.loc="/opt/apps/spark/R/lib/") > sc <- sparkR.init(appName="SparkR", master="yarn-client", > sparkEnvir=list(spark.executor.instances="1")) > sqlContext <- sparkRSQL.init(sc) > df <- createDataFrame(sqlContext, faithful) > head(df) > 15/12/09 09:04:24 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > fr-s-cour-wrk3.alidaho.com): java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) > {code} > Container stderr: > {code} > 15/12/09 09:04:14 INFO storage.MemoryStore: Block broadcast_1 stored as > values in memory (estimated size 8.7 KB, free 530.0 MB) > 15/12/09 09:04:14 INFO r.BufferedStreamThread: Fatal error: cannot open file > '/hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02/sparkr/SparkR/worker/daemon.R': > No such file or directory > 15/12/09 09:04:24 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 1) > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) > at java.net.ServerSocket.implAccept(ServerSocket.java:545) > at java.net.ServerSocket.accept(ServerSocket.java:513) > at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:426) > {code} > Worker Node that runned the Container: > {code} > # ls -la > /hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02 > total 71M > drwx--x--- 3 yarn hadoop 4.0K Dec 9 09:04 . > drwx--x--- 7 yarn hadoop 4.0K Dec 9 09:04 .. > -rw-r--r-- 1 yarn hadoop 110 Dec 9 09:03 container_tokens > -rw-r--r-- 1 yarn hadoop 12 Dec 9 09:03 .container_tokens.crc > -rwx-- 1 yarn hadoop 736 Dec 9 09:03 > default_container_executor_session.sh > -rw-r--r-- 1 yarn hadoop 16 Dec 9 09:03 > .default_container_executor_session.sh.crc > -rwx-- 1 yarn hadoop 790 Dec 9 09:03 default_container_executor.sh > -rw-r--r-- 1 yarn hadoop 16 Dec 9 09:03 .default_container_executor.sh.crc > -rwxr-xr-x 1 yarn hadoop 61K Dec 9 09:04 hadoop-lzo-0.6.0.2.3.2.0-2950.jar > -rwxr-xr-x 1 yarn hadoop 317K Dec 9 09:04 kafka-clients-0.8.2.2.jar > -rwx-- 1 yarn hadoop 6.0K Dec 9 09:03 launch_container.sh > -rw-r--r-- 1 yarn hadoop 56 Dec 9 09:03 .launch_container.sh.crc >
[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15074462#comment-15074462 ] Sen Fang commented on SPARK-5159: - As of Spark 1.5.2, we have a similar issue that might be related to this JIRA. I haven't test this in 1.6.0 just yet and will report back if it is still an issue. The symptoms is that if the thriftserver is started by a user who doesn't have permission to access the table directory on HDFS, even if a correctly privileged user establish a SQL connection and execute a query, the query will fail with the error message that the thriftserver start user doesn't have permission to list the folder. However as in HIVE, the listing action should have been performed on behalf of user instead. I can report back more detailed steps to reproduce this problem when we test it under 1.6.0 to make sure this issue still exists. > Thrift server does not respect hive.server2.enable.doAs=true > > > Key: SPARK-5159 > URL: https://issues.apache.org/jira/browse/SPARK-5159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Andrew Ray > > I'm currently testing the spark sql thrift server on a kerberos secured > cluster in YARN mode. Currently any user can access any table regardless of > HDFS permissions as all data is read as the hive user. In HiveServer2 the > property hive.server2.enable.doAs=true causes all access to be done as the > submitting user. We should do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12239) SparkR - Not distributing SparkR module in YARN
[ https://issues.apache.org/jira/browse/SPARK-12239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073138#comment-15073138 ] Sen Fang commented on SPARK-12239: -- Thanks [~sunrui] for your suggestions. Both sound very reasonable to me. If we decide on one particular approach, I am happy to put together a PR. Given the prevalence of RStudio as IDE for day-to-day R users, I think this bit is particular beneficial to streamline. > SparkR - Not distributing SparkR module in YARN > > > Key: SPARK-12239 > URL: https://issues.apache.org/jira/browse/SPARK-12239 > Project: Spark > Issue Type: Bug > Components: SparkR, YARN >Affects Versions: 1.5.2, 1.5.3 >Reporter: Sebastian YEPES FERNANDEZ >Priority: Critical > > Hello, > I am trying to use the SparkR in a YARN environment and I have encountered > the following problem: > Every thing work correctly when using bin/sparkR, but if I try running the > same jobs using sparkR directly through R it does not work. > I have managed to track down what is causing the problem, when sparkR is > launched through R the "SparkR" module is not distributed to the worker nodes. > I have tried working around this issue using the setting > "spark.yarn.dist.archives", but it does not work as it deploys the > file/extracted folder with the extension ".zip" and workers are actually > looking for a folder with the name "sparkr" > Is there currently any way to make this work? > {code} > # spark-defaults.conf > spark.yarn.dist.archives /opt/apps/spark/R/lib/sparkr.zip > # R > library(SparkR, lib.loc="/opt/apps/spark/R/lib/") > sc <- sparkR.init(appName="SparkR", master="yarn-client", > sparkEnvir=list(spark.executor.instances="1")) > sqlContext <- sparkRSQL.init(sc) > df <- createDataFrame(sqlContext, faithful) > head(df) > 15/12/09 09:04:24 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > fr-s-cour-wrk3.alidaho.com): java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) > {code} > Container stderr: > {code} > 15/12/09 09:04:14 INFO storage.MemoryStore: Block broadcast_1 stored as > values in memory (estimated size 8.7 KB, free 530.0 MB) > 15/12/09 09:04:14 INFO r.BufferedStreamThread: Fatal error: cannot open file > '/hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02/sparkr/SparkR/worker/daemon.R': > No such file or directory > 15/12/09 09:04:24 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 1) > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) > at java.net.ServerSocket.implAccept(ServerSocket.java:545) > at java.net.ServerSocket.accept(ServerSocket.java:513) > at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:426) > {code} > Worker Node that runned the Container: > {code} > # ls -la > /hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02 > total 71M > drwx--x--- 3 yarn hadoop 4.0K Dec 9 09:04 . > drwx--x--- 7 yarn hadoop 4.0K Dec 9 09:04 .. > -rw-r--r-- 1 yarn hadoop 110 Dec 9 09:03 container_tokens > -rw-r--r-- 1 yarn hadoop 12 Dec 9 09:03 .container_tokens.crc > -rwx-- 1 yarn hadoop 736 Dec 9 09:03 > default_container_executor_session.sh > -rw-r--r-- 1 yarn hadoop 16 Dec 9 09:03 > .default_container_executor_session.sh.crc > -rwx-- 1 yarn hadoop 790 Dec 9 09:03 default_container_executor.sh > -rw-r--r-- 1 yarn hadoop 16 Dec 9 09:03 .default_container_executor.sh.crc > -rwxr-xr-x 1 yarn hadoop 61K Dec 9 09:04 hadoop-lzo-0.6.0.2.3.2.0-2950.jar > -rwxr-xr-x 1 yarn hadoop 317K Dec 9 09:04 kafka-clients-0.8.2.2.jar > -rwx-- 1 yarn hadoop 6.0K Dec 9 09:03 launch_container.sh > -rw-r--r-- 1 yarn hadoop 56 Dec 9 09:03 .launch_container.sh.crc > -rwxr-xr-x 1 yarn hadoop 2.2M Dec 9 09:04 > spark-cassandra-connector_2.10-1.5.0-M3.jar > -rwxr-xr-x 1 yarn hadoop 7.1M Dec 9 09:04 spark-csv-assembly-1.3.0.jar > lrwxrwxrwx 1 yarn hadoop 119 Dec 9 09:03 __spark__.jar -> > /hadoop/hdfs/disk03/hadoop/yarn/local/usercache/spark/filecache/361/spark-assembly-1.5.3-SNAPSHOT-hadoop2.7.1.jar > lrwxrwxrwx 1 yarn hadoop 84 Dec 9 09:03 sparkr.zip -> > /hadoop/hdfs/disk01/hadoop/yarn/local/usercache/spark/filecache/359/sparkr.zip > -rwxr-xr-x 1 yarn hadoop 1.8M Dec 9 09:04 > spark-streaming_2.10-1.5.3-SNAPSHOT.jar > -rwxr-xr-x 1 yarn hadoop 11M Dec 9 09:04 >
[jira] [Commented] (SPARK-12239) SparkR - Not distributing SparkR module in YARN
[ https://issues.apache.org/jira/browse/SPARK-12239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072029#comment-15072029 ] Sen Fang commented on SPARK-12239: -- I recently also got hit by this issue and I would think we can make some improvement to make this more robust. This came up exactly when we attempt to use SparkR in RStudio/RStudio Server. [~sunrui] The workaround you suggest absolutely works but I don't see it in the page you cited. Has that page been changed? [~shivaram] I believe the problem is because {{sparkR.init}} always launches in local mode first, e.g: {code} Launching java with spark-submit command spark-submit sparkr-shell /var/folders/yw/mfqkln8172l93g6yfnt2k2zwgp/T//RtmpCwXLoF/backend_port432252948808 {code} It currently kind of works for YARN, as documented for {{sparkR.init}}: {code} sc <- sparkR.init("yarn-client", "SparkR", "/home/spark", ... {code} only because the actual Spark context is initiated by this call https://github.com/apache/spark/blob/v1.6.0-rc4/R/pkg/R/sparkR.R#L212 However this is already after the deploy step, that you cited above, where Spark determines the cluster manager via spark-submit arguments. https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L228 Therefore you end up with a functional YARN based SparkContext without sparkr.zip. Shouldn't the fix be as easy append the actually master in the spark-submit command we build inside R? Or am I missing something? > SparkR - Not distributing SparkR module in YARN > > > Key: SPARK-12239 > URL: https://issues.apache.org/jira/browse/SPARK-12239 > Project: Spark > Issue Type: Bug > Components: SparkR, YARN >Affects Versions: 1.5.2, 1.5.3 >Reporter: Sebastian YEPES FERNANDEZ >Priority: Critical > > Hello, > I am trying to use the SparkR in a YARN environment and I have encountered > the following problem: > Every thing work correctly when using bin/sparkR, but if I try running the > same jobs using sparkR directly through R it does not work. > I have managed to track down what is causing the problem, when sparkR is > launched through R the "SparkR" module is not distributed to the worker nodes. > I have tried working around this issue using the setting > "spark.yarn.dist.archives", but it does not work as it deploys the > file/extracted folder with the extension ".zip" and workers are actually > looking for a folder with the name "sparkr" > Is there currently any way to make this work? > {code} > # spark-defaults.conf > spark.yarn.dist.archives /opt/apps/spark/R/lib/sparkr.zip > # R > library(SparkR, lib.loc="/opt/apps/spark/R/lib/") > sc <- sparkR.init(appName="SparkR", master="yarn-client", > sparkEnvir=list(spark.executor.instances="1")) > sqlContext <- sparkRSQL.init(sc) > df <- createDataFrame(sqlContext, faithful) > head(df) > 15/12/09 09:04:24 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > fr-s-cour-wrk3.alidaho.com): java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) > {code} > Container stderr: > {code} > 15/12/09 09:04:14 INFO storage.MemoryStore: Block broadcast_1 stored as > values in memory (estimated size 8.7 KB, free 530.0 MB) > 15/12/09 09:04:14 INFO r.BufferedStreamThread: Fatal error: cannot open file > '/hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02/sparkr/SparkR/worker/daemon.R': > No such file or directory > 15/12/09 09:04:24 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 1) > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) > at java.net.ServerSocket.implAccept(ServerSocket.java:545) > at java.net.ServerSocket.accept(ServerSocket.java:513) > at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:426) > {code} > Worker Node that runned the Container: > {code} > # ls -la > /hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02 > total 71M > drwx--x--- 3 yarn hadoop 4.0K Dec 9 09:04 . > drwx--x--- 7 yarn hadoop 4.0K Dec 9 09:04 .. > -rw-r--r-- 1 yarn hadoop 110 Dec 9 09:03 container_tokens > -rw-r--r-- 1 yarn hadoop 12 Dec 9 09:03 .container_tokens.crc > -rwx-- 1 yarn hadoop 736 Dec 9 09:03 > default_container_executor_session.sh > -rw-r--r-- 1 yarn hadoop 16 Dec 9 09:03 >
[jira] [Created] (SPARK-12526) `ifelse`, `when`, `otherwise` unable to take Column as value
Sen Fang created SPARK-12526: Summary: `ifelse`, `when`, `otherwise` unable to take Column as value Key: SPARK-12526 URL: https://issues.apache.org/jira/browse/SPARK-12526 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.5.2, 1.6.0 Reporter: Sen Fang When passing a Column to {{ifelse}}, {{when}}, {{otherwise}}, it will error out with {code} attempt to replicate an object of type 'environment' {code} The problems lies in the use of base R {{ifelse}} function, which is vectorized version of {{if ... else ...}} idiom, but it is unable to replicate a Column's job id as it is an environment. Considering {{callJMethod}} was never designed to be vectorized, the safe option is to replace {{ifelse}} with {{if ... else ...}} instead. However technically this is inconsistent to base R's ifelse, which is meant to be vectorized. I can send a PR for review first and discuss further if there is scenario at all when `ifelse`, `when`, `otherwise` would be used vectorizedly. A dummy example is: {code} ifelse(lit(1) == lit(1), lit(2), lit(3)) {code} A concrete example might be: {code} ifelse(df$mpg > 0, df$mpg, 0) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021527#comment-15021527 ] Sen Fang commented on SPARK-2336: - I finally took a crack on the hybrid spill tree for kNN and results so far appear to be promising. For anyone who is still interested, you can find it as a spark package at: https://github.com/saurfang/spark-knn The implementation is written for ml API and scales well in terms of both number of observations and number of vector dimensions. The KNN itself is flexible and the package comes with KNNClassifier and KNNRegression for (optionally weighted) classification and regression. There are a few implementation details I am still trying to iron out. Otherwise I look forward to benchmark it against other implementations such as KNN-join, KD-Tree, and LSH. > Approximate k-NN Models for MLLib > - > > Key: SPARK-2336 > URL: https://issues.apache.org/jira/browse/SPARK-2336 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Brian Gawalt >Priority: Minor > Labels: clustering, features > > After tackling the general k-Nearest Neighbor model as per > https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to > also offer approximate k-Nearest Neighbor. A promising approach would involve > building a kd-tree variant within from each partition, a la > http://www.autonlab.org/autonweb/14714.html?branch=1=2 > This could offer a simple non-linear ML model that can label new data with > much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11906) Speculation Tasks Cause ProgressBar UI Overflow
Sen Fang created SPARK-11906: Summary: Speculation Tasks Cause ProgressBar UI Overflow Key: SPARK-11906 URL: https://issues.apache.org/jira/browse/SPARK-11906 Project: Spark Issue Type: Bug Components: Web UI Reporter: Sen Fang Priority: Trivial When there are speculative tasks in stage, the started tasks + completed tasks can be greater than total number of tasks. It leads to the started progress block to overflow to next line. Visually the light blue progress block becomes no longer visible when it happens. The fix should be as trivial as to cap the number of started task by total - completed task. https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L322 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11263) lintr Throws Warnings on Commented Code in Documentation
Sen Fang created SPARK-11263: Summary: lintr Throws Warnings on Commented Code in Documentation Key: SPARK-11263 URL: https://issues.apache.org/jira/browse/SPARK-11263 Project: Spark Issue Type: Task Components: SparkR Reporter: Sen Fang Priority: Minor This comes from a discussion in https://github.com/apache/spark/pull/9205 Currently lintr throws many warnings around "style: Commented code should be removed." For example {code} R/RDD.R:260:3: style: Commented code should be removed. # unpersist(rdd) # rdd@@env$isCached == FALSE ^~~ R/RDD.R:283:3: style: Commented code should be removed. # sc <- sparkR.init() ^~~ R/RDD.R:284:3: style: Commented code should be removed. # setCheckpointDir(sc, "checkpoint") ^~ {code} Some of them are legitimate warnings but most of them are simply code examples of functions that are not part of public API. For example {code} # @examples #\dontrun{ # sc <- sparkR.init() # rdd <- parallelize(sc, 1:10, 2L) # cache(rdd) #} {code} One workaround is to convert them back to Roxygen doc but assign {{#' @rdname .ignore}} and Roxygen will skip these functions with message {{Skipping invalid path: .ignore.Rd}} That being said, I feel people usually praise/criticize R package documentation is "expert friendly". The convention seems to be providing as much documentation as possible but don't export functions that is unstable or developer only. If users choose to use them, they acknowledge the risk by using {{:::}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11244) sparkR.stop doesn't clean up .sparkRSQLsc in environment
Sen Fang created SPARK-11244: Summary: sparkR.stop doesn't clean up .sparkRSQLsc in environment Key: SPARK-11244 URL: https://issues.apache.org/jira/browse/SPARK-11244 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.5.1 Reporter: Sen Fang Currently {{sparkR.stop}} removes relevant variables from {{.sparkREnv}} for SparkContext and backend. However it doesn't clean up {{.sparkRSQLsc}} and {{.sparkRHivesc}}. It results {code} sc <- sparkR.init("local") sqlContext <- sparkRSQL.init(sc) sparkR.stop() sc <- sparkR.init("local") sqlContext <- sparkRSQL.init(sc) sqlContext {code} producing {code} sqlContext Error in callJMethod(x, "getClass") : Invalid jobj 1. If SparkR was restarted, Spark operations need to be re-executed. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936268#comment-14936268 ] Sen Fang commented on SPARK-: - Another idea is do something similar to F# TypeProvider approach: http://fsharp.github.io/FSharp.Data/ I haven't looked into this extensively just yet but as far as I understand this uses compile time macro to generate classes based on data sources. In that sense, it is slightly similar to protobuf where you generate Java class based on schema definition. This makes dataframe type safe at the very upstream. With a bit of IDE plugin, you will even able to have autocomplete and type check when you write code, which would be very nice. I'm not sure if it will be scalable to propagate these type information down stream (in aggregation or transformed dataframe) though. As I understand, the macro and type provider in Scala provides similar capabilities. > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > As a Spark user, I want an API that sits somewhere in the middle of the > spectrum so I can write most of my applications with that API, and yet it can > be optimized well by Spark to achieve performance and stability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8939) YARN EC2 default setting fails with IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-8939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933233#comment-14933233 ] Sen Fang commented on SPARK-8939: - FYI, PR has been merged into branch-1.5 in spark-ec2 repo: https://github.com/amplab/spark-ec2/pull/10 > YARN EC2 default setting fails with IllegalArgumentException > > > Key: SPARK-8939 > URL: https://issues.apache.org/jira/browse/SPARK-8939 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.5.0 >Reporter: Andrew Or > > I just set it up from scratch using the spark-ec2 script. Then I ran > {code} > bin/spark-shell --master yarn > {code} > which failed with > {code} > 15/07/09 03:44:29 ERROR SparkContext: Error initializing SparkContext. > java.lang.IllegalArgumentException: Unknown/unsupported param > List(--num-executors, , --executor-memory, 6154m, --executor-memory, 6154m, > --executor-cores, 2, --name, Spark shell) > {code} > This goes away if I provide `--num-executors`, but we should fix the default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8939) YARN EC2 default setting fails with IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-8939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744792#comment-14744792 ] Sen Fang commented on SPARK-8939: - The issue seems to lie in: https://github.com/amplab/spark-ec2/blob/branch-1.5/templates/root/spark/conf/spark-env.sh#L7 where {{spark_worker_instances}} is set to empty string here: https://github.com/apache/spark/blob/branch-1.5/ec2/spark_ec2.py#L810 This results the following replacement: {code} if [ -n ]; then export SPARK_WORKER_INSTANCES= fi {code} in spark-env.sh. Somehow bash thinks this evaluates to true. This simple fix is to surround the variable with quote in the if statement. However, I'm a bit confused that at https://github.com/apache/spark/blob/v1.5.0/core/src/main/scala/org/apache/spark/SparkConf.scala#L481 suggests this environment variable has been deprecated in favor for `SPARK_EXECUTOR_INSTANCES` but `--worker-instances` are referenced at a lot of places thus I'm not sure if we want to remove/deprecate it. I'm submitting a PR for now since I have tested my fix for YARN and also verified it didn't break for regular standalone. > YARN EC2 default setting fails with IllegalArgumentException > > > Key: SPARK-8939 > URL: https://issues.apache.org/jira/browse/SPARK-8939 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.5.0 >Reporter: Andrew Or > > I just set it up from scratch using the spark-ec2 script. Then I ran > {code} > bin/spark-shell --master yarn > {code} > which failed with > {code} > 15/07/09 03:44:29 ERROR SparkContext: Error initializing SparkContext. > java.lang.IllegalArgumentException: Unknown/unsupported param > List(--num-executors, , --executor-memory, 6154m, --executor-memory, 6154m, > --executor-cores, 2, --name, Spark shell) > {code} > This goes away if I provide `--num-executors`, but we should fix the default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10543) Peak Execution Memory Quantile should be Pre-task Basis
Sen Fang created SPARK-10543: Summary: Peak Execution Memory Quantile should be Pre-task Basis Key: SPARK-10543 URL: https://issues.apache.org/jira/browse/SPARK-10543 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.0 Reporter: Sen Fang Priority: Minor Currently the Peak Execution Memory quantiles seem to be cumulative rather than per task basis. For example, I have seen a value of 2TB in one of my jobs on the quantile metric but each individual task shows less than 1GB on the bottom table. [~andrewor14] In your PR https://github.com/apache/spark/pull/7770, the screenshot shows the Max Peak Execution Memory of 792.5KB while in the bottom it's about 50KB per task (unless your workload is skewed) The fix seems straightforward that we use the `update` rather than `value` from the accumulable. I'm happy to provide a PR if people agree this is the right behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10543) Peak Execution Memory Quantile should be Per-task Basis
[ https://issues.apache.org/jira/browse/SPARK-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sen Fang updated SPARK-10543: - Summary: Peak Execution Memory Quantile should be Per-task Basis (was: Peak Execution Memory Quantile should be Pre-task Basis) > Peak Execution Memory Quantile should be Per-task Basis > --- > > Key: SPARK-10543 > URL: https://issues.apache.org/jira/browse/SPARK-10543 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Sen Fang >Priority: Minor > > Currently the Peak Execution Memory quantiles seem to be cumulative rather > than per task basis. For example, I have seen a value of 2TB in one of my > jobs on the quantile metric but each individual task shows less than 1GB on > the bottom table. > [~andrewor14] In your PR https://github.com/apache/spark/pull/7770, the > screenshot shows the Max Peak Execution Memory of 792.5KB while in the bottom > it's about 50KB per task (unless your workload is skewed) > The fix seems straightforward that we use the `update` rather than `value` > from the accumulable. I'm happy to provide a PR if people agree this is the > right behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs
[ https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703626#comment-14703626 ] Sen Fang commented on SPARK-3533: - I also have a working implementation based on Hadoop 1 + MultipleTextOutputFormat under Scala. [~silasdavis] let me know if you have time putting up a PR. If not, I can get one together this weekend. Add saveAsTextFileByKey() method to RDDs Key: SPARK-3533 URL: https://issues.apache.org/jira/browse/SPARK-3533 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Nicholas Chammas Users often have a single RDD of key-value pairs that they want to save to multiple locations based on the keys. For example, say I have an RDD like this: {code} a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda x: x[0]) a.collect() [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')] a.keys().distinct().collect() ['B', 'F', 'N'] {code} Now I want to write the RDD out to different paths depending on the keys, so that I have one output directory per distinct key. Each output directory could potentially have multiple {{part-}} files, one per RDD partition. So the output would look something like: {code} /path/prefix/B [/part-1, /part-2, etc] /path/prefix/F [/part-1, /part-2, etc] /path/prefix/N [/part-1, /part-2, etc] {code} Though it may be possible to do this with some combination of {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the {{MultipleTextOutputFormat}} output format class, it isn't straightforward. It's not clear if it's even possible at all in PySpark. Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that makes it easy to save RDDs out to multiple locations at once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8443) GenerateMutableProjection Exceeds JVM Code Size Limits
Sen Fang created SPARK-8443: --- Summary: GenerateMutableProjection Exceeds JVM Code Size Limits Key: SPARK-8443 URL: https://issues.apache.org/jira/browse/SPARK-8443 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Sen Fang GenerateMutableProjection put all expressions columns into a single apply function. When there are a lot of columns, the apply function code size exceeds the 64kb limit, which is a hard limit on jvm that cannot change. This comes up when we were aggregating about 100 columns using codegen and unsafe feature. I wrote an unit test that reproduces this issue. https://github.com/saurfang/spark/blob/codegen_size_limit/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala This test currently fails at 2048 expressions. It seems the master is more tolerant than branch-1.4 about this because code is more concise. While the code on master has changed since branch-1.4, I am able to reproduce the problem in master. For now I hacked my way in branch-1.4 to workaround this problem by wrapping each expression with a separate function then call those functions sequentially in apply. The proper way is probably check the length of the projectCode and break it up as necessary. (This seems to be easier in master actually since we are building code by string rather than quasiquote) Let me know if anyone has additional thoughts on this, I'm happy to contribute a pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8443) GenerateMutableProjection Exceeds JVM Code Size Limits
[ https://issues.apache.org/jira/browse/SPARK-8443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sen Fang updated SPARK-8443: Description: GenerateMutableProjection put all expressions columns into a single apply function. When there are a lot of columns, the apply function code size exceeds the 64kb limit, which is a hard limit on jvm that cannot change. This comes up when we were aggregating about 100 columns using codegen and unsafe feature. I wrote an unit test that reproduces this issue. https://github.com/saurfang/spark/blob/codegen_size_limit/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala This test currently fails at 2048 expressions. It seems the master is more tolerant than branch-1.4 about this because code is more concise. While the code on master has changed since branch-1.4, I am able to reproduce the problem in master. For now I hacked my way in branch-1.4 to workaround this problem by wrapping each expression with a separate function then call those functions sequentially in apply. The proper way is probably check the length of the projectCode and break it up as necessary. (This seems to be easier in master actually since we are building code by string rather than quasiquote) Let me know if anyone has additional thoughts on this, I'm happy to contribute a pull request. Attaching stack trace produced by unit test {code} [info] - code size limit *** FAILED *** (7 seconds, 103 milliseconds) [info] com.google.common.util.concurrent.UncheckedExecutionException: org.codehaus.janino.JaninoRuntimeException: Code of method (Ljava/lang/Object;)Ljava/lang/Object; of class SC$SpecificProjection grows beyond 64 KB [info] at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2263) [info] at com.google.common.cache.LocalCache.get(LocalCache.java:4000) [info] at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) [info] at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) [info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:285) [info] at org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(CodeGenerationSuite.scala:50) [info] at org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(CodeGenerationSuite.scala:48) [info] at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) [info] at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) [info] at scala.collection.immutable.Range.foreach(Range.scala:141) [info] at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) [info] at scala.collection.AbstractTraversable.foldLeft(Traversable.scala:105) [info] at org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply$mcV$sp(CodeGenerationSuite.scala:47) [info] at org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply(CodeGenerationSuite.scala:47) [info] at org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply(CodeGenerationSuite.scala:47) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526153#comment-14526153 ] Sen Fang commented on SPARK-2336: - Hey Longbao, great to hear from you. To my best understanding, in the paper I cited above, they are distributing the input points by pushing them through the top tree (figure 4). Of course, for a precise result, this means we need to backtrack which isn't very ideal. What they propose is to use a buffer boundary like a spill tree. However unlike spill tree, here you would push the input targets to both children if it falls within the buffer zone, because the top tree was built as a metric tree (they explained the reason being a spill tree as top tree has a high memory penalty). So every input now might end up in multiple subtrees and you will need to reduceByKey at the end to keep the top K neighbors. Is your implementation available somewhere? I'm having a hard time to find time to finish my implementation this month. Would be great if eventually we can compare our implementations, validate and benchmark. Approximate k-NN Models for MLLib - Key: SPARK-2336 URL: https://issues.apache.org/jira/browse/SPARK-2336 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Brian Gawalt Priority: Minor Labels: clustering, features After tackling the general k-Nearest Neighbor model as per https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to also offer approximate k-Nearest Neighbor. A promising approach would involve building a kd-tree variant within from each partition, a la http://www.autonlab.org/autonweb/14714.html?branch=1language=2 This could offer a simple non-linear ML model that can label new data with much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14396306#comment-14396306 ] Sen Fang commented on SPARK-2336: - I'm tentatively going to give the hybrid spilltree implementation described in OP's post a try. Specifically I'm going to follow the implementation described in: http://dx.doi.org/10.1109/WACV.2007.18 According to the paper, the algorithm scales well in terms of number of observations, bounded by the available memory across clusters (billions in paper's example). The original hybrid spilltree paper claims the algorithm scales much better than other alternatives (LSH, metric tree) when it comes to number of features (up to hundreds). Furthermore, random projection is often used to reduce the dimensionality of feature space. Random projection is out of scope in my implementation and should probably implemented as a separate dimension reduction technique. (in the paper 4000 features are projected down to 100) The average runtime for training and prediction is generally O(log n). The training runtime will suffer if we turn up the buffer size. The search is only approximate on overlapping node and the higher the buffer size the more accurate the search is. The prediction runtime will suffer when backtracking search is needed in a non-overlap node (which is more likely as balance threshold increases). More specifically, metric tree promises accurate but less performant search while spill tree creates trees with overlapping nodes and uses more efficient defeatist search whose accuracy is related to the buffer size *tau* (the larger it is the more accurate the search but deeper tree). Hybrid spill tree is constructed with both overlapping (spill tree) and non-overlapping (metric tree) node and uses balance threshold *rho* to balances the tree depth and search efficiency. A high level overview of the algorithm is as follows: 1. Sample M data points (M = number of partitions) 2. Build the top level metric tree 3. Repartition RDD by assigning each point to leaf node of the above tree 4. Build a hybrid spill tree at each partition This concludes the training phase of kNN. Prediction is achieved by identify the subtree it falls into then run prediction on that subtree. Let me know if anyone has any thoughts, concerns or corrections on the things I said above. Approximate k-NN Models for MLLib - Key: SPARK-2336 URL: https://issues.apache.org/jira/browse/SPARK-2336 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Brian Gawalt Priority: Minor Labels: clustering, features After tackling the general k-Nearest Neighbor model as per https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to also offer approximate k-Nearest Neighbor. A promising approach would involve building a kd-tree variant within from each partition, a la http://www.autonlab.org/autonweb/14714.html?branch=1language=2 This could offer a simple non-linear ML model that can label new data with much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org