[jira] [Commented] (SPARK-6468) Fix the race condition of subDirs in DiskBlockManager
[ https://issues.apache.org/jira/browse/SPARK-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375873#comment-14375873 ] Apache Spark commented on SPARK-6468: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/5136 Fix the race condition of subDirs in DiskBlockManager - Key: SPARK-6468 URL: https://issues.apache.org/jira/browse/SPARK-6468 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.3.0 Reporter: Shixiong Zhu Priority: Minor There are two race conditions of subDirs in DiskBlockManager: 1. `getAllFiles` does not use correct locks to read the contents in `subDirs`. Although it's designed for testing, it's still worth to add correct locks to eliminate the race condition. 2. The double-check has a race condition in `getFile(filename: String)`. If a thread finds `subDirs(dirId)(subDirId)` is not null out of the `synchronized` block, it may not be able to see the correct content of the File instance pointed by `subDirs(dirId)(subDirId)` according to the Java memory model (there is no volatile variable here). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6468) Fix the race condition of subDirs in DiskBlockManager
Shixiong Zhu created SPARK-6468: --- Summary: Fix the race condition of subDirs in DiskBlockManager Key: SPARK-6468 URL: https://issues.apache.org/jira/browse/SPARK-6468 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.3.0 Reporter: Shixiong Zhu Priority: Minor There are two race conditions of subDirs in DiskBlockManager: 1. `getAllFiles` does not use correct locks to read the contents in `subDirs`. Although it's designed for testing, it's still worth to add correct locks to eliminate the race condition. 2. The double-check has a race condition in `getFile(filename: String)`. If a thread finds `subDirs(dirId)(subDirId)` is not null out of the `synchronized` block, it may not be able to see the correct content of the File instance pointed by `subDirs(dirId)(subDirId)` according to the Java memory model (there is no volatile variable here). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2394) Make it easier to read LZO-compressed files from EC2 clusters
[ https://issues.apache.org/jira/browse/SPARK-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375766#comment-14375766 ] Theodore Vasiloudis commented on SPARK-2394: Just adding some more info here for people who end up here through searches: Steps 1-3 can be completed by running this script on each machine on you cluster: https://gist.github.com/thvasilo/7696d21cb3205f5cb11d There should be an easy way to execute this script when the cluster is being launched, I tried using the --user-data flag but that doesn't seem to do that. Otherwise you'd have to rsync this script into each machine (easy, use ~/spark-ec2/copy-dir after you've copied the file to you master) and then ssh into each machine and run it (not so easy) For Step 4, make sure that the core-site.xml in changed in both the hadoop config, as well as the spark-conf/ directory. Also as suggested in the hadoop-lzo docs {quote} Note that there seems to be a bug in /path/to/hadoop/bin/hadoop; comment out the line: {code} JAVA_LIBRARY_PATH='' {code} {quote} Here's how I set the vars in spark-env.sh: {code} export SPARK_SUBMIT_LIBRARY_PATH=$SPARK_SUBMIT_LIBRARY_PATH:/root/persistent-hdfs/lib/native/:/root/hadoop-native:/root/hadoop-lzo/target/native/Linux-amd64-64/lib:/usr/lib64/ export SPARK_SUBMIT_CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar {code} And what I added to both core-site.xml {code:xml} property nameio.compression.codecs/name valueorg.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec/value /property property nameio.compression.codec.lzo.class/name valuecom.hadoop.compression.lzo.LzoCodec/value /property {code} As for the code (Step 5) itself, I've tried the different variations suggested in the mailing list and other places and ended up using the following: https://gist.github.com/thvasilo/cd99709eacb44c8a8cff Note that this uses the sequenceFile reader, specifically for the Google Ngrams. Setting the minPartitions is important in order to get good parallelization with what you with the data later on. (3*cores in your cluster seems like a good value) You can run the above job using: {code} ./bin/spark-submit --jars local:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar --class your.package.here.TestNgrams --master $SPARK_MASTER $SPARK_JAR dummy_arg {code} you should of course set the env variables for you spark master and the location of your fat jar. Note that I'm passing the hadoop-lzo jar as local, that assumes that every node has built the jar, which is done by the script given above. Do the above and you should get the count and the first line of the data when running the job. Make it easier to read LZO-compressed files from EC2 clusters - Key: SPARK-2394 URL: https://issues.apache.org/jira/browse/SPARK-2394 Project: Spark Issue Type: Improvement Components: EC2, Input/Output Affects Versions: 1.0.0 Reporter: Nicholas Chammas Priority: Minor Labels: compression Amazon hosts [a large Google n-grams data set on S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is perfect, among other things, for putting together interesting and easily reproducible public demos of Spark's capabilities. The problem is that the data set is compressed using LZO, and it is currently more painful than it should be to get your average {{spark-ec2}} cluster to read input compressed in this way. This is what one has to go through to get a Spark cluster created with {{spark-ec2}} to read LZO-compressed files: # Install the latest LZO release, perhaps via {{yum}}. # Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build it. To build {{hadoop-lzo}} you need Maven. # Install Maven. For some reason, [you cannot install Maven with {{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum], so install it manually. # Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E]. # Make [the appropriate calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E] to {{sc.newAPIHadoopFile}}. This seems like a bit too much work for what we're trying to accomplish. If we expect this to be a common pattern -- reading LZO-compressed files from a {{spark-ec2}} cluster -- it would be great if we could
[jira] [Comment Edited] (SPARK-1702) Mesos executor won't start because of a ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375599#comment-14375599 ] Littlestar edited comment on SPARK-1702 at 3/23/15 11:00 AM: - I met this on spak 1.3.0 + mesos 0.21.1 with run-example SparkPi Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/executor/MesosExecutorBackend Caused by: java.lang.ClassNotFoundException: org.apache.spark.executor.MesosExecutorBackend at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) Could not find the main class: org.apache.spark.executor.MesosExecutorBackend was (Author: cnstar9988): I met this on spak 1.3.0 + mesos 0.21.1 with run-example SparkPi Mesos executor won't start because of a ClassNotFoundException -- Key: SPARK-1702 URL: https://issues.apache.org/jira/browse/SPARK-1702 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Reporter: Bouke van der Bijl Labels: executors, mesos, spark Some discussion here: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html Fix here (which is probably not the right fix): https://github.com/apache/spark/pull/620 This was broken in v0.9.0, was fixed in v0.9.1 and is now broken again. Error in Mesos executor stderr: WARNING: Logging before InitGoogleLogging() is written to STDERR I0502 17:31:42.672224 14688 exec.cpp:131] Version: 0.18.0 I0502 17:31:42.674959 14707 exec.cpp:205] Executor registered on slave 20140501-182306-16842879-5050-10155-0 14/05/02 17:31:42 INFO MesosExecutorBackend: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/05/02 17:31:42 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20140501-182306-16842879-5050-10155-0 14/05/02 17:31:43 INFO SecurityManager: Changing view acls to: vagrant 14/05/02 17:31:43 INFO SecurityManager: SecurityManager, is authentication enabled: false are ui acls enabled: false users with view permissions: Set(vagrant) 14/05/02 17:31:43 INFO Slf4jLogger: Slf4jLogger started 14/05/02 17:31:43 INFO Remoting: Starting remoting 14/05/02 17:31:43 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@localhost:50843] 14/05/02 17:31:43 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@localhost:50843] java.lang.ClassNotFoundException: org/apache/spark/serializer/JavaSerializer at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:165) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:176) at org.apache.spark.executor.Executor.init(Executor.scala:106) at org.apache.spark.executor.MesosExecutorBackend.registered(MesosExecutorBackend.scala:56) Exception in thread Thread-0 I0502 17:31:43.710039 14707 exec.cpp:412] Deactivating the executor libprocess The problem is that it can't find the class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2394) Make it easier to read LZO-compressed files from EC2 clusters
[ https://issues.apache.org/jira/browse/SPARK-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375766#comment-14375766 ] Theodore Vasiloudis edited comment on SPARK-2394 at 3/23/15 11:38 AM: -- Just adding some more info here for people who end up here through searches: Steps 1-3 can be completed by running this script on each machine on you cluster: https://gist.github.com/thvasilo/7696d21cb3205f5cb11d There should be an easy way to execute this script when the cluster is being launched, I tried using the --user-data flag but that doesn't seem to do that. Otherwise you'd have to rsync this script into each machine (easy, use ~/spark-ec2/copy-dir after you've copied the file to you master) and then ssh into each machine and run it (not so easy) For Step 4, make sure that the core-site.xml in changed in both the hadoop config, as well as the spark-conf/ directory. Also as suggested in the hadoop-lzo docs {quote} Note that there seems to be a bug in /path/to/hadoop/bin/hadoop; comment out the line: {code} JAVA_LIBRARY_PATH='' {code} {quote} Here's how I set the vars in spark-env.sh: {code} export SPARK_SUBMIT_LIBRARY_PATH=$SPARK_SUBMIT_LIBRARY_PATH:/root/persistent-hdfs/lib/native/:/root/hadoop-native:/root/hadoop-lzo/target/native/Linux-amd64-64/lib:/usr/lib64/ export SPARK_SUBMIT_CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar {code} And what I added to both core-site.xml {code:xml} property nameio.compression.codecs/name valueorg.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec/value /property property nameio.compression.codec.lzo.class/name valuecom.hadoop.compression.lzo.LzoCodec/value /property {code} Here is an easy way to test if everything works (replace ephemeral with persistent if you are using that): {code} echo hello world test.log lzop test.log ephemeral-hdfs/bin/hadoop fs -copyFromLocal test.log.lzo /user/root/test.log.lzo #Test local ephemeral-hdfs/bin/hadoop jar /root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer /user/root/test.log.lzo #Test distributed ephemeral-hdfs/bin/hadoop jar /root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/root/test.log.lzo {code} As for the code (Step 5) itself, I've tried the different variations suggested in the mailing list and other places and ended up using the following: https://gist.github.com/thvasilo/cd99709eacb44c8a8cff Note that this uses the sequenceFile reader, specifically for the Google Ngrams. Setting the minPartitions is important in order to get good parallelization with what you with the data later on. (3*cores in your cluster seems like a good value) You can run the above job using: {code} ./bin/spark-submit --jars local:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar --class your.package.here.TestNgrams --master $SPARK_MASTER $SPARK_JAR dummy_arg {code} you should of course set the env variables for you spark master and the location of your fat jar. Note that I'm passing the hadoop-lzo jar as local, that assumes that every node has built the jar, which is done by the script given above. Do the above and you should get the count and the first line of the data when running the job. was (Author: tvas): Just adding some more info here for people who end up here through searches: Steps 1-3 can be completed by running this script on each machine on you cluster: https://gist.github.com/thvasilo/7696d21cb3205f5cb11d There should be an easy way to execute this script when the cluster is being launched, I tried using the --user-data flag but that doesn't seem to do that. Otherwise you'd have to rsync this script into each machine (easy, use ~/spark-ec2/copy-dir after you've copied the file to you master) and then ssh into each machine and run it (not so easy) For Step 4, make sure that the core-site.xml in changed in both the hadoop config, as well as the spark-conf/ directory. Also as suggested in the hadoop-lzo docs {quote} Note that there seems to be a bug in /path/to/hadoop/bin/hadoop; comment out the line: {code} JAVA_LIBRARY_PATH='' {code} {quote} Here's how I set the vars in spark-env.sh: {code} export SPARK_SUBMIT_LIBRARY_PATH=$SPARK_SUBMIT_LIBRARY_PATH:/root/persistent-hdfs/lib/native/:/root/hadoop-native:/root/hadoop-lzo/target/native/Linux-amd64-64/lib:/usr/lib64/ export SPARK_SUBMIT_CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar {code} And what I added to both core-site.xml {code:xml} property nameio.compression.codecs/name
[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375707#comment-14375707 ] vijay commented on SPARK-6435: -- I tested this on Linux with the 1.3.0 release, works fine. Apparently a windows-specific issue. Apparently on windows only the 1st jar is picked up. This appears to be a problem with parsing the command line, introduced by the change in windows scripts between 1.2.0 and 1.3.0. A simple fix to bin\windows-utils.cmd resolves the issue. I ran this command to test with 'real' jars: {code} %SPARK_HOME%\bin\spark-shell --master local --jars c:\code\elasticsearch-1.4.2\lib\lucene-core-4.10.2.jar,c:\temp\guava-14.0.1.jar {code} Here are some snippets from the console - note that only the 1st jar is added; I can load classes from the 1st jar but not the 2nd: {code} 15/03/23 10:57:41 INFO SparkUI: Started SparkUI at http://vgarla-t440P.fritz.box :4040 15/03/23 10:57:41 INFO SparkContext: Added JAR file:/c:/code/elasticsearch-1.4.2/lib/lucene-core-4.10.2.jar at http://192.168.178.41:54601/jars/lucene-core-4.10.2.jar with timestamp 1427104661969 15/03/23 10:57:42 INFO Executor: Starting executor ID driver on host localhost ... scala import org.apache.lucene.util.IOUtils import org.apache.lucene.util.IOUtils scala import com.google.common.base.Strings console:20: error: object Strings is not a member of package com.google.common.base {code} Looking at the command line in jvisualvm, I see that only the 1st jar is aded: {code} Main class: org.apache.spark.deploy.SparkSubmit Arguments: --class org.apache.spark.repl.Main --master local --jars c:\code\elasticsearch-1.4.2\lib\lucene-core-4.10.2.jar spark-shell c:\temp\guava-14.0.1.jar {code} In spark 1.2.0, spark-shell2.cmd just passed arguments as is to the java command line: {code} cmd /V /E /C %SPARK_HOME%\bin\spark-submit.cmd --class org.apache.spark.repl.Main %* spark-shell {code} In spark 1.3.0, spark-shell2.cmd calls windows-utils.cmd to parse arguments into SUBMISSION_OPTS and APPLICATION_OPTS. Only the first jar in the list passed to --jars makes it into the SUBMISSION_OPTS; latter jars are added to APPLICATION_OPTS: {code} call %SPARK_HOME%\bin\windows-utils.cmd %* if %ERRORLEVEL% equ 1 ( call :usage exit /b 1 ) echo SUBMISSION_OPTS=%SUBMISSION_OPTS% echo APPLICATION_OPTS=%APPLICATION_OPTS% cmd /V /E /C %SPARK_HOME%\bin\spark-submit.cmd --class org.apache.spark.repl.Main %SUBMISSION_OPTS% spark-shell %APPLICATION_OPTS% {code} The problem is that by the time the command line arguments get to windows-utils.cmd, the windows command line processor has split the comma-separated list into distinct arguments. The windows way of saying treat this as a single arg is to surround in double-quotes. However, when I surround the jars in quotes, I get an error: {code} %SPARK_HOME%\bin\spark-shell --master local --jars c:\code\elasticsearch-1.4.2\lib\lucene-core-4.10.2.jar,c:\temp\guava-14.0.1.jar c:\temp\guava-14.0.1.jar==x was unexpected at this time. {code} Digging in, I see this is caused by this line from windows-utils.cmd: {code} if x%2==x ( {code} Replacing the quotes with square brackets does the trick: {code} if [x%2]==[x] ( {code} Now the command line is processed correctly. spark-shell --jars option does not add all jars to classpath Key: SPARK-6435 URL: https://issues.apache.org/jira/browse/SPARK-6435 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.3.0 Environment: Win64 Reporter: vijay Not all jars supplied via the --jars option will be added to the driver (and presumably executor) classpath. The first jar(s) will be added, but not all. To reproduce this, just add a few jars (I tested 5) to the --jars option, and then try to import a class from the last jar. This fails. A simple reproducer: Create a bunch of dummy jars: jar cfM jar1.jar log.txt jar cfM jar2.jar log.txt jar cfM jar3.jar log.txt jar cfM jar4.jar log.txt Start the spark-shell with the dummy jars and guava at the end: %SPARK_HOME%\bin\spark-shell --master local --jars jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar In the shell, try importing from guava; you'll get an error: {code} scala import com.google.common.base.Strings console:19: error: object Strings is not a member of package com.google.common.base import com.google.common.base.Strings ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2167) spark-submit should return exit code based on failure/success
[ https://issues.apache.org/jira/browse/SPARK-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376001#comment-14376001 ] Sean Owen commented on SPARK-2167: -- (Thanks [~tgraves] for having a look at some of these older issues. You'd know a lot about what's still in play or likely obsolete.) spark-submit should return exit code based on failure/success - Key: SPARK-2167 URL: https://issues.apache.org/jira/browse/SPARK-2167 Project: Spark Issue Type: New Feature Components: Deploy Affects Versions: 1.0.0 Reporter: Thomas Graves Assignee: Guoqiang Li spark-submit script and Java class should exit with 0 for success and non-zero with failure so that other command line tools and workflow managers (like oozie) can properly tell if the spark app succeeded or failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6436) io/netty missing from external shuffle service jars for yarn
[ https://issues.apache.org/jira/browse/SPARK-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves closed SPARK-6436. Resolution: Invalid This is working for me. Sorry for the confusion, I had build environment issues. io/netty missing from external shuffle service jars for yarn Key: SPARK-6436 URL: https://issues.apache.org/jira/browse/SPARK-6436 Project: Spark Issue Type: Bug Components: Shuffle, YARN Affects Versions: 1.3.0 Reporter: Thomas Graves I was trying to use the external shuffle service on yarn but it appears that io/netty isn't included in the network jars. I loaded up network-common, network-yarn, and network-shuffle. If there is some other jar supposed to be included please let me know. 2015-03-20 14:25:07,142 [main] FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager java.lang.NoClassDefFoundError: io/netty/channel/EventLoopGroup at org.apache.spark.network.shuffle.ExternalShuffleBlockManager.init(ExternalShuffleBlockManager.java:64) at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.init(ExternalShuffleBlockHandler.java:53) at org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:105) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376009#comment-14376009 ] Christophe PRÉAUD commented on SPARK-6469: -- Sorry if I'm saying something stupid, but I would expect {{LOCAL_DIRS}} (according to Spark comments in [Utils.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L749], {{YARN_LOCAL_DIRS}} is for Hadoop 0.23, and {{LOCAL_DIRS}} for Hadoop 2.X) to be set in yarn-client mode. Local directories configured for YARN are not used in yarn-client mode -- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not spark.local.dir which should be ignored. If this works correctly in yarn-cluster mode, I've found out that it is not the case in yarn-client mode. The problem seems to originate in the method [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686]. Indeed, I've checked with a simple application that the {{CONTAINER_ID}} environment variable is correctly set in yarn-cluster mode (to something like {{container_142761810_0151_01_01}}, but not in yarn-client mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4227) Document external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376022#comment-14376022 ] Thomas Graves commented on SPARK-4227: -- Looks like I had build issues. The instructions http://spark.apache.org/docs/1.3.0/job-scheduling.html work. Document external shuffle service - Key: SPARK-4227 URL: https://issues.apache.org/jira/browse/SPARK-4227 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Critical We should add spark.shuffle.service.enabled to the Configuration page and give instructions for launching the shuffle service as an auxiliary service on YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376039#comment-14376039 ] Christophe PRÉAUD commented on SPARK-6469: -- Not exactly, sorry if this was not clear from my description: * when I am running YARN on Hadoop 2 in *cluster* mode, both {{LOCAL_DIRS}} and {{CONTAINER_ID}} are correctly set. * when I am running YARN on Hadoop 2 in *client* mode, neither {{LOCAL_DIRS}} nor {{CONTAINER_ID}} is correctly set. Local directories configured for YARN are not used in yarn-client mode -- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not spark.local.dir which should be ignored. If this works correctly in yarn-cluster mode, I've found out that it is not the case in yarn-client mode. The problem seems to originate in the method [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686]. Indeed, I've checked with a simple application that the {{CONTAINER_ID}} environment variable is correctly set in yarn-cluster mode (to something like {{container_142761810_0151_01_01}}, but not in yarn-client mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6255) Python MLlib API missing items: Classification
[ https://issues.apache.org/jira/browse/SPARK-6255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375994#comment-14375994 ] Apache Spark commented on SPARK-6255: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/5137 Python MLlib API missing items: Classification -- Key: SPARK-6255 URL: https://issues.apache.org/jira/browse/SPARK-6255 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. LogisticRegressionWithLBFGS * setNumClasses * setValidateData LogisticRegressionModel * getThreshold * numClasses * numFeatures SVMWithSGD * setValidateData SVMModel * getThreshold -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376017#comment-14376017 ] Sean Owen commented on SPARK-6469: -- Ah I get you, I think you have a point. So, you are running in YARN on Hadoop 2 in cluster mode and neither {{YARN_LOCAL_DIRS}} or {{CONTAINER_ID}} is set. Paging [~sandyr] [~tgraves] [~vanzin] for thoughts on whether that's to be expected, not, or means a check here has to be adjusted. Local directories configured for YARN are not used in yarn-client mode -- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not spark.local.dir which should be ignored. If this works correctly in yarn-cluster mode, I've found out that it is not the case in yarn-client mode. The problem seems to originate in the method [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686]. Indeed, I've checked with a simple application that the {{CONTAINER_ID}} environment variable is correctly set in yarn-cluster mode (to something like {{container_142761810_0151_01_01}}, but not in yarn-client mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3735) Sending the factor directly or AtA based on the cost in ALS
[ https://issues.apache.org/jira/browse/SPARK-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376046#comment-14376046 ] Debasish Das commented on SPARK-3735: - We might want to consider doing some of these things through indexed RDD exposed through an API...right now ALS is completely join based...can we do something nicer if we have access to an efficient read only cache from ALS mapPartitions...Idea here is to think about zeros explicitly and not adding the implicit heuristic which is generally hard to tune... Sending the factor directly or AtA based on the cost in ALS --- Key: SPARK-3735 URL: https://issues.apache.org/jira/browse/SPARK-3735 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng It is common to have some super popular products in the dataset. In this case, sending many user factors to the target product block could be more expensive than sending the normal equation `\sum_i u_i u_i^T` and `\sum_i u_i r_ij` to the product block. The cost of sending a single factor is `k`, while the cost of sending a normal equation is much more expensive, `k * (k + 3) / 2`. However, if we use normal equation for all products associated with a user, we don't need to send this user factor. Determining the optimal assignment is hard. But we could use a simple heuristic. Inside any rating block, 1) order the product ids by the number of user ids associated with them in desc order 2) starting from the most popular product, mark popular products as use normal eq and calculate the cost Remember the best assignment that comes with the lowest cost and use it for computation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6463) AttributeSet.equal should compare size
[ https://issues.apache.org/jira/browse/SPARK-6463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] June updated SPARK-6463: Summary: AttributeSet.equal should compare size (was: [SPARK][SQL] AttributeSet.equal should compare size) AttributeSet.equal should compare size -- Key: SPARK-6463 URL: https://issues.apache.org/jira/browse/SPARK-6463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: June Priority: Minor AttributeSet.equal should compare both member and size -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
[ https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375580#comment-14375580 ] Littlestar edited comment on SPARK-6461 at 3/23/15 9:04 AM: when I add MESOS_HADOOP_CONF_DIR at all mesos-master-env.sh and mesos-slave-env.sh , It throws the following error. Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/executor/MesosExecutorBackend Caused by: java.lang.ClassNotFoundException: org.apache.spark.executor.MesosExecutorBackend similar to https://issues.apache.org/jira/browse/SPARK-1702 was (Author: cnstar9988): when I add MESOS_HADOOP_CONF_DIR at all mesos-master-env.sh and mesos-slave-env.sh , It throws the following error. Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/executor/MesosExecutorBackend Caused by: java.lang.ClassNotFoundException: org.apache.spark.executor.MesosExecutorBackend similar to https://github.com/apache/spark/pull/620 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos -- Key: SPARK-6461 URL: https://issues.apache.org/jira/browse/SPARK-6461 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.3.0 Reporter: Littlestar I use mesos run spak 1.3.0 ./run-example SparkPi but failed. spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos spark.executorEnv.PATH spark.executorEnv.HADOOP_HOME spark.executorEnv.JAVA_HOME E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop fs -copyToLocal 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz' sh: hadoop: command not found -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375976#comment-14375976 ] Sean Owen commented on SPARK-6469: -- So, if {{YARN_LOCAL_DIRS}} is set, then {{isRunningInYarnContainer}} is {{true}} and it uses this for the local dir. {{CONTAINER_ID}} is not relevant to this. What local directory are you expecting it to use, if {{YARN_LOCAL_DIRS}} isn't set? Local directories configured for YARN are not used in yarn-client mode -- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not spark.local.dir which should be ignored. If this works correctly in yarn-cluster mode, I've found out that it is not the case in yarn-client mode. The problem seems to originate in the method [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686]. Indeed, I've checked with a simple application that the {{CONTAINER_ID}} environment variable is correctly set in yarn-cluster mode (to something like {{container_142761810_0151_01_01}}, but not in yarn-client mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS
[ https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375905#comment-14375905 ] Thomas Graves commented on SPARK-6449: -- [~rdub] Was there an exception in the log higher up? Wondering if it shows the entire exception for the out of memory. Driver OOM results in reported application result SUCCESS - Key: SPARK-6449 URL: https://issues.apache.org/jira/browse/SPARK-6449 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Reporter: Ryan Williams I ran a job yesterday that according to the History Server and YARN RM finished with status {{SUCCESS}}. Clicking around on the history server UI, there were too few stages run, and I couldn't figure out why that would have been. Finally, inspecting the end of the driver's logs, I saw: {code} 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext Exception in thread Driver scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit exceeded (of class java.lang.OutOfMemoryError) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0, (reason: Shutdown hook called before final status was reported.) 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before final status was reported.) 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1426705269584_0055 {code} The driver OOM'd, [the {{catch}} block that presumably should have caught it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484] threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and written to the event log. This should be logged as a failed job and reported as such to YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375941#comment-14375941 ] Christophe PRÉAUD edited comment on SPARK-6469 at 3/23/15 2:16 PM: --- Attached a simple application to check the value of the {{CONTAINER_ID}} environment variable. * Check in yarn-cluster mode {code} /opt/spark/bin/spark-submit --master yarn-cluster --class TestYarnVars testyarnvars_2.10-1.0.jar 2/dev/null {code} (the stdout of the application on the YARN web ui reads: {{CONTAINER_ID: container_142761810_0151_01_01}} * Check in yarn-client mode: {code} /opt/spark/bin/spark-submit --master yarn-client --class TestYarnVars testyarnvars_2.10-1.0.jar 2/dev/null {code} CONTAINER_ID: null was (Author: preaudc): Attached a simple application to check the value of the {{CONTAINER_ID}} environment variable. * Check in yarn-cluster mode {code} /opt/spark/bin/spark-submit --master yarn-cluster --class TestYarnVars --queue spark-batch testyarnvars_2.10-1.0.jar 2/dev/null {code} (the stdout of the application on the YARN web ui reads: {{CONTAINER_ID: container_142761810_0151_01_01}} * Check in yarn-client mode: {code} /opt/spark/bin/spark-submit --master yarn-client --class TestYarnVars --queue spark-batch testyarnvars_2.10-1.0.jar 2/dev/null {code} CONTAINER_ID: null Local directories configured for YARN are not used in yarn-client mode -- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not spark.local.dir which should be ignored. If this works correctly in yarn-cluster mode, I've found out that it is not the case in yarn-client mode. The problem seems to originate in the method [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686]. Indeed, I've checked with a simple application that the {{CONTAINER_ID}} environment variable is correctly set in yarn-cluster mode (to something like {{container_142761810_0151_01_01}}, but not in yarn-client mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375941#comment-14375941 ] Christophe PRÉAUD edited comment on SPARK-6469 at 3/23/15 2:15 PM: --- Attached a simple application to check the value of the {{CONTAINER_ID}} environment variable. * Check in yarn-cluster mode {code} /opt/spark/bin/spark-submit --master yarn-cluster --class TestYarnVars --queue spark-batch testyarnvars_2.10-1.0.jar 2/dev/null {code} (the stdout of the application on the YARN web ui reads: {{CONTAINER_ID: container_142761810_0151_01_01}} * Check in yarn-client mode: {code} /opt/spark/bin/spark-submit --master yarn-client --class TestYarnVars --queue spark-batch testyarnvars_2.10-1.0.jar 2/dev/null {code} CONTAINER_ID: null was (Author: preaudc): Attached a simple application to check the value of the {{CONTAINER_ID}} environment variable. * Check in yarn-cluster mode {code} /opt/spark/bin/spark-submit --master yarn-cluster --class TestYarnVars --queue spark-batch testyarnvars_2.10-1.0.jar 2/dev/null {code} (the stdout of the application on the YARN wen ui reads: {{CONTAINER_ID: container_142761810_0151_01_01}} * Check in yarn-client mode: {code} /opt/spark/bin/spark-submit --master yarn-client --class TestYarnVars --queue spark-batch testyarnvars_2.10-1.0.jar 2/dev/null {code} CONTAINER_ID: null Local directories configured for YARN are not used in yarn-client mode -- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not spark.local.dir which should be ignored. If this works correctly in yarn-cluster mode, I've found out that it is not the case in yarn-client mode. The problem seems to originate in the method [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686]. Indeed, I've checked with a simple application that the {{CONTAINER_ID}} environment variable is correctly set in yarn-cluster mode (to something like {{container_142761810_0151_01_01}}, but not in yarn-client mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christophe PRÉAUD updated SPARK-6469: - Attachment: TestYarnVars.scala Attached a simple application to check the value of the {{CONTAINER_ID}} environment variable. * Check in yarn-cluster mode {code} /opt/spark/bin/spark-submit --master yarn-cluster --class TestYarnVars --queue spark-batch testyarnvars_2.10-1.0.jar 2/dev/null {code} (the stdout of the application on the YARN wen ui reads: {{CONTAINER_ID: container_142761810_0151_01_01}} * Check in yarn-client mode: {code} /opt/spark/bin/spark-submit --master yarn-client --class TestYarnVars --queue spark-batch testyarnvars_2.10-1.0.jar 2/dev/null {code} CONTAINER_ID: null Local directories configured for YARN are not used in yarn-client mode -- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not spark.local.dir which should be ignored. If this works correctly in yarn-cluster mode, I've found out that it is not the case in yarn-client mode. The problem seems to originate in the method [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686]. Indeed, I've checked with a simple application that the {{CONTAINER_ID}} environment variable is correctly set in yarn-cluster mode (to something like {{container_142761810_0151_01_01}}, but not in yarn-client mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode
Christophe PRÉAUD created SPARK-6469: Summary: Local directories configured for YARN are not used in yarn-client mode Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Christophe PRÉAUD Priority: Minor According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not spark.local.dir which should be ignored. If this works correctly in yarn-cluster mode, I've found out that it is not the case in yarn-client mode. The problem seems to originate in the method [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686]. Indeed, I've checked with a simple application that the {{CONTAINER_ID}} environment variable is correctly set in yarn-cluster mode (to something like {{container_142761810_0151_01_01}}, but not in yarn-client mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2394) Make it easier to read LZO-compressed files from EC2 clusters
[ https://issues.apache.org/jira/browse/SPARK-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375929#comment-14375929 ] Nicholas Chammas commented on SPARK-2394: - Thank you for posting this information for others! Make it easier to read LZO-compressed files from EC2 clusters - Key: SPARK-2394 URL: https://issues.apache.org/jira/browse/SPARK-2394 Project: Spark Issue Type: Improvement Components: EC2, Input/Output Affects Versions: 1.0.0 Reporter: Nicholas Chammas Priority: Minor Labels: compression Amazon hosts [a large Google n-grams data set on S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is perfect, among other things, for putting together interesting and easily reproducible public demos of Spark's capabilities. The problem is that the data set is compressed using LZO, and it is currently more painful than it should be to get your average {{spark-ec2}} cluster to read input compressed in this way. This is what one has to go through to get a Spark cluster created with {{spark-ec2}} to read LZO-compressed files: # Install the latest LZO release, perhaps via {{yum}}. # Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build it. To build {{hadoop-lzo}} you need Maven. # Install Maven. For some reason, [you cannot install Maven with {{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum], so install it manually. # Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E]. # Make [the appropriate calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E] to {{sc.newAPIHadoopFile}}. This seems like a bit too much work for what we're trying to accomplish. If we expect this to be a common pattern -- reading LZO-compressed files from a {{spark-ec2}} cluster -- it would be great if we could somehow make this less painful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6255) Python MLlib API missing items: Classification
[ https://issues.apache.org/jira/browse/SPARK-6255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376095#comment-14376095 ] Yanbo Liang commented on SPARK-6255: [~josephkb] Can you assign it to me? Python MLlib API missing items: Classification -- Key: SPARK-6255 URL: https://issues.apache.org/jira/browse/SPARK-6255 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. LogisticRegressionWithLBFGS * setNumClasses * setValidateData LogisticRegressionModel * getThreshold * numClasses * numFeatures SVMWithSGD * setValidateData SVMModel * getThreshold -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6451) Support CombineSum in Code Gen
[ https://issues.apache.org/jira/browse/SPARK-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376088#comment-14376088 ] Apache Spark commented on SPARK-6451: - User 'gvramana' has created a pull request for this issue: https://github.com/apache/spark/pull/5138 Support CombineSum in Code Gen -- Key: SPARK-6451 URL: https://issues.apache.org/jira/browse/SPARK-6451 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker Since we are using CombineSum at the reducer side for the SUM function, we need to make it work in code gen. Otherwise, code gen will not convert Aggregates with a SUM function to GeneratedAggregates (the code gen version). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376065#comment-14376065 ] Thomas Graves commented on SPARK-6469: -- Note if its purely the documentation confused you then we should update documentation to clarify the client/cluster mode differences. Local directories configured for YARN are not used in yarn-client mode -- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not spark.local.dir which should be ignored. If this works correctly in yarn-cluster mode, I've found out that it is not the case in yarn-client mode. The problem seems to originate in the method [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686]. Indeed, I've checked with a simple application that the {{CONTAINER_ID}} environment variable is correctly set in yarn-cluster mode (to something like {{container_142761810_0151_01_01}}, but not in yarn-client mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376061#comment-14376061 ] Thomas Graves commented on SPARK-6469: -- Are you saying they are not set on the driver node in yarn client mode? If so that is what I would expect since the driver is not running on the YARN cluster, its running on the gateway (wherever you launch it). Is the driver now chosing local directories for the executors to use? If not what problem is this causing exactly? Local directories configured for YARN are not used in yarn-client mode -- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not spark.local.dir which should be ignored. If this works correctly in yarn-cluster mode, I've found out that it is not the case in yarn-client mode. The problem seems to originate in the method [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686]. Indeed, I've checked with a simple application that the {{CONTAINER_ID}} environment variable is correctly set in yarn-cluster mode (to something like {{container_142761810_0151_01_01}}, but not in yarn-client mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376084#comment-14376084 ] Christophe PRÉAUD commented on SPARK-6469: -- The problem I have is that spark temporary files are written in {{/tmp}} in yarn-client mode, but your explanation makes sense, the gateway is indeed not on the YARN cluster so this is expected. I agree though that an update in the documentation to clarify this would be welcome :-) Local directories configured for YARN are not used in yarn-client mode -- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not spark.local.dir which should be ignored. If this works correctly in yarn-cluster mode, I've found out that it is not the case in yarn-client mode. The problem seems to originate in the method [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686]. Indeed, I've checked with a simple application that the {{CONTAINER_ID}} environment variable is correctly set in yarn-cluster mode (to something like {{container_142761810_0151_01_01}}, but not in yarn-client mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376092#comment-14376092 ] Thomas Graves commented on SPARK-6469: -- Yeah so you should be able to set the spark.local.dir config to change that directory in yarn client mode for the driver. Executors will still use the yarn approved directories. We should change this jira to clarify documentation then. Local directories configured for YARN are not used in yarn-client mode -- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not spark.local.dir which should be ignored. If this works correctly in yarn-cluster mode, I've found out that it is not the case in yarn-client mode. The problem seems to originate in the method [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686]. Indeed, I've checked with a simple application that the {{CONTAINER_ID}} environment variable is correctly set in yarn-cluster mode (to something like {{container_142761810_0151_01_01}}, but not in yarn-client mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christophe PRÉAUD updated SPARK-6469: - Issue Type: Documentation (was: Bug) Local directories configured for YARN are not used in yarn-client mode -- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Documentation Components: Spark Core Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not spark.local.dir which should be ignored. If this works correctly in yarn-cluster mode, I've found out that it is not the case in yarn-client mode. The problem seems to originate in the method [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686]. Indeed, I've checked with a simple application that the {{CONTAINER_ID}} environment variable is correctly set in yarn-cluster mode (to something like {{container_142761810_0151_01_01}}, but not in yarn-client mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6460) Implement OpensslAesCtrCryptoCodec to enable encrypted shuffle algorithms which openssl provides
liyunzhang_intel created SPARK-6460: --- Summary: Implement OpensslAesCtrCryptoCodec to enable encrypted shuffle algorithms which openssl provides Key: SPARK-6460 URL: https://issues.apache.org/jira/browse/SPARK-6460 Project: Spark Issue Type: Bug Components: Shuffle Reporter: liyunzhang_intel SPARK-5682 only implements the encrypted shuffle algorithm provided by JCE. OpensslAesCtrCryptoCodec need implement algorithm provided by openssl. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6460) Implement OpensslAesCtrCryptoCodec to enable encrypted shuffle algorithms which openssl provides
[ https://issues.apache.org/jira/browse/SPARK-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated SPARK-6460: Issue Type: Sub-task (was: Bug) Parent: SPARK-5682 Implement OpensslAesCtrCryptoCodec to enable encrypted shuffle algorithms which openssl provides Key: SPARK-6460 URL: https://issues.apache.org/jira/browse/SPARK-6460 Project: Spark Issue Type: Sub-task Components: Shuffle Reporter: liyunzhang_intel SPARK-5682 only implements the encrypted shuffle algorithm provided by JCE. OpensslAesCtrCryptoCodec need implement algorithm provided by openssl. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6483) Spark SQL udf(ScalaUdf) is very slow
[ https://issues.apache.org/jira/browse/SPARK-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377338#comment-14377338 ] Apache Spark commented on SPARK-6483: - User 'zzcclp' has created a pull request for this issue: https://github.com/apache/spark/pull/5154 Spark SQL udf(ScalaUdf) is very slow Key: SPARK-6483 URL: https://issues.apache.org/jira/browse/SPARK-6483 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.4.0 Environment: 1. Spark version is 1.3.0 2. 3 node per 80G/20C 3. read 250G parquet files from hdfs Reporter: zzc Test case: 1. register floor func with command: sqlContext.udf.register(floor, (ts: Int) = ts - ts % 300), then run with sql select chan, floor(ts) as tt, sum(size) from qlogbase3 group by chan, floor(ts), *it takes 17 minutes.* {quote} == Physical Plan == Aggregate false, [chan#23015,PartialGroup#23500], [chan#23015,PartialGroup#23500 AS tt#23494,CombineSum(PartialSum#23499L) AS c2#23495L] Exchange (HashPartitioning [chan#23015,PartialGroup#23500], 54) Aggregate true, [chan#23015,scalaUDF(ts#23016)], [chan#23015,*scalaUDF*(ts#23016) AS PartialGroup#23500,SUM(size#23023L) AS PartialSum#23499L] PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[115] at map at newParquet.scala:562 {quote} 2. run with sql select chan, (ts - ts % 300) as tt, sum(size) from qlogbase3 group by chan, (ts - ts % 300), *it takes only 5 minutes.* {quote} == Physical Plan == Aggregate false, [chan#23015,PartialGroup#23349], [chan#23015,PartialGroup#23349 AS tt#23343,CombineSum(PartialSum#23348L) AS c2#23344L] Exchange (HashPartitioning [chan#23015,PartialGroup#23349], 54) Aggregate true, [chan#23015,(ts#23016 - (ts#23016 % 300))], [chan#23015,*(ts#23016 - (ts#23016 % 300))* AS PartialGroup#23349,SUM(size#23023L) AS PartialSum#23348L] PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[83] at map at newParquet.scala:562 {quote} 3. use *HiveContext* with sql select chan, floor((ts - ts % 300)) as tt, sum(size) from qlogbase3 group by chan, floor((ts - ts % 300)) *it takes only 5 minutes too. * {quote} == Physical Plan == Aggregate false, [chan#23015,PartialGroup#23108L], [chan#23015,PartialGroup#23108L AS tt#23102L,CombineSum(PartialSum#23107L) AS _c2#23103L] Exchange (HashPartitioning [chan#23015,PartialGroup#23108L], 54) Aggregate true, [chan#23015,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor((ts#23016 - (ts#23016 % 300)))], [chan#23015,*HiveGenericUdf*#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor((ts#23016 - (ts#23016 % 300))) AS PartialGroup#23108L,SUM(size#23023L) AS PartialSum#23107L] PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[28] at map at newParquet.scala:562 {quote} *Why? ScalaUdf is so slow?? How to improve it?* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3306) Addition of external resource dependency in executors
[ https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377343#comment-14377343 ] Reynold Xin commented on SPARK-3306: Can you elaborate on why this needs to be Spark and can't live outside? This seems to me can be implemented entirely outside of Spark. In particular: 1. Use a global singleton object that manages resources. 2. The singleton can register a shutdown hook to clear resources upon JVM exit. And it probably would take just a few lines of code to implement the above two. Addition of external resource dependency in executors - Key: SPARK-3306 URL: https://issues.apache.org/jira/browse/SPARK-3306 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Yan Currently, Spark executors only support static and read-only external resources of side files and jar files. With emerging disparate data sources, there is a need to support more versatile external resources, such as connections to data sources, to facilitate efficient data accesses to the sources. For one, the JDBCRDD, with some modifications, could benefit from this feature by reusing established JDBC connections from the same Spark context before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6477) Run MIMA tests before the Spark test suite
[ https://issues.apache.org/jira/browse/SPARK-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376740#comment-14376740 ] Apache Spark commented on SPARK-6477: - User 'brennonyork' has created a pull request for this issue: https://github.com/apache/spark/pull/5145 Run MIMA tests before the Spark test suite -- Key: SPARK-6477 URL: https://issues.apache.org/jira/browse/SPARK-6477 Project: Spark Issue Type: Improvement Components: Build Reporter: Brennon York Priority: Minor Right now the MIMA tests are the last thing to run, yet run very quickly and, if they fail, didn't need the entire Spark test suite to have completed first. I propose we move the MIMA tests to run before the full Spark suite so that builds that fail the MIMA checks will return much faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6229) Support SASL encryption in network/common module
[ https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376147#comment-14376147 ] Marcelo Vanzin commented on SPARK-6229: --- [~adav] the problem of exposing the pipeline as an API is twofold: * Every application needs to understand the internals of the pipeline. For example, in your EncryptionHandler suggestion, SSL and SASL benefit from being placed in different locations inside the pipeline. How do you expose that in an external API? And why make client code even care about that? Also, I don't necessarily agree that SSL is an application concern. It's a transport-level protocol - which is one of the reasons it would be placed in a separate place in the stack from a SASL handler, for example.. * Handling SASL and SSL inside the network library does not necessarily make it any less unit-testable, stable or fast. It just makes it easier for clients to use those things. Instead of writing a bunch of code that needs to be synchronized between client and server, all they need is a proper configuration object. Configuration (= data) is much easier to change and fix than code. The SecurityManager issue is already solved in the transport library. When I moved the SASL code to network/common I moved {{SecretKeyHolder}} with it. So there you have, your application-agnostic interface for providing security secrets for the network library. So, again, what I'm suggesting here is not to hardcode SSL and SASL into the library. I'm suggesting an easier interface for people to configure SSL and SASL that doesn't require writing any extra code. If they don't want either of those, they still have that option, but instead of deleting / disabling / conditioning code, they'd change a couple of lines in a config file. They'd get the same stable, fast network library without SSL nor SASL, without having to change a single line of code. Another problem with your example ({{transport.setEncryptionHandler}} vs. {{if (sasl) ...}}) is that the latter would be needed anyway if you want SASL. Why not then have also an AuthenticationHandler aside from the encryption handler? As for your factory comment, that's already there, in a way. The bootstrap functionality is basically a way to insert things into the channel being instantiated. What I'm proposing here is twofold: first, extend that interface so that the bootstrap implementation can modify the pipeline (and also allow server bootstraps, for reasons I explained in my first long comment), and second, control which bootstraps get activated via configuration, not via code. Note that, internally within the library, you'd have basically what you're saying: SSL and SASL would be plugins of sort that you can insert into the transport layer only if you want. The different that I'm trying to convey here is that the *external* interface of the library doesn't expose those. Support SASL encryption in network/common module Key: SPARK-6229 URL: https://issues.apache.org/jira/browse/SPARK-6229 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin After SASL support has been added to network/common, supporting encryption should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. Since the latter requires a valid kerberos login to work (and so doesn't really work with executors), encryption would require the use of DIGEST-MD5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4746) integration tests should be separated from faster unit tests
[ https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid closed SPARK-4746. --- Resolution: Won't Fix looks like there isn't interest in this, closing to clean up jira integration tests should be separated from faster unit tests Key: SPARK-4746 URL: https://issues.apache.org/jira/browse/SPARK-4746 Project: Spark Issue Type: Bug Components: Tests Reporter: Imran Rashid Assignee: Imran Rashid Priority: Trivial Currently there isn't a good way for a developer to skip the longer integration tests. This can slow down local development. See http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html One option is to use scalatest's notion of test tags to tag all integration tests, so they could easily be skipped -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christophe PRÉAUD updated SPARK-6469: - Description: According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not {{spark.local.dir}} which should be ignored. It should be noted though that in yarn-client mode, though the executors will indeed use the local directories configured for YARN, the driver will not, because it is not running on the YARN cluster; the driver in yarn-client will use the local directories defined in {{spark.local.dir}} was: According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not spark.local.dir which should be ignored. If this works correctly in yarn-cluster mode, I've found out that it is not the case in yarn-client mode. The problem seems to originate in the method [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686]. Indeed, I've checked with a simple application that the {{CONTAINER_ID}} environment variable is correctly set in yarn-cluster mode (to something like {{container_142761810_0151_01_01}}, but not in yarn-client mode. Local directories configured for YARN are not used in yarn-client mode -- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Documentation Components: Spark Core Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not {{spark.local.dir}} which should be ignored. It should be noted though that in yarn-client mode, though the executors will indeed use the local directories configured for YARN, the driver will not, because it is not running on the YARN cluster; the driver in yarn-client will use the local directories defined in {{spark.local.dir}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6469) The YARN driver in yarn-client mode will not use the local directories configured for YARN
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christophe PRÉAUD updated SPARK-6469: - Summary: The YARN driver in yarn-client mode will not use the local directories configured for YARN (was: Local directories configured for YARN are not used in yarn-client mode) The YARN driver in yarn-client mode will not use the local directories configured for YARN --- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Documentation Components: Spark Core Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not {{spark.local.dir}} which should be ignored. It should be noted though that in yarn-client mode, though the executors will indeed use the local directories configured for YARN, the driver will not, because it is not running on the YARN cluster; the driver in yarn-client will use the local directories defined in {{spark.local.dir}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6469) The YARN driver in yarn-client mode will not use the local directories configured for YARN
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christophe PRÉAUD updated SPARK-6469: - Description: According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not {{spark.local.dir}} which should be ignored. However it should be noted that in yarn-client mode, though the executors will indeed use the local directories configured for YARN, the driver will not, because it is not running on the YARN cluster; the driver in yarn-client will use the local directories defined in {{spark.local.dir}} Can this please be clarified in the Spark YARN documentation above? was: According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not {{spark.local.dir}} which should be ignored. It should be noted though that in yarn-client mode, though the executors will indeed use the local directories configured for YARN, the driver will not, because it is not running on the YARN cluster; the driver in yarn-client will use the local directories defined in {{spark.local.dir}} The YARN driver in yarn-client mode will not use the local directories configured for YARN --- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Documentation Components: Spark Core Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not {{spark.local.dir}} which should be ignored. However it should be noted that in yarn-client mode, though the executors will indeed use the local directories configured for YARN, the driver will not, because it is not running on the YARN cluster; the driver in yarn-client will use the local directories defined in {{spark.local.dir}} Can this please be clarified in the Spark YARN documentation above? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6469) The YARN driver in yarn-client mode will not use the local directories configured for YARN
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376111#comment-14376111 ] Christophe PRÉAUD commented on SPARK-6469: -- I've changed the JIRA title, type and description, is this ok? Thanks to all of you for your help! The YARN driver in yarn-client mode will not use the local directories configured for YARN --- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Documentation Components: Spark Core Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not {{spark.local.dir}} which should be ignored. However it should be noted that in yarn-client mode, though the executors will indeed use the local directories configured for YARN, the driver will not, because it is not running on the YARN cluster; the driver in yarn-client will use the local directories defined in {{spark.local.dir}} Can this please be clarified in the Spark YARN documentation above? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6469) The YARN driver in yarn-client mode will not use the local directories configured for YARN
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-6469: - Component/s: (was: Spark Core) YARN The YARN driver in yarn-client mode will not use the local directories configured for YARN --- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Documentation Components: YARN Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not {{spark.local.dir}} which should be ignored. However it should be noted that in yarn-client mode, though the executors will indeed use the local directories configured for YARN, the driver will not, because it is not running on the YARN cluster; the driver in yarn-client will use the local directories defined in {{spark.local.dir}} Can this please be clarified in the Spark YARN documentation above? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6469) The YARN driver in yarn-client mode will not use the local directories configured for YARN
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376119#comment-14376119 ] Thomas Graves commented on SPARK-6469: -- looks good, thanks. The YARN driver in yarn-client mode will not use the local directories configured for YARN --- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Documentation Components: YARN Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not {{spark.local.dir}} which should be ignored. However it should be noted that in yarn-client mode, though the executors will indeed use the local directories configured for YARN, the driver will not, because it is not running on the YARN cluster; the driver in yarn-client will use the local directories defined in {{spark.local.dir}} Can this please be clarified in the Spark YARN documentation above? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6308) VectorUDT is displayed as `vecto` in dtypes
[ https://issues.apache.org/jira/browse/SPARK-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6308: - Assignee: Manoj Kumar VectorUDT is displayed as `vecto` in dtypes --- Key: SPARK-6308 URL: https://issues.apache.org/jira/browse/SPARK-6308 Project: Spark Issue Type: Bug Components: MLlib, SQL Reporter: Xiangrui Meng Assignee: Manoj Kumar VectorUDT should override simpleString instead of relying on the default implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6435: - Component/s: Windows Great debugging! [~tsudukim] do you have thoughts on this? I think this bit was part of your change in https://github.com/apache/spark/commit/8d932475e6759e869c16ce6cac203a2e56558716#diff-7ac5881d6bad553b23f5225775c8fde3 So, it sounds like you do need to quote the comma-separated arg? but then quoting doesn't work as expected? The {{x%2==x}} idiom is used several places in the Windows scripts. Is the square bracket syntax definitely preferred? spark-shell --jars option does not add all jars to classpath Key: SPARK-6435 URL: https://issues.apache.org/jira/browse/SPARK-6435 Project: Spark Issue Type: Bug Components: Spark Shell, Windows Affects Versions: 1.3.0 Environment: Win64 Reporter: vijay Not all jars supplied via the --jars option will be added to the driver (and presumably executor) classpath. The first jar(s) will be added, but not all. To reproduce this, just add a few jars (I tested 5) to the --jars option, and then try to import a class from the last jar. This fails. A simple reproducer: Create a bunch of dummy jars: jar cfM jar1.jar log.txt jar cfM jar2.jar log.txt jar cfM jar3.jar log.txt jar cfM jar4.jar log.txt Start the spark-shell with the dummy jars and guava at the end: %SPARK_HOME%\bin\spark-shell --master local --jars jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar In the shell, try importing from guava; you'll get an error: {code} scala import com.google.common.base.Strings console:19: error: object Strings is not a member of package com.google.common.base import com.google.common.base.Strings ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6320) Adding new query plan strategy to SQLContext
[ https://issues.apache.org/jira/browse/SPARK-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376434#comment-14376434 ] Michael Armbrust commented on SPARK-6320: - If that can be done in a minimally invasive way that sounds reasonable to me. Adding new query plan strategy to SQLContext Key: SPARK-6320 URL: https://issues.apache.org/jira/browse/SPARK-6320 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Youssef Hatem Priority: Minor Hi, I would like to add a new strategy to {{SQLContext}}. To do this I created a new class which extends {{Strategy}}. In my new class I need to call {{planLater}} function. However this method is defined in {{SparkPlanner}} (which itself inherits the method from {{QueryPlanner}}). To my knowledge the only way to make {{planLater}} function visible to my new strategy is to define my strategy inside another class that extends {{SparkPlanner}} and inherits {{planLater}} as a result, by doing so I will have to extend the {{SQLContext}} such that I can override the {{planner}} field with the new {{Planner}} class I created. It seems that this is a design problem because adding a new strategy seems to require extending {{SQLContext}} (unless I am doing it wrong and there is a better way to do it). Thanks a lot, Youssef -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6200) Support dialect in SQL
[ https://issues.apache.org/jira/browse/SPARK-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376450#comment-14376450 ] Michael Armbrust commented on SPARK-6200: - Making comments there would be a good idea. Generally, I don't think extra complexity is worth it just to have short names. Likely most users will either use the default value or will just be cutting and copying from some documentation into a config file once. If this was an interface we expected them to toggle a lot it would probably be different. Support dialect in SQL -- Key: SPARK-6200 URL: https://issues.apache.org/jira/browse/SPARK-6200 Project: Spark Issue Type: Improvement Components: SQL Reporter: haiyang Created a new dialect manager,support dialect command and add new dialect use sql statement etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6200) Support dialect in SQL
[ https://issues.apache.org/jira/browse/SPARK-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6200. - Resolution: Duplicate Support dialect in SQL -- Key: SPARK-6200 URL: https://issues.apache.org/jira/browse/SPARK-6200 Project: Spark Issue Type: Improvement Components: SQL Reporter: haiyang Created a new dialect manager,support dialect command and add new dialect use sql statement etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6255) Python MLlib API missing items: Classification
[ https://issues.apache.org/jira/browse/SPARK-6255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6255: - Assignee: Yanbo Liang Python MLlib API missing items: Classification -- Key: SPARK-6255 URL: https://issues.apache.org/jira/browse/SPARK-6255 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Yanbo Liang This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. LogisticRegressionWithLBFGS * setNumClasses * setValidateData LogisticRegressionModel * getThreshold * numClasses * numFeatures SVMWithSGD * setValidateData SVMModel * getThreshold -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376430#comment-14376430 ] Imran Rashid commented on SPARK-5928: - sorry to hear that [~douglaz]. To help understand / prioritize this, can you share a bit more info? a) how much data were you shuffling? b) were you able to fix this by increasing the number of partitions? how many partitions did you need to use in the end? c) did you get a mix of snappy errors as well? d) did you also run into SPARK-5945 as a result of your failures ? thanks Remote Shuffle Blocks cannot be more than 2 GB -- Key: SPARK-5928 URL: https://issues.apache.org/jira/browse/SPARK-5928 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid If a shuffle block is over 2GB, the shuffle fails, with an uninformative exception. The tasks get retried a few times and then eventually the job fails. Here is an example program which can cause the exception: {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} Note that you can't trigger this exception in local mode, it only happens on remote fetches. I triggered these exceptions running with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {noformat} 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at
[jira] [Commented] (SPARK-2331) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
[ https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376495#comment-14376495 ] Patrick Wendell commented on SPARK-2331: By the way - [~rxin] recently pointed out to me that EmptyRDD is private[spark]. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/EmptyRDD.scala#L27 Given that I'm sort of confused how people were using it before. I'm not totally sure how making a class private[spark] affects its use in a return type. SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T] -- Key: SPARK-2331 URL: https://issues.apache.org/jira/browse/SPARK-2331 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Ian Hummel Priority: Minor The return type for SparkContext.emptyRDD is EmptyRDD[T]. It should be RDD[T]. That means you have to add extra type annotations on code like the below (which creates a union of RDDs over some subset of paths in a folder) {code} val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { (rdd, path) ⇒ rdd.union(sc.textFile(path)) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6472) Elements of an array of structs cannot be accessed.
Yin Huai created SPARK-6472: --- Summary: Elements of an array of structs cannot be accessed. Key: SPARK-6472 URL: https://issues.apache.org/jira/browse/SPARK-6472 Project: Spark Issue Type: Bug Reporter: Yin Huai Priority: Blocker I tried the following snippet with HiveContext. {code} import sqlContext._ val rdd = sc.parallelize({a:[{b:1}, {b:2}]} :: Nil) val df = jsonRDD(rdd) df.registerTempTable(jt) // This one does not work. df.select(a[0]).collect // This one is fine. sql(select a[0] from jt).collect {code} The exception is {code} df.select(a[0]).collect org.apache.spark.sql.AnalysisException: cannot resolve 'a[0]' given input columns a; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:104) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:118) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:117) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:122) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:45) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:43) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:88) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.apply(CheckAnalysis.scala:43) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1108) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:133) at org.apache.spark.sql.DataFrame.logicalPlanToDataFrame(DataFrame.scala:157) at org.apache.spark.sql.DataFrame.select(DataFrame.scala:465) at org.apache.spark.sql.DataFrame.select(DataFrame.scala:480) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6472) Elements of an array of structs cannot be accessed.
[ https://issues.apache.org/jira/browse/SPARK-6472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6472: Component/s: SQL Elements of an array of structs cannot be accessed. --- Key: SPARK-6472 URL: https://issues.apache.org/jira/browse/SPARK-6472 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker I tried the following snippet with HiveContext. {code} import sqlContext._ val rdd = sc.parallelize({a:[{b:1}, {b:2}]} :: Nil) val df = jsonRDD(rdd) df.registerTempTable(jt) // This one does not work. df.select(a[0]).collect // This one is fine. sql(select a[0] from jt).collect {code} The exception is {code} df.select(a[0]).collect org.apache.spark.sql.AnalysisException: cannot resolve 'a[0]' given input columns a; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:104) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:118) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:117) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:122) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:45) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:43) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:88) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.apply(CheckAnalysis.scala:43) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1108) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:133) at org.apache.spark.sql.DataFrame.logicalPlanToDataFrame(DataFrame.scala:157) at org.apache.spark.sql.DataFrame.select(DataFrame.scala:465) at org.apache.spark.sql.DataFrame.select(DataFrame.scala:480) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6472) Elements of an array of structs cannot be accessed.
[ https://issues.apache.org/jira/browse/SPARK-6472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-6472. - Resolution: Not a Problem It is not a problem. For select, we support column name string. I need to use selectExpr to access an array element. Elements of an array of structs cannot be accessed. --- Key: SPARK-6472 URL: https://issues.apache.org/jira/browse/SPARK-6472 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker I tried the following snippet with HiveContext. {code} import sqlContext._ val rdd = sc.parallelize({a:[{b:1}, {b:2}]} :: Nil) val df = jsonRDD(rdd) df.registerTempTable(jt) // This one does not work. df.select(a[0]).collect // This one is fine. sql(select a[0] from jt).collect {code} The exception is {code} df.select(a[0]).collect org.apache.spark.sql.AnalysisException: cannot resolve 'a[0]' given input columns a; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:104) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:118) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:117) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:122) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:45) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:43) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:88) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.apply(CheckAnalysis.scala:43) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1108) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:133) at org.apache.spark.sql.DataFrame.logicalPlanToDataFrame(DataFrame.scala:157) at org.apache.spark.sql.DataFrame.select(DataFrame.scala:465) at org.apache.spark.sql.DataFrame.select(DataFrame.scala:480) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2331) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
[ https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2331: --- Description: The return type for SparkContext.emptyRDD is EmptyRDD[T]. It should be RDD[T]. That means you have to add extra type annotations on code like the below (which creates a union of RDDs over some subset of paths in a folder) {code} val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { (rdd, path) ⇒ rdd.union(sc.textFile(path)) } {code} was: The return type for SparkContext.emptyRDD is EmptyRDD[T]. It should be RDD[T]. That means you have to add extra type annotations on code like the below (which creates a union of RDDs over some subset of paths in a folder) val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { (rdd, path) ⇒ rdd.union(sc.textFile(path)) } SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T] -- Key: SPARK-2331 URL: https://issues.apache.org/jira/browse/SPARK-2331 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Ian Hummel Priority: Minor The return type for SparkContext.emptyRDD is EmptyRDD[T]. It should be RDD[T]. That means you have to add extra type annotations on code like the below (which creates a union of RDDs over some subset of paths in a folder) {code} val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { (rdd, path) ⇒ rdd.union(sc.textFile(path)) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException
[ https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376454#comment-14376454 ] Imran Rashid commented on SPARK-5945: - Hi [~ilganeli], sorry for taking a while to respond. I think the main issue here is not so much just implementing the code (as [~SuYan] already has shown the small required patch). The big issue is figuring out what the desired semantics are (see the questions I listed above), which means just getting feedback from all the required people on this one. But if you want to drive that process, that sounds great, it would really be appreciated! Spark should not retry a stage infinitely on a FetchFailedException --- Key: SPARK-5945 URL: https://issues.apache.org/jira/browse/SPARK-5945 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid While investigating SPARK-5928, I noticed some very strange behavior in the way spark retries stages after a FetchFailedException. It seems that on a FetchFailedException, instead of simply killing the task and retrying, Spark aborts the stage and retries. If it just retried the task, the task might fail 4 times and then trigger the usual job killing mechanism. But by killing the stage instead, the max retry logic is skipped (it looks to me like there is no limit for retries on a stage). After a bit of discussion with Kay Ousterhout, it seems the idea is that if a fetch fails, we assume that the block manager we are fetching from has failed, and that it will succeed if we retry the stage w/out that block manager. In that case, it wouldn't make any sense to retry the task, since its doomed to fail every time, so we might as well kill the whole stage. But this raises two questions: 1) Is it really safe to assume that a FetchFailedException means that the BlockManager has failed, and ti will work if we just try another one? SPARK-5928 shows that there are at least some cases where that assumption is wrong. Even if we fix that case, this logic seems brittle to the next case we find. I guess the idea is that this behavior is what gives us the R in RDD ... but it seems like its not really that robust and maybe should be reconsidered. 2) Should stages only be retried a limited number of times? It would be pretty easy to put in a limited number of retries per stage. Though again, we encounter issues with keeping things resilient. Theoretically one stage could have many retries, but due to failures in different stages further downstream, so we might need to track the cause of each retry as well to still have the desired behavior. In general it just seems there is some flakiness in the retry logic. This is the only reproducible example I have at the moment, but I vaguely recall hitting other cases of strange behavior w/ retries when trying to run long pipelines. Eg., if one executor is stuck in a GC during a fetch, the fetch fails, but the executor eventually comes back and the stage gets retried again, but the same GC issues happen the second time around, etc. Copied from SPARK-5928, here's the example program that can regularly produce a loop of stage failures. Note that it will only fail from a remote fetch, so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException
[ https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-5945: Assignee: Ilya Ganelin Spark should not retry a stage infinitely on a FetchFailedException --- Key: SPARK-5945 URL: https://issues.apache.org/jira/browse/SPARK-5945 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid Assignee: Ilya Ganelin While investigating SPARK-5928, I noticed some very strange behavior in the way spark retries stages after a FetchFailedException. It seems that on a FetchFailedException, instead of simply killing the task and retrying, Spark aborts the stage and retries. If it just retried the task, the task might fail 4 times and then trigger the usual job killing mechanism. But by killing the stage instead, the max retry logic is skipped (it looks to me like there is no limit for retries on a stage). After a bit of discussion with Kay Ousterhout, it seems the idea is that if a fetch fails, we assume that the block manager we are fetching from has failed, and that it will succeed if we retry the stage w/out that block manager. In that case, it wouldn't make any sense to retry the task, since its doomed to fail every time, so we might as well kill the whole stage. But this raises two questions: 1) Is it really safe to assume that a FetchFailedException means that the BlockManager has failed, and ti will work if we just try another one? SPARK-5928 shows that there are at least some cases where that assumption is wrong. Even if we fix that case, this logic seems brittle to the next case we find. I guess the idea is that this behavior is what gives us the R in RDD ... but it seems like its not really that robust and maybe should be reconsidered. 2) Should stages only be retried a limited number of times? It would be pretty easy to put in a limited number of retries per stage. Though again, we encounter issues with keeping things resilient. Theoretically one stage could have many retries, but due to failures in different stages further downstream, so we might need to track the cause of each retry as well to still have the desired behavior. In general it just seems there is some flakiness in the retry logic. This is the only reproducible example I have at the moment, but I vaguely recall hitting other cases of strange behavior w/ retries when trying to run long pipelines. Eg., if one executor is stuck in a GC during a fetch, the fetch fails, but the executor eventually comes back and the stage gets retried again, but the same GC issues happen the second time around, etc. Copied from SPARK-5928, here's the example program that can regularly produce a loop of stage failures. Note that it will only fail from a remote fetch, so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376501#comment-14376501 ] vijay commented on SPARK-6435: -- I came up with square brackets after 2 minutes of googling/stackoverflowing; a more thorough search/understanding of bat scripts might result in a better/different solution (I can rule myself out of the more thorough bat script understanding). That being said, this test is used to check for an empty string. Square brackets is the most upvoted solution: http://stackoverflow.com/questions/2541767/what-is-the-proper-way-to-test-if-variable-is-empty-in-a-batch-file-if-not-1 spark-shell --jars option does not add all jars to classpath Key: SPARK-6435 URL: https://issues.apache.org/jira/browse/SPARK-6435 Project: Spark Issue Type: Bug Components: Spark Shell, Windows Affects Versions: 1.3.0 Environment: Win64 Reporter: vijay Not all jars supplied via the --jars option will be added to the driver (and presumably executor) classpath. The first jar(s) will be added, but not all. To reproduce this, just add a few jars (I tested 5) to the --jars option, and then try to import a class from the last jar. This fails. A simple reproducer: Create a bunch of dummy jars: jar cfM jar1.jar log.txt jar cfM jar2.jar log.txt jar cfM jar3.jar log.txt jar cfM jar4.jar log.txt Start the spark-shell with the dummy jars and guava at the end: %SPARK_HOME%\bin\spark-shell --master local --jars jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar In the shell, try importing from guava; you'll get an error: {code} scala import com.google.common.base.Strings console:19: error: object Strings is not a member of package com.google.common.base import com.google.common.base.Strings ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6443) Could not submit app in standalone cluster mode when HA is enabled
[ https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Wang updated SPARK-6443: Description: After digging some codes, I found user could not submit app in standalone cluster mode when HA is enabled. But in client mode it can work. Haven't try yet. But I will verify this and file a PR to resolve it if the problem exists. 3/23 update: I started a HA cluster with zk, and tried to submit SparkPi example with command: *./spark-submit --class org.apache.spark.examples.SparkPi --master spark://doggie153:7077,doggie159:7077 --deploy-mode cluster ../lib/spark-examples-1.2.0-hadoop2.4.0.jar * and it failed with error message: ??Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: spark://doggie153:7077,doggie159:7077 akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Invalid master URL: spark://doggie153:7077,doggie159:7077 at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35) at akka.actor.ActorCell.create(ActorCell.scala:580) ... 9 more?? So my guess is right. I will fix it in related PR. was: After digging some codes, I found user could not submit app in standalone cluster mode when HA is enabled. But in client mode it can work. Haven't try yet. But I will verify this and file a PR to resolve it if the problem exists. Could not submit app in standalone cluster mode when HA is enabled -- Key: SPARK-6443 URL: https://issues.apache.org/jira/browse/SPARK-6443 Project: Spark Issue Type: Bug Components: Spark Submit Reporter: Tao Wang After digging some codes, I found user could not submit app in standalone cluster mode when HA is enabled. But in client mode it can work. Haven't try yet. But I will verify this and file a PR to resolve it if the problem exists. 3/23 update: I started a HA cluster with zk, and tried to submit SparkPi example with command: *./spark-submit --class org.apache.spark.examples.SparkPi --master spark://doggie153:7077,doggie159:7077 --deploy-mode cluster ../lib/spark-examples-1.2.0-hadoop2.4.0.jar * and it failed with error message: ??Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: spark://doggie153:7077,doggie159:7077 akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Invalid master URL: spark://doggie153:7077,doggie159:7077 at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) at
[jira] [Created] (SPARK-6465) GenericRowWithSchema: KryoException: Class cannot be created (missing no-arg constructor):
Earthson Lu created SPARK-6465: -- Summary: GenericRowWithSchema: KryoException: Class cannot be created (missing no-arg constructor): Key: SPARK-6465 URL: https://issues.apache.org/jira/browse/SPARK-6465 Project: Spark Issue Type: Bug Components: DataFrame Affects Versions: 1.3.0 Environment: Spark 1.3, YARN 2.6.0, CentOS Reporter: Earthson Lu I can not find a issue for this. register for GenericRowWithSchema is lost in org.apache.spark.sql.execution.SparkSqlSerializer. Is this the only thing we need to do? Here is the log ``` 15/03/23 16:21:00 WARN TaskSetManager: Lost task 9.0 in stage 20.0 (TID 31978, datanode06.site): com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1050) at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1062) at com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:228) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:217) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42) at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:138) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:66) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6466) Remove unnecessary attributes when resolving GroupingSets
[ https://issues.apache.org/jira/browse/SPARK-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375657#comment-14375657 ] Apache Spark commented on SPARK-6466: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/5134 Remove unnecessary attributes when resolving GroupingSets - Key: SPARK-6466 URL: https://issues.apache.org/jira/browse/SPARK-6466 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor When resolving GroupingSets, we currently list all outputs of GroupingSets's child plan. However, the columns that are not in groupBy expressions and not used by aggregation expressions are unnecessary and can be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6467) Override QueryPlan.missingInput when necessary and rely on it CheckAnalysis
Cheng Lian created SPARK-6467: - Summary: Override QueryPlan.missingInput when necessary and rely on it CheckAnalysis Key: SPARK-6467 URL: https://issues.apache.org/jira/browse/SPARK-6467 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Priority: Minor Currently, some LogicalPlans do not override missingInput, but they should. Then, the lack of proper missingInput implementations leaks to CheckAnalysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6397) Exclude virtual columns from QueryPlan.missingInput
[ https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6397: -- Description: Virtual columns like GROUPING__ID should never be considered as missing input, and thus should be execluded from {{QueryPlan.missingInput}}. (was: Currently, some LogicalPlans do not override missingInput, but they should. Then, the lack of proper missingInput implementations leaks to CheckAnalysis.) Exclude virtual columns from QueryPlan.missingInput --- Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yadong Qi Assignee: Yadong Qi Priority: Minor Virtual columns like GROUPING__ID should never be considered as missing input, and thus should be execluded from {{QueryPlan.missingInput}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1480) Choose classloader consistently inside of Spark codebase
[ https://issues.apache.org/jira/browse/SPARK-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375572#comment-14375572 ] Littlestar commented on SPARK-1480: --- I meet this bug on spark 1.3.0 + mesos 0.21.1 100%.. I0323 16:32:18.933440 14504 fetcher.cpp:64] Extracted resource '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S4/frameworks/20150323-152848-1214949568-5050-21134-0009/executors/20150323-100710-1214949568-5050-3453-S4/runs/3d8f22f5-7fed-44ed-b5f9-98a219133911/spark-1.3.0-bin-2.4.0.tar.gz' into '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S4/frameworks/20150323-152848-1214949568-5050-21134-0009/executors/20150323-100710-1214949568-5050-3453-S4/runs/3d8f22f5-7fed-44ed-b5f9-98a219133911' Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/executor/MesosExecutorBackend Caused by: java.lang.ClassNotFoundException: org.apache.spark.executor.MesosExecutorBackend at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) Could not find the main class: org.apache.spark.executor.MesosExecutorBackend Choose classloader consistently inside of Spark codebase Key: SPARK-1480 URL: https://issues.apache.org/jira/browse/SPARK-1480 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.0.0 The Spark codebase is not always consistent on which class loader it uses when classlaoders are explicitly passed to things like serializers. This caused SPARK-1403 and also causes a bug where when the driver has a modified context class loader it is not translated correctly in local mode to the (local) executor. In most cases what we want is the following behavior: 1. If there is a context classloader on the thread, use that. 2. Otherwise use the classloader that loaded Spark. We should just have a utility function for this and call that function whenever we need to get a classloader. Note that SPARK-1403 is a workaround for this exact problem (it sets the context class loader because downstream code assumes it is set). Once this gets fixed in a more general way SPARK-1403 can be reverted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3720) support ORC in spark sql
[ https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375590#comment-14375590 ] iward edited comment on SPARK-3720 at 3/23/15 9:10 AM: --- hi,[~zhzhan] , I have the same problem with your issues of spark-2883.And I just contact orcFile on spark,I can not quite understand your patch ,I would like to ask you a few questions: #1,why spark would read the whole files,what's the detail of problem on spark? #2,could you tell me what should we do to solve the problem? thanks was (Author: iward): hi,[~zhzhan] , I have the same problem.And I just contact orcFile on spark,I can not quite understand your patch ,I would like to ask you a few questions: #1,why spark would read the whole files,what's the detail of problem on spark? #2,could you tell me what should we do to solve the problem? thanks support ORC in spark sql Key: SPARK-3720 URL: https://issues.apache.org/jira/browse/SPARK-3720 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: Fei Wang Attachments: orc.diff The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data on hdfs.ORC file format has many advantages such as: 1 a single file as the output of each task, which reduces the NameNode's load 2 Hive type support including datetime, decimal, and the complex types (struct, list, map, and union) 3 light-weight indexes stored within the file skip row groups that don't pass predicate filtering seek to a given row 4 block-mode compression based on data type run-length encoding for integer columns dictionary encoding for string columns 5 concurrent reads of the same file using separate RecordReaders 6 ability to split files without scanning for markers 7 bound the amount of memory needed for reading or writing 8 metadata stored using Protocol Buffers, which allows addition and removal of fields Now spark sql support Parquet, support ORC provide people more opts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6430) Cannot resolve column correctlly when using left semi join
[ https://issues.apache.org/jira/browse/SPARK-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375453#comment-14375453 ] Michael Armbrust commented on SPARK-6430: - Actually, I might be wrong. Let me investigate. Cannot resolve column correctlly when using left semi join -- Key: SPARK-6430 URL: https://issues.apache.org/jira/browse/SPARK-6430 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: Spark 1.3.0 on yarn mode Reporter: zzc My code: {quote} case class TestData(key: Int, value: String) case class TestData2(a: Int, b: Int) import org.apache.spark.sql.execution.joins._ import sqlContext.implicits._ val testData = sc.parallelize( (1 to 100).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) val testData2 = sc.parallelize( TestData2(1, 1) :: TestData2(1, 2) :: TestData2(2, 1) :: TestData2(2, 2) :: TestData2(3, 1) :: TestData2(3, 2) :: Nil, 2).toDF() testData2.registerTempTable(testData2) //val tmp = sqlContext.sql(SELECT * FROM testData *LEFT SEMI JOIN* testData2 ON key = a ) val tmp = sqlContext.sql(SELECT testData2.b, count(testData2.b) FROM testData *LEFT SEMI JOIN* testData2 ON key = testData2.a group by testData2.b) tmp.explain() {quote} Error log: {quote} org.apache.spark.sql.AnalysisException: cannot resolve 'testData2.b' given input columns key, value; line 1 pos 108 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) {quote} {quote}SELECT * FROM testData LEFT SEMI JOIN testData2 ON key = a{quote} is correct, {quote} SELECT a FROM testData LEFT SEMI JOIN testData2 ON key = a SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = a group by b SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = testData2.a group by testData2.b SELECT testData2.b, count(testData2.b) FROM testData LEFT SEMI JOIN testData2 ON key = testData2.a group by testData2.b {quote} are incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-6430) Cannot resolve column correctlly when using left semi join
[ https://issues.apache.org/jira/browse/SPARK-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-6430: - Cannot resolve column correctlly when using left semi join -- Key: SPARK-6430 URL: https://issues.apache.org/jira/browse/SPARK-6430 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: Spark 1.3.0 on yarn mode Reporter: zzc My code: {quote} case class TestData(key: Int, value: String) case class TestData2(a: Int, b: Int) import org.apache.spark.sql.execution.joins._ import sqlContext.implicits._ val testData = sc.parallelize( (1 to 100).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) val testData2 = sc.parallelize( TestData2(1, 1) :: TestData2(1, 2) :: TestData2(2, 1) :: TestData2(2, 2) :: TestData2(3, 1) :: TestData2(3, 2) :: Nil, 2).toDF() testData2.registerTempTable(testData2) //val tmp = sqlContext.sql(SELECT * FROM testData *LEFT SEMI JOIN* testData2 ON key = a ) val tmp = sqlContext.sql(SELECT testData2.b, count(testData2.b) FROM testData *LEFT SEMI JOIN* testData2 ON key = testData2.a group by testData2.b) tmp.explain() {quote} Error log: {quote} org.apache.spark.sql.AnalysisException: cannot resolve 'testData2.b' given input columns key, value; line 1 pos 108 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) {quote} {quote}SELECT * FROM testData LEFT SEMI JOIN testData2 ON key = a{quote} is correct, {quote} SELECT a FROM testData LEFT SEMI JOIN testData2 ON key = a SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = a group by b SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = testData2.a group by testData2.b SELECT testData2.b, count(testData2.b) FROM testData LEFT SEMI JOIN testData2 ON key = testData2.a group by testData2.b {quote} are incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
Littlestar created SPARK-6461: - Summary: spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos Key: SPARK-6461 URL: https://issues.apache.org/jira/browse/SPARK-6461 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.3.0 Reporter: Littlestar I use mesos run spak 1.3.0 ./run-example SparkPi but failed. spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos spark.executorEnv.PATH spark.executorEnv.HADOOP_HOME spark.executorEnv.JAVA_HOME E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop fs -copyToLocal 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz' sh: hadoop: command not found -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6430) Cannot resolve column correctlly when using left semi join
[ https://issues.apache.org/jira/browse/SPARK-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6430: Target Version/s: 1.3.1 Cannot resolve column correctlly when using left semi join -- Key: SPARK-6430 URL: https://issues.apache.org/jira/browse/SPARK-6430 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: Spark 1.3.0 on yarn mode Reporter: zzc My code: {quote} case class TestData(key: Int, value: String) case class TestData2(a: Int, b: Int) import org.apache.spark.sql.execution.joins._ import sqlContext.implicits._ val testData = sc.parallelize( (1 to 100).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) val testData2 = sc.parallelize( TestData2(1, 1) :: TestData2(1, 2) :: TestData2(2, 1) :: TestData2(2, 2) :: TestData2(3, 1) :: TestData2(3, 2) :: Nil, 2).toDF() testData2.registerTempTable(testData2) //val tmp = sqlContext.sql(SELECT * FROM testData *LEFT SEMI JOIN* testData2 ON key = a ) val tmp = sqlContext.sql(SELECT testData2.b, count(testData2.b) FROM testData *LEFT SEMI JOIN* testData2 ON key = testData2.a group by testData2.b) tmp.explain() {quote} Error log: {quote} org.apache.spark.sql.AnalysisException: cannot resolve 'testData2.b' given input columns key, value; line 1 pos 108 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) {quote} {quote}SELECT * FROM testData LEFT SEMI JOIN testData2 ON key = a{quote} is correct, {quote} SELECT a FROM testData LEFT SEMI JOIN testData2 ON key = a SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = a group by b SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = testData2.a group by testData2.b SELECT testData2.b, count(testData2.b) FROM testData LEFT SEMI JOIN testData2 ON key = testData2.a group by testData2.b {quote} are incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6456) Spark Sql throwing exception on large partitioned data
[ https://issues.apache.org/jira/browse/SPARK-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375481#comment-14375481 ] Cheng Lian edited comment on SPARK-6456 at 3/23/15 7:21 AM: How many partitions are there? Also, what's the version of the Hive metastore? For now, Spark SQL only support Hive 0.12.0 and 0.13.1. Spark 1.1 and prior versions only support Hive 0.12.0. was (Author: lian cheng): How many partitions are there? Spark Sql throwing exception on large partitioned data -- Key: SPARK-6456 URL: https://issues.apache.org/jira/browse/SPARK-6456 Project: Spark Issue Type: Bug Components: SQL Reporter: pankaj Fix For: 1.2.1 Spark connects with Hive Metastore. I am able to run simple queries like show table and select. but throws below exception while running query on the hive Table having large number of partitions. {noformat} Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40) at`enter code here` org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785) at org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86) at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137) at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6463) [SPARK][SQL] AttributeSet.equal should compare size
June created SPARK-6463: --- Summary: [SPARK][SQL] AttributeSet.equal should compare size Key: SPARK-6463 URL: https://issues.apache.org/jira/browse/SPARK-6463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: June Priority: Minor AttributeSet.equal should compare both member and size -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
[ https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375468#comment-14375468 ] Littlestar edited comment on SPARK-6461 at 3/23/15 8:39 AM: each mesos slave node has JAVA and HADOOP DataNode. I also add the following setting to mesos-master-env.sh and mesos-slave-env.sh. export MESOS_JAVA_HOME=/home/test/jdk export MESOS_HADOOP_HOME=/home/test/hadoop-2.4.0 export MESOS_HADOOP_CONF_DIR=/home/test/hadoop-2.4.0/etc/hadoop export MESOS_PATH=/home/test/jdk/bin:/home/test/hadoop-2.4.0/sbin:/home/test/hadoop-2.4.0/bin:/sbin:/bin:/usr/sbin:/usr/bin /usr/bin/env: bash: No such file or directory thanks. was (Author: cnstar9988): each mesos slave node has JAVA and HADOOP DataNode. I also add the following setting to mesos-master-env.sh and mesos-slave-env.sh. export MESOS_JAVA_HOME=/home/test/jdk export MESOS_HADOOP_HOME=/home/test/hadoop-2.4.0 export MESOS_PATH=/home/test/jdk/bin:/home/test/hadoop-2.4.0/sbin:/home/test/hadoop-2.4.0/bin:/sbin:/bin:/usr/sbin:/usr/bin /usr/bin/env: bash: No such file or directory thanks. spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos -- Key: SPARK-6461 URL: https://issues.apache.org/jira/browse/SPARK-6461 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.3.0 Reporter: Littlestar I use mesos run spak 1.3.0 ./run-example SparkPi but failed. spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos spark.executorEnv.PATH spark.executorEnv.HADOOP_HOME spark.executorEnv.JAVA_HOME E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop fs -copyToLocal 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz' sh: hadoop: command not found -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6463) AttributeSet.equal should compare size
[ https://issues.apache.org/jira/browse/SPARK-6463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375581#comment-14375581 ] Apache Spark commented on SPARK-6463: - User 'sisihj' has created a pull request for this issue: https://github.com/apache/spark/pull/5133 AttributeSet.equal should compare size -- Key: SPARK-6463 URL: https://issues.apache.org/jira/browse/SPARK-6463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: June Priority: Minor AttributeSet.equal should compare both member and size -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
[ https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375580#comment-14375580 ] Littlestar edited comment on SPARK-6461 at 3/23/15 8:49 AM: when I add MESOS_HADOOP_CONF_DIR at all mesos-master-env.sh and mesos-slave-env.sh , It throws the following error. Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/executor/MesosExecutorBackend Caused by: java.lang.ClassNotFoundException: org.apache.spark.executor.MesosExecutorBackend similar to https://github.com/apache/spark/pull/620 was (Author: cnstar9988): when I add MESOS_HADOOP_CONF_DIR at all mesos-master-env.sh and mesos-slave-env.sh , It throws the following error. Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/executor/MesosExecutorBackend Caused by: java.lang.ClassNotFoundException: org.apache.spark.executor.MesosExecutorBackend spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos -- Key: SPARK-6461 URL: https://issues.apache.org/jira/browse/SPARK-6461 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.3.0 Reporter: Littlestar I use mesos run spak 1.3.0 ./run-example SparkPi but failed. spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos spark.executorEnv.PATH spark.executorEnv.HADOOP_HOME spark.executorEnv.JAVA_HOME E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop fs -copyToLocal 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz' sh: hadoop: command not found -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3720) support ORC in spark sql
[ https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375590#comment-14375590 ] iward edited comment on SPARK-3720 at 3/23/15 9:05 AM: --- hi,[~zhzhan] , I have the same problem.And I just contact orcFile on spark,I can not quite understand your patch ,I would like to ask you a few questions: #1,why spark would read the whole files,what's the detail of problem on spark? #2,could you tell me what should we do to solve the problem? thanks was (Author: iward): hi,Zhan Zhang , I have the same problem.And I just contact orcFile on spark,I can not quite understand your patch ,I would like to ask you a few questions: #1,why spark would read the whole files,what's the detail of problem on spark? #2,could you tell me what should we do to solve the problem? thanks support ORC in spark sql Key: SPARK-3720 URL: https://issues.apache.org/jira/browse/SPARK-3720 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: Fei Wang Attachments: orc.diff The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data on hdfs.ORC file format has many advantages such as: 1 a single file as the output of each task, which reduces the NameNode's load 2 Hive type support including datetime, decimal, and the complex types (struct, list, map, and union) 3 light-weight indexes stored within the file skip row groups that don't pass predicate filtering seek to a given row 4 block-mode compression based on data type run-length encoding for integer columns dictionary encoding for string columns 5 concurrent reads of the same file using separate RecordReaders 6 ability to split files without scanning for markers 7 bound the amount of memory needed for reading or writing 8 metadata stored using Protocol Buffers, which allows addition and removal of fields Now spark sql support Parquet, support ORC provide people more opts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1702) Mesos executor won't start because of a ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375599#comment-14375599 ] Littlestar commented on SPARK-1702: --- I met this on spak 1.3.0 + mesos 0.21.1 Mesos executor won't start because of a ClassNotFoundException -- Key: SPARK-1702 URL: https://issues.apache.org/jira/browse/SPARK-1702 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Reporter: Bouke van der Bijl Labels: executors, mesos, spark Some discussion here: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html Fix here (which is probably not the right fix): https://github.com/apache/spark/pull/620 This was broken in v0.9.0, was fixed in v0.9.1 and is now broken again. Error in Mesos executor stderr: WARNING: Logging before InitGoogleLogging() is written to STDERR I0502 17:31:42.672224 14688 exec.cpp:131] Version: 0.18.0 I0502 17:31:42.674959 14707 exec.cpp:205] Executor registered on slave 20140501-182306-16842879-5050-10155-0 14/05/02 17:31:42 INFO MesosExecutorBackend: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/05/02 17:31:42 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20140501-182306-16842879-5050-10155-0 14/05/02 17:31:43 INFO SecurityManager: Changing view acls to: vagrant 14/05/02 17:31:43 INFO SecurityManager: SecurityManager, is authentication enabled: false are ui acls enabled: false users with view permissions: Set(vagrant) 14/05/02 17:31:43 INFO Slf4jLogger: Slf4jLogger started 14/05/02 17:31:43 INFO Remoting: Starting remoting 14/05/02 17:31:43 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@localhost:50843] 14/05/02 17:31:43 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@localhost:50843] java.lang.ClassNotFoundException: org/apache/spark/serializer/JavaSerializer at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:165) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:176) at org.apache.spark.executor.Executor.init(Executor.scala:106) at org.apache.spark.executor.MesosExecutorBackend.registered(MesosExecutorBackend.scala:56) Exception in thread Thread-0 I0502 17:31:43.710039 14707 exec.cpp:412] Deactivating the executor libprocess The problem is that it can't find the class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1702) Mesos executor won't start because of a ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375599#comment-14375599 ] Littlestar edited comment on SPARK-1702 at 3/23/15 9:05 AM: I met this on spak 1.3.0 + mesos 0.21.1 with run-example SparkPi was (Author: cnstar9988): I met this on spak 1.3.0 + mesos 0.21.1 Mesos executor won't start because of a ClassNotFoundException -- Key: SPARK-1702 URL: https://issues.apache.org/jira/browse/SPARK-1702 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Reporter: Bouke van der Bijl Labels: executors, mesos, spark Some discussion here: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html Fix here (which is probably not the right fix): https://github.com/apache/spark/pull/620 This was broken in v0.9.0, was fixed in v0.9.1 and is now broken again. Error in Mesos executor stderr: WARNING: Logging before InitGoogleLogging() is written to STDERR I0502 17:31:42.672224 14688 exec.cpp:131] Version: 0.18.0 I0502 17:31:42.674959 14707 exec.cpp:205] Executor registered on slave 20140501-182306-16842879-5050-10155-0 14/05/02 17:31:42 INFO MesosExecutorBackend: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/05/02 17:31:42 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20140501-182306-16842879-5050-10155-0 14/05/02 17:31:43 INFO SecurityManager: Changing view acls to: vagrant 14/05/02 17:31:43 INFO SecurityManager: SecurityManager, is authentication enabled: false are ui acls enabled: false users with view permissions: Set(vagrant) 14/05/02 17:31:43 INFO Slf4jLogger: Slf4jLogger started 14/05/02 17:31:43 INFO Remoting: Starting remoting 14/05/02 17:31:43 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@localhost:50843] 14/05/02 17:31:43 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@localhost:50843] java.lang.ClassNotFoundException: org/apache/spark/serializer/JavaSerializer at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:165) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:176) at org.apache.spark.executor.Executor.init(Executor.scala:106) at org.apache.spark.executor.MesosExecutorBackend.registered(MesosExecutorBackend.scala:56) Exception in thread Thread-0 I0502 17:31:43.710039 14707 exec.cpp:412] Deactivating the executor libprocess The problem is that it can't find the class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd
[ https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SaintBacchus updated SPARK-6464: Description: Nowadays, the transformation *coalesce* was always used to expand or reduce the number of the partition in order to gain a good performance. But *coalesce* can't make sure that the child partition will be executed in the same executor as the parent partition. And this will lead to have a large network transfer. In some scenario such as I mentioned in the title +small and cached rdd+, we want to coalesce all the partition in the same executor into one partition and make sure the child partition will be executed in this executor. It can avoid network transfer and reduce the scheduler of the Tasks and also can reused the cpu core to do other job. In this scenario, our performance had improved 20% than before. was: Nowadays, the transformation *coalesce* was always used to expand or reduce the number of the partition in order to gain a good performance. But *coalesce* can't make sure that the child partition will be executed in the same executor as the parent partition. And this will lead to have a large network transfer. In some scenario such as I metioned in the title Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd --- Key: SPARK-6464 URL: https://issues.apache.org/jira/browse/SPARK-6464 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: SaintBacchus Nowadays, the transformation *coalesce* was always used to expand or reduce the number of the partition in order to gain a good performance. But *coalesce* can't make sure that the child partition will be executed in the same executor as the parent partition. And this will lead to have a large network transfer. In some scenario such as I mentioned in the title +small and cached rdd+, we want to coalesce all the partition in the same executor into one partition and make sure the child partition will be executed in this executor. It can avoid network transfer and reduce the scheduler of the Tasks and also can reused the cpu core to do other job. In this scenario, our performance had improved 20% than before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6397) Exclude virtual columns from QueryPlan.missingInput
[ https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6397: -- Summary: Exclude virtual columns from QueryPlan.missingInput (was: Override QueryPlan.missingInput when necessary and rely on CheckAnalysis) Exclude virtual columns from QueryPlan.missingInput --- Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yadong Qi Assignee: Yadong Qi Priority: Minor Currently, some LogicalPlans do not override missingInput, but they should. Then, the lack of proper missingInput implementations leaks to CheckAnalysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6451) Support CombineSum in Code Gen
[ https://issues.apache.org/jira/browse/SPARK-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375456#comment-14375456 ] Venkata Ramana G commented on SPARK-6451: - Working on the same. Support CombineSum in Code Gen -- Key: SPARK-6451 URL: https://issues.apache.org/jira/browse/SPARK-6451 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker Since we are using CombineSum at the reducer side for the SUM function, we need to make it work in code gen. Otherwise, code gen will not convert Aggregates with a SUM function to GeneratedAggregates (the code gen version). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6462) UpdateStateByKey should allow inner join of new with old keys
Andre Schumacher created SPARK-6462: --- Summary: UpdateStateByKey should allow inner join of new with old keys Key: SPARK-6462 URL: https://issues.apache.org/jira/browse/SPARK-6462 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Andre Schumacher In a nutshell: provide a (inner join) instead of a cogroup for updateStateByKey in StateDStream. Details: It is common to read data (saw weblog data) from a streaming source (say Kafka) and each time update the state of a relatively small number of keys. If only the state changes need to be propagated to a downstream sink then one could avoid filtering out unchanged state in the user program and instead provide this functionality in the API (say by adding a updateStateChangesByKey method). Note that this is related but not identical to: https://issues.apache.org/jira/browse/SPARK-2629 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1480) Choose classloader consistently inside of Spark codebase
[ https://issues.apache.org/jira/browse/SPARK-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375576#comment-14375576 ] Littlestar commented on SPARK-1480: --- same as https://issues.apache.org/jira/browse/SPARK-6461 run-example SparkPi can reproduce this bug. Choose classloader consistently inside of Spark codebase Key: SPARK-1480 URL: https://issues.apache.org/jira/browse/SPARK-1480 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.0.0 The Spark codebase is not always consistent on which class loader it uses when classlaoders are explicitly passed to things like serializers. This caused SPARK-1403 and also causes a bug where when the driver has a modified context class loader it is not translated correctly in local mode to the (local) executor. In most cases what we want is the following behavior: 1. If there is a context classloader on the thread, use that. 2. Otherwise use the classloader that loaded Spark. We should just have a utility function for this and call that function whenever we need to get a classloader. Note that SPARK-1403 is a workaround for this exact problem (it sets the context class loader because downstream code assumes it is set). Once this gets fixed in a more general way SPARK-1403 can be reverted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3720) support ORC in spark sql
[ https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375590#comment-14375590 ] iward commented on SPARK-3720: -- hi,Zhan Zhang , I have the same problem.And I just contact orcFile on spark,I can not quite understand your patch ,I would like to ask you a few questions: #1,why spark would read the whole files,what's the detail of problem on spark? #2,could you tell me what should we do to solve the problem? thanks support ORC in spark sql Key: SPARK-3720 URL: https://issues.apache.org/jira/browse/SPARK-3720 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: Fei Wang Attachments: orc.diff The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data on hdfs.ORC file format has many advantages such as: 1 a single file as the output of each task, which reduces the NameNode's load 2 Hive type support including datetime, decimal, and the complex types (struct, list, map, and union) 3 light-weight indexes stored within the file skip row groups that don't pass predicate filtering seek to a given row 4 block-mode compression based on data type run-length encoding for integer columns dictionary encoding for string columns 5 concurrent reads of the same file using separate RecordReaders 6 ability to split files without scanning for markers 7 bound the amount of memory needed for reading or writing 8 metadata stored using Protocol Buffers, which allows addition and removal of fields Now spark sql support Parquet, support ORC provide people more opts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
[ https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375468#comment-14375468 ] Littlestar commented on SPARK-6461: --- each mesos slave node has JAVA and HADOOP DataNode. I also add the following setting to mesos-master-env.sh and mesos-slave-env.sh. export MESOS_JAVA_HOME=/home/test/jdk export MESOS_HADOOP_HOME=/home/test/hadoop-2.4.0 export MESOS_PATH=/home/test/jdk/bin:/home/test/hadoop-2.4.0/sbin:/home/test/hadoop-2.4.0/bin:/sbin:/bin:/usr/sbin:/usr/bin /usr/bin/env: bash: No such file or directory thanks. spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos -- Key: SPARK-6461 URL: https://issues.apache.org/jira/browse/SPARK-6461 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.3.0 Reporter: Littlestar I use mesos run spak 1.3.0 ./run-example SparkPi but failed. spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos spark.executorEnv.PATH spark.executorEnv.HADOOP_HOME spark.executorEnv.JAVA_HOME E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop fs -copyToLocal 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz' sh: hadoop: command not found -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6443) Could not submit app in standalone cluster mode when HA is enabled
[ https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Wang updated SPARK-6443: Description: After digging some codes, I found user could not submit app in standalone cluster mode when HA is enabled. But in client mode it can work. Haven't try yet. But I will verify this and file a PR to resolve it if the problem exists. 3/23 update: I started a HA cluster with zk, and tried to submit SparkPi example with command: ./spark-submit --class org.apache.spark.examples.SparkPi --master spark://doggie153:7077,doggie159:7077 --deploy-mode cluster ../lib/spark-examples-1.2.0-hadoop2.4.0.jar and it failed with error message: Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: spark://doggie153:7077,doggie159:7077 akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Invalid master URL: spark://doggie153:7077,doggie159:7077 at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35) at akka.actor.ActorCell.create(ActorCell.scala:580) ... 9 more But in client mode it ended with correct result. So my guess is right. I will fix it in the related PR. was: After digging some codes, I found user could not submit app in standalone cluster mode when HA is enabled. But in client mode it can work. Haven't try yet. But I will verify this and file a PR to resolve it if the problem exists. 3/23 update: I started a HA cluster with zk, and tried to submit SparkPi example with command: *./spark-submit --class org.apache.spark.examples.SparkPi --master spark://doggie153:7077,doggie159:7077 --deploy-mode cluster ../lib/spark-examples-1.2.0-hadoop2.4.0.jar * and it failed with error message: ??Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: spark://doggie153:7077,doggie159:7077 akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Invalid master URL: spark://doggie153:7077,doggie159:7077 at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35) at akka.actor.ActorCell.create(ActorCell.scala:580) ... 9 more?? So my guess is right. I will fix it in related PR. Could not submit app in standalone cluster mode when HA is enabled -- Key: SPARK-6443 URL: https://issues.apache.org/jira/browse/SPARK-6443 Project: Spark Issue Type: Bug Components: Spark Submit Reporter:
[jira] [Commented] (SPARK-1403) Spark on Mesos does not set Thread's context class loader
[ https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375583#comment-14375583 ] Littlestar commented on SPARK-1403: --- I want to reopen this bug, because I can reproduce it at spark 1.3.0 + mesos 0.21.1 with run-example SparkPi Spark on Mesos does not set Thread's context class loader - Key: SPARK-1403 URL: https://issues.apache.org/jira/browse/SPARK-1403 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: ubuntu 12.04 on vagrant Reporter: Bharath Bhushan Priority: Blocker Fix For: 1.0.0 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark executor on mesos slave throws a java.lang.ClassNotFoundException for org.apache.spark.serializer.JavaSerializer. The lengthy discussion is here: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd
[ https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SaintBacchus updated SPARK-6464: Description: Nowadays, the transformation *coalesce* was always used to expand or reduce the number of the partition in order to gain a good performance. But *coalesce* can't make sure that the child partition will be executed in the same executor as the parent partition. And this will lead to have a large network transfer. In some scenario such as I metioned in the title Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd --- Key: SPARK-6464 URL: https://issues.apache.org/jira/browse/SPARK-6464 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: SaintBacchus Nowadays, the transformation *coalesce* was always used to expand or reduce the number of the partition in order to gain a good performance. But *coalesce* can't make sure that the child partition will be executed in the same executor as the parent partition. And this will lead to have a large network transfer. In some scenario such as I metioned in the title -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6466) Remove unnecessary attributes when resolving GroupingSets
Liang-Chi Hsieh created SPARK-6466: -- Summary: Remove unnecessary attributes when resolving GroupingSets Key: SPARK-6466 URL: https://issues.apache.org/jira/browse/SPARK-6466 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor When resolving GroupingSets, we currently list all outputs of GroupingSets's child plan. However, the columns that are not in groupBy expressions and not used by aggregation expressions are unnecessary and can be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6397) Override QueryPlan.missingInput when necessary and rely on CheckAnalysis
[ https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6397: -- Affects Version/s: 1.3.0 Override QueryPlan.missingInput when necessary and rely on CheckAnalysis Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yadong Qi Assignee: Yadong Qi Priority: Minor Currently, some LogicalPlans do not override missingInput, but they should. Then, the lack of proper missingInput implementations leaks to CheckAnalysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6397) Override QueryPlan.missingInput when necessary and rely on CheckAnalysis
[ https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6397: -- Assignee: Yadong Qi Override QueryPlan.missingInput when necessary and rely on CheckAnalysis Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yadong Qi Assignee: Yadong Qi Priority: Minor Currently, some LogicalPlans do not override missingInput, but they should. Then, the lack of proper missingInput implementations leaks to CheckAnalysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
[ https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375580#comment-14375580 ] Littlestar commented on SPARK-6461: --- when I add MESOS_HADOOP_CONF_DIR at all mesos-master-env.sh and mesos-slave-env.sh , It throws the following error. Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/executor/MesosExecutorBackend Caused by: java.lang.ClassNotFoundException: org.apache.spark.executor.MesosExecutorBackend spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos -- Key: SPARK-6461 URL: https://issues.apache.org/jira/browse/SPARK-6461 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.3.0 Reporter: Littlestar I use mesos run spak 1.3.0 ./run-example SparkPi but failed. spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos spark.executorEnv.PATH spark.executorEnv.HADOOP_HOME spark.executorEnv.JAVA_HOME E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop fs -copyToLocal 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz' sh: hadoop: command not found -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd
SaintBacchus created SPARK-6464: --- Summary: Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd Key: SPARK-6464 URL: https://issues.apache.org/jira/browse/SPARK-6464 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: SaintBacchus -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd
[ https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SaintBacchus updated SPARK-6464: Description: Nowadays, the transformation *coalesce* was always used to expand or reduce the number of the partition in order to gain a good performance. But *coalesce* can't make sure that the child partition will be executed in the same executor as the parent partition. And this will lead to have a large network transfer. In some scenario such as I mentioned in the title +small and cached rdd+, we want to coalesce all the partition in the same executor into one partition and make sure the child partition will be executed in this executor. It can avoid network transfer and reduce the scheduler of the Tasks and also can reused the cpu core to do other job. In this scenario, our performance had improved 20% than before. was: Nowadays, the transformation *coalesce* was always used to expand or reduce the number of the partition in order to gain a good performance. But *coalesce* can't make sure that the child partition will be executed in the same executor as the parent partition. And this will lead to have a large network transfer. In some scenario such as I mentioned in the title +small and cached rdd+, we want to coalesce all the partition in the same executor into one partition and make sure the child partition will be executed in this executor. It can avoid network transfer and reduce the scheduler of the Tasks and also can reused the cpu core to do other job. In this scenario, our performance had improved 20% than before. Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd --- Key: SPARK-6464 URL: https://issues.apache.org/jira/browse/SPARK-6464 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: SaintBacchus Nowadays, the transformation *coalesce* was always used to expand or reduce the number of the partition in order to gain a good performance. But *coalesce* can't make sure that the child partition will be executed in the same executor as the parent partition. And this will lead to have a large network transfer. In some scenario such as I mentioned in the title +small and cached rdd+, we want to coalesce all the partition in the same executor into one partition and make sure the child partition will be executed in this executor. It can avoid network transfer and reduce the scheduler of the Tasks and also can reused the cpu core to do other job. In this scenario, our performance had improved 20% than before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
[ https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375468#comment-14375468 ] Littlestar edited comment on SPARK-6461 at 3/23/15 9:29 AM: each mesos slave node has JAVA and HADOOP DataNode. Now I add the following setting to mesos-master-env.sh and mesos-slave-env.sh. export MESOS_JAVA_HOME=/home/test/jdk export MESOS_HADOOP_HOME=/home/test/hadoop-2.4.0 export MESOS_HADOOP_CONF_DIR=/home/test/hadoop-2.4.0/etc/hadoop export MESOS_PATH=/home/test/jdk/bin:/home/test/hadoop-2.4.0/sbin:/home/test/hadoop-2.4.0/bin:/sbin:/bin:/usr/sbin:/usr/bin /usr/bin/env: bash: No such file or directory thanks. was (Author: cnstar9988): each mesos slave node has JAVA and HADOOP DataNode. I also add the following setting to mesos-master-env.sh and mesos-slave-env.sh. export MESOS_JAVA_HOME=/home/test/jdk export MESOS_HADOOP_HOME=/home/test/hadoop-2.4.0 export MESOS_HADOOP_CONF_DIR=/home/test/hadoop-2.4.0/etc/hadoop export MESOS_PATH=/home/test/jdk/bin:/home/test/hadoop-2.4.0/sbin:/home/test/hadoop-2.4.0/bin:/sbin:/bin:/usr/sbin:/usr/bin /usr/bin/env: bash: No such file or directory thanks. spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos -- Key: SPARK-6461 URL: https://issues.apache.org/jira/browse/SPARK-6461 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.3.0 Reporter: Littlestar I use mesos run spak 1.3.0 ./run-example SparkPi but failed. spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos spark.executorEnv.PATH spark.executorEnv.HADOOP_HOME spark.executorEnv.JAVA_HOME E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop fs -copyToLocal 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz' sh: hadoop: command not found -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6397) Exclude virtual columns from QueryPlan.missingInput
[ https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375680#comment-14375680 ] Cheng Lian commented on SPARK-6397: --- Hey [~smolav], after some discussion with [~waterman] in his PRs, we decided to fix the GROUPING__ID virtual column issue first. So I updated the title and description of this JIRA ticket, and created SPARK-6467 for the original one. You may link your PR to that one. Thanks! I should have created another JIRA ticket for the fix introduced in [~waterman]'s PR, but I realized the problem too late after merging it. Exclude virtual columns from QueryPlan.missingInput --- Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yadong Qi Assignee: Yadong Qi Priority: Minor Virtual columns like GROUPING__ID should never be considered as missing input, and thus should be execluded from {{QueryPlan.missingInput}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6397) Exclude virtual columns from QueryPlan.missingInput
[ https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6397. --- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Issue resolved by pull request 5132 [https://github.com/apache/spark/pull/5132] Exclude virtual columns from QueryPlan.missingInput --- Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yadong Qi Assignee: Yadong Qi Priority: Minor Fix For: 1.3.1, 1.4.0 Virtual columns like GROUPING__ID should never be considered as missing input, and thus should be execluded from {{QueryPlan.missingInput}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6456) Spark Sql throwing exception on large partitioned data
[ https://issues.apache.org/jira/browse/SPARK-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6456: -- Description: Spark connects with Hive Metastore. I am able to run simple queries like show table and select. but throws below exception while running query on the hive Table having large number of partitions. {noformat} Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40) at`enter code here` org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785) at org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86) at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137) at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) {noformat} was: Observation: Spark connects with hive Metastore. i am able to run simple queries like show table and select. but throws below exception while running query on the hive Table having large number of partitions. {code} Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40) at`enter code here` org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785) at org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86) at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137) at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143) at
[jira] [Commented] (SPARK-6456) Spark Sql throwing exception on large partitioned data
[ https://issues.apache.org/jira/browse/SPARK-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375481#comment-14375481 ] Cheng Lian commented on SPARK-6456: --- How many partitions are there? Spark Sql throwing exception on large partitioned data -- Key: SPARK-6456 URL: https://issues.apache.org/jira/browse/SPARK-6456 Project: Spark Issue Type: Bug Components: SQL Reporter: pankaj Fix For: 1.2.1 Spark connects with Hive Metastore. I am able to run simple queries like show table and select. but throws below exception while running query on the hive Table having large number of partitions. {noformat} Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40) at`enter code here` org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785) at org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86) at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137) at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5320) Joins on simple table created using select gives error
[ https://issues.apache.org/jira/browse/SPARK-5320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5320: Assignee: Yuri Saito Joins on simple table created using select gives error -- Key: SPARK-5320 URL: https://issues.apache.org/jira/browse/SPARK-5320 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.1 Reporter: Kuldeep Assignee: Yuri Saito Fix For: 1.3.1, 1.4.0 Register select 0 as a, 1 as b as table zeroone Register select 0 as x, 1 as y as table zeroone2 The following sql select * from zeroone ta join zeroone2 tb on ta.a = tb.x gives error java.lang.UnsupportedOperationException: LeafNode NoRelation$ must implement statistics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
[ https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375478#comment-14375478 ] Littlestar commented on SPARK-6461: --- in spark/bin, some shell script use usr/bin/env bash I think changed #!/usr/bin/env bash to #!/bin/bash and that worked. spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos -- Key: SPARK-6461 URL: https://issues.apache.org/jira/browse/SPARK-6461 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.3.0 Reporter: Littlestar I use mesos run spak 1.3.0 ./run-example SparkPi but failed. spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos spark.executorEnv.PATH spark.executorEnv.HADOOP_HOME spark.executorEnv.JAVA_HOME E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop fs -copyToLocal 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz' sh: hadoop: command not found -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org