[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318915#comment-14318915 ] Chris T commented on SPARK-5436: I'm haven't been able to make headway on this. [~MechCoder], I suggest you take this on. Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5754) Spark AM not launching on Windows
[ https://issues.apache.org/jira/browse/SPARK-5754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318654#comment-14318654 ] Inigo commented on SPARK-5754: -- So, I did some test of what works and what doesn't: * '-Dspark.driver.port=21390': Error: Could not find or load main class '-Dspark.driver.port=21390' * -Dspark.driver.port=21390: OK * -Dspark.driver.port=21390: OK * -Dspark.driver.port='21390': OK * -Dspark.driver.port=21390: OK Another one that is problematic is: * -XX:OnOutOfMemoryError='kill %p': Error: Could not find or load main class ... * -XX:OnOutOfMemoryError=kill %p: Java cannot parse the option * -XX:OnOutOfMemoryError='kill %p': Java cannot parse the option Spark AM not launching on Windows - Key: SPARK-5754 URL: https://issues.apache.org/jira/browse/SPARK-5754 Project: Spark Issue Type: Bug Components: Windows, YARN Affects Versions: 1.1.1, 1.2.0 Environment: Windows Server 2012, Hadoop 2.4.1. Reporter: Inigo I'm trying to run Spark Pi on a YARN cluster running on Windows and the AM container fails to start. The problem seems to be in the generation of the YARN command which adds single quotes (') surrounding some of the java options. In particular, the part of the code that is adding those is the escapeForShell function in YarnSparkHadoopUtil. Apparently, Windows does not like the quotes for these options. Here is an example of the command that the container tries to execute: @call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp '-Dspark.yarn.secondary.jars=' '-Dspark.app.name=org.apache.spark.examples.SparkPi' '-Dspark.master=yarn-cluster' org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.examples.SparkPi' --jar 'file:/D:/data/spark-1.1.1-bin-hadoop2.4/bin/../lib/spark-examples-1.1.1-hadoop2.4.0.jar' --executor-memory 1024 --executor-cores 1 --num-executors 2 Once I transform it into: @call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp -Dspark.yarn.secondary.jars= -Dspark.app.name=org.apache.spark.examples.SparkPi -Dspark.master=yarn-cluster org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.examples.SparkPi' --jar 'file:/D:/data/spark-1.1.1-bin-hadoop2.4/bin/../lib/spark-examples-1.1.1-hadoop2.4.0.jar' --executor-memory 1024 --executor-cores 1 --num-executors 2 Everything seems to start. How should I deal with this? Creating a separate function like escapeForShell for Windows and call it whenever I detect this is for Windows? Or should I add some sanity check on YARN? I checked a little and there seems to be people that is able to run Spark on YARN on Windows, so it might be something else. I didn't find anything related on Jira either. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5757) Use json4s instead of DataFrame.toJSON in model export
[ https://issues.apache.org/jira/browse/SPARK-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5757. -- Resolution: Fixed Fix Version/s: 1.3.0 Use json4s instead of DataFrame.toJSON in model export -- Key: SPARK-5757 URL: https://issues.apache.org/jira/browse/SPARK-5757 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Fix For: 1.3.0 We use DataFrame.toJSON to save/load model metadata, which depends on DataFrame's JSON support and subject to changes made there. To avoid conflicts, e.g., https://github.com/apache/spark/pull/4544, we should use json4s directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5780) The loggings of Python unittests are noisy and scaring in
[ https://issues.apache.org/jira/browse/SPARK-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318836#comment-14318836 ] Apache Spark commented on SPARK-5780: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4572 The loggings of Python unittests are noisy and scaring in -- Key: SPARK-5780 URL: https://issues.apache.org/jira/browse/SPARK-5780 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.3.0, 1.4.0 Reporter: Davies Liu There a bunch of logging coming from driver and worker, it's noisy and scaring, and a lots of exception in it, people are confusing about the tests are failing or not. It should mute the logging during tests, only show them if any one failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5522) Accelerate the Histroty Server start
[ https://issues.apache.org/jira/browse/SPARK-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318857#comment-14318857 ] Ryan Williams commented on SPARK-5522: -- I think [SPARK-4558|https://issues.apache.org/jira/browse/SPARK-4558] is talking about the same problem. Accelerate the Histroty Server start Key: SPARK-5522 URL: https://issues.apache.org/jira/browse/SPARK-5522 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Reporter: Liangliang Gu When starting the history server, all the log files will be fetched and parsed in order to get the applications' meta data e.g. App Name, Start Time, Duration, etc. In our production cluster, there exist 2600 log files (160G) in HDFS and it costs 3 hours to restart the history server, which is a little bit too long for us. It would be better, if the history server can show logs with missing information during start-up and fill the missing information after fetching and parsing a log file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5782) Python Worker / Pyspark Daemon Memory Issue
Mark Khaitman created SPARK-5782: Summary: Python Worker / Pyspark Daemon Memory Issue Key: SPARK-5782 URL: https://issues.apache.org/jira/browse/SPARK-5782 Project: Spark Issue Type: Bug Components: PySpark, Shuffle Affects Versions: 1.2.1, 1.3.0, 1.2.2 Environment: CentOS 7, Spark Standalone Reporter: Mark Khaitman I'm including the Shuffle component on this, as a brief scan through the code (which I'm not 100% familiar with just yet) shows a large amount of memory handling in it: It appears that any type of join between two RDDs spawns up twice as many pyspark.daemon workers compared to the default 1 task - 1 core configuration in our environment. This can become problematic in the cases where you build up a tree of RDD joins, since the pyspark.daemons do not cease to exist until the top level join is completed (or so it seems)... This can lead to memory exhaustion by a single framework, even though is set to have a 512MB python worker memory limit and few gigs of executor memory. Another related issue to this is that the individual python workers are not supposed to even exceed that far beyond 512MB, otherwise they're supposed to spill to disk. I came across this bit of code in shuffle.py which *may* have something to do with allowing some of our python workers from somehow reaching 2GB each (which when multiplied by the number of cores per executor * the number of joins occurring in some cases), causing the Out-of-Memory killer to step up to its unfortunate job! :( def _next_limit(self): Return the next memory limit. If the memory is not released after spilling, it will dump the data only when the used memory starts to increase. return max(self.memory_limit, get_used_memory() * 1.05) I've only just started looking into the code, and would definitely love to contribute towards Spark, though I figured it might be quicker to resolve if someone already owns the code! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5783) Include filename, line number in eventlog-parsing error message
Ryan Williams created SPARK-5783: Summary: Include filename, line number in eventlog-parsing error message Key: SPARK-5783 URL: https://issues.apache.org/jira/browse/SPARK-5783 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.1 Reporter: Ryan Williams Priority: Minor While investigating why some recent applications were not showing up in my History Server UI, I found error message blocks like this in the history server logs: {code} 15/02/12 18:51:55 ERROR scheduler.ReplayListenerBus: Exception in parsing Spark event log. java.lang.ClassCastException: org.json4s.JsonAST$JNothing$ cannot be cast to org.json4s.JsonAST$JObject at org.apache.spark.util.JsonProtocol$.mapFromJson(JsonProtocol.scala:814) at org.apache.spark.util.JsonProtocol$.executorInfoFromJson(JsonProtocol.scala:805) ... at org.apache.spark.deploy.history.FsHistoryProvider$$anon$1.run(FsHistoryProvider.scala:84) 15/02/12 18:51:55 ERROR scheduler.ReplayListenerBus: Malformed line: {Event:SparkListenerExecutorAdded,Timestamp:1422897479154,Executor ID:12,Executor Info:{Host:demeter-csmaz11-1.demeter.hpc.mssm.edu,Total Cores:4}} {code} Turns out certain files had some malformed lines due to having been generated by a forked Spark with some WIP event-log functionality. It would be nice if the first line specified the file the error was found in, and the last line specified the line number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5776) JIRA version not of form x.y.z breaks merge_spark_pr.py
[ https://issues.apache.org/jira/browse/SPARK-5776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5776. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4570 [https://github.com/apache/spark/pull/4570] JIRA version not of form x.y.z breaks merge_spark_pr.py --- Key: SPARK-5776 URL: https://issues.apache.org/jira/browse/SPARK-5776 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Sean Owen Priority: Minor Fix For: 1.4.0 It appears that the version 2+ I added to JIRA breaks the merge script since it expects x.y.z only. I will try to patch the logic quickly. Worst case, we can name the version 2.0.0 if we have to. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5655) YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster configured in secure mode
[ https://issues.apache.org/jira/browse/SPARK-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5655: - Affects Version/s: (was: 1.3.0) Fix Version/s: 1.2.2 YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster configured in secure mode - Key: SPARK-5655 URL: https://issues.apache.org/jira/browse/SPARK-5655 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.1 Environment: Both CDH5.3.0 and CDH5.1.3, latest build on branch-1.2 Reporter: Andrew Rowson Assignee: Andrew Rowson Priority: Critical Labels: hadoop Fix For: 1.3.0, 1.2.2 When running a Spark job on a YARN cluster which doesn't run containers under the same user as the nodemanager, and also when using the YARN auxiliary shuffle service, jobs fail with something similar to: {code:java} java.io.FileNotFoundException: /data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index (Permission denied) {code} The root cause of this here: https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287 Spark will attempt to chmod 700 any application directories it creates during the job, which includes files created in the nodemanager's usercache directory. The owner of these files is the container UID, which on a secure cluster is the name of the user creating the job, and on an nonsecure cluster but with the yarn.nodemanager.container-executor.class configured is the value of yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user. The problem with this is that the auxiliary shuffle manager runs as part of the nodemanager, which is typically running as the user 'yarn'. This can't access these files that are only owner-readable. YARN already attempts to secure files created under appcache but keep them readable by the nodemanager, by setting the group of the appcache directory to 'yarn' and also setting the setgid flag. This means that files and directories created under this should also have the 'yarn' group. Normally this means that the nodemanager should also be able to read these files, but Spark setting chmod700 wipes this out. I'm not sure what the right approach is here. Commenting out the chmod700 functionality makes this work on YARN, and still makes the application files only readable by the owner and the group: {code} /data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c # ls -lah total 206M drwxr-s--- 2 nobody yarn 4.0K Feb 6 18:30 . drwxr-s--- 12 nobody yarn 4.0K Feb 6 18:30 .. -rw-r- 1 nobody yarn 206M Feb 6 18:30 shuffle_0_0_0.data {code} But this may not be the right approach on non-YARN. Perhaps an additional step to see if this chmod700 step is necessary (ie non-YARN) is required. Sadly, I don't have a non-YARN environment to test, otherwise I'd be able to suggest a patch. I believe this is a related issue in the MapReduce framwork: https://issues.apache.org/jira/browse/MAPREDUCE-3728 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5778) Throw if nonexistent spark.metrics.conf file is provided
[ https://issues.apache.org/jira/browse/SPARK-5778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318823#comment-14318823 ] Apache Spark commented on SPARK-5778: - User 'ryan-williams' has created a pull request for this issue: https://github.com/apache/spark/pull/4571 Throw if nonexistent spark.metrics.conf file is provided -- Key: SPARK-5778 URL: https://issues.apache.org/jira/browse/SPARK-5778 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.1 Reporter: Ryan Williams Priority: Minor Spark looks for a {{MetricsSystem}} configuration file when the {{spark.metrics.conf}} parameter is set, [defaulting to the path {{metrics.properties}} when it's not set|https://github.com/apache/spark/blob/466b1f671b21f575d28f9c103f51765790914fe3/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L52-L55]. In the event of a failure to find or parse the file, [the exception is caught and an error is logged|https://github.com/apache/spark/blob/466b1f671b21f575d28f9c103f51765790914fe3/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L61]. This seems like reasonable behavior in the general case where the user has not specified a {{spark.metrics.conf}} file. However, I've been bitten several times by having specified a file that all or some executors did not have present (I typo'd the path, or forgot to add an additional {{--files}} flag to make my local metrics config file get shipped to all executors), the error was swallowed and I was confused about why I'd captured no metrics from a job that appeared to have run successfully. I'd like to change the behavior to actually throw if the user has specified a configuration file that doesn't exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5779) Python broadcast does not work with Kryo serializer
Davies Liu created SPARK-5779: - Summary: Python broadcast does not work with Kryo serializer Key: SPARK-5779 URL: https://issues.apache.org/jira/browse/SPARK-5779 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.1, 1.3.0 Reporter: Davies Liu Priority: Critical The PythonBroadcast cannot be serialized by Kryo, which is introduced in 1.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5765) word split problem in run-example and compute-classpath
[ https://issues.apache.org/jira/browse/SPARK-5765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318869#comment-14318869 ] Nicholas Chammas commented on SPARK-5765: - FWIW [~srowen], the last time I had this discussion with [~pwendell] I believe the direction was to have separate JIRAs for separate pieces of work. That way they can be assigned and credited to separate people. If 3 people work on fixing Bash word splitting bugs in different places, how do you track those contributions in JIRA in an automated way? Per Patrick's announcement some weeks ago, JIRA is now the source for release notes credits. word split problem in run-example and compute-classpath --- Key: SPARK-5765 URL: https://issues.apache.org/jira/browse/SPARK-5765 Project: Spark Issue Type: Sub-task Components: Examples Affects Versions: 1.3.0, 1.1.2, 1.2.1 Reporter: Venkata Ramana G Work split problem in spark directory path in scripts run-example and compute-classpath.sh This was introduced in defect fix SPARK-4504 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring
[ https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318637#comment-14318637 ] Yin Huai commented on SPARK-4856: - [~chenghao] I think it is fine to use NullType for an empty string during the process of inferring schema. However, I think we should not always treat an empty string as null (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L402). If the inferred data type is StringType, we should return the empty string. Otherwise, we are destroying information. If the inferred data type is any other data type, I think it is reasonable to return a null. Null empty string should not be considered as StringType at begining in Json schema inferring --- Key: SPARK-4856 URL: https://issues.apache.org/jira/browse/SPARK-4856 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Fix For: 1.3.0 We have data like: {noformat} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {noformat} As empty string (the headers) will be considered as String, and it ignores the real nested data type (struct type headers in line 1), and then we will get the headers (in line 1) as String Type, which is not our expectation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5747) Review all Bash scripts for word splitting bugs
[ https://issues.apache.org/jira/browse/SPARK-5747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318717#comment-14318717 ] Apache Spark commented on SPARK-5747: - User 'dyross' has created a pull request for this issue: https://github.com/apache/spark/pull/4540 Review all Bash scripts for word splitting bugs --- Key: SPARK-5747 URL: https://issues.apache.org/jira/browse/SPARK-5747 Project: Spark Issue Type: Umbrella Components: Build Reporter: Nicholas Chammas Triggered by [this discussion|http://apache-spark-developers-list.1001551.n3.nabble.com/1-2-1-start-all-sh-broken-td10583.html]. Bash Word Splitting is nefarious problem. http://mywiki.wooledge.org/WordSplitting Bad (x): {code} command $variable {code} Good (/): {code} command $variable {code} Bad (x): {code} command $variable/path {code} Good (/): {code} command $variable/path {code} Bad (x): {code} command $variable/stuff* {code} Good (/): {code} command $variable/stuff* {code} It's that simple. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4180) SparkContext constructor should throw exception if another SparkContext is already running
[ https://issues.apache.org/jira/browse/SPARK-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4180: -- Labels: (was: backport-needed) SparkContext constructor should throw exception if another SparkContext is already running -- Key: SPARK-4180 URL: https://issues.apache.org/jira/browse/SPARK-4180 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Fix For: 1.2.0, 1.3.0 Spark does not currently support multiple concurrently-running SparkContexts in the same JVM (see SPARK-2243). Therefore, SparkContext's constructor should throw an exception if there is an active SparkContext that has not been shut down via {{stop()}}. PySpark already does this, but the Scala SparkContext should do the same thing. The current behavior with multiple active contexts is unspecified / not understood and it may be the source of confusing errors (see the user error report in SPARK-4080, for example). This should be pretty easy to add: just add a {{activeSparkContext}} field to the SparkContext companion object and {{synchronize}} on it in the constructor and {{stop()}} methods; see PySpark's {{context.py}} file for an example of this approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4180) SparkContext constructor should throw exception if another SparkContext is already running
[ https://issues.apache.org/jira/browse/SPARK-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4180. --- Resolution: Fixed Fix Version/s: (was: 1.2.1) 1.2.0 Target Version/s: 1.2.0 (was: 1.2.0, 1.0.3, 1.1.2) I'm going to resolve this as fixed since it was included in 1.2.0. Now that we're about to release 1.3, I don't think that we need to backport this into branch-1.0, so I'm going to remove the {{backport-needed}} label. SparkContext constructor should throw exception if another SparkContext is already running -- Key: SPARK-4180 URL: https://issues.apache.org/jira/browse/SPARK-4180 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Fix For: 1.3.0, 1.2.0 Spark does not currently support multiple concurrently-running SparkContexts in the same JVM (see SPARK-2243). Therefore, SparkContext's constructor should throw an exception if there is an active SparkContext that has not been shut down via {{stop()}}. PySpark already does this, but the Scala SparkContext should do the same thing. The current behavior with multiple active contexts is unspecified / not understood and it may be the source of confusing errors (see the user error report in SPARK-4080, for example). This should be pretty easy to add: just add a {{activeSparkContext}} field to the SparkContext companion object and {{synchronize}} on it in the constructor and {{stop()}} methods; see PySpark's {{context.py}} file for an example of this approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5776) JIRA version not of form x.y.z breaks merge_spark_pr.py
[ https://issues.apache.org/jira/browse/SPARK-5776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318777#comment-14318777 ] Apache Spark commented on SPARK-5776: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4570 JIRA version not of form x.y.z breaks merge_spark_pr.py --- Key: SPARK-5776 URL: https://issues.apache.org/jira/browse/SPARK-5776 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Sean Owen Priority: Minor It appears that the version 2+ I added to JIRA breaks the merge script since it expects x.y.z only. I will try to patch the logic quickly. Worst case, we can name the version 2.0.0 if we have to. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5781) Add metadata files for JSON datasets
Yin Huai created SPARK-5781: --- Summary: Add metadata files for JSON datasets Key: SPARK-5781 URL: https://issues.apache.org/jira/browse/SPARK-5781 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai If we save a dataset in JSON format (e.g. through DataFrame.save), we should also persist the schema of the table. So, we can avoid inferring the schema when we want to query it in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5765) word split problem in run-example and compute-classpath
[ https://issues.apache.org/jira/browse/SPARK-5765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1431#comment-1431 ] Sean Owen commented on SPARK-5765: -- I don't think JIRAs are for splitting up work on one issue among people. This seems like one issue. PRs track implementation and there can be several for a JIRA. You're right that there's this problem of one Assignee field. Usually there's clearly one contributor; where not I hope we'd give kudos to the newer contributor; that's 99% of all cases. Here, no problem, we have JIRAs enough to go around. word split problem in run-example and compute-classpath --- Key: SPARK-5765 URL: https://issues.apache.org/jira/browse/SPARK-5765 Project: Spark Issue Type: Sub-task Components: Examples Affects Versions: 1.3.0, 1.1.2, 1.2.1 Reporter: Venkata Ramana G Work split problem in spark directory path in scripts run-example and compute-classpath.sh This was introduced in defect fix SPARK-4504 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-984) SPARK_TOOLS_JAR not set if multiple tools jars exists
[ https://issues.apache.org/jira/browse/SPARK-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-984: - Assignee: (was: Josh Rosen) SPARK_TOOLS_JAR not set if multiple tools jars exists - Key: SPARK-984 URL: https://issues.apache.org/jira/browse/SPARK-984 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.8.1, 0.9.0 Reporter: Aaron Davidson Priority: Minor If you have multiple tools assemblies (e.g., if you assembled on 0.8.1 and 0.9.0 before, for instance), then this error is thrown in spark-class: {noformat}./spark-class: line 115: [: /home/aaron/spark/tools/target/scala-2.9.3/spark-tools-assembly-0.8.1-incubating-SNAPSHOT.jar: binary operator expected{noformat} This is because of a flaw in the bash script: {noformat}if [ -e $TOOLS_DIR/target/scala-$SCALA_VERSION/*assembly*[0-9Tg].jar ]; then{noformat} which does not parse correctly if the path resolves to multiple files. The error is non-fatal, but a nuisance and presumably breaks whatever SPARK_TOOLS_JAR is used for. Currently, we error if multiple Spark assemblies are found, so we could do something similar for tools assemblies. The only issue is that means that the user will always have to go through both errors (clean the assembly/ jars then tools/ jar) when it appears that the tools/ jar is not actually important for normal operation. The second possibility is to infer the correct tools jar using the single available assembly jar, but this is slightly complicated by the code path if $FWDIR/RELEASE exists. Since I'm not 100% on what SPARK_TOOLS_JAR is even for, I'm assigning this to Josh who wrote the code initially. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5776) JIRA version not of form x.y.z breaks merge_spark_pr.py
Sean Owen created SPARK-5776: Summary: JIRA version not of form x.y.z breaks merge_spark_pr.py Key: SPARK-5776 URL: https://issues.apache.org/jira/browse/SPARK-5776 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Sean Owen Priority: Minor It appears that the version 2+ I added to JIRA breaks the merge script since it expects x.y.z only. I will try to patch the logic quickly. Worst case, we can name the version 2.0.0 if we have to. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5776) JIRA version not of form x.y.z breaks merge_spark_pr.py
[ https://issues.apache.org/jira/browse/SPARK-5776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-5776: Assignee: Sean Owen JIRA version not of form x.y.z breaks merge_spark_pr.py --- Key: SPARK-5776 URL: https://issues.apache.org/jira/browse/SPARK-5776 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.4.0 It appears that the version 2+ I added to JIRA breaks the merge script since it expects x.y.z only. I will try to patch the logic quickly. Worst case, we can name the version 2.0.0 if we have to. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4180) SparkContext constructor should throw exception if another SparkContext is already running
[ https://issues.apache.org/jira/browse/SPARK-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4180: -- Fix Version/s: 1.2.1 1.3.0 SparkContext constructor should throw exception if another SparkContext is already running -- Key: SPARK-4180 URL: https://issues.apache.org/jira/browse/SPARK-4180 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Labels: backport-needed Fix For: 1.3.0, 1.2.1 Spark does not currently support multiple concurrently-running SparkContexts in the same JVM (see SPARK-2243). Therefore, SparkContext's constructor should throw an exception if there is an active SparkContext that has not been shut down via {{stop()}}. PySpark already does this, but the Scala SparkContext should do the same thing. The current behavior with multiple active contexts is unspecified / not understood and it may be the source of confusing errors (see the user error report in SPARK-4080, for example). This should be pretty easy to add: just add a {{activeSparkContext}} field to the SparkContext companion object and {{synchronize}} on it in the constructor and {{stop()}} methods; see PySpark's {{context.py}} file for an example of this approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5655) YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster configured in secure mode
[ https://issues.apache.org/jira/browse/SPARK-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5655. -- Resolution: Fixed Fix Version/s: 1.3.0 YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster configured in secure mode - Key: SPARK-5655 URL: https://issues.apache.org/jira/browse/SPARK-5655 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0, 1.2.1 Environment: Both CDH5.3.0 and CDH5.1.3, latest build on branch-1.2 Reporter: Andrew Rowson Assignee: Andrew Rowson Priority: Critical Labels: hadoop Fix For: 1.3.0 When running a Spark job on a YARN cluster which doesn't run containers under the same user as the nodemanager, and also when using the YARN auxiliary shuffle service, jobs fail with something similar to: {code:java} java.io.FileNotFoundException: /data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index (Permission denied) {code} The root cause of this here: https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287 Spark will attempt to chmod 700 any application directories it creates during the job, which includes files created in the nodemanager's usercache directory. The owner of these files is the container UID, which on a secure cluster is the name of the user creating the job, and on an nonsecure cluster but with the yarn.nodemanager.container-executor.class configured is the value of yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user. The problem with this is that the auxiliary shuffle manager runs as part of the nodemanager, which is typically running as the user 'yarn'. This can't access these files that are only owner-readable. YARN already attempts to secure files created under appcache but keep them readable by the nodemanager, by setting the group of the appcache directory to 'yarn' and also setting the setgid flag. This means that files and directories created under this should also have the 'yarn' group. Normally this means that the nodemanager should also be able to read these files, but Spark setting chmod700 wipes this out. I'm not sure what the right approach is here. Commenting out the chmod700 functionality makes this work on YARN, and still makes the application files only readable by the owner and the group: {code} /data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c # ls -lah total 206M drwxr-s--- 2 nobody yarn 4.0K Feb 6 18:30 . drwxr-s--- 12 nobody yarn 4.0K Feb 6 18:30 .. -rw-r- 1 nobody yarn 206M Feb 6 18:30 shuffle_0_0_0.data {code} But this may not be the right approach on non-YARN. Perhaps an additional step to see if this chmod700 step is necessary (ie non-YARN) is required. Sadly, I don't have a non-YARN environment to test, otherwise I'd be able to suggest a patch. I believe this is a related issue in the MapReduce framwork: https://issues.apache.org/jira/browse/MAPREDUCE-3728 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5778) Throw if nonexistent spark.metrics.conf file is provided
Ryan Williams created SPARK-5778: Summary: Throw if nonexistent spark.metrics.conf file is provided Key: SPARK-5778 URL: https://issues.apache.org/jira/browse/SPARK-5778 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.1 Reporter: Ryan Williams Priority: Minor Spark looks for a {{MetricsSystem}} configuration file when the {{spark.metrics.conf}} parameter is set, [defaulting to the path {{metrics.properties}} when it's not set|https://github.com/apache/spark/blob/466b1f671b21f575d28f9c103f51765790914fe3/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L52-L55]. In the event of a failure to find or parse the file, [the exception is caught and an error is logged|https://github.com/apache/spark/blob/466b1f671b21f575d28f9c103f51765790914fe3/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L61]. This seems like reasonable behavior in the general case where the user has not specified a {{spark.metrics.conf}} file. However, I've been bitten several times by having specified a file that all or some executors did not have present (I typo'd the path, or forgot to add an additional {{--files}} flag to make my local metrics config file get shipped to all executors), the error was swallowed and I was confused about why I'd captured no metrics from a job that appeared to have run successfully. I'd like to change the behavior to actually throw if the user has specified a configuration file that doesn't exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5522) Accelerate the History Server start
[ https://issues.apache.org/jira/browse/SPARK-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Williams updated SPARK-5522: - Summary: Accelerate the History Server start (was: Accelerate the Histroty Server start) Accelerate the History Server start --- Key: SPARK-5522 URL: https://issues.apache.org/jira/browse/SPARK-5522 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Reporter: Liangliang Gu When starting the history server, all the log files will be fetched and parsed in order to get the applications' meta data e.g. App Name, Start Time, Duration, etc. In our production cluster, there exist 2600 log files (160G) in HDFS and it costs 3 hours to restart the history server, which is a little bit too long for us. It would be better, if the history server can show logs with missing information during start-up and fill the missing information after fetching and parsing a log file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5210) Support log rolling in EventLogger
[ https://issues.apache.org/jira/browse/SPARK-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5210. --- Resolution: Later Assignee: (was: Josh Rosen) I'm closing this issue for now, since my original motivation for this feature has changed and there's no reason to let it clutter up the JIRA in the meantime. Support log rolling in EventLogger -- Key: SPARK-5210 URL: https://issues.apache.org/jira/browse/SPARK-5210 Project: Spark Issue Type: New Feature Components: Spark Core, Web UI Reporter: Josh Rosen For long-running Spark applications (e.g. running for days / weeks), the Spark event log may grow to be very large. As a result, it would be useful if EventLoggingListener supported log file rolling / rotation. Adding this feature will involve changes to the HistoryServer in order to be able to load event logs from a sequence of files instead of a single file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5655) YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster configured in secure mode
[ https://issues.apache.org/jira/browse/SPARK-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5655: - Assignee: Andrew Rowson YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster configured in secure mode - Key: SPARK-5655 URL: https://issues.apache.org/jira/browse/SPARK-5655 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0, 1.2.1 Environment: Both CDH5.3.0 and CDH5.1.3, latest build on branch-1.2 Reporter: Andrew Rowson Assignee: Andrew Rowson Priority: Critical Labels: hadoop Fix For: 1.3.0 When running a Spark job on a YARN cluster which doesn't run containers under the same user as the nodemanager, and also when using the YARN auxiliary shuffle service, jobs fail with something similar to: {code:java} java.io.FileNotFoundException: /data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index (Permission denied) {code} The root cause of this here: https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287 Spark will attempt to chmod 700 any application directories it creates during the job, which includes files created in the nodemanager's usercache directory. The owner of these files is the container UID, which on a secure cluster is the name of the user creating the job, and on an nonsecure cluster but with the yarn.nodemanager.container-executor.class configured is the value of yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user. The problem with this is that the auxiliary shuffle manager runs as part of the nodemanager, which is typically running as the user 'yarn'. This can't access these files that are only owner-readable. YARN already attempts to secure files created under appcache but keep them readable by the nodemanager, by setting the group of the appcache directory to 'yarn' and also setting the setgid flag. This means that files and directories created under this should also have the 'yarn' group. Normally this means that the nodemanager should also be able to read these files, but Spark setting chmod700 wipes this out. I'm not sure what the right approach is here. Commenting out the chmod700 functionality makes this work on YARN, and still makes the application files only readable by the owner and the group: {code} /data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c # ls -lah total 206M drwxr-s--- 2 nobody yarn 4.0K Feb 6 18:30 . drwxr-s--- 12 nobody yarn 4.0K Feb 6 18:30 .. -rw-r- 1 nobody yarn 206M Feb 6 18:30 shuffle_0_0_0.data {code} But this may not be the right approach on non-YARN. Perhaps an additional step to see if this chmod700 step is necessary (ie non-YARN) is required. Sadly, I don't have a non-YARN environment to test, otherwise I'd be able to suggest a patch. I believe this is a related issue in the MapReduce framwork: https://issues.apache.org/jira/browse/MAPREDUCE-3728 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5777) Completes data source filter types and remove CatalystScan
Cheng Lian created SPARK-5777: - Summary: Completes data source filter types and remove CatalystScan Key: SPARK-5777 URL: https://issues.apache.org/jira/browse/SPARK-5777 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.1, 1.2.0, 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Currently the data source API only supports a fraction of common filters, esp. {{And}} is not supported yet. To workaround this issue and enable full filter push-down optimization in the Parquet data source, {{CatalystScan}} was introduced to receive full Catalyst filter expressions. This class should be removed, since in principle, data source implementations shouldn't touch Catalyst expressions (which are not part of the public developer API). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5767) Migrate Parquet data source to the write support of data source API
Cheng Lian created SPARK-5767: - Summary: Migrate Parquet data source to the write support of data source API Key: SPARK-5767 URL: https://issues.apache.org/jira/browse/SPARK-5767 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Migrate to the newly introduced data source write support API (SPARK-5658). Add support for overwriting and appending to existing tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4819) Remove Guava's Optional from public API
[ https://issues.apache.org/jira/browse/SPARK-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4819: - Target Version/s: 2+ Remove Guava's Optional from public API - Key: SPARK-4819 URL: https://issues.apache.org/jira/browse/SPARK-4819 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 1.2.0 Reporter: Marcelo Vanzin Filing this mostly so this isn't forgotten. Spark currently exposes Guava types in its public API (the {{Optional}} class is used in the Java bindings). This makes it hard to properly hide Guava from user applications, and makes mixing different Guava versions with Spark a little sketchy (even if things should work, since those classes are pretty simple in general). Since this changes the public API, it has to be done in a release that allows such breakages. But it would be nice to at least have a transition plan for deprecating the affected APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator
[ https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3369: - Priority: Major (was: Critical) Target Version/s: 2+ (was: 1.2.0) Affects Version/s: 1.2.1 Assignee: Sean Owen Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator - Key: SPARK-3369 URL: https://issues.apache.org/jira/browse/SPARK-3369 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 1.0.2, 1.2.1 Reporter: Sean Owen Assignee: Sean Owen Labels: breaking_change Attachments: FlatMapIterator.patch {{mapPartitions}} in the Scala RDD API takes a function that transforms an {{Iterator}} to an {{Iterator}}: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an {{Iterator}} but is requires to return an {{Iterable}}, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself. Similarity for other {{mapPartitions*}} methods and other {{*FlatMapFunctions}}s in Java. (Is there a reason for this difference that I'm overlooking?) If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of: Promise that Spark will only call {{iterator()}} once, so implementors can use a hacky {{IteratorIterable}} that returns the same {{Iterator}}. Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the desired signature, and deprecate existing ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3266) JavaDoubleRDD doesn't contain max()
[ https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3266: - Target Version/s: 2+ (was: 1.1.1, 1.2.0) Assignee: Sean Owen JavaDoubleRDD doesn't contain max() --- Key: SPARK-3266 URL: https://issues.apache.org/jira/browse/SPARK-3266 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.1, 1.0.2, 1.1.0, 1.2.0 Reporter: Amey Chaugule Assignee: Sean Owen Attachments: spark-repro-3266.tar.gz While I can compile my code, I see: Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I don't notice max() although it is clearly listed in the documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader
[ https://issues.apache.org/jira/browse/SPARK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318041#comment-14318041 ] Sean Owen commented on SPARK-5770: -- Can you be more specific about where you think the code path doesn't copy the new file? your PR does not touch the copying code but disables overwrite entirely, which is not OK. Use addJar() to upload a new jar file to executor, it can't be added to classloader --- Key: SPARK-5770 URL: https://issues.apache.org/jira/browse/SPARK-5770 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula First use addJar() to upload a jar to the executor, then change the jar content and upload it again. We can see the jar file in the local has be updated, but the classloader still load the old one. The executor log has no error or exception to point it. I use spark-shell to test it. And set spark.files.overwrite is true. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3365) Failure to save Lists to Parquet
[ https://issues.apache.org/jira/browse/SPARK-3365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317768#comment-14317768 ] Yi Tian commented on SPARK-3365: The reason is Spark generated wrong schema for type {{List}} in {{ScalaReflection.scala}} for example: the generated schema for type {{Seq\[String\]}} is: {code} {name:x,type:{type:array,elementType:string,containsNull:true},nullable:true,metadata:{}} {code} the generated schema for type {{List\[String\]}} is: {code} {name:x,type:{type:struct,fields:[]},nullable:true,metadata:{}} {code} The related code is [here|https://github.com/apache/spark/blob/500dc2b4b3136029457e708859fe27da93b1f9e8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L110] The order of resolution is: # UserCustomType # Option\[\_\] # Product # Array\[Byte\] # Array\[\_\] # Seq\[\_\] # Map\[\_, _\] # String # Timestamp # java.sql.Date # BigDecimal # java.math.BigDecimal # Decimal # java.lang.Integer # ... I think the {{List}} type should belong to {{Seq\[\_\]}} pattern, so we should move {{Product}} behind {{Seq\[\_\]}}. May I open a PR for this issue? Failure to save Lists to Parquet Key: SPARK-3365 URL: https://issues.apache.org/jira/browse/SPARK-3365 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Reproduction, same works if type is Seq. (props to [~chrisgrier] for finding this) {code} scala case class Test(x: List[String]) defined class Test scala sparkContext.parallelize(Test(List()) :: Nil).saveAsParquetFile(bug) 23:09:51.807 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.ArithmeticException: / by zero at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:99) at parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:92) at parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5763) Sort-based Groupby and Join to resolve skewed data
[ https://issues.apache.org/jira/browse/SPARK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-5763: Description: In SPARK-4644, it provide a way to resolve skewed data. But when we has more keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So we can use sort-merge to resolve skewed-groupby and skewed-join.because SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based on SPARK-2926. And i have implemented sort-merge-groupby and it is very well for skewed data in my test.Later i will implement sort-merge-join to resolve skewed-join. [~rxin] [~sandyr] [~andrewor14] how about your opinions about this? was: In SPARK-4644, it provide a way to resolve skewed data. But when we has more keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So we can use sort-merge to resolve skewed-groupby and skewed-join.because SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based on SPARK-2926. And i have implemented sort-merge-groupby and it is very well for skewed data in my test.Later i will implement sort-merge-join to resolve skewed-join. [~rxin] [~sandyr] [~andrewor14] how about your opinion about this? Sort-based Groupby and Join to resolve skewed data -- Key: SPARK-5763 URL: https://issues.apache.org/jira/browse/SPARK-5763 Project: Spark Issue Type: Improvement Reporter: Lianhui Wang In SPARK-4644, it provide a way to resolve skewed data. But when we has more keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So we can use sort-merge to resolve skewed-groupby and skewed-join.because SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based on SPARK-2926. And i have implemented sort-merge-groupby and it is very well for skewed data in my test.Later i will implement sort-merge-join to resolve skewed-join. [~rxin] [~sandyr] [~andrewor14] how about your opinions about this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5763) Sort-based Groupby and Join to resolve skewed data
[ https://issues.apache.org/jira/browse/SPARK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-5763: Description: In SPARK-4644, it provide a way to resolve skewed data. But when we has more keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So we can use sort-merge to resolve skewed-groupby and skewed-join.because SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based on SPARK-2926. And i have implemented sort-merge-groupby and it is very well for skewed data in my test.Later i will implement sort-merge-join to resolve skewed-join. [~rxin] [~sandyr] [~andrewor14] how about your opinion about this? was:In SPARK-4644, it provide a way to resolve skewed data. But when we has more keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So we can use sort-merge to resolve skewed-groupby and skewed-join.because SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based on SPARK-2926. And i have implemented sort-merge-groupby and it is very well for skewed data in my test.Later i will implement sort-merge-join to resolve skewed-join. Sort-based Groupby and Join to resolve skewed data -- Key: SPARK-5763 URL: https://issues.apache.org/jira/browse/SPARK-5763 Project: Spark Issue Type: Improvement Reporter: Lianhui Wang In SPARK-4644, it provide a way to resolve skewed data. But when we has more keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So we can use sort-merge to resolve skewed-groupby and skewed-join.because SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based on SPARK-2926. And i have implemented sort-merge-groupby and it is very well for skewed data in my test.Later i will implement sort-merge-join to resolve skewed-join. [~rxin] [~sandyr] [~andrewor14] how about your opinion about this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5508) [hive context] Unable to query array once saved as parquet
[ https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301349#comment-14301349 ] Ayoub Benali edited comment on SPARK-5508 at 2/12/15 9:53 AM: -- I narrowed down the problem, the issue seems to come from the insert into table persisted_table select * from tmp_table command. By using scala code: saving the Schema RDD as a parquet file, reloading it, saving it in the hive meta store and querying the column of array type works just fine. So When I do the insertion from the tmp_table to the persisted_table using Hive Context the data in column of array type seems to be inserted in the wrong way into parquet which breaks the query after words. I tried Spark SQL CLI and insertIntoTable method to do the insertion as well, but it lead to the same issue when querying the table. was (Author: ayoub): I narrowed down the problem, the issue seems to come from the insert into table persisted_table select * from tmp_table command. By using scala code: saving the Schema RDD as a parquet file, reloading it, saving it in the hive meta store and querying the column of array type works just fine. So When I do the insertion from the tmp_table to the persisted_table using Hive Context the data in column of array type seems to be inserted in the wrong way into parquet which breaks the query after words. I tried Spark SQL CLI to do the insertion as well, but it lead to the same issue when querying the table. [hive context] Unable to query array once saved as parquet -- Key: SPARK-5508 URL: https://issues.apache.org/jira/browse/SPARK-5508 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Environment: mesos, cdh Reporter: Ayoub Benali Labels: hivecontext, parquet When the table is saved as parquet, we cannot query a field which is an array of struct, like show bellow: {noformat} scala val data1={ | timestamp: 1422435598, | data_array: [ | { | field1: 1, | field2: 2 | } | ] | } scala val data2={ | timestamp: 1422435598, | data_array: [ | { | field1: 3, | field2: 4 | } | ] scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) scala val rdd = hiveContext.jsonRDD(jsonRDD) scala rdd.printSchema root |-- data_array: array (nullable = true) ||-- element: struct (containsNull = false) |||-- field1: integer (nullable = true) |||-- field2: integer (nullable = true) |-- timestamp: integer (nullable = true) scala rdd.registerTempTable(tmp_table) scala hiveContext.sql(select data.field1 from tmp_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect res3: Array[org.apache.spark.sql.Row] = Array([1], [3]) scala hiveContext.sql(SET hive.exec.dynamic.partition = true) scala hiveContext.sql(SET hive.exec.dynamic.partition.mode = nonstrict) scala hiveContext.sql(set parquet.compression=GZIP) scala hiveContext.setConf(spark.sql.parquet.binaryAsString, true) scala hiveContext.sql(create external table if not exists persisted_table(data_array ARRAY STRUCTfield1: INT, field2: INT, timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table') scala hiveContext.sql(insert into table persisted_table select * from tmp_table).collect scala hiveContext.sql(select data.field1 from persisted_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://*/test_table/part-1 at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at
[jira] [Updated] (SPARK-5739) Size exceeds Integer.MAX_VALUE in File Map
[ https://issues.apache.org/jira/browse/SPARK-5739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5739: - Component/s: MLlib Priority: Minor (was: Major) Size exceeds Integer.MAX_VALUE in File Map -- Key: SPARK-5739 URL: https://issues.apache.org/jira/browse/SPARK-5739 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.1 Environment: Spark1.1.1 on a cluster with 12 node. Every node with 128GB RAM, 24 Core. the data is just 40GB, and there is 48 parallel task on a node. Reporter: DjvuLee Priority: Minor I just run the kmeans algorithm using a random generate data,but occurred this problem after some iteration. I try several time, and this problem is reproduced. Because the data is random generate, so I guess is there a bug ? Or if random data can lead to such a scenario that the size is bigger than Integer.MAX_VALUE, can we check the size before using the file map? 015-02-11 00:39:36,057 [sparkDriver-akka.actor.default-dispatcher-15] WARN org.apache.spark.util.SizeEstimator - Failed to check whether UseCompressedOops is set; assuming yes [error] (run-main-0) java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:850) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105) at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:86) at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:140) at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:105) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:747) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:598) at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:869) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:79) at org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:68) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809) at org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:270) at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143) at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126) at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:338) at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:348) at KMeansDataGenerator$.main(kmeans.scala:105) at KMeansDataGenerator.main(kmeans.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55) at java.lang.reflect.Method.invoke(Method.java:619) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5766) Slow RowMatrix multiplication
[ https://issues.apache.org/jira/browse/SPARK-5766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317961#comment-14317961 ] Sean Owen commented on SPARK-5766: -- Given that RowMatrix is a row-by-row representation, it would have to be vector-matrix multiplication I think. In principle I think you can leverage native code here; the only question is whether it overcomes the overhead of the call for typical inputs, but it's possible. Slow RowMatrix multiplication - Key: SPARK-5766 URL: https://issues.apache.org/jira/browse/SPARK-5766 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Amaru Cuba Gyllensten Priority: Minor Labels: matrix Looking at the source code for RowMatrix multiplication by a local matrix, it seems like it is going through all columnvectors of the matrix, doing pairwise dot product on each column. It seems like this could be sped up by using gemm, performing full matrix-matrix multiplication on the local data, (or gemv, for vector-matrix multiplication), as is done in BlockMatrix or Matrix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5644) Delete tmp dir when sc is stop
[ https://issues.apache.org/jira/browse/SPARK-5644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5644: - Assignee: Weizhong Delete tmp dir when sc is stop -- Key: SPARK-5644 URL: https://issues.apache.org/jira/browse/SPARK-5644 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Weizhong Assignee: Weizhong Priority: Minor Fix For: 1.4.0 When we run driver as a service which will not stop. In this service process we will create SparkContext and run job and then stop it, because we only call sc.stop but not exit this service process so the tmp dirs created by HttpFileServer and SparkEnv will not be deleted after SparkContext is stopped, and this will lead to creating too many tmp dirs if we create many SparkContext to run job in this service process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318021#comment-14318021 ] Manoj Kumar commented on SPARK-5436: Hi, I would like to give this a go. [~ChrisT] are you still working on this? Otherwise I would love to carry this forward. Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3365) Failure to save Lists to Parquet
[ https://issues.apache.org/jira/browse/SPARK-3365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317949#comment-14317949 ] Cheng Lian commented on SPARK-3365: --- Hey [~tianyi], please open a PR for this. However, I'd suggest adding a {{List[_]}} clause before {{Product}}, rather than moving the latter. Thanks! Failure to save Lists to Parquet Key: SPARK-3365 URL: https://issues.apache.org/jira/browse/SPARK-3365 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Reproduction, same works if type is Seq. (props to [~chrisgrier] for finding this) {code} scala case class Test(x: List[String]) defined class Test scala sparkContext.parallelize(Test(List()) :: Nil).saveAsParquetFile(bug) 23:09:51.807 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.ArithmeticException: / by zero at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:99) at parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:92) at parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5768) Spark UI Shows incorrect memory under Yarn
[ https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318088#comment-14318088 ] Al M commented on SPARK-5768: - So when it says *Memory Used* 3.2GB / 20GB it actually means we are using 3.2GB of memory for caching out of a total 20GB available for caching? Calling the the column 'Storage Memory' this would be clearer to me. If changing the heading of the column is not an option then a tooltip explaining that it is referring to memory used for storage. I'd find it pretty useful to have another column that shows my total memory usage. Right now I can only see this by running 'free' or 'top' every machine or looking at the Yarn UI. Spark UI Shows incorrect memory under Yarn -- Key: SPARK-5768 URL: https://issues.apache.org/jira/browse/SPARK-5768 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0, 1.2.1 Environment: Centos 6 Reporter: Al M Priority: Trivial I am running Spark on Yarn with 2 executors. The executors are running on separate physical machines. I have spark.executor.memory set to '40g'. This is because I want to have 40g of memory used on each machine. I have one executor per machine. When I run my application I see from 'top' that both my executors are using the full 40g of memory I allocated to them. The 'Executors' tab in the Spark UI shows something different. It shows the memory used as a total of 20GB per executor e.g. x / 20.3GB. This makes it look like I only have 20GB available per executor when really I have 40GB available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5768) Spark UI Shows incorrect memory under Yarn
[ https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5768: - Component/s: (was: YARN) Web UI Issue Type: Improvement (was: Bug) It sounds like you are looking at the memory available for caching, which is ~0.54 (0.6*0.9) of the total. It's correct in that sense, but this is a common misconception. I suggest this track suggesting a small UI change to make this clearer. What do you think would be clearer? Spark UI Shows incorrect memory under Yarn -- Key: SPARK-5768 URL: https://issues.apache.org/jira/browse/SPARK-5768 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0, 1.2.1 Environment: Centos 6 Reporter: Al M Priority: Trivial I am running Spark on Yarn with 2 executors. The executors are running on separate physical machines. I have spark.executor.memory set to '40g'. This is because I want to have 40g of memory used on each machine. I have one executor per machine. When I run my application I see from 'top' that both my executors are using the full 40g of memory I allocated to them. The 'Executors' tab in the Spark UI shows something different. It shows the memory used as a total of 20GB per executor e.g. x / 20.3GB. This makes it look like I only have 20GB available per executor when really I have 40GB available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5769) Set params in constructor and setParams() in Python ML pipeline API
Xiangrui Meng created SPARK-5769: Summary: Set params in constructor and setParams() in Python ML pipeline API Key: SPARK-5769 URL: https://issues.apache.org/jira/browse/SPARK-5769 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng As discussed in the design doc of SPARK-4586, we want to make Python users happy (no setters/getters) while keeping a low maintenance cost by forcing keyword arguments in the constructor and in setParams. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5769) Set params in constructor and setParams() in Python ML pipeline API
[ https://issues.apache.org/jira/browse/SPARK-5769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318005#comment-14318005 ] Apache Spark commented on SPARK-5769: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4564 Set params in constructor and setParams() in Python ML pipeline API --- Key: SPARK-5769 URL: https://issues.apache.org/jira/browse/SPARK-5769 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng As discussed in the design doc of SPARK-4586, we want to make Python users happy (no setters/getters) while keeping a low maintenance cost by forcing keyword arguments in the constructor and in setParams. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader
meiyoula created SPARK-5770: --- Summary: Use addJar() to upload a new jar file to executor, it can't be added to classloader Key: SPARK-5770 URL: https://issues.apache.org/jira/browse/SPARK-5770 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula First use addJar() to upload a jar to the executor, then change the jar content and upload it again. We can see the jar file in the local has be updated, but the classloader still load the old one. The executor log has no error or exception to point it. I use spark-shell to test it. And set spark.files.overwrite is true. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader
[ https://issues.apache.org/jira/browse/SPARK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318027#comment-14318027 ] Apache Spark commented on SPARK-5770: - User 'XuTingjun' has created a pull request for this issue: https://github.com/apache/spark/pull/4565 Use addJar() to upload a new jar file to executor, it can't be added to classloader --- Key: SPARK-5770 URL: https://issues.apache.org/jira/browse/SPARK-5770 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula First use addJar() to upload a jar to the executor, then change the jar content and upload it again. We can see the jar file in the local has be updated, but the classloader still load the old one. The executor log has no error or exception to point it. I use spark-shell to test it. And set spark.files.overwrite is true. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5768) Spark UI Shows incorrect memory under Yarn
Al M created SPARK-5768: --- Summary: Spark UI Shows incorrect memory under Yarn Key: SPARK-5768 URL: https://issues.apache.org/jira/browse/SPARK-5768 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.1, 1.2.0 Environment: Centos 6 Reporter: Al M Priority: Trivial I am running Spark on Yarn with 2 executors. The executors are running on separate physical machines. I have spark.executor.memory set to '40g'. This is because I want to have 40g of memory used on each machine. I have one executor per machine. When I run my application I see from 'top' that both my executors are using the full 40g of memory I allocated to them. The 'Executors' tab in the Spark UI shows something different. It shows the memory used as a total of 20GB per executor e.g. x / 20.3GB. This makes it look like I only have 20GB available per executor when really I have 40GB available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4553) query for parquet table with string fields in spark sql hive get binary result
[ https://issues.apache.org/jira/browse/SPARK-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317976#comment-14317976 ] Apache Spark commented on SPARK-4553: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/4563 query for parquet table with string fields in spark sql hive get binary result -- Key: SPARK-4553 URL: https://issues.apache.org/jira/browse/SPARK-4553 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Assignee: Cheng Lian Priority: Blocker run create table test_parquet(key int, value string) stored as parquet; insert into table test_parquet select * from src; select * from test_parquet; get result as follow ... 282 [B@38fda3b 138 [B@1407a24 238 [B@12de6fb 419 [B@6c97695 15 [B@4885067 118 [B@156a8d3 72 [B@65d20dd 90 [B@4c18906 307 [B@60b24cc 19 [B@59cf51b 435 [B@39fdf37 10 [B@4f799d7 277 [B@3950951 273 [B@596bf4b 306 [B@3e91557 224 [B@3781d61 309 [B@2d0d128 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5767) Migrate Parquet data source to the write support of data source API
[ https://issues.apache.org/jira/browse/SPARK-5767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317977#comment-14317977 ] Apache Spark commented on SPARK-5767: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/4563 Migrate Parquet data source to the write support of data source API --- Key: SPARK-5767 URL: https://issues.apache.org/jira/browse/SPARK-5767 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Migrate to the newly introduced data source write support API (SPARK-5658). Add support for overwriting and appending to existing tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5766) Slow RowMatrix multiplication
[ https://issues.apache.org/jira/browse/SPARK-5766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318013#comment-14318013 ] Amaru Cuba Gyllensten commented on SPARK-5766: -- Yeah, I noticed it when multiplying a 10,000 by 2000 IndexedRowMatrix with its transpose (represented as a local matrix), and doing some reductions on the rows. Running on my local machine, the multiplication in spark took about 7 times longer than an implementation where the left hand matrix was chunked and each chunk (consisiting of ~1000 rows) was multiplied with gemm (or similar). This might be an unfair comparison, as it kinda requires the rows to be stored locally as dense matrices. (A use case which might be covered by the upcoming BlockMatrix?) Slow RowMatrix multiplication - Key: SPARK-5766 URL: https://issues.apache.org/jira/browse/SPARK-5766 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Amaru Cuba Gyllensten Priority: Minor Labels: matrix Looking at the source code for RowMatrix multiplication by a local matrix, it seems like it is going through all columnvectors of the matrix, doing pairwise dot product on each column. It seems like this could be sped up by using gemm, performing full matrix-matrix multiplication on the local data, (or gemv, for vector-matrix multiplication), as is done in BlockMatrix or Matrix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5786) Documentation of Narrow Dependencies
Imran Rashid created SPARK-5786: --- Summary: Documentation of Narrow Dependencies Key: SPARK-5786 URL: https://issues.apache.org/jira/browse/SPARK-5786 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Imran Rashid Narrow dependencies can really improve job performance by skipping shuffles entirely. However aside from being mentioned in some early papers and during some meetups, they aren't explained (or even mentioned) in the docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5759) ExecutorRunnable should catch YarnException while NMClient start container
[ https://issues.apache.org/jira/browse/SPARK-5759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5759: - Affects Version/s: 1.2.0 ExecutorRunnable should catch YarnException while NMClient start container -- Key: SPARK-5759 URL: https://issues.apache.org/jira/browse/SPARK-5759 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Lianhui Wang some time since some of reasons, it lead to some exception while NMClient start container.example:we do not config spark_shuffle on some machines, so it will throw a exception: java.lang.Error: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:spark_shuffle does not exist. because YarnAllocator use ThreadPoolExecutor to start Container, so we can not find which container or hostname throw exception. I think we should catch YarnException in ExecutorRunnable when start container. if there are some exceptions, we can know the container id or hostname of failed container. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5787) Protect JVM from some not-important exceptions
Davies Liu created SPARK-5787: - Summary: Protect JVM from some not-important exceptions Key: SPARK-5787 URL: https://issues.apache.org/jira/browse/SPARK-5787 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Davies Liu Priority: Critical Any un-captured exception will shutdown the executor JVM, so we should capture all those exceptions which did not hurt executor much (executor is still functional). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5765) word split problem in run-example and compute-classpath
[ https://issues.apache.org/jira/browse/SPARK-5765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5765: - Assignee: Venkata Ramana G word split problem in run-example and compute-classpath --- Key: SPARK-5765 URL: https://issues.apache.org/jira/browse/SPARK-5765 Project: Spark Issue Type: Sub-task Components: Examples Affects Versions: 1.3.0, 1.1.2, 1.2.2 Reporter: Venkata Ramana G Assignee: Venkata Ramana G Work split problem in spark directory path in scripts run-example and compute-classpath.sh This was introduced in defect fix SPARK-4504 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2774) Set preferred locations for reduce tasks
[ https://issues.apache.org/jira/browse/SPARK-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319056#comment-14319056 ] Apache Spark commented on SPARK-2774: - User 'shivaram' has created a pull request for this issue: https://github.com/apache/spark/pull/4576 Set preferred locations for reduce tasks Key: SPARK-2774 URL: https://issues.apache.org/jira/browse/SPARK-2774 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Currently we do not set preferred locations for reduce tasks in Spark. This patch proposes setting preferred locations based on the map output sizes and locations tracked by the MapOutputTracker. This is useful in two conditions 1. When you have a small job in a large cluster it can be useful to co-locate map and reduce tasks to avoid going over the network 2. If there is a lot of data skew in the map stage outputs, then it is beneficial to place the reducer close to the largest output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5760) StandaloneRestClient/Server error behavior is incorrect
[ https://issues.apache.org/jira/browse/SPARK-5760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5760. Resolution: Fixed Fix Version/s: 1.3.0 StandaloneRestClient/Server error behavior is incorrect --- Key: SPARK-5760 URL: https://issues.apache.org/jira/browse/SPARK-5760 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Fix For: 1.3.0 There are three main known issues: (1) Server would always send the JSON to the servlet's output stream. However, if the response code is not 200, the client reads from the error stream instead. The server must write to the correct stream depending on the response code. (2) If the server returns an empty response (no JSON), then both output and error streams are null at the client, leading to NPEs. This happens if the server throws an internal exception that it cannot recover from. (3) The default error handling servlet did not match the URL cases correctly, because there are empty strings in the list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5726) Hadamard Vector Product Transformer
[ https://issues.apache.org/jira/browse/SPARK-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319182#comment-14319182 ] Sean Owen commented on SPARK-5726: -- You can ignore this comment, but I wonder if it would be even more immediately recognized if called ElementwiseProduct or something like that, as that's all the Hadamard product is right? Hadamard Vector Product Transformer --- Key: SPARK-5726 URL: https://issues.apache.org/jira/browse/SPARK-5726 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Octavian Geagla Assignee: Octavian Geagla I originally posted my idea here: http://apache-spark-developers-list.1001551.n3.nabble.com/Any-interest-in-weighting-VectorTransformer-which-does-component-wise-scaling-td10265.html A draft of this feature is implemented, documented, and tested already. Code is on a branch on my fork here: https://github.com/ogeagla/spark/compare/spark-mllib-weighting I'm curious if there is any interest in this feature, in which case I'd appreciate some feedback. One thing that might be useful is an example/test case using the transformer within the ML pipeline, since there are not any examples which use Vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3570) Shuffle write time does not include time to open shuffle files
[ https://issues.apache.org/jira/browse/SPARK-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318960#comment-14318960 ] Kay Ousterhout commented on SPARK-3570: --- There are a bunch of times when files are opened and written to that are not currently logged, so I did some investigation of this to figure out when the time may be significant and is therefore worth logging. I did this on a 5-machine cluster using ext3 (which exacerbates disk access issues, making it easy to see when times may be long) running query 3a of the big data benchmark (which struggles with disk access because of the many shuffle files). Here's what I found: SortShuffleWriter.write, call to shuffleBlockManager.getDataFile: this just opens 1 file, and typically takes about 100us, so not worth adding logging SortShuffleWriter.write, call to shuffleBlockManager.getIndexFile: this writes a single index file and typically took about 0.1ms (as high as 1ms). Also doesn't seem worth logging. ExternalSorter.spillToPartitionFiles, creating the disk writers for each partition: because this creates one file for each partition, the time to create all of the files adds up: this took 75-100ms ExternalSorter.writePartitionedFile, copying the data from the partitioned files to a single file: because this reads and writes all of the shuffle data, it can be long; ~13ms on the workload I looked at. ExternalSorter.writePartitionedFile, time to call blockManager.getDiskWriter on line 748: getDiskWriter *CAN* take a long time because of the call to file.length(), which may hit disk. However, in this case, each call takes 20us or less (and this is likely noisy -- getting small to measure). To totally speculate, I'd guess that because this is called many times on the same file, as opposed to different files, and the file is actively being written to, the length is cached in memory by the OS. To summarize, this all leads to the intuitive conclusion that we only need to long when we're writing lots of data (e.g., when copying all of the shuffle data to a single file) or when we're opening a lot of files. Shuffle write time does not include time to open shuffle files -- Key: SPARK-3570 URL: https://issues.apache.org/jira/browse/SPARK-3570 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.2, 1.0.2, 1.1.0 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Attachments: 3a_1410854905_0_job_log_waterfall.pdf, 3a_1410957857_0_job_log_waterfall.pdf Currently, the reported shuffle write time does not include time to open the shuffle files. This time can be very significant when the disk is highly utilized and many shuffle files exist on the machine (I'm not sure how severe this is in 1.0 onward -- since shuffle files are automatically deleted, this may be less of an issue because there are fewer old files sitting around). In experiments I did, in extreme cases, adding the time to open files can increase the shuffle write time from 5ms (of a 2 second task) to 1 second. We should fix this for better performance debugging. Thanks [~shivaram] for helping to diagnose this problem. cc [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5785) Pyspark does not support narrow dependencies
Imran Rashid created SPARK-5785: --- Summary: Pyspark does not support narrow dependencies Key: SPARK-5785 URL: https://issues.apache.org/jira/browse/SPARK-5785 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Imran Rashid joins ( cogroups etc.) are always considered to have wide dependencies in pyspark, they are never narrow. This can cause unnecessary shuffles. eg., this simple job should shuffle rddA rddB once each, but it also will do a third shuffle of the unioned data: {code} rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64) rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64) joined = rddA.join(rddB) joined.count() rddA._partitionFunc == rddB._partitionFunc True {code} (Or the docs should somewhere explain that this feature is missing from spark.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5788) Capture exceptions in Python write thread
[ https://issues.apache.org/jira/browse/SPARK-5788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319135#comment-14319135 ] Apache Spark commented on SPARK-5788: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4577 Capture exceptions in Python write thread -- Key: SPARK-5788 URL: https://issues.apache.org/jira/browse/SPARK-5788 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0, 1.2.1 Reporter: Davies Liu Priority: Blocker The exception in Python writer thread will shutdown executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5765) word split problem in run-example and compute-classpath
[ https://issues.apache.org/jira/browse/SPARK-5765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5765: - Affects Version/s: 1.3.0 1.2.1 word split problem in run-example and compute-classpath --- Key: SPARK-5765 URL: https://issues.apache.org/jira/browse/SPARK-5765 Project: Spark Issue Type: Sub-task Components: Examples Affects Versions: 1.3.0, 1.1.2, 1.2.1 Reporter: Venkata Ramana G Assignee: Venkata Ramana G Fix For: 1.3.0, 1.1.2, 1.2.2 Work split problem in spark directory path in scripts run-example and compute-classpath.sh This was introduced in defect fix SPARK-4504 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5765) word split problem in run-example and compute-classpath
[ https://issues.apache.org/jira/browse/SPARK-5765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5765: - Affects Version/s: (was: 1.2.2) (was: 1.3.0) word split problem in run-example and compute-classpath --- Key: SPARK-5765 URL: https://issues.apache.org/jira/browse/SPARK-5765 Project: Spark Issue Type: Sub-task Components: Examples Affects Versions: 1.3.0, 1.1.2, 1.2.1 Reporter: Venkata Ramana G Assignee: Venkata Ramana G Fix For: 1.3.0, 1.1.2, 1.2.2 Work split problem in spark directory path in scripts run-example and compute-classpath.sh This was introduced in defect fix SPARK-4504 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5762) Shuffle write time is incorrect for sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5762. Resolution: Fixed Fix Version/s: 1.2.2 1.3.0 Target Version/s: 1.3.0, 1.2.2 (was: 1.3.0) Shuffle write time is incorrect for sort-based shuffle -- Key: SPARK-5762 URL: https://issues.apache.org/jira/browse/SPARK-5762 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.3.0, 1.2.2 For the sort-based shuffle, when bypassing merge sort, one file is written for each partition, and then a final file is written that concatenates all of the existing files together. The time to write this final file is not included in the shuffle write time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5780) The loggings of Python unittests are noisy and scaring in
[ https://issues.apache.org/jira/browse/SPARK-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5780. Resolution: Fixed Fix Version/s: 1.3.0 Target Version/s: 1.3.0 (was: 1.4.0) The loggings of Python unittests are noisy and scaring in -- Key: SPARK-5780 URL: https://issues.apache.org/jira/browse/SPARK-5780 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.3.0 Reporter: Davies Liu Fix For: 1.3.0 There a bunch of logging coming from driver and worker, it's noisy and scaring, and a lots of exception in it, people are confusing about the tests are failing or not. It should mute the logging during tests, only show them if any one failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5780) The loggings of Python unittests are noisy and scaring in
[ https://issues.apache.org/jira/browse/SPARK-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5780: - Affects Version/s: (was: 1.4.0) The loggings of Python unittests are noisy and scaring in -- Key: SPARK-5780 URL: https://issues.apache.org/jira/browse/SPARK-5780 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.3.0 Reporter: Davies Liu Fix For: 1.3.0 There a bunch of logging coming from driver and worker, it's noisy and scaring, and a lots of exception in it, people are confusing about the tests are failing or not. It should mute the logging during tests, only show them if any one failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1192) Around 30 parameters in Spark are used but undocumented and some are having confusing name
[ https://issues.apache.org/jira/browse/SPARK-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1192. -- Resolution: Won't Fix PR was withdrawn; this probably deserves a rethink if it were reconsidered anyway so let's resolve. Around 30 parameters in Spark are used but undocumented and some are having confusing name -- Key: SPARK-1192 URL: https://issues.apache.org/jira/browse/SPARK-1192 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.0.0 Reporter: Nan Zhu Assignee: Nan Zhu I grep the code in core component, I found that around 30 parameters in the implementation is actually used but undocumented. By reading the source code, I found that some of them are actually very useful for the user. I suggest to make a complete document on the parameters. Also some parameters are having confusing names spark.shuffle.copier.threads - this parameters is to control how many threads you will use when you start a Netty-based shuffle servicebut from the name, we cannot get this information spark.shuffle.sender.port - the similar problem with the above one, when you use Netty-based shuffle receiver, you will have to setup a Netty-based sender...this parameter is to setup the port used by the Netty sender, but the name cannot convey this information -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5690) Flaky test: org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion
[ https://issues.apache.org/jira/browse/SPARK-5690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5690. Resolution: Fixed Fix Version/s: 1.3.0 Target Version/s: 1.3.0 Flaky test: org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion - Key: SPARK-5690 URL: https://issues.apache.org/jira/browse/SPARK-5690 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Andrew Or Priority: Critical Labels: flaky-test Fix For: 1.3.0 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/1647/testReport/junit/org.apache.spark.deploy.rest/StandaloneRestSubmitSuite/simple_submit_until_completion/ {code} org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion Failing for the past 1 build (Since Failed#1647 ) Took 30 sec. Error Message Driver driver-20150209035158- did not finish within 30 seconds. Stacktrace sbt.ForkMain$ForkError: Driver driver-20150209035158- did not finish within 30 seconds. at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$class.fail(Assertions.scala:1328) at org.scalatest.FunSuite.fail(FunSuite.scala:1555) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$apache$spark$deploy$rest$StandaloneRestSubmitSuite$$waitUntilFinished(StandaloneRestSubmitSuite.scala:152) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply$mcV$sp(StandaloneRestSubmitSuite.scala:57) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StandaloneRestSubmitSuite.scala:41) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) at org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.runTest(StandaloneRestSubmitSuite.scala:41) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at
[jira] [Closed] (SPARK-5761) Revamp StandaloneRestProtocolSuite
[ https://issues.apache.org/jira/browse/SPARK-5761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5761. Resolution: Fixed Fix Version/s: 1.3.0 Revamp StandaloneRestProtocolSuite -- Key: SPARK-5761 URL: https://issues.apache.org/jira/browse/SPARK-5761 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Andrew Or Assignee: Andrew Or Fix For: 1.3.0 It currently runs an end-to-end test which is both slow and reported as flaky here: SPARK-5690. We should make it test the individual components more closely and make it more like unit test suite instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4897) Python 3 support
[ https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318935#comment-14318935 ] Ryan Ovas commented on SPARK-4897: -- I'm interested in using Spark in my startup, but everything we do is in Python 3.4 which makes adopting Spark difficult for me as well. I was surprised and disappointed (since I will have trouble using it myself) to see that there is no Python 3.x support when (as [~ianozsvald] suggested) the community as a whole is moving towards Python 3.4. Python 3 support Key: SPARK-4897 URL: https://issues.apache.org/jira/browse/SPARK-4897 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Josh Rosen Priority: Minor It would be nice to have Python 3 support in PySpark, provided that we can do it in a way that maintains backwards-compatibility with Python 2.6. I started looking into porting this; my WIP work can be found at https://github.com/JoshRosen/spark/compare/python3 I was able to use the [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] tool to handle the basic conversion of things like {{print}} statements, etc. and had to manually fix up a few imports for packages that moved / were renamed, but the major blocker that I hit was {{cloudpickle}}: {code} [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark Python 3.4.2 (default, Oct 19 2014, 17:52:17) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin Type help, copyright, credits or license for more information. Traceback (most recent call last): File /Users/joshrosen/Documents/Spark/python/pyspark/shell.py, line 28, in module import pyspark File /Users/joshrosen/Documents/spark/python/pyspark/__init__.py, line 41, in module from pyspark.context import SparkContext File /Users/joshrosen/Documents/spark/python/pyspark/context.py, line 26, in module from pyspark import accumulators File /Users/joshrosen/Documents/spark/python/pyspark/accumulators.py, line 97, in module from pyspark.cloudpickle import CloudPickler File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 120, in module class CloudPickler(pickle.Pickler): File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 122, in CloudPickler dispatch = pickle.Pickler.dispatch.copy() AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch' {code} This code looks like it will be hard difficult to port to Python 3, so this might be a good reason to switch to [Dill|https://github.com/uqfoundation/dill] for Python serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5784) Add StatsDSink to MetricsSystem
[ https://issues.apache.org/jira/browse/SPARK-5784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318993#comment-14318993 ] Apache Spark commented on SPARK-5784: - User 'ryan-williams' has created a pull request for this issue: https://github.com/apache/spark/pull/4574 Add StatsDSink to MetricsSystem --- Key: SPARK-5784 URL: https://issues.apache.org/jira/browse/SPARK-5784 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.1 Reporter: Ryan Williams Priority: Minor [StatsD|https://github.com/etsy/statsd/] is a common wrapper for Graphite; it would be useful to support sending metrics to StatsD in addition to [the existing Graphite support|https://github.com/apache/spark/blob/6a1be026cf37e4c8bf39133dfb4a73f7caedcc26/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala]. [readytalk/metrics-statsd|https://github.com/readytalk/metrics-statsd] is a StatsD adapter for the [dropwizard/metrics|https://github.com/dropwizard/metrics] library that Spark uses. The Maven repository at http://dl.bintray.com/readytalk/maven/ serves {{metrics-statsd}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5726) Hadamard Vector Product Transformer
[ https://issues.apache.org/jira/browse/SPARK-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5726: - Assignee: Octavian Geagla Hadamard Vector Product Transformer --- Key: SPARK-5726 URL: https://issues.apache.org/jira/browse/SPARK-5726 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Octavian Geagla Assignee: Octavian Geagla I originally posted my idea here: http://apache-spark-developers-list.1001551.n3.nabble.com/Any-interest-in-weighting-VectorTransformer-which-does-component-wise-scaling-td10265.html A draft of this feature is implemented, documented, and tested already. Code is on a branch on my fork here: https://github.com/ogeagla/spark/compare/spark-mllib-weighting I'm curious if there is any interest in this feature, in which case I'd appreciate some feedback. One thing that might be useful is an example/test case using the transformer within the ML pipeline, since there are not any examples which use Vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5726) Hadamard Vector Product Transformer
[ https://issues.apache.org/jira/browse/SPARK-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319163#comment-14319163 ] Xiangrui Meng commented on SPARK-5726: -- This is a nice feature. I like the name `HadamardProduct` better than `HadamardScaler`. The former is more explicit though it always reminds me the Hadamard transform. Could you submit a PR? Hadamard Vector Product Transformer --- Key: SPARK-5726 URL: https://issues.apache.org/jira/browse/SPARK-5726 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Octavian Geagla I originally posted my idea here: http://apache-spark-developers-list.1001551.n3.nabble.com/Any-interest-in-weighting-VectorTransformer-which-does-component-wise-scaling-td10265.html A draft of this feature is implemented, documented, and tested already. Code is on a branch on my fork here: https://github.com/ogeagla/spark/compare/spark-mllib-weighting I'm curious if there is any interest in this feature, in which case I'd appreciate some feedback. One thing that might be useful is an example/test case using the transformer within the ML pipeline, since there are not any examples which use Vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5782) Python Worker / Pyspark Daemon Memory Issue
[ https://issues.apache.org/jira/browse/SPARK-5782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Khaitman updated SPARK-5782: - Description: I'm including the Shuffle component on this, as a brief scan through the code (which I'm not 100% familiar with just yet) shows a large amount of memory handling in it: It appears that any type of join between two RDDs spawns up twice as many pyspark.daemon workers compared to the default 1 task - 1 core configuration in our environment. This can become problematic in the cases where you build up a tree of RDD joins, since the pyspark.daemons do not cease to exist until the top level join is completed (or so it seems)... This can lead to memory exhaustion by a single framework, even though is set to have a 512MB python worker memory limit and few gigs of executor memory. Another related issue to this is that the individual python workers are not supposed to even exceed that far beyond 512MB, otherwise they're supposed to spill to disk. Some of our python workers are somehow reaching 2GB each (which when multiplied by the number of cores per executor * the number of joins occurring in some cases), causing the Out-of-Memory killer to step up to its unfortunate job! :( I originally thought _next_limit in shuffle.py had an issue though I initially misread it. Logic looks good there :) Somewhere the 512mb limit is not being checked is my current suspicion. I've only just started looking into the code, and would definitely love to contribute towards Spark, though I figured it might be quicker to resolve if someone already owns the code! was: I'm including the Shuffle component on this, as a brief scan through the code (which I'm not 100% familiar with just yet) shows a large amount of memory handling in it: It appears that any type of join between two RDDs spawns up twice as many pyspark.daemon workers compared to the default 1 task - 1 core configuration in our environment. This can become problematic in the cases where you build up a tree of RDD joins, since the pyspark.daemons do not cease to exist until the top level join is completed (or so it seems)... This can lead to memory exhaustion by a single framework, even though is set to have a 512MB python worker memory limit and few gigs of executor memory. Another related issue to this is that the individual python workers are not supposed to even exceed that far beyond 512MB, otherwise they're supposed to spill to disk. I came across this bit of code in shuffle.py which *may* have something to do with allowing some of our python workers from somehow reaching 2GB each (which when multiplied by the number of cores per executor * the number of joins occurring in some cases), causing the Out-of-Memory killer to step up to its unfortunate job! :( def _next_limit(self): Return the next memory limit. If the memory is not released after spilling, it will dump the data only when the used memory starts to increase. return max(self.memory_limit, get_used_memory() * 1.05) I've only just started looking into the code, and would definitely love to contribute towards Spark, though I figured it might be quicker to resolve if someone already owns the code! Python Worker / Pyspark Daemon Memory Issue --- Key: SPARK-5782 URL: https://issues.apache.org/jira/browse/SPARK-5782 Project: Spark Issue Type: Bug Components: PySpark, Shuffle Affects Versions: 1.3.0, 1.2.1, 1.2.2 Environment: CentOS 7, Spark Standalone Reporter: Mark Khaitman I'm including the Shuffle component on this, as a brief scan through the code (which I'm not 100% familiar with just yet) shows a large amount of memory handling in it: It appears that any type of join between two RDDs spawns up twice as many pyspark.daemon workers compared to the default 1 task - 1 core configuration in our environment. This can become problematic in the cases where you build up a tree of RDD joins, since the pyspark.daemons do not cease to exist until the top level join is completed (or so it seems)... This can lead to memory exhaustion by a single framework, even though is set to have a 512MB python worker memory limit and few gigs of executor memory. Another related issue to this is that the individual python workers are not supposed to even exceed that far beyond 512MB, otherwise they're supposed to spill to disk. Some of our python workers are somehow reaching 2GB each (which when multiplied by the number of cores per executor * the number of joins occurring in some cases), causing the Out-of-Memory killer to step up to its unfortunate job! :( I originally thought _next_limit in shuffle.py had an issue though I initially misread it. Logic looks good
[jira] [Updated] (SPARK-5783) Include filename, line number in eventlog-parsing error message
[ https://issues.apache.org/jira/browse/SPARK-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5783: - Affects Version/s: (was: 1.2.1) 1.0.0 Include filename, line number in eventlog-parsing error message --- Key: SPARK-5783 URL: https://issues.apache.org/jira/browse/SPARK-5783 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Ryan Williams Priority: Minor While investigating why some recent applications were not showing up in my History Server UI, I found error message blocks like this in the history server logs: {code} 15/02/12 18:51:55 ERROR scheduler.ReplayListenerBus: Exception in parsing Spark event log. java.lang.ClassCastException: org.json4s.JsonAST$JNothing$ cannot be cast to org.json4s.JsonAST$JObject at org.apache.spark.util.JsonProtocol$.mapFromJson(JsonProtocol.scala:814) at org.apache.spark.util.JsonProtocol$.executorInfoFromJson(JsonProtocol.scala:805) ... at org.apache.spark.deploy.history.FsHistoryProvider$$anon$1.run(FsHistoryProvider.scala:84) 15/02/12 18:51:55 ERROR scheduler.ReplayListenerBus: Malformed line: {Event:SparkListenerExecutorAdded,Timestamp:1422897479154,Executor ID:12,Executor Info:{Host:demeter-csmaz11-1.demeter.hpc.mssm.edu,Total Cores:4}} {code} Turns out certain files had some malformed lines due to having been generated by a forked Spark with some WIP event-log functionality. It would be nice if the first line specified the file the error was found in, and the last line specified the line number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5746) INSERT OVERWRITE throws FileNotFoundException when the source and destination point to the same table.
[ https://issues.apache.org/jira/browse/SPARK-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319014#comment-14319014 ] Yin Huai commented on SPARK-5746: - Here are places where we need to take care overwrite, CreateMetastoreDataSourceAsSelect, CreatableRelationProvider.createRelation, and InsertableRelation.insert. INSERT OVERWRITE throws FileNotFoundException when the source and destination point to the same table. -- Key: SPARK-5746 URL: https://issues.apache.org/jira/browse/SPARK-5746 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker With the newly introduced write support of data source API, {{JSONRelation}} and {{ParquetRelation2}} both suffer this bug. The root cause is that we removed the source table before insertion ([here|https://github.com/apache/spark/blob/1ac099e3e00ddb01af8e6e3a84c70f8363f04b5c/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L112-L121]). The correct solution should be first insert into a temporary folder, and then overwrite the source table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5735) Replace uses of EasyMock with Mockito
[ https://issues.apache.org/jira/browse/SPARK-5735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319140#comment-14319140 ] Apache Spark commented on SPARK-5735: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/4578 Replace uses of EasyMock with Mockito - Key: SPARK-5735 URL: https://issues.apache.org/jira/browse/SPARK-5735 Project: Spark Issue Type: Improvement Components: Tests Reporter: Patrick Wendell Assignee: Josh Rosen There are a few reasons we should drop EasyMock. First, we should have a single mocking framework in our tests in general to keep things consistent. Second, EasyMock has caused us some dependency pain in our tests due to objenesis. We aren't totally sure but suspect such conflicts might be causing non deterministic test failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5759) ExecutorRunnable should catch YarnException while NMClient start container
[ https://issues.apache.org/jira/browse/SPARK-5759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5759. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Lianhui Wang Target Version/s: 1.3.0 ExecutorRunnable should catch YarnException while NMClient start container -- Key: SPARK-5759 URL: https://issues.apache.org/jira/browse/SPARK-5759 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Lianhui Wang Assignee: Lianhui Wang Fix For: 1.3.0 some time since some of reasons, it lead to some exception while NMClient start container.example:we do not config spark_shuffle on some machines, so it will throw a exception: java.lang.Error: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:spark_shuffle does not exist. because YarnAllocator use ThreadPoolExecutor to start Container, so we can not find which container or hostname throw exception. I think we should catch YarnException in ExecutorRunnable when start container. if there are some exceptions, we can know the container id or hostname of failed container. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5335) Destroying cluster in VPC with --delete-groups fails to remove security groups
[ https://issues.apache.org/jira/browse/SPARK-5335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5335. -- Resolution: Fixed Fix Version/s: 1.2.2 1.3.0 Issue resolved by pull request 4122 [https://github.com/apache/spark/pull/4122] Destroying cluster in VPC with --delete-groups fails to remove security groups Key: SPARK-5335 URL: https://issues.apache.org/jira/browse/SPARK-5335 Project: Spark Issue Type: Bug Components: EC2 Reporter: Vladimir Grigor Fix For: 1.3.0, 1.2.2 When I try to remove security groups using option of the script, it fails because in VPC one should remove security groups by id, not name as it is now. {code} $ ./spark-ec2 -k key20141114 -i ~/key.pem --region=eu-west-1 --delete-groups destroy SparkByScript Are you sure you want to destroy the cluster SparkByScript? The following instances will be terminated: Searching for existing cluster SparkByScript... ALL DATA ON ALL NODES WILL BE LOST!! Destroy cluster SparkByScript (y/N): y Terminating master... Terminating slaves... Deleting security groups (this will take some time)... Waiting for cluster to enter 'terminated' state. Cluster is now in 'terminated' state. Waited 0 seconds. Attempt 1 Deleting rules in security group SparkByScript-slaves Deleting rules in security group SparkByScript-master ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid value 'SparkByScript-slaves' for groupName. You may not reference Amazon VPC security groups by name. Please use the corresponding id for this operation./Message/Error/ErrorsRequestID60313fac-5d47-48dd-a8bf-e9832948c0a6/RequestID/Response Failed to delete security group SparkByScript-slaves ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid value 'SparkByScript-master' for groupName. You may not reference Amazon VPC security groups by name. Please use the corresponding id for this operation./Message/Error/ErrorsRequestID74ff8431-c0c1-4052-9ecb-c0adfa7eeeac/RequestID/Response Failed to delete security group SparkByScript-master Attempt 2 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5335) Destroying cluster in VPC with --delete-groups fails to remove security groups
[ https://issues.apache.org/jira/browse/SPARK-5335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5335: - Assignee: Vladimir Grigor Destroying cluster in VPC with --delete-groups fails to remove security groups Key: SPARK-5335 URL: https://issues.apache.org/jira/browse/SPARK-5335 Project: Spark Issue Type: Bug Components: EC2 Reporter: Vladimir Grigor Assignee: Vladimir Grigor Fix For: 1.3.0, 1.2.2 When I try to remove security groups using option of the script, it fails because in VPC one should remove security groups by id, not name as it is now. {code} $ ./spark-ec2 -k key20141114 -i ~/key.pem --region=eu-west-1 --delete-groups destroy SparkByScript Are you sure you want to destroy the cluster SparkByScript? The following instances will be terminated: Searching for existing cluster SparkByScript... ALL DATA ON ALL NODES WILL BE LOST!! Destroy cluster SparkByScript (y/N): y Terminating master... Terminating slaves... Deleting security groups (this will take some time)... Waiting for cluster to enter 'terminated' state. Cluster is now in 'terminated' state. Waited 0 seconds. Attempt 1 Deleting rules in security group SparkByScript-slaves Deleting rules in security group SparkByScript-master ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid value 'SparkByScript-slaves' for groupName. You may not reference Amazon VPC security groups by name. Please use the corresponding id for this operation./Message/Error/ErrorsRequestID60313fac-5d47-48dd-a8bf-e9832948c0a6/RequestID/Response Failed to delete security group SparkByScript-slaves ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid value 'SparkByScript-master' for groupName. You may not reference Amazon VPC security groups by name. Please use the corresponding id for this operation./Message/Error/ErrorsRequestID74ff8431-c0c1-4052-9ecb-c0adfa7eeeac/RequestID/Response Failed to delete security group SparkByScript-master Attempt 2 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5762) Shuffle write time is incorrect for sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5762: - Target Version/s: 1.3.0 (was: 1.3.0, 1.2.2) Shuffle write time is incorrect for sort-based shuffle -- Key: SPARK-5762 URL: https://issues.apache.org/jira/browse/SPARK-5762 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.3.0 For the sort-based shuffle, when bypassing merge sort, one file is written for each partition, and then a final file is written that concatenates all of the existing files together. The time to write this final file is not included in the shuffle write time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5762) Shuffle write time is incorrect for sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5762: - Fix Version/s: (was: 1.2.2) Shuffle write time is incorrect for sort-based shuffle -- Key: SPARK-5762 URL: https://issues.apache.org/jira/browse/SPARK-5762 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.3.0 For the sort-based shuffle, when bypassing merge sort, one file is written for each partition, and then a final file is written that concatenates all of the existing files together. The time to write this final file is not included in the shuffle write time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5790) VertexRDD's won't zip properly for `diff` capability
[ https://issues.apache.org/jira/browse/SPARK-5790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319316#comment-14319316 ] Brennon York commented on SPARK-5790: - FWIW this issue is a blocker for [SPARK-4600|https://issues.apache.org/jira/browse/SPARK-4600] that I'm working on as `diff` relies on the use of `zipPartitions` causing this. If someone could assign this to me I'll continue working this issue. VertexRDD's won't zip properly for `diff` capability Key: SPARK-5790 URL: https://issues.apache.org/jira/browse/SPARK-5790 Project: Spark Issue Type: Bug Components: GraphX Reporter: Brennon York For VertexRDD's with differing partition sizes one cannot run commands like `diff` as it will thrown an IllegalArgumentException. The code below provides an example: {code} import org.apache.spark.graphx._ import org.apache.spark.rdd._ val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 3L).map(id = (id, id.toInt+1))) setA.collect.foreach(println(_)) val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(2L until 4L).map(id = (id, id.toInt+2))) setB.collect.foreach(println(_)) val diff = setA.diff(setB) diff.collect.foreach(println(_)) val setC: VertexRDD[Int] = VertexRDD(sc.parallelize(2L until 4L).map(id = (id, id.toInt+2)) ++ sc.parallelize(6L until 8L).map(id = (id, id.toInt+2))) setA.diff(setC).collect // java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5790) VertexRDD's won't zip properly for `diff` capability
[ https://issues.apache.org/jira/browse/SPARK-5790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5790: - Assignee: Brennon York VertexRDD's won't zip properly for `diff` capability Key: SPARK-5790 URL: https://issues.apache.org/jira/browse/SPARK-5790 Project: Spark Issue Type: Bug Components: GraphX Reporter: Brennon York Assignee: Brennon York For VertexRDD's with differing partition sizes one cannot run commands like `diff` as it will thrown an IllegalArgumentException. The code below provides an example: {code} import org.apache.spark.graphx._ import org.apache.spark.rdd._ val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 3L).map(id = (id, id.toInt+1))) setA.collect.foreach(println(_)) val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(2L until 4L).map(id = (id, id.toInt+2))) setB.collect.foreach(println(_)) val diff = setA.diff(setB) diff.collect.foreach(println(_)) val setC: VertexRDD[Int] = VertexRDD(sc.parallelize(2L until 4L).map(id = (id, id.toInt+2)) ++ sc.parallelize(6L until 8L).map(id = (id, id.toInt+2))) setA.diff(setC).collect // java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5790) VertexRDD's won't zip properly for `diff` capability
Brennon York created SPARK-5790: --- Summary: VertexRDD's won't zip properly for `diff` capability Key: SPARK-5790 URL: https://issues.apache.org/jira/browse/SPARK-5790 Project: Spark Issue Type: Bug Components: GraphX Reporter: Brennon York For VertexRDD's with differing partition sizes one cannot run commands like `diff` as it will thrown an IllegalArgumentException. The code below provides an example: {code:scala} import org.apache.spark.graphx._ import org.apache.spark.rdd._ val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 3L).map(id = (id, id.toInt+1))) setA.collect.foreach(println(_)) val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(2L until 4L).map(id = (id, id.toInt+2))) setB.collect.foreach(println(_)) val diff = setA.diff(setB) diff.collect.foreach(println(_)) val setC: VertexRDD[Int] = VertexRDD(sc.parallelize(2L until 4L).map(id = (id, id.toInt+2)) ++ sc.parallelize(6L until 8L).map(id = (id, id.toInt+2))) setA.diff(setC).collect // java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319400#comment-14319400 ] Cheng Hao commented on SPARK-5791: -- Can you also attach the performance comparison result for this query? [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Zhou updated SPARK-5791: --- Description: Spark SQL shows poor performance when multiple tables do join operation (was: Spark SQL shows poor performance when multiple tables do join operation compared with Hive on MapReduce.) [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3299) [SQL] Public API in SQLContext to list tables
[ https://issues.apache.org/jira/browse/SPARK-3299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3299. - Resolution: Fixed Fix Version/s: 1.3.0 [SQL] Public API in SQLContext to list tables - Key: SPARK-3299 URL: https://issues.apache.org/jira/browse/SPARK-3299 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.0.2 Reporter: Evan Chan Assignee: Bill Bejeck Priority: Minor Labels: newbie Fix For: 1.3.0 There is no public API in SQLContext to list the current tables. This would be pretty helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3168) The ServletContextHandler of webui lacks a SessionManager
[ https://issues.apache.org/jira/browse/SPARK-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3168: - Component/s: (was: Spark Core) Web UI Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) I'm wondering if this is still live. I agree that it shouldn't be on by default, and then, the details of what this enables matter less. They do incur overhead in cookies and memory, etc. The ServletContextHandler of webui lacks a SessionManager - Key: SPARK-3168 URL: https://issues.apache.org/jira/browse/SPARK-3168 Project: Spark Issue Type: Improvement Components: Web UI Environment: CAS Reporter: meiyoula Priority: Minor When i use CAS to realize single sign of webui, it occurs a exception: {code} WARN [qtp1076146544-24] / org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:561) java.lang.IllegalStateException: No SessionManager at org.eclipse.jetty.server.Request.getSession(Request.java:1269) at org.eclipse.jetty.server.Request.getSession(Request.java:1248) at org.jasig.cas.client.validation.AbstractTicketValidationFilter.doFilter(AbstractTicketValidationFilter.java:178) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) at org.jasig.cas.client.authentication.AuthenticationFilter.doFilter(AuthenticationFilter.java:116) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) at org.jasig.cas.client.session.SingleSignOutFilter.doFilter(SingleSignOutFilter.java:76) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
Yi Zhou created SPARK-5791: -- Summary: [Spark SQL] show poor performance when multiple table do join operation Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Spark SQL shows poor performance when multiple tables do join operation compared with Hive on MapReduce. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319364#comment-14319364 ] Yi Zhou commented on SPARK-5791: For example: SELECT * FROM inventory inv JOIN ( SELECT i_item_id, i_item_sk FROM item WHERE i_current_price 0.98 AND i_current_price 1.5 ) items ON inv.inv_item_sk = items.i_item_sk JOIN warehouse w ON inv.inv_warehouse_sk = w.w_warehouse_sk JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk WHERE datediff(d_date, '2001-05-08') = -30 AND datediff(d_date, '2001-05-08') = 30; [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Spark SQL shows poor performance when multiple tables do join operation compared with Hive on MapReduce. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5764) Delete the cache and lock file after executor fetching the jar
[ https://issues.apache.org/jira/browse/SPARK-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula closed SPARK-5764. --- Resolution: Not a Problem Delete the cache and lock file after executor fetching the jar -- Key: SPARK-5764 URL: https://issues.apache.org/jira/browse/SPARK-5764 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: meiyoula Every time while executor fetching a jar from httpserver, a lock file and a cache file will be created on the local. After fetching, this two files will be useless. And when the jar package is big, the cache file also be big. it wates the disk space. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5765) word split problem in run-example and compute-classpath
Venkata Ramana G created SPARK-5765: --- Summary: word split problem in run-example and compute-classpath Key: SPARK-5765 URL: https://issues.apache.org/jira/browse/SPARK-5765 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.2.1, 1.3.0, 1.1.2 Reporter: Venkata Ramana G Work split problem in spark directory path in scripts run-example and compute-classpath.sh This was introduced in defect fix SPARK-4504 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5762) Shuffle write time is incorrect for sort-based shuffle
Kay Ousterhout created SPARK-5762: - Summary: Shuffle write time is incorrect for sort-based shuffle Key: SPARK-5762 URL: https://issues.apache.org/jira/browse/SPARK-5762 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Reporter: Kay Ousterhout Assignee: Kay Ousterhout For the sort-based shuffle, when bypassing merge sort, one file is written for each partition, and then a final file is written that concatenates all of the existing files together. The time to write this final file is not included in the shuffle write time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5754) Spark AM not launching on Windows
[ https://issues.apache.org/jira/browse/SPARK-5754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317850#comment-14317850 ] Sean Owen commented on SPARK-5754: -- We just resolved http://issues.apache.org/jira/browse/SPARK-4267 which is kind of the flip side to this problem. Now, all arguments are quoted since they may contain spaces, and spaces would break the command. Some quoting seems to be needed at some level to handle this case. I wonder why single quote is problematic in Windows? does it have to be double quote? Maybe the logic can be improved to only quote args with a space in them but that still leaves the question of how to correctly quote args with spaces in Windows. Spark AM not launching on Windows - Key: SPARK-5754 URL: https://issues.apache.org/jira/browse/SPARK-5754 Project: Spark Issue Type: Bug Components: Windows, YARN Affects Versions: 1.1.1, 1.2.0 Environment: Windows Server 2012, Hadoop 2.4.1. Reporter: Inigo I'm trying to run Spark Pi on a YARN cluster running on Windows and the AM container fails to start. The problem seems to be in the generation of the YARN command which adds single quotes (') surrounding some of the java options. In particular, the part of the code that is adding those is the escapeForShell function in YarnSparkHadoopUtil. Apparently, Windows does not like the quotes for these options. Here is an example of the command that the container tries to execute: @call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp '-Dspark.yarn.secondary.jars=' '-Dspark.app.name=org.apache.spark.examples.SparkPi' '-Dspark.master=yarn-cluster' org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.examples.SparkPi' --jar 'file:/D:/data/spark-1.1.1-bin-hadoop2.4/bin/../lib/spark-examples-1.1.1-hadoop2.4.0.jar' --executor-memory 1024 --executor-cores 1 --num-executors 2 Once I transform it into: @call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp -Dspark.yarn.secondary.jars= -Dspark.app.name=org.apache.spark.examples.SparkPi -Dspark.master=yarn-cluster org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.examples.SparkPi' --jar 'file:/D:/data/spark-1.1.1-bin-hadoop2.4/bin/../lib/spark-examples-1.1.1-hadoop2.4.0.jar' --executor-memory 1024 --executor-cores 1 --num-executors 2 Everything seems to start. How should I deal with this? Creating a separate function like escapeForShell for Windows and call it whenever I detect this is for Windows? Or should I add some sanity check on YARN? I checked a little and there seems to be people that is able to run Spark on YARN on Windows, so it might be something else. I didn't find anything related on Jira either. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5739) Size exceeds Integer.MAX_VALUE in File Map
[ https://issues.apache.org/jira/browse/SPARK-5739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317782#comment-14317782 ] Sean Owen commented on SPARK-5739: -- What would that really do though except change one error into another? I think the current exception is quite clear. Size exceeds Integer.MAX_VALUE in File Map -- Key: SPARK-5739 URL: https://issues.apache.org/jira/browse/SPARK-5739 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.1 Environment: Spark1.1.1 on a cluster with 12 node. Every node with 128GB RAM, 24 Core. the data is just 40GB, and there is 48 parallel task on a node. Reporter: DjvuLee I just run the kmeans algorithm using a random generate data,but occurred this problem after some iteration. I try several time, and this problem is reproduced. Because the data is random generate, so I guess is there a bug ? Or if random data can lead to such a scenario that the size is bigger than Integer.MAX_VALUE, can we check the size before using the file map? 015-02-11 00:39:36,057 [sparkDriver-akka.actor.default-dispatcher-15] WARN org.apache.spark.util.SizeEstimator - Failed to check whether UseCompressedOops is set; assuming yes [error] (run-main-0) java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:850) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105) at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:86) at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:140) at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:105) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:747) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:598) at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:869) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:79) at org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:68) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809) at org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:270) at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143) at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126) at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:338) at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:348) at KMeansDataGenerator$.main(kmeans.scala:105) at KMeansDataGenerator.main(kmeans.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55) at java.lang.reflect.Method.invoke(Method.java:619) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5762) Shuffle write time is incorrect for sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317799#comment-14317799 ] Apache Spark commented on SPARK-5762: - User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/4559 Shuffle write time is incorrect for sort-based shuffle -- Key: SPARK-5762 URL: https://issues.apache.org/jira/browse/SPARK-5762 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Reporter: Kay Ousterhout Assignee: Kay Ousterhout For the sort-based shuffle, when bypassing merge sort, one file is written for each partition, and then a final file is written that concatenates all of the existing files together. The time to write this final file is not included in the shuffle write time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5765) word split problem in run-example and compute-classpath
[ https://issues.apache.org/jira/browse/SPARK-5765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317882#comment-14317882 ] Apache Spark commented on SPARK-5765: - User 'gvramana' has created a pull request for this issue: https://github.com/apache/spark/pull/4561 word split problem in run-example and compute-classpath --- Key: SPARK-5765 URL: https://issues.apache.org/jira/browse/SPARK-5765 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.3.0, 1.1.2, 1.2.1 Reporter: Venkata Ramana G Work split problem in spark directory path in scripts run-example and compute-classpath.sh This was introduced in defect fix SPARK-4504 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org