[GitHub] spark pull request: [SPARK-1773] Update outdated docs for spark-su...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/701#issuecomment-42729688 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14866/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/730#issuecomment-42763089 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1756: Add missing description to spark-e...
Github user witgo commented on a diff in the pull request: https://github.com/apache/spark/pull/646#discussion_r12507598 --- Diff: conf/spark-env.sh.template --- @@ -38,6 +38,7 @@ # - SPARK_WORKER_INSTANCES, to set the number of worker processes per node # - SPARK_WORKER_DIR, to set the working directory of worker processes # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. -Dx=y) +# - SPARK_DRIVER_MEMORY, Memory for driver (e.g. 1000M, 2G) (Default: 512 Mb) --- End diff -- If so, ``` if [ ! -z $DRIVER_MEMORY ] [ ! -z $DEPLOY_MODE ] [ $DEPLOY_MODE = client ]; then export SPARK_MEM=$DRIVER_MEMORY fi ``` is not correct --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...
GitHub user pwendell opened a pull request: https://github.com/apache/spark/pull/730 SPARK-1652: Set driver memory correctly in spark-submit. The previous check didn't account for the fact that the default deploy mode is client unless otherwise specified. Also, this sets the more narrowly defined SPARK_DRIVER_MEM instead of setting SPARK_MEM. You can merge this pull request into a Git repository by running: $ git pull https://github.com/pwendell/spark spark-submit Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/730.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #730 commit f5081465658ee9035a498ac1fb8a532b89e895cc Author: Patrick Wendell pwend...@gmail.com Date: 2014-05-11T06:06:12Z SPARK-1652: Set driver memory correctly in spark-submit. The previous check didn't account for the fact that the default deploy mode is client unless otherwise specified. Also, this sets the more narrowly defined SPARK_DRIVER_MEM instead of setting SPARK_MEM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1756: Add missing description to spark-e...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/646#discussion_r12507600 --- Diff: conf/spark-env.sh.template --- @@ -38,6 +38,7 @@ # - SPARK_WORKER_INSTANCES, to set the number of worker processes per node # - SPARK_WORKER_DIR, to set the working directory of worker processes # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. -Dx=y) +# - SPARK_DRIVER_MEMORY, Memory for driver (e.g. 1000M, 2G) (Default: 512 Mb) --- End diff -- Yes - does this work for you?: https://github.com/apache/spark/pull/730/files --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/730#issuecomment-42763096 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/730#issuecomment-42763144 This was reported by @witgo. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1756: Add missing description to spark-e...
Github user witgo closed the pull request at: https://github.com/apache/spark/pull/646 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Synthetic GraphX Benchmark
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/720#issuecomment-42756947 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14876/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1759
GitHub user pjopensource opened a pull request: https://github.com/apache/spark/pull/693 SPARK-1759 You can merge this pull request into a Git repository by running: $ git pull https://github.com/pjopensource/spark branch-0.9 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/693.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #693 commit a017d2ee597473d483350b5a3d57713fae392522 Author: Jason Penn pjopensou...@gmail.com Date: 2014-05-08T08:48:09Z modified: sbt/sbt --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1752][MLLIB] Standardize text format fo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/685#issuecomment-42501713 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/582#issuecomment-42623759 Thanks, merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1756: Add missing description to spark-e...
Github user witgo commented on a diff in the pull request: https://github.com/apache/spark/pull/646#discussion_r12507619 --- Diff: conf/spark-env.sh.template --- @@ -38,6 +38,7 @@ # - SPARK_WORKER_INSTANCES, to set the number of worker processes per node # - SPARK_WORKER_DIR, to set the working directory of worker processes # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. -Dx=y) +# - SPARK_DRIVER_MEMORY, Memory for driver (e.g. 1000M, 2G) (Default: 512 Mb) --- End diff -- Yes,it work for me. `./bin/spark-shell --driver-memory 2g` = ``` /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/bin/java -cp ::/Users/witgo/work/code/java/spark/dist/conf:/Users/witgo/work/code/java/spark/dist/lib/spark-assembly-1.0.0-SNAPSHOT-hadoop0.23.9.jar -Djava.library.path= -Xms2g -Xmx2g org.apache.spark.deploy.SparkSubmit spark-internal --driver-memory 2g --class org.apache.spark.repl.Main ``` But in ` --driver-memory 2g --class org.apache.spark.repl.Main ` , `--driver-memory 2g` is unnecessary --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Improve column pruning.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/729#issuecomment-42763487 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Improve column pruning.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/729#issuecomment-42763488 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14883/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: remove outdated runtime Information scala home
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/728#issuecomment-42763506 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1686: keep schedule() calling in the mai...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/639#discussion_r12490657 --- Diff: core/src/main/scala/org/apache/spark/deploy/master/MasterMessages.scala --- @@ -39,4 +39,6 @@ private[master] object MasterMessages { case object RequestWebUIPort case class WebUIPortResponse(webUIBoundPort: Int) + + case object TriggerSchedule --- End diff -- I think this is no longer used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: remove outdated runtime Information scala home
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/728#issuecomment-42763556 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/730#issuecomment-42763555 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14884/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: remove outdated runtime Information scala home
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/728#issuecomment-42763553 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1752][MLLIB] Standardize text format fo...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/685#discussion_r12503350 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/NumericParser.scala --- @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util + +import scala.collection.mutable.{ArrayBuffer, ListBuffer} + +private[mllib] object NumericTokenizer { + val NUMBER = -1 + val END = -2 +} + +import NumericTokenizer._ + +/** + * Simple tokenizer for a numeric structure consisting of three types: + * + * - number: a double in Java's floating number format + * - array: an array of numbers stored as `[v0,v1,...,vn]` + * - tuple: a list of numbers, arrays, or tuples stored as `(...)` + * + * @param s input string + * @param start start index + * @param end end index + */ +private[mllib] class NumericTokenizer(s: String, start: Int, end: Int) { --- End diff -- Tiny comment - `StringTokenizer` is considered just about deprecated (http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html). Guava's `Splitter` does a better job and can return delimiters, and is already available in the project. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1752][MLLIB] Standardize text format fo...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/685#discussion_r12502576 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/NumericParser.scala --- @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util + +import scala.collection.mutable.{ArrayBuffer, ListBuffer} + +private[mllib] object NumericTokenizer { + val NUMBER = -1 + val END = -2 +} + +import NumericTokenizer._ + +/** + * Simple tokenizer for a numeric structure consisting of three types: + * + * - number: a double in Java's floating number format + * - array: an array of numbers stored as `[v0,v1,...,vn]` + * - tuple: a list of numbers, arrays, or tuples stored as `(...)` + * + * @param s input string + * @param start start index + * @param end end index + */ +private[mllib] class NumericTokenizer(s: String, start: Int, end: Int) { --- End diff -- Much of the logic here could perhaps be replaced by `new StringTokenizer(s.substring(start, end), [](), true)`. StringTokenizer can be configured to return the tokens (namely '[', ',', etc) so it should be enough for parsing these data types. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/730#issuecomment-42763554 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/715#issuecomment-42763346 /cc @scrapcodes who worked on the IndestructableActorSystem --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1795 - Add recursive directory file sear...
Github user patrickotoole commented on a diff in the pull request: https://github.com/apache/spark/pull/537#discussion_r12507593 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala --- @@ -327,18 +327,18 @@ class StreamingContext private[streaming] ( * @param directory HDFS directory to monitor for new file * @param filter Function to filter paths to process * @param newFilesOnly Should process only new files and ignore existing files in the directory + * @param recursive Should search through the directory recursively to find new files * @tparam K Key type for reading HDFS file * @tparam V Value type for reading HDFS file * @tparam F Input format for reading HDFS file */ def fileStream[ -K: ClassTag, -V: ClassTag, -F : NewInputFormat[K, V]: ClassTag - ] (directory: String, filter: Path = Boolean, newFilesOnly: Boolean): InputDStream[(K, V)] = { -new FileInputDStream[K, V, F](this, directory, filter, newFilesOnly) - } - + K: ClassTag, + V: ClassTag, + F : NewInputFormat[K, V]: ClassTag +] (directory: String, filter: Path = Boolean, newFilesOnly: Boolean, recursive: Boolean): DStream[(K, V)] = { + new FileInputDStream[K, V, F](this, directory, filter, newFilesOnly, recursive) + } --- End diff -- I have included a default value on the FileInputDStream but not on the API itself. Wondering if we want to introduce default values to the more granular version of the API. Currently, it looks like the exposed API essentially has two versions for these method -- one that assumes default values and one that exposes all the parameters of the DStream constructor. Thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1756: Add missing description to spark-e...
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/646#issuecomment-42763279 A better solution [PR 730](https://github.com/apache/spark/pull/730) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: remove outdated runtime Information scala home
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/728#issuecomment-42764076 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14885/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: remove outdated runtime Information scala home
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/728#issuecomment-42764075 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]SPARK-1706: Allow multiple executors per ...
Github user lianhuiwang commented on a diff in the pull request: https://github.com/apache/spark/pull/636#discussion_r12507640 --- Diff: core/src/main/scala/org/apache/spark/deploy/ApplicationDescription.scala --- @@ -28,6 +28,7 @@ private[spark] class ApplicationDescription( extends Serializable { val user = System.getProperty(user.name, unknown) - + // only valid when spark.executor.multiPerWorker is set to true + var maxCorePerExecutor = maxCores --- End diff -- yes,but in ApplicationInfo,app.coresLeft is equal to app.maxCores. so in schedule when the sum of all worker free cores is greater than app.coresLeft, now leftCoreToAssign actually is equal to maxCorePerExecutor. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...
GitHub user ajtulloch opened a pull request: https://github.com/apache/spark/pull/725 SPARK-1791 - SVM implementation does not use threshold parameter Summary: https://issues.apache.org/jira/browse/SPARK-1791 Simple fix, and backward compatible, since - anyone who set the threshold was getting completely wrong answers. - anyone who did not set the threshold had the default 0.0 value for the threshold anyway. Test Plan: Unit test added that is verified to fail under the old implementation, and pass under the new implementation. Reviewers: CC: You can merge this pull request into a Git repository by running: $ git pull https://github.com/ajtulloch/spark SPARK-1791-SVM Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/725.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #725 commit 6f7075a75927b08c3c1211642632e03c2246f5cd Author: Andrew Tulloch and...@tullo.ch Date: 2014-05-10T23:22:02Z SPARK-1791 - SVM implementation does not use threshold parameter Summary: https://issues.apache.org/jira/browse/SPARK-1791 Simple fix, and backward compatible, since - anyone who set the threshold was getting completely wrong answers. - anyone who did not set the threshold had the default 0.0 value for the threshold anyway. Test Plan: Unit test added that is verified to fail under the old implementation, and pass under the new implementation. Reviewers: CC: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1774] Respect SparkSubmit --jars on YAR...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/710 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Enabled incremental build that comes with sbt ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/525 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Upgrade parquet library.
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/684#issuecomment-42750659 Thanks for taking a look michael - I merged this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1773] Update outdated docs for spark-su...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/701#discussion_r12505284 --- Diff: docs/spark-standalone.md --- @@ -150,20 +162,30 @@ You can also pass an option `--cores numCores` to control the number of cores Spark supports two deploy modes. Spark applications may run with the driver inside the client process or entirely inside the cluster. -The spark-submit script described in the [cluster mode overview](cluster-overview.html) provides the most straightforward way to submit a compiled Spark application to the cluster in either deploy mode. For info on the lower-level invocations used to launch an app inside the cluster, read ahead. +The spark-submit script provides the most straightforward way to submit a compiled Spark application to the cluster in either deploy mode. For more detail, see the [cluster mode overview](cluster-overview.html). + +./bin/spark-submit \ + --class main-class + --master master-url \ + --deploy-mode deploy-mode \ + ... // other options + application-jar + [application-arguments] -## Launching Applications Inside the Cluster +main-class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi) +master-url: The URL of the master node (e.g. spark://23.195.26.187:7077) +deploy-mode: Whether to deploy this application within the cluster or from an external client (e.g. client) +application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. +application-arguments: Arguments passed to the main method of main-class + +Behind the scenes, this invokes the standalone Client to launch your application, which is also the legacy way to launch your application before Spark 1.0. --- End diff -- Since this is deprecated we shouldn't document it. Otherwise new users of spark will use this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: The org.datanucleus:* should not be packaged i...
GitHub user witgo opened a pull request: https://github.com/apache/spark/pull/688 The org.datanucleus:* should not be packaged into spark-assembly-*.jar You can merge this pull request into a Git repository by running: $ git pull https://github.com/witgo/spark SPARK-1644 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/688.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #688 commit a5974140300f4fc0ab2dee72a9807bc37daff0a1 Author: witgo wi...@qq.com Date: 2014-05-08T01:59:18Z The org.datanucleus:* should not be packaged into spark-assembly-*.jar --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1569 Spark on Yarn, authentication broke...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/649#issuecomment-42427401 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] Simplify the build with sbt 0.13.2 featu...
Github user markhamstra commented on the pull request: https://github.com/apache/spark/pull/706#issuecomment-42697568 https://issues.apache.org/jira/browse/SPARK-1776 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: L-BFGS Documentation
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/702#discussion_r12460491 --- Diff: docs/mllib-optimization.md --- @@ -163,3 +177,100 @@ each iteration, to compute the gradient direction. Available algorithms for gradient descent: * [GradientDescent.runMiniBatchSGD](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent) + +### Limited-memory BFGS +L-BFGS is currently only a low-level optimization primitive in `MLlib`. If you want to use L-BFGS in various +ML algorithms such as Linear Regression, and Logistic Regression, you have to pass the gradient of objective +function, and updater into optimizer yourself instead of using the training APIs like +[LogisticRegression.LogisticRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegression). +See the example below. It will be addressed in the next release. + +The L1 regularization by using +[Updater.L1Updater](api/mllib/index.html#org.apache.spark.mllib.optimization.Updater) will not work since the +soft-thresholding logic in L1Updater is designed for gradient descent. + +The L-BFGS method +[LBFGS.runLBFGS](api/scala/index.html#org.apache.spark.mllib.optimization.LBFGS) +has the following parameters: + +* `gradient` is a class that computes the gradient of the objective function +being optimized, i.e., with respect to a single training example, at the +current parameter value. MLlib includes gradient classes for common loss +functions, e.g., hinge, logistic, least-squares. The gradient class takes as +input a training example, its label, and the current parameter value. +* `updater` is a class originally designed for gradient decent which computes +the actual gradient descent step. However, we're able to take the gradient and +loss of objective function of regularization for L-BFGS by ignoring the part of logic +only for gradient decent such as adaptive step size stuff. We will refactorize +this into regularizer to replace updater to separate the logic between +regularization and step update later. +* `numCorrections` is the number of corrections used in the L-BFGS update. 10 is +recommended. +* `maxNumIterations` is the maximal number of iterations that L-BFGS can be run. +* `regParam` is the regularization parameter when using regularization. +* `return` A tuple containing two elements. The first element is a column matrix +containing weights for every feature, and the second element is an array containing +the loss computed for every iteration. + +Here is an example to train binary logistic regression with L2 regularization using +L-BFGS optimizer. +{% highlight scala %} +import org.apache.spark.SparkContext +import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.mllib.classification.LogisticRegressionModel +import breeze.linalg.{DenseVector = BDV} --- End diff -- We don't need `breeze` for the example. See my next comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Docs] Update YARN and standalone docs
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/701#issuecomment-42713025 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1789. Multiple versions of Netty depende...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/723#discussion_r12507895 --- Diff: core/src/test/scala/org/apache/spark/LocalSparkContext.scala --- @@ -17,8 +17,7 @@ package org.apache.spark -import org.jboss.netty.logging.InternalLoggerFactory -import org.jboss.netty.logging.Slf4JLoggerFactory +import _root_.io.netty.util.internal.logging.{Slf4JLoggerFactory, InternalLoggerFactory} --- End diff -- Yes you're right. I had thought it was conflicting with `scala.io` somehow but according to my IDE it's the Spark subpackage. (The abundance of import syntax in Scala is not my favorite thing.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1752][MLLIB] Standardize text format fo...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/685#discussion_r12502581 --- Diff: python/pyspark/mllib/linalg.py --- @@ -234,6 +233,45 @@ def dense(elements): return array(elements, dtype=float64) +@staticmethod +def parse(s): + +Parses a string resulted from Vectors.stringify() into a vector. + + Vectors.parse([0.0,1.0]) +array([ 0., 1.]) + print Vectors.parse((2,[1],[1.0])) +(2,[1],[1.0]) + +return Vectors._parse_structured(eval(s)) --- End diff -- We can't use eval here; it will be both slow and a security risk. If it's a lot of work to create a fast parser in Python (which is might be), maybe we should instead do JSON, with just tuples. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/715#issuecomment-42716337 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14854/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1565, update examples to be used with sp...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/552#issuecomment-42520748 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1789. Multiple versions of Netty depende...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/723 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fixed streaming examples docs to use run-examp...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/722#issuecomment-42738393 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Synthetic GraphX Benchmark
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/720#issuecomment-42756921 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1565 (Addendum): Replace `run-example` w...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/704#discussion_r12462266 --- Diff: bin/run-example --- @@ -49,46 +31,31 @@ fi if [[ -z $SPARK_EXAMPLES_JAR ]]; then echo Failed to find Spark examples assembly in $FWDIR/lib or $FWDIR/examples/target 2 - echo You need to build Spark with sbt/sbt assembly before running this program 2 + echo You need to build Spark before running this program 2 exit 1 fi +SPARK_EXAMPLES_JAR_REL=${SPARK_EXAMPLES_JAR#$FWDIR/} -# Since the examples JAR ideally shouldn't include spark-core (that dependency should be -# provided), also add our standard Spark classpath, built using compute-classpath.sh. -CLASSPATH=`$FWDIR/bin/compute-classpath.sh` -CLASSPATH=$SPARK_EXAMPLES_JAR:$CLASSPATH - -if $cygwin; then -CLASSPATH=`cygpath -wp $CLASSPATH` -export SPARK_EXAMPLES_JAR=`cygpath -w $SPARK_EXAMPLES_JAR` -fi - -# Find java binary -if [ -n ${JAVA_HOME} ]; then - RUNNER=${JAVA_HOME}/bin/java -else - if [ `command -v java` ]; then -RUNNER=java - else -echo JAVA_HOME is not set 2 -exit 1 - fi -fi +EXAMPLE_CLASS=example-class +EXAMPLE_ARGS=[example args] +EXAMPLE_MASTER=${MASTER:-master} -# Set JAVA_OPTS to be able to load native libraries and to set heap size -JAVA_OPTS=$SPARK_JAVA_OPTS -# Load extra JAVA_OPTS from conf/java-opts, if it exists -if [ -e $FWDIR/conf/java-opts ] ; then - JAVA_OPTS=$JAVA_OPTS `cat $FWDIR/conf/java-opts` +if [ -n $1 ]; then + EXAMPLE_CLASS=$1 + shift fi -export JAVA_OPTS -if [ $SPARK_PRINT_LAUNCH_COMMAND == 1 ]; then - echo -n Spark Command: - echo $RUNNER -cp $CLASSPATH $JAVA_OPTS $@ - echo - echo +if [ -n $1 ]; then + EXAMPLE_ARGS=$@ fi -exec $RUNNER -cp $CLASSPATH $JAVA_OPTS $@ +echo NOTE: This script has been replaced with ./bin/spark-submit. Please run: 2 +echo +echo ./bin/spark-submit \\ 2 --- End diff -- I thought about this some more, I think maybe we should just call spark-submit with the supplied master instead of telling the user this stuff. Or we could call spark submit and then print out the user how to run this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1688] Propagate PySpark worker stderr t...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/603#discussion_r12388303 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -208,6 +170,19 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String } } + /** + * Redirect a worker's stdout and stderr to our stderr. + */ + private def redirectWorkerStreams(stdout: InputStream, stderr: InputStream) { +try { + new RedirectThread(stdout, System.err, stdout reader for + pythonExec).start() + new RedirectThread(stderr, System.err, stderr reader for + pythonExec).start() +} catch { + case e: Throwable = --- End diff -- It could. The reason for paranoia is because I don't want both the stderr RedirectThread and the catch block in `startDaemon` to contend for the error stream. But sure, it doesn't have to be a throwable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Include the sbin/spark-config.sh in spark-exec...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/651#issuecomment-42608067 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/636#issuecomment-42504239 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1668: Add implicit preference as an opti...
Github user techaddict commented on a diff in the pull request: https://github.com/apache/spark/pull/597#discussion_r12389144 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -121,11 +157,14 @@ object MovieLensALS { } /** Compute RMSE (Root Mean Squared Error). */ - def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long) = { + def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { + +def evalRating(r: Double) = --- End diff -- ya better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Typo fix: fetchting - fetching
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/680#issuecomment-42485074 Thanks. Merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] Simplify the build with sbt 0.13.2 featu...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/706#discussion_r12508408 --- Diff: project/SparkBuild.scala --- @@ -297,7 +273,7 @@ object SparkBuild extends Build { val chillVersion = 0.3.6 val codahaleMetricsVersion = 3.0.0 val jblasVersion = 1.2.3 - val jets3tVersion = if (^2\\.[3-9]+.r.findFirstIn(hadoopVersion).isDefined) 0.9.0 else 0.7.1 + val jets3tVersion = ^2\\.[3-9]+.r.findFirstIn(hadoopVersion).fold(0.7.1)(_ = 0.9.0) --- End diff -- I myself learnt it not so long ago and noticed it spurred some discussions about it in the Scala community (with [Martin Ordersky himself](http://stackoverflow.com/a/5332657/1305344)). [Functional Programming in Scala](http://www.manning.com/bjarnason/) reads on page 69 (in the pdf version): *Itâs fine to use pattern matching, though you should be able to implement all the functions besides map and getOrElse without resorting to pattern matching.* So I followed the advice and applied Option.map(...).getOrElse(...) quite intensively, but...IntelliJ IDEA just before I committed the change suggested to replace it with fold. I thought I'd need to change my habits once more and sent the PR. With all that said, it's not clear what's the most idiomatic approach, but [pattern matching is in my opinion a step back from map/getOrElse](https://github.com/scala/scala/blob/2.12.x/src/library/scala/Option.scala#L156) and there's no need for it. I'd appreciate being corrected and could even replace `Option.fold` with `map`/`getOrElse` if that would make the PR better for more committers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] Simplify the build with sbt 0.13.2 featu...
Github user ScrapCodes commented on the pull request: https://github.com/apache/spark/pull/706#issuecomment-42768086 This is what Patrick was trying to say https://issues.apache.org/jira/browse/SPARK-1776. So while we have this on our mind, even if we merge this patch it has to eventually be replaced entirely. We are still experimenting and lets see if we all agree to this. OTOH We should retain the comments, in case we decide to merge this patch. Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]SPARK-1706: Allow multiple executors per ...
Github user CodingCat commented on a diff in the pull request: https://github.com/apache/spark/pull/636#discussion_r12508764 --- Diff: core/src/main/scala/org/apache/spark/deploy/ApplicationDescription.scala --- @@ -28,6 +28,7 @@ private[spark] class ApplicationDescription( extends Serializable { val user = System.getProperty(user.name, unknown) - + // only valid when spark.executor.multiPerWorker is set to true + var maxCorePerExecutor = maxCores --- End diff -- https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/ApplicationInfo.scala#L84, coreLeft implementation, https://github.com/CodingCat/spark/blob/SPARK-1706/core/src/main/scala/org/apache/spark/scheduler/cluster/SparkDeploySchedulerBackend.scala#L63, --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1690] Tolerating empty elements when sa...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/644#issuecomment-42752706 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1688] Propagate PySpark worker stderr t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/603#issuecomment-42469138 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14780/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Fix Performance Issue in data type casti...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/679 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...
Github user markhamstra commented on the pull request: https://github.com/apache/spark/pull/686#issuecomment-42494200 @CodingCat @kayousterhout @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1668: Add implicit preference as an opti...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/597#discussion_r12407485 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -121,11 +157,23 @@ object MovieLensALS { } /** Compute RMSE (Root Mean Squared Error). */ - def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long) = { + def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { + +def mapPredictedRating(r: Double) = + if (!implicitPrefs) { --- End diff -- Can we change it to the following: ~~~ if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Unify GraphImpl RDDs + other graph load optimi...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/497#issuecomment-42489547 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Docs] Update YARN docs
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/701#discussion_r12456854 --- Diff: docs/running-on-yarn.md --- @@ -68,11 +69,12 @@ In yarn-cluster mode, the driver runs on a different machine than the client, so $ ./bin/spark-submit --class my.main.Class \ --master yarn-cluster \ +--deploy-mode cluster \ --jars my-other-jar.jar,my-other-other-jar.jar my-main-jar.jar -yarn-cluster 5 +[app arguments] --- End diff -- Same as above, --master should probably just be yarn. And to be concrete like the other parts of the example, maybe use apparg1 apparg2 instead of [app arguments]? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Bug fix of sparse vector conversion
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/661 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1631] Correctly set the Yarn app name w...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/539#issuecomment-42512812 I dug through the code a bunch. For those following this PR, here's a quick summary. Looks like the way we set Spark app name on YARN is actually quite convoluted because there are two types of app names: 1. The one observed by the RM 2. The one observed by SparkContext This PR is mainly concerned with (1) and only on the YARN client mode. Before the changes, there are two ways of affecting (1), in the order of decreasing priority. (a) set `spark.app.name` in spark-defaults (b) set `SPARK_YARN_APP_NAME` in spark-env After the changes, this becomes (a) set `SPARK_YARN_APP_NAME` in spark-env (b) set `spark.app.name` in SparkConf (through `conf.setAppName`) (c) set `spark.app.name` in spark-defaults This guarantees that if an environment variable is explicitly specified, then the RM should use that name. Otherwise, use whatever SparkContext is using for its app name, in which case (1) == (2). To me, the semantics are pretty clear. Notice however that in either before or after, there is now way for SparkSubmit's --name parameter to affect either (1) or (2). This is a separate bug with SparkSubmit that I have summarized in (SPARK-1755)[https://issues.apache.org/jira/browse/SPARK-1755]. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Unify GraphImpl RDDs + other graph load optimi...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/497#issuecomment-42483855 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/686#issuecomment-42494319 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14790/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1565 (Addendum): Replace `run-example` w...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/704#issuecomment-42633729 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1588 Re-add support for SPARK_YARN_USER_...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/514#issuecomment-42525085 Closing this assuming it has been resolved through other PR's. Please let me know in case this is not the case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Include the sbin/spark-config.sh in spark-exec...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/651#discussion_r12450552 --- Diff: sbin/spark-executor --- @@ -19,5 +19,10 @@ FWDIR=$(cd `dirname $0`/..; pwd) +sbin=`dirname $0` +sbin=`cd $sbin; pwd` + +. $sbin/spark-config.sh --- End diff -- Hey actually - instead of doing this, do you mind just inlining the setting up of the PYTHONPATH into this file? `spark-config.sh` is intended to configure Spark's standalone daemons, it's not meant to be used at all in mesos mode. I realize the current approach fixes the problem, but it would be cleaner to just have the mesos executor setup $PYTHONPATH directly. ``` export PYTHONPATH=$FWDIR/python:$PYTHONPATH export PYTHONPATH=$FWDIR/python/lib/py4j-0.8.1-src.zip:$PYTHONPATH ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Spark-1461] Deferred Expression Evaluation (s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/446#issuecomment-42718954 Build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1779] add warning when memoryFraction i...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/714#issuecomment-42770141 Hi @pwendell, thanks. I didn't notice validateSettings in SparkConf and it's good to validate this in it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1470] Spark logger moving to use scala-...
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/332#issuecomment-42771828 @pwendell @marmbrus @rxin typesafehub/scala-logging#4 has been solved by typesafehub/scala-logging#15 The PR should be able to merge into master master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]SPARK-1706: Allow multiple executors per ...
Github user CodingCat closed the pull request at: https://github.com/apache/spark/pull/636 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]SPARK-1706: Allow multiple executors per ...
Github user lianhuiwang commented on a diff in the pull request: https://github.com/apache/spark/pull/636#discussion_r12509207 --- Diff: core/src/main/scala/org/apache/spark/deploy/ApplicationDescription.scala --- @@ -28,6 +28,7 @@ private[spark] class ApplicationDescription( extends Serializable { val user = System.getProperty(user.name, unknown) - + // only valid when spark.executor.multiPerWorker is set to true + var maxCorePerExecutor = maxCores --- End diff -- yes thanks i see. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1798. Tests should clean up temp files
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/732 SPARK-1798. Tests should clean up temp files Three issues related to temp files that tests generate â these should be touched up for hygiene but are not urgent. Modules have a log4j.properties which directs the unit-test.log output file to a directory like `[module]/target/unit-test.log`. But this ends up creating `[module]/[module]/target/unit-test.log` instead of former. The `work/` directory is not deleted by mvn clean, in the parent and in modules. Neither is the `checkpoint/` directory created under the various external modules. Many tests create a temp directory, which is not usually deleted. This can be largely resolved by calling `deleteOnExit()` at creation and trying to call `Utils.deleteRecursively` consistently to clean up, sometimes in an `@After` method. _If anyone seconds the motion, I can create a more significant change that introduces a new test trait along the lines of `LocalSparkContext`, which provides management of temp directories for subclasses to take advantage of._ You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-1798 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/732.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #732 commit bdd0f410e17f69c5463dbb3014240e7f552dccda Author: Sean Owen so...@cloudera.com Date: 2014-05-06T21:19:07Z Remove duplicate module dir in log4j.properties output path for tests commit b21b3566fa51b5375ca809449004a873cb5feefb Author: Sean Owen so...@cloudera.com Date: 2014-05-06T21:21:48Z Remove work/ and checkpoint/ dirs with mvn clean commit 5af578e5be2450371ec14aa33d1a679f029fb720 Author: Sean Owen so...@cloudera.com Date: 2014-05-06T21:22:23Z Try to consistently delete test temp dirs and files, and set deleteOnExit() for each --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/731#issuecomment-42772900 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Synthetic GraphX Benchmark
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/720#discussion_r12502303 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/SynthBenchmark.scala --- @@ -0,0 +1,136 @@ +package org.apache.spark.graphx.lib + +import org.apache.spark.SparkContext._ +import org.apache.spark.graphx.PartitionStrategy +import org.apache.spark.graphx.PartitionStrategy.{CanonicalRandomVertexCut, EdgePartition2D, EdgePartition1D, RandomVertexCut} +import org.apache.spark.{SparkContext, SparkConf} +import org.apache.spark.graphx.util.GraphGenerators +import java.io.{PrintWriter, FileOutputStream} + +/** + * The SynthBenchmark application can be used to run various GraphX algorithms on + * synthetic log-normal graphs. The intent of this code is to enable users to + * profile the GraphX system without access to large graph datasets. + */ +object SynthBenchmark { --- End diff -- Hey @jegonzal - we've generally been trying to put runnable programs like this in the examples package. Would it be possible to do that here? The main benefits are that (i) users understand this is not a stable API and (ii) users can launch it with `run-example` or `spark-submit` which allows them to easily do things like launch on YARN or ask for a specific number of executors when running the program. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/731#issuecomment-42773948 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/731#issuecomment-42773949 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14886/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1798. Tests should clean up temp files
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/732#issuecomment-42773573 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/731#issuecomment-42772897 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] Simplify the build with sbt 0.13.2 featu...
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/706#discussion_r12509377 --- Diff: project/SparkBuild.scala --- @@ -297,7 +273,7 @@ object SparkBuild extends Build { val chillVersion = 0.3.6 val codahaleMetricsVersion = 3.0.0 val jblasVersion = 1.2.3 - val jets3tVersion = if (^2\\.[3-9]+.r.findFirstIn(hadoopVersion).isDefined) 0.9.0 else 0.7.1 + val jets3tVersion = ^2\\.[3-9]+.r.findFirstIn(hadoopVersion).fold(0.7.1)(_ = 0.9.0) --- End diff -- We discussed this a while ago, soon after Spark development moved to Scala 2.10. The overwhelming preference was that we use map/getOrElse instead of fold. When working on Spark, consider turning off that particular inspection and warning in the IntelliJ preferences. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: L-BFGS Documentation
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/702#discussion_r12460376 --- Diff: docs/mllib-optimization.md --- @@ -163,3 +177,100 @@ each iteration, to compute the gradient direction. Available algorithms for gradient descent: * [GradientDescent.runMiniBatchSGD](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent) + +### Limited-memory BFGS +L-BFGS is currently only a low-level optimization primitive in `MLlib`. If you want to use L-BFGS in various +ML algorithms such as Linear Regression, and Logistic Regression, you have to pass the gradient of objective +function, and updater into optimizer yourself instead of using the training APIs like +[LogisticRegression.LogisticRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegression). --- End diff -- last `LogisticRegression` - `LogisticRegressionWithSGD` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] Simplify the build with sbt 0.13.2 featu...
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/706#discussion_r12489193 --- Diff: project/SparkBuild.scala --- @@ -16,17 +16,18 @@ */ import sbt._ -import sbt.Classpaths.publishTask -import sbt.Keys._ +import Keys._ +import Classpaths.publishTask import sbtassembly.Plugin._ import AssemblyKeys._ -import scala.util.Properties +import util.Properties --- End diff -- https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/629#issuecomment-42673811 Although there's now a 'fix' in, it is only going to do automatically what you are doing by hand. My hunch is that perhaps the script, or env variable, already contains a classpath with 0.9.0 jars in it, which means that when they're replaced with differently-named jars, suddenly it can't find basic classes. Just a guess. (Not sure what support will do with it given that this is into unsupported territory. You can ask. There's a reasonable question here about understanding the deployment. They'll reach out to the right people as needed.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Update RoutingTable.scala
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/647#issuecomment-42465768 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1755] Respect SparkSubmit --name on YAR...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/699#issuecomment-42630934 I've merged this thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: L-BFGS Documentation
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/702#discussion_r12460451 --- Diff: docs/mllib-optimization.md --- @@ -163,3 +177,100 @@ each iteration, to compute the gradient direction. Available algorithms for gradient descent: * [GradientDescent.runMiniBatchSGD](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent) + +### Limited-memory BFGS +L-BFGS is currently only a low-level optimization primitive in `MLlib`. If you want to use L-BFGS in various +ML algorithms such as Linear Regression, and Logistic Regression, you have to pass the gradient of objective +function, and updater into optimizer yourself instead of using the training APIs like +[LogisticRegression.LogisticRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegression). +See the example below. It will be addressed in the next release. + +The L1 regularization by using +[Updater.L1Updater](api/mllib/index.html#org.apache.spark.mllib.optimization.Updater) will not work since the +soft-thresholding logic in L1Updater is designed for gradient descent. + +The L-BFGS method +[LBFGS.runLBFGS](api/scala/index.html#org.apache.spark.mllib.optimization.LBFGS) +has the following parameters: + +* `gradient` is a class that computes the gradient of the objective function +being optimized, i.e., with respect to a single training example, at the +current parameter value. MLlib includes gradient classes for common loss +functions, e.g., hinge, logistic, least-squares. The gradient class takes as +input a training example, its label, and the current parameter value. +* `updater` is a class originally designed for gradient decent which computes --- End diff -- Let's remove this paragraph or move it to a separate `Developer` section. For the guide, we better separate user-facing content from developer-facing content. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Unify GraphImpl RDDs + other graph load optimi...
Github user ankurdave commented on a diff in the pull request: https://github.com/apache/spark/pull/497#discussion_r12456034 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/impl/ReplicatedVertexView.scala --- @@ -21,192 +21,102 @@ import scala.reflect.{classTag, ClassTag} import org.apache.spark.SparkContext._ import org.apache.spark.rdd.RDD -import org.apache.spark.util.collection.{PrimitiveVector, OpenHashSet} import org.apache.spark.graphx._ /** - * A view of the vertices after they are shipped to the join sites specified in - * `vertexPlacement`. The resulting view is co-partitioned with `edges`. If `prevViewOpt` is - * specified, `updatedVerts` are treated as incremental updates to the previous view. Otherwise, a - * fresh view is created. - * - * The view is always cached (i.e., once it is evaluated, it remains materialized). This avoids - * constructing it twice if the user calls graph.triplets followed by graph.mapReduceTriplets, for - * example. However, it means iterative algorithms must manually call `Graph.unpersist` on previous - * iterations' graphs for best GC performance. See the implementation of - * [[org.apache.spark.graphx.Pregel]] for an example. + * Manages shipping vertex attributes to the edge partitions of an + * [[org.apache.spark.graphx.EdgeRDD]]. Vertex attributes may be partially shipped to construct a + * triplet view with vertex attributes on only one side, and they may be updated. An active vertex + * set may additionally be shipped to the edge partitions. Be careful not to store a reference to + * `edges`, since it may be modified when the attribute shipping level is upgraded. */ private[impl] -class ReplicatedVertexView[VD: ClassTag]( -updatedVerts: VertexRDD[VD], -edges: EdgeRDD[_], -routingTable: RoutingTable, -prevViewOpt: Option[ReplicatedVertexView[VD]] = None) { +class ReplicatedVertexView[VD: ClassTag, ED: ClassTag]( +var edges: EdgeRDD[ED, VD], +var hasSrcId: Boolean = false, +var hasDstId: Boolean = false) { /** - * Within each edge partition, create a local map from vid to an index into the attribute - * array. Each map contains a superset of the vertices that it will receive, because it stores - * vids from both the source and destination of edges. It must always include both source and - * destination vids because some operations, such as GraphImpl.mapReduceTriplets, rely on this. + * Return a new `ReplicatedVertexView` with the specified `EdgeRDD`, which must have the same + * shipping level. */ - private val localVertexIdMap: RDD[(Int, VertexIdToIndexMap)] = prevViewOpt match { -case Some(prevView) = - prevView.localVertexIdMap -case None = - edges.partitionsRDD.mapPartitions(_.map { -case (pid, epart) = - val vidToIndex = new VertexIdToIndexMap - epart.foreach { e = -vidToIndex.add(e.srcId) -vidToIndex.add(e.dstId) - } - (pid, vidToIndex) - }, preservesPartitioning = true).cache().setName(ReplicatedVertexView localVertexIdMap) - } - - private lazy val bothAttrs: RDD[(PartitionID, VertexPartition[VD])] = create(true, true) - private lazy val srcAttrOnly: RDD[(PartitionID, VertexPartition[VD])] = create(true, false) - private lazy val dstAttrOnly: RDD[(PartitionID, VertexPartition[VD])] = create(false, true) - private lazy val noAttrs: RDD[(PartitionID, VertexPartition[VD])] = create(false, false) - - def unpersist(blocking: Boolean = true): ReplicatedVertexView[VD] = { -bothAttrs.unpersist(blocking) -srcAttrOnly.unpersist(blocking) -dstAttrOnly.unpersist(blocking) -noAttrs.unpersist(blocking) -// Don't unpersist localVertexIdMap because a future ReplicatedVertexView may be using it -// without modification -this + def withEdges[VD2: ClassTag, ED2: ClassTag]( + edges_ : EdgeRDD[ED2, VD2]): ReplicatedVertexView[VD2, ED2] = { +new ReplicatedVertexView(edges_, hasSrcId, hasDstId) } - def get(includeSrc: Boolean, includeDst: Boolean): RDD[(PartitionID, VertexPartition[VD])] = { -(includeSrc, includeDst) match { - case (true, true) = bothAttrs - case (true, false) = srcAttrOnly - case (false, true) = dstAttrOnly - case (false, false) = noAttrs -} + /** + * Return a new `ReplicatedVertexView` where edges are reversed and shipping levels are swapped to + * match. + */ + def reverse() = { +val newEdges = edges.mapEdgePartitions((pid, part) = part.reverse) +new ReplicatedVertexView(newEdges, hasDstId, hasSrcId) }
[GitHub] spark pull request: SPARK-1569 Spark on Yarn, authentication broke...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/649#issuecomment-42423123 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1556. jets3t dep doesn't update properly...
Github user darose commented on the pull request: https://github.com/apache/spark/pull/629#issuecomment-42635850 I tried downloading the latest from the master branch of both spark and shark, and built and ran them (in the hopes of getting past that class incompatible issue) but no luck there either. Now it's dying on Exception in thread main java.lang.NoClassDefFoundError: org/json4s/JsonAST$JValue I'm really at wits end here - have spent days on trying to get this to run. Is there a matched set of spark shark binaries available somewhere that can run on CDH5? (I.e., that includes this jets3t fix.) Or if not, could someone provide instructions on how I might build them? I'm going to have to fall back to slow-as-molasses Hive if I can't find a way through this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Use numpy directly for matrix multiply.
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/687 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1774] Respect SparkSubmit --jars on YAR...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/710#issuecomment-42634045 We may not always want to automatically add the jar to `spark.jar`. I will spend some time looking into cases in which we want to skip this (which likely involves `yarn-cluster` and python files). Do not merge this yet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Docs] Update YARN and standalone docs
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/701#issuecomment-42715097 thanks for the update, the changes look good to me. I did notice one other thing in the yarn docs and that is that we don't say what parameters spark-shell takes. I know it used to list the environment variables to use to configure but now it doesn't say how to configure and I don't see any other docs that say how to do that. I actually went to look at the code to see how it is now done. So perhaps I'll file a separate jira for that unless you know of another pr perhaps handling that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1798. Tests should clean up temp files
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/732#issuecomment-42773567 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1798. Tests should clean up temp files
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/732#issuecomment-42774629 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1798. Tests should clean up temp files
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/732#issuecomment-42774630 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14887/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/731#issuecomment-42774784 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add init script to the debian packaging
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/733#issuecomment-42775785 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...
GitHub user CodingCat opened a pull request: https://github.com/apache/spark/pull/731 SPARK-1706: Allow multiple executors per worker in Standalone mode resubmit of SPARK-1706 for a totally different algorithm You can merge this pull request into a Git repository by running: $ git pull https://github.com/CodingCat/spark SPARK-1706-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/731.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #731 commit 01308816e4eee4efba40dc8ce374bf0166edc3b9 Author: CodingCat zhunans...@gmail.com Date: 2014-05-04T18:13:41Z make master support multiple executors per worker commit 893974d4c51e0e5de22b4b3f85f5413310a3b86d Author: CodingCat zhunans...@gmail.com Date: 2014-05-04T18:17:12Z memory dominated allocation commit cdd2556102bface6a80a22fbdb7375ed853fafc3 Author: CodingCat zhunans...@gmail.com Date: 2014-05-11T14:53:03Z memory dominated allocation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---