[GitHub] spark pull request: [Examples] fix deprecated method use in HBaseT...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4725#issuecomment-75521453 Ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Examples] fix deprecated method use in HBaseT...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4725#issuecomment-75519690 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5724] fix the misconfiguration in AkkaU...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4512 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5943][Streaming] Update the test to use...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4722 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3920#discussion_r25155799 --- Diff: examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala --- @@ -18,20 +18,34 @@ package org.apache.spark.examples.pythonconverters import scala.collection.JavaConversions._ +import scala.util.parsing.json._ import org.apache.spark.api.python.Converter import org.apache.hadoop.hbase.client.{Put, Result} import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.util.Bytes +import org.apache.hadoop.hbase.KeyValue.Type +import org.apache.hadoop.hbase.CellUtil /** - * Implementation of [[org.apache.spark.api.python.Converter]] that converts an - * HBase Result to a String + * Implementation of [[org.apache.spark.api.python.Converter]] that converts all + * the records in an HBase Result to a String */ class HBaseResultToStringConverter extends Converter[Any, String] { override def convert(obj: Any): String = { +import collection.JavaConverters._ val result = obj.asInstanceOf[Result] -Bytes.toStringBinary(result.value()) +val output = result.listCells.asScala.map(cell = +Map( + row - Bytes.toStringBinary(CellUtil.cloneRow(cell)), + columnFamily - Bytes.toStringBinary(CellUtil.cloneFamily(cell)), + qualifier - Bytes.toStringBinary(CellUtil.cloneQualifier(cell)), + timestamp - cell.getTimestamp.toString, + type - Type.codeToType(cell.getTypeByte).toString, + value - Bytes.toStringBinary(CellUtil.cloneValue(cell)) +) +) +output.map(JSONObject(_).toString()).mkString(\n) --- End diff -- Output is a `Buffer[Map[String, String]]`, since there are several records in an HBase Result. However `JSONObject` has the only constructor `JSONObject(obj: Map[String, Any])`. So `JSONObject(output).toString()` will cause compilation failure. ^^ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Examples] fix deprecated method use in HBaseT...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/4725#discussion_r25157906 --- Diff: examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala --- @@ -36,7 +36,7 @@ object HBaseTest { // Initialize hBase table if necessary val admin = new HBaseAdmin(conf) if (!admin.isTableAvailable(args(0))) { - val tableDesc = new HTableDescriptor(args(0)) + val tableDesc = new HTableDescriptor(TableName.valueOf(args(0))) --- End diff -- Do you happen to know how long ago this constructor was added? I want to figure out of this makes it incompatible with any HBase = 0.98.7, which is presumably the earliest version kind of 'supported' by the examples. Are there other deprecations in the HBase examples that can be improved? I suspect the examples were written for HBase ~0.94.x --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5943][Streaming] Update the test to use...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4722#issuecomment-75526163 LGTM. The new method is in branch-1.3, so can be back-ported, and I think this qualifies as a good tiny fix. I verified these are all the occurrences. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4730][YARN] Warn against deprecated YAR...
Github user zuxqoj commented on a diff in the pull request: https://github.com/apache/spark/pull/3590#discussion_r25155375 --- Diff: yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala --- @@ -78,11 +79,25 @@ private[spark] class YarnClientSchedulerBackend( (--queue, SPARK_YARN_QUEUE, spark.yarn.queue), (--name, SPARK_YARN_APP_NAME, spark.app.name) ) +// Warn against the following deprecated environment variables: env var - suggestion +val deprecatedEnvVars = Map( + SPARK_MASTER_MEMORY - SPARK_DRIVER_MEMORY or --driver-memory through spark-submit, + SPARK_WORKER_INSTANCES - SPARK_WORKER_INSTANCES or --num-executors through spark-submit, + SPARK_WORKER_MEMORY - SPARK_EXECUTOR_MEMORY or --executor-memory through spark-submit, + SPARK_WORKER_CORES - SPARK_EXECUTOR_CORES or --executor-cores through spark-submit) +// Do the same for deprecated properties: property - suggestion +val deprecatedProps = Map(spark.master.memory - --driver-memory through spark-submit) --- End diff -- SPARK_MASTER_MEMORY and spark.master.memory are not applicable in yarn-client mode should be removed, please refer SPARK-1953 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5174][SPARK-5175] provide more APIs in ...
Github user CodingCat commented on the pull request: https://github.com/apache/spark/pull/3984#issuecomment-75528392 sure, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5802][MLLIB] cache transformed data in ...
Github user joshdevins commented on the pull request: https://github.com/apache/spark/pull/4593#issuecomment-75540678 I have the same concern as @dbtsai in his comment. Most consumers of this API will already be caching their dataset before the learning phase. Without user care, this will introduce effectively double caching (in terms of data size of cached RDDs) and will cause many jobs to fail after upgrading by exceeding available heap for RDD cache. Furthermore, we are making assumptions about how to cache -- in-memory only in this case. Should we parameterise this? Perhaps that will help send the message in the API that there is caching also done before learning. (FWIW, in-memory is definitely the right default choice here.) See email thread on dev for my specific encountering of this bug: http://mail-archives.apache.org/mod_mbox/spark-dev/201502.mbox/%3CCAH5MZvMBjqOST-9Nr9k1z1rUODfSiczr_fV9kwqDFqAMNLC2Zw%40mail.gmail.com%3E --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5802][MLLIB] cache transformed data in ...
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/4593#issuecomment-75550855 @dbtsai, @joshdevins here's an issue i have. I'm using new ml pipeline with hyperparameter grid search. Because folds doesn't depend from hyperparameter, i've reimplemented a bit LogisticRegression to not unpersist data: ```scala class CustomLogisticRegression extends LogisticRegression { var oldInstances: RDD[LabeledPoint] = null override def fit(dataset: SchemaRDD, paramMap: ParamMap): LogisticRegressionModel = { println(sFitting dataset ${dataset.id} with ParamMap $paramMap.) transformSchema(dataset.schema, paramMap, logging = true) import dataset.sqlContext._ val map = this.paramMap ++ paramMap val instances = dataset.select(map(labelCol).attr, map(featuresCol).attr) .map { case Row(label: Double, features: Vector) = LabeledPoint(label, features) } //For parallel grid search this.synchronized({ if (oldInstances == null || oldInstances.id != instances.id) { if (oldInstances != null) { oldInstances.unpersist() } oldInstances = instances instances.setName(sInstances for LR with ParamMap $paramMap and RDD ${dataset.id}) instances.persist(StorageLevel.MEMORY_AND_DISK) } }) val lr = (new LogisticRegressionWithLBFGS) .setValidateData(false) lr.optimizer .setRegParam(map(regParam)) .setNumIterations(map(maxIter)) val lrOldModel = lr.run(instances) val lrm = new LogisticRegressionModel(this, map, lr.run(instances).weights) //instances.unpersist() // copy model params Params.inheritValues(map, this, lrm) lrm } } ``` Then for 3 folds in crossvalidation and 3 hyperparameters to LogisticRegression i got something like this: ``` Fitting dataset 11 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.5 } Fitting dataset 11 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.1 } Fitting dataset 11 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.01 } Fitting dataset 12 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.5 } Fitting dataset 12 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.1 } Fitting dataset 12 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.01 } ``` So persistence on the model level need to cache folds for hyperparameters grid search, but persistence on GLM level need to speed-up Standart scalar transformation etc. Don't know yet how to do this efficiently without double caching. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
Github user xunyuw closed the pull request at: https://github.com/apache/spark/pull/4728 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
GitHub user xunyuw opened a pull request: https://github.com/apache/spark/pull/4728 Merge pull request #1 from apache/master SYNC 2015-02-08 20:00 You can merge this pull request into a Git repository by running: $ git pull https://github.com/xunyuw/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4728.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4728 commit 027f3928651aede18758471cf75b20230bc434fc Author: Xunyu Wang xunyu.w...@hotmail.com Date: 2015-02-08T12:34:48Z Merge pull request #1 from apache/master SYNC 2015-02-08 20:00 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Examples] fix deprecated method use in HBaseT...
Github user potix2 closed the pull request at: https://github.com/apache/spark/pull/4725 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Examples] fix deprecated method use in HBaseT...
Github user potix2 commented on a diff in the pull request: https://github.com/apache/spark/pull/4725#discussion_r25161690 --- Diff: examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala --- @@ -36,7 +36,7 @@ object HBaseTest { // Initialize hBase table if necessary val admin = new HBaseAdmin(conf) if (!admin.isTableAvailable(args(0))) { - val tableDesc = new HTableDescriptor(args(0)) + val tableDesc = new HTableDescriptor(TableName.valueOf(args(0))) --- End diff -- Sorry, I didn't know when that constructor was added. I understand my proposal makes the compatibility of the earliest version broken. The other deprication is nothing in the HBase examples, I close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Examples] fix deprecated method use in HBaseT...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/4725#discussion_r25162029 --- Diff: examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala --- @@ -36,7 +36,7 @@ object HBaseTest { // Initialize hBase table if necessary val admin = new HBaseAdmin(conf) if (!admin.isTableAvailable(args(0))) { - val tableDesc = new HTableDescriptor(args(0)) + val tableDesc = new HTableDescriptor(TableName.valueOf(args(0))) --- End diff -- @potix2 no it may be just fine. I was asking you to check it. It would be good to know when the new method was added to make sure this doesn't needlessly break recent versions, but, I agree with this change as long as the constructor was available in = 0.98.7 and preferably a few previous versions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #2 from apache/master
GitHub user xunyuw reopened a pull request: https://github.com/apache/spark/pull/4727 Merge pull request #2 from apache/master SYNC 2015-02-23 20:00 You can merge this pull request into a Git repository by running: $ git pull https://github.com/xunyuw/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4727.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4727 commit 027f3928651aede18758471cf75b20230bc434fc Author: Xunyu Wang xunyu.w...@hotmail.com Date: 2015-02-08T12:34:48Z Merge pull request #1 from apache/master SYNC 2015-02-08 20:00 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #2 from apache/master
Github user xunyuw closed the pull request at: https://github.com/apache/spark/pull/4727 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #2 from apache/master
GitHub user xunyuw opened a pull request: https://github.com/apache/spark/pull/4727 Merge pull request #2 from apache/master SYNC 2015-02-23 20:00 You can merge this pull request into a Git repository by running: $ git pull https://github.com/xunyuw/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4727.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4727 commit 027f3928651aede18758471cf75b20230bc434fc Author: Xunyu Wang xunyu.w...@hotmail.com Date: 2015-02-08T12:34:48Z Merge pull request #1 from apache/master SYNC 2015-02-08 20:00 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #2 from apache/master
Github user xunyuw closed the pull request at: https://github.com/apache/spark/pull/4727 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3147][MLLib] A/B testing
Github user feynmanliang commented on the pull request: https://github.com/apache/spark/pull/4716#issuecomment-75543022 `[error] * abstract method numDim()Int in interface org.apache.spark.mllib.stat.MultivariateStatisticalSummary does not have a correspondent in old version` Would it be bett --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
Github user xunyuw closed the pull request at: https://github.com/apache/spark/pull/4726 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
Github user xunyuw commented on the pull request: https://github.com/apache/spark/pull/4726#issuecomment-75531993 SYNC 2015-02-23 20:00 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4726#issuecomment-75532001 Mind closing this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
GitHub user xunyuw opened a pull request: https://github.com/apache/spark/pull/4726 Merge pull request #1 from apache/master SYNC 2015-02-23 20:00 You can merge this pull request into a Git repository by running: $ git pull https://github.com/xunyuw/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4726.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4726 commit 027f3928651aede18758471cf75b20230bc434fc Author: Xunyu Wang xunyu.w...@hotmail.com Date: 2015-02-08T12:34:48Z Merge pull request #1 from apache/master SYNC 2015-02-08 20:00 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
Github user xunyuw closed the pull request at: https://github.com/apache/spark/pull/4726 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK 5280] RDF Loader added + documentation
Github user lukovnikov commented on the pull request: https://github.com/apache/spark/pull/4650#issuecomment-75533258 @maropu tests are added and build tests passed. Is it ready for merging now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2087] [SQL] Multiple thriftserver sessi...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/4382#issuecomment-75535316 /cc @liancheng can you review this for me? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3147][MLLib] A/B testing
Github user feynmanliang closed the pull request at: https://github.com/apache/spark/pull/4716 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3147][MLLib] A/B testing
GitHub user feynmanliang reopened a pull request: https://github.com/apache/spark/pull/4716 [SPARK-3147][MLLib] A/B testing Implementation of A/B testing using Streaming API. This contribution is my original work and I license the work to the project under the project's open source license. You can merge this pull request into a Git repository by running: $ git pull https://github.com/feynmanliang/spark ab_testing Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4716.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4716 commit 105401a89216516565236f59a66a22cc91830686 Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-10T19:36:27Z Add broken implementation of AB testing. commit cb73e790c435a4819fb62bc6c37717f4b882aee4 Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-10T21:07:29Z Fix AB testing implementation and add unit tests. commit e0d5beccf54914ebdc5663dbe4ba71944f3183e2 Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-10T22:54:26Z Extract t-testing code out of OnlineABTesting. commit 2100de641a2e86efeaa0f559500c7ced6f7d51a9 Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-11T04:56:30Z Add peace period for dropping first k entries of each A/B group. commit 708380e980ed46ac1beb7665f7854fcf36ebc403 Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-11T05:09:18Z Add numDim to MultivariateOnlineSummarizer. commit ec7f700fbca15d84bba126edaaa50d53ce5fc7be Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-11T06:02:41Z Refactored ABTestingMethod into sealed trait. commit 3f19e15aa3b7056262b601686643ed962846cdc3 Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-11T06:29:49Z Add (non-sliding) testing window functionality. commit c56f9237aa81a70e8572e2ecb851ebaf5cdfa473 Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-11T15:19:46Z Fix peace period implementation. commit 0d738815eb1cd49096112d8be7e9124345af0604 Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-11T17:31:05Z Fix test window batching. commit abf59d5e8f817f847af77aef7514fb740dbbf69d Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-11T17:56:15Z Handle (inelegantly) closure capture for ABTestMethod commit e05eaaf3bb21bbed4c123d9ec6514e84ae75adcb Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-11T18:20:19Z Improve handling of OnlineABTestMethod closure by moving DStream processing method into Serializable class. commit 964a555746273a3afa542e34fdc6b86be60a5db9 Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-11T18:52:37Z Fixed flaky peacePeriod test. commit 79c1d44c6232b0a4af5df4dc14cdc83919cfdea9 Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-11T20:39:58Z Add ScalaDocs and format to style guide. commit e030c12337dce99abcf26f7d02c5d00a78f58c9b Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-12T00:02:20Z Add OnlineABTestExample. commit e8e1f82b16fbdd8446e21b32bb39b413e1ae30d1 Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-12T00:03:12Z Format code to style guide. commit 17eef4eb22d918198dd03f2a931f009863fadcf5 Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-19T04:43:36Z Switch MultivariateOnlineSummarizer to univariate StatsCounter. commit a2ad38be8a77eef045581282b3dbc9d6a1544870 Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-19T14:45:15Z Reduce number of passes in pairSummaries. commit 4bb8636e5317a542ff0b29270548bd933199c6eb Author: Feynman Liang feynman.li...@gmail.com Date: 2015-01-19T14:45:41Z Add test for behavior when missing data from one group. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4726#issuecomment-75532361 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
GitHub user xunyuw reopened a pull request: https://github.com/apache/spark/pull/4726 Merge pull request #1 from apache/master SYNC 2015-02-23 20:00 You can merge this pull request into a Git repository by running: $ git pull https://github.com/xunyuw/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4726.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4726 commit 027f3928651aede18758471cf75b20230bc434fc Author: Xunyu Wang xunyu.w...@hotmail.com Date: 2015-02-08T12:34:48Z Merge pull request #1 from apache/master SYNC 2015-02-08 20:00 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5926] [SQL] make DataFrame.explain leve...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/4707#issuecomment-75536583 retest please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4902][CORE] gap-sampling performan...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3744#issuecomment-75561628 @witgo is this still live and have you followed up on Xiangrui's comment? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4362: Added classProbabilities m...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3626#issuecomment-75564514 @alanctgardner have you had a look at @jkbradley 's feedback? I'm wondering this is still live. It needs a rebase if so. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-75563929 I am also not clear this is a good thing. As a default, it doesn't change anything. There is probably not a globally correct ratio, even if it's not 1, but this implies there is. Is there evidence that a default besides 1.0 is better in most cases? The docs don't even suggest what the tradeoff is here. Won't this potentially cause more shuffles when the ratio is not 1? I think this is something that must be set on a case-by-case basis, and that can already be done, even as a function of the parent RDD partitions, by the caller. Can we elaborate on this or close it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4006] Block Manager - Double Register C...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/2854#issuecomment-75564774 Mind closing this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4340] [Core] add java opts argument sub...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3205#issuecomment-75564995 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4362: Added classProbabilities m...
Github user alanctgardner commented on a diff in the pull request: https://github.com/apache/spark/pull/3626#discussion_r25172516 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala --- @@ -65,6 +66,25 @@ class NaiveBayesModel private[mllib] ( override def predict(testData: Vector): Double = { labels(brzArgmax(brzPi + brzTheta * testData.toBreeze)) } + + def classProbabilities(testData: RDD[Vector]): --- End diff -- Sorry for the delay, I have no strong preference but predictProbabilities makes sense for consistency. I can make that change and the style ones mentioned. My stats background is not super-strong, @jatinpreet seemed to imply there's a correctness issue with this PR. Can anyone comment on if I've got the math wrong? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4340] [Core] add java opts argument sub...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3205#issuecomment-75565369 [Test build #27852 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27852/consoleFull) for PR 3205 at commit [`9f8db81`](https://github.com/apache/spark/commit/9f8db81cef7287a92b9752f2c09c01b3ddf0d8ac). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3147][MLLib] A/B testing
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4716#issuecomment-75580259 Let's remove `numDim`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5946][Streaming] Add Python API for dir...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/4723#issuecomment-75581979 Hi @tdas , do we need to add a Python version of `createRDD` for direct Kafka stream? Seems this API requires Python wrapper of Java object like `OffsetRange`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/3920#issuecomment-75582752 @GenTang This PR looks good to me now, thanks @JoshRosen I think it's ready to go. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r25183946 --- Diff: launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java --- @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.launcher; + +import java.io.File; +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import java.util.Map; +import java.util.regex.Pattern; + +import static org.apache.spark.launcher.CommandBuilderUtils.*; + +/** + * Command builder for internal Spark classes. + * p/ + * This class handles building the command to launch all internal Spark classes except for + * SparkSubmit (which is handled by {@link SparkSubmitCommandBuilder} class. + */ +class SparkClassCommandBuilder extends SparkLauncher implements CommandBuilder { --- End diff -- Yes, that part is sort of weird. But it's the only way to expose all the methods that should be public without having a public abstract base class like before. So it's kinda the best solution I have if SparkLauncher is to remain public; if it's not, we can break the common parts into an abstract class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r25184021 --- Diff: launcher/src/main/java/org/apache/spark/launcher/CommandBuilder.java --- @@ -0,0 +1,31 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.launcher; + +import java.io.IOException; +import java.util.List; +import java.util.Map; + +/** + * Internal interface that defines a command builder. + */ +interface CommandBuilder { --- End diff -- `Main.java` actually uses `CommandBuilder`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5946][Streaming] Add Python API for dir...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/4723#discussion_r25178444 --- Diff: examples/src/main/python/streaming/direct_kafka_wordcount.py --- @@ -0,0 +1,55 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + + + Counts words in UTF8 encoded, '\n' delimited text directly received from Kafka in every 2 seconds. + Usage: direct_kafka_wordcount.py broker_list topic + + To run this on your local machine, you need to setup Kafka and create a producer first, see + http://kafka.apache.org/documentation.html#quickstart + + and then run the example +`$ bin/spark-submit --driver-class-path external/kafka-assembly/target/scala-*/\ + spark-streaming-kafka-assembly-*.jar \ + examples/src/main/python/streaming/direct_kafka_wordcount.py \ + localhost:9092 test` + + +import sys + +from pyspark import SparkContext +from pyspark.streaming import StreamingContext +from pyspark.streaming.kafka import KafkaUtils + +if __name__ == __main__: +if len(sys.argv) != 3: +print sys.stderr, Usage: direct_kafka_wordcount.py broker_list topic +exit(-1) + +sc = SparkContext(appName=PythonStreamingDirectKafkaWordCount) +ssc = StreamingContext(sc, 2) + +brokers, topic = sys.argv[1:] +kvs = KafkaUtils.createDirectStream(ssc, [topic], {metadata.broker.list: brokers}) --- End diff -- Hi @davies , thanks for your comment, I will add this as a argument. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5939][MLLib] make FPGrowth example app ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4714 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/3920#discussion_r25180080 --- Diff: examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala --- @@ -18,20 +18,34 @@ package org.apache.spark.examples.pythonconverters import scala.collection.JavaConversions._ +import scala.util.parsing.json._ import org.apache.spark.api.python.Converter import org.apache.hadoop.hbase.client.{Put, Result} import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.util.Bytes +import org.apache.hadoop.hbase.KeyValue.Type +import org.apache.hadoop.hbase.CellUtil /** - * Implementation of [[org.apache.spark.api.python.Converter]] that converts an - * HBase Result to a String + * Implementation of [[org.apache.spark.api.python.Converter]] that converts all + * the records in an HBase Result to a String */ class HBaseResultToStringConverter extends Converter[Any, String] { override def convert(obj: Any): String = { +import collection.JavaConverters._ val result = obj.asInstanceOf[Result] -Bytes.toStringBinary(result.value()) +val output = result.listCells.asScala.map(cell = +Map( + row - Bytes.toStringBinary(CellUtil.cloneRow(cell)), + columnFamily - Bytes.toStringBinary(CellUtil.cloneFamily(cell)), + qualifier - Bytes.toStringBinary(CellUtil.cloneQualifier(cell)), + timestamp - cell.getTimestamp.toString, + type - Type.codeToType(cell.getTypeByte).toString, + value - Bytes.toStringBinary(CellUtil.cloneValue(cell)) +) +) +output.map(JSONObject(_).toString()).mkString(\n) --- End diff -- That make sense. JSON will escape the `\n` in String, so it's safe to have `\n` as separator. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-75581853 You can implement this by expressing parallelism as a function of the parent RDD right? yeah you have to write the expression but does an alternative multiplier arg do much better? yeah mostly I'm questioning a global setting. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-75584312 @srowen good point. I think a ratio argument is prettier than an expression, but arguably not enough to warrant clogging up the API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5950][SQL] Enable inserting array into ...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/4729 [SPARK-5950][SQL] Enable inserting array into Hive table saved as Parquet using DataSource API Currently `ParquetConversions` in `HiveMetastoreCatalog` does not really work. One reason is that table is not part of the children nodes of `InsertIntoTable`. So the replacing is not working. When we create a Parquet table in Hive with ARRAY field. In default `ArrayType` has `containsNull` as true. It affects the table's schema. But when inserting data into the table later, the schema of inserting data can be with `containsNull` as true or false. That makes the inserting/reading failed. A similar problem is reported in https://issues.apache.org/jira/browse/SPARK-5508. Hive seems only support null elements array. So this pr enables same behavior. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 hive_parquet Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4729.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4729 commit 4e3bd5568e644bc81e2539a917329486ea968a92 Author: Liang-Chi Hsieh vii...@gmail.com Date: 2015-02-23T17:03:30Z Enable inserting array into hive table saved as parquet using datasource. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r25183693 --- Diff: launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java --- @@ -0,0 +1,684 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.launcher; + +import java.io.BufferedReader; +import java.io.File; +import java.io.FileFilter; +import java.io.FileInputStream; +import java.io.InputStreamReader; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Properties; +import java.util.jar.JarFile; +import java.util.regex.Pattern; + +import static org.apache.spark.launcher.CommandBuilderUtils.*; + +/** + * Launcher for Spark applications. + * p/ + * Use this class to start Spark applications programmatically. The class uses a builder pattern + * to allow clients to configure the Spark application and launch it as a child process. + * p/ + * Note that launching Spark applications using this class will not automatically load environment + * variables from the spark-env.sh or spark-env.cmd scripts in the configuration directory. + */ +public class SparkLauncher { + + /** The Spark master. */ + public static final String SPARK_MASTER = spark.master; + + /** Configuration key for the driver memory. */ + public static final String DRIVER_MEMORY = spark.driver.memory; + /** Configuration key for the driver class path. */ + public static final String DRIVER_EXTRA_CLASSPATH = spark.driver.extraClassPath; + /** Configuration key for the driver VM options. */ + public static final String DRIVER_EXTRA_JAVA_OPTIONS = spark.driver.extraJavaOptions; + /** Configuration key for the driver native library path. */ + public static final String DRIVER_EXTRA_LIBRARY_PATH = spark.driver.extraLibraryPath; + + /** Configuration key for the executor memory. */ + public static final String EXECUTOR_MEMORY = spark.executor.memory; --- End diff -- Yes. I tried to add the most common set of job config options here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4708#issuecomment-75593300 [Test build #27854 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27854/consoleFull) for PR 4708 at commit [`b85c5fe`](https://github.com/apache/spark/commit/b85c5fe14fdece4769fc98bbedcba80252b325bf). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3920#discussion_r25182187 --- Diff: examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala --- @@ -18,20 +18,34 @@ package org.apache.spark.examples.pythonconverters import scala.collection.JavaConversions._ +import scala.util.parsing.json._ import org.apache.spark.api.python.Converter import org.apache.hadoop.hbase.client.{Put, Result} import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.util.Bytes +import org.apache.hadoop.hbase.KeyValue.Type +import org.apache.hadoop.hbase.CellUtil /** - * Implementation of [[org.apache.spark.api.python.Converter]] that converts an - * HBase Result to a String + * Implementation of [[org.apache.spark.api.python.Converter]] that converts all + * the records in an HBase Result to a String */ class HBaseResultToStringConverter extends Converter[Any, String] { override def convert(obj: Any): String = { +import collection.JavaConverters._ val result = obj.asInstanceOf[Result] -Bytes.toStringBinary(result.value()) +val output = result.listCells.asScala.map(cell = +Map( + row - Bytes.toStringBinary(CellUtil.cloneRow(cell)), + columnFamily - Bytes.toStringBinary(CellUtil.cloneFamily(cell)), + qualifier - Bytes.toStringBinary(CellUtil.cloneQualifier(cell)), + timestamp - cell.getTimestamp.toString, + type - Type.codeToType(cell.getTypeByte).toString, + value - Bytes.toStringBinary(CellUtil.cloneValue(cell)) +) +) +output.map(JSONObject(_).toString()).mkString(\n) --- End diff -- Great! In fact, HBase itself will escape `\n` too. That's why I choose `\n` at the first place. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...
Github user ilganeli commented on a diff in the pull request: https://github.com/apache/spark/pull/4708#discussion_r25183148 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -830,39 +836,39 @@ class DAGScheduler( try { // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep). // For ResultTask, serialize and broadcast (rdd, func). - val taskBinaryBytes: Array[Byte] = -if (stage.isShuffleMap) { - closureSerializer.serialize((stage.rdd, stage.shuffleDep.get) : AnyRef).array() -} else { - closureSerializer.serialize((stage.rdd, stage.resultOfJob.get.func) : AnyRef).array() -} + val taskBinaryBytes: Array[Byte] = stage match { +case a: ShuffleMapStage = + closureSerializer.serialize((a.rdd, a.shuffleDep): AnyRef).array() +case b: ResultStage = + closureSerializer.serialize((b.rdd, b.resultOfJob.get.func): AnyRef).array() + } + taskBinary = sc.broadcast(taskBinaryBytes) } catch { // In the case of a failure during serialization, abort the stage. case e: NotSerializableException = abortStage(stage, Task not serializable: + e.toString) runningStages -= stage -return --- End diff -- This was a mistake introduced when I was doing the second round of refactoring (copying code back from when I pulled this all out to its own method). When this code is within its own method then we can just look at the return value of the method and the weird return breaks become unnecessary. I'll add a comment for these in the meantime. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5939][MLLib] make FPGrowth example app ...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4714#issuecomment-75580799 LGTM. Merged into master and branch-1.3. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5950][SQL] Enable inserting array into ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4729#issuecomment-75588235 [Test build #27853 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27853/consoleFull) for PR 4729 at commit [`4e3bd55`](https://github.com/apache/spark/commit/4e3bd5568e644bc81e2539a917329486ea968a92). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/3916#issuecomment-75593031 @pwendell I see what you mean about compatibility. Let me play with the code a bit, it might not be hard to do something like that as part of this patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5951
GitHub user zuxqoj opened a pull request: https://github.com/apache/spark/pull/4730 SPARK-5951 Remove unreachable driver memory properties in yarn client mode You can merge this pull request into a Git repository by running: $ git pull https://github.com/zuxqoj/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4730.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4730 commit 977dc967eb3f2e718df68729d614efc48a47c9da Author: mohit.goyal mohit.go...@guavus.com Date: 2015-02-23T17:35:24Z remove not rechable deprecated variables in yarn client mode --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4340] [Core] add java opts argument sub...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3205#issuecomment-75581446 [Test build #27852 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27852/consoleFull) for PR 3205 at commit [`9f8db81`](https://github.com/apache/spark/commit/9f8db81cef7287a92b9752f2c09c01b3ddf0d8ac). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-75580971 In general, a fixed number of partitions is very difficult to work with when configuring a shuffle. Suppose I have a job where I know a `flatMap` is going to blow up the size of my data by two. If I want to minimize reduce-side spilling in a shuffle that comes after the `flatMap`, I want the parallelism of the shuffle to be double that of the input stage. Because the size of my input data could change between different runs of my job, a ratio is a much more natural way to express my needs than a constant. It's unclear to me whether a global default is useful at all, but a configurable parallelism ratio per shuffle operation definitely is. (Systems like Crunch take this approach). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4340] [Core] add java opts argument sub...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3205#issuecomment-75581461 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27852/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5951][YARN] Remove unreachable driver m...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4730#issuecomment-75594177 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Examples] fix deprecated method use in HBaseT...
GitHub user potix2 opened a pull request: https://github.com/apache/spark/pull/4725 [Examples] fix deprecated method use in HBaseTest HTableDescriptor(String name) is deprecated. https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/TableName.html You can merge this pull request into a Git repository by running: $ git pull https://github.com/potix2/spark fix-warning-hbase-example Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4725.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4725 commit f613b861afa037f78fc981933789cfc730c9a062 Author: Katsunori Kanda pot...@gmail.com Date: 2015-02-23T10:21:16Z [Examples] fix deprecated method use in HBaseTest --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5944] [PySpark] fix version in Python A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4731#issuecomment-75629173 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27861/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5944] [PySpark] fix version in Python A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4731#issuecomment-75629156 [Test build #27861 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27861/consoleFull) for PR 4731 at commit [`08cbc3f`](https://github.com/apache/spark/commit/08cbc3f2f6ea21ecfb491e89b521679d4fb24879). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5912] [docs] [mllib] Small fixes to Chi...
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/4732 [SPARK-5912] [docs] [mllib] Small fixes to ChiSqSelector docs Fixes: * typo in Scala example * Removed comment usually applied on sparse data since that is debatable * small edits to text for clarity CC: @avulanov I noticed a typo post-hoc and ended up making a few small edits. Do the changes look OK? You can merge this pull request into a Git repository by running: $ git pull https://github.com/jkbradley/spark chisqselector-docs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4732.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4732 commit 3f3f9f4968ff1a8f45be6dbaead54eb1ea6df406 Author: Joseph K. Bradley jos...@databricks.com Date: 2015-02-23T21:18:06Z small fixes to ChiSqSelector docs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...
Github user markhamstra commented on the pull request: https://github.com/apache/spark/pull/4708#issuecomment-75637201 Look pretty good to me, but left a few more comments. Also, please take a look at the various logging strings to see whether some of them can be expressed more readably using string interpolation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5912] [docs] [mllib] Small fixes to Chi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4732#issuecomment-75637616 [Test build #27862 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27862/consoleFull) for PR 4732 at commit [`3f3f9f4`](https://github.com/apache/spark/commit/3f3f9f4968ff1a8f45be6dbaead54eb1ea6df406). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/4708#discussion_r25202931 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -210,40 +210,58 @@ class DAGScheduler( * The jobId value passed in will be used if the stage doesn't already exist with * a lower jobId (jobId always increases across jobs.) */ - private def getShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): Stage = { + private def getShuffleMapStage( + shuffleDep: ShuffleDependency[_, _, _], + jobId: Int): ShuffleMapStage = { shuffleToMapStage.get(shuffleDep.shuffleId) match { case Some(stage) = stage case None = // We are going to register ancestor shuffle dependencies registerShuffleDependencies(shuffleDep, jobId) // Then register current shuffleDep -val stage = - newOrUsedStage( -shuffleDep.rdd, shuffleDep.rdd.partitions.size, shuffleDep, jobId, -shuffleDep.rdd.creationSite) +val stage = newOrUsedShuffleStage(shuffleDep, jobId) shuffleToMapStage(shuffleDep.shuffleId) = stage - + stage } } /** - * Create a Stage -- either directly for use as a result stage, or as part of the (re)-creation - * of a shuffle map stage in newOrUsedStage. The stage will be associated with the provided - * jobId. Production of shuffle map stages should always use newOrUsedStage, not newStage - * directly. + * Create a ShuffleMapStage as part of the (re)-creation of a shuffle map stage in + * newOrUsedShuffleStage. The stage will be associated with the provide jobId. + * Production of shuffle map stages should always use newOrUsedShuffleStage,not + * newShuffleMapStage directly. */ - private def newStage( + private def newShuffleMapStage( rdd: RDD[_], numTasks: Int, - shuffleDep: Option[ShuffleDependency[_, _, _]], + shuffleDep: ShuffleDependency[_, _, _], jobId: Int, - callSite: CallSite) -: Stage = - { + callSite: CallSite): ShuffleMapStage = { val parentStages = getParentStages(rdd, jobId) val id = nextStageId.getAndIncrement() -val stage = new Stage(id, rdd, numTasks, shuffleDep, parentStages, jobId, callSite) +val stage: ShuffleMapStage = new ShuffleMapStage(id, rdd, numTasks, parentStages, + jobId, callSite, shuffleDep) + +stageIdToStage(id) = stage +updateJobIdStageIdMaps(jobId, stage) +stage + } + + /** + * Create a ResultStage -- either directly for use as a result stage, or as part of the + * (re)-creation of a shuffle map stage in newOrUsedShuffleStage. The stage will be associated + * with the provided jobId. + */ + private def newResultStage( + rdd: RDD[_], + numTasks: Int, + jobId: Int, + callSite: CallSite): ResultStage = { +val parentStages = getParentStages(rdd, jobId) +val id = nextStageId.getAndIncrement() +val stage: ResultStage = new ResultStage(id, rdd, numTasks, parentStages, jobId, callSite) + --- End diff -- I'd rather avoid the code duplication in newShuffleMapStage and newResultStage. This can be done in generic fashion via runtime reflection: ```scala import scala.reflect.runtime.{universe = ru} ... private def newStage[T : Stage: ru.TypeTag]( rdd: RDD[_], numTasks: Int, shuffleDep: Option[ShuffleDependency[_, _, _]], jobId: Int, callSite: CallSite): T = { val m = ru.runtimeMirror(getClass.getClassLoader) val classT = ru.typeOf[T].typeSymbol.asClass val cm = m.reflectClass(classT) val ctor = ru.typeOf[T].declaration(ru.nme.CONSTRUCTOR).asMethod val ctorm = cm.reflectConstructor(ctor) val parentStages = getParentStages(rdd, jobId) val id = nextStageId.getAndIncrement() val stage = shuffleDep.map { shufDep = ctorm(id, rdd, numTasks, parentStages, jobId, callSite, shufDep) }.getOrElse(ctorm(id, rdd, numTasks, parentStages, jobId, callSite)).asInstanceOf[T] stageIdToStage(id) = stage updateJobIdStageIdMaps(jobId, stage) stage } ... val stage = newStage[ShuffleMapStage](rdd, numTasks, Some(shuffleDep), jobId, rdd.creationSite) ... finalStage = newStage[ResultStage](finalRDD, partitions.size, None, jobId, callSite) ``` ...but I'd want to see the performance numbers on that before deciding not to go with a less flexible approach that avoids reflection: ```scala private def newStage[T : Stage](
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75638966 Oops, did not realize that a test was still running (glad it passed) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5946][Streaming] Add Python API for dir...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4723#issuecomment-75594343 [Test build #27855 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27855/consoleFull) for PR 4723 at commit [`5381db1`](https://github.com/apache/spark/commit/5381db1ad833ab72a2eb15b0f30d745c1bfbe764). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/4708#discussion_r25186634 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Stage.scala --- @@ -77,52 +71,9 @@ private[spark] class Stage( /** Pointer to the latest [StageInfo] object, set by DAGScheduler. */ var latestInfo: StageInfo = StageInfo.fromStage(this) - def isAvailable: Boolean = { -if (!isShuffleMap) { - true -} else { - numAvailableOutputs == numPartitions -} - } - - def addOutputLoc(partition: Int, status: MapStatus) { -val prevList = outputLocs(partition) -outputLocs(partition) = status :: prevList -if (prevList == Nil) { - numAvailableOutputs += 1 -} - } - - def removeOutputLoc(partition: Int, bmAddress: BlockManagerId) { -val prevList = outputLocs(partition) -val newList = prevList.filterNot(_.location == bmAddress) -outputLocs(partition) = newList -if (prevList != Nil newList == Nil) { - numAvailableOutputs -= 1 -} - } + var numAvailableOutputs = 0 --- End diff -- ...for nextAttemptId, too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5253] [ML] LinearRegression with L1/L2 ...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4259#issuecomment-75600260 @dbtsai I'd like to make a pass over this, but I realized that it has conflicts because of the developer api PR committed last week: [https://github.com/apache/spark/pull/3637] Could you please rebase? I don't think there are any more big PRs coming up which will make you rebase again. Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4708#issuecomment-75604216 [Test build #27858 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27858/consoleFull) for PR 4708 at commit [`d548caf`](https://github.com/apache/spark/commit/d548cafab4b6f36ee7e9bed696419567f0bc3d94). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4708#issuecomment-75609741 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27854/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4708#issuecomment-75610748 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27856/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75614718 [Test build #27857 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27857/consoleFull) for PR 4709 at commit [`58d9e4d`](https://github.com/apache/spark/commit/58d9e4d0dd4c03399cafd487f6391b1c560e82d8). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `ChiSqSelector stands for Chi-Squared feature selection. It operates on the labeled data. ChiSqSelector orders categorical features based on their values of Chi-Squared test on independence from class and filters (selects) top given features. ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/4708#discussion_r25186416 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Stage.scala --- @@ -47,26 +47,20 @@ import org.apache.spark.util.CallSite * be updated for each attempt. --- End diff -- Remove unused BlockManagerId from imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75599168 [Test build #27857 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27857/consoleFull) for PR 4709 at commit [`58d9e4d`](https://github.com/apache/spark/commit/58d9e4d0dd4c03399cafd487f6391b1c560e82d8). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4048#issuecomment-75604198 [Test build #27859 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27859/consoleFull) for PR 4048 at commit [`a1f1665`](https://github.com/apache/spark/commit/a1f16654a77caa3ef2e35d7e3ace830aa1708bdd). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4708#issuecomment-75594276 [Test build #27856 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27856/consoleFull) for PR 4708 at commit [`6da3a71`](https://github.com/apache/spark/commit/6da3a7101c3c8087a9a924b998889eb6e1b3446f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5950][SQL] Enable inserting array into ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4729#issuecomment-75597599 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27853/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/4708#discussion_r25186558 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Stage.scala --- @@ -77,52 +71,9 @@ private[spark] class Stage( /** Pointer to the latest [StageInfo] object, set by DAGScheduler. */ var latestInfo: StageInfo = StageInfo.fromStage(this) - def isAvailable: Boolean = { -if (!isShuffleMap) { - true -} else { - numAvailableOutputs == numPartitions -} - } - - def addOutputLoc(partition: Int, status: MapStatus) { -val prevList = outputLocs(partition) -outputLocs(partition) = status :: prevList -if (prevList == Nil) { - numAvailableOutputs += 1 -} - } - - def removeOutputLoc(partition: Int, bmAddress: BlockManagerId) { -val prevList = outputLocs(partition) -val newList = prevList.filterNot(_.location == bmAddress) -outputLocs(partition) = newList -if (prevList != Nil newList == Nil) { - numAvailableOutputs -= 1 -} - } + var numAvailableOutputs = 0 --- End diff -- Add explicit type declaration --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/4708#discussion_r25188257 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -228,22 +227,41 @@ class DAGScheduler( } /** - * Create a Stage -- either directly for use as a result stage, or as part of the (re)-creation - * of a shuffle map stage in newOrUsedStage. The stage will be associated with the provided - * jobId. Production of shuffle map stages should always use newOrUsedStage, not newStage - * directly. + * Create a ShuffleMapStage as part of the (re)-creation of a shuffle map stage in + * newOrUsedShuffleStage. The stage will be associated with the provided + * jobId. Production of shuffle map stages should always use newOrUsedShuffleStage, + * not newShuffleMapStage directly. --- End diff -- nit: reformat a little... ```scala /** * Create a ShuffleMapStage as part of the (re)-creation of a shuffle map stage in * newOrUsedShuffleStage. The stage will be associated with the provided jobId. * Production of shuffle map stages should always use newOrUsedShuffleStage, not * newShuffleMapStage directly. */ ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/4709#discussion_r25188678 --- Diff: docs/mllib-feature-extraction.md --- @@ -375,3 +375,55 @@ data2 = labels.zip(normalizer2.transform(features)) {% endhighlight %} /div /div + +## Feature selection +[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) allows selecting the most relevant features for use in model construction. The number of features to select can be determined using the validation set. Feature selection is usually applied on sparse data, for example in text classification. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. + +### ChiSqSelector +ChiSqSelector stands for Chi-Squared feature selection. It operates on the labeled data. ChiSqSelector orders categorical features based on their values of Chi-Squared test on independence from class and filters (selects) top given features. + + Model Fitting + +[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) has the +following parameters in the constructor: + +* `numTopFeatures` number of top features that selector will select (filter). + +We provide a [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) method in +`ChiSqSelector` which can take an input of `RDD[LabeledPoint]` with categorical features, learn the summary statistics, and then +return a model which can transform the input dataset into the reduced feature space. + +This model implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer) +which can apply the Chi-Squared feature selection on a `Vector` to produce a reduced `Vector` or on +an `RDD[Vector]` to produce a reduced `RDD[Vector]`. + +Note that the model that performs actual feature filtering can be instantiated independently with array of feature indices that has to be sorted ascending. + + Example + +The following example shows the basic use of ChiSqSelector. + +div class=codetabs +div data-lang=scala +{% highlight scala %} +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLUtils + +// load some data in libsvm format, each point is in the range 0..255 +val data = MLUtils.loadLibSVMFile(sc, data/mllib/sample_libsvm_data.txt) +// discretize data in 16 equal bins +val discretizedData = data.map { lp = + LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x = x / 16 } ) ) +} +// create ChiSqSelector that will select 50 features +val selector = new ChiSqSelector(50) +// create ChiSqSelector model +val transformer = selector.fit(disctetizedData) +// filter top 50 features +val filteredData = transformer.transform(discretizedData) --- End diff -- Since transform() takes an RDD[Vector], you'll need to map the data to features, and then zip the transformed features with the labels. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75603592 I think that last issue is the only one--thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4708#issuecomment-75609726 [Test build #27854 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27854/consoleFull) for PR 4708 at commit [`b85c5fe`](https://github.com/apache/spark/commit/b85c5fe14fdece4769fc98bbedcba80252b325bf). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4708#issuecomment-75610731 [Test build #27856 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27856/consoleFull) for PR 4708 at commit [`6da3a71`](https://github.com/apache/spark/commit/6da3a7101c3c8087a9a924b998889eb6e1b3446f). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user avulanov commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75610561 Sorry for this, still sleeping... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75611280 [Test build #27860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27860/consoleFull) for PR 4709 at commit [`19a8a4e`](https://github.com/apache/spark/commit/19a8a4e9b8c3b5607c87fb1eae19810f90b9ad6a). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5924] Add the ability to specify withMe...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4704#discussion_r25192351 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala --- @@ -29,7 +29,18 @@ import org.apache.spark.sql.types.{StructField, StructType} /** * Params for [[StandardScaler]] and [[StandardScalerModel]]. */ -private[feature] trait StandardScalerParams extends Params with HasInputCol with HasOutputCol +private[feature] trait StandardScalerParams extends Params with HasInputCol with HasOutputCol { + val withMean: BooleanParam = new BooleanParam(this, --- End diff -- Add doc with `@group param` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5944] fix version in Python API docs
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/4731 [SPARK-5944] fix version in Python API docs use RELEASE_VERSION when building the Python API docs You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark api_version Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4731.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4731 commit 08cbc3f2f6ea21ecfb491e89b521679d4fb24879 Author: Davies Liu dav...@databricks.com Date: 2015-02-23T19:10:45Z fix python docs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5950][SQL] Enable inserting array into ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4729#issuecomment-75597588 [Test build #27853 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27853/consoleFull) for PR 4729 at commit [`4e3bd55`](https://github.com/apache/spark/commit/4e3bd5568e644bc81e2539a917329486ea968a92). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class Params(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5927][MLlib] Modify FPGrowth's partitio...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4706#issuecomment-75599387 @viirya Your proposal definitely works better in some cases, while the current implementation works better in some others. I think we both agree on this. The question is which partitioning scheme fits real datasets better. I don't have a clear answer. If there are some standard benchmark datasets, we can compare the performance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...
Github user kayousterhout commented on the pull request: https://github.com/apache/spark/pull/4708#issuecomment-75602215 @JoshRosen I'm happy to take a look at this but won't be able to get to it until Friday. Feel free to merge it sooner than that if you're eager to get it in; otherwise I'll take a look Friday! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4362: Added classProbabilities m...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3626#issuecomment-75608187 @alanctgardner That will be great if you change it to predictProbabilities; thanks. I agree with what @jatinpreet was saying about the correctness, and with @srowen 's comment on how to fix it: The value of ```brzPi + brzTheta * testData.toBreeze``` is a log probability, which needs to be exponentiated before you normalize it here: [https://github.com/apache/spark/pull/3626/files?diff=split#diff-6d8eff78be2fb624d4a076db334208a4R84] Could you please rebase off of master and make these couple of updates? After that, I can make a final pass. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5924] Add the ability to specify withMe...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4704#discussion_r25192356 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala --- @@ -44,12 +55,18 @@ class StandardScaler extends Estimator[StandardScalerModel] with StandardScalerP /** @group setParam */ def setOutputCol(value: String): this.type = set(outputCol, value) - + + /** @grour setParam */ --- End diff -- `@group` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5946][Streaming] Add Python API for dir...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4723#issuecomment-75610987 [Test build #27855 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27855/consoleFull) for PR 4723 at commit [`5381db1`](https://github.com/apache/spark/commit/5381db1ad833ab72a2eb15b0f30d745c1bfbe764). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5946][Streaming] Add Python API for dir...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4723#issuecomment-75610996 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27855/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org