[GitHub] spark pull request: [Examples] fix deprecated method use in HBaseT...

2015-02-23 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4725#issuecomment-75521453
  
Ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Examples] fix deprecated method use in HBaseT...

2015-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4725#issuecomment-75519690
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5724] fix the misconfiguration in AkkaU...

2015-02-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4512


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5943][Streaming] Update the test to use...

2015-02-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4722


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-02-23 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3920#discussion_r25155799
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
 ---
@@ -18,20 +18,34 @@
 package org.apache.spark.examples.pythonconverters
 
 import scala.collection.JavaConversions._
+import scala.util.parsing.json._
 
 import org.apache.spark.api.python.Converter
 import org.apache.hadoop.hbase.client.{Put, Result}
 import org.apache.hadoop.hbase.io.ImmutableBytesWritable
 import org.apache.hadoop.hbase.util.Bytes
+import org.apache.hadoop.hbase.KeyValue.Type
+import org.apache.hadoop.hbase.CellUtil
 
 /**
- * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts an
- * HBase Result to a String
+ * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts all
+ * the records in an HBase Result to a String
  */
 class HBaseResultToStringConverter extends Converter[Any, String] {
   override def convert(obj: Any): String = {
+import collection.JavaConverters._
 val result = obj.asInstanceOf[Result]
-Bytes.toStringBinary(result.value())
+val output = result.listCells.asScala.map(cell =
+Map(
+  row - Bytes.toStringBinary(CellUtil.cloneRow(cell)),
+  columnFamily - 
Bytes.toStringBinary(CellUtil.cloneFamily(cell)),
+  qualifier - 
Bytes.toStringBinary(CellUtil.cloneQualifier(cell)),
+  timestamp - cell.getTimestamp.toString,
+  type - Type.codeToType(cell.getTypeByte).toString,
+  value - Bytes.toStringBinary(CellUtil.cloneValue(cell))
+)
+)
+output.map(JSONObject(_).toString()).mkString(\n)
--- End diff --

Output is a `Buffer[Map[String, String]]`, since there are several records 
in an HBase Result.  
However `JSONObject` has the only constructor `JSONObject(obj: Map[String, 
Any])`. So `JSONObject(output).toString()` will cause compilation failure. ^^


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Examples] fix deprecated method use in HBaseT...

2015-02-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4725#discussion_r25157906
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala 
---
@@ -36,7 +36,7 @@ object HBaseTest {
 // Initialize hBase table if necessary
 val admin = new HBaseAdmin(conf)
 if (!admin.isTableAvailable(args(0))) {
-  val tableDesc = new HTableDescriptor(args(0))
+  val tableDesc = new HTableDescriptor(TableName.valueOf(args(0)))
--- End diff --

Do you happen to know how long ago this constructor was added? I want to 
figure out of this makes it incompatible with any HBase = 0.98.7, which is 
presumably the earliest version kind of 'supported' by the examples.

Are there other deprecations in the HBase examples that can be improved? I 
suspect the examples were written for HBase ~0.94.x


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5943][Streaming] Update the test to use...

2015-02-23 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4722#issuecomment-75526163
  
LGTM. The new method is in branch-1.3, so can be back-ported, and I think 
this qualifies as a good tiny fix. I verified these are all the occurrences.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4730][YARN] Warn against deprecated YAR...

2015-02-23 Thread zuxqoj
Github user zuxqoj commented on a diff in the pull request:

https://github.com/apache/spark/pull/3590#discussion_r25155375
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
 ---
@@ -78,11 +79,25 @@ private[spark] class YarnClientSchedulerBackend(
 (--queue, SPARK_YARN_QUEUE, spark.yarn.queue),
 (--name, SPARK_YARN_APP_NAME, spark.app.name)
   )
+// Warn against the following deprecated environment variables: env 
var - suggestion
+val deprecatedEnvVars = Map(
+  SPARK_MASTER_MEMORY - SPARK_DRIVER_MEMORY or --driver-memory 
through spark-submit,
+  SPARK_WORKER_INSTANCES - SPARK_WORKER_INSTANCES or 
--num-executors through spark-submit,
+  SPARK_WORKER_MEMORY - SPARK_EXECUTOR_MEMORY or --executor-memory 
through spark-submit,
+  SPARK_WORKER_CORES - SPARK_EXECUTOR_CORES or --executor-cores 
through spark-submit)
+// Do the same for deprecated properties: property - suggestion
+val deprecatedProps = Map(spark.master.memory - --driver-memory 
through spark-submit)
--- End diff --

SPARK_MASTER_MEMORY and spark.master.memory are not applicable in 
yarn-client mode

should be removed, please refer SPARK-1953


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5174][SPARK-5175] provide more APIs in ...

2015-02-23 Thread CodingCat
Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/3984#issuecomment-75528392
  
sure, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5802][MLLIB] cache transformed data in ...

2015-02-23 Thread joshdevins
Github user joshdevins commented on the pull request:

https://github.com/apache/spark/pull/4593#issuecomment-75540678
  
I have the same concern as @dbtsai in his comment. Most consumers of this 
API will already be caching their dataset before the learning phase. Without 
user care, this will introduce effectively double caching (in terms of data 
size of cached RDDs) and will cause many jobs to fail after upgrading by 
exceeding available heap for RDD cache. Furthermore, we are making assumptions 
about how to cache -- in-memory only in this case. Should we parameterise this? 
Perhaps that will help send the message in the API that there is caching also 
done before learning. (FWIW, in-memory is definitely the right default choice 
here.)

See email thread on dev for my specific encountering of this bug: 

http://mail-archives.apache.org/mod_mbox/spark-dev/201502.mbox/%3CCAH5MZvMBjqOST-9Nr9k1z1rUODfSiczr_fV9kwqDFqAMNLC2Zw%40mail.gmail.com%3E



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5802][MLLIB] cache transformed data in ...

2015-02-23 Thread petro-rudenko
Github user petro-rudenko commented on the pull request:

https://github.com/apache/spark/pull/4593#issuecomment-75550855
  
@dbtsai, @joshdevins  here's an issue i have. I'm using new ml pipeline 
with hyperparameter grid search. Because folds doesn't depend from 
hyperparameter, i've reimplemented a bit LogisticRegression to not unpersist 
data:
```scala
class CustomLogisticRegression extends LogisticRegression {
  var oldInstances: RDD[LabeledPoint] = null
  
  override def fit(dataset: SchemaRDD, paramMap: ParamMap): 
LogisticRegressionModel = {
println(sFitting dataset ${dataset.id} with ParamMap $paramMap.)
transformSchema(dataset.schema, paramMap, logging = true)
import dataset.sqlContext._
val map = this.paramMap ++ paramMap
val instances = dataset.select(map(labelCol).attr, 
map(featuresCol).attr)
  .map {
case Row(label: Double, features: Vector) =
  LabeledPoint(label, features)
  }

//For parallel grid search 
this.synchronized({
  if (oldInstances == null || oldInstances.id != instances.id) {
if (oldInstances != null) {
  oldInstances.unpersist()
}
oldInstances = instances
instances.setName(sInstances for LR with ParamMap $paramMap and 
RDD ${dataset.id})
instances.persist(StorageLevel.MEMORY_AND_DISK)
  }
})

val lr = (new LogisticRegressionWithLBFGS)
  .setValidateData(false)

lr.optimizer
  .setRegParam(map(regParam))
  .setNumIterations(map(maxIter))
val lrOldModel = lr.run(instances)
val lrm = new LogisticRegressionModel(this, map, 
lr.run(instances).weights)
//instances.unpersist()
// copy model params
Params.inheritValues(map, this, lrm)
lrm
  }
}
```

Then for 3 folds in crossvalidation and 3 hyperparameters to 
LogisticRegression i got something like this:

```
Fitting dataset 11 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.5
}
Fitting dataset 11 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.1
}
Fitting dataset 11 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.01
}

Fitting dataset 12 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.5
}
Fitting dataset 12 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.1
}
Fitting dataset 12 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.01
}
```

So persistence on the model level need to cache folds for hyperparameters 
grid search, but persistence on GLM level need to speed-up Standart scalar 
transformation etc. Don't know yet how to do this efficiently without double 
caching.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2015-02-23 Thread xunyuw
Github user xunyuw closed the pull request at:

https://github.com/apache/spark/pull/4728


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2015-02-23 Thread xunyuw
GitHub user xunyuw opened a pull request:

https://github.com/apache/spark/pull/4728

Merge pull request #1 from apache/master

SYNC 2015-02-08 20:00

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xunyuw/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4728.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4728


commit 027f3928651aede18758471cf75b20230bc434fc
Author: Xunyu Wang xunyu.w...@hotmail.com
Date:   2015-02-08T12:34:48Z

Merge pull request #1 from apache/master

SYNC 2015-02-08 20:00




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Examples] fix deprecated method use in HBaseT...

2015-02-23 Thread potix2
Github user potix2 closed the pull request at:

https://github.com/apache/spark/pull/4725


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Examples] fix deprecated method use in HBaseT...

2015-02-23 Thread potix2
Github user potix2 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4725#discussion_r25161690
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala 
---
@@ -36,7 +36,7 @@ object HBaseTest {
 // Initialize hBase table if necessary
 val admin = new HBaseAdmin(conf)
 if (!admin.isTableAvailable(args(0))) {
-  val tableDesc = new HTableDescriptor(args(0))
+  val tableDesc = new HTableDescriptor(TableName.valueOf(args(0)))
--- End diff --

Sorry, I didn't know when that constructor was added. I understand my 
proposal makes the compatibility of the earliest version broken.
The other deprication is nothing in the HBase examples, I close this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Examples] fix deprecated method use in HBaseT...

2015-02-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4725#discussion_r25162029
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala 
---
@@ -36,7 +36,7 @@ object HBaseTest {
 // Initialize hBase table if necessary
 val admin = new HBaseAdmin(conf)
 if (!admin.isTableAvailable(args(0))) {
-  val tableDesc = new HTableDescriptor(args(0))
+  val tableDesc = new HTableDescriptor(TableName.valueOf(args(0)))
--- End diff --

@potix2 no it may be just fine. I was asking you to check it. It would be 
good to know when the new method was added to make sure this doesn't needlessly 
break recent versions, but, I agree with this change as long as the constructor 
was available in = 0.98.7 and preferably a few previous versions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #2 from apache/master

2015-02-23 Thread xunyuw
GitHub user xunyuw reopened a pull request:

https://github.com/apache/spark/pull/4727

Merge pull request #2 from apache/master

SYNC 2015-02-23 20:00

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xunyuw/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4727.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4727


commit 027f3928651aede18758471cf75b20230bc434fc
Author: Xunyu Wang xunyu.w...@hotmail.com
Date:   2015-02-08T12:34:48Z

Merge pull request #1 from apache/master

SYNC 2015-02-08 20:00




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #2 from apache/master

2015-02-23 Thread xunyuw
Github user xunyuw closed the pull request at:

https://github.com/apache/spark/pull/4727


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #2 from apache/master

2015-02-23 Thread xunyuw
GitHub user xunyuw opened a pull request:

https://github.com/apache/spark/pull/4727

Merge pull request #2 from apache/master

SYNC 2015-02-23 20:00

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xunyuw/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4727.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4727


commit 027f3928651aede18758471cf75b20230bc434fc
Author: Xunyu Wang xunyu.w...@hotmail.com
Date:   2015-02-08T12:34:48Z

Merge pull request #1 from apache/master

SYNC 2015-02-08 20:00




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #2 from apache/master

2015-02-23 Thread xunyuw
Github user xunyuw closed the pull request at:

https://github.com/apache/spark/pull/4727


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3147][MLLib] A/B testing

2015-02-23 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/4716#issuecomment-75543022
  
`[error]  * abstract method numDim()Int in interface 
org.apache.spark.mllib.stat.MultivariateStatisticalSummary does not have a 
correspondent in old version`

Would it be bett


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2015-02-23 Thread xunyuw
Github user xunyuw closed the pull request at:

https://github.com/apache/spark/pull/4726


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2015-02-23 Thread xunyuw
Github user xunyuw commented on the pull request:

https://github.com/apache/spark/pull/4726#issuecomment-75531993
  
SYNC 2015-02-23 20:00


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2015-02-23 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4726#issuecomment-75532001
  
Mind closing this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2015-02-23 Thread xunyuw
GitHub user xunyuw opened a pull request:

https://github.com/apache/spark/pull/4726

Merge pull request #1 from apache/master

SYNC 2015-02-23 20:00

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xunyuw/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4726.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4726


commit 027f3928651aede18758471cf75b20230bc434fc
Author: Xunyu Wang xunyu.w...@hotmail.com
Date:   2015-02-08T12:34:48Z

Merge pull request #1 from apache/master

SYNC 2015-02-08 20:00




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2015-02-23 Thread xunyuw
Github user xunyuw closed the pull request at:

https://github.com/apache/spark/pull/4726


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK 5280] RDF Loader added + documentation

2015-02-23 Thread lukovnikov
Github user lukovnikov commented on the pull request:

https://github.com/apache/spark/pull/4650#issuecomment-75533258
  
@maropu tests are added and build tests passed. Is it ready for merging now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2087] [SQL] Multiple thriftserver sessi...

2015-02-23 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/4382#issuecomment-75535316
  
/cc @liancheng can you review this for me?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3147][MLLib] A/B testing

2015-02-23 Thread feynmanliang
Github user feynmanliang closed the pull request at:

https://github.com/apache/spark/pull/4716


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3147][MLLib] A/B testing

2015-02-23 Thread feynmanliang
GitHub user feynmanliang reopened a pull request:

https://github.com/apache/spark/pull/4716

[SPARK-3147][MLLib] A/B testing

Implementation of A/B testing using Streaming API.

This contribution is my original work and I license the work to the project 
under the project's open source license.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/feynmanliang/spark ab_testing

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4716.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4716


commit 105401a89216516565236f59a66a22cc91830686
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-10T19:36:27Z

Add broken implementation of AB testing.

commit cb73e790c435a4819fb62bc6c37717f4b882aee4
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-10T21:07:29Z

Fix AB testing implementation and add unit tests.

commit e0d5beccf54914ebdc5663dbe4ba71944f3183e2
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-10T22:54:26Z

Extract t-testing code out of OnlineABTesting.

commit 2100de641a2e86efeaa0f559500c7ced6f7d51a9
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-11T04:56:30Z

Add peace period for dropping first k entries of each A/B group.

commit 708380e980ed46ac1beb7665f7854fcf36ebc403
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-11T05:09:18Z

Add numDim to MultivariateOnlineSummarizer.

commit ec7f700fbca15d84bba126edaaa50d53ce5fc7be
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-11T06:02:41Z

Refactored ABTestingMethod into sealed trait.

commit 3f19e15aa3b7056262b601686643ed962846cdc3
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-11T06:29:49Z

Add (non-sliding) testing window functionality.

commit c56f9237aa81a70e8572e2ecb851ebaf5cdfa473
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-11T15:19:46Z

Fix peace period implementation.

commit 0d738815eb1cd49096112d8be7e9124345af0604
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-11T17:31:05Z

Fix test window batching.

commit abf59d5e8f817f847af77aef7514fb740dbbf69d
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-11T17:56:15Z

Handle (inelegantly) closure capture for ABTestMethod

commit e05eaaf3bb21bbed4c123d9ec6514e84ae75adcb
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-11T18:20:19Z

Improve handling of OnlineABTestMethod closure by moving DStream processing 
method into Serializable class.

commit 964a555746273a3afa542e34fdc6b86be60a5db9
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-11T18:52:37Z

Fixed flaky peacePeriod test.

commit 79c1d44c6232b0a4af5df4dc14cdc83919cfdea9
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-11T20:39:58Z

Add ScalaDocs and format to style guide.

commit e030c12337dce99abcf26f7d02c5d00a78f58c9b
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-12T00:02:20Z

Add OnlineABTestExample.

commit e8e1f82b16fbdd8446e21b32bb39b413e1ae30d1
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-12T00:03:12Z

Format code to style guide.

commit 17eef4eb22d918198dd03f2a931f009863fadcf5
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-19T04:43:36Z

Switch MultivariateOnlineSummarizer to univariate StatsCounter.

commit a2ad38be8a77eef045581282b3dbc9d6a1544870
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-19T14:45:15Z

Reduce number of passes in pairSummaries.

commit 4bb8636e5317a542ff0b29270548bd933199c6eb
Author: Feynman Liang feynman.li...@gmail.com
Date:   2015-01-19T14:45:41Z

Add test for behavior when missing data from one group.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2015-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4726#issuecomment-75532361
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2015-02-23 Thread xunyuw
GitHub user xunyuw reopened a pull request:

https://github.com/apache/spark/pull/4726

Merge pull request #1 from apache/master

SYNC 2015-02-23 20:00

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xunyuw/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4726.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4726


commit 027f3928651aede18758471cf75b20230bc434fc
Author: Xunyu Wang xunyu.w...@hotmail.com
Date:   2015-02-08T12:34:48Z

Merge pull request #1 from apache/master

SYNC 2015-02-08 20:00




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5926] [SQL] make DataFrame.explain leve...

2015-02-23 Thread yanboliang
Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/4707#issuecomment-75536583
  
retest please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-4902][CORE] gap-sampling performan...

2015-02-23 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3744#issuecomment-75561628
  
@witgo is this still live and have you followed up on Xiangrui's comment?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4362: Added classProbabilities m...

2015-02-23 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3626#issuecomment-75564514
  
@alanctgardner have you had a look at @jkbradley 's feedback? I'm wondering 
this is still live. It needs a rebase if so.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2015-02-23 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-75563929
  
I am also not clear this is a good thing. As a default, it doesn't change 
anything. There is probably not a globally correct ratio, even if it's not 1, 
but this implies there is. Is there evidence that a default besides 1.0 is 
better in most cases? The docs don't even suggest what the tradeoff is here.

Won't this potentially cause more shuffles when the ratio is not 1? I think 
this is something that must be set on a case-by-case basis, and that can 
already be done, even as a function of the parent RDD partitions, by the caller.

Can we elaborate on this or close it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4006] Block Manager - Double Register C...

2015-02-23 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/2854#issuecomment-75564774
  
Mind closing this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4340] [Core] add java opts argument sub...

2015-02-23 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3205#issuecomment-75564995
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4362: Added classProbabilities m...

2015-02-23 Thread alanctgardner
Github user alanctgardner commented on a diff in the pull request:

https://github.com/apache/spark/pull/3626#discussion_r25172516
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
@@ -65,6 +66,25 @@ class NaiveBayesModel private[mllib] (
   override def predict(testData: Vector): Double = {
 labels(brzArgmax(brzPi + brzTheta * testData.toBreeze))
   }
+
+  def classProbabilities(testData: RDD[Vector]):
--- End diff --

Sorry for the delay, I have no strong preference but predictProbabilities 
makes sense for consistency. I can make that change and the style ones 
mentioned.

My stats background is not super-strong, @jatinpreet seemed to imply 
there's a correctness issue with this PR. Can anyone comment on if I've got the 
math wrong?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4340] [Core] add java opts argument sub...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3205#issuecomment-75565369
  
  [Test build #27852 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27852/consoleFull)
 for   PR 3205 at commit 
[`9f8db81`](https://github.com/apache/spark/commit/9f8db81cef7287a92b9752f2c09c01b3ddf0d8ac).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3147][MLLib] A/B testing

2015-02-23 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4716#issuecomment-75580259
  
Let's remove `numDim`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5946][Streaming] Add Python API for dir...

2015-02-23 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/4723#issuecomment-75581979
  
Hi @tdas , do we need to add a Python  version of `createRDD` for direct 
Kafka stream? Seems this API requires Python wrapper of Java object like 
`OffsetRange`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-02-23 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3920#issuecomment-75582752
  
@GenTang  This PR looks good to me now, thanks

@JoshRosen I think it's ready to go.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-02-23 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3916#discussion_r25183946
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java 
---
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.launcher;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+import java.util.regex.Pattern;
+
+import static org.apache.spark.launcher.CommandBuilderUtils.*;
+
+/**
+ * Command builder for internal Spark classes.
+ * p/
+ * This class handles building the command to launch all internal Spark 
classes except for
+ * SparkSubmit (which is handled by {@link SparkSubmitCommandBuilder} 
class.
+ */
+class SparkClassCommandBuilder extends SparkLauncher implements 
CommandBuilder {
--- End diff --

Yes, that part is sort of weird. But it's the only way to expose all the 
methods that should be public without having a public abstract base class like 
before. So it's kinda the best solution I have if SparkLauncher is to remain 
public; if it's not, we can break the common parts into an abstract class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-02-23 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3916#discussion_r25184021
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/CommandBuilder.java ---
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.launcher;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Internal interface that defines a command builder.
+ */
+interface CommandBuilder {
--- End diff --

`Main.java` actually uses `CommandBuilder`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5946][Streaming] Add Python API for dir...

2015-02-23 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/4723#discussion_r25178444
  
--- Diff: examples/src/main/python/streaming/direct_kafka_wordcount.py ---
@@ -0,0 +1,55 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+ Counts words in UTF8 encoded, '\n' delimited text directly received from 
Kafka in every 2 seconds.
+ Usage: direct_kafka_wordcount.py broker_list topic
+
+ To run this on your local machine, you need to setup Kafka and create a 
producer first, see
+ http://kafka.apache.org/documentation.html#quickstart
+
+ and then run the example
+`$ bin/spark-submit --driver-class-path 
external/kafka-assembly/target/scala-*/\
+  spark-streaming-kafka-assembly-*.jar \
+  examples/src/main/python/streaming/direct_kafka_wordcount.py \
+  localhost:9092 test`
+
+
+import sys
+
+from pyspark import SparkContext
+from pyspark.streaming import StreamingContext
+from pyspark.streaming.kafka import KafkaUtils
+
+if __name__ == __main__:
+if len(sys.argv) != 3:
+print  sys.stderr, Usage: direct_kafka_wordcount.py 
broker_list topic
+exit(-1)
+
+sc = SparkContext(appName=PythonStreamingDirectKafkaWordCount)
+ssc = StreamingContext(sc, 2)
+
+brokers, topic = sys.argv[1:]
+kvs = KafkaUtils.createDirectStream(ssc, [topic], 
{metadata.broker.list: brokers})
--- End diff --

Hi @davies , thanks for your comment, I will add this as a argument.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5939][MLLib] make FPGrowth example app ...

2015-02-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4714


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-02-23 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/3920#discussion_r25180080
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
 ---
@@ -18,20 +18,34 @@
 package org.apache.spark.examples.pythonconverters
 
 import scala.collection.JavaConversions._
+import scala.util.parsing.json._
 
 import org.apache.spark.api.python.Converter
 import org.apache.hadoop.hbase.client.{Put, Result}
 import org.apache.hadoop.hbase.io.ImmutableBytesWritable
 import org.apache.hadoop.hbase.util.Bytes
+import org.apache.hadoop.hbase.KeyValue.Type
+import org.apache.hadoop.hbase.CellUtil
 
 /**
- * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts an
- * HBase Result to a String
+ * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts all
+ * the records in an HBase Result to a String
  */
 class HBaseResultToStringConverter extends Converter[Any, String] {
   override def convert(obj: Any): String = {
+import collection.JavaConverters._
 val result = obj.asInstanceOf[Result]
-Bytes.toStringBinary(result.value())
+val output = result.listCells.asScala.map(cell =
+Map(
+  row - Bytes.toStringBinary(CellUtil.cloneRow(cell)),
+  columnFamily - 
Bytes.toStringBinary(CellUtil.cloneFamily(cell)),
+  qualifier - 
Bytes.toStringBinary(CellUtil.cloneQualifier(cell)),
+  timestamp - cell.getTimestamp.toString,
+  type - Type.codeToType(cell.getTypeByte).toString,
+  value - Bytes.toStringBinary(CellUtil.cloneValue(cell))
+)
+)
+output.map(JSONObject(_).toString()).mkString(\n)
--- End diff --

That make sense. JSON will escape the `\n` in String, so it's safe to have 
`\n` as separator.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2015-02-23 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-75581853
  
You can implement this by expressing parallelism as a function of the 
parent RDD right? yeah you have to write the expression but does an alternative 
multiplier arg do much better? yeah mostly I'm questioning a global setting.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2015-02-23 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-75584312
  
@srowen good point.  I think a ratio argument is prettier than an 
expression, but arguably not enough to warrant clogging up the API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5950][SQL] Enable inserting array into ...

2015-02-23 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/4729

[SPARK-5950][SQL] Enable inserting array into Hive table saved as Parquet 
using DataSource API

Currently `ParquetConversions` in `HiveMetastoreCatalog` does not really 
work. One reason is that table is not part of the children nodes of 
`InsertIntoTable`. So the replacing is not working.

When we create a Parquet table in Hive with ARRAY field. In default 
`ArrayType` has `containsNull` as true. It affects the table's schema. But when 
inserting data into the table later, the schema of inserting data can be  with 
`containsNull` as true or false. That makes the inserting/reading failed.

A similar problem is reported in 
https://issues.apache.org/jira/browse/SPARK-5508.

Hive seems only support null elements array. So this pr enables same 
behavior.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 hive_parquet

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4729.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4729


commit 4e3bd5568e644bc81e2539a917329486ea968a92
Author: Liang-Chi Hsieh vii...@gmail.com
Date:   2015-02-23T17:03:30Z

Enable inserting array into hive table saved as parquet using datasource.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-02-23 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3916#discussion_r25183693
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java ---
@@ -0,0 +1,684 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.launcher;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileFilter;
+import java.io.FileInputStream;
+import java.io.InputStreamReader;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Properties;
+import java.util.jar.JarFile;
+import java.util.regex.Pattern;
+
+import static org.apache.spark.launcher.CommandBuilderUtils.*;
+
+/**
+ * Launcher for Spark applications.
+ * p/
+ * Use this class to start Spark applications programmatically. The class 
uses a builder pattern
+ * to allow clients to configure the Spark application and launch it as a 
child process.
+ * p/
+ * Note that launching Spark applications using this class will not 
automatically load environment
+ * variables from the spark-env.sh or spark-env.cmd scripts in the 
configuration directory.
+ */
+public class SparkLauncher {
+
+  /** The Spark master. */
+  public static final String SPARK_MASTER = spark.master;
+
+  /** Configuration key for the driver memory. */
+  public static final String DRIVER_MEMORY = spark.driver.memory;
+  /** Configuration key for the driver class path. */
+  public static final String DRIVER_EXTRA_CLASSPATH = 
spark.driver.extraClassPath;
+  /** Configuration key for the driver VM options. */
+  public static final String DRIVER_EXTRA_JAVA_OPTIONS = 
spark.driver.extraJavaOptions;
+  /** Configuration key for the driver native library path. */
+  public static final String DRIVER_EXTRA_LIBRARY_PATH = 
spark.driver.extraLibraryPath;
+
+  /** Configuration key for the executor memory. */
+  public static final String EXECUTOR_MEMORY = spark.executor.memory;
--- End diff --

Yes. I tried to add the most common set of job config options here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75593300
  
  [Test build #27854 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27854/consoleFull)
 for   PR 4708 at commit 
[`b85c5fe`](https://github.com/apache/spark/commit/b85c5fe14fdece4769fc98bbedcba80252b325bf).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-02-23 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3920#discussion_r25182187
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
 ---
@@ -18,20 +18,34 @@
 package org.apache.spark.examples.pythonconverters
 
 import scala.collection.JavaConversions._
+import scala.util.parsing.json._
 
 import org.apache.spark.api.python.Converter
 import org.apache.hadoop.hbase.client.{Put, Result}
 import org.apache.hadoop.hbase.io.ImmutableBytesWritable
 import org.apache.hadoop.hbase.util.Bytes
+import org.apache.hadoop.hbase.KeyValue.Type
+import org.apache.hadoop.hbase.CellUtil
 
 /**
- * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts an
- * HBase Result to a String
+ * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts all
+ * the records in an HBase Result to a String
  */
 class HBaseResultToStringConverter extends Converter[Any, String] {
   override def convert(obj: Any): String = {
+import collection.JavaConverters._
 val result = obj.asInstanceOf[Result]
-Bytes.toStringBinary(result.value())
+val output = result.listCells.asScala.map(cell =
+Map(
+  row - Bytes.toStringBinary(CellUtil.cloneRow(cell)),
+  columnFamily - 
Bytes.toStringBinary(CellUtil.cloneFamily(cell)),
+  qualifier - 
Bytes.toStringBinary(CellUtil.cloneQualifier(cell)),
+  timestamp - cell.getTimestamp.toString,
+  type - Type.codeToType(cell.getTypeByte).toString,
+  value - Bytes.toStringBinary(CellUtil.cloneValue(cell))
+)
+)
+output.map(JSONObject(_).toString()).mkString(\n)
--- End diff --

Great! In fact, HBase itself will escape `\n` too. That's why I choose `\n` 
at the first place.
Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread ilganeli
Github user ilganeli commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25183148
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -830,39 +836,39 @@ class DAGScheduler(
 try {
   // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
   // For ResultTask, serialize and broadcast (rdd, func).
-  val taskBinaryBytes: Array[Byte] =
-if (stage.isShuffleMap) {
-  closureSerializer.serialize((stage.rdd, stage.shuffleDep.get) : 
AnyRef).array()
-} else {
-  closureSerializer.serialize((stage.rdd, 
stage.resultOfJob.get.func) : AnyRef).array()
-}
+  val taskBinaryBytes: Array[Byte] = stage match {
+case a: ShuffleMapStage =
+  closureSerializer.serialize((a.rdd, a.shuffleDep): 
AnyRef).array()
+case b: ResultStage =
+  closureSerializer.serialize((b.rdd, b.resultOfJob.get.func): 
AnyRef).array()
+  }
+
   taskBinary = sc.broadcast(taskBinaryBytes)
 } catch {
   // In the case of a failure during serialization, abort the stage.
   case e: NotSerializableException =
 abortStage(stage, Task not serializable:  + e.toString)
 runningStages -= stage
-return
--- End diff --

This was a mistake introduced when I was doing the second round of 
refactoring (copying code back from when I pulled this all out to its own 
method). When this code is within its own method then we can just look at the 
return value of the method and the weird return breaks become unnecessary. I'll 
add a comment for these in the meantime. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5939][MLLib] make FPGrowth example app ...

2015-02-23 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4714#issuecomment-75580799
  
LGTM. Merged into master and branch-1.3. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5950][SQL] Enable inserting array into ...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4729#issuecomment-75588235
  
  [Test build #27853 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27853/consoleFull)
 for   PR 4729 at commit 
[`4e3bd55`](https://github.com/apache/spark/commit/4e3bd5568e644bc81e2539a917329486ea968a92).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-02-23 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/3916#issuecomment-75593031
  
@pwendell I see what you mean about compatibility. Let me play with the 
code a bit, it might not be hard to do something like that as part of this 
patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5951

2015-02-23 Thread zuxqoj
GitHub user zuxqoj opened a pull request:

https://github.com/apache/spark/pull/4730

SPARK-5951

Remove unreachable driver memory properties in yarn client mode

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zuxqoj/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4730.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4730


commit 977dc967eb3f2e718df68729d614efc48a47c9da
Author: mohit.goyal mohit.go...@guavus.com
Date:   2015-02-23T17:35:24Z

remove not rechable deprecated variables in yarn client mode




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4340] [Core] add java opts argument sub...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3205#issuecomment-75581446
  
  [Test build #27852 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27852/consoleFull)
 for   PR 3205 at commit 
[`9f8db81`](https://github.com/apache/spark/commit/9f8db81cef7287a92b9752f2c09c01b3ddf0d8ac).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2015-02-23 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-75580971
  
In general, a fixed number of partitions is very difficult to work with 
when configuring a shuffle.  Suppose I have a job where I know a `flatMap` is 
going to blow up the size of my data by two.  If I want to minimize reduce-side 
spilling in a shuffle that comes after the `flatMap`, I want the parallelism of 
the shuffle to be double that of the input stage.  Because the size of my input 
data could change between different runs of my job, a ratio is a much more 
natural way to express my needs than a constant.

It's unclear to me whether a global default is useful at all, but a 
configurable parallelism ratio per shuffle operation definitely is.  (Systems 
like Crunch take this approach).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4340] [Core] add java opts argument sub...

2015-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3205#issuecomment-75581461
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27852/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5951][YARN] Remove unreachable driver m...

2015-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4730#issuecomment-75594177
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Examples] fix deprecated method use in HBaseT...

2015-02-23 Thread potix2
GitHub user potix2 opened a pull request:

https://github.com/apache/spark/pull/4725

[Examples] fix deprecated method use in HBaseTest

HTableDescriptor(String name) is deprecated.


https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/TableName.html

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/potix2/spark fix-warning-hbase-example

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4725.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4725


commit f613b861afa037f78fc981933789cfc730c9a062
Author: Katsunori Kanda pot...@gmail.com
Date:   2015-02-23T10:21:16Z

[Examples] fix deprecated method use in HBaseTest




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5944] [PySpark] fix version in Python A...

2015-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4731#issuecomment-75629173
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27861/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5944] [PySpark] fix version in Python A...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4731#issuecomment-75629156
  
  [Test build #27861 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27861/consoleFull)
 for   PR 4731 at commit 
[`08cbc3f`](https://github.com/apache/spark/commit/08cbc3f2f6ea21ecfb491e89b521679d4fb24879).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5912] [docs] [mllib] Small fixes to Chi...

2015-02-23 Thread jkbradley
GitHub user jkbradley opened a pull request:

https://github.com/apache/spark/pull/4732

[SPARK-5912] [docs] [mllib] Small fixes to ChiSqSelector docs

Fixes:
* typo in Scala example
* Removed comment usually applied on sparse data since that is debatable
* small edits to text for clarity

CC: @avulanov  I noticed a typo post-hoc and ended up making a few small 
edits.  Do the changes look OK?

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jkbradley/spark chisqselector-docs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4732.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4732


commit 3f3f9f4968ff1a8f45be6dbaead54eb1ea6df406
Author: Joseph K. Bradley jos...@databricks.com
Date:   2015-02-23T21:18:06Z

small fixes to ChiSqSelector docs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75637201
  
Look pretty good to me, but left a few more comments.  Also, please take a 
look at the various logging strings to see whether some of them can be 
expressed more readably using string interpolation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5912] [docs] [mllib] Small fixes to Chi...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4732#issuecomment-75637616
  
  [Test build #27862 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27862/consoleFull)
 for   PR 4732 at commit 
[`3f3f9f4`](https://github.com/apache/spark/commit/3f3f9f4968ff1a8f45be6dbaead54eb1ea6df406).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25202931
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -210,40 +210,58 @@ class DAGScheduler(
* The jobId value passed in will be used if the stage doesn't already 
exist with
* a lower jobId (jobId always increases across jobs.)
*/
-  private def getShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], 
jobId: Int): Stage = {
+  private def getShuffleMapStage(
+  shuffleDep: ShuffleDependency[_, _, _],
+  jobId: Int): ShuffleMapStage = {
 shuffleToMapStage.get(shuffleDep.shuffleId) match {
   case Some(stage) = stage
   case None =
 // We are going to register ancestor shuffle dependencies
 registerShuffleDependencies(shuffleDep, jobId)
 // Then register current shuffleDep
-val stage =
-  newOrUsedStage(
-shuffleDep.rdd, shuffleDep.rdd.partitions.size, shuffleDep, 
jobId,
-shuffleDep.rdd.creationSite)
+val stage = newOrUsedShuffleStage(shuffleDep, jobId)
 shuffleToMapStage(shuffleDep.shuffleId) = stage
- 
+
 stage
 }
   }
 
   /**
-   * Create a Stage -- either directly for use as a result stage, or as 
part of the (re)-creation
-   * of a shuffle map stage in newOrUsedStage.  The stage will be 
associated with the provided
-   * jobId. Production of shuffle map stages should always use 
newOrUsedStage, not newStage
-   * directly.
+   * Create a ShuffleMapStage as part of the (re)-creation of a shuffle 
map stage in
+   * newOrUsedShuffleStage.  The stage will be associated with the provide 
jobId.
+   * Production of shuffle map stages should always use 
newOrUsedShuffleStage,not
+   * newShuffleMapStage directly.
*/
-  private def newStage(
+  private def newShuffleMapStage(
   rdd: RDD[_],
   numTasks: Int,
-  shuffleDep: Option[ShuffleDependency[_, _, _]],
+  shuffleDep: ShuffleDependency[_, _, _],
   jobId: Int,
-  callSite: CallSite)
-: Stage =
-  {
+  callSite: CallSite): ShuffleMapStage = {
 val parentStages = getParentStages(rdd, jobId)
 val id = nextStageId.getAndIncrement()
-val stage = new Stage(id, rdd, numTasks, shuffleDep, parentStages, 
jobId, callSite)
+val stage: ShuffleMapStage = new ShuffleMapStage(id, rdd, numTasks, 
parentStages,
+  jobId, callSite, shuffleDep)
+
+stageIdToStage(id) = stage
+updateJobIdStageIdMaps(jobId, stage)
+stage
+  }
+
+  /**
+   * Create a ResultStage -- either directly for use as a result stage, or 
as part of the
+   * (re)-creation of a shuffle map stage in newOrUsedShuffleStage.  The 
stage will be associated
+   * with the provided jobId.
+   */
+  private def newResultStage(
+  rdd: RDD[_],
+  numTasks: Int,
+  jobId: Int,
+  callSite: CallSite): ResultStage = {
+val parentStages = getParentStages(rdd, jobId)
+val id = nextStageId.getAndIncrement()
+val stage: ResultStage = new ResultStage(id, rdd, numTasks, 
parentStages, jobId, callSite)
+
--- End diff --

I'd rather avoid the code duplication in newShuffleMapStage and 
newResultStage.  This can be done in generic fashion via runtime reflection:
```scala
import scala.reflect.runtime.{universe = ru}
...
  private def newStage[T : Stage: ru.TypeTag](
  rdd: RDD[_],
  numTasks: Int,
  shuffleDep: Option[ShuffleDependency[_, _, _]],
  jobId: Int,
  callSite: CallSite): T = {
val m = ru.runtimeMirror(getClass.getClassLoader)
val classT = ru.typeOf[T].typeSymbol.asClass
val cm = m.reflectClass(classT)
val ctor = ru.typeOf[T].declaration(ru.nme.CONSTRUCTOR).asMethod
val ctorm = cm.reflectConstructor(ctor)
val parentStages = getParentStages(rdd, jobId)
val id = nextStageId.getAndIncrement()
val stage = shuffleDep.map { shufDep =
  ctorm(id, rdd, numTasks, parentStages, jobId, callSite, shufDep)
}.getOrElse(ctorm(id, rdd, numTasks, parentStages, jobId, 
callSite)).asInstanceOf[T]

stageIdToStage(id) = stage
updateJobIdStageIdMaps(jobId, stage)
stage
  }
...
  val stage = newStage[ShuffleMapStage](rdd, numTasks, Some(shuffleDep), 
jobId, rdd.creationSite)
...
  finalStage = newStage[ResultStage](finalRDD, partitions.size, None, 
jobId, callSite)
```
...but I'd want to see the performance numbers on that before deciding not 
to go with a less flexible approach that avoids reflection:
```scala
  private def newStage[T : Stage](
   

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75638966
  
Oops, did not realize that a test was still running (glad it passed)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5946][Streaming] Add Python API for dir...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4723#issuecomment-75594343
  
  [Test build #27855 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27855/consoleFull)
 for   PR 4723 at commit 
[`5381db1`](https://github.com/apache/spark/commit/5381db1ad833ab72a2eb15b0f30d745c1bfbe764).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25186634
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Stage.scala ---
@@ -77,52 +71,9 @@ private[spark] class Stage(
   /** Pointer to the latest [StageInfo] object, set by DAGScheduler. */
   var latestInfo: StageInfo = StageInfo.fromStage(this)
 
-  def isAvailable: Boolean = {
-if (!isShuffleMap) {
-  true
-} else {
-  numAvailableOutputs == numPartitions
-}
-  }
-
-  def addOutputLoc(partition: Int, status: MapStatus) {
-val prevList = outputLocs(partition)
-outputLocs(partition) = status :: prevList
-if (prevList == Nil) {
-  numAvailableOutputs += 1
-}
-  }
-
-  def removeOutputLoc(partition: Int, bmAddress: BlockManagerId) {
-val prevList = outputLocs(partition)
-val newList = prevList.filterNot(_.location == bmAddress)
-outputLocs(partition) = newList
-if (prevList != Nil  newList == Nil) {
-  numAvailableOutputs -= 1
-}
-  }
+  var numAvailableOutputs = 0
--- End diff --

...for nextAttemptId, too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5253] [ML] LinearRegression with L1/L2 ...

2015-02-23 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/4259#issuecomment-75600260
  
@dbtsai I'd like to make a pass over this, but I realized that it has 
conflicts because of the developer api PR committed last week: 
[https://github.com/apache/spark/pull/3637]  Could you please rebase?  I don't 
think there are any more big PRs coming up which will make you rebase again.  
Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75604216
  
  [Test build #27858 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27858/consoleFull)
 for   PR 4708 at commit 
[`d548caf`](https://github.com/apache/spark/commit/d548cafab4b6f36ee7e9bed696419567f0bc3d94).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75609741
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27854/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75610748
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27856/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75614718
  
  [Test build #27857 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27857/consoleFull)
 for   PR 4709 at commit 
[`58d9e4d`](https://github.com/apache/spark/commit/58d9e4d0dd4c03399cafd487f6391b1c560e82d8).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `ChiSqSelector stands for Chi-Squared feature selection. It operates on 
the labeled data. ChiSqSelector orders categorical features based on their 
values of Chi-Squared test on independence from class and filters (selects) top 
given features.  `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25186416
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Stage.scala ---
@@ -47,26 +47,20 @@ import org.apache.spark.util.CallSite
  * be updated for each attempt.
--- End diff --

Remove unused BlockManagerId from imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75599168
  
  [Test build #27857 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27857/consoleFull)
 for   PR 4709 at commit 
[`58d9e4d`](https://github.com/apache/spark/commit/58d9e4d0dd4c03399cafd487f6391b1c560e82d8).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4048#issuecomment-75604198
  
  [Test build #27859 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27859/consoleFull)
 for   PR 4048 at commit 
[`a1f1665`](https://github.com/apache/spark/commit/a1f16654a77caa3ef2e35d7e3ace830aa1708bdd).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75594276
  
  [Test build #27856 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27856/consoleFull)
 for   PR 4708 at commit 
[`6da3a71`](https://github.com/apache/spark/commit/6da3a7101c3c8087a9a924b998889eb6e1b3446f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5950][SQL] Enable inserting array into ...

2015-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4729#issuecomment-75597599
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27853/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25186558
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Stage.scala ---
@@ -77,52 +71,9 @@ private[spark] class Stage(
   /** Pointer to the latest [StageInfo] object, set by DAGScheduler. */
   var latestInfo: StageInfo = StageInfo.fromStage(this)
 
-  def isAvailable: Boolean = {
-if (!isShuffleMap) {
-  true
-} else {
-  numAvailableOutputs == numPartitions
-}
-  }
-
-  def addOutputLoc(partition: Int, status: MapStatus) {
-val prevList = outputLocs(partition)
-outputLocs(partition) = status :: prevList
-if (prevList == Nil) {
-  numAvailableOutputs += 1
-}
-  }
-
-  def removeOutputLoc(partition: Int, bmAddress: BlockManagerId) {
-val prevList = outputLocs(partition)
-val newList = prevList.filterNot(_.location == bmAddress)
-outputLocs(partition) = newList
-if (prevList != Nil  newList == Nil) {
-  numAvailableOutputs -= 1
-}
-  }
+  var numAvailableOutputs = 0
--- End diff --

Add explicit type declaration


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25188257
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -228,22 +227,41 @@ class DAGScheduler(
   }
 
   /**
-   * Create a Stage -- either directly for use as a result stage, or as 
part of the (re)-creation
-   * of a shuffle map stage in newOrUsedStage.  The stage will be 
associated with the provided
-   * jobId. Production of shuffle map stages should always use 
newOrUsedStage, not newStage
-   * directly.
+   * Create a ShuffleMapStage as part of the (re)-creation of a shuffle 
map stage in 
+   * newOrUsedShuffleStage.  The stage will be associated with the provided
+   * jobId. Production of shuffle map stages should always use 
newOrUsedShuffleStage, 
+   * not newShuffleMapStage directly.
--- End diff --

nit: reformat a little...
```scala
  /**
   * Create a ShuffleMapStage as part of the (re)-creation of a shuffle map 
stage in 
   * newOrUsedShuffleStage.  The stage will be associated with the provided 
jobId.
   * Production of shuffle map stages should always use 
newOrUsedShuffleStage, not
   * newShuffleMapStage directly.
   */
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4709#discussion_r25188678
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -375,3 +375,55 @@ data2 = labels.zip(normalizer2.transform(features))
 {% endhighlight %}
 /div
 /div
+
+## Feature selection
+[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) allows 
selecting the most relevant features for use in model construction. The number 
of features to select can be determined using the validation set. Feature 
selection is usually applied on sparse data, for example in text 
classification. Feature selection reduces the size of the vector space and, in 
turn, the complexity of any subsequent operation with vectors. 
+
+### ChiSqSelector
+ChiSqSelector stands for Chi-Squared feature selection. It operates on the 
labeled data. ChiSqSelector orders categorical features based on their values 
of Chi-Squared test on independence from class and filters (selects) top given 
features.  
+
+ Model Fitting
+

+[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector)
 has the
+following parameters in the constructor:
+
+* `numTopFeatures` number of top features that selector will select 
(filter).
+
+We provide a 
[`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) 
method in
+`ChiSqSelector` which can take an input of `RDD[LabeledPoint]` with 
categorical features, learn the summary statistics, and then
+return a model which can transform the input dataset into the reduced 
feature space.
+
+This model implements 
[`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer)
+which can apply the Chi-Squared feature selection on a `Vector` to produce 
a reduced `Vector` or on
+an `RDD[Vector]` to produce a reduced `RDD[Vector]`.
+
+Note that the model that performs actual feature filtering can be 
instantiated independently with array of feature indices that has to be sorted 
ascending.
+
+ Example
+
+The following example shows the basic use of ChiSqSelector.
+
+div class=codetabs
+div data-lang=scala
+{% highlight scala %}
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLUtils
+
+// load some data in libsvm format, each point is in the range 0..255
+val data = MLUtils.loadLibSVMFile(sc, data/mllib/sample_libsvm_data.txt)
+// discretize data in 16 equal bins
+val discretizedData = data.map { lp =
+  LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x = x / 
16 } ) )
+}
+// create ChiSqSelector that will select 50 features
+val selector = new ChiSqSelector(50)
+// create ChiSqSelector model
+val transformer = selector.fit(disctetizedData)
+// filter top 50 features
+val filteredData = transformer.transform(discretizedData)
--- End diff --

Since transform() takes an RDD[Vector], you'll need to map the data to 
features, and then zip the transformed features with the labels.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75603592
  
I think that last issue is the only one--thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75609726
  
  [Test build #27854 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27854/consoleFull)
 for   PR 4708 at commit 
[`b85c5fe`](https://github.com/apache/spark/commit/b85c5fe14fdece4769fc98bbedcba80252b325bf).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75610731
  
  [Test build #27856 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27856/consoleFull)
 for   PR 4708 at commit 
[`6da3a71`](https://github.com/apache/spark/commit/6da3a7101c3c8087a9a924b998889eb6e1b3446f).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75610561
  
Sorry for this, still sleeping...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75611280
  
  [Test build #27860 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27860/consoleFull)
 for   PR 4709 at commit 
[`19a8a4e`](https://github.com/apache/spark/commit/19a8a4e9b8c3b5607c87fb1eae19810f90b9ad6a).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5924] Add the ability to specify withMe...

2015-02-23 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4704#discussion_r25192351
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala ---
@@ -29,7 +29,18 @@ import org.apache.spark.sql.types.{StructField, 
StructType}
 /**
  * Params for [[StandardScaler]] and [[StandardScalerModel]].
  */
-private[feature] trait StandardScalerParams extends Params with 
HasInputCol with HasOutputCol
+private[feature] trait StandardScalerParams extends Params with 
HasInputCol with HasOutputCol {
+  val withMean: BooleanParam = new BooleanParam(this, 
--- End diff --

Add doc with `@group param`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5944] fix version in Python API docs

2015-02-23 Thread davies
GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/4731

[SPARK-5944] fix version in Python API docs

use RELEASE_VERSION when building the Python API docs

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark api_version

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4731.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4731


commit 08cbc3f2f6ea21ecfb491e89b521679d4fb24879
Author: Davies Liu dav...@databricks.com
Date:   2015-02-23T19:10:45Z

fix python docs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5950][SQL] Enable inserting array into ...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4729#issuecomment-75597588
  
  [Test build #27853 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27853/consoleFull)
 for   PR 4729 at commit 
[`4e3bd55`](https://github.com/apache/spark/commit/4e3bd5568e644bc81e2539a917329486ea968a92).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class Params(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5927][MLlib] Modify FPGrowth's partitio...

2015-02-23 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4706#issuecomment-75599387
  
@viirya Your proposal definitely works better in some cases, while the 
current implementation works better in some others. I think we both agree on 
this. The question is which partitioning scheme fits real datasets better. I 
don't have a clear answer. If there are some standard benchmark datasets, we 
can compare the performance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75602215
  
@JoshRosen I'm happy to take a look at this but won't be able to get to it 
until Friday.  Feel free to merge it sooner than that if you're eager to get it 
in; otherwise I'll take a look Friday!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-4362: Added classProbabilities m...

2015-02-23 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3626#issuecomment-75608187
  
@alanctgardner  That will be great if you change it to 
predictProbabilities; thanks.  I agree with what @jatinpreet was saying about 
the correctness, and with @srowen 's comment on how to fix it: The value of 
```brzPi + brzTheta * testData.toBreeze``` is a log probability, which needs to 
be exponentiated before you normalize it here: 
[https://github.com/apache/spark/pull/3626/files?diff=split#diff-6d8eff78be2fb624d4a076db334208a4R84]

Could you please rebase off of master and make these couple of updates?  
After that, I can make a final pass.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5924] Add the ability to specify withMe...

2015-02-23 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4704#discussion_r25192356
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala ---
@@ -44,12 +55,18 @@ class StandardScaler extends 
Estimator[StandardScalerModel] with StandardScalerP
 
   /** @group setParam */
   def setOutputCol(value: String): this.type = set(outputCol, value)
-
+  
+  /** @grour setParam */
--- End diff --

`@group`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5946][Streaming] Add Python API for dir...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4723#issuecomment-75610987
  
  [Test build #27855 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27855/consoleFull)
 for   PR 4723 at commit 
[`5381db1`](https://github.com/apache/spark/commit/5381db1ad833ab72a2eb15b0f30d745c1bfbe764).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5946][Streaming] Add Python API for dir...

2015-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4723#issuecomment-75610996
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27855/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >