[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...

2015-04-21 Thread petro-rudenko
Github user petro-rudenko closed the pull request at:

https://github.com/apache/spark/pull/5510


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...

2015-04-20 Thread petro-rudenko
Github user petro-rudenko commented on the pull request:

https://github.com/apache/spark/pull/5510#issuecomment-94418041
  
For my case i can live with default behaviour. It's just not intuitive that 
empty ParamGridBuilder returns array of size 1 and also not clear how to handle 
just 1 parameter. E.g. if there's only 1 param just set it explicitly and not 
use crossvalidation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...

2015-04-20 Thread petro-rudenko
Github user petro-rudenko commented on the pull request:

https://github.com/apache/spark/pull/5510#issuecomment-94419332
  
For my case it means:
```scala
(new ParamGridBuilder).addGrid(lr.regParam, Array(0.1)) == (lr.regParam=0.1 
 new ParamGridBuilder.build())
```

So if there's only 1 param - just overwrite the default value and again run 
as with empty param map.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...

2015-04-15 Thread petro-rudenko
Github user petro-rudenko commented on the pull request:

https://github.com/apache/spark/pull/5510#issuecomment-93412249
  
Ideally crossvalidator should handle next cases:
1) No parameters at all:  just run est.fit(dataset, new ParamMap)
2) 1 param: set this param to estimator (assume it's a weird way to 
override default param) and again do step 1.
3) 2+ params: do crossvalidation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...

2015-04-15 Thread petro-rudenko
Github user petro-rudenko commented on the pull request:

https://github.com/apache/spark/pull/5510#issuecomment-93373411
  
Maybe in Crossvalidator handle empty estimatorParamMap?
```scala
/** @group setParam */
  def setEstimatorParamMaps(value: Array[ParamMap]): this.type = {
if (value.isEmpty) {
  set(estimatorParamMaps, Array(new ParamMap))
} else {
  set(estimatorParamMaps, value)
}
  }
```

?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...

2015-04-14 Thread petro-rudenko
GitHub user petro-rudenko opened a pull request:

https://github.com/apache/spark/pull/5510

[SPARK-6901][Ml]ParamGridBuilder.build with no grids should return an emty 
array

ParamGridBuilder.build with no grids returns array with an empty param map.
```scala
assert((new ParamGridBuilder).build().size == 1)
```
I have a logic if ParamGridBuilder is empty - then not use CrossValidator. 
It confuses because if the ParamGridBuilder has one grid point in it will also 
return an array of size 1:
```scala
assert((new ParamGridBuilder).addGrid(lr.regParam, Array(0.1)).build().size 
== 1)
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/petro-rudenko/spark SPARK-6901

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5510.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5510


commit 742bd209c7fd5fc82c65a86a1b28de2470db018b
Author: Peter Rudenko petro.rude...@gmail.com
Date:   2015-04-14T15:12:22Z

[SPARK-6901][Ml]ParamGridBuilder.build with no grids should return an empty 
array




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...

2015-04-14 Thread petro-rudenko
Github user petro-rudenko commented on a diff in the pull request:

https://github.com/apache/spark/pull/5510#discussion_r28339279
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/ParamGridBuilder.scala ---
@@ -100,10 +100,11 @@ class ParamGridBuilder {
* Builds and returns all combinations of parameters specified by the 
param grid.
*/
   def build(): Array[ParamMap] = {
-var paramMaps = Array(new ParamMap)
+var paramMaps = Array.empty[ParamMap]
--- End diff --

Do you mean like this:
```scala
def build(): Array[ParamMap] = {
if (paramGrid.isEmpty) Array.empty[ParamMap]
else {
  var paramMaps = Array(new ParamMap)
  paramGrid.foreach { case (param, values) =
val newParamMaps = values.flatMap { v =
  paramMaps.map(_.copy.put(param.asInstanceOf[Param[Any]], v))
}
paramMaps = newParamMaps.toArray
  }
  paramMaps
}
  }
```
?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2991] Implement RDD lazy transforms for...

2015-04-06 Thread petro-rudenko
Github user petro-rudenko commented on the pull request:

https://github.com/apache/spark/pull/1909#issuecomment-90063723
  
+1 for this. Useful feature to calculate distributed cumulative sum.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5885][MLLIB] Add VectorAssembler as a f...

2015-04-03 Thread petro-rudenko
Github user petro-rudenko commented on a diff in the pull request:

https://github.com/apache/spark/pull/5196#discussion_r27739585
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.collection.mutable.ArrayBuilder
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.AlphaComponent
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{HasInputCols, HasOutputCol, ParamMap}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT, Vectors}
+import org.apache.spark.sql.{Column, DataFrame, Row}
+import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
+import org.apache.spark.sql.catalyst.expressions.CreateStruct
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: AlphaComponent ::
+ * A feature transformer than merge multiple columns into a vector column.
+ */
+@AlphaComponent
+class VectorAssembler extends Transformer with HasInputCols with 
HasOutputCol {
+
+  /** @group setParam */
+  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  override def transform(dataset: DataFrame, paramMap: ParamMap): 
DataFrame = {
+val map = this.paramMap ++ paramMap
+val assembleFunc = udf { r: Row =
+  VectorAssembler.assemble(r.toSeq: _*)
+}
+val args = map(inputCols).map(c = UnresolvedAttribute(c))
+dataset.select(col(*), assembleFunc(new 
Column(CreateStruct(args))).as(map(outputCol)))
+  }
+
+  override def transformSchema(schema: StructType, paramMap: ParamMap): 
StructType = {
+val map = this.paramMap ++ paramMap
+val inputColNames = map(inputCols)
+val outputColName = map(outputCol)
+val inputDataTypes = inputColNames.map(name = schema(name).dataType)
+for (dataType - inputDataTypes) {
+  if (!(dataType == DoubleType || dataType.isInstanceOf[VectorUDT])) {
+throw new IllegalArgumentException(sData type $dataType is not 
supported.)
+  }
+}
+if (schema.fieldNames.contains(outputColName)) {
+  throw new IllegalArgumentException(sOutput column $outputColName 
already exists.)
+}
+StructType(schema.fields :+ new StructField(outputColName, new 
VectorUDT, false))
+  }
+}
+
+@AlphaComponent
+object VectorAssembler {
+
+  private[feature] def assemble(vv: Any*): Vector = {
+val indices = ArrayBuilder.make[Int]
+val values = ArrayBuilder.make[Double]
+var cur = 0
+vv.foreach {
+  case v: Double =
--- End diff --

Would be good to support Integers also and just convert them to double.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5885][MLLIB] Add VectorAssembler as a f...

2015-04-02 Thread petro-rudenko
Github user petro-rudenko commented on a diff in the pull request:

https://github.com/apache/spark/pull/5196#discussion_r27645880
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.collection.mutable.ArrayBuilder
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.AlphaComponent
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{HasInputCols, HasOutputCol, ParamMap}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT, Vectors}
+import org.apache.spark.sql.{Column, DataFrame, Row}
+import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
+import org.apache.spark.sql.catalyst.expressions.CreateStruct
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: AlphaComponent ::
+ * A feature transformer than merge multiple columns into a vector column.
+ */
+@AlphaComponent
+class VectorAssembler extends Transformer with HasInputCols with 
HasOutputCol {
--- End diff --

Maybe call it FeatureUnion to keep the same semantic with 
[sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5886][ML] Add label indexer

2015-03-31 Thread petro-rudenko
Github user petro-rudenko commented on a diff in the pull request:

https://github.com/apache/spark/pull/4735#discussion_r27486767
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/LabelIndexer.scala ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.AlphaComponent
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute.NominalAttribute
+import org.apache.spark.ml.param._
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.{StringType, StructType}
+import org.apache.spark.util.collection.OpenHashMap
+
+/**
+ * Base trait for [[LabelIndexer]] and [[LabelIndexerModel]].
+ */
+private[feature] trait LabelIndexerBase extends Params with HasLabelCol 
with HasOutputCol {
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType, paramMap: 
ParamMap): StructType = {
+val map = this.paramMap ++ paramMap
+val labelType = schema(map(labelCol)).dataType
+require(labelType == StringType, sThe label column must be 
string-typed but got $labelType.)
+val inputFields = schema.fields
+val outputColName = map(outputCol)
+require(inputFields.forall(_.name != outputColName),
+  sOutput column $outputColName already exists.)
+val attr = NominalAttribute.defaultAttr.withName(map(outputCol))
+val outputFields = inputFields :+ attr.toStructField()
+StructType(outputFields)
+  }
+}
+
+/**
+ * :: AlphaComponent ::
+ * A label indexer that maps a string column of labels to an ML column of 
label indices.
+ * The indices are in [0, numLabels), ordered by label frequencies.
+ * So the most frequent label gets index 0.
+ */
+@AlphaComponent
+class LabelIndexer extends Estimator[LabelIndexerModel] with 
LabelIndexerBase {
+
+  /** @group setParam */
+  def setLabelCol(value: String): this.type = set(labelCol, value)
--- End diff --

If it suppose to be general indexer, not just for label column, maybe it 
makes sense to call it ColumnIndexer and use setInputCol instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5886][ML] Add label indexer

2015-03-31 Thread petro-rudenko
Github user petro-rudenko commented on a diff in the pull request:

https://github.com/apache/spark/pull/4735#discussion_r27510186
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/LabelIndexer.scala ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.AlphaComponent
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute.NominalAttribute
+import org.apache.spark.ml.param._
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.{StringType, StructType}
+import org.apache.spark.util.collection.OpenHashMap
+
+/**
+ * Base trait for [[LabelIndexer]] and [[LabelIndexerModel]].
+ */
+private[feature] trait LabelIndexerBase extends Params with HasLabelCol 
with HasOutputCol {
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType, paramMap: 
ParamMap): StructType = {
+val map = this.paramMap ++ paramMap
+val labelType = schema(map(labelCol)).dataType
+require(labelType == StringType, sThe label column must be 
string-typed but got $labelType.)
+val inputFields = schema.fields
+val outputColName = map(outputCol)
+require(inputFields.forall(_.name != outputColName),
+  sOutput column $outputColName already exists.)
+val attr = NominalAttribute.defaultAttr.withName(map(outputCol))
+val outputFields = inputFields :+ attr.toStructField()
+StructType(outputFields)
+  }
+}
+
+/**
+ * :: AlphaComponent ::
+ * A label indexer that maps a string column of labels to an ML column of 
label indices.
+ * The indices are in [0, numLabels), ordered by label frequencies.
+ * So the most frequent label gets index 0.
+ */
+@AlphaComponent
+class LabelIndexer extends Estimator[LabelIndexerModel] with 
LabelIndexerBase {
+
+  /** @group setParam */
+  def setLabelCol(value: String): this.type = set(labelCol, value)
--- End diff --

Yes, but in jira's example it uses setInputCol rather then setFeatureCol:
```scala
val i = new LabelIndexer()
  .setInputCol(country)
  .setOutputCol(countryIndex)
```
which makes more sense as for me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6608] [SQL] Makes DataFrame.rdd a lazy ...

2015-03-30 Thread petro-rudenko
Github user petro-rudenko commented on the pull request:

https://github.com/apache/spark/pull/5265#issuecomment-87670835
  
+1 for this, since for example [the caching logic from ml 
package](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L64)
 doesn't work properly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5886][ML] Add label indexer

2015-03-30 Thread petro-rudenko
Github user petro-rudenko commented on a diff in the pull request:

https://github.com/apache/spark/pull/4735#discussion_r27399968
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/LabelIndexer.scala ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.AlphaComponent
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute.NominalAttribute
+import org.apache.spark.ml.param._
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.{StringType, StructType}
+import org.apache.spark.util.collection.OpenHashMap
+
+/**
+ * Base trait for [[LabelIndexer]] and [[LabelIndexerModel]].
+ */
+private[feature] trait LabelIndexerBase extends Params with HasLabelCol 
with HasOutputCol {
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType, paramMap: 
ParamMap): StructType = {
+val map = this.paramMap ++ paramMap
+val labelType = schema(map(labelCol)).dataType
+require(labelType == StringType, sThe label column must be 
string-typed but got $labelType.)
+val inputFields = schema.fields
+val outputColName = map(outputCol)
+require(inputFields.forall(_.name != outputColName),
+  sOutput column $outputColName already exists.)
+val attr = NominalAttribute.defaultAttr.withName(map(outputCol))
+val outputFields = inputFields :+ attr.toStructField()
+StructType(outputFields)
+  }
+}
+
+/**
+ * :: AlphaComponent ::
+ * A label indexer that maps a string column of labels to an ML column of 
label indices.
+ * The indices are in [0, numLabels), ordered by label frequencies.
+ * So the most frequent label gets index 0.
+ */
+@AlphaComponent
+class LabelIndexer extends Estimator[LabelIndexerModel] with 
LabelIndexerBase {
+
+  /** @group setParam */
+  def setLabelCol(value: String): this.type = set(labelCol, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  // TODO: handle unseen labels
+
+  override def fit(dataset: DataFrame, paramMap: ParamMap): 
LabelIndexerModel = {
+val map = this.paramMap ++ paramMap
+val counts = 
dataset.select(map(labelCol)).map(_.getString(0)).countByValue()
+val labels = counts.toSeq.sortBy(-_._2).map(_._1).toArray
--- End diff --

Maybe it makes sense to use implementation from 
[DatasetIndexer](https://github.com/apache/spark/pull/3000/) or vise versa, so 
to keep the transformation logic in one place, and if there would be need to 
optimize performance (e.g. get the advantages of columnar storages - some 
storages can provide column cardinality metadata) - to change it in one place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [ML][docs][minor] Define LabeledDocument/Docum...

2015-03-24 Thread petro-rudenko
Github user petro-rudenko commented on a diff in the pull request:

https://github.com/apache/spark/pull/5135#discussion_r27043852
  
--- Diff: docs/ml-guide.md ---
@@ -655,6 +660,36 @@ import org.apache.spark.sql.DataFrame;
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.SQLContext;
 
+// Labeled and unlabeled instance types.
--- End diff --

Yes it's annoying when copy/pasting a bunch of code into spark shell and it 
fails because these classes are not declared. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [ML][docs][minor] Define LabeledDocument/Docum...

2015-03-23 Thread petro-rudenko
GitHub user petro-rudenko opened a pull request:

https://github.com/apache/spark/pull/5135

[ML][docs][minor] Define LabeledDocument/Document classes in CV example

To easier copy/paste Cross-Validation example code snippet need to define 
LabeledDocument/Document in it, since they difined in a previous example.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/petro-rudenko/spark patch-3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5135.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5135


commit 1d35383bf893aa7c14fb4750d730b3bf6c92cfe7
Author: Peter Rudenko petro.rude...@gmail.com
Date:   2015-03-23T11:28:19Z

[SQL][docs][minor] Define LabeledDocument/Document classes in CV example

To easier copy/paste Cross-Validation example code snippet need to define 
LabeledDocument/Document in it, since they difined in a previous example.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4682 [CORE] Consolidate various 'Clock' ...

2015-02-25 Thread petro-rudenko
Github user petro-rudenko commented on the pull request:

https://github.com/apache/spark/pull/4514#issuecomment-75989874
  
Having problem compiling spark with sbt due to next error:
```
$ build/sbt -Phadoop-2.4 compile
[error] 
/home/peter/soft/spark_src/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:127:
 type mismatch;
[error]  found   : org.apache.spark.util.SystemClock
[error]  required: org.apache.spark.Clock
[error]   private var clock: Clock = new SystemClock()
[error]  ^
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Resolving org.objenesis#objenesis;1.2 ...
[info] Updating {file:/home/peter/soft/spark_src/}streaming-mqtt...
[info] Resolving org.apache.hadoop#hadoop-mapreduce-client-common;2.4.0 ...
[error] 
/home/peter/soft/spark_src/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:66:
 reference to Clock is ambiguous;
[error] it is imported twice in the same scope by
[error] import org.apache.spark.util._
[error] and import org.apache.spark._
[error] clock: Clock = new SystemClock())
[error]^
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Resolving org.apache.hadoop#hadoop-annotations;2.4.0 ...
[info] Updating {file:/home/peter/soft/spark_src/}streaming-twitter...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Resolving org.apache.spark#spark-network-shuffle_2.10;1.3.0-SNAPSHOT 
...
[warn] There may be incompatibilities among your library dependencies.
[warn] Here are some of the libraries that were evicted:
[warn]  * com.google.guava:guava:(14.0.1, 11.0.2) - 15.0
[warn] Run 'evicted' to see detailed eviction warnings
[info] Resolving org.objenesis#objenesis;1.2 ...
[info] Updating {file:/home/peter/soft/spark_src/}streaming-flume...
[info] Resolving commons-net#commons-net;3.1 ...
[info] Updating {file:/home/peter/soft/spark_src/}tools...
[info] Resolving net.sf.py4j#py4j;0.8.2.1 ...
[warn] 
/home/peter/soft/spark_src/core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala:34:
 imported `Clock' is permanently hidden by definition of trait Clock in package 
worker
[warn] import org.apache.spark.util.{Clock, SystemClock}
[warn]   ^
[info] Resolving org.twitter4j#twitter4j-core;3.0.3 ...
[error] 
/home/peter/soft/spark_src/core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala:61:
 type mismatch;
[error]  found   : org.apache.spark.util.SystemClock
[error]  required: org.apache.spark.deploy.worker.Clock
[error]   private var clock: Clock = new SystemClock()
[error]  ^
[error] 
/home/peter/soft/spark_src/core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala:190:
 value getTimeMillis is not a member of org.apache.spark.deploy.worker.Clock
[error]   val processStart = clock.getTimeMillis()
[error]^
[error] 
/home/peter/soft/spark_src/core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala:192:
 value getTimeMillis is not a member of org.apache.spark.deploy.worker.Clock
[error]   if (clock.getTimeMillis() - processStart  
successfulRunDuration * 1000) {
[error] ^
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4682 [CORE] Consolidate various 'Clock' ...

2015-02-25 Thread petro-rudenko
Github user petro-rudenko commented on the pull request:

https://github.com/apache/spark/pull/4514#issuecomment-75994711
  
Thanks, works now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5802][MLLIB] cache transformed data in ...

2015-02-23 Thread petro-rudenko
Github user petro-rudenko commented on the pull request:

https://github.com/apache/spark/pull/4593#issuecomment-75550855
  
@dbtsai, @joshdevins  here's an issue i have. I'm using new ml pipeline 
with hyperparameter grid search. Because folds doesn't depend from 
hyperparameter, i've reimplemented a bit LogisticRegression to not unpersist 
data:
```scala
class CustomLogisticRegression extends LogisticRegression {
  var oldInstances: RDD[LabeledPoint] = null
  
  override def fit(dataset: SchemaRDD, paramMap: ParamMap): 
LogisticRegressionModel = {
println(sFitting dataset ${dataset.id} with ParamMap $paramMap.)
transformSchema(dataset.schema, paramMap, logging = true)
import dataset.sqlContext._
val map = this.paramMap ++ paramMap
val instances = dataset.select(map(labelCol).attr, 
map(featuresCol).attr)
  .map {
case Row(label: Double, features: Vector) =
  LabeledPoint(label, features)
  }

//For parallel grid search 
this.synchronized({
  if (oldInstances == null || oldInstances.id != instances.id) {
if (oldInstances != null) {
  oldInstances.unpersist()
}
oldInstances = instances
instances.setName(sInstances for LR with ParamMap $paramMap and 
RDD ${dataset.id})
instances.persist(StorageLevel.MEMORY_AND_DISK)
  }
})

val lr = (new LogisticRegressionWithLBFGS)
  .setValidateData(false)

lr.optimizer
  .setRegParam(map(regParam))
  .setNumIterations(map(maxIter))
val lrOldModel = lr.run(instances)
val lrm = new LogisticRegressionModel(this, map, 
lr.run(instances).weights)
//instances.unpersist()
// copy model params
Params.inheritValues(map, this, lrm)
lrm
  }
}
```

Then for 3 folds in crossvalidation and 3 hyperparameters to 
LogisticRegression i got something like this:

```
Fitting dataset 11 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.5
}
Fitting dataset 11 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.1
}
Fitting dataset 11 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.01
}

Fitting dataset 12 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.5
}
Fitting dataset 12 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.1
}
Fitting dataset 12 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.01
}
```

So persistence on the model level need to cache folds for hyperparameters 
grid search, but persistence on GLM level need to speed-up Standart scalar 
transformation etc. Don't know yet how to do this efficiently without double 
caching.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-02-16 Thread petro-rudenko
Github user petro-rudenko commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-74563955
  
@jkbradley i can setValidateData in GLM, but not in the LogisticRegression 
class from the new API. For my case found a trick to customize anything i want 
(add org.apache.spark.ml  package to my project and extends any class). When 
this API would be public it would be easier to customize (e.g. use 
LogisticRegressionWithSGD except for LRWithLBFGS) in user's namespace.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Ml] SPARK-5804 Explicitly manage cache in Cro...

2015-02-13 Thread petro-rudenko
GitHub user petro-rudenko opened a pull request:

https://github.com/apache/spark/pull/4595

[Ml] SPARK-5804 Explicitly manage cache in Crossvalidator k-fold loop

On a big dataset explicitly unpersist train and validation folds allows to 
load more data into memory in the next loop iteration. On my environment 
(single node 8Gb worker RAM, 2 GB dataset file, 3 folds for cross validation), 
saved more than 5 minutes.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/petro-rudenko/spark patch-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4595.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4595


commit c5f3265a13c39c693d1fd13d46fadff89d2ab6da
Author: Peter Rudenko petro.rude...@gmail.com
Date:   2015-02-13T19:21:56Z

[Ml] SPARK-5804 Explicitly manage cache in Crossvalidator k-fold loop

On a big dataset explicitly unpersist train and validation folds allows to 
load more data into memory in the next loop iteration. On my environment 
(single node 8Gb worker RAM, 2 GB dataset file, 3 folds for cross validation), 
saved more than 5 minutes.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Ml] SPARK-5796 Don't transform data on a last...

2015-02-13 Thread petro-rudenko
GitHub user petro-rudenko opened a pull request:

https://github.com/apache/spark/pull/4590

[Ml] SPARK-5796 Don't transform data on a last estimator in Pipeline

If it's a last estimator in Pipeline there's no need to transform data, 
since there's no next stage that would consume this data.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/petro-rudenko/spark patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4590.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4590


commit d13ec3324429919dcea549b00bae2e83ba51073c
Author: Peter Rudenko petro.rude...@gmail.com
Date:   2015-02-13T12:41:44Z

[Ml] SPARK-5796 Don't transform data on a last estimator in Pipeline

If it's a last estimator in Pipeline there's no need to transform data, 
since there's no next stage that would consume this data.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-02-09 Thread petro-rudenko
Github user petro-rudenko commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-73509087
  
One more issue. In LogisticRegressionWithLBFGS class there's a line:
```scala
this.setFeatureScaling(true)
```

I have feature scaling as a part of pipeline to produce new columns based 
on scaled columns. But i can't tell to the LogisticRegression class from the 
new API to set feature scaling to false.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-27 Thread petro-rudenko
Github user petro-rudenko commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-71636977
  
Also would be nice to be able to get/set model state:
```scala
// Run cross-validation, and choose the best set of parameters.
val cvModel = crossval.fit(training)
val modelState = cvModel.bestModel.getModelState 
// Map(weights- Vector(0.2, 0.3, 0.5,...), regParam - 0.1, ...)
//Save this state, pass to other prediction frontend, etc.

val lr = new LogisticRegression()
val lrModel = lr.setModelState(modelState)
//LogisticRegressionModel

lrModel.transform(...).predict(...)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org