[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...
Github user petro-rudenko closed the pull request at: https://github.com/apache/spark/pull/5510 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/5510#issuecomment-94418041 For my case i can live with default behaviour. It's just not intuitive that empty ParamGridBuilder returns array of size 1 and also not clear how to handle just 1 parameter. E.g. if there's only 1 param just set it explicitly and not use crossvalidation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/5510#issuecomment-94419332 For my case it means: ```scala (new ParamGridBuilder).addGrid(lr.regParam, Array(0.1)) == (lr.regParam=0.1 new ParamGridBuilder.build()) ``` So if there's only 1 param - just overwrite the default value and again run as with empty param map. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/5510#issuecomment-93412249 Ideally crossvalidator should handle next cases: 1) No parameters at all: just run est.fit(dataset, new ParamMap) 2) 1 param: set this param to estimator (assume it's a weird way to override default param) and again do step 1. 3) 2+ params: do crossvalidation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/5510#issuecomment-93373411 Maybe in Crossvalidator handle empty estimatorParamMap? ```scala /** @group setParam */ def setEstimatorParamMaps(value: Array[ParamMap]): this.type = { if (value.isEmpty) { set(estimatorParamMaps, Array(new ParamMap)) } else { set(estimatorParamMaps, value) } } ``` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...
GitHub user petro-rudenko opened a pull request: https://github.com/apache/spark/pull/5510 [SPARK-6901][Ml]ParamGridBuilder.build with no grids should return an emty array ParamGridBuilder.build with no grids returns array with an empty param map. ```scala assert((new ParamGridBuilder).build().size == 1) ``` I have a logic if ParamGridBuilder is empty - then not use CrossValidator. It confuses because if the ParamGridBuilder has one grid point in it will also return an array of size 1: ```scala assert((new ParamGridBuilder).addGrid(lr.regParam, Array(0.1)).build().size == 1) ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/petro-rudenko/spark SPARK-6901 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5510.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5510 commit 742bd209c7fd5fc82c65a86a1b28de2470db018b Author: Peter Rudenko petro.rude...@gmail.com Date: 2015-04-14T15:12:22Z [SPARK-6901][Ml]ParamGridBuilder.build with no grids should return an empty array --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6901][Ml]ParamGridBuilder.build with no...
Github user petro-rudenko commented on a diff in the pull request: https://github.com/apache/spark/pull/5510#discussion_r28339279 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamGridBuilder.scala --- @@ -100,10 +100,11 @@ class ParamGridBuilder { * Builds and returns all combinations of parameters specified by the param grid. */ def build(): Array[ParamMap] = { -var paramMaps = Array(new ParamMap) +var paramMaps = Array.empty[ParamMap] --- End diff -- Do you mean like this: ```scala def build(): Array[ParamMap] = { if (paramGrid.isEmpty) Array.empty[ParamMap] else { var paramMaps = Array(new ParamMap) paramGrid.foreach { case (param, values) = val newParamMaps = values.flatMap { v = paramMaps.map(_.copy.put(param.asInstanceOf[Param[Any]], v)) } paramMaps = newParamMaps.toArray } paramMaps } } ``` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2991] Implement RDD lazy transforms for...
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/1909#issuecomment-90063723 +1 for this. Useful feature to calculate distributed cumulative sum. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5885][MLLIB] Add VectorAssembler as a f...
Github user petro-rudenko commented on a diff in the pull request: https://github.com/apache/spark/pull/5196#discussion_r27739585 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -0,0 +1,101 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.collection.mutable.ArrayBuilder + +import org.apache.spark.SparkException +import org.apache.spark.annotation.AlphaComponent +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{HasInputCols, HasOutputCol, ParamMap} +import org.apache.spark.mllib.linalg.{Vector, VectorUDT, Vectors} +import org.apache.spark.sql.{Column, DataFrame, Row} +import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute +import org.apache.spark.sql.catalyst.expressions.CreateStruct +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: AlphaComponent :: + * A feature transformer than merge multiple columns into a vector column. + */ +@AlphaComponent +class VectorAssembler extends Transformer with HasInputCols with HasOutputCol { + + /** @group setParam */ + def setInputCols(value: Array[String]): this.type = set(inputCols, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + override def transform(dataset: DataFrame, paramMap: ParamMap): DataFrame = { +val map = this.paramMap ++ paramMap +val assembleFunc = udf { r: Row = + VectorAssembler.assemble(r.toSeq: _*) +} +val args = map(inputCols).map(c = UnresolvedAttribute(c)) +dataset.select(col(*), assembleFunc(new Column(CreateStruct(args))).as(map(outputCol))) + } + + override def transformSchema(schema: StructType, paramMap: ParamMap): StructType = { +val map = this.paramMap ++ paramMap +val inputColNames = map(inputCols) +val outputColName = map(outputCol) +val inputDataTypes = inputColNames.map(name = schema(name).dataType) +for (dataType - inputDataTypes) { + if (!(dataType == DoubleType || dataType.isInstanceOf[VectorUDT])) { +throw new IllegalArgumentException(sData type $dataType is not supported.) + } +} +if (schema.fieldNames.contains(outputColName)) { + throw new IllegalArgumentException(sOutput column $outputColName already exists.) +} +StructType(schema.fields :+ new StructField(outputColName, new VectorUDT, false)) + } +} + +@AlphaComponent +object VectorAssembler { + + private[feature] def assemble(vv: Any*): Vector = { +val indices = ArrayBuilder.make[Int] +val values = ArrayBuilder.make[Double] +var cur = 0 +vv.foreach { + case v: Double = --- End diff -- Would be good to support Integers also and just convert them to double. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5885][MLLIB] Add VectorAssembler as a f...
Github user petro-rudenko commented on a diff in the pull request: https://github.com/apache/spark/pull/5196#discussion_r27645880 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -0,0 +1,101 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.collection.mutable.ArrayBuilder + +import org.apache.spark.SparkException +import org.apache.spark.annotation.AlphaComponent +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{HasInputCols, HasOutputCol, ParamMap} +import org.apache.spark.mllib.linalg.{Vector, VectorUDT, Vectors} +import org.apache.spark.sql.{Column, DataFrame, Row} +import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute +import org.apache.spark.sql.catalyst.expressions.CreateStruct +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: AlphaComponent :: + * A feature transformer than merge multiple columns into a vector column. + */ +@AlphaComponent +class VectorAssembler extends Transformer with HasInputCols with HasOutputCol { --- End diff -- Maybe call it FeatureUnion to keep the same semantic with [sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5886][ML] Add label indexer
Github user petro-rudenko commented on a diff in the pull request: https://github.com/apache/spark/pull/4735#discussion_r27486767 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LabelIndexer.scala --- @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkException +import org.apache.spark.annotation.AlphaComponent +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.attribute.NominalAttribute +import org.apache.spark.ml.param._ +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{StringType, StructType} +import org.apache.spark.util.collection.OpenHashMap + +/** + * Base trait for [[LabelIndexer]] and [[LabelIndexerModel]]. + */ +private[feature] trait LabelIndexerBase extends Params with HasLabelCol with HasOutputCol { + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType, paramMap: ParamMap): StructType = { +val map = this.paramMap ++ paramMap +val labelType = schema(map(labelCol)).dataType +require(labelType == StringType, sThe label column must be string-typed but got $labelType.) +val inputFields = schema.fields +val outputColName = map(outputCol) +require(inputFields.forall(_.name != outputColName), + sOutput column $outputColName already exists.) +val attr = NominalAttribute.defaultAttr.withName(map(outputCol)) +val outputFields = inputFields :+ attr.toStructField() +StructType(outputFields) + } +} + +/** + * :: AlphaComponent :: + * A label indexer that maps a string column of labels to an ML column of label indices. + * The indices are in [0, numLabels), ordered by label frequencies. + * So the most frequent label gets index 0. + */ +@AlphaComponent +class LabelIndexer extends Estimator[LabelIndexerModel] with LabelIndexerBase { + + /** @group setParam */ + def setLabelCol(value: String): this.type = set(labelCol, value) --- End diff -- If it suppose to be general indexer, not just for label column, maybe it makes sense to call it ColumnIndexer and use setInputCol instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5886][ML] Add label indexer
Github user petro-rudenko commented on a diff in the pull request: https://github.com/apache/spark/pull/4735#discussion_r27510186 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LabelIndexer.scala --- @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkException +import org.apache.spark.annotation.AlphaComponent +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.attribute.NominalAttribute +import org.apache.spark.ml.param._ +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{StringType, StructType} +import org.apache.spark.util.collection.OpenHashMap + +/** + * Base trait for [[LabelIndexer]] and [[LabelIndexerModel]]. + */ +private[feature] trait LabelIndexerBase extends Params with HasLabelCol with HasOutputCol { + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType, paramMap: ParamMap): StructType = { +val map = this.paramMap ++ paramMap +val labelType = schema(map(labelCol)).dataType +require(labelType == StringType, sThe label column must be string-typed but got $labelType.) +val inputFields = schema.fields +val outputColName = map(outputCol) +require(inputFields.forall(_.name != outputColName), + sOutput column $outputColName already exists.) +val attr = NominalAttribute.defaultAttr.withName(map(outputCol)) +val outputFields = inputFields :+ attr.toStructField() +StructType(outputFields) + } +} + +/** + * :: AlphaComponent :: + * A label indexer that maps a string column of labels to an ML column of label indices. + * The indices are in [0, numLabels), ordered by label frequencies. + * So the most frequent label gets index 0. + */ +@AlphaComponent +class LabelIndexer extends Estimator[LabelIndexerModel] with LabelIndexerBase { + + /** @group setParam */ + def setLabelCol(value: String): this.type = set(labelCol, value) --- End diff -- Yes, but in jira's example it uses setInputCol rather then setFeatureCol: ```scala val i = new LabelIndexer() .setInputCol(country) .setOutputCol(countryIndex) ``` which makes more sense as for me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6608] [SQL] Makes DataFrame.rdd a lazy ...
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/5265#issuecomment-87670835 +1 for this, since for example [the caching logic from ml package](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L64) doesn't work properly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5886][ML] Add label indexer
Github user petro-rudenko commented on a diff in the pull request: https://github.com/apache/spark/pull/4735#discussion_r27399968 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LabelIndexer.scala --- @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkException +import org.apache.spark.annotation.AlphaComponent +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.attribute.NominalAttribute +import org.apache.spark.ml.param._ +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{StringType, StructType} +import org.apache.spark.util.collection.OpenHashMap + +/** + * Base trait for [[LabelIndexer]] and [[LabelIndexerModel]]. + */ +private[feature] trait LabelIndexerBase extends Params with HasLabelCol with HasOutputCol { + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType, paramMap: ParamMap): StructType = { +val map = this.paramMap ++ paramMap +val labelType = schema(map(labelCol)).dataType +require(labelType == StringType, sThe label column must be string-typed but got $labelType.) +val inputFields = schema.fields +val outputColName = map(outputCol) +require(inputFields.forall(_.name != outputColName), + sOutput column $outputColName already exists.) +val attr = NominalAttribute.defaultAttr.withName(map(outputCol)) +val outputFields = inputFields :+ attr.toStructField() +StructType(outputFields) + } +} + +/** + * :: AlphaComponent :: + * A label indexer that maps a string column of labels to an ML column of label indices. + * The indices are in [0, numLabels), ordered by label frequencies. + * So the most frequent label gets index 0. + */ +@AlphaComponent +class LabelIndexer extends Estimator[LabelIndexerModel] with LabelIndexerBase { + + /** @group setParam */ + def setLabelCol(value: String): this.type = set(labelCol, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + // TODO: handle unseen labels + + override def fit(dataset: DataFrame, paramMap: ParamMap): LabelIndexerModel = { +val map = this.paramMap ++ paramMap +val counts = dataset.select(map(labelCol)).map(_.getString(0)).countByValue() +val labels = counts.toSeq.sortBy(-_._2).map(_._1).toArray --- End diff -- Maybe it makes sense to use implementation from [DatasetIndexer](https://github.com/apache/spark/pull/3000/) or vise versa, so to keep the transformation logic in one place, and if there would be need to optimize performance (e.g. get the advantages of columnar storages - some storages can provide column cardinality metadata) - to change it in one place. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][docs][minor] Define LabeledDocument/Docum...
Github user petro-rudenko commented on a diff in the pull request: https://github.com/apache/spark/pull/5135#discussion_r27043852 --- Diff: docs/ml-guide.md --- @@ -655,6 +660,36 @@ import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row; import org.apache.spark.sql.SQLContext; +// Labeled and unlabeled instance types. --- End diff -- Yes it's annoying when copy/pasting a bunch of code into spark shell and it fails because these classes are not declared. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][docs][minor] Define LabeledDocument/Docum...
GitHub user petro-rudenko opened a pull request: https://github.com/apache/spark/pull/5135 [ML][docs][minor] Define LabeledDocument/Document classes in CV example To easier copy/paste Cross-Validation example code snippet need to define LabeledDocument/Document in it, since they difined in a previous example. You can merge this pull request into a Git repository by running: $ git pull https://github.com/petro-rudenko/spark patch-3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5135.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5135 commit 1d35383bf893aa7c14fb4750d730b3bf6c92cfe7 Author: Peter Rudenko petro.rude...@gmail.com Date: 2015-03-23T11:28:19Z [SQL][docs][minor] Define LabeledDocument/Document classes in CV example To easier copy/paste Cross-Validation example code snippet need to define LabeledDocument/Document in it, since they difined in a previous example. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4682 [CORE] Consolidate various 'Clock' ...
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/4514#issuecomment-75989874 Having problem compiling spark with sbt due to next error: ``` $ build/sbt -Phadoop-2.4 compile [error] /home/peter/soft/spark_src/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:127: type mismatch; [error] found : org.apache.spark.util.SystemClock [error] required: org.apache.spark.Clock [error] private var clock: Clock = new SystemClock() [error] ^ [info] Resolving org.fusesource.jansi#jansi;1.4 ... [info] Done updating. [info] Resolving org.objenesis#objenesis;1.2 ... [info] Updating {file:/home/peter/soft/spark_src/}streaming-mqtt... [info] Resolving org.apache.hadoop#hadoop-mapreduce-client-common;2.4.0 ... [error] /home/peter/soft/spark_src/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:66: reference to Clock is ambiguous; [error] it is imported twice in the same scope by [error] import org.apache.spark.util._ [error] and import org.apache.spark._ [error] clock: Clock = new SystemClock()) [error]^ [info] Resolving org.fusesource.jansi#jansi;1.4 ... [info] Done updating. [info] Resolving org.apache.hadoop#hadoop-annotations;2.4.0 ... [info] Updating {file:/home/peter/soft/spark_src/}streaming-twitter... [info] Resolving org.fusesource.jansi#jansi;1.4 ... [info] Done updating. [info] Resolving org.apache.spark#spark-network-shuffle_2.10;1.3.0-SNAPSHOT ... [warn] There may be incompatibilities among your library dependencies. [warn] Here are some of the libraries that were evicted: [warn] * com.google.guava:guava:(14.0.1, 11.0.2) - 15.0 [warn] Run 'evicted' to see detailed eviction warnings [info] Resolving org.objenesis#objenesis;1.2 ... [info] Updating {file:/home/peter/soft/spark_src/}streaming-flume... [info] Resolving commons-net#commons-net;3.1 ... [info] Updating {file:/home/peter/soft/spark_src/}tools... [info] Resolving net.sf.py4j#py4j;0.8.2.1 ... [warn] /home/peter/soft/spark_src/core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala:34: imported `Clock' is permanently hidden by definition of trait Clock in package worker [warn] import org.apache.spark.util.{Clock, SystemClock} [warn] ^ [info] Resolving org.twitter4j#twitter4j-core;3.0.3 ... [error] /home/peter/soft/spark_src/core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala:61: type mismatch; [error] found : org.apache.spark.util.SystemClock [error] required: org.apache.spark.deploy.worker.Clock [error] private var clock: Clock = new SystemClock() [error] ^ [error] /home/peter/soft/spark_src/core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala:190: value getTimeMillis is not a member of org.apache.spark.deploy.worker.Clock [error] val processStart = clock.getTimeMillis() [error]^ [error] /home/peter/soft/spark_src/core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala:192: value getTimeMillis is not a member of org.apache.spark.deploy.worker.Clock [error] if (clock.getTimeMillis() - processStart successfulRunDuration * 1000) { [error] ^ ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4682 [CORE] Consolidate various 'Clock' ...
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/4514#issuecomment-75994711 Thanks, works now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5802][MLLIB] cache transformed data in ...
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/4593#issuecomment-75550855 @dbtsai, @joshdevins here's an issue i have. I'm using new ml pipeline with hyperparameter grid search. Because folds doesn't depend from hyperparameter, i've reimplemented a bit LogisticRegression to not unpersist data: ```scala class CustomLogisticRegression extends LogisticRegression { var oldInstances: RDD[LabeledPoint] = null override def fit(dataset: SchemaRDD, paramMap: ParamMap): LogisticRegressionModel = { println(sFitting dataset ${dataset.id} with ParamMap $paramMap.) transformSchema(dataset.schema, paramMap, logging = true) import dataset.sqlContext._ val map = this.paramMap ++ paramMap val instances = dataset.select(map(labelCol).attr, map(featuresCol).attr) .map { case Row(label: Double, features: Vector) = LabeledPoint(label, features) } //For parallel grid search this.synchronized({ if (oldInstances == null || oldInstances.id != instances.id) { if (oldInstances != null) { oldInstances.unpersist() } oldInstances = instances instances.setName(sInstances for LR with ParamMap $paramMap and RDD ${dataset.id}) instances.persist(StorageLevel.MEMORY_AND_DISK) } }) val lr = (new LogisticRegressionWithLBFGS) .setValidateData(false) lr.optimizer .setRegParam(map(regParam)) .setNumIterations(map(maxIter)) val lrOldModel = lr.run(instances) val lrm = new LogisticRegressionModel(this, map, lr.run(instances).weights) //instances.unpersist() // copy model params Params.inheritValues(map, this, lrm) lrm } } ``` Then for 3 folds in crossvalidation and 3 hyperparameters to LogisticRegression i got something like this: ``` Fitting dataset 11 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.5 } Fitting dataset 11 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.1 } Fitting dataset 11 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.01 } Fitting dataset 12 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.5 } Fitting dataset 12 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.1 } Fitting dataset 12 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.01 } ``` So persistence on the model level need to cache folds for hyperparameters grid search, but persistence on GLM level need to speed-up Standart scalar transformation etc. Don't know yet how to do this efficiently without double caching. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-74563955 @jkbradley i can setValidateData in GLM, but not in the LogisticRegression class from the new API. For my case found a trick to customize anything i want (add org.apache.spark.ml package to my project and extends any class). When this API would be public it would be easier to customize (e.g. use LogisticRegressionWithSGD except for LRWithLBFGS) in user's namespace. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Ml] SPARK-5804 Explicitly manage cache in Cro...
GitHub user petro-rudenko opened a pull request: https://github.com/apache/spark/pull/4595 [Ml] SPARK-5804 Explicitly manage cache in Crossvalidator k-fold loop On a big dataset explicitly unpersist train and validation folds allows to load more data into memory in the next loop iteration. On my environment (single node 8Gb worker RAM, 2 GB dataset file, 3 folds for cross validation), saved more than 5 minutes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/petro-rudenko/spark patch-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4595.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4595 commit c5f3265a13c39c693d1fd13d46fadff89d2ab6da Author: Peter Rudenko petro.rude...@gmail.com Date: 2015-02-13T19:21:56Z [Ml] SPARK-5804 Explicitly manage cache in Crossvalidator k-fold loop On a big dataset explicitly unpersist train and validation folds allows to load more data into memory in the next loop iteration. On my environment (single node 8Gb worker RAM, 2 GB dataset file, 3 folds for cross validation), saved more than 5 minutes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Ml] SPARK-5796 Don't transform data on a last...
GitHub user petro-rudenko opened a pull request: https://github.com/apache/spark/pull/4590 [Ml] SPARK-5796 Don't transform data on a last estimator in Pipeline If it's a last estimator in Pipeline there's no need to transform data, since there's no next stage that would consume this data. You can merge this pull request into a Git repository by running: $ git pull https://github.com/petro-rudenko/spark patch-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4590.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4590 commit d13ec3324429919dcea549b00bae2e83ba51073c Author: Peter Rudenko petro.rude...@gmail.com Date: 2015-02-13T12:41:44Z [Ml] SPARK-5796 Don't transform data on a last estimator in Pipeline If it's a last estimator in Pipeline there's no need to transform data, since there's no next stage that would consume this data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-73509087 One more issue. In LogisticRegressionWithLBFGS class there's a line: ```scala this.setFeatureScaling(true) ``` I have feature scaling as a part of pipeline to produce new columns based on scaled columns. But i can't tell to the LogisticRegression class from the new API to set feature scaling to false. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...
Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-71636977 Also would be nice to be able to get/set model state: ```scala // Run cross-validation, and choose the best set of parameters. val cvModel = crossval.fit(training) val modelState = cvModel.bestModel.getModelState // Map(weights- Vector(0.2, 0.3, 0.5,...), regParam - 0.1, ...) //Save this state, pass to other prediction frontend, etc. val lr = new LogisticRegression() val lrModel = lr.setModelState(modelState) //LogisticRegressionModel lrModel.transform(...).predict(...) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org