[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator
[ https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186722#comment-15186722 ] Wojciech Jurczyk commented on SPARK-13030: -- I am not sure if I get you correctly. Are you against changing OHE to an Estimator? The fix you have proposed cannot be used, because OHE is a Transformer, not an Estimator. It is used on testing data without any context. > Change OneHotEncoder to Estimator > - > > Key: SPARK-13030 > URL: https://issues.apache.org/jira/browse/SPARK-13030 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.0 >Reporter: Wojciech Jurczyk > > OneHotEncoder should be an Estimator, just like in scikit-learn > (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). > In its current form, it is impossible to use when number of categories is > different between training dataset and test dataset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12874) ML StringIndexer does not protect itself from column name duplication
[ https://issues.apache.org/jira/browse/SPARK-12874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127934#comment-15127934 ] Wojciech Jurczyk commented on SPARK-12874: -- Thank you for feedback and willingness to help, [~holdenk]. I'll prepare a PR with a fix in a few days. > ML StringIndexer does not protect itself from column name duplication > - > > Key: SPARK-12874 > URL: https://issues.apache.org/jira/browse/SPARK-12874 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.2, 1.6.0 >Reporter: Wojciech Jurczyk > > StringIndexerModel, when performing transform() does not check the schema of > the input DataFrame. Because of that, it is possible to create a DataFrame > containing columns with duplicated names. > This issue is similar to SPARK-12711. StringIndexer could make use of > transformSchema to assure that the input DataFrame schema is correct in sense > of the parameters' values. > Please confirm. Then, I'll prepare a PR to resolve the bug. > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L147 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13030) Change OneHotEncoder to Estimator
Wojciech Jurczyk created SPARK-13030: Summary: Change OneHotEncoder to Estimator Key: SPARK-13030 URL: https://issues.apache.org/jira/browse/SPARK-13030 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.6.0 Reporter: Wojciech Jurczyk OneHotEncoder should be an Estimator, just like in scikit-learn (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). In its current form, it is impossible to use when number of categories is different between training dataset and test dataset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12877) TrainValidationSplit is missing in pyspark.ml.tuning
Wojciech Jurczyk created SPARK-12877: Summary: TrainValidationSplit is missing in pyspark.ml.tuning Key: SPARK-12877 URL: https://issues.apache.org/jira/browse/SPARK-12877 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.0 Reporter: Wojciech Jurczyk I was investingating progress in SPARK-10759 and I noticed that there is no TrainValidationSplit class in pyspark.ml.tuning module. Java/Scala's examples SPARK-10759 use org.apache.spark.ml.tuning.TrainValidationSplit that is not available from Python and this blocks SPARK-10759. Does the class have different name in PySpark, maybe? Also, I couldn't find any JIRA task to saying it need to be implemented. Is it by design that the TrainValidationSplit estimator is not ported to PySpark? If not, that is if the estimator needs porting then I would like to contribute. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12874) ML StringIndexer does not protect itself from column name duplication
Wojciech Jurczyk created SPARK-12874: Summary: ML StringIndexer does not protect itself from column name duplication Key: SPARK-12874 URL: https://issues.apache.org/jira/browse/SPARK-12874 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.6.0, 1.5.2 Reporter: Wojciech Jurczyk StringIndexerModel, when performing transform() does not check the schema of the input DataFrame. Because of that, it is possible to create a DataFrame containing columns with duplicated names. This issue is similar to SPARK-12711. StringIndexer could make use of transformSchema to assure that the input DataFrame schema is correct in sense of the parameters' values. Please confirm. Then, I'll prepare a PR to resolve the bug. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L147 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?
[ https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093932#comment-15093932 ] Wojciech Jurczyk commented on SPARK-7146: - {quote}Cons: Users have to be careful since parameters can have different meanings for different algorithms."{quote} I think, the users have to be careful even if the trait stay private. I mean, getters/setters and the parameters themselves are visible anyway (users have to set the parameters somehow). Consider a parameter called threshold. Obviously, it can have multiple meanings depending on the context. Currently, threshold's meaning hardcoded to link to binary classification and it can't be used in other cases. {quote}Sharing the Param traits helps to encourage standardized Param names and documentation{quote} but result in more specialized params (which restricts their use cases). On the other hand, inputCol/outputCol are good examples of parameters that are fully universal and generic. Having them in one trait would indeed result in some kind of standardization. {quote}If the shared Params are public, then implementations could test for the traits.{quote} A side note: this can be done anyway (by structural typing). And it's not always a bad thing (knowing that the meaning of the parameters can be different). {quote}It is unclear if we want users to rely on these traits, which are somewhat experimental.{quote} As I mentioned in SPARK-12751, we want to rely on the traits (for now only input/output column, and obviously, only for Transformers that are not UnaryTransformers). As far as I know classes in ML that use sharedParams are experimental, too (like LinearRegressionModel). We depend on experimental API anyway. Maybe the parameters can be divided into groups? Parameters in the first one would be fully universal (like inputCol). In the second group parameters would be less universal (but still shared, if used multiple times). Additionally, I think some parameters should be thrown out from the shared params. Consider the threshold from the shared params once again. It's used only in Logistic Regression (if I'm correct). Other operations, like Binarize define their own threshold param. Finally, I would vote for the option (b). Overriding docs will do. And then, maybe it'd possible to split the trait into two: for internal and external use? To benefit from having both private and public traits. > Should ML sharedParams be a public API? > --- > > Key: SPARK-7146 > URL: https://issues.apache.org/jira/browse/SPARK-7146 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > Discussion: Should the Param traits in sharedParams.scala be public? > Pros: > * Sharing the Param traits helps to encourage standardized Param names and > documentation. > Cons: > * Users have to be careful since parameters can have different meanings for > different algorithms. > * If the shared Params are public, then implementations could test for the > traits. It is unclear if we want users to rely on these traits, which are > somewhat experimental. > Currently, the shared params are private. > Proposal: Either > (a) make the shared params private to encourage users to write specialized > documentation and value checks for parameters, or > (b) design a better way to encourage overriding documentation and parameter > value checks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12751) Traits generated by SharedParamsCodeGen should not be private
[ https://issues.apache.org/jira/browse/SPARK-12751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wojciech Jurczyk updated SPARK-12751: - Description: Many Estimators and Transformers mix in traits generated by [SharedParamsCodeGen|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala]. These estimators and transformers (like StringIndexer, MinMaxScaler etc) are accessible publicly while traits generated by SharedParamsCodeGen are private\[ml\]. From user's code it is possible to invoke methods that the traits introduce but it is illegal to use any trait explicitly. For example, you can call setInputCol(str) on StringIndexer but you are not allowed to assign StringIndexer to a variable of type HasInputCol. {code:java} val x: HasInputCol = new StringIndexer() // Usage of HasInputCol is illegal. {code} For example, it is impossible to create a collection of transformers that have both HasInputCol and HasOutputCol (e.g. Set\[Transformer with HasInputCol with HasOutputCol\]). We have to use structural typing and reflective calls like this: {code} ml.Estimator[_] { val outputCol: ml.param.Param[String] } {code} This seems easy to fix, exposing a couple of traits should not break anything. On the other hand, maybe it goes deeper than that. was: Many Estimators and Transformers mix in traits generated by SharedParamsCodeGen. These estimators and transformers (like StringIndexer, MinMaxScaler etc) are accessible publicly while traits generated by SharedParamsCodeGen are private\[ml\]. From user's code it is possible to invoke methods that the traits introduce but it is illegal to use any trait explicitly. For example, you can call setInputCol(str) on StringIndexer but you are not allowed to assign StringIndexer to a variable of type HasInputCol. {code:java} val x: HasInputCol = new StringIndexer() // Usage of HasInputCol is illegal. {code} For example, it is impossible to create a collection of transformers that have both HasInputCol and HasOutputCol (e.g. Set\[Transformer with HasInputCol with HasOutputCol\]). We have to use structural typing and reflective calls like this: {code} ml.Estimator[_] { val outputCol: ml.param.Param[String] } {code} This seems easy to fix, exposing a couple of traits should not break anything. On the other hand, maybe it goes deeper than that. > Traits generated by SharedParamsCodeGen should not be private > - > > Key: SPARK-12751 > URL: https://issues.apache.org/jira/browse/SPARK-12751 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.2, 1.6.0 >Reporter: Wojciech Jurczyk >Priority: Minor > > Many Estimators and Transformers mix in traits generated by > [SharedParamsCodeGen|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala]. > These estimators and transformers (like StringIndexer, MinMaxScaler etc) are > accessible publicly while traits generated by SharedParamsCodeGen are > private\[ml\]. From user's code it is possible to invoke methods that the > traits introduce but it is illegal to use any trait explicitly. For example, > you can call setInputCol(str) on StringIndexer but you are not allowed to > assign StringIndexer to a variable of type HasInputCol. > {code:java} > val x: HasInputCol = new StringIndexer() // Usage of HasInputCol is illegal. > {code} > For example, it is impossible to create a collection of transformers that > have both HasInputCol and HasOutputCol (e.g. Set\[Transformer with > HasInputCol with HasOutputCol\]). We have to use structural typing and > reflective calls like this: > {code} > ml.Estimator[_] { val outputCol: ml.param.Param[String] } > {code} > This seems easy to fix, exposing a couple of traits should not break > anything. On the other hand, maybe it goes deeper than that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12751) Traits generated by SharedParamsCodeGen should not be private
Wojciech Jurczyk created SPARK-12751: Summary: Traits generated by SharedParamsCodeGen should not be private Key: SPARK-12751 URL: https://issues.apache.org/jira/browse/SPARK-12751 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.6.0, 1.5.2 Reporter: Wojciech Jurczyk Many Estimators and Transformers mix in traits generated by SharedParamsCodeGen. These estimators and transformers (like StringIndexer, MinMaxScaler etc) are accessible publicly while traits generated by SharedParamsCodeGen are private\[ml\]. From user's code it is possible to invoke methods that the traits introduce but it is illegal to use any trait explicitly. For example, you can call setInputCol(str) on StringIndexer but you are not allowed to assign StringIndexer to a variable of type HasInputCol. {code:java} val x: HasInputCol = new StringIndexer() // Usage of HasInputCol is illegal. {code} For example, it is impossible to create a collection of transformers that have both HasInputCol and HasOutputCol (e.g. Set\[Transformer with HasInputCol with HasOutputCol\]). We have to use structural typing and reflective calls like this: {code} ml.Estimator[_] { val outputCol: ml.param.Param[String] } {code} This seems easy to fix, exposing a couple of traits should not break anything. On the other hand, maybe it goes deeper than that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12751) Traits generated by SharedParamsCodeGen should not be private
[ https://issues.apache.org/jira/browse/SPARK-12751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wojciech Jurczyk updated SPARK-12751: - Priority: Minor (was: Major) > Traits generated by SharedParamsCodeGen should not be private > - > > Key: SPARK-12751 > URL: https://issues.apache.org/jira/browse/SPARK-12751 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.2, 1.6.0 >Reporter: Wojciech Jurczyk >Priority: Minor > > Many Estimators and Transformers mix in traits generated by > SharedParamsCodeGen. These estimators and transformers (like StringIndexer, > MinMaxScaler etc) are accessible publicly while traits generated by > SharedParamsCodeGen are private\[ml\]. From user's code it is possible to > invoke methods that the traits introduce but it is illegal to use any trait > explicitly. For example, you can call setInputCol(str) on StringIndexer but > you are not allowed to assign StringIndexer to a variable of type HasInputCol. > {code:java} > val x: HasInputCol = new StringIndexer() // Usage of HasInputCol is illegal. > {code} > For example, it is impossible to create a collection of transformers that > have both HasInputCol and HasOutputCol (e.g. Set\[Transformer with > HasInputCol with HasOutputCol\]). We have to use structural typing and > reflective calls like this: > {code} > ml.Estimator[_] { val outputCol: ml.param.Param[String] } > {code} > This seems easy to fix, exposing a couple of traits should not break > anything. On the other hand, maybe it goes deeper than that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12711) ML StopWordsRemover does not protect itself from column name duplication
[ https://issues.apache.org/jira/browse/SPARK-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090535#comment-15090535 ] Wojciech Jurczyk commented on SPARK-12711: -- [~josephkb]Is there any particular reason why StopWordsRemover is not a UnaryTransformer? As the docs say, the UnaryTransformer is an "Abstract class for transformers that take one input column, apply transformation, and output the result as a new column." which is the case. Moreover, UnaryTransformer implementation checks whether the output column already exists or not. Then, Making StopWordsRemover a UnaryTransformer would solve the issue. Talking about UnaryTransformers candidates, I think StringIndexer is a similar case (and probably, there are other Transformers that could be UnaryTransformers). It doesn't check whether the output column exists in the input DataFrame (it has the same flaw). Making StringIndexer a UnaryTransformer would solve the flaw, too. What do you think? > ML StopWordsRemover does not protect itself from column name duplication > > > Key: SPARK-12711 > URL: https://issues.apache.org/jira/browse/SPARK-12711 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.6.0 >Reporter: Grzegorz Chilkiewicz >Priority: Trivial > Labels: ml, mllib, newbie, suggestion > > At work we were 'taking a closer look' at ML transformers and I > spotted that anomally. > On first look, resolution looks simple: > Add to StopWordsRemover.transformSchema line (as is done in e.g. > PCA.transformSchema, StandardScaler.transformSchema, > OneHotEncoder.transformSchema): > {code} > require(!schema.fieldNames.contains($(outputCol)), s"Output column > ${$(outputCol)} already exists.") > {code} > Am I correct? Is that a bug?If yes - I am willing to prepare an > appropriate pull request. > Maybe a better idea is to make use of super.transformSchema in > StopWordsRemover (and possibly in all other places)? > Links to files at github, mentioned above: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala#L147 > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L109-L111 > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala#L101-L102 > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L138-L139 > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala#L75-L76 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11478) ML StringIndexer return inconsistent schema
[ https://issues.apache.org/jira/browse/SPARK-11478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060210#comment-15060210 ] Wojciech Jurczyk commented on SPARK-11478: -- Any progress on this, [~yanboliang]? I faced the same issue and I'm wondering if you're still working on this. > ML StringIndexer return inconsistent schema > --- > > Key: SPARK-11478 > URL: https://issues.apache.org/jira/browse/SPARK-11478 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > ML StringIndexer transform and transformSchema return inconsistent schema. > {code} > val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, > "a"), (5, "c")), 2) > val df = sqlContext.createDataFrame(data).toDF("id", "label") > val indexer = new StringIndexer() > .setInputCol("label") > .setOutputCol("labelIndex") > .fit(df) > val transformed = indexer.transform(df) > println(transformed.schema.toString()) > println(indexer.transformSchema(df.schema)) > The nullable of "labelIndex" return inconsistent value: > StructType(StructField(id,IntegerType,false), > StructField(label,StringType,true), StructField(labelIndex,DoubleType,true)) > StructType(StructField(id,IntegerType,false), > StructField(label,StringType,true), StructField(labelIndex,DoubleType,false)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org