[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2016-03-09 Thread Wojciech Jurczyk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186722#comment-15186722
 ] 

Wojciech Jurczyk commented on SPARK-13030:
--

I am not sure if I get you correctly. Are you against changing OHE to an 
Estimator? The fix you have proposed cannot be used, because OHE is a 
Transformer, not an Estimator. It is used on testing data without any context.

> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12874) ML StringIndexer does not protect itself from column name duplication

2016-02-02 Thread Wojciech Jurczyk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127934#comment-15127934
 ] 

Wojciech Jurczyk commented on SPARK-12874:
--

Thank you for feedback and willingness to help, [~holdenk]. I'll prepare a PR 
with a fix in a few days.

> ML StringIndexer does not protect itself from column name duplication
> -
>
> Key: SPARK-12874
> URL: https://issues.apache.org/jira/browse/SPARK-12874
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Wojciech Jurczyk
>
> StringIndexerModel, when performing transform() does not check the schema of 
> the input DataFrame. Because of that, it is possible to create a DataFrame 
> containing columns with duplicated names.
> This issue is similar to SPARK-12711. StringIndexer could make use of 
> transformSchema to assure that the input DataFrame schema is correct in sense 
> of the parameters' values.
> Please confirm. Then, I'll prepare a PR to resolve the bug.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L147



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13030) Change OneHotEncoder to Estimator

2016-01-26 Thread Wojciech Jurczyk (JIRA)
Wojciech Jurczyk created SPARK-13030:


 Summary: Change OneHotEncoder to Estimator
 Key: SPARK-13030
 URL: https://issues.apache.org/jira/browse/SPARK-13030
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.6.0
Reporter: Wojciech Jurczyk


OneHotEncoder should be an Estimator, just like in scikit-learn 
(http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
In its current form, it is impossible to use when number of categories is 
different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12877) TrainValidationSplit is missing in pyspark.ml.tuning

2016-01-18 Thread Wojciech Jurczyk (JIRA)
Wojciech Jurczyk created SPARK-12877:


 Summary: TrainValidationSplit is missing in pyspark.ml.tuning
 Key: SPARK-12877
 URL: https://issues.apache.org/jira/browse/SPARK-12877
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.0
Reporter: Wojciech Jurczyk


I was investingating progress in SPARK-10759 and I noticed that there is no 
TrainValidationSplit class in pyspark.ml.tuning module.
Java/Scala's examples SPARK-10759 use 
org.apache.spark.ml.tuning.TrainValidationSplit that is not available from 
Python and this blocks SPARK-10759.

Does the class have different name in PySpark, maybe? Also, I couldn't find any 
JIRA task to saying it need to be implemented. Is it by design that the 
TrainValidationSplit estimator is not ported to PySpark? If not, that is if the 
estimator needs porting then I would like to contribute.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12874) ML StringIndexer does not protect itself from column name duplication

2016-01-17 Thread Wojciech Jurczyk (JIRA)
Wojciech Jurczyk created SPARK-12874:


 Summary: ML StringIndexer does not protect itself from column name 
duplication
 Key: SPARK-12874
 URL: https://issues.apache.org/jira/browse/SPARK-12874
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.6.0, 1.5.2
Reporter: Wojciech Jurczyk


StringIndexerModel, when performing transform() does not check the schema of 
the input DataFrame. Because of that, it is possible to create a DataFrame 
containing columns with duplicated names.

This issue is similar to SPARK-12711. StringIndexer could make use of 
transformSchema to assure that the input DataFrame schema is correct in sense 
of the parameters' values.

Please confirm. Then, I'll prepare a PR to resolve the bug.

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L147



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?

2016-01-12 Thread Wojciech Jurczyk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093932#comment-15093932
 ] 

Wojciech Jurczyk commented on SPARK-7146:
-

{quote}Cons:
Users have to be careful since parameters can have different meanings for 
different algorithms."{quote}

I think, the users have to be careful even if the trait stay private. I mean, 
getters/setters and the parameters themselves are visible anyway (users have to 
set the parameters somehow).

Consider a parameter called threshold. Obviously, it can have multiple meanings 
depending on the context. Currently, threshold's meaning hardcoded to link to 
binary classification and it can't be used in other cases.
{quote}Sharing the Param traits helps to encourage standardized Param names and 
documentation{quote} but result in more specialized params (which restricts 
their use cases).

On the other hand, inputCol/outputCol are good examples of parameters that are 
fully universal and generic. Having them in one trait would indeed result in 
some kind of standardization.

{quote}If the shared Params are public, then implementations could test for the 
traits.{quote} A side note: this can be done anyway (by structural typing). And 
it's not always a bad thing (knowing that the meaning of the parameters can be 
different).
{quote}It is unclear if we want users to rely on these traits, which are 
somewhat experimental.{quote}
As I mentioned in SPARK-12751, we want to rely on the traits (for now only 
input/output column, and obviously, only for Transformers that are not 
UnaryTransformers). As far as I know classes in ML that use sharedParams are 
experimental, too (like LinearRegressionModel). We depend on experimental API 
anyway.

Maybe the parameters can be divided into groups? Parameters in the first one 
would be fully universal (like inputCol). In the second group parameters would 
be less universal (but still shared, if used multiple times).
Additionally, I think some parameters should be thrown out from the shared 
params. Consider the threshold from the shared params once again. It's used 
only in Logistic Regression (if I'm correct). Other operations, like Binarize 
define their own threshold param.

Finally, I would vote for the option (b). Overriding docs will do. And then, 
maybe it'd possible to split the trait into two: for internal and external use? 
To benefit from having both private and public traits.


> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Discussion: Should the Param traits in sharedParams.scala be public?
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> Proposal: Either
> (a) make the shared params private to encourage users to write specialized 
> documentation and value checks for parameters, or
> (b) design a better way to encourage overriding documentation and parameter 
> value checks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12751) Traits generated by SharedParamsCodeGen should not be private

2016-01-11 Thread Wojciech Jurczyk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wojciech Jurczyk updated SPARK-12751:
-
Description: 
Many Estimators and Transformers mix in traits generated by 
[SharedParamsCodeGen|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala].
 These estimators and transformers (like StringIndexer, MinMaxScaler etc) are 
accessible publicly while traits generated by SharedParamsCodeGen are 
private\[ml\]. From user's code it is possible to invoke methods that the 
traits introduce but it is illegal to use any trait explicitly. For example, 
you can call setInputCol(str) on StringIndexer but you are not allowed to 
assign StringIndexer to a variable of type HasInputCol.
{code:java}
val x: HasInputCol = new StringIndexer() // Usage of HasInputCol is illegal.
{code}
For example, it is impossible to create a collection of transformers that have 
both HasInputCol and HasOutputCol (e.g. Set\[Transformer with HasInputCol with 
HasOutputCol\]). We have to use structural typing and reflective calls like 
this:
{code}
ml.Estimator[_] { val outputCol: ml.param.Param[String] }
{code}

This seems easy to fix, exposing a couple of traits should not break anything. 
On the other hand, maybe it goes deeper than that.

  was:
Many Estimators and Transformers mix in traits generated by 
SharedParamsCodeGen. These estimators and transformers (like StringIndexer, 
MinMaxScaler etc) are accessible publicly while traits generated by 
SharedParamsCodeGen are private\[ml\]. From user's code it is possible to 
invoke methods that the traits introduce but it is illegal to use any trait 
explicitly. For example, you can call setInputCol(str) on StringIndexer but you 
are not allowed to assign StringIndexer to a variable of type HasInputCol.
{code:java}
val x: HasInputCol = new StringIndexer() // Usage of HasInputCol is illegal.
{code}
For example, it is impossible to create a collection of transformers that have 
both HasInputCol and HasOutputCol (e.g. Set\[Transformer with HasInputCol with 
HasOutputCol\]). We have to use structural typing and reflective calls like 
this:
{code}
ml.Estimator[_] { val outputCol: ml.param.Param[String] }
{code}

This seems easy to fix, exposing a couple of traits should not break anything. 
On the other hand, maybe it goes deeper than that.


> Traits generated by SharedParamsCodeGen should not be private
> -
>
> Key: SPARK-12751
> URL: https://issues.apache.org/jira/browse/SPARK-12751
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Wojciech Jurczyk
>Priority: Minor
>
> Many Estimators and Transformers mix in traits generated by 
> [SharedParamsCodeGen|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala].
>  These estimators and transformers (like StringIndexer, MinMaxScaler etc) are 
> accessible publicly while traits generated by SharedParamsCodeGen are 
> private\[ml\]. From user's code it is possible to invoke methods that the 
> traits introduce but it is illegal to use any trait explicitly. For example, 
> you can call setInputCol(str) on StringIndexer but you are not allowed to 
> assign StringIndexer to a variable of type HasInputCol.
> {code:java}
> val x: HasInputCol = new StringIndexer() // Usage of HasInputCol is illegal.
> {code}
> For example, it is impossible to create a collection of transformers that 
> have both HasInputCol and HasOutputCol (e.g. Set\[Transformer with 
> HasInputCol with HasOutputCol\]). We have to use structural typing and 
> reflective calls like this:
> {code}
> ml.Estimator[_] { val outputCol: ml.param.Param[String] }
> {code}
> This seems easy to fix, exposing a couple of traits should not break 
> anything. On the other hand, maybe it goes deeper than that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12751) Traits generated by SharedParamsCodeGen should not be private

2016-01-11 Thread Wojciech Jurczyk (JIRA)
Wojciech Jurczyk created SPARK-12751:


 Summary: Traits generated by SharedParamsCodeGen should not be 
private
 Key: SPARK-12751
 URL: https://issues.apache.org/jira/browse/SPARK-12751
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.6.0, 1.5.2
Reporter: Wojciech Jurczyk


Many Estimators and Transformers mix in traits generated by 
SharedParamsCodeGen. These estimators and transformers (like StringIndexer, 
MinMaxScaler etc) are accessible publicly while traits generated by 
SharedParamsCodeGen are private\[ml\]. From user's code it is possible to 
invoke methods that the traits introduce but it is illegal to use any trait 
explicitly. For example, you can call setInputCol(str) on StringIndexer but you 
are not allowed to assign StringIndexer to a variable of type HasInputCol.
{code:java}
val x: HasInputCol = new StringIndexer() // Usage of HasInputCol is illegal.
{code}
For example, it is impossible to create a collection of transformers that have 
both HasInputCol and HasOutputCol (e.g. Set\[Transformer with HasInputCol with 
HasOutputCol\]). We have to use structural typing and reflective calls like 
this:
{code}
ml.Estimator[_] { val outputCol: ml.param.Param[String] }
{code}

This seems easy to fix, exposing a couple of traits should not break anything. 
On the other hand, maybe it goes deeper than that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12751) Traits generated by SharedParamsCodeGen should not be private

2016-01-11 Thread Wojciech Jurczyk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wojciech Jurczyk updated SPARK-12751:
-
Priority: Minor  (was: Major)

> Traits generated by SharedParamsCodeGen should not be private
> -
>
> Key: SPARK-12751
> URL: https://issues.apache.org/jira/browse/SPARK-12751
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Wojciech Jurczyk
>Priority: Minor
>
> Many Estimators and Transformers mix in traits generated by 
> SharedParamsCodeGen. These estimators and transformers (like StringIndexer, 
> MinMaxScaler etc) are accessible publicly while traits generated by 
> SharedParamsCodeGen are private\[ml\]. From user's code it is possible to 
> invoke methods that the traits introduce but it is illegal to use any trait 
> explicitly. For example, you can call setInputCol(str) on StringIndexer but 
> you are not allowed to assign StringIndexer to a variable of type HasInputCol.
> {code:java}
> val x: HasInputCol = new StringIndexer() // Usage of HasInputCol is illegal.
> {code}
> For example, it is impossible to create a collection of transformers that 
> have both HasInputCol and HasOutputCol (e.g. Set\[Transformer with 
> HasInputCol with HasOutputCol\]). We have to use structural typing and 
> reflective calls like this:
> {code}
> ml.Estimator[_] { val outputCol: ml.param.Param[String] }
> {code}
> This seems easy to fix, exposing a couple of traits should not break 
> anything. On the other hand, maybe it goes deeper than that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12711) ML StopWordsRemover does not protect itself from column name duplication

2016-01-09 Thread Wojciech Jurczyk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090535#comment-15090535
 ] 

Wojciech Jurczyk commented on SPARK-12711:
--

[~josephkb]Is there any particular reason why StopWordsRemover is not a 
UnaryTransformer? As the docs say, the UnaryTransformer is an "Abstract class 
for transformers that take one input column, apply transformation, and output 
the result as a new column." which is the case. Moreover, UnaryTransformer 
implementation checks whether the output column already exists or not. Then, 
Making StopWordsRemover a UnaryTransformer would solve the issue. Talking about 
UnaryTransformers candidates, I think StringIndexer is a similar case (and 
probably, there are other Transformers that could be UnaryTransformers). It 
doesn't check whether the output column exists in the input DataFrame (it has 
the same flaw). Making StringIndexer a UnaryTransformer would solve the flaw, 
too. What do you think?

> ML StopWordsRemover does not protect itself from column name duplication
> 
>
> Key: SPARK-12711
> URL: https://issues.apache.org/jira/browse/SPARK-12711
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Trivial
>  Labels: ml, mllib, newbie, suggestion
>
> At work we were 'taking a closer look' at ML transformers and I 
> spotted that anomally.
> On first look, resolution looks simple:
> Add to StopWordsRemover.transformSchema line (as is done in e.g. 
> PCA.transformSchema, StandardScaler.transformSchema, 
> OneHotEncoder.transformSchema):
> {code}
> require(!schema.fieldNames.contains($(outputCol)), s"Output column 
> ${$(outputCol)} already exists.")
> {code}
> Am I correct? Is that a bug?If yes - I am willing to prepare an 
> appropriate pull request.
> Maybe a better idea is to make use of super.transformSchema in 
> StopWordsRemover (and possibly in all other places)?
> Links to files at github, mentioned above:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala#L147
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L109-L111
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala#L101-L102
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L138-L139
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala#L75-L76



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11478) ML StringIndexer return inconsistent schema

2015-12-16 Thread Wojciech Jurczyk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060210#comment-15060210
 ] 

Wojciech Jurczyk commented on SPARK-11478:
--

Any progress on this, [~yanboliang]? I faced the same issue and I'm wondering 
if you're still working on this.

> ML StringIndexer return inconsistent schema
> ---
>
> Key: SPARK-11478
> URL: https://issues.apache.org/jira/browse/SPARK-11478
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> ML StringIndexer transform and transformSchema return inconsistent schema.
> {code}
> val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, 
> "a"), (5, "c")), 2)
> val df = sqlContext.createDataFrame(data).toDF("id", "label")
> val indexer = new StringIndexer()
>   .setInputCol("label")
>   .setOutputCol("labelIndex")
>   .fit(df)
> val transformed = indexer.transform(df)
> println(transformed.schema.toString())
> println(indexer.transformSchema(df.schema))
> The nullable of "labelIndex" return inconsistent value:
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,true))
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,false))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org