[jira] [Commented] (SPARK-24467) VectorAssemblerEstimator
[ https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532324#comment-16532324 ] Liang-Chi Hsieh commented on SPARK-24467: - It sounds good to me for the approach similar to one hot encoder. > VectorAssemblerEstimator > > > Key: SPARK-24467 > URL: https://issues.apache.org/jira/browse/SPARK-24467 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > In [SPARK-22346], I believe I made a wrong API decision: I recommended added > `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since > I thought the latter option would break most workflows. However, I should > have proposed: > * Add a Param to VectorAssembler for specifying the sizes of Vectors in the > inputCols. This Param can be optional. If not given, then VectorAssembler > will behave as it does now. If given, then VectorAssembler can use that info > instead of figuring out the Vector sizes via metadata or examining Rows in > the data (though it could do consistency checks). > * Add a VectorAssemblerEstimator which gets the Vector lengths from data and > produces a VectorAssembler with the vector lengths Param specified. > This will not break existing workflows. Migrating to > VectorAssemblerEstimator will be easier than adding VectorSizeHint since it > will not require users to manually input Vector lengths. > Note: Even with this Estimator, VectorSizeHint might prove useful for other > things in the future which require vector length metadata, so we could > consider keeping it rather than deprecating it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24467) VectorAssemblerEstimator
[ https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516861#comment-16516861 ] Nick Pentreath commented on SPARK-24467: One option is to do that same as we did for one hot encoder: we could create a new Estimator/Model pair, and deprecate the old one, for 2.4.0. Then for 3.0, we could remove the old one. > VectorAssemblerEstimator > > > Key: SPARK-24467 > URL: https://issues.apache.org/jira/browse/SPARK-24467 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > In [SPARK-22346], I believe I made a wrong API decision: I recommended added > `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since > I thought the latter option would break most workflows. However, I should > have proposed: > * Add a Param to VectorAssembler for specifying the sizes of Vectors in the > inputCols. This Param can be optional. If not given, then VectorAssembler > will behave as it does now. If given, then VectorAssembler can use that info > instead of figuring out the Vector sizes via metadata or examining Rows in > the data (though it could do consistency checks). > * Add a VectorAssemblerEstimator which gets the Vector lengths from data and > produces a VectorAssembler with the vector lengths Param specified. > This will not break existing workflows. Migrating to > VectorAssemblerEstimator will be easier than adding VectorSizeHint since it > will not require users to manually input Vector lengths. > Note: Even with this Estimator, VectorSizeHint might prove useful for other > things in the future which require vector length metadata, so we could > consider keeping it rather than deprecating it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24467) VectorAssemblerEstimator
[ https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511852#comment-16511852 ] Joseph K. Bradley commented on SPARK-24467: --- True, we would have to make the VectorAssembler inherit from Model. Since VectorAssembler has public constructors and is not a final class, that would technically be a breaking change. This might not work : ( ...until Spark 3.0. That seems like a benign enough breaking change to put in a major Spark release. > VectorAssemblerEstimator > > > Key: SPARK-24467 > URL: https://issues.apache.org/jira/browse/SPARK-24467 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > In [SPARK-22346], I believe I made a wrong API decision: I recommended added > `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since > I thought the latter option would break most workflows. However, I should > have proposed: > * Add a Param to VectorAssembler for specifying the sizes of Vectors in the > inputCols. This Param can be optional. If not given, then VectorAssembler > will behave as it does now. If given, then VectorAssembler can use that info > instead of figuring out the Vector sizes via metadata or examining Rows in > the data (though it could do consistency checks). > * Add a VectorAssemblerEstimator which gets the Vector lengths from data and > produces a VectorAssembler with the vector lengths Param specified. > This will not break existing workflows. Migrating to > VectorAssemblerEstimator will be easier than adding VectorSizeHint since it > will not require users to manually input Vector lengths. > Note: Even with this Estimator, VectorSizeHint might prove useful for other > things in the future which require vector length metadata, so we could > consider keeping it rather than deprecating it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24467) VectorAssemblerEstimator
[ https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16507152#comment-16507152 ] Apache Spark commented on SPARK-24467: -- User 'tengpeng' has created a pull request for this issue: https://github.com/apache/spark/pull/21522 > VectorAssemblerEstimator > > > Key: SPARK-24467 > URL: https://issues.apache.org/jira/browse/SPARK-24467 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > In [SPARK-22346], I believe I made a wrong API decision: I recommended added > `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since > I thought the latter option would break most workflows. However, I should > have proposed: > * Add a Param to VectorAssembler for specifying the sizes of Vectors in the > inputCols. This Param can be optional. If not given, then VectorAssembler > will behave as it does now. If given, then VectorAssembler can use that info > instead of figuring out the Vector sizes via metadata or examining Rows in > the data (though it could do consistency checks). > * Add a VectorAssemblerEstimator which gets the Vector lengths from data and > produces a VectorAssembler with the vector lengths Param specified. > This will not break existing workflows. Migrating to > VectorAssemblerEstimator will be easier than adding VectorSizeHint since it > will not require users to manually input Vector lengths. > Note: Even with this Estimator, VectorSizeHint might prove useful for other > things in the future which require vector length metadata, so we could > consider keeping it rather than deprecating it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24467) VectorAssemblerEstimator
[ https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506334#comment-16506334 ] Nick Pentreath commented on SPARK-24467: Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't think a new estimator could return the existing {{VectorAssembler}} but would probably need to return a new {{VectorAssemblerModel}} > VectorAssemblerEstimator > > > Key: SPARK-24467 > URL: https://issues.apache.org/jira/browse/SPARK-24467 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > In [SPARK-22346], I believe I made a wrong API decision: I recommended added > `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since > I thought the latter option would break most workflows. However, I should > have proposed: > * Add a Param to VectorAssembler for specifying the sizes of Vectors in the > inputCols. This Param can be optional. If not given, then VectorAssembler > will behave as it does now. If given, then VectorAssembler can use that info > instead of figuring out the Vector sizes via metadata or examining Rows in > the data (though it could do consistency checks). > * Add a VectorAssemblerEstimator which gets the Vector lengths from data and > produces a VectorAssembler with the vector lengths Param specified. > This will not break existing workflows. Migrating to > VectorAssemblerEstimator will be easier than adding VectorSizeHint since it > will not require users to manually input Vector lengths. > Note: Even with this Estimator, VectorSizeHint might prove useful for other > things in the future which require vector length metadata, so we could > consider keeping it rather than deprecating it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24467) VectorAssemblerEstimator
[ https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503002#comment-16503002 ] Liang-Chi Hsieh commented on SPARK-24467: - [~josephkb] Does that mean {{VectorAssembler}} will change from a {{Transformer}} to a {{Model}}? > VectorAssemblerEstimator > > > Key: SPARK-24467 > URL: https://issues.apache.org/jira/browse/SPARK-24467 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > In [SPARK-22346], I believe I made a wrong API decision: I recommended added > `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since > I thought the latter option would break most workflows. However, I should > have proposed: > * Add a Param to VectorAssembler for specifying the sizes of Vectors in the > inputCols. This Param can be optional. If not given, then VectorAssembler > will behave as it does now. If given, then VectorAssembler can use that info > instead of figuring out the Vector sizes via metadata or examining Rows in > the data (though it could do consistency checks). > * Add a VectorAssemblerEstimator which gets the Vector lengths from data and > produces a VectorAssembler with the vector lengths Param specified. > This will not break existing workflows. Migrating to > VectorAssemblerEstimator will be easier than adding VectorSizeHint since it > will not require users to manually input Vector lengths. > Note: Even with this Estimator, VectorSizeHint might prove useful for other > things in the future which require vector length metadata, so we could > consider keeping it rather than deprecating it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org