[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302738#comment-16302738 ] Joseph K. Bradley commented on SPARK-8418: -- One more thought: Looking at existing PRs and docs for inputCols & outputCols, I'm worried it may be unclear to users how to use multi-column APIs. E.g., if OneHotEncoderEstimator (or any of the others) have docs talking about transforming a Numeric column to a Vector column, then users may be confused about whether each inputCol is treated independently, all concatenated in the output, or what. I'm commenting on the OHE PR but thought this was relevant to all of these PRs. > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292320#comment-16292320 ] Nick Pentreath commented on SPARK-8418: --- Created SPARK-22796, SPARK-22797 and SPARK-22798 to track PySpark support for {{QuantileDiscretizer}}, {{Bucketizer}} and {{StringIndexer}}, respectively. The in-progress PR for QD changed to throwing exception as per above discussion. I created SPARK-22799 to track that. > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282796#comment-16282796 ] Joseph K. Bradley commented on SPARK-8418: -- Agreed; thanks! > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275723#comment-16275723 ] yuhao yang commented on SPARK-8418: --- second Nick's comments. > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275426#comment-16275426 ] Nick Pentreath commented on SPARK-8418: --- *1 I’m ok with throwing an exception. We can update the previous and in progress PRs accordingly. *2 where modifying an existing API obviously we need to keep both. But I prefer only inputCols for new Components. We can provide convenience method to set single (or a few) input columns - I did that for FeatureHasher. Like setInputCol(col: String, others: String*). But the param set is inputCols under the hood. Java still must use setInputCols as the above only works for Scala I think. We can also deprecate the single column variants for 3.0 if we like? *3 yes we must thoroughly test this before 2.3 release. I think it should be fine as it’s just adding a few new parameters which is nothing out of the ordinary. *4 I will create JIRAs for Python APIs - ideally we’d like them for 2.3. Fortunately it should be pretty trivial to complete. On Sat, 2 Dec 2017 at 00:00, Joseph K. Bradley (JIRA)> Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275049#comment-16275049 ] Joseph K. Bradley commented on SPARK-8418: -- I just glanced through the various PRs adding multi-column support and wanted to get consensus about a few items to make sure we have consistent APIs. CC [~mlnick], [~yuhaoyan], [~yanboliang], [~WeichenXu123], [~huaxing], [~viirya] Let me know what you think! *1. When both inputCol and inputCols are specified, what should we do?* * [SPARK-20542]: Bucketizer: logWarning * [SPARK-13030]: OneHotEncoder: n/a (no single-column support) * [SPARK-11215]: StringIndexer: throw exception * [SPARK-22397]: QuantileDiscretizer: logWarning * my vote: throw exception (safer since it's easier for users to recognize their error) *2. Should we have single- and multi-column support or just multi-column? E.g., should we have (a) inputCol and inputCols or (b) only inputCols?* Currently, [SPARK-13030] only has multi-column support for the new OneHotEncoderEstimator. The other PRs have both single- and multi-column support since they are modifying existing APIs. *Q*: Should we add single-column to OneHotEncoderEstimator for consistency or not bother? I'm ambivalent. *3. Backwards compatibility for ML persistence* We'll have to be aware of whether we're breaking compatibility. I don't see problems in most PRs but have not tested it manually. The only PR with an issue is [SPARK-13030] for OneHotEncoder; however, that's pretty reasonable to break compatibility for persistence there. *4. Python APIs* I don't see follow-ups for Python APIs yet. Are those planned for 2.3? > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224454#comment-16224454 ] Nick Pentreath commented on SPARK-8418: --- Adding SPARK-13030, since the new version of {{OneHotEncoder}} will also support transforming multiple columns. > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965268#comment-14965268 ] Yanbo Liang commented on SPARK-8418: [~josephkb] I have implemented StringIndexer supporting multiple columns at SPARK-11215. Could you please go to review and comment on the PR? > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961739#comment-14961739 ] Joseph K. Bradley commented on SPARK-8418: -- {quote}I vote for strategy 2 of Nick Buroojy proposed. But I think we don't need to reimplement all transformers to support a multi-value implementation because of some feature transformers not needed.{quote} * This sounds like a good way to start. I'd prefer just doing strategy 2 (not 1) since it's a bit deceptive to provide the multi-value API if it is not optimized underneath. +1 for only adding support where needed. Starting with StringIndexer and OneHotEncoder sounds good to me. {quote}I don't think RFormula is the best way to resolve this issue because it still use the pipeline chained transformers one by one to encode multiple columns which is low performance.{quote} * That's currently true, but it could be optimized. Ideally, it would call these multi-value implementations when available---and would convert to a single Vector as soon as possible in the transformations to be efficient. * I guess RFormula is really a separate discussion, so I won't discuss it here more. @yanboliang I'm fine if we skip a design doc for this task. It seems pretty straightforward given the discussion above. > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959978#comment-14959978 ] Yanbo Liang commented on SPARK-8418: [~josephkb] I don't think RFormula is the best way to resolve this issue because it still use the pipeline chained transformers one by one to encode multiple columns which is low performance. I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to reimplement all transformers to support a multi-value implementation because of some feature transformers not needed. I will firstly try to start with OneHotEncoder which is mostly common used. > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903350#comment-14903350 ] Joseph K. Bradley commented on SPARK-8418: -- New idea: We could allow transformers to leverage RFormula. That might be the nicest way to specify a bunch of columns and leverage existing code for assembling them. > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744508#comment-14744508 ] Joseph K. Bradley commented on SPARK-8418: -- Apologies for being AWOL! I'd definitely appreciate help with designing this improvement. For API (Vector vs. Map): I prefer sticking with a Vector API. I see the appeal of keeping columns separate, but DataFrames are not yet meant to handle too many columns (hundreds at most, I'd say). We can still keep feature names and metadata using ML attributes (which describe each feature in Vector columns in DataFrames). For sharing code, we should definitely do option 2. For backwards compatibility, we should not modify current Params, but we could add a new one for multiple inputs (and check for conflicting settings when running). I would hope we could share code in this multi-value transformation so that each transformer only needs to specify how to transform a single value. I hope we can do this, rather than implementing option 1 as the default. Would you mind sketching up a quick design doc? That should help clarify the different options and help us choose a simple but flexible API. If you'd like to follow existing examples, here are some ones you could look at: * Classification threshold (shorter doc): [https://docs.google.com/document/d/1nV6m7sqViHkEpawelq1S5_QLWWAouSlv81eiEEjKuJY/edit?usp=sharing] * R-like stats for model (long doc): [https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing] These items we've discussed can be sketched out in the doc. After you link it from this JIRA, others can give you feedback on this JIRA (better than on the doc since some people have trouble viewing Google docs). Thanks very much! > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632061#comment-14632061 ] Nick Buroojy commented on SPARK-8418: - I like this idea a lot, and think it would solve one of our main performance issues with the ml api. Our data set has hundreds of string features that we need to convert into binary vectors. We have found the latency overhead of processing the features one at-a-time with a StringVectorizer (SPARK-7290) to be unbearable. We wrote a custom Estimator to vectorize all string columns with only a couple passes over the data set and found significant performance gains. I suspect that we aren't the only users with many columns, so we would love to fix this issue upstream with some sort of multi-column interface to transformers and estimators. I suppose we could make do with the Vector or Array interface using the VectorAssembler as described in this ticket; however, I think the cleanest interface for us would be a Map from source column to dest column. As far as sharing code, there are at least two strategies: 1) Use the single value implementation as it is today, and add a multi-value view on top of it. For example, StringVectorizer.setInputCols(Array[A, B]) would return a pipeline of [StringVectorizer.setInputCol(A), StringVectorizer(B)] 2) Reimplement each transformer to support a multi-value implementation and make the single-value interface a trivial invocation of the multi-value code. For example StringVectorizer.setInputCol(A) would invoke StringVectorizer.setInputCols(Array[A]) The obvious downside of 1 is that it wouldn't address the performance issues we ran into with hundreds of columns. The upsides are minimal implementation effort and simpler code to maintain. The main downside of 2 is more upfront effort to implement multi-value transformations, but the upside is reasonable performance with wide data sets. I don't think 1 and 2 are mutually exclusive. Maybe the multi-value interface could be solidified first with the 1 implementation, then over time the key transformers, like StringVectorizer, could be rewritten to 2? You mentioned that this would require a short design doc. Can I help with that? Add single- and multi-value support to ML Transformers -- Key: SPARK-8418 URL: https://issues.apache.org/jira/browse/SPARK-8418 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley It would be convenient if all feature transformers supported transforming columns of single values and multiple values, specifically: * one column with one value (e.g., type {{Double}}) * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) We could go as far as supporting multiple columns, but that may not be necessary since VectorAssembler could be used to handle that. Estimators under {{ml.feature}} should also support this. This will likely require a short design doc to describe: * how input and output columns will be specified * schema validation * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org