subject:"\[jira\] \[Commented\] \(SPARK\-8418\) Add single\- and multi\-value support to ML Transformers"

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-12-23 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302738#comment-16302738
 ] 

Joseph K. Bradley commented on SPARK-8418:
--

One more thought: Looking at existing PRs and docs for inputCols & outputCols, 
I'm worried it may be unclear to users how to use multi-column APIs.  E.g., if 
OneHotEncoderEstimator (or any of the others) have docs talking about 
transforming a Numeric column to a Vector column, then users may be confused 
about whether each inputCol is treated independently, all concatenated in the 
output, or what.  I'm commenting on the OHE PR but thought this was relevant to 
all of these PRs.

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-12-15 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292320#comment-16292320
 ] 

Nick Pentreath commented on SPARK-8418:
---

Created SPARK-22796, SPARK-22797 and SPARK-22798 to track PySpark support for 
{{QuantileDiscretizer}}, {{Bucketizer}} and {{StringIndexer}}, respectively.

The in-progress PR for QD changed to throwing exception as per above 
discussion. I created SPARK-22799 to track that.

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-12-07 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282796#comment-16282796
 ] 

Joseph K. Bradley commented on SPARK-8418:
--

Agreed; thanks!

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-12-02 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275723#comment-16275723
 ] 

yuhao yang commented on SPARK-8418:
---

second Nick's comments.

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-12-01 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275426#comment-16275426
 ] 

Nick Pentreath commented on SPARK-8418:
---

*1 I’m ok with throwing an exception. We can update the previous and in
progress PRs accordingly.

*2 where modifying an existing API obviously we need to keep both.

But I prefer only inputCols for new Components. We can provide convenience
method to set single (or a few) input columns - I did that for
FeatureHasher.

Like setInputCol(col: String, others: String*). But the param set is
inputCols under the hood.

Java still must use setInputCols as the above only works for Scala I think.

We can also deprecate the single column variants for 3.0 if we like?

*3 yes we must thoroughly test this before 2.3 release. I think it should
be fine as it’s just adding a few new parameters which is nothing out of
the ordinary.

*4 I will create JIRAs for Python APIs - ideally we’d like them for 2.3.
Fortunately it should be pretty trivial to complete.
On Sat, 2 Dec 2017 at 00:00, Joseph K. Bradley (JIRA) 



> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-12-01 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275049#comment-16275049
 ] 

Joseph K. Bradley commented on SPARK-8418:
--

I just glanced through the various PRs adding multi-column support and wanted 
to get consensus about a few items to make sure we have consistent APIs.  CC 
[~mlnick], [~yuhaoyan], [~yanboliang], [~WeichenXu123], [~huaxing], [~viirya]  
Let me know what you think!

*1. When both inputCol and inputCols are specified, what should we do?*

* [SPARK-20542]: Bucketizer: logWarning
* [SPARK-13030]: OneHotEncoder: n/a (no single-column support)
* [SPARK-11215]: StringIndexer: throw exception
* [SPARK-22397]: QuantileDiscretizer: logWarning
* my vote: throw exception (safer since it's easier for users to recognize 
their error)

*2. Should we have single- and multi-column support or just multi-column?  
E.g., should we have (a) inputCol and inputCols or (b) only inputCols?*

Currently, [SPARK-13030] only has multi-column support for the new 
OneHotEncoderEstimator.  The other PRs have both single- and multi-column 
support since they are modifying existing APIs.
*Q*: Should we add single-column to OneHotEncoderEstimator for consistency or 
not bother?  I'm ambivalent.

*3. Backwards compatibility for ML persistence*

We'll have to be aware of whether we're breaking compatibility.  I don't see 
problems in most PRs but have not tested it manually.  The only PR with an 
issue is [SPARK-13030] for OneHotEncoder; however, that's pretty reasonable to 
break compatibility for persistence there.

*4. Python APIs*

I don't see follow-ups for Python APIs yet.  Are those planned for 2.3?

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-10-30 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224454#comment-16224454
 ] 

Nick Pentreath commented on SPARK-8418:
---

Adding SPARK-13030, since the new version of {{OneHotEncoder}} will also 
support transforming multiple columns.

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2015-10-20 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965268#comment-14965268
 ] 

Yanbo Liang commented on SPARK-8418:


[~josephkb] I have implemented StringIndexer supporting multiple columns at 
SPARK-11215. Could you please go to review and comment on the PR?

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2015-10-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961739#comment-14961739
 ] 

Joseph K. Bradley commented on SPARK-8418:
--

{quote}I vote for strategy 2 of Nick Buroojy proposed. But I think we don't 
need to reimplement all transformers to support a multi-value implementation 
because of some feature transformers not needed.{quote}
* This sounds like a good way to start.  I'd prefer just doing strategy 2 (not 
1) since it's a bit deceptive to provide the multi-value API if it is not 
optimized underneath.  +1 for only adding support where needed.

Starting with StringIndexer and OneHotEncoder sounds good to me.

{quote}I don't think RFormula is the best way to resolve this issue because it 
still use the pipeline chained transformers one by one to encode multiple 
columns which is low performance.{quote}
* That's currently true, but it could be optimized.  Ideally, it would call 
these multi-value implementations when available---and would convert to a 
single Vector as soon as possible in the transformations to be efficient.
* I guess RFormula is really a separate discussion, so I won't discuss it here 
more.

@yanboliang  I'm fine if we skip a design doc for this task.  It seems pretty 
straightforward given the discussion above.

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2015-10-15 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959978#comment-14959978
 ] 

Yanbo Liang commented on SPARK-8418:


[~josephkb] I don't think RFormula is the best way to resolve this issue 
because it still use the pipeline chained transformers one by one to encode 
multiple columns which is low performance.
I vote for strategy 2 of [~nburoojy] proposed. But I think we don't need to 
reimplement all transformers to support a multi-value implementation because of 
some feature transformers not needed.
I will firstly try to start with OneHotEncoder which is mostly common used.
 

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2015-09-22 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903350#comment-14903350
 ] 

Joseph K. Bradley commented on SPARK-8418:
--

New idea: We could allow transformers to leverage RFormula. That might be the 
nicest way to specify a bunch of columns and leverage existing code for 
assembling them.

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2015-09-14 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744508#comment-14744508
 ] 

Joseph K. Bradley commented on SPARK-8418:
--

Apologies for being AWOL!  I'd definitely appreciate help with designing this 
improvement.

For API (Vector vs. Map): I prefer sticking with a Vector API.  I see the 
appeal of keeping columns separate, but DataFrames are not yet meant to handle 
too many columns (hundreds at most, I'd say).  We can still keep feature names 
and metadata using ML attributes (which describe each feature in Vector columns 
in DataFrames).

For sharing code, we should definitely do option 2.  For backwards 
compatibility, we should not modify current Params, but we could add a new one 
for multiple inputs (and check for conflicting settings when running).  I would 
hope we could share code in this multi-value transformation so that each 
transformer only needs to specify how to transform a single value.  I hope we 
can do this, rather than implementing option 1 as the default.

Would you mind sketching up a quick design doc?  That should help clarify the 
different options and help us choose a simple but flexible API.  If you'd like 
to follow existing examples, here are some ones you could look at:
* Classification threshold (shorter doc): 
[https://docs.google.com/document/d/1nV6m7sqViHkEpawelq1S5_QLWWAouSlv81eiEEjKuJY/edit?usp=sharing]
* R-like stats for model (long doc): 
[https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing]

These items we've discussed can be sketched out in the doc.

After you link it from this JIRA, others can give you feedback on this JIRA 
(better than on the doc since some people have trouble viewing Google docs).

Thanks very much!

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2015-07-17 Thread Nick Buroojy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632061#comment-14632061
 ] 

Nick Buroojy commented on SPARK-8418:
-

I like this idea a lot, and think it would solve one of our main performance 
issues with the ml api.

Our data set has hundreds of string features that we need to convert into 
binary vectors. We have found the latency overhead of processing the features 
one at-a-time with a StringVectorizer (SPARK-7290) to be unbearable. We wrote a 
custom Estimator to vectorize all string columns with only a couple passes over 
the data set and found significant performance gains.

I suspect that we aren't the only users with many columns, so we would love to 
fix this issue upstream with some sort of multi-column interface to 
transformers and estimators.

I suppose we could make do with the Vector or Array interface using the 
VectorAssembler as described in this ticket; however, I think the cleanest 
interface for us would be a Map from source column to dest column.

As far as sharing code, there are at least two strategies:
1) Use the single value implementation as it is today, and add a multi-value 
view on top of it. For example, StringVectorizer.setInputCols(Array[A, B]) 
would return a pipeline of [StringVectorizer.setInputCol(A), 
StringVectorizer(B)]
2) Reimplement each transformer to support a multi-value implementation and 
make the single-value interface a trivial invocation of the multi-value code. 
For example StringVectorizer.setInputCol(A) would invoke 
StringVectorizer.setInputCols(Array[A])

The obvious downside of 1 is that it wouldn't address the performance issues we 
ran into with hundreds of columns. The upsides are minimal implementation 
effort and simpler code to maintain.

The main downside of 2 is more upfront effort to implement multi-value 
transformations, but the upside is reasonable performance with wide data sets.

I don't think 1 and 2 are mutually exclusive. Maybe the multi-value interface 
could be solidified first with the 1 implementation, then over time the key 
transformers, like StringVectorizer, could be rewritten to 2?

You mentioned that this would require a short design doc. Can I help with that?

 Add single- and multi-value support to ML Transformers
 --

 Key: SPARK-8418
 URL: https://issues.apache.org/jira/browse/SPARK-8418
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley

 It would be convenient if all feature transformers supported transforming 
 columns of single values and multiple values, specifically:
 * one column with one value (e.g., type {{Double}})
 * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
 We could go as far as supporting multiple columns, but that may not be 
 necessary since VectorAssembler could be used to handle that.
 Estimators under {{ml.feature}} should also support this.
 This will likely require a short design doc to describe:
 * how input and output columns will be specified
 * schema validation
 * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

13 matches

Site Navigation

Mail list logo

Footer information