[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-03-06 Thread srowen
Github user srowen closed the pull request at: https://github.com/apache/spark/pull/4460 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-22 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75461553 I went to add AttributeGroup, but then I can't figure out how this isn't already covered by Attribute's dimension? It's 1 for a scalar, 1 for a vector-valued feature.

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-22 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75458264 To me, that means that the idea of a string-valued categorical column still has a place in the representation since it exists at some stage of a pipeline. It's just

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-21 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75366539 @mengxr Understood about being told the number of distinct values for a column by the caller/schema. I thought you were saying this was a difference between string /

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-21 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75384291 @jkbradley The issue with algorithms handling munging and indexing is the increased complexity. For example, if `DecisionTree` takes string columns, there will be some

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75370525 Test FAILed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-21 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75370501 [Test build #27813 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27813/consoleFull) for PR 4460 at commit

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-21 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75370524 [Test build #27813 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27813/consoleFull) for PR 4460 at commit

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-21 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75370475 - (Any support for making Metadata .get methods return `Option`?) - I created a `FeatureType` hierarchy. It's a little tricky and involves `trait`s because `Binary`

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-20 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75223695 @srowen If we mark a string column categorical, it may be hard to answer how many categories it has without looking at the data. If a column is marked categorical, it

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-20 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75225746 Don't you always have to look at the data to determine how many unique values a column has, regardless of type? String and int are encodings, but attribute types like

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-20 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75352674 +1 for the feature assembler or some other algorithm handling munging and indexing as needed. * Note that the behavior of the assembler may depend on the algorithm

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-20 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75304366 Don't you always have to look at the data to determine how many unique values a column has, regardless of type? No if we already have ML attributes saved

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-19 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75129379 I'm OK with a type hierarchy as long as it stays simple (and doesn't turn into a type system parallel to the DataFrame system). To support any type of

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-19 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75106841 Rename FeatureType? and what's its value for AttributeGroup? GROUP or null? I wish we could use `type`, but it is already taken by Scala. `DataType` is taken by

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-19 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75170753 Call it `AttributeType` maybe? So if an `AttributeGroup` contains both `Attribute`s but also vector-valued columns, which sound like `AttributeGroup`s within

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-17 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-74648170 So is the idea that `FeatureAttributes` becomes `AttributeGroup`, and that it continues to contain many `Attribute`s? I didn't realize that we intended the vector-valued

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-16 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-74623712 Btw, for the feature type, beside continuous and categorical, do we want to make binary special? It could be treated as both continuous and categorical. --- If your

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-16 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-74609769 I like the current sketch but also want to think about it more. A few thoughts: I'm not quite clear on how the Array of Attributes in FeatureAttributes

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-16 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-74614834 This is perhaps contained in @jkbradley 's question, but how does this work with features that are represented with multiple entries in the feature vector - e.g. when

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-16 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-74623603 There are two types of `Attribute(s)`: describing a feature group (a vector column) or describing a single feature (a scalar column). For a feature group, the column name

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-11 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-73927224 @mengxr Great, that helps me. I took another shot at implementing the above ideas. - Is package `org.apache.spark.ml.attribute` reasonable? - `FeatureType`

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-11 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-73927632 [Test build #27294 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27294/consoleFull) for PR 4460 at commit

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-11 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-73927817 [Test build #27294 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27294/consoleFull) for PR 4460 at commit

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-73927823 Test FAILed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-08 Thread srowen
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/4460 SPARK-4588 [MLLIB] [WIP] Add API for feature attributes @mengxr This is an early checkin to see if this is in the right direction at all. - Should this have a counterpart builder

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-08 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-73435580 [Test build #27053 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27053/consoleFull) for PR 4460 at commit

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-73439415 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

2015-02-08 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-73439410 [Test build #27053 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27053/consoleFull) for PR 4460 at commit