Github user srowen closed the pull request at:
https://github.com/apache/spark/pull/4460
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75461553
I went to add AttributeGroup, but then I can't figure out how this isn't
already covered by Attribute's dimension? It's 1 for a scalar, 1 for a
vector-valued feature.
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75458264
To me, that means that the idea of a string-valued categorical column
still has a place in the representation since it exists at some stage of a
pipeline. It's just
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75366539
@mengxr Understood about being told the number of distinct values for a
column by the caller/schema. I thought you were saying this was a difference
between string /
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75384291
@jkbradley The issue with algorithms handling munging and indexing is the
increased complexity. For example, if `DecisionTree` takes string columns,
there will be some
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75370525
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75370501
[Test build #27813 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27813/consoleFull)
for PR 4460 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75370524
[Test build #27813 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27813/consoleFull)
for PR 4460 at commit
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75370475
- (Any support for making Metadata .get methods return `Option`?)
- I created a `FeatureType` hierarchy. It's a little tricky and involves
`trait`s because `Binary`
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75223695
@srowen If we mark a string column categorical, it may be hard to answer
how many categories it has without looking at the data. If a column is marked
categorical, it
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75225746
Don't you always have to look at the data to determine how many unique
values a column has, regardless of type? String and int are encodings, but
attribute types like
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75352674
+1 for the feature assembler or some other algorithm handling munging and
indexing as needed.
* Note that the behavior of the assembler may depend on the algorithm
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75304366
Don't you always have to look at the data to determine how many unique
values a column has, regardless of type?
No if we already have ML attributes saved
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75129379
I'm OK with a type hierarchy as long as it stays simple (and doesn't turn
into a type system parallel to the DataFrame system).
To support any type of
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75106841
Rename FeatureType? and what's its value for AttributeGroup? GROUP or
null?
I wish we could use `type`, but it is already taken by Scala. `DataType` is
taken by
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75170753
Call it `AttributeType` maybe?
So if an `AttributeGroup` contains both `Attribute`s but also vector-valued
columns, which sound like `AttributeGroup`s within
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-74648170
So is the idea that `FeatureAttributes` becomes `AttributeGroup`, and that
it continues to contain many `Attribute`s? I didn't realize that we intended
the vector-valued
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-74623712
Btw, for the feature type, beside continuous and categorical, do we want to
make binary special? It could be treated as both continuous and categorical.
---
If your
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-74609769
I like the current sketch but also want to think about it more. A few
thoughts:
I'm not quite clear on how the Array of Attributes in FeatureAttributes
Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-74614834
This is perhaps contained in @jkbradley 's question, but how does this work
with features that are represented with multiple entries in the feature vector
- e.g. when
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-74623603
There are two types of `Attribute(s)`: describing a feature group (a vector
column) or describing a single feature (a scalar column). For a feature group,
the column name
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-73927224
@mengxr Great, that helps me. I took another shot at implementing the above
ideas.
- Is package `org.apache.spark.ml.attribute` reasonable?
- `FeatureType`
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-73927632
[Test build #27294 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27294/consoleFull)
for PR 4460 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-73927817
[Test build #27294 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27294/consoleFull)
for PR 4460 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-73927823
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
GitHub user srowen opened a pull request:
https://github.com/apache/spark/pull/4460
SPARK-4588 [MLLIB] [WIP] Add API for feature attributes
@mengxr
This is an early checkin to see if this is in the right direction at all.
- Should this have a counterpart builder
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-73435580
[Test build #27053 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27053/consoleFull)
for PR 4460 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-73439415
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-73439410
[Test build #27053 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27053/consoleFull)
for PR 4460 at commit
29 matches
Mail list logo