Thanks Ted. Since the SGD is a sequential method, so the Vector be created
for each line could be very large and won't consume too much memory. Could I
assume if we have limited number of features, or could use the map-reduce to
pre-process the data to know how many different values in a category could
have, we could just create a long vector, and put different feature values
to different slot to avoid the possible feature collision?

Thanks,
Stanley



On Thu, Apr 21, 2011 at 12:24 AM, Ted Dunning <[email protected]> wrote:

> Stanley,
>
> Yes.  What you say is correct.  Feature hashing can cause degradation.
>
> With multiple hashing, however, you do have a fairly strong guarantee that
> the feature hashing is very close to information preserving.  This is
> related to the fact that the feature hashing operation is a random linear
> transformation.  Since we are hashing to something that is still quite a
> high dimensional space, the information loss is likely to be minimal.
>
> On Wed, Apr 20, 2011 at 6:06 AM, Stanley Xu <[email protected]> wrote:
>
> > Dear all,
> >
> > Per my understand, what Feature Hashing did in SGD do compress the
> Feature
> > Dimensions to a fixed length Vector. Won't that make the training result
> > incorrect if Feature Hashing Collision happened? Won't the two features
> > hashed to the same slot would be thought as the same feature? Even if we
> > have multiple probes to reduce the total collision like a bloom filter.
> > Won't it also make the slot that has the collision looks like a
> combination
> > feature?
> >
> > Thanks.
> >
> > Best wishes,
> > Stanley Xu
> >
>

Reply via email to