Thanks Ted. Since the SGD is a sequential method, so the Vector be created for each line could be very large and won't consume too much memory. Could I assume if we have limited number of features, or could use the map-reduce to pre-process the data to know how many different values in a category could have, we could just create a long vector, and put different feature values to different slot to avoid the possible feature collision?
Thanks, Stanley On Thu, Apr 21, 2011 at 12:24 AM, Ted Dunning <[email protected]> wrote: > Stanley, > > Yes. What you say is correct. Feature hashing can cause degradation. > > With multiple hashing, however, you do have a fairly strong guarantee that > the feature hashing is very close to information preserving. This is > related to the fact that the feature hashing operation is a random linear > transformation. Since we are hashing to something that is still quite a > high dimensional space, the information loss is likely to be minimal. > > On Wed, Apr 20, 2011 at 6:06 AM, Stanley Xu <[email protected]> wrote: > > > Dear all, > > > > Per my understand, what Feature Hashing did in SGD do compress the > Feature > > Dimensions to a fixed length Vector. Won't that make the training result > > incorrect if Feature Hashing Collision happened? Won't the two features > > hashed to the same slot would be thought as the same feature? Even if we > > have multiple probes to reduce the total collision like a bloom filter. > > Won't it also make the slot that has the collision looks like a > combination > > feature? > > > > Thanks. > > > > Best wishes, > > Stanley Xu > > >
