Maybe there's a space for Mr based input conversion job indeed as a command line routine? I was kind of thinking about the same. Maybe even along with standartisation of the values. Some formal definition of inputs being fed to it.
apologies for brevity. Sent from my android. -Dmitriy On Apr 21, 2011 3:05 PM, "Ted Dunning" <[email protected]> wrote: > It is definitely a reasonable idea to convert data to hashed feature vectors > using map-reduce. > > And yes, you can pick a vector length that is long enough so that you don't > have to worry about > collisions. You need to examine your data to decide how large that needs to > be, but it isn't hard > to do. The encoding framework handles to the placement of features in the > vector for you. You > don't have to worry about that. > > On Wed, Apr 20, 2011 at 8:03 PM, Stanley Xu <[email protected]> wrote: > >> Thanks Ted. Since the SGD is a sequential method, so the Vector be created >> for each line could be very large and won't consume too much memory. Could >> I >> assume if we have limited number of features, or could use the map-reduce >> to >> pre-process the data to know how many different values in a category could >> have, we could just create a long vector, and put different feature values >> to different slot to avoid the possible feature collision? >> >> Thanks, >> Stanley >> >> >> >> On Thu, Apr 21, 2011 at 12:24 AM, Ted Dunning <[email protected]> >> wrote: >> >> > Stanley, >> > >> > Yes. What you say is correct. Feature hashing can cause degradation. >> > >> > With multiple hashing, however, you do have a fairly strong guarantee >> that >> > the feature hashing is very close to information preserving. This is >> > related to the fact that the feature hashing operation is a random linear >> > transformation. Since we are hashing to something that is still quite a >> > high dimensional space, the information loss is likely to be minimal. >> > >> > On Wed, Apr 20, 2011 at 6:06 AM, Stanley Xu <[email protected]> wrote: >> > >> > > Dear all, >> > > >> > > Per my understand, what Feature Hashing did in SGD do compress the >> > Feature >> > > Dimensions to a fixed length Vector. Won't that make the training >> result >> > > incorrect if Feature Hashing Collision happened? Won't the two features >> > > hashed to the same slot would be thought as the same feature? Even if >> we >> > > have multiple probes to reduce the total collision like a bloom filter. >> > > Won't it also make the slot that has the collision looks like a >> > combination >> > > feature? >> > > >> > > Thanks. >> > > >> > > Best wishes, >> > > Stanley Xu >> > > >> > >>
