Maybe there's a space for Mr based input conversion job indeed as a command
line routine? I was kind of thinking about the same. Maybe even along with
standartisation of the values. Some formal definition of inputs being fed to
it.

apologies for brevity.

Sent from my android.
-Dmitriy
On Apr 21, 2011 3:05 PM, "Ted Dunning" <[email protected]> wrote:
> It is definitely a reasonable idea to convert data to hashed feature
vectors
> using map-reduce.
>
> And yes, you can pick a vector length that is long enough so that you
don't
> have to worry about
> collisions. You need to examine your data to decide how large that needs
to
> be, but it isn't hard
> to do. The encoding framework handles to the placement of features in the
> vector for you. You
> don't have to worry about that.
>
> On Wed, Apr 20, 2011 at 8:03 PM, Stanley Xu <[email protected]> wrote:
>
>> Thanks Ted. Since the SGD is a sequential method, so the Vector be
created
>> for each line could be very large and won't consume too much memory.
Could
>> I
>> assume if we have limited number of features, or could use the map-reduce
>> to
>> pre-process the data to know how many different values in a category
could
>> have, we could just create a long vector, and put different feature
values
>> to different slot to avoid the possible feature collision?
>>
>> Thanks,
>> Stanley
>>
>>
>>
>> On Thu, Apr 21, 2011 at 12:24 AM, Ted Dunning <[email protected]>
>> wrote:
>>
>> > Stanley,
>> >
>> > Yes. What you say is correct. Feature hashing can cause degradation.
>> >
>> > With multiple hashing, however, you do have a fairly strong guarantee
>> that
>> > the feature hashing is very close to information preserving. This is
>> > related to the fact that the feature hashing operation is a random
linear
>> > transformation. Since we are hashing to something that is still quite a
>> > high dimensional space, the information loss is likely to be minimal.
>> >
>> > On Wed, Apr 20, 2011 at 6:06 AM, Stanley Xu <[email protected]>
wrote:
>> >
>> > > Dear all,
>> > >
>> > > Per my understand, what Feature Hashing did in SGD do compress the
>> > Feature
>> > > Dimensions to a fixed length Vector. Won't that make the training
>> result
>> > > incorrect if Feature Hashing Collision happened? Won't the two
features
>> > > hashed to the same slot would be thought as the same feature? Even if
>> we
>> > > have multiple probes to reduce the total collision like a bloom
filter.
>> > > Won't it also make the slot that has the collision looks like a
>> > combination
>> > > feature?
>> > >
>> > > Thanks.
>> > >
>> > > Best wishes,
>> > > Stanley Xu
>> > >
>> >
>>

Reply via email to