Mahout uses 1-of-n encoding (aka Zach's bitmap) but stores these encodings
all together in double vectors for consistency.

In the hashed encoding, we do this, but all of the encoded variables live
on top of each other in randomized and multiple locations in the encoded
vector.  This sounds crazy, but works quite well.

On Sun, Dec 25, 2011 at 9:18 PM, Zach Richardson <[email protected]> wrote:

> In a way yes.
>
> Generally you want to convert nominal attributes to a "bitmap" (this has a
> fancier name that is slipping my mind at the moment).  Where each "name" in
> the nominal feature has a spot in the vector for being on or off.  In most
> cases this should be set to one.  I am not aware of anything like that in
> mahout for regular vector encoding.  You could reasonably easy write your
> own.
>
> For instance if you have A, B, and C as the three possible values in your
> nominal feature, you would encode
>
> A B C
> 1 0 0 for A
> 0 1 0 for B etc.
>
> However, if you are planning on using the SGD classifiers you can use the
> Hash based encoding for Categorical / Nominal features through the
> WordValueEncoder.
>
> Hope this helps.
>
> Zach
>
> On Sun, Dec 25, 2011 at 10:18 PM, Donald A. Smith
> <[email protected]>wrote:
>
> > I believe that vectorized attributes are stored as doubles in mahout.
>  Are
> > some
> > attributes "nominal"? That is, for some attributes is the distance
> > function such that any two unequal values are at distance 1?
> >
> > Looking
> > at MapBackedARFFModel.java, I see that weka nominal attributes get
> > converted to integer-valued doubles (1.0, 2.0, 3.0, ...).   Will the
> > nominal with value 1.0 be closer to the nominal with value 2.0 than to
> > the nominal with value 3.0?  Or is the distance between 1.0 and 3.0 also
> 1?
> >
> >
> >
> >  Thanks, Don
>
>
>
>
> --
> Zach Richardson
> Ravel, Co-founder
> Austin, TX
> [email protected]
> 512.825.6031
>

Reply via email to