I've been working on a feature problem with mahout and was wondering
if others have dealt with a similar issue and/or had some ideas. This
is more of a feature extraction problem then a mahout problem, but
implementation is in mahout. For this example, I'm putting together a
logistic regression to classify if the person is a pet lover or not.

Let's say I have a collection of data on things people own. One of the
features would be "pets". Each person could have multiple types of
pets, for example "Cats", or "Dogs". For each cat or dog they own, we
have the type of cat, for example "Siamese". And for a particular cat,
we have items such as weight, age, name, etc.

For example, user A has a siamese cat named toby with an age of 10,
and another siamese cat named tabby with an age or 6. User B has no
cats, and user C has a calico cat names florence with an age of 10.

Now, there could be maybe 1000 different types of pets, like "cat",
"dog", "penguin". So while the number is a decent size, we do know the
dictionary of possible pets. However, for specific type of "cat",
let's say there a 1 million different types, and we can't possibly
know what are of those are. The same issue is with the pet name, where
we don't know the entire space of pet names.

I would like to be able to compare all users against each other,
including users that have 6 cats and users that have no cats.

What I was thinking in terms of features would be:

Categorical features (value )

NumberOfPets
NumberOfCats
NumberOfDogs etc for all types of pets.

Then we get to species. One issue with species is that due to the
number of possible species, they may be a large number of species
where maybe 1 or 2 users have a species. So I thought about having
categorical features for the most popular species, like:

NumberOfSiamese

and then a feature to catch all of the species that are not popular:

NumberOfRareSpecies

I've also thought about encoding species as a text field, so you would
have a "CatSpecies" text feature, with the value being all of the cat
species a user owns.

And now I'm left with the features of a specific cat, such as
name/age. If I keep species as a categorical feature, I could have
features like : SiameseCatAge, SiameseCatName. If I use a CatSpecies
text field, I could possible have features like CatAge, CatName that
do not factor in the species type for these features.

In the case of the SiameseCatAge, a user with multiple siamese cats
would have multiple values added to the categorical feature, while a
user with no siamese cats would have no features added.

I'd be interested to hear if people thought this was a reasonable
approach to feature extraction to this data, or if there where other
feature extraction techniques I should consider employing to solve
this problem. I'm sure there are some holes in this approach and I
would really appreciate any help in identifying what those are and how
I could improve on this approach.

Reply via email to