I've been working on a feature problem with mahout and was wondering if others have dealt with a similar issue and/or had some ideas. This is more of a feature extraction problem then a mahout problem, but implementation is in mahout. For this example, I'm putting together a logistic regression to classify if the person is a pet lover or not.
Let's say I have a collection of data on things people own. One of the features would be "pets". Each person could have multiple types of pets, for example "Cats", or "Dogs". For each cat or dog they own, we have the type of cat, for example "Siamese". And for a particular cat, we have items such as weight, age, name, etc. For example, user A has a siamese cat named toby with an age of 10, and another siamese cat named tabby with an age or 6. User B has no cats, and user C has a calico cat names florence with an age of 10. Now, there could be maybe 1000 different types of pets, like "cat", "dog", "penguin". So while the number is a decent size, we do know the dictionary of possible pets. However, for specific type of "cat", let's say there a 1 million different types, and we can't possibly know what are of those are. The same issue is with the pet name, where we don't know the entire space of pet names. I would like to be able to compare all users against each other, including users that have 6 cats and users that have no cats. What I was thinking in terms of features would be: Categorical features (value ) NumberOfPets NumberOfCats NumberOfDogs etc for all types of pets. Then we get to species. One issue with species is that due to the number of possible species, they may be a large number of species where maybe 1 or 2 users have a species. So I thought about having categorical features for the most popular species, like: NumberOfSiamese and then a feature to catch all of the species that are not popular: NumberOfRareSpecies I've also thought about encoding species as a text field, so you would have a "CatSpecies" text feature, with the value being all of the cat species a user owns. And now I'm left with the features of a specific cat, such as name/age. If I keep species as a categorical feature, I could have features like : SiameseCatAge, SiameseCatName. If I use a CatSpecies text field, I could possible have features like CatAge, CatName that do not factor in the species type for these features. In the case of the SiameseCatAge, a user with multiple siamese cats would have multiple values added to the categorical feature, while a user with no siamese cats would have no features added. I'd be interested to hear if people thought this was a reasonable approach to feature extraction to this data, or if there where other feature extraction techniques I should consider employing to solve this problem. I'm sure there are some holes in this approach and I would really appreciate any help in identifying what those are and how I could improve on this approach.
