Hi Joscha,

If you have some money left, I'd recommend to get a copy of Mahout in Action, which features a very nice to read, detailed introduction to classification with Mahout, including strategies for feature selection.

--sebastian

On 10.06.2011 17:28, Hector Yee wrote:
Oh you have a very strange feature, you are using the label as a feature, may 
bad. I thought the words were the labels.
Usually it's something like weight, height, something meaningful. If it's just 
the label like you have you might as well use a hash map there is no feature to 
learn! But if you want try making it an indicator vector. Set features to the 
number of animals and for the vector set it to 1 at the index of the animal in 
the array, 0 otherwise. E.g for ant the feature is 0, 1 , 00000

Sent from my iPad

On Jun 10, 2011, at 12:54 AM, Joscha Feth<[email protected]>  wrote:

Hello fellow Mahouts,

I am trying to grasp Mahout and generated a very simple (but obviously
wrong) example which I hoped would help me understand how everything works:

-- 8<  --
public class OLRTest {

    private static final int FEATURES = 1;
    private static final int CATEGORIES = 2;

    private static final WordValueEncoder ANIMAL_ENCODER = new
AdaptiveWordValueEncoder(
            "animal");

    private static final String[] animals = new String[] { "alligator",
"ant",
            "bear", "bee", "bird", "camel", "cat", "cheetah", "chicken",
            "chimpanzee", "cow", "crocodile", "deer", "dog", "dolphin",
"duck",
            "eagle", "elephant", "fish", "fly", "fox", "frog", "giraffe",
            "goat", "goldfish", "hamster", "hippopotamus", "horse",
"kangaroo",
            "kitten", "lion", "lobster", "monkey", "octopus", "owl",
"panda",
            "pig", "puppy", "rabbit", "rat", "scorpion", "seal", "shark",
            "sheep", "snail", "snake", "spider", "squirrel", "tiger",
"turtle",
            "wolf", "zebra" };

    public static void main(String[] args) {
        final OnlineLogisticRegression algorithm = new
OnlineLogisticRegression(
                CATEGORIES, FEATURES, new L1());

        for (String animal : animals) {
            algorithm.train(0, generateVector(animal));
        }

        algorithm.close();

        testClassify(algorithm, "lion");
        testClassify(algorithm, "rabbit");
        testClassify(algorithm, "xyz");
        testClassify(algorithm, "something");
    }

    private static void testClassify(final OnlineLogisticRegression
algorithm,
            final String allegedAnimal) {
        System.out.println(allegedAnimal
                + " is an animal with a probability of "
                + algorithm.classifyScalar(generateVector(allegedAnimal)) *
100
                + "%");
    }

    private static Vector generateVector(String animal) {
        final Vector v = new RandomAccessSparseVector(FEATURES);
        ANIMAL_ENCODER.addToVector(animal, v);
        return v;
    }
}
-- 8<  --

The output of running this sample code is:
-- 8<  --
lion is an animal with a probability of 0.12008121418417145%
rabbit is an animal with a probability of 0.11720244687895641%
xyz is an animal with a probability of 0.04192879358244322%
something is an animal with a probability of 0.04047790610981663%
-- 8<  --

There were multiple surprising things for me:
* I would have suspected the probability of "lion" and "rabbit" close to
100%
* I would have suspected the probability of "xyz" and "something" close to
0%
* I would have suspected the probability of "lion" being the same as the one
for "rabbit"
* I would have suspected the probability of "xyz" being the same as the one
for "something"

I know that the animals sample provided is extremely small, but even when
training with multiple passes (100, 1000, 10000) it did change the
probabilities only marginally.
What am I missing here?

Thanks very much!
Joscha Feth

Reply via email to