Re: Classification beginner questions

Joscha Feth Sat, 11 Jun 2011 00:04:11 -0700

Hello Sebastian,

Thanks for the hint, I did get the MEAP edition of the ebook already through
manning, however I find myself struggling to translate the newsgroup and
wikipedia examples to my usecase. Especially I can't seem to be able to find
any code examples which helps me with the generation of my model if I do not
use the Mahout commandline options.


Kind regards,
Joscha Feth

On Fri, Jun 10, 2011 at 22:23, Sebastian Schelter <[email protected]> wrote:

> Hi Joscha,
>
> If you have some money left, I'd recommend to get a copy of Mahout in
> Action, which features a very nice to read, detailed introduction to
> classification with Mahout, including strategies for feature selection.
>
> --sebastian
>
>
> On 10.06.2011 17:28, Hector Yee wrote:
>
>> Oh you have a very strange feature, you are using the label as a feature,
>> may bad. I thought the words were the labels.
>> Usually it's something like weight, height, something meaningful. If it's
>> just the label like you have you might as well use a hash map there is no
>> feature to learn! But if you want try making it an indicator vector. Set
>> features to the number of animals and for the vector set it to 1 at the
>> index of the animal in the array, 0 otherwise. E.g for ant the feature is 0,
>> 1 , 00000
>>
>> Sent from my iPad
>>
>> On Jun 10, 2011, at 12:54 AM, Joscha Feth<[email protected]>  wrote:
>>
>>  Hello fellow Mahouts,
>>>
>>> I am trying to grasp Mahout and generated a very simple (but obviously
>>> wrong) example which I hoped would help me understand how everything
>>> works:
>>>
>>> -- 8<  --
>>> public class OLRTest {
>>>
>>>    private static final int FEATURES = 1;
>>>    private static final int CATEGORIES = 2;
>>>
>>>    private static final WordValueEncoder ANIMAL_ENCODER = new
>>> AdaptiveWordValueEncoder(
>>>            "animal");
>>>
>>>    private static final String[] animals = new String[] { "alligator",
>>> "ant",
>>>            "bear", "bee", "bird", "camel", "cat", "cheetah", "chicken",
>>>            "chimpanzee", "cow", "crocodile", "deer", "dog", "dolphin",
>>> "duck",
>>>            "eagle", "elephant", "fish", "fly", "fox", "frog", "giraffe",
>>>            "goat", "goldfish", "hamster", "hippopotamus", "horse",
>>> "kangaroo",
>>>            "kitten", "lion", "lobster", "monkey", "octopus", "owl",
>>> "panda",
>>>            "pig", "puppy", "rabbit", "rat", "scorpion", "seal", "shark",
>>>            "sheep", "snail", "snake", "spider", "squirrel", "tiger",
>>> "turtle",
>>>            "wolf", "zebra" };
>>>
>>>    public static void main(String[] args) {
>>>        final OnlineLogisticRegression algorithm = new
>>> OnlineLogisticRegression(
>>>                CATEGORIES, FEATURES, new L1());
>>>
>>>        for (String animal : animals) {
>>>            algorithm.train(0, generateVector(animal));
>>>        }
>>>
>>>        algorithm.close();
>>>
>>>        testClassify(algorithm, "lion");
>>>        testClassify(algorithm, "rabbit");
>>>        testClassify(algorithm, "xyz");
>>>        testClassify(algorithm, "something");
>>>    }
>>>
>>>    private static void testClassify(final OnlineLogisticRegression
>>> algorithm,
>>>            final String allegedAnimal) {
>>>        System.out.println(allegedAnimal
>>>                + " is an animal with a probability of "
>>>                + algorithm.classifyScalar(generateVector(allegedAnimal))
>>> *
>>> 100
>>>                + "%");
>>>    }
>>>
>>>    private static Vector generateVector(String animal) {
>>>        final Vector v = new RandomAccessSparseVector(FEATURES);
>>>        ANIMAL_ENCODER.addToVector(animal, v);
>>>        return v;
>>>    }
>>> }
>>> -- 8<  --
>>>
>>> The output of running this sample code is:
>>> -- 8<  --
>>> lion is an animal with a probability of 0.12008121418417145%
>>> rabbit is an animal with a probability of 0.11720244687895641%
>>> xyz is an animal with a probability of 0.04192879358244322%
>>> something is an animal with a probability of 0.04047790610981663%
>>> -- 8<  --
>>>
>>> There were multiple surprising things for me:
>>> * I would have suspected the probability of "lion" and "rabbit" close to
>>> 100%
>>> * I would have suspected the probability of "xyz" and "something" close
>>> to
>>> 0%
>>> * I would have suspected the probability of "lion" being the same as the
>>> one
>>> for "rabbit"
>>> * I would have suspected the probability of "xyz" being the same as the
>>> one
>>> for "something"
>>>
>>> I know that the animals sample provided is extremely small, but even when
>>> training with multiple passes (100, 1000, 10000) it did change the
>>> probabilities only marginally.
>>> What am I missing here?
>>>
>>> Thanks very much!
>>> Joscha Feth
>>>
>>
>

Re: Classification beginner questions

Reply via email to