Hello Sebastian, Thanks for the hint, I did get the MEAP edition of the ebook already through manning, however I find myself struggling to translate the newsgroup and wikipedia examples to my usecase. Especially I can't seem to be able to find any code examples which helps me with the generation of my model if I do not use the Mahout commandline options.
Kind regards, Joscha Feth On Fri, Jun 10, 2011 at 22:23, Sebastian Schelter <[email protected]> wrote: > Hi Joscha, > > If you have some money left, I'd recommend to get a copy of Mahout in > Action, which features a very nice to read, detailed introduction to > classification with Mahout, including strategies for feature selection. > > --sebastian > > > On 10.06.2011 17:28, Hector Yee wrote: > >> Oh you have a very strange feature, you are using the label as a feature, >> may bad. I thought the words were the labels. >> Usually it's something like weight, height, something meaningful. If it's >> just the label like you have you might as well use a hash map there is no >> feature to learn! But if you want try making it an indicator vector. Set >> features to the number of animals and for the vector set it to 1 at the >> index of the animal in the array, 0 otherwise. E.g for ant the feature is 0, >> 1 , 00000 >> >> Sent from my iPad >> >> On Jun 10, 2011, at 12:54 AM, Joscha Feth<[email protected]> wrote: >> >> Hello fellow Mahouts, >>> >>> I am trying to grasp Mahout and generated a very simple (but obviously >>> wrong) example which I hoped would help me understand how everything >>> works: >>> >>> -- 8< -- >>> public class OLRTest { >>> >>> private static final int FEATURES = 1; >>> private static final int CATEGORIES = 2; >>> >>> private static final WordValueEncoder ANIMAL_ENCODER = new >>> AdaptiveWordValueEncoder( >>> "animal"); >>> >>> private static final String[] animals = new String[] { "alligator", >>> "ant", >>> "bear", "bee", "bird", "camel", "cat", "cheetah", "chicken", >>> "chimpanzee", "cow", "crocodile", "deer", "dog", "dolphin", >>> "duck", >>> "eagle", "elephant", "fish", "fly", "fox", "frog", "giraffe", >>> "goat", "goldfish", "hamster", "hippopotamus", "horse", >>> "kangaroo", >>> "kitten", "lion", "lobster", "monkey", "octopus", "owl", >>> "panda", >>> "pig", "puppy", "rabbit", "rat", "scorpion", "seal", "shark", >>> "sheep", "snail", "snake", "spider", "squirrel", "tiger", >>> "turtle", >>> "wolf", "zebra" }; >>> >>> public static void main(String[] args) { >>> final OnlineLogisticRegression algorithm = new >>> OnlineLogisticRegression( >>> CATEGORIES, FEATURES, new L1()); >>> >>> for (String animal : animals) { >>> algorithm.train(0, generateVector(animal)); >>> } >>> >>> algorithm.close(); >>> >>> testClassify(algorithm, "lion"); >>> testClassify(algorithm, "rabbit"); >>> testClassify(algorithm, "xyz"); >>> testClassify(algorithm, "something"); >>> } >>> >>> private static void testClassify(final OnlineLogisticRegression >>> algorithm, >>> final String allegedAnimal) { >>> System.out.println(allegedAnimal >>> + " is an animal with a probability of " >>> + algorithm.classifyScalar(generateVector(allegedAnimal)) >>> * >>> 100 >>> + "%"); >>> } >>> >>> private static Vector generateVector(String animal) { >>> final Vector v = new RandomAccessSparseVector(FEATURES); >>> ANIMAL_ENCODER.addToVector(animal, v); >>> return v; >>> } >>> } >>> -- 8< -- >>> >>> The output of running this sample code is: >>> -- 8< -- >>> lion is an animal with a probability of 0.12008121418417145% >>> rabbit is an animal with a probability of 0.11720244687895641% >>> xyz is an animal with a probability of 0.04192879358244322% >>> something is an animal with a probability of 0.04047790610981663% >>> -- 8< -- >>> >>> There were multiple surprising things for me: >>> * I would have suspected the probability of "lion" and "rabbit" close to >>> 100% >>> * I would have suspected the probability of "xyz" and "something" close >>> to >>> 0% >>> * I would have suspected the probability of "lion" being the same as the >>> one >>> for "rabbit" >>> * I would have suspected the probability of "xyz" being the same as the >>> one >>> for "something" >>> >>> I know that the animals sample provided is extremely small, but even when >>> training with multiple passes (100, 1000, 10000) it did change the >>> probabilities only marginally. >>> What am I missing here? >>> >>> Thanks very much! >>> Joscha Feth >>> >> >
