Re: TrainNewsGroups source code

Ahmed Abdeen Hamed Fri, 06 Apr 2012 05:09:52 -0700

I didn't sent my response to the list so I sending it now. Please let me
know if you have a fix for me.


Thanks very much,
-Ahmed

On Wed, Apr 4, 2012 at 8:44 PM, Ahmed Abdeen Hamed
<[email protected]>wrote:

> Sorry my message wasn't specific enough. In ch14 of the MiA source code.
> there is a TrainNewsGroups.java file. Some parts of this class are
> commented out. However, these parts are actually needed to get the
> classifier to be trained. When uncommenting them back there are
> variables/objects that are used but not declared (e.g., onColon.split(line)).
> I need the complete example as the book describes so I can run it.
>
> The example code is below so you will see what I mean.
>
> Thanks very much,
>
> -Ahmed
>
>
> package mia.classifier.ch14;
>
> import java.io.BufferedReader;
> import java.io.File;
> import java.io.FileReader;
> import java.io.IOException;
> import java.io.Reader;
> import java.io.StringReader;
> import java.util.ArrayList;
> import java.util.Arrays;
> import java.util.Collection;
> import java.util.Collections;
> import java.util.List;
> import java.util.Map;
> import java.util.Set;
> import java.util.TreeMap;
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> import org.apache.lucene.util.Version;
> import org.apache.mahout.classifier.sgd.L1;
> import org.apache.mahout.classifier.sgd.OnlineLogisticRegression;
> import org.apache.mahout.math.DenseVector;
> import org.apache.mahout.math.RandomAccessSparseVector;
> import org.apache.mahout.math.Vector;
> import org.apache.mahout.vectorizer.encoders.ConstantValueEncoder;
> import org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder;
> import org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder;
> import org.apache.mahout.vectorizer.encoders.Dictionary;
>
> import com.google.common.collect.ConcurrentHashMultiset;
> import com.google.common.collect.HashMultiset;
> import com.google.common.collect.Iterables;
> import com.google.common.collect.Multiset;
>
> public class TrainNewsGroups {
>     private static final int FEATURES = 10000;
>     private static Multiset<String> overallCounts;
>
>     public static void main(String[] args) {
>         File base = new File(args[0]);
>         overallCounts = HashMultiset.create();
>
>         Map<String, Set<Integer>> traceDictionary = new TreeMap<String,
> Set<Integer>>();
>         FeatureVectorEncoder encoder = new StaticWordValueEncoder("body");
>         encoder.setProbes(2);
>         encoder.setTraceDictionary(traceDictionary);
>         FeatureVectorEncoder bias = new ConstantValueEncoder("Intercept");
>         bias.setTraceDictionary(traceDictionary);
>         FeatureVectorEncoder lines = new ConstantValueEncoder("Lines");
>         lines.setTraceDictionary(traceDictionary);
>         Dictionary newsGroups = new Dictionary();
>
>         OnlineLogisticRegression learningAlgorithm =
>             new OnlineLogisticRegression(
>                   20, FEATURES, new L1())
>                 .alpha(1).stepOffset(1000)
>                 .decayExponent(0.9)
>                 .lambda(3.0e-5)
>                 .learningRate(20);
>
>         List<File> files = new ArrayList<File>();
>         for (File newsgroup : base.listFiles()) {
>           newsGroups.intern(newsgroup.getName());
>           files.addAll(Arrays.asList(newsgroup.listFiles()));
>         }
>
>         Collections.shuffle(files);
>         System.out.printf("%d training files\n", files.size());
>
>         double averageLL = 0.0;
>         double averageCorrect = 0.0;
>         double averageLineCount = 0.0;
>         int k = 0;
>         double step = 0.0;
>         int[] bumps = new int[]{1, 2, 5};
>         double lineCount = 0;
>
>         Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_31);
>
> /*        for (File file : files) {
>               BufferedReader reader = new BufferedReader(new
> FileReader(file));
>               String ng = file.getParentFile().getName();
>               int actual = newsGroups.intern(ng);
>               Multiset<String> words = ConcurrentHashMultiset.create();
>
>               String line = reader.readLine();
>               while (line != null && line.length() > 0) {
>                 if (line.startsWith("Lines:")) {
>                   String count = Iterables.get(onColon.split(line), 1);
>                   try {
>                     lineCount = Integer.parseInt(count);
>                     averageLineCount += (lineCount - averageLineCount)
>                         / Math.min(k + 1, 1000);
>                   } catch (NumberFormatException e) {
>                     lineCount = averageLineCount;
>                   }
>                 }
>                 boolean countHeader = (
>                     line.startsWith("From:") ||
> line.startsWith("Subject:")||
>                     line.startsWith("Keywords:")||
> line.startsWith("Summary:"));
>                 do {
>                   StringReader in = new StringReader(line);
>                   if (countHeader) {
>                     countWords(analyzer, words, in);
>                   }
>                   line = reader.readLine();
>                 } while (line.startsWith(" "));
>               }
>               countWords(analyzer, words, reader);
>               reader.close();
>             }
>
>         Vector v = new RandomAccessSparseVector(FEATURES);
>         bias.addToVector(null, 1, v);
>         lines.addToVector(null, lineCount / 30, v);
>         logLines.addToVector(null, Math.log(lineCount + 1), v);
>         for (String word : words.elementSet()) {
>           encoder.addToVector(word, Math.log(1 + words.count(word)), v);
>         }
>         */
>
>         /*double mu = Math.min(k + 1, 200);
>         double ll = learningAlgorithm.logLikelihood(actual, v);  #1
>         averageLL = averageLL + (ll - averageLL) / mu;
>
>         Vector p = new DenseVector(20);
>         learningAlgorithm.classifyFull(p, v);
>         int estimated = p.maxValueIndex();
>
>         int correct = (estimated == actual? 1 : 0);
>         averageCorrect = averageCorrect + (correct - averageCorrect) /
> mu;*/
>
> /*        learningAlgorithm.train(actual, v);
>         k++;
>         int bump = bumps[(int) Math.floor(step) % bumps.length];
>         int scale = (int) Math.pow(10, Math.floor(step / bumps.length));
>         if (k % (bump * scale) == 0) {
>           step += 0.25;
>           System.out.printf("%10d %10.3f %10.3f %10.2f %s %s\n",
>                k, ll, averageLL, averageCorrect * 100, ng,
>                newsGroups.values().get(estimated));
>         }
>         learningAlgorithm.close();*/
>
>     }
>
>     private static void countWords(Analyzer analyzer, Collection<String>
> words, Reader in) throws IOException {
>         TokenStream ts = analyzer.tokenStream("text", in);
>         ts.addAttribute(CharTermAttribute.class);
>         while (ts.incrementToken()) {
>           String s = ts.getAttribute(CharTermAttribute.class).toString();
>           words.add(s);
>         }
>         /*overallCounts.addAll(words);*/
>
>       }
> }
>
>
>
> On Wed, Apr 4, 2012 at 5:38 PM, Ted Dunning <[email protected]> wrote:
>
>> I am sorry, but I don't understand the question.
>>
>> All of the code in Mahout compiles.  This is verified several times a day
>> by the continuous integration testing.
>>
>> Can you say more specifically what you mean?  Line 95 of what?
>>
>>
>> On Wed, Apr 4, 2012 at 12:18 PM, Ahmed Abdeen Hamed <
>> [email protected]> wrote:
>>
>>> Hello,
>>>
>>> The source code for the TrainNewsGroups classification example has some
>>> issues. There are uncommented regions that they are still covered in the
>>>  MiA book. However, those regions can't compile. For instance, line 95
>>> onColon.split(line) doesn't declare onColon object before it was used. Is
>>> there an updated version of this class that I can use instead?
>>>
>>> I would appreciate any help.
>>>
>>> Thanks,
>>> -Ahmed
>>>
>>
>>
>

Re: TrainNewsGroups source code

Reply via email to