You win the prize.  Order is very important in stochastic gradient descent.

Randomizing once should be fine.

It should also be fine to do a random merge of the two classes.  Or an
alternating join.

On Thu, Feb 14, 2013 at 1:33 PM, Brian McCallister <[email protected]> wrote:

> So to answer my own question, the order of training matters. I had been
> doing all category 1 then all category 0. Apparently this breaks things
> badly
>
>
> On Wed, Feb 13, 2013 at 4:29 PM, Brian McCallister <[email protected]>
> wrote:
>
> > I'm trying to do a basic two category classifier on textual data, I am
> > working with a training set of only about 100,000 documents, and am using
> > an AdaptiveLogisticRegression with default settings.
> >
> > When I build the trainer it reports:
> >
> >
> > % correct:       0.9996315789473774
> > AUC:              0.75
> > log likelihood: -0.032966543010819874
> >
> > Which seems pretty good.
> >
> > When I then classify the *training data* everything lands in the first
> > category, when in fact they are split down the middle.
> >
> > Creation of vectors looks like:
> >
> >         FeatureVectorEncoder content_encoder = new
> > AdaptiveWordValueEncoder("content");
> >         content_encoder.setProbes(2);
> >
> >         FeatureVectorEncoder type_encoder = new
> > StaticWordValueEncoder("type");
> >         type_encoder.setProbes(2);
> >
> >         Vector v = new RandomAccessSparseVector(100);
> >         type_encoder.addToVector(type, v);
> >
> >         for (String word : data.getWords()) {
> >             content_encoder.addToVector(word, v);
> >         }
> >         return new NamedVector(v, label);
> >
> > where data.getWords() is the massaged (tidy, extract characters, then run
> > trhough lucene standard analyzer and lower case filter) content of
> various
> > documents.\
> >
> > Training looks like:
> >
> >             Configuration hconf = new Configuration();
> >             FileSystem fs = FileSystem.get(path, hconf);
> >
> >             SequenceFile.Reader reader = new SequenceFile.Reader(fs, new
> > Path(path), hconf);
> >             LongWritable key = new LongWritable();
> >             VectorWritable value = new VectorWritable();
> >             AdaptiveLogisticRegression reg = new
> > AdaptiveLogisticRegression(2, 100, new L1());
> >
> >             while (reader.next(key, value)) {
> >                 NamedVector v = (NamedVector) value.get();
> >                 System.out.println(v.getName());
> >                 reg.train("spam".equals(v.getName()) ? 1 : 0, v);
> >             }
> >             reader.close();
> >             reg.close();
> >             CrossFoldLearner best =
> > reg.getBest().getPayload().getLearner();
> >             System.out.println(best.percentCorrect());
> >             System.out.println(best.auc());
> >             System.out.println(best.getLogLikelihood());
> >
> >             ModelSerializer.writeBinary(model.getPath(),
> > reg.getBest().getPayload().getLearner());
> >
> >
> > And running through the test data looks like:
> >
> >             InputStream in = new FileInputStream(model);
> >             CrossFoldLearner best = ModelSerializer.readBinary(in,
> > CrossFoldLearner.class);
> >             in.close();
> >
> >             Configuration hconf = new Configuration();
> >             FileSystem fs = FileSystem.get(path, hconf);
> >
> >             SequenceFile.Reader reader = new SequenceFile.Reader(fs, new
> > Path(path), hconf);
> >             LongWritable key = new LongWritable();
> >             VectorWritable value = new VectorWritable();
> >
> >             int correct = 0;
> >             int total = 0;
> >             while (reader.next(key, value)) {
> >                 total++;
> >                 NamedVector v = (NamedVector) value.get();
> >                 int expected = "spam".equals(v.getName()) ? 1 : 0;
> >                 Vector p = new DenseVector(2);
> >                 best.classifyFull(p, v);
> >                 int cat = p.maxValueIndex();
> >                 System.out.println(cat == 1 ? "SPAM" : "HAM");
> >                 if (cat == expected) { correct++;}
> >             }
> >             reader.close();
> >             best.close();
> >
> >             double cd = correct;
> >             double td = total;
> >
> >             System.out.println(cd / td);
> >
> > Can anyone help me figure out what I am doing wrong?
> >
> > Also, I'd love to try naive bayes or complementary naive bayes, but I am
> > unable to find any documentation on how to do so :-(
> >
>

Reply via email to