You win the prize. Order is very important in stochastic gradient descent.
Randomizing once should be fine. It should also be fine to do a random merge of the two classes. Or an alternating join. On Thu, Feb 14, 2013 at 1:33 PM, Brian McCallister <[email protected]> wrote: > So to answer my own question, the order of training matters. I had been > doing all category 1 then all category 0. Apparently this breaks things > badly > > > On Wed, Feb 13, 2013 at 4:29 PM, Brian McCallister <[email protected]> > wrote: > > > I'm trying to do a basic two category classifier on textual data, I am > > working with a training set of only about 100,000 documents, and am using > > an AdaptiveLogisticRegression with default settings. > > > > When I build the trainer it reports: > > > > > > % correct: 0.9996315789473774 > > AUC: 0.75 > > log likelihood: -0.032966543010819874 > > > > Which seems pretty good. > > > > When I then classify the *training data* everything lands in the first > > category, when in fact they are split down the middle. > > > > Creation of vectors looks like: > > > > FeatureVectorEncoder content_encoder = new > > AdaptiveWordValueEncoder("content"); > > content_encoder.setProbes(2); > > > > FeatureVectorEncoder type_encoder = new > > StaticWordValueEncoder("type"); > > type_encoder.setProbes(2); > > > > Vector v = new RandomAccessSparseVector(100); > > type_encoder.addToVector(type, v); > > > > for (String word : data.getWords()) { > > content_encoder.addToVector(word, v); > > } > > return new NamedVector(v, label); > > > > where data.getWords() is the massaged (tidy, extract characters, then run > > trhough lucene standard analyzer and lower case filter) content of > various > > documents.\ > > > > Training looks like: > > > > Configuration hconf = new Configuration(); > > FileSystem fs = FileSystem.get(path, hconf); > > > > SequenceFile.Reader reader = new SequenceFile.Reader(fs, new > > Path(path), hconf); > > LongWritable key = new LongWritable(); > > VectorWritable value = new VectorWritable(); > > AdaptiveLogisticRegression reg = new > > AdaptiveLogisticRegression(2, 100, new L1()); > > > > while (reader.next(key, value)) { > > NamedVector v = (NamedVector) value.get(); > > System.out.println(v.getName()); > > reg.train("spam".equals(v.getName()) ? 1 : 0, v); > > } > > reader.close(); > > reg.close(); > > CrossFoldLearner best = > > reg.getBest().getPayload().getLearner(); > > System.out.println(best.percentCorrect()); > > System.out.println(best.auc()); > > System.out.println(best.getLogLikelihood()); > > > > ModelSerializer.writeBinary(model.getPath(), > > reg.getBest().getPayload().getLearner()); > > > > > > And running through the test data looks like: > > > > InputStream in = new FileInputStream(model); > > CrossFoldLearner best = ModelSerializer.readBinary(in, > > CrossFoldLearner.class); > > in.close(); > > > > Configuration hconf = new Configuration(); > > FileSystem fs = FileSystem.get(path, hconf); > > > > SequenceFile.Reader reader = new SequenceFile.Reader(fs, new > > Path(path), hconf); > > LongWritable key = new LongWritable(); > > VectorWritable value = new VectorWritable(); > > > > int correct = 0; > > int total = 0; > > while (reader.next(key, value)) { > > total++; > > NamedVector v = (NamedVector) value.get(); > > int expected = "spam".equals(v.getName()) ? 1 : 0; > > Vector p = new DenseVector(2); > > best.classifyFull(p, v); > > int cat = p.maxValueIndex(); > > System.out.println(cat == 1 ? "SPAM" : "HAM"); > > if (cat == expected) { correct++;} > > } > > reader.close(); > > best.close(); > > > > double cd = correct; > > double td = total; > > > > System.out.println(cd / td); > > > > Can anyone help me figure out what I am doing wrong? > > > > Also, I'd love to try naive bayes or complementary naive bayes, but I am > > unable to find any documentation on how to do so :-( > > >
