I'm trying to do a basic two category classifier on textual data, I am
working with a training set of only about 100,000 documents, and am using
an AdaptiveLogisticRegression with default settings.
When I build the trainer it reports:
% correct: 0.9996315789473774
AUC: 0.75
log likelihood: -0.032966543010819874
Which seems pretty good.
When I then classify the *training data* everything lands in the first
category, when in fact they are split down the middle.
Creation of vectors looks like:
FeatureVectorEncoder content_encoder = new
AdaptiveWordValueEncoder("content");
content_encoder.setProbes(2);
FeatureVectorEncoder type_encoder = new
StaticWordValueEncoder("type");
type_encoder.setProbes(2);
Vector v = new RandomAccessSparseVector(100);
type_encoder.addToVector(type, v);
for (String word : data.getWords()) {
content_encoder.addToVector(word, v);
}
return new NamedVector(v, label);
where data.getWords() is the massaged (tidy, extract characters, then run
trhough lucene standard analyzer and lower case filter) content of various
documents.\
Training looks like:
Configuration hconf = new Configuration();
FileSystem fs = FileSystem.get(path, hconf);
SequenceFile.Reader reader = new SequenceFile.Reader(fs, new
Path(path), hconf);
LongWritable key = new LongWritable();
VectorWritable value = new VectorWritable();
AdaptiveLogisticRegression reg = new
AdaptiveLogisticRegression(2, 100, new L1());
while (reader.next(key, value)) {
NamedVector v = (NamedVector) value.get();
System.out.println(v.getName());
reg.train("spam".equals(v.getName()) ? 1 : 0, v);
}
reader.close();
reg.close();
CrossFoldLearner best = reg.getBest().getPayload().getLearner();
System.out.println(best.percentCorrect());
System.out.println(best.auc());
System.out.println(best.getLogLikelihood());
ModelSerializer.writeBinary(model.getPath(),
reg.getBest().getPayload().getLearner());
And running through the test data looks like:
InputStream in = new FileInputStream(model);
CrossFoldLearner best = ModelSerializer.readBinary(in,
CrossFoldLearner.class);
in.close();
Configuration hconf = new Configuration();
FileSystem fs = FileSystem.get(path, hconf);
SequenceFile.Reader reader = new SequenceFile.Reader(fs, new
Path(path), hconf);
LongWritable key = new LongWritable();
VectorWritable value = new VectorWritable();
int correct = 0;
int total = 0;
while (reader.next(key, value)) {
total++;
NamedVector v = (NamedVector) value.get();
int expected = "spam".equals(v.getName()) ? 1 : 0;
Vector p = new DenseVector(2);
best.classifyFull(p, v);
int cat = p.maxValueIndex();
System.out.println(cat == 1 ? "SPAM" : "HAM");
if (cat == expected) { correct++;}
}
reader.close();
best.close();
double cd = correct;
double td = total;
System.out.println(cd / td);
Can anyone help me figure out what I am doing wrong?
Also, I'd love to try naive bayes or complementary naive bayes, but I am
unable to find any documentation on how to do so :-(