Hi,
I'm currently working on a text classification problem.
As my learning datasets are rather small (dozens of entries), I'm
looking for a solution to use a glossary in addition to the learning
phase, to tune the model.
I know that some text entries containing some keywords have great chance
to be in a specific category. With SGD, I've tried to create a new
Vector with this keywords and I added this to the current category's
vector during the learning.
It works pretty good if I put a big weight on it (700).
My "hack" code looks like this :
List<String> keywords = ... //keywords for the current category
for (String keyword : keywords) {
predictorEncoders.get(99).addToVector(keyword, 700,
featureVector);
}
The 99 predictor is a new predictor I've created just for this keywords :
FeatureVectorEncoder keyWordEncoder =
TYPE_DICTIONARY.get("text").getConstructor(String.class).newInstance("keywords");
predictorEncoders.put(99, keyWordEncoder);
It works pretty well, my confusion matrix is better with this hack, but
maybe it's not optimal because this attribute does not exists in the
train/test data.
Did someone experienced this kind of things? Do you have advices? Or is
it just a wrong idea?
Thanks!
Loic