Hi,

I'm currently working on a text classification problem.
As my learning datasets are rather small (dozens of entries), I'm looking for a solution to use a glossary in addition to the learning phase, to tune the model.

I know that some text entries containing some keywords have great chance to be in a specific category. With SGD, I've tried to create a new Vector with this keywords and I added this to the current category's vector during the learning.

It works pretty good if I put a big weight on it (700).

My "hack" code looks like this :


      List<String> keywords = ... //keywords for the current category

      for (String keyword : keywords) {
predictorEncoders.get(99).addToVector(keyword, 700, featureVector);
      }

The 99 predictor is a new predictor I've created just for this keywords :

FeatureVectorEncoder keyWordEncoder = TYPE_DICTIONARY.get("text").getConstructor(String.class).newInstance("keywords");
predictorEncoders.put(99, keyWordEncoder);


It works pretty well, my confusion matrix is better with this hack, but maybe it's not optimal because this attribute does not exists in the train/test data.

Did someone experienced this kind of things? Do you have advices? Or is it just a wrong idea?

Thanks!

Loic

Reply via email to