Best way to use the Document Categorization

Jonathan Boston Thu, 29 Nov 2012 14:46:00 -0800

Hi,

I'm trying to use the Document Categorization over a large set of text and 
could use some help. I've just briefly looked into MaxEnt so I'm unsure of the 
best approach.


For my project, the texts are categorized, but some percentage (probably around 
20%) of them are incorrect. Some of them could also legitimately fall under 
multiple categories. I want to correct the category for each text, but only if 
it has a high probability of being categorized correctly. I've considered 
"high" to be > .5 with no other categories being > .25.

My initial approach has been to try and bootstrap a training set by:

1) Randomly picking a text adding it to the current training set
2) Testing each of the texts in the training set with the new model
3) Only keeping the newest text if each of the training texts have a high 
probability of matching the initial category that was provided.
4) Repeat

I've had limited success building up a training set, but it takes a while to 
train the model as more records are added to the training set.

Does this seem like a reasonable approach? Will the model perform well if there 
are a few incorrectly categorized texts in the training set?

Thanks.

-boston

Best way to use the Document Categorization

Reply via email to