Hi, I'm trying to use the Document Categorization over a large set of text and could use some help. I've just briefly looked into MaxEnt so I'm unsure of the best approach.
For my project, the texts are categorized, but some percentage (probably around 20%) of them are incorrect. Some of them could also legitimately fall under multiple categories. I want to correct the category for each text, but only if it has a high probability of being categorized correctly. I've considered "high" to be > .5 with no other categories being > .25. My initial approach has been to try and bootstrap a training set by: 1) Randomly picking a text adding it to the current training set 2) Testing each of the texts in the training set with the new model 3) Only keeping the newest text if each of the training texts have a high probability of matching the initial category that was provided. 4) Repeat I've had limited success building up a training set, but it takes a while to train the model as more records are added to the training set. Does this seem like a reasonable approach? Will the model perform well if there are a few incorrectly categorized texts in the training set? Thanks. -boston
