Look at the Mahout project and the "Mahout In Action" and "Taming Text" books. These have a lot to say about categorizing documents well. http://mahout.apache.org http://www.manning.com/owen http://www.manning.com/ingersoll/
----- Original Message ----- | From: "Jonathan Boston" <[email protected]> | To: [email protected] | Sent: Thursday, November 29, 2012 2:44:13 PM | Subject: Best way to use the Document Categorization | | Hi, | | I'm trying to use the Document Categorization over a large set of | text and could use some help. I've just briefly looked into MaxEnt | so I'm unsure of the best approach. | | For my project, the texts are categorized, but some percentage | (probably around 20%) of them are incorrect. Some of them could also | legitimately fall under multiple categories. I want to correct the | category for each text, but only if it has a high probability of | being categorized correctly. I've considered "high" to be > .5 with | no other categories being > .25. | | My initial approach has been to try and bootstrap a training set by: | | 1) Randomly picking a text adding it to the current training set | 2) Testing each of the texts in the training set with the new model | 3) Only keeping the newest text if each of the training texts have a | high probability of matching the initial category that was provided. | 4) Repeat | | I've had limited success building up a training set, but it takes a | while to train the model as more records are added to the training | set. | | Does this seem like a reasonable approach? Will the model perform | well if there are a few incorrectly categorized texts in the | training set? | | Thanks. | | -boston
