[spambayes-dev] SpamBayes for Document Categorization?

Michael Murdock Fri, 22 Jul 2005 11:36:04 -0700

Hello,

I am interested in using SpamBayes as the core classifier for a system I want to write that classifies document instances into categories. Instances might be formatted in Word, PDF, text, or html. Of course I don't expect SpamBayes to know how to read all these different formats. So for the sake of discussion, let's just say it could process the text of any document I throw at it.

For the sake of discussion, let's say I have five categories with many document instances (training examples) from each of these five categories:

Doc Category #1 - streaming media protocols
Doc Category #2 - media format conversion tools
Doc Category #3 - DirectShow
Doc Category #4 - media content management systems
Doc Category #5 - none of the above

In my proposed system I drag a document instance into a watch folder, which causes a text classifier to open it, analyze it and "tag" it somehow to indicate to which of the five categories it belongs (say by moving it into one of five directories).

Here are my five concerns.

1. Embedding the SpamBase code into my app.

My first concern is whether or not the SpamBayes training and classifier code is structured such that it can be embedded into this kind of tool. I'm pretty comfortable with Python. But rewriting major pieces of SpamBayes to do this app would not be fun, nor feasible.

2. SpamBayes for Non-Email-Types of Classification.

Does it even make sense to start with SpamBayes since my problem domain doesn't have anything like email headers or the presence of an attachment, etc. that SpamBayes probably uses in its core feature extraction?

3. Discriminatory Training

My next concern relates to the lack of discriminatory training between categories. I think the way SpamBayes works is my training on a particular class, say class 1, is building a model with which to make the discrimination: Is this document instance a member of class 1 or not? When I train the model for class 1 do I only include positive instances (the ham) of Category 1? Or do I also include negative instances from the other categories (spam)?

If the model for Category-1 is only trained on positive instance from that category, then this trained model is independent of the trained models for categories 2 through 5. And when it comes time to make a classification the model that responds "loudest" is the one selected. But, and here's my concern, there has never been a proabability model created that discriminates between the categories. Does this make sense what I am describing? I guess I'm thinking about Maximum Likelihood training of acoustic models in a speech recognition system, which has this lack of discriminatory training and I'm wondering if multi-class naive-Bayes classifiers have this same kind of shortcoming.

4. Adding a New Document Category.

Let's say I have trained the models on my five classes (as described above) and everything is working fine and I decide to add a new document category. Do the first five models need to be trained from scratch (to include the negative instances in this new sixth category)? Or can SpamBayes models be "incrementally" trained by just training on these new class-6 negative examples?

5. Size of Training Sample

My final concern relates to the number of training documents I would need. I'm guessing that each of my documents, no matter how long or short, reduces to a single feature vector for training and classification. Is this correct? If so, it would seem that I would need at least hundreds of examples from each category and probably thousands. Yes? No?

Thanks for any thoughts you might have on my concerns and questions.

~Michael.

_______________________________________________
spambayes-dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-dev

[spambayes-dev] SpamBayes for Document Categorization?

Reply via email to