Hi, I plan to use mahout classification feature.I have a lot of data on which i am planning to train my model.Now i have few queries as follows: 1)Suppose i have 2 types of data: Spam and not spam (this is just for example and not real use case , but similar to my real use case).The amount of spam data is far less then that of non spam data in training data . I have 2% of spam(or may be 1%) and 98% of nonspam in training. Now the question is, if i build my model on this training such that it outputs spam/ nonspam will i get nonspam all the time as non spam data is more in training? Will my model correclty identify spam?
-- Regards, Damodar Shetyo
