Does mahout classification depends on amount of data in each category?

damodar shetyo Tue, 03 Jul 2012 03:57:55 -0700

Hi,
I plan to use mahout classification feature.I have a lot of data on which i
am planning to train my model.Now i have few queries as follows:
1)Suppose i have 2 types of data:  Spam and not spam (this is just for
example and not real use case , but similar  to my real use case).The
amount of  spam data is far less then that of non spam data in training
data . I have 2% of spam(or may be 1%)  and 98% of nonspam in training.
Now the question is, if i build my model on this training  such that it
outputs spam/ nonspam will i get nonspam  all the time as non spam data is
more in training?
Will my model correclty identify spam?



-- 
Regards,
Damodar Shetyo

Does mahout classification depends on amount of data in each category?

Reply via email to