(Please don't "ping" your questions on the list -- bad form and makes people less likely to answer.)
You do not have to have equal numbers of positive/negative examples. I think you need to go back and read up on the basics of how Bayesian classification works before you dig in to Mahout. This is exactly why the frequency of the class/label is part of the calculation. On Tue, Jul 3, 2012 at 4:54 PM, damodar shetyo <[email protected]> wrote: > Can someone help me with this? > > > Regards, > Damodar > > On Tue, Jul 3, 2012 at 4:27 PM, damodar shetyo <[email protected]>wrote: > >> Hi, >> I plan to use mahout classification feature.I have a lot of data on which >> i am planning to train my model.Now i have few queries as follows: >> 1)Suppose i have 2 types of data: Spam and not spam (this is just for >> example and not real use case , but similar to my real use case).The >> amount of spam data is far less then that of non spam data in training >> data . I have 2% of spam(or may be 1%) and 98% of nonspam in training. >> Now the question is, if i build my model on this training such that it >> outputs spam/ nonspam will i get nonspam all the time as non spam data is >> more in training? >> Will my model correclty identify spam? >> >> >> -- >> Regards, >> Damodar Shetyo >> >> > > > -- > Regards, > Damodar Shetyo
