---------- Forwarded message ---------- Date: Sun, 13 Mar 2005 21:58:09 +1100 From: oscar ng <[EMAIL PROTECTED]> To: 'Danny Yoo' <[EMAIL PROTECTED]> Subject: RE: [Tutor] Newbie in Python
Hi Danny, Thanks for the reply..i wasn't sure how this works so I am glad there is someone that might be able to help me. Because this is an university assignment I am not sure how much of help you can provide..but here it goes. I need to build a mail filtering system that detects sorts mail messages into appropriate categories, such as spam, job announcement and conference announcement. The assignment will be in two parts. In the first part you will try your own approaches to solving the problem, using the Natural Language Toolkit package for Python. In the second part, you will use the techniques learned in the classes on text classification, and compare the results of these and your own. What Is Given The target dataset consists of four types of documents, a list of spam mail messages and a list of messages sent to various newsgroups. The four types of documents are located in different directories. Each document is formatted as an email message with the main text and two email headers: From and Subject. All the HTML code has been removed. Below is an example of a message from the corpus: From: [EMAIL PROTECTED] Subject: YOUR APPLICATION HAS BEEN APPROVED You Have Been APPROVED for 3 UNSECURED VISA and MASTERCARDS! Are you at least 18 Years of age? Have a Valid Social Security No? Income of at Least $99 p/week? YOU'RE APPROVED! Our Banks offer: INSTANT FREE ONLINE APPROVAL! Receive your cards in as little as TWO Weeks from Today! Just in Time for Summer Vacation! For more information on how you can get your Visa or Mastercards NOW, click on the link below: MailTo:[EMAIL PROTECTED] ******************************************************* If you are no longer interested in receiving information on Credit Cards or Financial Services, please click on the link below and you will be removed from our optin list. MailTo:[EMAIL PROTECTED] The four categories are as follows: spam job announcements (now available) conference announcements (now available) other emails What your code should do for Part I, then, is to tokenise the files, classify the emails according to your own algorithm, and output the results of the classification. Your algorithm might specify, for example, that emails with greater than X% of capitalised words are spam. Your algorithm for this part can be quite simple; the main aim is to get the infrastructure built for Part II, and to get you thinking about what is involved in these sorts of systems. The output of your code might look as follows: 24 messages are SPAM (77% correct): msg-a-2 msg-a-3 ... 11 messages are JOB ANN (63% correct): ja-4 ja-6 ... 35 messages are CONF ANN (84% correct): ca-1 ca-3 msg-a-11 ... 9 messages are OTHER (22% correct): 10000 10001 ... ---I am stuck in understanding how I can go about opening the folder(directory) that contains all the files that I need to process for this assignment. As the folder contains sub folders ie and then the email files that need to be processed. Thanks for your time in reading this and hope to hear from you soon.. If you need more info there is a link http://www.comp.mq.edu.au/units/comp348/assignments/ass1.html -----Original Message----- From: Danny Yoo [mailto:[EMAIL PROTECTED] Sent: Friday, 11 March 2005 6:04 AM To: oscar ng Cc: tutor@python.org Subject: Re: [Tutor] Newbie in Python On Thu, 10 Mar 2005, oscar ng wrote: > Needing help on a mail filtering system that explores the headers and > text and determines which category the email falls into. [text cut] Hi Oscar, Ok. What help do you need? You have not told us what problems you're having, so we're stuck just twiddling our thumbs. *grin* Are you already aware of projects that do this, or are you doing this for fun? The SpamBayes project has quite a bit of source code that may interest you: http://spambayes.sourceforge.net/ _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor