Trouble with UnicodeEncodeError and email

2014-01-08 Thread Florian Lindner
Hello!

I've written some tiny script using Python 3 and it used to work perfectly. 
Then I realized it needs to run on my Debian Stable server too, which offers 
only Python 2. Ok, most backporting was a matter of minutes, but I'm becoming 
desperate on some Unicode error...

i use scikit-learn to train a filter on a set of email messages:

vectorizer = CountVectorizer(input='filename', decode_error='replace', 
strip_accents='unicode',
 preprocessor=self.mail_preprocessor, 
stop_words='english')

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

The vectorizer gets a list of filenames, reads them and passes them to the 
preprocessor:


def mail_preprocessor(self, message):
# Filter POPFile cruft by matching date string at the beginning.
print(Type:, type(message)) # imported from __future__
pop_reg = re.compile(r^[0-9]{4}/[0-1][1-9]/[0-3]?[0-9])
message = [line for line in message.splitlines(True) if not 
pop_reg.match(line)]
xxx = .join(message)
msg = email.message_from_string(xxx)  # -- CRASH here

msg_body = 

for part in msg.walk():
if part.get_content_type() in [text/plain, text/html]:
body = part.get_payload(decode=True)
soup = BeautifulSoup(body)
msg_body += soup.get_text( , strip=True)


if -BEGIN PGP MESSAGE- in msg_body:
msg_body = 

msg_body +=  .join(email.utils.parseaddr(msg[From]))
try:
msg_body +=   + msg[Subject]
except TypeError: # Can't convert 'NoneType' object to str implicitly
pass
msg_body = msg_body.lower()
return msg_body


Type: type 'unicode'

Traceback (most recent call last):
  File flofify.py, line 182, in module
main()
  File flofify.py, line 161, in main
model.train()
  File flofify.py, line 73, in train
vectors = vectorizer.fit_transform(data[:,1])
  File /usr/lib/python2.7/site-packages/sklearn/feature_extraction/text.py, 
line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
  File /usr/lib/python2.7/site-packages/sklearn/feature_extraction/text.py, 
line 715, in _count_vocab
for feature in analyze(doc):
  File /usr/lib/python2.7/site-packages/sklearn/feature_extraction/text.py, 
line 229, in lambda
tokenize(preprocess(self.decode(doc))), stop_words)
  File flofify.py, line 119, in mail_preprocessor
msg = email.message_from_string(xxx)
  File /usr/lib/python2.7/email/__init__.py, line 57, in message_from_string
return Parser(*args, **kws).parsestr(s)
  File /usr/lib/python2.7/email/parser.py, line 82, in parsestr
return self.parse(StringIO(text), headersonly=headersonly)
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 
1624: ordinal not in range(128)

I've tried various modifications like encoding/decoding the message argument to 
utf-8.

Any help?

Thanks!

Florian

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Trouble with UnicodeEncodeError and email

2014-01-08 Thread Chris Angelico
On Thu, Jan 9, 2014 at 12:14 AM, Florian Lindner mailingli...@xgm.de wrote:
 I've written some tiny script using Python 3 and it used to work perfectly. 
 Then I realized it needs to run on my Debian Stable server too, which offers 
 only Python 2. Ok, most backporting was a matter of minutes, but I'm becoming 
 desperate on some Unicode error...

Are you sure it does? The current Debian stable is Wheezy, which comes
with a package 'python3' in the repository, which will install 3.2.3.
(The previous Debian stable, Squeeze, has 3.1.3 under the same name.)
You may need to change your shebang, but that's all you'd need to do.
Or are you unable to install new packages? If so, I strongly recommend
getting Python 3 added, as it's going to spare you a lot of Unicode
headaches.

Mind you, I compile my own Py3 for Wheezy, since I like to be on the
bleeding edge. But that's not for everyone. :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Trouble with UnicodeEncodeError and email

2014-01-08 Thread Florian Lindner
Am Donnerstag, 9. Januar 2014, 00:26:15 schrieb Chris Angelico:
 On Thu, Jan 9, 2014 at 12:14 AM, Florian Lindner mailingli...@xgm.de wrote:
  I've written some tiny script using Python 3 and it used to work perfectly. 
  Then I realized it needs to run on my Debian Stable server too, which 
  offers only Python 2. Ok, most backporting was a matter of minutes, but I'm 
  becoming desperate on some Unicode error...
 
 Are you sure it does? The current Debian stable is Wheezy, which comes
 with a package 'python3' in the repository, which will install 3.2.3.
 (The previous Debian stable, Squeeze, has 3.1.3 under the same name.)
 You may need to change your shebang, but that's all you'd need to do.
 Or are you unable to install new packages? If so, I strongly recommend
 getting Python 3 added, as it's going to spare you a lot of Unicode
 headaches.
 
 Mind you, I compile my own Py3 for Wheezy, since I like to be on the
 bleeding edge. But that's not for everyone. :)

Well, I thought I had scanned to repos but obviously... I had to install 
BeautifulSoup and scikit-learn manually. Now some other Unicode issues have 
arised, but I need to sort them out first how they are connected to my mail 
delivery agent.

Thx a lot,

Florian
-- 
https://mail.python.org/mailman/listinfo/python-list