Jason R. Mastaler writes:

Stephen Warren <[EMAIL PROTECTED]> writes:

At present, my patch ensures that it only parses None/UTF-8
encoded data, because I don't know if/how it would work on other
types... Are there other types that are likely?

Goodness, yes. Actually, very few mail clients grok UTF-8 at this
point. Instead, they use what many call RFC 2047 "gobbledygook" to
support a whole slew of encodings.


I'm in need of a good overview of I18N/UTF-8/unicode/etc.!

The current situation is more complex that you'd ever imagine.

*whimper* Yes - it seems that way now!


Anyway, I've modified the patch so that it converts the data it uses first to unicode, then back to ASCII. Whilst doing this, it jumps out of X-TMDA parsing if the conversions can't be applied - i.e. the first two words of the subject aren't representable in ASCII. I'm assuming we don't want to offer internationalized text for "X-TMDA" and "dated" etc.?

One thing I wonder about - I'm calling split directly on the results that come from decoce_header - the docs you pointed me at indicate that all current encodings match ASCII for codes 32-254, so I'm guessing this is safe, with the possible exception of multi-byte encoded data? Should I convert *everything* to unicode before doing anything? With the current code, after I've stripped "X-TMDA" and the parameter, I can put it back with the exact same encoding as it was in before, but I'm not sure how I'd do that if I mapped everything into Unicode before splitting it and removing the command...

--- tmda-inject 2003-07-11 16:02:59.000000000 -0700
+++ tmda-inject-hacked 2003-08-20 14:49:02.000000000 -0700
@@ -126,6 +126,7 @@
from TMDA import Util


from email.Utils import formataddr, getaddresses, parseaddr
+from email.Header import decode_header, make_header
import socket
import string


@@ -396,17 +397,51 @@
if (Defaults.X_TMDA_IN_SUBJECT and msgout.has_key('subject') and
x_tmda_over is None):
sub = msgout.get('subject')
- subsplit = sub.split(None, 2)
- if subsplit and subsplit[0].lower() == 'x-tmda':
- x_tmda_over = 1
- actions = { 'from' : FilterParser.splitaction(subsplit[1]) }
- log_msg = '%s: %s' % ('X-TMDA', subsplit[1])
- # Fixup Subject: before sending.
- del msgout['Subject']
- if subsplit[2:]:
- msgout['Subject'] = subsplit[2:][0]
- else:
- msgout['Subject'] = ''
+
+ dh = decode_header(sub)
+
+ #sys.stderr.write('x-tmda testing in subject "%s"\n' % sub)
+ #for hp in dh:
+ # sys.stderr.write('decoded: "%s" : "%s"\n' % (hp[0], hp[1]))
+
+ dhs = []
+ for i in (0, 1):
+ if len(dh) > i:
+ dhsi = dh[i][0].split(None)
+ dhs.extend(map(lambda x: [x, dh[i][1]], dhsi))
+ dhs.extend(dh[2:])
+
+ #for hp in dhs:
+ # sys.stderr.write('decoded split: "%s" : "%s"\n' % (hp[0], hp[1]))
+
+ if len(dhs) >= 2:
+ try:
+ dhs0_c = dhs[0][1] or 'ascii'
+ dhs0_u = unicode(dhs[0][0], dhs0_c)
+ #sys.stderr.write('dhs0_u: %s\n' % dhs0_u)
+
+ dhs1_c = dhs[1][1] or 'ascii'
+ dhs1_u = unicode(dhs[1][0], dhs1_c)
+ #sys.stderr.write('dhs0_u: %s\n' % dhs1_u)
+
+ dhs0_a = dhs0_u.encode('ascii')
+ #sys.stderr.write('dhs0_u: %s\n' % dhs0_a)
+
+ dhs1_a = dhs1_u.encode('ascii')
+ #sys.stderr.write('dhs0_u: %s\n' % dhs1_a)
+
+ if dhs0_a.lower() == 'x-tmda':
+ x_tmda_over = 1
+ actions = { 'from' : FilterParser.splitaction(dhs1_a) }
+ log_msg = '%s: %s' % ('X-TMDA', dhs1_a)
+ # Fixup Subject: before sending.
+ del msgout['Subject']
+ if dhs[2:]:
+ msgout['Subject'] = make_header(dhs[2:])
+ else:
+ msgout['Subject'] = ''
+ except UnicodeError:
+ pass


# If the address matches a line in the filter file, it is tagged
# accordingly, otherwise it is tagged with the default cookie


--
Stephen Warren, Software Engineer, Parama Networks, San Jose, CA
[EMAIL PROTECTED] http://www.wwwdotorg.org/


_________________________________________________
tmda-workers mailing list ([EMAIL PROTECTED])
http://tmda.net/lists/listinfo/tmda-workers

Reply via email to