[Bug 7656] UTF8 rules, normalize_charset etc overhaul
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656 Henrik Krohns changed: What|Removed |Added Blocks||7645 Referenced Bugs: https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7645 [Bug 7645] Wide character in print at /usr/bin/sa-compile line 433 -- You are receiving this mail because: You are the assignee for the bug.
[Bug 7645] Wide character in print at /usr/bin/sa-compile line 433
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7645 Henrik Krohns changed: What|Removed |Added Depends on||7656 Referenced Bugs: https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656 [Bug 7656] UTF8 rules, normalize_charset etc overhaul -- You are receiving this mail because: You are the assignee for the bug.
[Bug 7656] UTF8 rules, normalize_charset etc overhaul
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656 Henrik Krohns changed: What|Removed |Added Blocks||7022 Referenced Bugs: https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7022 [Bug 7022] normalize_charset -- You are receiving this mail because: You are the assignee for the bug.
[Bug 7022] normalize_charset
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7022 Henrik Krohns changed: What|Removed |Added Depends on||7656 CC||h...@hege.li Referenced Bugs: https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656 [Bug 7656] UTF8 rules, normalize_charset etc overhaul -- You are receiving this mail because: You are the assignee for the bug.
[Bug 7656] UTF8 rules, normalize_charset etc overhaul
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656 --- Comment #3 from Henrik Krohns --- (In reply to Henrik Krohns from comment #0) > latin1 message, no ct RULE_LATIN1 / > latin1 message, utf8 ct RULE_LATIN1 / > latin1 message, no ct RULE_UTF8 / > latin1 message, utf8 ct RULE_UTF8 / Ok these should be now fixed.. Basically Encode::Detect::Detector thinks body "päivää" is Windows-1255 (Hebrew!!). dbg: message: failed decoding as declared charset UTF-8 dbg: message: decoded as detected charset windows-1255, declared UTF-8 Why are we using a module that hasn't been updated in 10 years anyway? Maybe look at Encode::Guess which has been in core atleast from 5.8.8? I simply added latin diacretic letters to SA's own basic Win-1252 detection. I borrowed the \xc0-\xd6\xd8-\xde\xe0-\xf6\xf8-\xfe bit from textcat, also looking at https://en.wikipedia.org/wiki/Windows-1252 it seems correct. Not sure if the missing ÿ (\xff) should be added to here and textcat.. Sendingspamassassin-3.4/lib/Mail/SpamAssassin/Message/Node.pm Sendingtrunk/lib/Mail/SpamAssassin/Message/Node.pm Transmitting file data ..done Committing transaction... Committed revision 1846805. -- You are receiving this mail because: You are the assignee for the bug.
[Bug 7656] UTF8 rules, normalize_charset etc overhaul
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656 --- Comment #2 from Henrik Krohns --- (In reply to Henrik Krohns from comment #0) > Unless people want to use multiple rules to match non-utf8 and utf8 > messages, perhaps the only sane solution would be to "upgrade" all non-utf8 > rules to utf8 internally and do the matching to utf8 upgraded body. In such > case the two rules above would actually be duplicates and work on any > message. Basically with this I mean that normalize_charset should affect rule parsing too and encode the rules (and resulting regexes) to UTF8? I don't think we can simply tell users to "convert all your rules/files to UTF8, if you want them to work". I don't use UTF8 in my editors or Linuxes anywhere. :-) -- You are receiving this mail because: You are the assignee for the bug.
[Bug 7656] UTF8 rules, normalize_charset etc overhaul
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656 Henrik Krohns changed: What|Removed |Added CC||h...@hege.li --- Comment #1 from Henrik Krohns --- Lots of talk here too that I haven't digested yet.. https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7022 -- You are receiving this mail because: You are the assignee for the bug.
[Bug 7656] New: UTF8 rules, normalize_charset etc overhaul
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656 Bug ID: 7656 Summary: UTF8 rules, normalize_charset etc overhaul Product: Spamassassin Version: SVN Trunk (Latest Devel Version) Hardware: All OS: All Status: NEW Severity: blocker Priority: P2 Component: Libraries Assignee: dev@spamassassin.apache.org Reporter: h...@hege.li Target Milestone: Undefined There are few relating bugs, but I'm creating new to oversee this. I don't think we should release 4.0.0 before all UTF8 related functionality works adequately and is documented properly. I made few tests with a message that either contains latin1 or utf8 encoded text (or simple html without any encoding clauses). Also three variants with Content-Type missing or specified as such. body RULE_LATIN1 /päivää/ body RULE_UTF8 /pÀivÀÀ/ TEXT/PLAIN normalize_charset 0 / 1 utf8 message, no ct RULE_UTF8 / RULE_UTF8 utf8 message, utf8 ct RULE_UTF8 / RULE_UTF8 utf8 message, latin1 ct RULE_UTF8 / RULE_UTF8 latin1 message, no ct RULE_LATIN1 / latin1 message, utf8 ct RULE_LATIN1 / latin1 message, latin1 ct RULE_LATIN1 / RULE_UTF8 TEXT/HTML normalize_charset 0 / 1 utf8 message, no ct RULE_UTF8 / RULE_UTF8 utf8 message, utf8 ct RULE_UTF8 / RULE_UTF8 utf8 message, latin1 ct RULE_UTF8 / RULE_UTF8 latin1 message, no ct RULE_UTF8 / latin1 message, utf8 ct RULE_UTF8 / latin1 message, latin1 ct RULE_UTF8 / RULE_UTF8 - normalize_charset 1 doesn't hit either rule unless message contains Content-Type..ISO-8859-1 ?? - html parser apparently assumes everything is UTF8. Only matches UTF8 rules? One can't even use simple workarounds such as "body RULE_FOO /p.iv../" to match umlauts(diacritic?) from UTF8 messages, as they obviously eat up two characters. Let's not even get into other things yet like sa-compile (bug 7645), textcat etc that all expect some correct encoding to work.. Unless people want to use multiple rules to match non-utf8 and utf8 messages, perhaps the only sane solution would be to "upgrade" all non-utf8 rules to utf8 internally and do the matching to utf8 upgraded body. In such case the two rules above would actually be duplicates and work on any message. -- You are receiving this mail because: You are the assignee for the bug.