[Bug 7656] UTF8 rules, normalize_charset etc overhaul

2018-11-17 Thread bugzilla-daemon
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

Henrik Krohns  changed:

   What|Removed |Added

 Blocks||7645


Referenced Bugs:

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7645
[Bug 7645] Wide character in print at /usr/bin/sa-compile line 433
-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7645] Wide character in print at /usr/bin/sa-compile line 433

2018-11-17 Thread bugzilla-daemon
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7645

Henrik Krohns  changed:

   What|Removed |Added

 Depends on||7656


Referenced Bugs:

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656
[Bug 7656] UTF8 rules, normalize_charset etc overhaul
-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

2018-11-17 Thread bugzilla-daemon
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

Henrik Krohns  changed:

   What|Removed |Added

 Blocks||7022


Referenced Bugs:

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7022
[Bug 7022] normalize_charset
-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7022] normalize_charset

2018-11-17 Thread bugzilla-daemon
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7022

Henrik Krohns  changed:

   What|Removed |Added

 Depends on||7656
 CC||h...@hege.li


Referenced Bugs:

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656
[Bug 7656] UTF8 rules, normalize_charset etc overhaul
-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

2018-11-17 Thread bugzilla-daemon
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

--- Comment #3 from Henrik Krohns  ---
(In reply to Henrik Krohns from comment #0)
> latin1 message, no ct RULE_LATIN1 / 
> latin1 message, utf8 ct   RULE_LATIN1 / 
> latin1 message, no ct RULE_UTF8 / 
> latin1 message, utf8 ct   RULE_UTF8 / 

Ok these should be now fixed..

Basically Encode::Detect::Detector thinks body "päivää" is Windows-1255
(Hebrew!!). 

dbg: message: failed decoding as declared charset UTF-8
dbg: message: decoded as detected charset windows-1255, declared UTF-8

Why are we using a module that hasn't been updated in 10 years anyway? Maybe
look at Encode::Guess which has been in core atleast from 5.8.8?

I simply added latin diacretic letters to SA's own basic Win-1252 detection. I
borrowed the \xc0-\xd6\xd8-\xde\xe0-\xf6\xf8-\xfe bit from textcat, also
looking at https://en.wikipedia.org/wiki/Windows-1252 it seems correct. Not
sure if the missing ÿ (\xff) should be added to here and textcat..

Sendingspamassassin-3.4/lib/Mail/SpamAssassin/Message/Node.pm
Sendingtrunk/lib/Mail/SpamAssassin/Message/Node.pm
Transmitting file data ..done
Committing transaction...
Committed revision 1846805.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

2018-11-17 Thread bugzilla-daemon
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

--- Comment #2 from Henrik Krohns  ---
(In reply to Henrik Krohns from comment #0)
> Unless people want to use multiple rules to match non-utf8 and utf8
> messages, perhaps the only sane solution would be to "upgrade" all non-utf8
> rules to utf8 internally and do the matching to utf8 upgraded body. In such
> case the two rules above would actually be duplicates and work on any
> message.

Basically with this I mean that normalize_charset should affect rule parsing
too and encode the rules (and resulting regexes) to UTF8? I don't think we can
simply tell users to "convert all your rules/files to UTF8, if you want them to
work". I don't use UTF8 in my editors or Linuxes anywhere. :-)

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

2018-11-17 Thread bugzilla-daemon
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

Henrik Krohns  changed:

   What|Removed |Added

 CC||h...@hege.li

--- Comment #1 from Henrik Krohns  ---
Lots of talk here too that I haven't digested yet..
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7022

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] New: UTF8 rules, normalize_charset etc overhaul

2018-11-17 Thread bugzilla-daemon
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

Bug ID: 7656
   Summary: UTF8 rules, normalize_charset etc overhaul
   Product: Spamassassin
   Version: SVN Trunk (Latest Devel Version)
  Hardware: All
OS: All
Status: NEW
  Severity: blocker
  Priority: P2
 Component: Libraries
  Assignee: dev@spamassassin.apache.org
  Reporter: h...@hege.li
  Target Milestone: Undefined

There are few relating bugs, but I'm creating new to oversee this.

I don't think we should release 4.0.0 before all UTF8 related functionality
works adequately and is documented properly.

I made few tests with a message that either contains latin1 or utf8 encoded
text (or simple html without any encoding clauses). Also three variants with
Content-Type missing or specified as such.

body RULE_LATIN1 /päivää/
body RULE_UTF8 /pÀivÀÀ/

TEXT/PLAIN  normalize_charset 0 / 1
utf8 message, no ct   RULE_UTF8   / RULE_UTF8
utf8 message, utf8 ct RULE_UTF8   / RULE_UTF8
utf8 message, latin1 ct   RULE_UTF8   / RULE_UTF8
latin1 message, no ct RULE_LATIN1 / 
latin1 message, utf8 ct   RULE_LATIN1 / 
latin1 message, latin1 ct RULE_LATIN1 / RULE_UTF8

TEXT/HTML  normalize_charset 0 / 1
utf8 message, no ct   RULE_UTF8 / RULE_UTF8
utf8 message, utf8 ct RULE_UTF8 / RULE_UTF8
utf8 message, latin1 ct   RULE_UTF8 / RULE_UTF8
latin1 message, no ct RULE_UTF8 / 
latin1 message, utf8 ct   RULE_UTF8 / 
latin1 message, latin1 ct RULE_UTF8 / RULE_UTF8

- normalize_charset 1 doesn't hit either rule unless message contains
Content-Type..ISO-8859-1 ??

- html parser apparently assumes everything is UTF8. Only matches UTF8 rules?

One can't even use simple workarounds such as "body RULE_FOO /p.iv../" to match
umlauts(diacritic?) from UTF8 messages, as they obviously eat up two
characters.

Let's not even get into other things yet like sa-compile (bug 7645), textcat
etc that all expect some correct encoding to work..

Unless people want to use multiple rules to match non-utf8 and utf8 messages,
perhaps the only sane solution would be to "upgrade" all non-utf8 rules to utf8
internally and do the matching to utf8 upgraded body. In such case the two
rules above would actually be duplicates and work on any message.

-- 
You are receiving this mail because:
You are the assignee for the bug.