Re: [Help] bodyre in hashbl
Il 18/05/2021 15:27, Henrik K ha scritto: Instead of \x{00E0}, you need to use \xC3\xA0 as you are matching_separate_ raw bytes. (untested, but assuming so from the url, too busy to test) Yes, it works. I was confusing, the Spamassassin documentation is right. I really have to use non capturing group in order to match the UTF8 characters, very long regexp! /([àèìòù])/ --> /([(?:\xE0|\xC3\xA0)(?:\xE8|\xC3\xA8)(?:\xEC|\xC3\xAC)(?:\xF2|\xC3\xB2)(?:\xF9|\xC3\xB9)(?:\xC0|\xC3\x80)(?:\xC8|\xC3\x88)(?:\xCC|\xC3\x8C)(?:\xD2|\xC3\x92)(?:\xD9|\xC3\x99)])/ Thank you very much Kind Regards Marco
Re: [Help] bodyre in hashbl
On Tue, May 18, 2021 at 03:04:12PM +0200, Marco wrote: > > Hello Henrik, > > thank you for the hints. I didn't realized that SA doesn't support UTF8 > regex. Well. As you suggest, I would like to write rules coding independent > in order to avoid surprises. I tried, it doesn't work... > > I have normalize_charset 1. > My text body is "Ciao, è proprio eccoci là si fa\nciao" > > With > ([\d\S\x{00E0}\x{c3a0}\x{00E8}\x{c3a8}\x{00EC}\x{c3ac}\x{00F2}\x{c3b2}\x{00F9}\x{c3b9}\x{00C0}\x{c380}\x{00C8}\x{c388}\x{00CC}\x{c38c}\x{00D2}\x{c392}\x{00D9}\x{c399}]+) This is still UTF8/Unicode format: \x{} https://www.fileformat.info/info/unicode/char/00e0/index.htm Instead of \x{00E0}, you need to use \xC3\xA0 as you are matching _separate_ raw bytes. (untested, but assuming so from the url, too busy to test)
Re: [Help] bodyre in hashbl
Il 17/05/2021 18:12, Henrik K ha scritto: On Mon, May 17, 2021 at 03:02:57PM +0200, Marco wrote: So I have to add the accented character literally. I can't understand why. Are there any limitation in Hashbl plugin with UTF8? Maybe I have misunderstood something. SA doesn't support UTF8 regex. It's just matching plain byte strings. Depends on normalize_charset setting too, for best compatibility you should match both latin and utf-8 raw byte variants: ü -> (?:\xfc|\xc3\xbc) https://cwiki.apache.org/confluence/display/SPAMASSASSIN/WritingRulesAdvanced Hello Henrik, thank you for the hints. I didn't realized that SA doesn't support UTF8 regex. Well. As you suggest, I would like to write rules coding independent in order to avoid surprises. I tried, it doesn't work... I have normalize_charset 1. My text body is "Ciao, è proprio eccoci là si fa\nciao" With ([\d\S\x{00E0}\x{c3a0}\x{00E8}\x{c3a8}\x{00EC}\x{c3ac}\x{00F2}\x{c3b2}\x{00F9}\x{c3b9}\x{00C0}\x{c380}\x{00C8}\x{c388}\x{00CC}\x{c38c}\x{00D2}\x{c392}\x{00D9}\x{c399}]+) I see: dbg: HashBL: __HASHBL_III_SPAM3: matches found: 'ciao,', 'è', 'proprio', 'eccoci', 'l▒', 'si', 'fa', 'ciao' 'là' seems to have bad encoded as 'l▒', so the hash doesn't match. If I write the characters literally: ([\d\Sàèìòù]+) I see: dbg: HashBL: __HASHBL_III_SPAM3: matches found: 'ciao,', 'è', 'proprio', 'eccoci', 'là', 'si', 'fa', 'ciao' Now 'là' is encoded correctly and the hash matches. Thank you very much Kind Regards Marco
Re: [Help] bodyre in hashbl
On Mon, May 17, 2021 at 07:12:47PM +0300, Henrik K wrote: > > Or check the replace_tags in 25_replace.cf, there's ready templates for > characters (but they match some commonly obfuscated variants too). And yeah sorry, these won't work with HashBL, it's just for basic rules..
Re: [Help] bodyre in hashbl
On Mon, May 17, 2021 at 03:02:57PM +0200, Marco wrote: > > So I have to add the accented character literally. > I can't understand why. Are there any limitation in Hashbl plugin with UTF8? > Maybe I have misunderstood something. SA doesn't support UTF8 regex. It's just matching plain byte strings. Depends on normalize_charset setting too, for best compatibility you should match both latin and utf-8 raw byte variants: ü -> (?:\xfc|\xc3\xbc) https://cwiki.apache.org/confluence/display/SPAMASSASSIN/WritingRulesAdvanced Or check the replace_tags in 25_replace.cf, there's ready templates for characters (but they match some commonly obfuscated variants too).