Re: How does SpamAssassin processing languages other than English

2016-04-13 Thread Yu Qian
Cool, thanks guys, i think I have a good sense of how SpamAssassin works
now. we are doing some spam project, that's amazing to have SpamAssassin.

---
Yu Qian
Ottawa Ontario
Phone: (514)-553-0198



On Wed, Apr 13, 2016 at 8:21 AM, RW  wrote:

> On Tue, 12 Apr 2016 14:15:50 -0400
> Dianne Skoll wrote:
>
> > On Tue, 12 Apr 2016 13:41:51 -0400
> > Yu Qian  wrote:
> >
> > > Yup, that's right, it becomes difficult if we want to support
> > > multiple language in one spam detection solution. and it's true
> > > that there are some best practice for single language. but didn't
> > > see too much support multiple
> >
> > The only practical approach is to normalize everything into Unicode
> > and tokenize Unicode characters.  (We actually use UTF-8 as the
> > on-disk representation.)
> >
> > We have a custom Bayes engine that treats any character in the CJK
> > Unified Ideographs range as a word.  This is not strictly correct
> > because there are two-character (and longer) CJK words, but it's close
> > enough,
>
> What happens in mainstream SpamAssassin is that if a word is over 15
> bytes long then 3 and 4 byte UTF-8 characters are extracted as tokens in
> place of the original word. Everything can be normalized to UTF-8 with
> "normalize_charset 1"
>
> This will likely work fairly well for CJK, but won't work well for any 3
> or 4 byte UTF-8  alphabet that isn't composed of ideograms (unless
> it's only in spam). This includes most Asian and African languages.
>
> I think the best solution to this is simply to retain the original
> long-word as a token - or to allow it as an option.
>
> Setting normalize_charset also helps with custom rules if you edit them
> as  UTF-8, but it's important to remember that SA sees a multibyte
> character as a sequence of bytes rather than a single charcter. For
> example you can't put a non-ascii character between square brackets.
>


Re: [OT] still configuring [Was: Disabling spamcop plugin]

2016-04-13 Thread Ian Zimmerman
On 2016-04-13 09:12 -0400, Michael Orlitzky wrote:

> package will be recompiled automatically as part of the updates. Any
> packages *depending on* that package (like, if they're statically linked
> to it) will also be recompiled.

But also _direct_ dependencies of the affected package, if the latest
version has new requirements.  And this is the heart of the problem.
With a dedicated security channel like debian has, the fixes are
recompiled targeted to the base release, so (for example) I'd never have
to update perl because of a fix in spamassassin.

In fact you can leave debian servers to update themselves unattended,
most of the time.  This is too huge a benefit for me to drop, even
weighed against the recent debian annoyances.

-- 
Please *no* private copies of mailing list or newsgroup messages.
Rule 420: All persons more than eight miles high to leave the court.


Re: Disabling spamcop plugin

2016-04-13 Thread Michael Orlitzky
On 04/13/2016 09:50 AM, Reindl Harald wrote:
> 
> enough problems by wasting time if you have to maintain 10, 20, 30 or 
> more servers and in case of problems need fast downgrades - especially 
> if you run virtual machines where all the compile jobs share hardware

emerge --buildpkg will create a binary package that you can instantly
downgrade to with emerge --usepkg


> besides that on a production server no compilers should be installed at 
> all - the generation of malware which compiles itself is only a question 
> of time

I'm not convinced that an attacker who can execute commands on your
server is more dangerous when one of those commands is `gcc`.


> 
> what gentoo would need to solve for professional environemnts is that 
> you have one machine which pulls the updates, compiles them and apckage 
> them in a way all other machines in the network can pull and apply them 
> in precompiled from over ftp, http or whatever network protocol
> 

As you wish:

  https://wiki.gentoo.org/wiki/Binary_package_guide



Re: Disabling spamcop plugin

2016-04-13 Thread Alarig Le Lay
On Wed Apr 13 15:50:27 2016, Reindl Harald wrote:
> enough problems by wasting time if you have to maintain 10, 20, 30 or more
> servers and in case of problems need fast downgrades - especially if you run
> virtual machines where all the compile jobs share hardware
> 
> besides that on a production server no compilers should be installed at all
> - the generation of malware which compiles itself is only a question of time
> 
> what gentoo would need to solve for professional environemnts is that you
> have one machine which pulls the updates, compiles them and apckage them in
> a way all other machines in the network can pull and apply them in
> precompiled from over ftp, http or whatever network protocol
> 
> we are doing the same even for Fedora servers where one machine which has
> all package sinstalled moves them from yum/dnf-cache to a repo folder, run
> createrepo and all other machines have only this repo enabled and so can do
> a "yum -y upgrade" which can be triggered over SSH directly from the admin
> machine with a "distribute-updates.sh" script and a own SSH key for that
> task

Hi,

When you run several dozens of servers, you should use and orchestrator.
By this way, you don’t spend time for each server.

Also, you can have a compiler for your gentoo architecture that serves
binary packages to other servers.

-- 
alarig


signature.asc
Description: Digital signature


Re: Disabling spamcop plugin

2016-04-13 Thread Reindl Harald



Am 13.04.2016 um 15:12 schrieb Michael Orlitzky:

On 04/13/2016 01:26 AM, Ian Zimmerman wrote:

On 2016-04-12 10:57 -0400, David Niklas wrote:


You could use Gentoo, you get to configure it all yourself!


Funny you'd say that, I _am_ actually switching to it - on my
"workstation" role computers.  I'm already over 50% over the hump, I
think.

But on "server type" computers, I just cannot spare a dedicated security
branch.  I really don't have the time, and more importantly the nerves,
to scramble and recompile the world when each new vulnerability is
announced.


This shouldn't be worse on Gentoo than it is anywhere else. We have a
mailing list, gentoo-announce [0], where security advisories get sent.
But, they only get sent out once the vulnerability has been fixed and
marked stable /everywhere/, so they often come a little late.
Nevertheless, security issues are fixed ASAP:

   1. Some vulnerability is found.

   2. The security team opens a bug, and contacts the maintainer of the
  affected package.

   3. A fix is committed to the tree.

   4. The arch teams scramble to stabilize the version with the fix.

   5. The announcement is sent out.

As long as you follow a semi-regular update cycle, you shouldn't have to
do anything special, even if you run a stable system. The affected
package will be recompiled automatically as part of the updates. Any
packages *depending on* that package (like, if they're statically linked
to it) will also be recompiled. No need to recompile @world


enough problems by wasting time if you have to maintain 10, 20, 30 or 
more servers and in case of problems need fast downgrades - especially 
if you run virtual machines where all the compile jobs share hardware


besides that on a production server no compilers should be installed at 
all - the generation of malware which compiles itself is only a question 
of time


what gentoo would need to solve for professional environemnts is that 
you have one machine which pulls the updates, compiles them and apckage 
them in a way all other machines in the network can pull and apply them 
in precompiled from over ftp, http or whatever network protocol


we are doing the same even for Fedora servers where one machine which 
has all package sinstalled moves them from yum/dnf-cache to a repo 
folder, run createrepo and all other machines have only this repo 
enabled and so can do a "yum -y upgrade" which can be triggered over SSH 
directly from the admin machine with a "distribute-updates.sh" script 
and a own SSH key for that task




signature.asc
Description: OpenPGP digital signature


Re: [OT] still configuring [Was: Disabling spamcop plugin]

2016-04-13 Thread Michael Orlitzky
On 04/13/2016 01:26 AM, Ian Zimmerman wrote:
> On 2016-04-12 10:57 -0400, David Niklas wrote:
> 
>> You could use Gentoo, you get to configure it all yourself!
> 
> Funny you'd say that, I _am_ actually switching to it - on my
> "workstation" role computers.  I'm already over 50% over the hump, I
> think. 
> 
> But on "server type" computers, I just cannot spare a dedicated security
> branch.  I really don't have the time, and more importantly the nerves,
> to scramble and recompile the world when each new vulnerability is
> announced.
> 

This shouldn't be worse on Gentoo than it is anywhere else. We have a
mailing list, gentoo-announce [0], where security advisories get sent.
But, they only get sent out once the vulnerability has been fixed and
marked stable /everywhere/, so they often come a little late.
Nevertheless, security issues are fixed ASAP:

  1. Some vulnerability is found.

  2. The security team opens a bug, and contacts the maintainer of the
 affected package.

  3. A fix is committed to the tree.

  4. The arch teams scramble to stabilize the version with the fix.

  5. The announcement is sent out.

As long as you follow a semi-regular update cycle, you shouldn't have to
do anything special, even if you run a stable system. The affected
package will be recompiled automatically as part of the updates. Any
packages *depending on* that package (like, if they're statically linked
to it) will also be recompiled. No need to recompile @world.


[0] https://www.gentoo.org/get-involved/mailing-lists/



Re: How does SpamAssassin processing languages other than English

2016-04-13 Thread RW
On Tue, 12 Apr 2016 14:15:50 -0400
Dianne Skoll wrote:

> On Tue, 12 Apr 2016 13:41:51 -0400
> Yu Qian  wrote:
> 
> > Yup, that's right, it becomes difficult if we want to support
> > multiple language in one spam detection solution. and it's true
> > that there are some best practice for single language. but didn't
> > see too much support multiple  
> 
> The only practical approach is to normalize everything into Unicode
> and tokenize Unicode characters.  (We actually use UTF-8 as the
> on-disk representation.)
> 
> We have a custom Bayes engine that treats any character in the CJK
> Unified Ideographs range as a word.  This is not strictly correct
> because there are two-character (and longer) CJK words, but it's close
> enough,

What happens in mainstream SpamAssassin is that if a word is over 15
bytes long then 3 and 4 byte UTF-8 characters are extracted as tokens in
place of the original word. Everything can be normalized to UTF-8 with 
"normalize_charset 1"

This will likely work fairly well for CJK, but won't work well for any 3
or 4 byte UTF-8  alphabet that isn't composed of ideograms (unless
it's only in spam). This includes most Asian and African languages. 

I think the best solution to this is simply to retain the original
long-word as a token - or to allow it as an option.

Setting normalize_charset also helps with custom rules if you edit them
as  UTF-8, but it's important to remember that SA sees a multibyte
character as a sequence of bytes rather than a single charcter. For
example you can't put a non-ascii character between square brackets.