Re: would SA benefit from port to Java

2006-11-26 Thread Nix
On 26 Nov 2006, [EMAIL PROTECTED] told this:
 From: Nix [EMAIL PROTECTED]

 On 20 Nov 2006, Giampaolo Tomassoni spake thusly:

 That's not even mentioning the metaprogramming and higher-order
 programming techniques that we use extensively in SpamAssassin -- those
 are basically *just not possible* in C/C++. ;)

 Ops. What's this stuff? Let me know.
 eval and all that it implies. Compiling and executing code at runtime.
 Calling functions by name. That sort of thing.
 Of course you *can* do it in C/C++. The traditional method is to write
 an interpreter or JIT-compiler for another language and do whatever-it-
 is in that other language.

 Of course, you can do that from fast C/C++ code using exec() or a perhaps
 better a spamc/spamd trick to call these fancy perl facilities. But that's
 cheating, isn't it?

Yeah, you could chatter with perl over a pipe, or just embed perl. Or just,
y'know, use perl :)

 Language wars get boring, ya know.

Never! Rewrite SpamAssassin in Objective Caml, Curry, and Cayenne now!
It'll be so much more efficient afterwards, and the hair our users
lose trying to install the myriad loony interpreters will *help* them
in the long run!

-- 
`The main high-level difference between Emacs and (say) UNIX, Windows,
 or BeOS... is that Emacs boots quicker.' --- PdS


Re: would SA benefit from port to Java

2006-11-26 Thread Nix
On 26 Nov 2006, Tom Allison uttered the following:
 I could see doing something in C/C++ but definitely not Java...
 Similary, for performance reasons I would stay away from Ruby.

The performance that matters for SA is the performance of the regular
expression matcher. That's the only part that comes close to loading the
CPU, and even that makes little difference to scan time unless net tests
are disabled.

-- 
`The main high-level difference between Emacs and (say) UNIX, Windows,
 or BeOS... is that Emacs boots quicker.' --- PdS


RE: would SA benefit from port to Java

2006-11-26 Thread Giampaolo Tomassoni
From: Nix [mailto:[EMAIL PROTECTED]
 
 On 26 Nov 2006, [EMAIL PROTECTED] told this:
  From: Nix [EMAIL PROTECTED]
 
  On 20 Nov 2006, Giampaolo Tomassoni spake thusly:
 
  That's not even mentioning the metaprogramming and higher-order
  programming techniques that we use extensively in 
 SpamAssassin -- those
  are basically *just not possible* in C/C++. ;)
 
  Ops. What's this stuff? Let me know.
  eval and all that it implies. Compiling and executing code at runtime.
  Calling functions by name. That sort of thing.
  Of course you *can* do it in C/C++. The traditional method is to write
  an interpreter or JIT-compiler for another language and do whatever-it-
  is in that other language.
 
  Of course, you can do that from fast C/C++ code using exec() or 
 a perhaps
  better a spamc/spamd trick to call these fancy perl facilities. 
 But that's
  cheating, isn't it?
 
 Yeah, you could chatter with perl over a pipe, or just embed 
 perl. Or just,
 y'know, use perl :)

Ok. Just a note: I didn't mean to write a perl interpreter, but just to supply 
with a simple SOA parser and ExpressionBTree builder to an hypotetical rule 
compiler for an even more hypotetical SAC++ deamon. Most of the rules in my 
rule body follow a simple SOA pattern. I didn't mean to embed perl into it 
nor to exec SA or otherwise use actual SA (apart for rules): that would be 
patently stupid.

giampaolo 

 
  Language wars get boring, ya know.
 
 Never! Rewrite SpamAssassin in Objective Caml, Curry, and Cayenne now!
 It'll be so much more efficient afterwards, and the hair our users
 lose trying to install the myriad loony interpreters will *help* them
 in the long run!
 
 -- 
 `The main high-level difference between Emacs and (say) UNIX, Windows,
  or BeOS... is that Emacs boots quicker.' --- PdS



Re: would SA benefit from port to Java

2006-11-25 Thread Nix
On 20 Nov 2006, Giampaolo Tomassoni spake thusly:

 That's not even mentioning the metaprogramming and higher-order
 programming techniques that we use extensively in SpamAssassin -- those
 are basically *just not possible* in C/C++. ;)

 Ops. What's this stuff? Let me know.

eval and all that it implies. Compiling and executing code at runtime.
Calling functions by name. That sort of thing.

Of course you *can* do it in C/C++. The traditional method is to write
an interpreter or JIT-compiler for another language and do whatever-it-
is in that other language.

Conveniently Larry Wall et al have done that for us: the `other
language' is of course perl.

-- 
`The main high-level difference between Emacs and (say) UNIX, Windows,
 or BeOS... is that Emacs boots quicker.' --- PdS


Re: would SA benefit from port to Java

2006-11-25 Thread jdow

From: Nix [EMAIL PROTECTED]


On 20 Nov 2006, Giampaolo Tomassoni spake thusly:


That's not even mentioning the metaprogramming and higher-order
programming techniques that we use extensively in SpamAssassin -- those
are basically *just not possible* in C/C++. ;)


Ops. What's this stuff? Let me know.


eval and all that it implies. Compiling and executing code at runtime.
Calling functions by name. That sort of thing.

Of course you *can* do it in C/C++. The traditional method is to write
an interpreter or JIT-compiler for another language and do whatever-it-
is in that other language.


Of course, you can do that from fast C/C++ code using exec() or a perhaps
better a spamc/spamd trick to call these fancy perl facilities. But that's
cheating, isn't it?

Language wars get boring, ya know.
{^_^}


Re: would SA benefit from port to Java

2006-11-25 Thread Tom Allison

Nix wrote:

On 20 Nov 2006, Giampaolo Tomassoni spake thusly:


That's not even mentioning the metaprogramming and higher-order
programming techniques that we use extensively in SpamAssassin -- those
are basically *just not possible* in C/C++. ;)

Ops. What's this stuff? Let me know.


eval and all that it implies. Compiling and executing code at runtime.
Calling functions by name. That sort of thing.

Of course you *can* do it in C/C++. The traditional method is to write
an interpreter or JIT-compiler for another language and do whatever-it-
is in that other language.

Conveniently Larry Wall et al have done that for us: the `other
language' is of course perl.



Hmmm...

Java's a memory pig.

I could see doing something in C/C++ but definitely not Java...
Similary, for performance reasons I would stay away from Ruby.


Re: would SA benefit from port to Java

2006-11-21 Thread Justin Mason

Giampaolo Tomassoni writes:
Recently in the perl blead code, one of the perl hackers has
added a trie-based regexp matcher (with Aho-Corasick
optimisations) to efficiently match multiple regular expressions
in parallel, to the perl core regexp matching code.  That's pretty
much what you're describing,
 
 Just to know, do you mean this?
 
   http://search.cpan.org/~dankogai/Regexp-Trie-0.02/lib/Regexp/Trie.pm
 
 Else, what's the perl blead code?

Blead is what the perl developers call the main development branch of
perl5, which you can rsync live from the perl perforce server; cf:

http://www.opensubscriber.com/message/dev@spamassassin.apache.org/712879.html

see also: http://taint.org/tag/tries , http://taint.org/tag/aho-corasick

You were also asking:

  That's not even mentioning the metaprogramming and higher-order
  programming techniques that we use extensively in SpamAssassin -- those
  are basically *just not possible* in C/C++. ;)
 
 Ops. What's this stuff? Let me know.

http://en.wikipedia.org/wiki/Metaprogramming
http://en.wikipedia.org/wiki/Higher-order_programming
http://hop.perl.plover.com/ (which I haven't actually read yet to
be quite honest ;)

--j.


RE: would SA benefit from port to Java

2006-11-21 Thread Giampaolo Tomassoni
From: Matt Kettler [mailto:[EMAIL PROTECTED]
 Giampaolo Tomassoni wrote:
 
  ...omissis
 
  But if we are speaking of a /10 mem*cpu factor, well, it could 
  easily be interesting, isn't it?
  
 No. I think it would be patently stupid because of the massive effort
 involved and loss of mind-power. But if you like, by all means, go for
 it, prove us all wrong..

It isn't going to be that encouraging... :)

giampaolo



Re: would SA benefit from port to Java

2006-11-21 Thread Justin Mason

That's not even mentioning the metaprogramming and higher-order
programming techniques that we use extensively in SpamAssassin -- those
are basically *just not possible* in C/C++. ;)

--j.

Matt Kettler writes:
 Giampaolo Tomassoni wrote:
  From: Matt Kettler [mailto:[EMAIL PROTECTED]

 
  That said, I agree, trying to implement SA in C++ would be a NIGHTMARE.
 
  C++ is NOT an optimal language for apps that are string-parsing intensive.
  
 
  I don't agree in this: I think there are good ways to handle strings in C++ 
  which are good enough for the purposes of SA and the security constraints 
  which would need to be enforced.

 I did not say there were no secure string handling methods. I said C++
 was not an optimal language for string parsing. Sure you can use STL's
 string library and gain some security.  However writing string parsing
 in C++ is a pain in the tail and results in a lot of very long and
 hard-to-maintain code. Writing string parsing in perl is easy and
 results in very compact easy-to-maintain code.
 
 I know. I write C/C++ for a living. String parsing in C++ sucks. Period.
 
 Let's see here.. let's find the last , in a string and extract all the
 characters after it as a new string..
 
 c++: Urgh.. Make a loop, compare each character, storing the most recent
 match, then do an ugly substring call using that index and length-index.
 perl: an easy-to-write regex will do this. There are probably better
 ways I don't know of.
 
 The perl code is slower, but the C++ code is hard to write and hard to
 maintain. I'm sure there's another way to do the perl code that's faster
 and comparable to C++ here. However, I've yet to see anyone do this
 operation repeatedly in C++ without ever making an off-by-one error
 somewhere.
 
 

  Drawbacks to C/C++:
  - regex is not language native, added by PCRE library.
  
 
  Which is opensource as well, so it may be used. A lot of things are not 
  language-native in C/C++. That's because C/C++ is designed. It can't be 
  regarded as a language limit, however: you can easily use external 
  libraries for all the natively unsupported features.

 True, but regexes in perl are NATIVE. You can use them ANYWHERE. Even as
 a parameter to a function call. To do regexes in C++ you have to make an
 external call to a library. Have you ever used PCRE? It's a pain. You
 have to call multiple functions, one to set up the regex, and another to
 do the match. That's not so bad for the rules, but do you know how many
 little regexes are scattered around the SA code that would have to be
 broken out? Urgh.
 
 

  - Too many folks write C/C++ badly, failing to watch their memory.
  
 
  That's a problem which may afflict even perl or python programs and 
  programmers. You're right: under C++ writing bad code often results in 
  sharper effects. But of course if you want to squeeze more performances you 
  need to trade off something. In the C/C++ case, ease of coding would be 
  traded a bit off in spite of higher performances.
 
 

  This is substantially more likely in anything involving string handling,
  which is everything SA does.
  
 
 
 

  - C/C++ does not have many of the very nice libraries that perl has
  for DNS, SPF, IP:Country, Base64, etc, etc.
  
 
  Well, DNS and Base64 are base services which are provided anyway. They came 
  in a different shape, but still present.

 As is SPF. But I would not call any of these libraries nice.
  SPF and IP::Country would need to be somehow rewritten, of course. These 
  falls under the plugin problem. It wouldn't be probably easy to replicate 
  the (good) behaviour of these perl modules, but I don't even think it 
  wouldn't be possible or even not worth to try it.
 
  Worse, most of the Mail:: modules would need to be somehow rewritten or 
  otherwise implemented.
 
  Of course, a SA recode in C/C++ wouldn't came gratis.
 
 

  -Again, the development team is perl programmers, unless you've got
  a set of equivalent spam experts, or can prove the existing devs all
  know your proposed language, even suggesting ANY port to ANY other
  language is inane. You may as well suggest changing the spoken language
  of the documentation to something other than English. Thus far, all the
  writers speak English. Many know other spoken languages besides
  English,  but I doubt you'd find another one that they ALL speak.
  
 
  I agree with you that this would be a great problem, but it is not going to 
  be the main problem, isn't it?

 I would suggest it would be.
  Most programmers in this list seems to be very versatile about programming 
  languages. Also, if you know perl, the next language you know is often 
  C/C++. That's just because C/C++ is often the first serious language you 
  learn.

 Yes, but many of the SA team do not have a programming background. They
 have a sysadmin background and learned perl to support CGI's and
 

Re: would SA benefit from port to Java

2006-11-19 Thread Justin Mason

Mark Martinec writes:
 On Friday November 17 2006 21:24, Giampaolo Tomassoni wrote:
  Besides, if there wasn't SA pluging, I would prefer a C/C++ version of SA.
  Wouldn't it run better? Wouldn't it be faster, wouldn't have a smaller
  memory footprint, better reclamation, better hooks for plugins etc? :)
 
 ...and buffer overruns, dangling pointers, poor maintainability,
 playground for security holes. If SA were written in C,
 I wouldn't let it examine mail being received from 'the wild'.

+1.

having perl's taint mode, as well, makes a big difference.

--j.


Re: would SA benefit from port to Java

2006-11-19 Thread Matt Kettler
Mark Martinec wrote:
 On Friday November 17 2006 21:24, Giampaolo Tomassoni wrote:
   
 Besides, if there wasn't SA pluging, I would prefer a C/C++ version of SA.
 Wouldn't it run better? Wouldn't it be faster, wouldn't have a smaller
 memory footprint, better reclamation, better hooks for plugins etc? :)
 

 ...and buffer overruns, dangling pointers, poor maintainability,
 playground for security holes. If SA were written in C,
 I wouldn't let it examine mail being received from 'the wild'.

   Mark

   

And postfix, your MTA, is written in ???

That said, I agree, trying to implement SA in C++ would be a NIGHTMARE.

C++ is NOT an optimal language for apps that are string-parsing intensive.

Drawbacks to C/C++:
- regex is not language native, added by PCRE library.
- Too many folks write C/C++ badly, failing to watch their memory.
This is substantially more likely in anything involving string handling,
which is everything SA does.
- C/C++ does not have many of the very nice libraries that perl has
for DNS, SPF, IP:Country, Base64, etc, etc.
-Again, the development team is perl programmers, unless you've got
a set of equivalent spam experts, or can prove the existing devs all
know your proposed language, even suggesting ANY port to ANY other
language is inane. You may as well suggest changing the spoken language
of the documentation to something other than English. Thus far, all the
writers speak English. Many know other spoken languages besides
English,  but I doubt you'd find another one that they ALL speak.







Re: would SA benefit from port to Java

2006-11-19 Thread Charlie Clark


Am 17.11.2006 um 20:36 schrieb Eric A. Hall:



Thinking about the GPL Java announcement some, and trying to  
imagine the
kinds of opportunities this allows for, it occurs to me that  
SpamAssassin

might be a natural fit for Java.


Why on earth do you come to that conclusion and what does Java going  
GPL have anything to do with it?


I'm just thinking out loud here, not advocating anything...



At best you are speculating rather thank thinking.

Would it run better? Would it be faster, have smaller memory  
footprint,
better reclamation, better hooks for plugins etc? OTOH, would it be  
harder

to build, given the dependence of SA on perl modules?


Please do some research on progam languages and domains because one  
size almost never fits all. While I personally very much dislike  
perl, it is extremely well-suited to this task: text-centric, rapidly  
changing. SA was the first out there, has a large body of active  
developers and is extensible by rules.


Charlie
--
Charlie Clark
Helmholtzstr. 20
Düsseldorf
D- 40215
Tel: +49-211-938-5360
GSM: +49-178-782-6226





Re: would SA benefit from port to Java

2006-11-18 Thread Mark Martinec
On Saturday November 18 2006 02:05, Matt Kettler wrote:
 I also expect a lot of the memory usage is the annotation tables and
 such for regexes. It would be interesting to compare the size of spamd
 without any rules loaded against one with a stock ruleset. The
 difference between the two can't really be improved by any means other
 than using a slower regex interpreter that doesn't use tables as
 extensively.

It just happens I measured memory sizes a couple of days ago.
This was with amavisd-new, but should not be much different than
spamd, except for somewhat smaller daemon main program in clamd.
Here it goes...



From: Mark Martinec [EMAIL PROTECTED]
To: amavis-user@lists.sourceforge.net
Date: Thu, 16 Nov 2006 19:06:00 +0100
Subject: Re: [AMaViS-user] Forward mail directly to an MDA

  How limited is the RAM?
  64MB PC100 SDRAM currently.
 
  I'll try that out. If the box runs out of RAM I'll simply buy some more.
  Those old DIMMs are not very expensive anymore.

 If the machine architecture will allow it, you really need to.
 Performance goes through the floor when you start swap thrashing.
 A message that might normally take 50 seconds to process could end
 up timing out instead, ending up in the deferred queue, only to
 have the same thing happen again. If it's an old machine, you will
 probably need low density (16 chip) DIMMs. An additional 128MB might
 get you closer to a functional system.

I collected some figures, concentrating on low memory usage.
All this with amavisd-new-2.4.4, SA 3.1.7, perl 5.8.8 on FreeBSD,
although it should pretty much apply to nearby versions.

 VSZ

  2.6 MB bare-bones Perl (interactive, no program);

 18.9 MB (increase by 16.3 - basic amavisd)
 as above + running amavisd with AV i/f,
   with disabled decoders ($bypass_decode_parts=1),
   disabled cache and nanny db ($enable_db=0),
   and disabled SA: @bypass_spam_checks_maps=(1);

 38.3 MB (increase by 19.4 - barebones SA code)
 as above + SA standard code, NO RULES;

 45.5 MB (increase by 7.2 - standard SA rules)
 as above + standard SA 3.1.7 rules updated with sa-update;

 120  MB (increase by 75 - de-luxe SA rules and plugins)
 as above + mid-range SARE rules, FuzzyOCR,
   [EMAIL PROTECTED], Razor2, ...

The shown figure is virtual memory size per amavisd process.
Resident memory size depends on how tight memory is,
but could go down to maybe 60% of VSZ if truly necessary.

In anticipated setup with clamd and SA, I suggest shedding
decoders (ClamAV can do most decoding) which saves about 3.1 MB,
and BarkeleyDB cache+nanny+statistics, which saves about 2.6 MB.

With little memory one can not afford more than one or perhaps
two amavisd processes. The pre-forking environment becomes
pretty much useless, as you always pay the memory footprint price
for one parent process. It is possible to tell Net::Server to
do it all with only one process, no forking. To do so, find
the following two lines in file amavisd:

# @ISA = qw(Net::Server);
@ISA = qw(Net::Server::PreForkSimple);

and swap the commenting-out:

@ISA = qw(Net::Server);
# @ISA = qw(Net::Server::PreForkSimple);

This yields exactly one amavisd process, regardless of the
$max_servers setting (which makes it suitable also for hardcore
debugging). Make sure to adjust the maxproc in Postfix master.cf
to 1 for a smtp service that feeds mail to amavisd (README.postfix).
This hasn't been tested extensively, but appears to work.
(I'm not sure what happens after $max_requests tasks,
just in case set it to a high value).

So I would think it is possible to run on a 64 MB host
one amavisd process (no pre-forking), with all standard
SA rules and network tests (possibly with Bayes, I'm not sure),
clamd, Postfix and a basic Unix OS.
128 MB would be more advisable, and 256 MB can get quite
comfortable with two amavisd processes and some SARE rules,
and there would even be room for X11 and emacs.

Running the whole setup on a virtual host (qemu or some virtualizer)
can make adjusting memory size very easy, so it may be worth
trying the setup on a virtual host first.

  Mark


Re: would SA benefit from port to Java

2006-11-18 Thread Mark Martinec
 This was with amavisd-new, but should not be much different than
 spamd, except for somewhat smaller daemon main program in clamd.

s/clamd/spamd/


RE: would SA benefit from port to Java

2006-11-18 Thread Giampaolo Tomassoni
From: Matt Kettler [mailto:[EMAIL PROTECTED]
 1) perl has a substantial base of text parsing and utility libraries
 that no other language can match.. Java does have native regex support,
 so it has a leg up over the others,

Right, but both langs are not that much suited for scoring a message: they 
apply all the rules to the very same piece of text.

It would be interesting, instead, to invert this approach by designing a 
finite state machine which is basicly a pre-compiled version of the whole rule 
body. You feed once the message in, and you get the results (i.e.: fired rules 
and/or message score).

I believe that this approach would reduce memory consumption as well as 
execution time a lot.

It would not be suitable for custom plugins, however. But all the standard 
rules (even the expensive ones in terms of computational power and memory 
footprint) would probably perform better this way.

The basic idea in the FSM model is that the pre-compiler is going to run just 
sometimes, maybe when a rule gets changed, added or deleted to the rule body. 
The pre-compiler could eventually even optimize the resulting FSM, perhaps by 
merging together paths shared by different rules. The .cf files syntax would 
not even need to be changed and this method could even allow for injecting a 
new, pre-compiled rule body version into an alive spamassassin.

Optionally, the FSM approach could be implemented the well-appreciated, actual 
perl by use of an external perl module.

Did anybody heard or thought of something like this?

Do you believe that an FSM would really improve SA performances?

What's your point?

giampaolo



Re: would SA benefit from port to Java

2006-11-18 Thread Matt Kettler
Giampaolo Tomassoni wrote:
 From: Matt Kettler [mailto:[EMAIL PROTECTED]
   
 1) perl has a substantial base of text parsing and utility libraries
 that no other language can match.. Java does have native regex support,
 so it has a leg up over the others,
 

 Right, but both langs are not that much suited for scoring a message: they 
 apply all the rules to the very same piece of text.

 It would be interesting, instead, to invert this approach by designing a 
 finite state machine which is basicly a pre-compiled version of the whole 
 rule body. You feed once the message in, and you get the results (i.e.: fired 
 rules and/or message score).

 I believe that this approach would reduce memory consumption as well as 
 execution time a lot.

 It would not be suitable for custom plugins, however. But all the standard 
 rules (even the expensive ones in terms of computational power and memory 
 footprint) would probably perform better this way.

 The basic idea in the FSM model is that the pre-compiler is going to run just 
 sometimes, maybe when a rule gets changed, added or deleted to the rule body. 
 The pre-compiler could eventually even optimize the resulting FSM, perhaps by 
 merging together paths shared by different rules. The .cf files syntax 
 would not even need to be changed and this method could even allow for 
 injecting a new, pre-compiled rule body version into an alive spamassassin.

 Optionally, the FSM approach could be implemented the well-appreciated, 
 actual perl by use of an external perl module.

 Did anybody heard or thought of something like this?
   
Nope..
 Do you believe that an FSM would really improve SA performances?
   
Maybe, maybe not.. It could definitely lead to some cross-regex
optimzations, but I don't know that there are enough of them of them
that it would make a substantial (10%) difference.
 What's your point?
   
I am pointless :)
 giampaolo


   



Re: would SA benefit from port to Java

2006-11-18 Thread Justin Mason

Giampaolo Tomassoni writes:
 From: Matt Kettler [mailto:[EMAIL PROTECTED]
  1) perl has a substantial base of text parsing and utility libraries
  that no other language can match.. Java does have native regex
  support, so it has a leg up over the others,
 
 Right, but both langs are not that much suited for scoring a message:
 they apply all the rules to the very same piece of text.
 
 It would be interesting, instead, to invert this approach by designing
 a finite state machine which is basicly a pre-compiled version of the
 whole rule body. You feed once the message in, and you get the results
 (i.e.: fired rules and/or message score).
 
 I believe that this approach would reduce memory consumption as well as
 execution time a lot.
 
 It would not be suitable for custom plugins, however. But all the
 standard rules (even the expensive ones in terms of computational
 power and memory footprint) would probably perform better this way.
 
 The basic idea in the FSM model is that the pre-compiler is going to run
 just sometimes, maybe when a rule gets changed, added or deleted to the
 rule body. The pre-compiler could eventually even optimize the resulting
 FSM, perhaps by merging together paths shared by different rules. The
 .cf files syntax would not even need to be changed and this method could
 even allow for injecting a new, pre-compiled rule body version into an
 alive spamassassin.
 
 Optionally, the FSM approach could be implemented the well-appreciated,
 actual perl by use of an external perl module.
 
 Did anybody heard or thought of something like this?
 
 Do you believe that an FSM would really improve SA performances?

Recently in the perl blead code, one of the perl hackers has added a
trie-based regexp matcher (with Aho-Corasick optimisations) to efficiently
match multiple regular expressions in parallel, to the perl core regexp
matching code.  That's pretty much what you're describing, and I'm looking
into rewriting bits of SpamAssassin to take advantage of that (in the
jm_re2c_hacks branch).

Hopefully it will run faster than the current regexp matching system,
which is actually quite fast as it stands!  (The perl regular expression
matching engine is _very_ efficient.)

There's also an re2c-based version, which already outperforms basic
SpamAssassin by 15-20%, btw.

They almost definitely will not reduce memory usage, though. ;)

--j.


RE: would SA benefit from port to Java

2006-11-18 Thread Giampaolo Tomassoni
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]

 ...omissis

 Recently in the perl blead code, one of the perl hackers has added a
 trie-based regexp matcher (with Aho-Corasick optimisations) to efficiently
 match multiple regular expressions in parallel, to the perl core regexp
 matching code.  That's pretty much what you're describing,

Yes, I think so too. I didn't know the name of such a beast. Aho-Corasick. It 
should definitely work. How could something named Aho-Corasick not to work? :)

Thank you for naming it.


 and I'm looking
 into rewriting bits of SpamAssassin to take advantage of that (in the
 jm_re2c_hacks branch).
 
 Hopefully it will run faster than the current regexp matching system,
 which is actually quite fast as it stands!  (The perl regular expression
 matching engine is _very_ efficient.)
 
 There's also an re2c-based version, which already outperforms basic
 SpamAssassin by 15-20%, btw.
 
 They almost definitely will not reduce memory usage, though. ;)

Mmmmh, I had the impression that all that strings being created, cloned, used, 
merged and the like in spite of being fed to the regexes would be one of the 
reasons of big memory usage. So, I'm wrong in this...

What's the memory-hungry piece of code, then?

giampaolo


 
 --j.



Re: would SA benefit from port to Java

2006-11-18 Thread Justin Mason

well...

I spent several years writing Java in the '90s, and am quite certain that
SpamAssassin would perform a *lot* worse if written in Java.

SpamAssassin is heavy on regular expressions, and *very* optimised for
Perl's VM.  
  
On top of that, I'm pretty sure it would be quite hard to get faster
performance out of Java *anyway*.

First off, the perl VM uses a much more CISC-like strategy than Java's,
with opcodes to implement operations like regexp matches, string
modifications, hash lookups, arrays, and so on, in a single VM opcode,
implemented in C.  That means that those operations in perl will be nearly
as fast as the equivalent C (at least if you choose the right
operations of course!).

Java, OTOH, uses opcodes that are more RISC-like, and implements much of
its core library in pure Java -- operations like HashMap lookups or regexp
matches, for example -- resulting in quite a few more pure Java ops being
required to perform them.  (At least this was the situation last time I
looked, which admittedly was JDK 1.2 or so ;).  Maybe this has changed
since then.)

For what it's worth, in my experience, Perl's performance is often as fast
as anything I could write in any other language -- at least, except for
specific, low-level bit-twiddling like the Rabin-Karp fast parallel
string matching algorithm I just hacked recently.  Perl is a *really*
nice language for performance, in my opinion.

Java's memory consumption, too, is frankly horrific compared to perl's.
Perl's garbage collection, for example, is quite deterministic -- when an
object's refcount hits zero, it is immediately freed.  Java's, OTOH,
relies on occasional GC runs -- and in my experience that can go quite
awry resulting in wierd hangs at odd times.  Virtually every large Java
project I've worked on has had the odd invocation of System.gc(); thrown
in odd places because of this!  This bug has been a problem in java since
1.0, and talking to java hackers recently, they still complain about it in
current releases.

so, in conclusion: go perl. ;)

--j.

Matt Kettler writes:
 Eric A. Hall wrote:
  Thinking about the GPL Java announcement some, and trying to imagine the
  kinds of opportunities this allows for, it occurs to me that SpamAssassin
  might be a natural fit for Java.
 
  I'm just thinking out loud here, not advocating anything...
 
  Would it run better? Would it be faster, have smaller memory footprint,
  better reclamation, better hooks for plugins etc? OTOH, would it be harder
  to build, given the dependence of SA on perl modules?

 There's been about a 3 dozen other folks who have asked about porting SA
 to C/C++/Java/Python/Insert any other language here.
 
 In general, SA would suffer severely from a conversion to Java, or any
 other language.
 
 It all fundamentally boils down to two things:
 
 1) perl has a substantial base of text parsing and utility libraries
 that no other language can match.. Java does have native regex support,
 so it has a leg up over the others, but it still lacks many of the
 libraries that SA is so heavily entrenched in. Do you know of any
 equivalent to IP::Country::Fast, for *ANY* other language? Admittedly
 that one is not used by everyone, but the MIME parsers, base64 decoders,
 HTML parser, Net::DNS, etc would be tough to find good matches for
 without having to write/maintain your own. This kind of text
 manipulation is what perl is actually very good at, and has lots of
 support libraries for.
 
 2) Most importantly, consider that all of the existing devels that
 maintain the code are perl developers, and not all of them are Java
 developers. Poof, there goes at least some, if not all, of your
 development team down the tubes. This is by far the most significant
 hurdle. Who would we loose here, and can we afford to loose the
 spam-fighting expertise these people have?
 
 That said, I'm a C/C++/assembly developer myself, and my own personal
 reaction is why would you want to convert from one lumbering hulk of a
 language with an expensive interpreter to another lumbering hulk of a
 language with an expensive VM. And yes, I know java is JIT compiled
 not interpreted, but AFAIK this is not as different from how perl works
 as you might think. Perl code isn't strictly interpreted from scratch
 every time you pass through the same code. Perl is really compiled and
 optimized at load time into bytecode, then interpreted from that. This
 makes perls startup much slower, but runtime isn't as slow as an
 interpreted language. As for size, perl interpreters and java VMs are
 both large.
 
 And yes, you can native compile java to machine code, but I doubt your
 gains here will be significant.
 
 My bets are on SA spending 99% of it's time in regex evaluation or
 network lookups. Regex execution is VERY well optimized in both
 languages even without native compilation, so that won't be helped much,
 if at all. Network lookups are basically spending their time waiting..
 you can't wait any faster 

Re: would SA benefit from port to Java

2006-11-18 Thread Justin Mason

Giampaolo Tomassoni writes:
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 
  ...omissis
 
  Recently in the perl blead code, one of the perl hackers has added a
  trie-based regexp matcher (with Aho-Corasick optimisations) to efficiently
  match multiple regular expressions in parallel, to the perl core regexp
  matching code.  That's pretty much what you're describing,
 
 Yes, I think so too. I didn't know the name of such a beast. Aho-Corasick. It 
 should definitely work. How could something named Aho-Corasick not to work? :)
 
 Thank you for naming it.

here's more info: http://en.wikipedia.org/wiki/Aho-Corasick .
it's a nice algorithm ;)

  and I'm looking
  into rewriting bits of SpamAssassin to take advantage of that (in the
  jm_re2c_hacks branch).
  
  Hopefully it will run faster than the current regexp matching system,
  which is actually quite fast as it stands!  (The perl regular expression
  matching engine is _very_ efficient.)
  
  There's also an re2c-based version, which already outperforms basic
  SpamAssassin by 15-20%, btw.
  
  They almost definitely will not reduce memory usage, though. ;)
 
 Mmmmh, I had the impression that all that strings being created, cloned, 
 used, merged and the like in spite of being fed to the regexes would be one 
 of the reasons of big memory usage. So, I'm wrong in this...
 
 What's the memory-hungry piece of code, then?

The perl interpreter -- I think the compiled code itself is quite
memory-hungry, as far as I can see.

--j.


Re: would SA benefit from port to Java

2006-11-18 Thread Mark Martinec
On Friday November 17 2006 21:24, Giampaolo Tomassoni wrote:
 Besides, if there wasn't SA pluging, I would prefer a C/C++ version of SA.
 Wouldn't it run better? Wouldn't it be faster, wouldn't have a smaller
 memory footprint, better reclamation, better hooks for plugins etc? :)

...and buffer overruns, dangling pointers, poor maintainability,
playground for security holes. If SA were written in C,
I wouldn't let it examine mail being received from 'the wild'.

  Mark


RE: would SA benefit from port to Java

2006-11-17 Thread Giampaolo Tomassoni
 Thinking about the GPL Java announcement some, and trying to imagine the
 kinds of opportunities this allows for, it occurs to me that SpamAssassin
 might be a natural fit for Java.
 
 I'm just thinking out loud here, not advocating anything...
 
 Would it run better? Would it be faster, have smaller memory footprint,
 better reclamation, better hooks for plugins etc?

It would probably run better. I wouldn't say it would work faster. I know for 
shure it would have a much bigger memory footprint... :)


 OTOH, would it be harder to build, given the dependence of SA on perl modules?

This is the main reason for not just starting with it.

Besides, if there wasn't SA pluging, I would prefer a C/C++ version of SA. 
Wouldn't it run better? Wouldn't it be faster, wouldn't have a smaller memory 
footprint, better reclamation, better hooks for plugins etc? :)

giampaolo


 
 Thoughts?
 
 -- 
 Eric A. Hallhttp://www.ehsco.com/
 Internet Core Protocols  http://www.oreilly.com/catalog/coreprot/



Re: would SA benefit from port to Java

2006-11-17 Thread Stuart Johnston

Giampaolo Tomassoni wrote:

Thinking about the GPL Java announcement some, and trying to imagine the
kinds of opportunities this allows for, it occurs to me that SpamAssassin
might be a natural fit for Java.

I'm just thinking out loud here, not advocating anything...

Would it run better?


What does that even mean?  Run better?


Re: would SA benefit from port to Java

2006-11-17 Thread Matt Kettler
Eric A. Hall wrote:
 Thinking about the GPL Java announcement some, and trying to imagine the
 kinds of opportunities this allows for, it occurs to me that SpamAssassin
 might be a natural fit for Java.

 I'm just thinking out loud here, not advocating anything...

 Would it run better? Would it be faster, have smaller memory footprint,
 better reclamation, better hooks for plugins etc? OTOH, would it be harder
 to build, given the dependence of SA on perl modules?
   
There's been about a 3 dozen other folks who have asked about porting SA
to C/C++/Java/Python/Insert any other language here.

In general, SA would suffer severely from a conversion to Java, or any
other language.

It all fundamentally boils down to two things:

1) perl has a substantial base of text parsing and utility libraries
that no other language can match.. Java does have native regex support,
so it has a leg up over the others, but it still lacks many of the
libraries that SA is so heavily entrenched in. Do you know of any
equivalent to IP::Country::Fast, for *ANY* other language? Admittedly
that one is not used by everyone, but the MIME parsers, base64 decoders,
HTML parser, Net::DNS, etc would be tough to find good matches for
without having to write/maintain your own. This kind of text
manipulation is what perl is actually very good at, and has lots of
support libraries for.

2) Most importantly, consider that all of the existing devels that
maintain the code are perl developers, and not all of them are Java
developers. Poof, there goes at least some, if not all, of your
development team down the tubes. This is by far the most significant
hurdle. Who would we loose here, and can we afford to loose the
spam-fighting expertise these people have?

That said, I'm a C/C++/assembly developer myself, and my own personal
reaction is why would you want to convert from one lumbering hulk of a
language with an expensive interpreter to another lumbering hulk of a
language with an expensive VM. And yes, I know java is JIT compiled
not interpreted, but AFAIK this is not as different from how perl works
as you might think. Perl code isn't strictly interpreted from scratch
every time you pass through the same code. Perl is really compiled and
optimized at load time into bytecode, then interpreted from that. This
makes perls startup much slower, but runtime isn't as slow as an
interpreted language. As for size, perl interpreters and java VMs are
both large.

And yes, you can native compile java to machine code, but I doubt your
gains here will be significant.

My bets are on SA spending 99% of it's time in regex evaluation or
network lookups. Regex execution is VERY well optimized in both
languages even without native compilation, so that won't be helped much,
if at all. Network lookups are basically spending their time waiting..
you can't wait any faster in machine code than a semi-interpreted
application.

I also expect a lot of the memory usage is the annotation tables and
such for regexes. It would be interesting to compare the size of spamd
without any rules loaded against one with a stock ruleset. The
difference between the two can't really be improved by any means other
than using a slower regex interpreter that doesn't use tables as
extensively.