Re: would SA benefit from port to Java
On 26 Nov 2006, [EMAIL PROTECTED] told this: From: Nix [EMAIL PROTECTED] On 20 Nov 2006, Giampaolo Tomassoni spake thusly: That's not even mentioning the metaprogramming and higher-order programming techniques that we use extensively in SpamAssassin -- those are basically *just not possible* in C/C++. ;) Ops. What's this stuff? Let me know. eval and all that it implies. Compiling and executing code at runtime. Calling functions by name. That sort of thing. Of course you *can* do it in C/C++. The traditional method is to write an interpreter or JIT-compiler for another language and do whatever-it- is in that other language. Of course, you can do that from fast C/C++ code using exec() or a perhaps better a spamc/spamd trick to call these fancy perl facilities. But that's cheating, isn't it? Yeah, you could chatter with perl over a pipe, or just embed perl. Or just, y'know, use perl :) Language wars get boring, ya know. Never! Rewrite SpamAssassin in Objective Caml, Curry, and Cayenne now! It'll be so much more efficient afterwards, and the hair our users lose trying to install the myriad loony interpreters will *help* them in the long run! -- `The main high-level difference between Emacs and (say) UNIX, Windows, or BeOS... is that Emacs boots quicker.' --- PdS
Re: would SA benefit from port to Java
On 26 Nov 2006, Tom Allison uttered the following: I could see doing something in C/C++ but definitely not Java... Similary, for performance reasons I would stay away from Ruby. The performance that matters for SA is the performance of the regular expression matcher. That's the only part that comes close to loading the CPU, and even that makes little difference to scan time unless net tests are disabled. -- `The main high-level difference between Emacs and (say) UNIX, Windows, or BeOS... is that Emacs boots quicker.' --- PdS
RE: would SA benefit from port to Java
From: Nix [mailto:[EMAIL PROTECTED] On 26 Nov 2006, [EMAIL PROTECTED] told this: From: Nix [EMAIL PROTECTED] On 20 Nov 2006, Giampaolo Tomassoni spake thusly: That's not even mentioning the metaprogramming and higher-order programming techniques that we use extensively in SpamAssassin -- those are basically *just not possible* in C/C++. ;) Ops. What's this stuff? Let me know. eval and all that it implies. Compiling and executing code at runtime. Calling functions by name. That sort of thing. Of course you *can* do it in C/C++. The traditional method is to write an interpreter or JIT-compiler for another language and do whatever-it- is in that other language. Of course, you can do that from fast C/C++ code using exec() or a perhaps better a spamc/spamd trick to call these fancy perl facilities. But that's cheating, isn't it? Yeah, you could chatter with perl over a pipe, or just embed perl. Or just, y'know, use perl :) Ok. Just a note: I didn't mean to write a perl interpreter, but just to supply with a simple SOA parser and ExpressionBTree builder to an hypotetical rule compiler for an even more hypotetical SAC++ deamon. Most of the rules in my rule body follow a simple SOA pattern. I didn't mean to embed perl into it nor to exec SA or otherwise use actual SA (apart for rules): that would be patently stupid. giampaolo Language wars get boring, ya know. Never! Rewrite SpamAssassin in Objective Caml, Curry, and Cayenne now! It'll be so much more efficient afterwards, and the hair our users lose trying to install the myriad loony interpreters will *help* them in the long run! -- `The main high-level difference between Emacs and (say) UNIX, Windows, or BeOS... is that Emacs boots quicker.' --- PdS
Re: would SA benefit from port to Java
On 20 Nov 2006, Giampaolo Tomassoni spake thusly: That's not even mentioning the metaprogramming and higher-order programming techniques that we use extensively in SpamAssassin -- those are basically *just not possible* in C/C++. ;) Ops. What's this stuff? Let me know. eval and all that it implies. Compiling and executing code at runtime. Calling functions by name. That sort of thing. Of course you *can* do it in C/C++. The traditional method is to write an interpreter or JIT-compiler for another language and do whatever-it- is in that other language. Conveniently Larry Wall et al have done that for us: the `other language' is of course perl. -- `The main high-level difference between Emacs and (say) UNIX, Windows, or BeOS... is that Emacs boots quicker.' --- PdS
Re: would SA benefit from port to Java
From: Nix [EMAIL PROTECTED] On 20 Nov 2006, Giampaolo Tomassoni spake thusly: That's not even mentioning the metaprogramming and higher-order programming techniques that we use extensively in SpamAssassin -- those are basically *just not possible* in C/C++. ;) Ops. What's this stuff? Let me know. eval and all that it implies. Compiling and executing code at runtime. Calling functions by name. That sort of thing. Of course you *can* do it in C/C++. The traditional method is to write an interpreter or JIT-compiler for another language and do whatever-it- is in that other language. Of course, you can do that from fast C/C++ code using exec() or a perhaps better a spamc/spamd trick to call these fancy perl facilities. But that's cheating, isn't it? Language wars get boring, ya know. {^_^}
Re: would SA benefit from port to Java
Nix wrote: On 20 Nov 2006, Giampaolo Tomassoni spake thusly: That's not even mentioning the metaprogramming and higher-order programming techniques that we use extensively in SpamAssassin -- those are basically *just not possible* in C/C++. ;) Ops. What's this stuff? Let me know. eval and all that it implies. Compiling and executing code at runtime. Calling functions by name. That sort of thing. Of course you *can* do it in C/C++. The traditional method is to write an interpreter or JIT-compiler for another language and do whatever-it- is in that other language. Conveniently Larry Wall et al have done that for us: the `other language' is of course perl. Hmmm... Java's a memory pig. I could see doing something in C/C++ but definitely not Java... Similary, for performance reasons I would stay away from Ruby.
Re: would SA benefit from port to Java
Giampaolo Tomassoni writes: Recently in the perl blead code, one of the perl hackers has added a trie-based regexp matcher (with Aho-Corasick optimisations) to efficiently match multiple regular expressions in parallel, to the perl core regexp matching code. That's pretty much what you're describing, Just to know, do you mean this? http://search.cpan.org/~dankogai/Regexp-Trie-0.02/lib/Regexp/Trie.pm Else, what's the perl blead code? Blead is what the perl developers call the main development branch of perl5, which you can rsync live from the perl perforce server; cf: http://www.opensubscriber.com/message/dev@spamassassin.apache.org/712879.html see also: http://taint.org/tag/tries , http://taint.org/tag/aho-corasick You were also asking: That's not even mentioning the metaprogramming and higher-order programming techniques that we use extensively in SpamAssassin -- those are basically *just not possible* in C/C++. ;) Ops. What's this stuff? Let me know. http://en.wikipedia.org/wiki/Metaprogramming http://en.wikipedia.org/wiki/Higher-order_programming http://hop.perl.plover.com/ (which I haven't actually read yet to be quite honest ;) --j.
RE: would SA benefit from port to Java
From: Matt Kettler [mailto:[EMAIL PROTECTED] Giampaolo Tomassoni wrote: ...omissis But if we are speaking of a /10 mem*cpu factor, well, it could easily be interesting, isn't it? No. I think it would be patently stupid because of the massive effort involved and loss of mind-power. But if you like, by all means, go for it, prove us all wrong.. It isn't going to be that encouraging... :) giampaolo
Re: would SA benefit from port to Java
That's not even mentioning the metaprogramming and higher-order programming techniques that we use extensively in SpamAssassin -- those are basically *just not possible* in C/C++. ;) --j. Matt Kettler writes: Giampaolo Tomassoni wrote: From: Matt Kettler [mailto:[EMAIL PROTECTED] That said, I agree, trying to implement SA in C++ would be a NIGHTMARE. C++ is NOT an optimal language for apps that are string-parsing intensive. I don't agree in this: I think there are good ways to handle strings in C++ which are good enough for the purposes of SA and the security constraints which would need to be enforced. I did not say there were no secure string handling methods. I said C++ was not an optimal language for string parsing. Sure you can use STL's string library and gain some security. However writing string parsing in C++ is a pain in the tail and results in a lot of very long and hard-to-maintain code. Writing string parsing in perl is easy and results in very compact easy-to-maintain code. I know. I write C/C++ for a living. String parsing in C++ sucks. Period. Let's see here.. let's find the last , in a string and extract all the characters after it as a new string.. c++: Urgh.. Make a loop, compare each character, storing the most recent match, then do an ugly substring call using that index and length-index. perl: an easy-to-write regex will do this. There are probably better ways I don't know of. The perl code is slower, but the C++ code is hard to write and hard to maintain. I'm sure there's another way to do the perl code that's faster and comparable to C++ here. However, I've yet to see anyone do this operation repeatedly in C++ without ever making an off-by-one error somewhere. Drawbacks to C/C++: - regex is not language native, added by PCRE library. Which is opensource as well, so it may be used. A lot of things are not language-native in C/C++. That's because C/C++ is designed. It can't be regarded as a language limit, however: you can easily use external libraries for all the natively unsupported features. True, but regexes in perl are NATIVE. You can use them ANYWHERE. Even as a parameter to a function call. To do regexes in C++ you have to make an external call to a library. Have you ever used PCRE? It's a pain. You have to call multiple functions, one to set up the regex, and another to do the match. That's not so bad for the rules, but do you know how many little regexes are scattered around the SA code that would have to be broken out? Urgh. - Too many folks write C/C++ badly, failing to watch their memory. That's a problem which may afflict even perl or python programs and programmers. You're right: under C++ writing bad code often results in sharper effects. But of course if you want to squeeze more performances you need to trade off something. In the C/C++ case, ease of coding would be traded a bit off in spite of higher performances. This is substantially more likely in anything involving string handling, which is everything SA does. - C/C++ does not have many of the very nice libraries that perl has for DNS, SPF, IP:Country, Base64, etc, etc. Well, DNS and Base64 are base services which are provided anyway. They came in a different shape, but still present. As is SPF. But I would not call any of these libraries nice. SPF and IP::Country would need to be somehow rewritten, of course. These falls under the plugin problem. It wouldn't be probably easy to replicate the (good) behaviour of these perl modules, but I don't even think it wouldn't be possible or even not worth to try it. Worse, most of the Mail:: modules would need to be somehow rewritten or otherwise implemented. Of course, a SA recode in C/C++ wouldn't came gratis. -Again, the development team is perl programmers, unless you've got a set of equivalent spam experts, or can prove the existing devs all know your proposed language, even suggesting ANY port to ANY other language is inane. You may as well suggest changing the spoken language of the documentation to something other than English. Thus far, all the writers speak English. Many know other spoken languages besides English, but I doubt you'd find another one that they ALL speak. I agree with you that this would be a great problem, but it is not going to be the main problem, isn't it? I would suggest it would be. Most programmers in this list seems to be very versatile about programming languages. Also, if you know perl, the next language you know is often C/C++. That's just because C/C++ is often the first serious language you learn. Yes, but many of the SA team do not have a programming background. They have a sysadmin background and learned perl to support CGI's and
Re: would SA benefit from port to Java
Mark Martinec writes: On Friday November 17 2006 21:24, Giampaolo Tomassoni wrote: Besides, if there wasn't SA pluging, I would prefer a C/C++ version of SA. Wouldn't it run better? Wouldn't it be faster, wouldn't have a smaller memory footprint, better reclamation, better hooks for plugins etc? :) ...and buffer overruns, dangling pointers, poor maintainability, playground for security holes. If SA were written in C, I wouldn't let it examine mail being received from 'the wild'. +1. having perl's taint mode, as well, makes a big difference. --j.
Re: would SA benefit from port to Java
Mark Martinec wrote: On Friday November 17 2006 21:24, Giampaolo Tomassoni wrote: Besides, if there wasn't SA pluging, I would prefer a C/C++ version of SA. Wouldn't it run better? Wouldn't it be faster, wouldn't have a smaller memory footprint, better reclamation, better hooks for plugins etc? :) ...and buffer overruns, dangling pointers, poor maintainability, playground for security holes. If SA were written in C, I wouldn't let it examine mail being received from 'the wild'. Mark And postfix, your MTA, is written in ??? That said, I agree, trying to implement SA in C++ would be a NIGHTMARE. C++ is NOT an optimal language for apps that are string-parsing intensive. Drawbacks to C/C++: - regex is not language native, added by PCRE library. - Too many folks write C/C++ badly, failing to watch their memory. This is substantially more likely in anything involving string handling, which is everything SA does. - C/C++ does not have many of the very nice libraries that perl has for DNS, SPF, IP:Country, Base64, etc, etc. -Again, the development team is perl programmers, unless you've got a set of equivalent spam experts, or can prove the existing devs all know your proposed language, even suggesting ANY port to ANY other language is inane. You may as well suggest changing the spoken language of the documentation to something other than English. Thus far, all the writers speak English. Many know other spoken languages besides English, but I doubt you'd find another one that they ALL speak.
Re: would SA benefit from port to Java
Am 17.11.2006 um 20:36 schrieb Eric A. Hall: Thinking about the GPL Java announcement some, and trying to imagine the kinds of opportunities this allows for, it occurs to me that SpamAssassin might be a natural fit for Java. Why on earth do you come to that conclusion and what does Java going GPL have anything to do with it? I'm just thinking out loud here, not advocating anything... At best you are speculating rather thank thinking. Would it run better? Would it be faster, have smaller memory footprint, better reclamation, better hooks for plugins etc? OTOH, would it be harder to build, given the dependence of SA on perl modules? Please do some research on progam languages and domains because one size almost never fits all. While I personally very much dislike perl, it is extremely well-suited to this task: text-centric, rapidly changing. SA was the first out there, has a large body of active developers and is extensible by rules. Charlie -- Charlie Clark Helmholtzstr. 20 Düsseldorf D- 40215 Tel: +49-211-938-5360 GSM: +49-178-782-6226
Re: would SA benefit from port to Java
On Saturday November 18 2006 02:05, Matt Kettler wrote: I also expect a lot of the memory usage is the annotation tables and such for regexes. It would be interesting to compare the size of spamd without any rules loaded against one with a stock ruleset. The difference between the two can't really be improved by any means other than using a slower regex interpreter that doesn't use tables as extensively. It just happens I measured memory sizes a couple of days ago. This was with amavisd-new, but should not be much different than spamd, except for somewhat smaller daemon main program in clamd. Here it goes... From: Mark Martinec [EMAIL PROTECTED] To: amavis-user@lists.sourceforge.net Date: Thu, 16 Nov 2006 19:06:00 +0100 Subject: Re: [AMaViS-user] Forward mail directly to an MDA How limited is the RAM? 64MB PC100 SDRAM currently. I'll try that out. If the box runs out of RAM I'll simply buy some more. Those old DIMMs are not very expensive anymore. If the machine architecture will allow it, you really need to. Performance goes through the floor when you start swap thrashing. A message that might normally take 50 seconds to process could end up timing out instead, ending up in the deferred queue, only to have the same thing happen again. If it's an old machine, you will probably need low density (16 chip) DIMMs. An additional 128MB might get you closer to a functional system. I collected some figures, concentrating on low memory usage. All this with amavisd-new-2.4.4, SA 3.1.7, perl 5.8.8 on FreeBSD, although it should pretty much apply to nearby versions. VSZ 2.6 MB bare-bones Perl (interactive, no program); 18.9 MB (increase by 16.3 - basic amavisd) as above + running amavisd with AV i/f, with disabled decoders ($bypass_decode_parts=1), disabled cache and nanny db ($enable_db=0), and disabled SA: @bypass_spam_checks_maps=(1); 38.3 MB (increase by 19.4 - barebones SA code) as above + SA standard code, NO RULES; 45.5 MB (increase by 7.2 - standard SA rules) as above + standard SA 3.1.7 rules updated with sa-update; 120 MB (increase by 75 - de-luxe SA rules and plugins) as above + mid-range SARE rules, FuzzyOCR, [EMAIL PROTECTED], Razor2, ... The shown figure is virtual memory size per amavisd process. Resident memory size depends on how tight memory is, but could go down to maybe 60% of VSZ if truly necessary. In anticipated setup with clamd and SA, I suggest shedding decoders (ClamAV can do most decoding) which saves about 3.1 MB, and BarkeleyDB cache+nanny+statistics, which saves about 2.6 MB. With little memory one can not afford more than one or perhaps two amavisd processes. The pre-forking environment becomes pretty much useless, as you always pay the memory footprint price for one parent process. It is possible to tell Net::Server to do it all with only one process, no forking. To do so, find the following two lines in file amavisd: # @ISA = qw(Net::Server); @ISA = qw(Net::Server::PreForkSimple); and swap the commenting-out: @ISA = qw(Net::Server); # @ISA = qw(Net::Server::PreForkSimple); This yields exactly one amavisd process, regardless of the $max_servers setting (which makes it suitable also for hardcore debugging). Make sure to adjust the maxproc in Postfix master.cf to 1 for a smtp service that feeds mail to amavisd (README.postfix). This hasn't been tested extensively, but appears to work. (I'm not sure what happens after $max_requests tasks, just in case set it to a high value). So I would think it is possible to run on a 64 MB host one amavisd process (no pre-forking), with all standard SA rules and network tests (possibly with Bayes, I'm not sure), clamd, Postfix and a basic Unix OS. 128 MB would be more advisable, and 256 MB can get quite comfortable with two amavisd processes and some SARE rules, and there would even be room for X11 and emacs. Running the whole setup on a virtual host (qemu or some virtualizer) can make adjusting memory size very easy, so it may be worth trying the setup on a virtual host first. Mark
Re: would SA benefit from port to Java
This was with amavisd-new, but should not be much different than spamd, except for somewhat smaller daemon main program in clamd. s/clamd/spamd/
RE: would SA benefit from port to Java
From: Matt Kettler [mailto:[EMAIL PROTECTED] 1) perl has a substantial base of text parsing and utility libraries that no other language can match.. Java does have native regex support, so it has a leg up over the others, Right, but both langs are not that much suited for scoring a message: they apply all the rules to the very same piece of text. It would be interesting, instead, to invert this approach by designing a finite state machine which is basicly a pre-compiled version of the whole rule body. You feed once the message in, and you get the results (i.e.: fired rules and/or message score). I believe that this approach would reduce memory consumption as well as execution time a lot. It would not be suitable for custom plugins, however. But all the standard rules (even the expensive ones in terms of computational power and memory footprint) would probably perform better this way. The basic idea in the FSM model is that the pre-compiler is going to run just sometimes, maybe when a rule gets changed, added or deleted to the rule body. The pre-compiler could eventually even optimize the resulting FSM, perhaps by merging together paths shared by different rules. The .cf files syntax would not even need to be changed and this method could even allow for injecting a new, pre-compiled rule body version into an alive spamassassin. Optionally, the FSM approach could be implemented the well-appreciated, actual perl by use of an external perl module. Did anybody heard or thought of something like this? Do you believe that an FSM would really improve SA performances? What's your point? giampaolo
Re: would SA benefit from port to Java
Giampaolo Tomassoni wrote: From: Matt Kettler [mailto:[EMAIL PROTECTED] 1) perl has a substantial base of text parsing and utility libraries that no other language can match.. Java does have native regex support, so it has a leg up over the others, Right, but both langs are not that much suited for scoring a message: they apply all the rules to the very same piece of text. It would be interesting, instead, to invert this approach by designing a finite state machine which is basicly a pre-compiled version of the whole rule body. You feed once the message in, and you get the results (i.e.: fired rules and/or message score). I believe that this approach would reduce memory consumption as well as execution time a lot. It would not be suitable for custom plugins, however. But all the standard rules (even the expensive ones in terms of computational power and memory footprint) would probably perform better this way. The basic idea in the FSM model is that the pre-compiler is going to run just sometimes, maybe when a rule gets changed, added or deleted to the rule body. The pre-compiler could eventually even optimize the resulting FSM, perhaps by merging together paths shared by different rules. The .cf files syntax would not even need to be changed and this method could even allow for injecting a new, pre-compiled rule body version into an alive spamassassin. Optionally, the FSM approach could be implemented the well-appreciated, actual perl by use of an external perl module. Did anybody heard or thought of something like this? Nope.. Do you believe that an FSM would really improve SA performances? Maybe, maybe not.. It could definitely lead to some cross-regex optimzations, but I don't know that there are enough of them of them that it would make a substantial (10%) difference. What's your point? I am pointless :) giampaolo
Re: would SA benefit from port to Java
Giampaolo Tomassoni writes: From: Matt Kettler [mailto:[EMAIL PROTECTED] 1) perl has a substantial base of text parsing and utility libraries that no other language can match.. Java does have native regex support, so it has a leg up over the others, Right, but both langs are not that much suited for scoring a message: they apply all the rules to the very same piece of text. It would be interesting, instead, to invert this approach by designing a finite state machine which is basicly a pre-compiled version of the whole rule body. You feed once the message in, and you get the results (i.e.: fired rules and/or message score). I believe that this approach would reduce memory consumption as well as execution time a lot. It would not be suitable for custom plugins, however. But all the standard rules (even the expensive ones in terms of computational power and memory footprint) would probably perform better this way. The basic idea in the FSM model is that the pre-compiler is going to run just sometimes, maybe when a rule gets changed, added or deleted to the rule body. The pre-compiler could eventually even optimize the resulting FSM, perhaps by merging together paths shared by different rules. The .cf files syntax would not even need to be changed and this method could even allow for injecting a new, pre-compiled rule body version into an alive spamassassin. Optionally, the FSM approach could be implemented the well-appreciated, actual perl by use of an external perl module. Did anybody heard or thought of something like this? Do you believe that an FSM would really improve SA performances? Recently in the perl blead code, one of the perl hackers has added a trie-based regexp matcher (with Aho-Corasick optimisations) to efficiently match multiple regular expressions in parallel, to the perl core regexp matching code. That's pretty much what you're describing, and I'm looking into rewriting bits of SpamAssassin to take advantage of that (in the jm_re2c_hacks branch). Hopefully it will run faster than the current regexp matching system, which is actually quite fast as it stands! (The perl regular expression matching engine is _very_ efficient.) There's also an re2c-based version, which already outperforms basic SpamAssassin by 15-20%, btw. They almost definitely will not reduce memory usage, though. ;) --j.
RE: would SA benefit from port to Java
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] ...omissis Recently in the perl blead code, one of the perl hackers has added a trie-based regexp matcher (with Aho-Corasick optimisations) to efficiently match multiple regular expressions in parallel, to the perl core regexp matching code. That's pretty much what you're describing, Yes, I think so too. I didn't know the name of such a beast. Aho-Corasick. It should definitely work. How could something named Aho-Corasick not to work? :) Thank you for naming it. and I'm looking into rewriting bits of SpamAssassin to take advantage of that (in the jm_re2c_hacks branch). Hopefully it will run faster than the current regexp matching system, which is actually quite fast as it stands! (The perl regular expression matching engine is _very_ efficient.) There's also an re2c-based version, which already outperforms basic SpamAssassin by 15-20%, btw. They almost definitely will not reduce memory usage, though. ;) Mmmmh, I had the impression that all that strings being created, cloned, used, merged and the like in spite of being fed to the regexes would be one of the reasons of big memory usage. So, I'm wrong in this... What's the memory-hungry piece of code, then? giampaolo --j.
Re: would SA benefit from port to Java
well... I spent several years writing Java in the '90s, and am quite certain that SpamAssassin would perform a *lot* worse if written in Java. SpamAssassin is heavy on regular expressions, and *very* optimised for Perl's VM. On top of that, I'm pretty sure it would be quite hard to get faster performance out of Java *anyway*. First off, the perl VM uses a much more CISC-like strategy than Java's, with opcodes to implement operations like regexp matches, string modifications, hash lookups, arrays, and so on, in a single VM opcode, implemented in C. That means that those operations in perl will be nearly as fast as the equivalent C (at least if you choose the right operations of course!). Java, OTOH, uses opcodes that are more RISC-like, and implements much of its core library in pure Java -- operations like HashMap lookups or regexp matches, for example -- resulting in quite a few more pure Java ops being required to perform them. (At least this was the situation last time I looked, which admittedly was JDK 1.2 or so ;). Maybe this has changed since then.) For what it's worth, in my experience, Perl's performance is often as fast as anything I could write in any other language -- at least, except for specific, low-level bit-twiddling like the Rabin-Karp fast parallel string matching algorithm I just hacked recently. Perl is a *really* nice language for performance, in my opinion. Java's memory consumption, too, is frankly horrific compared to perl's. Perl's garbage collection, for example, is quite deterministic -- when an object's refcount hits zero, it is immediately freed. Java's, OTOH, relies on occasional GC runs -- and in my experience that can go quite awry resulting in wierd hangs at odd times. Virtually every large Java project I've worked on has had the odd invocation of System.gc(); thrown in odd places because of this! This bug has been a problem in java since 1.0, and talking to java hackers recently, they still complain about it in current releases. so, in conclusion: go perl. ;) --j. Matt Kettler writes: Eric A. Hall wrote: Thinking about the GPL Java announcement some, and trying to imagine the kinds of opportunities this allows for, it occurs to me that SpamAssassin might be a natural fit for Java. I'm just thinking out loud here, not advocating anything... Would it run better? Would it be faster, have smaller memory footprint, better reclamation, better hooks for plugins etc? OTOH, would it be harder to build, given the dependence of SA on perl modules? There's been about a 3 dozen other folks who have asked about porting SA to C/C++/Java/Python/Insert any other language here. In general, SA would suffer severely from a conversion to Java, or any other language. It all fundamentally boils down to two things: 1) perl has a substantial base of text parsing and utility libraries that no other language can match.. Java does have native regex support, so it has a leg up over the others, but it still lacks many of the libraries that SA is so heavily entrenched in. Do you know of any equivalent to IP::Country::Fast, for *ANY* other language? Admittedly that one is not used by everyone, but the MIME parsers, base64 decoders, HTML parser, Net::DNS, etc would be tough to find good matches for without having to write/maintain your own. This kind of text manipulation is what perl is actually very good at, and has lots of support libraries for. 2) Most importantly, consider that all of the existing devels that maintain the code are perl developers, and not all of them are Java developers. Poof, there goes at least some, if not all, of your development team down the tubes. This is by far the most significant hurdle. Who would we loose here, and can we afford to loose the spam-fighting expertise these people have? That said, I'm a C/C++/assembly developer myself, and my own personal reaction is why would you want to convert from one lumbering hulk of a language with an expensive interpreter to another lumbering hulk of a language with an expensive VM. And yes, I know java is JIT compiled not interpreted, but AFAIK this is not as different from how perl works as you might think. Perl code isn't strictly interpreted from scratch every time you pass through the same code. Perl is really compiled and optimized at load time into bytecode, then interpreted from that. This makes perls startup much slower, but runtime isn't as slow as an interpreted language. As for size, perl interpreters and java VMs are both large. And yes, you can native compile java to machine code, but I doubt your gains here will be significant. My bets are on SA spending 99% of it's time in regex evaluation or network lookups. Regex execution is VERY well optimized in both languages even without native compilation, so that won't be helped much, if at all. Network lookups are basically spending their time waiting.. you can't wait any faster
Re: would SA benefit from port to Java
Giampaolo Tomassoni writes: From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] ...omissis Recently in the perl blead code, one of the perl hackers has added a trie-based regexp matcher (with Aho-Corasick optimisations) to efficiently match multiple regular expressions in parallel, to the perl core regexp matching code. That's pretty much what you're describing, Yes, I think so too. I didn't know the name of such a beast. Aho-Corasick. It should definitely work. How could something named Aho-Corasick not to work? :) Thank you for naming it. here's more info: http://en.wikipedia.org/wiki/Aho-Corasick . it's a nice algorithm ;) and I'm looking into rewriting bits of SpamAssassin to take advantage of that (in the jm_re2c_hacks branch). Hopefully it will run faster than the current regexp matching system, which is actually quite fast as it stands! (The perl regular expression matching engine is _very_ efficient.) There's also an re2c-based version, which already outperforms basic SpamAssassin by 15-20%, btw. They almost definitely will not reduce memory usage, though. ;) Mmmmh, I had the impression that all that strings being created, cloned, used, merged and the like in spite of being fed to the regexes would be one of the reasons of big memory usage. So, I'm wrong in this... What's the memory-hungry piece of code, then? The perl interpreter -- I think the compiled code itself is quite memory-hungry, as far as I can see. --j.
Re: would SA benefit from port to Java
On Friday November 17 2006 21:24, Giampaolo Tomassoni wrote: Besides, if there wasn't SA pluging, I would prefer a C/C++ version of SA. Wouldn't it run better? Wouldn't it be faster, wouldn't have a smaller memory footprint, better reclamation, better hooks for plugins etc? :) ...and buffer overruns, dangling pointers, poor maintainability, playground for security holes. If SA were written in C, I wouldn't let it examine mail being received from 'the wild'. Mark
RE: would SA benefit from port to Java
Thinking about the GPL Java announcement some, and trying to imagine the kinds of opportunities this allows for, it occurs to me that SpamAssassin might be a natural fit for Java. I'm just thinking out loud here, not advocating anything... Would it run better? Would it be faster, have smaller memory footprint, better reclamation, better hooks for plugins etc? It would probably run better. I wouldn't say it would work faster. I know for shure it would have a much bigger memory footprint... :) OTOH, would it be harder to build, given the dependence of SA on perl modules? This is the main reason for not just starting with it. Besides, if there wasn't SA pluging, I would prefer a C/C++ version of SA. Wouldn't it run better? Wouldn't it be faster, wouldn't have a smaller memory footprint, better reclamation, better hooks for plugins etc? :) giampaolo Thoughts? -- Eric A. Hallhttp://www.ehsco.com/ Internet Core Protocols http://www.oreilly.com/catalog/coreprot/
Re: would SA benefit from port to Java
Giampaolo Tomassoni wrote: Thinking about the GPL Java announcement some, and trying to imagine the kinds of opportunities this allows for, it occurs to me that SpamAssassin might be a natural fit for Java. I'm just thinking out loud here, not advocating anything... Would it run better? What does that even mean? Run better?
Re: would SA benefit from port to Java
Eric A. Hall wrote: Thinking about the GPL Java announcement some, and trying to imagine the kinds of opportunities this allows for, it occurs to me that SpamAssassin might be a natural fit for Java. I'm just thinking out loud here, not advocating anything... Would it run better? Would it be faster, have smaller memory footprint, better reclamation, better hooks for plugins etc? OTOH, would it be harder to build, given the dependence of SA on perl modules? There's been about a 3 dozen other folks who have asked about porting SA to C/C++/Java/Python/Insert any other language here. In general, SA would suffer severely from a conversion to Java, or any other language. It all fundamentally boils down to two things: 1) perl has a substantial base of text parsing and utility libraries that no other language can match.. Java does have native regex support, so it has a leg up over the others, but it still lacks many of the libraries that SA is so heavily entrenched in. Do you know of any equivalent to IP::Country::Fast, for *ANY* other language? Admittedly that one is not used by everyone, but the MIME parsers, base64 decoders, HTML parser, Net::DNS, etc would be tough to find good matches for without having to write/maintain your own. This kind of text manipulation is what perl is actually very good at, and has lots of support libraries for. 2) Most importantly, consider that all of the existing devels that maintain the code are perl developers, and not all of them are Java developers. Poof, there goes at least some, if not all, of your development team down the tubes. This is by far the most significant hurdle. Who would we loose here, and can we afford to loose the spam-fighting expertise these people have? That said, I'm a C/C++/assembly developer myself, and my own personal reaction is why would you want to convert from one lumbering hulk of a language with an expensive interpreter to another lumbering hulk of a language with an expensive VM. And yes, I know java is JIT compiled not interpreted, but AFAIK this is not as different from how perl works as you might think. Perl code isn't strictly interpreted from scratch every time you pass through the same code. Perl is really compiled and optimized at load time into bytecode, then interpreted from that. This makes perls startup much slower, but runtime isn't as slow as an interpreted language. As for size, perl interpreters and java VMs are both large. And yes, you can native compile java to machine code, but I doubt your gains here will be significant. My bets are on SA spending 99% of it's time in regex evaluation or network lookups. Regex execution is VERY well optimized in both languages even without native compilation, so that won't be helped much, if at all. Network lookups are basically spending their time waiting.. you can't wait any faster in machine code than a semi-interpreted application. I also expect a lot of the memory usage is the annotation tables and such for regexes. It would be interesting to compare the size of spamd without any rules loaded against one with a stock ruleset. The difference between the two can't really be improved by any means other than using a slower regex interpreter that doesn't use tables as extensively.