On Sat, Oct 18, 2008 at 12:44:46PM +0300, Henrik K wrote:
> On Fri, Oct 17, 2008 at 10:24:21PM +0200, Henrik Nordstrom wrote:
> > On tor, 2008-10-16 at 12:02 +0300, Henrik K wrote:
> > 
> > > Optimizing 1000 x "www.foo.bar/<randomstuff>" into a _single_
> > > "www.foobar.com/(r(egex|and(om)?)|fuba[rz])" regex is nowhere near linear.
> > > Even if it's all random servers, there are only ~30 characters from which
> > > branches are created from.
> > 
> > Right. 
> > 
> > Would be interesting to see how 50K dstdomain compares to 50k host
> > patterns merged into a single dstdomain_regex pattern in terms of CPU
> > usage. Probably a little tweaking of Squid is needed to support such
> > large patterns, but that's trivial. (squid.conf parser is limited to
> > 4096 characters per line, including folding)
> 
> Not sure what the splay code does in Squid, didn't have time to grab it.
> But a simple test with Perl:
> 
> - Grepped some hostnames from wwwlogs etc
> - Regexp::Assemble'd 50000 unique hostnames (= 560kB regex, took 22 sec)
> - Run 100000 hostnames on it in 4 seconds (25000 hosts/sec on 2.8Ghz CPU)
> 
> It's pretty powerful stuff.

Oops, did it even slightly wrong.

By doing it correctly, using ^hostname$ instead of plain hostname in regex
results in 1.2 seconds, that's 80000+ hosts/sec..

Reply via email to