On 06/23/2010 12:39 PM, Brian Lavender wrote: > What started this is that having a hash provides ideally a O(1). So, > I say that to achieve this, one would ideally want to have a large array > to store your IP addresses and a hashing function that provides good > distribution. To which, Tim pointed to using a smaller array and using > chaining, because he initially thought that arrays were limited to smaller > sizes than was observed by doing some simple tests. To which I replied, > using a fixed size array and the chaining could cause the hash to decay > into a linked list which has a O(n).
Indeed, the classic memory vs access time for hashes. > And to which Tim has noted using a > large fixed size array may just take up all your architectural limits > of the program. I imagine that having such a large array would push > the lower bound of the stack up in memory, or if allocated at runtime, > would push the heap way down from the top. I had what sounds like a similar problem for web logging. I wanted to take a URL which contained agent, IP, URL, referer. I posted the source on lugod quite awhile ago. The pseudo code was: for hits: agent_id=getid(agent,"agent string"); ip_id=getid(ip,"ip string"); url_id=getid(url,"url string"); refer_id=getid(refer,"refer string"); append_log(agent_id,ip_id,url_id,refer_id,datastamp,returncode) Of course getid would lookup the value and if it didn't exist it would add that string to the table. This resulted in several advantages: * Made it much easier to extract info from the logs, things like how many hits did a subset of the website have last year vs this year so we can evaluate advertising effectiveness. * Logs were denser (less disk) * Running reports like top 10 for things like hits, bandwidth, IPs, agents, and 404s was easy. While load testing on a rather old machine (PII-400 I believe) I could record 1000 hits/sec even when using mysql as the backend. With a memory resident database or something lighter like sqllite I'd expect much better. Of course hardware has come a long way since the PII-400 as well. Such a setup might work well with for an IDS as well. With some care it should be relatively straight forward to make it parallel friendly. With all that said, I used the big hash method in a simple little tool I wrote to instantly tell you where your bandwidth is going: http://broadley.org/bill/tcpdump.pl The output: tcpdump -n -c 1000 | ./tcpdump.pl | head -5 1000 packets captured 1000 packets received by filter 0 packets dropped by kernel lines=1000 lengths=760 613884 128.120.46.32.8649 -> 128.120.246.1.48876 153463 128.120.46.33.8649 -> 128.120.246.1.48609 153462 128.120.46.33.8649 -> 128.120.246.52.54912 2684 131.215.74.22.61733 -> 128.120.246.102.9996 _______________________________________________ vox-tech mailing list vox-tech@lists.lugod.org http://lists.lugod.org/mailman/listinfo/vox-tech