David,

Very clever indeed!

It's amazing how many different optimizations we can find based on the type of data. There's the baseline brute force, there's mine which takes a constant factor out (by shrinking the text), but would fail if you had many different words throughout with the same suffixes (as odd as that may seem). There were a couple of methods that fly through redundant text. And there's your latest which really whoops it up on large texts with certain length search strings.

I wonder on yours, could there be some advantage to be gleaned from the fact that longer searches can be broken down into shorter sub-searches (i.e. 6 words = 3 words + 3 words)? Could be interesting if there's a slick way to utilize that.

From what I can tell, the algorithms roughly look like:

Brute force: ~500 ticks Dual-G4 baseline, runs faster when there are more matches
Redundancy array methods: As little as a few ticks, but slower on pseudo-random data
Suffix indexing method: about 60-80% faster at eliminating matches and 20-40% slower at finding matches than brute force
3-word indexing method: About 10% overhead with 99% speedup on 3 word queries, same speed on others


I took Chris data and examined it. My solution below exploits that knowledge of the data but if it is genuinely a typical search set (most search material being three words or so) then the following does the trick.

Using the algorithm on Chris' web page I process the material in 530 ticks on my machine (Chris does it in 494). With the code below this is cut to 51 ticks so it should run in <50 ticks on Chris' machine, an all-time record.

The trick is to exploit "intersect" (look it up) which is amazingly fast at, er, intersecting. The intersection itself executes in around 50 milliSeconds or about 3 ticks. It is all the preparatory work which takes the other 48 ticks.

_______________________________________________ use-revolution mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to