Hi,

responding to myself...

Ulrik Petersen wrote:
onemind wrote:
Thanks,

The thing is, i am going to need to use different letters each time to
search through over 200,000 words in a database and it needs to be fast.

What technology would be best suited for this task? I just assumed that a databse would be ideal, why do you say sql isn't suited for this and what
is?

Thanks again.

[Snip]

Having said all this, the fastest way would probably be to use an in-memory datastructure, and simply query that in-memory. One possible -- and very simple -- solution would be to have a hash-map for every character you wished to be able to search, then store pointers to the strings of the words in each hash-map. That would make your lookup-times be O(m), where m is the number of letters to search for, rather than O(n), where n is the number of words.

What I said above is complete and utter BS.  Sorry.

You might want to look into "bitsets", or "finger-prints", as Information Retrieval specialists like to call them.

The basic idea is that you make a bitset out of each word, with one bit for each of the "features" you want to be on or off for that word. For your purposes, probably you want each bit in the bitset to represent the presence or absence of one letter. If you only target the 26 letters of the English alphabet, a 32 bit integer will suffice.

You can store such a bitset in a column in SQLite, say, "fingerprint". Compute this as you insert the word.

Then use the "&" operator (bitwise-and) of SQLite's language to filter out those that you don't want.

Say you are interested in those words which do contain "a" and "b", and "c". Say that "a" is bit 1, and "b" is bit 2, "c" is bit 3. Then you OR these together (1|2|4=7), giving the value 7 to the &-operator:

SELECT * FROM words WHERE fingerprint & 7;


HTH


Ulrik Petersen

Reply via email to