onemind wrote:
Thanks,
The thing is, i am going to need to use different letters each time to
search through over 200,000 words in a database and it needs to be fast.
What technology would be best suited for this task? I just assumed that a
databse would be ideal, why do you say sql isn't suited for this and what
is?
Thanks again.
Derrel Lipman's recent post may answer your question better, but here's
a sketch of a solution that involves SQLite.
1) Find a suitable regular expression library, say, PCRE (Perl
Compatible Regular Expressions) -- www.prce.org
2) Write a C function to be used from within SQLite, using the
instructions found at:
http://www.sqlite.org/capi3ref.html#sqlite3_create_function
The C function might be a custom one that, given a string of letters,
searched for all letters (AND) or any letters (OR), possibly using the
RegEx library.
3) Recompile SQLite with said regex library added into the SQLite code,
as well as your C function.
4) Register your C function with SQLite using the above API
5) Use the function with the regex '[spqd]' to search for words
containing the letters "s", "p", "q", OR "d". Doing it for all letters
(AND) may be doable with a single regex, but if not, you can always, in
your custom function, search for all the letters, mark them off one by
one as you find them, and return the appropriate value when all have
been found, otherwise, if you get to the end of the string, then return
another appropriate value.
Another poster mentioned that you should really test the
straightforward, simple-minded approach that he mentioned, first. If it
is fast enough, then why bother doing it the hard way. The above
probably also won't use an index, so it is also an O(n) approach, like
the simple-minded approach of doing several LIKE's probably is.
200,000 words does not sound like a whole lot. The first query might be
a little slow, but if your table fits in memory, then your operating
system's cache will probably make subsequent queries rather fast.
Having said all this, the fastest way would probably be to use an
in-memory datastructure, and simply query that in-memory. One possible
-- and very simple -- solution would be to have a hash-map for every
character you wished to be able to search, then store pointers to the
strings of the words in each hash-map. That would make your
lookup-times be O(m), where m is the number of letters to search for,
rather than O(n), where n is the number of words.
HTH
Ulrik Petersen