> My current planned approach is to have each file added read and keywords
> extracted (i.e. variable names, libraries imported etc.) and used as
> keys in a dictionary, and each key points to a list of files containing
> it (files will be stored in a zip or similar).
>
> i.e.
>
> words = {
>
> "imaplib": ["X.py"],
> "smtplib": ["X.py", "Y.py"]
>
> }
>
> And if I do a search for "smtplib" or "imaplib" it'll return X.py
> first, as it contains both, and Y.py second.
>
> Is there a more efficient approach? I don't think a relational DB really
> suits this sort of task at present, but I'm wondering about search
> times.
Hi Liam,
That sounds fine, and it's a popular approach. What you have there is
known as an "inverted index":
http://www.nist.gov/dads/HTML/invertedIndex.html
and it's the way that many search engines know how to link up keywords to
their respective documents.
> I've googled for search theory, and hit c2's wiki up, but I'm having no
> luck at finding applicable, understandable theory... a lot of stuff
> about trees and nodes and paths through graphs, but I can't correlate
> that to my problem.
There's a really short-and-sweet book called "Understanding Search
Engines: Mathematical Modeling and Text Retrieval" that covers some core
ideas:
http://www.cs.utk.edu/~berry/usebook/
I'd heartily recommend that one; they do talk about the theory, but from a
practitioner's point of view, so it's quite readable. And it is very thin
and easy to carry. *grin*
_______________________________________________
Tutor maillist - [email protected]
http://mail.python.org/mailman/listinfo/tutor