I built a word search engine in Python.  It isn't yet to the point of
being really usable, although I have gotten some interesting results
searching my local documents.

I thought reiserfs would be a reasonable way to keep my term indices
on disk.  It ends up taking about two-thirds of its wall-clock time in
userland, so reiserfs isn't costing *that* much, but it seems to be
actually writing data to reiserfs at only about 50-150 kilobytes per
second, which is disappointing, because my disk seems to be capable of
sequential writes at about 10 000 kilobytes per second.  However, this
may be my fault; I'm using the reiserfs in the standard 2.4.13 kernel,
and I've heard that maybe I should use some other kernel patches (or a
newer kernel) to make it faster, and I have CONFIG_REISERFS_CHECK set
to 'y', which is documented to slow reiserfs down significantly.

It also uses more space than I would like, essentially due to the file
format.

Some notes on indexing document types other than plain text:

    A trial on 10 random PostScript documents produced the following
    rough results using ps2ascii from gs 5.10:
    - ps2ascii can spew out about 2000 words per second on my PII-366
    - words were about 6 chars long
    - ps2ascii fails on 20% of input documents for unknown reasons
    - ascii files were typically 20% of the size of the PostScript originals
    - according to the above, ps2ascii could convert PostScript to text at 
      about 60 kb of PostScript per second

    ps2ascii's output should generally be good enough extracting keywords
    for searches, but could be improved for that purpose, and it will be
    ugly for displaying excerpts.  However, ps2ascii tries to decolumnize
    columns.  Its major problems are guessing wrong about fonts (inserting
    Greek letters or ligatures for bullets or mathematical symbols),
    joining words it shouldn't, and splitting words it shouldn't.

    A similar test with the pdftotext program from xpdf 1.00 produced the
    following results:
    - pdftotext can spew out about 10 000 words per second
    - words were about 6 chars long
    - pdftotext failed on one input document that was encrypted (an application
      for a Danish entry visa)
    - sizes of output files varied widely; about the same size as input was
      common, as was half the size of input, a tenth of the size of input,
      and 10x the size of input

    pdftotext didn't handle multicolumn input in a fashion desirable for
    text searching, and otherwise suffered the same problems as ps2ascii,
    but not as severely.

    pdftops (from xpdf) worked much better than pdf2ps (from gs).

    On extracting HTML stuff: lynx -dump -nolinks does a reasonably good
    job of extracting contents, but it should be easy to improve on.  It
    can spew out somewhere between 10 000 and 100 000 words per second and
    is quite reliable.



Here's the program that builds or updates indices; after it is the
program to search them:

#!/usr/bin/python
import sys, socket, time, os, errno, re, random, profile, posix
usage = "index indexdir filename [filename ...]"

# Builds a full-text index using the filesystem.
#
# Not surprisingly, this will be slow and space-hungry unless your
# filesystem rocks.  I recommend running it on ReiserFS.
#
# Initially this indexes at about 28-36 kilobytes per second on my
# PII-366, about 2/3 of which is user time.
# http://www.searchtools.com/tools/lucene.html describes Lucene as
# having "very fast indexing, over 200 MB per hour", which works out
# to 55K per second, so I guess that's not too shabby.  It takes about
# 20% longer to complete a small job on ext2 than on reiserfs (due to
# taking nearly 3x as much system time), and I expect that the size of
# the difference should increase as the job size increases.  (It also
# used 20 times as much disk space.)
#
# Profiling reveals that 17.7 seconds of a 52.0-second run are in
# 137980 calls to text_word_iterator.next, while another 4.2 seconds
# are in 137801 calls to indexer.stem and another 5.6 seconds are in
# 137801 calls to indexer.wanted.  So it should be possible to essentially
# double the speed of this indexer with a little C code, making it as fast
# as Lucene on plain text.
# 
# Space is another matter.  On reiserfs, it has rather poor space
# usage; in one instance, 213 mail messages totaling 909kB used
# 1572kiB for their index.  In another, 12 documents totaling 8.8MB
# used 8388kiB for their index.  This could be prohibitive overhead,
# although as mentioned above, it's 20 times better than ext2fs.
# 
# An indexdir contains two subdirectories: "docs" and "words".
# Files in "docs" are documents that have been indexed.
# This dir normally contains symlinks to the actual files.
# Some of them may be dangling.
# "words" contains one file per word; it lists the names of files in "docs"
# that contain that word, along with some other data (presently the number
# of occurrences).  
# "words" might contain index entries that point at no-longer-existing files
# (or files the creation of which failed) in "docs", and some things in "docs"
# might not be fully indexed.
# We need a way to update the index for an already-existing file!

# XXX backwards-compatibility hack
try: StopIteration
except: StopIteration = "StopIteration"

jumper = random.Random()
jumper.seed()
serial = 0
def linkfileinto(dir, file):
    while 1:
        global serial
        serial = serial + 1
        filename = str(serial)
        try:
            # XXX assumes symlink() is atomic
            os.symlink(file, os.path.join(dir, filename))
            return filename
        except OSError, err:
            err_errno, msg = err
            if err_errno == errno.EEXIST:
                serial = serial + 10**jumper.randrange(3)
            else:
                raise

class text_word_iterator:
    def __init__(self, filename):
        self.file = open(filename)
        self.readline()
    def readline(self):
        self.line = self.file.readline()
        if self.line == '': self.words = None
        else: self.words = filter(None, re.split(r'\W+', self.line))
    def next(self):
        while 1:
            if self.words is None: raise StopIteration
            elif self.words: return self.words.pop()
            else:
                self.readline()

class texttype:
    def getwords(self, filename):
        "Return a word iterator."
        return text_word_iterator(filename)

def makedirs(dirname):
    if not os.path.exists(dirname): os.makedirs(dirname)

class indexer:
    def docsdir(self): return os.path.join(self.indexdir, "docs")
    def wordsdir(self): return os.path.join(self.indexdir, "words")
    
    def __init__(self, indexdir):
        self.indexdir = indexdir
        self.words = {}

    texttypeinstance = texttype()
    def guesstype(self, origname, linkedname):
        for compressed_extension in ['.zip', '.gz', '.pdf', '.PDF']:
            if origname[-len(compressed_extension):] == compressed_extension:
                return None
        return self.texttypeinstance
    def stem(self, word): return word.lower()
    def wanted(self, word):
        # length limit of 60 should cut down on uuencode and base64 "words"
        return (len(word) > 1 and len(word) < 61 and
                word not in ['the', 'and'])

    def indexfile(self, origname, linkedname):
        """Add a file to the full-text index.

        indexdir --- the top-level directory the index lives in
        origname --- the file's original name (used for guessing type)
        linkedname --- the name the file is known as in this system

        This routine figures out what kind of file the file is,
        extracts words from it, stems them, and remembers them for
        addition to the index.

        """
        filetype = self.guesstype(origname, linkedname)
        if filetype is None: return  # XXX should report as failure!
        filewords = filetype.getwords(os.path.join(self.docsdir(),
                                                   linkedname))
        words = {}
        try:
            while 1:
                word = self.stem(filewords.next())
                if self.wanted(word):
                    if not words.has_key(word): words[word] = 0
                    words[word] = words[word] + 1
        except StopIteration:
            pass
        for word in words.keys():
            if not self.words.has_key(word): self.words[word] = []
            self.words[word].append((linkedname, words[word]))

    def index(self, filename):
        filename = os.path.abspath(filename)
        docsdir = self.docsdir()
        makedirs(docsdir)
        linkedname = linkfileinto(docsdir, filename)
        self.indexfile(filename, linkedname)

    def commit(self):
        wordsdir = self.wordsdir()
        makedirs(wordsdir)
        for word in self.words.keys():
            wordfile = open(os.path.join(wordsdir, word), "a")
            for file, data in self.words[word]:
                # FIXME: really big concurrent indexing jobs could
                # exceed PIPE_MAX bytes and result in interleaved appends
                # due to buffering
                wordfile.write("\n%s: %s" % (file, data))
            posix.fdatasync(wordfile.fileno())
            wordfile.close()
        self.words.clear()

def main(argv):
    if len(argv) < 2:
        sys.stderr.write(usage + "\n")
        return 1
    indexdir = argv[1]
    files = argv[2:]
    myindexer = indexer(indexdir)
    try:
        for file in files:
            print "indexing", file,
            sys.stdout.flush()
            myindexer.index(file)
            print "done."
    finally:
        print "committing",
        sys.stdout.flush()
        myindexer.commit()
        print "done."
    return 0

#if __name__ == "__main__": profile.run('sys.exit(main(sys.argv))')
if __name__ == "__main__": sys.exit(main(sys.argv))
# bug log:
# - forgot about argv[0], tried to use argv[0] for the index dir
# - assumed os.makedirs() would succeed if dir already existed
# - forgot to declare hostname in uniqname global --- "local variable
#   referenced before assignment"
# - had local variable "time" shadowing global "time" module --- same
#   problem
# - used "symlink" instead of "os.symlink"
# - wrote foo(os.path.join(x), y) instead of foo(os.path.join(x, y))
# - didn't think about the case of an empty word (now the text word iterator
#   avoids them)
# - didn't think about lowercasing words
# - when I moved the rename code from function to function, I didn't rename
#   'linkedname' to 'file' as I should have
# - forgot to makedirs(todir)
# - didn't provide for "stop words"
# and then it worked.
# - forgot to declare 'serial' global
# - tried to catch IOError from symlink(), but it's OSError
# - forgot to put "return filename" inside loop
# - tried to use symlink()'s atomicity to provide concurrency control for
#   concurrent indexing jobs, but forgot I was symlinking into a different
#   directory than the finished files were in.  Gave up on not making
#   docs visible in "docs" until indexing them was finished.
# - the symlink logic was stupidly wrong --- when the symlink already
#   existed, symlink() would fail, so it would pick a new number --- and
#   then return the number it had tried in the first place instead of
#   trying again.




And here's the 'search' program:

#!/usr/bin/python
import sys, socket, time, os, errno, re, string
usage = "search indexdir searchterm [searchterm...]"

# Search a full-text index created by the "index" program.

# for scoring, see lucene faq sec. 3 q. 31

def set(list):
    rv = {}
    for item in list: rv[item] = 1
    return rv

def intersect(somelists):
    rvdict = set(somelists[0])
    for eachlist in somelists[1:]:
        listdict = set(eachlist)
        for item in rvdict.keys():
            if not listdict.has_key(item): del rvdict[item]
    return rvdict.keys()

class indexer:
    def docsdir(self): return os.path.join(self.indexdir, "docs")
    def docstmpdir(self): return os.path.join(self.indexdir, "docstmp")
    def wordsdir(self): return os.path.join(self.indexdir, "words")
    
    def __init__(self, indexdir):
        self.indexdir = indexdir
    def docscontaining(self, term):
        try: index = open(os.path.join(self.wordsdir(), term))
        except IOError, err:
            err_errno, msg = err
            if err_errno == errno.ENOENT: return []
            else: raise
        try:
            rv = []
            while 1:
                line = index.readline()
                if line == '': return rv
                colonpos = string.find(line, ':')
                if colonpos != -1:
                    rv.append(line[:colonpos])
        finally:
            index.close()
    def realname(self, filename):
        return os.readlink(os.path.join(self.docsdir(), filename))

def main(argv):
    if len(argv) < 2:
        sys.stderr.write(usage + "\n")
        return 1
    if not os.path.exists(argv[1]):
        sys.stderr.write("No such indexdir: " + argv[1] + "\n")
        return 2
    myindex = indexer(argv[1])
    doclists = map(myindex.docscontaining, argv[2:])
    doclist = intersect(doclists)
    for doc in doclist:
        print myindex.realname(doc)
    return 0

if __name__ == "__main__": sys.exit(main(sys.argv))
# bug log:
# - called "indexer" "index" --- oops.
# - did except OSError, but open was raising IOError
# - said readlink instead of os.readlink
# - tried to do [:colonpos-1] instead of [:colonpos}, duh
# and then it worked.
# - didn't error on misspelled indexdir name


-- 
/* By Kragen Sitaker, http://pobox.com/~kragen/puzzle2.html */
char a[99]="  KJ",d[999][16];main(){int s=socket(2,1,0),n=0,z,l,i;*(short*)a=2;
if(!bind(s,a,16))for(;;){z=16;if((l=recvfrom(s,a,99,0,d[n],&z))>0){for(i=0;i<n;
i++){z=(memcmp(d[i],d[n],8))?z:0;while(sendto(s,a,l,0,d[i],16)&0);}z?n++:0;}}}


Reply via email to