Re: Indexing a largish collection of mail and usenet messages?

2007-01-02 Thread Christophe Ollier

John L a écrit :

I have a collection of archives of mailing list and news messages. The 
largest collection is pretty big, about 150,000 messages which means 
about 200 megabytes of text, shortly to be migrated to a FreeBSD 
server.  The lists are all active so archives typically add a few 
messages each day. I want to provide a full text search of each 
archive.  What software should I use?  I have been using the sturdy but 
ancient lqtext package. It's OK, but it has a few bugs I have yet to 
pick and I'm wondering if something better is available.


You could have a look at Lucene (http://lucene.apache.org/) : a text 
search engine library written in Java. I don't know lqtext, but Lucene 
seems to work in a similar way : a first program builds  updates an 
index, a second program allows to query the index.


It's only a library, you have to program the interfaces for you 
(indexing) and your users (querying). There are numerous ports to other 
languages (C, Perl, Python, PHP (through ZendFramework) are in the ports 
tree).


First, I am NOT, repeat NOT, asking about web spiders.  The messages are 
directly available to indexing software as files on my server, so 
there's no advantage to running them through Apache on the way to the 
indexer. Also, the messages in the archive never change and I know what 
files are new each day, so it would be pointless for a package to 
re-spider the whole archive to look for the new messages.  I am not 
unalterably opposed to something that spiders if it is otherwise 
wonderful, but that approach hasn't been fruitful in the past.


Lucene can update an existing index with new documents.

What I want ideally is something that knows enough about the structure 
of mail messages to deal intelligently with headers vs. body, that can 
do something reasonable with MIME and HTML bodies (not urgent, I can 
always run them through demime on the way to the index), and most 
importantly that actually works with 150,000 messages.  I've seen lots 
of packages that look promising but that fall over dead once they get 
past 10,000 messages or so.


I don't think Lucene can do this out of the box, but you can associate 
any keyword to your indexed documents (e.g. mail headers).


About performance, I'm personally satisfied. I use the PHP port, with 
20k documents, the full index takes about an hour to build, queries 
about 100 to 1000 ms. Lucene seems fit for millions of documents.



[...]


--
Christophe
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Indexing a largish collection of mail and usenet messages?

2007-01-02 Thread John L
You could have a look at Lucene (http://lucene.apache.org/) : a text search 
engine library written in Java. I don't know lqtext, but Lucene seems to work 
in a similar way : a first program builds  updates an index, a second 
program allows to query the index.


Thanks.  Using java on a BSD box is a pain, but I see Ferret, a port into 
C that can be glued into ruby.


Regards,
John Levine, [EMAIL PROTECTED], Primary Perpetrator of The Internet for 
Dummies,
Information Superhighwayman wanna-be, http://johnlevine.com, Mayor
I dropped the toothpaste, said Tom, crestfallenly.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]