Hi Murray,
Thanks for the info.
You're right, the only reason to store plain text is to permit searching.
I think your approach is valid for me. I don't know anything about Lucene, thereby I have much to read, investigate, ... Soon I'll come back with more questions ... :)
It's not my approach, it's how all search engines operate. You might want to consider joining the Lucene users list or at least read their documentation, as this isn't really the correct place to ask support for search engines, and I certainly won't be able to answer your detailed implementation questions. There may be other search engine implementations on SourceForge as well, I just haven't looked.
Murray
2005/4/22, Murray Altheim <[EMAIL PROTECTED]>:
Xoan,
All searches happen this way, but that process of indexing goes on *before* the user does the search, which is why it seems fast. I've integrated Lucene into my Xindice collections, with a listener that notes when a document is created, changed or deleted. There's an initial cost of indexing the whole collection (if the database is populated all at once), but the cost is incremental and almost unnoticeable otherwise.
Because Lucene uses a model whereby you feed documents to various indexers depending on their type (so a text document goes to a different one than an HTML document, which needs a text stripper to remove the markup), you don't need a separate text document stored for each HTML document, if the only reason you're doing that is having the text available for searching. You only create the text temporarily for the indexer to function, then dump it.
Murray
--
Murray
...................................................................... Murray Altheim http://www.altheim.com/murray/ Strategic and Services Development The Open University Library The Open University, Milton Keynes, Bucks, MK7 6AA, UK .
MORE swift, more fleet, than the sun-stained feet of the Dawns that trample the night-- More fleet, more swift, than the gleams that lift in the wake of a wild star's flight-- Through the unpathed deeps of a sea that sweeps unplumbed, unsailed, unknown, Where the forces untamed, unseen, unnamed, have ruled from the First, alone, Now the Ghosts of Thought, with a message caught from the tales of the dreaming past, Unheard, unseen, with nor sound nor sheen, speed through the ultimate vast.
excerpt from "Wireless Telegraph." by Don Marquis. http://donmarquis.org/wireless.htm