Hello, I've been trying for a while to create a web search engine to spider a small number of websites (around 1000 of them). Before even considering Lucene I used a dbms and tried "crawling" a site while taking in all keywords from the html files (filtering out stopwords etc). Unfortunately this simplistic approach resulted into huge amounts of data which made the whole project impractical. Then I looked into Lucene as a friend suggested because it's more efficient in storing indexes of this kind. Since most websites nowadays are dynamically produced based on templates much of the web page content remains the same over and over again meaning that the same words are re-added to the index making it larger without adding any useful information to it. I came up with the idea to approximately find which keywords remain the same over the site and index them only once in a document calling it the "base". Now every page from the same website gets compared to the base document and only the differences are stored as a separate document with a field containing the "link" to the base document. This works as expected i.e. it substantially decreases the index size but introduces another problem; how do I search?
Say I want to run a query with two terms being searched using the AND operator. For example search for "home" and "test". Suppose that "home" is in the base document and "test" appears in a couple of documents of the same website but does not exist in the base document. The correct result is those two documents. How do I get Lucene to do this for me? I've not had any experience before with search engine programming so I might be doing it all wrong, I'd be glad if anyone could point me to the right direction if I am doing it all wrong. I'm expecting your suggestions or comments. Thanks in advance, Kyriakos Ktorides -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>