Re: Removing similar documents from search results

2005-03-14 Thread David Spencer
Miles Barr wrote: Has anyone tried to remove similar documents from their search results? It looks like Google does some on the fly filtering of the results, hiding pages which is thinks are too similar, i.e. when you see: "In order to show you the most relevant results, we have omitted some entrie

Re: Removing similar documents from search results

2005-03-14 Thread David Spencer
to be returned it might be used to normalize the scores however... Otis --- David Spencer <[EMAIL PROTECTED]> wrote: Miles Barr wrote: Has anyone tried to remove similar documents from their search results? It looks like Google does some on the fly filtering of the results, hiding pages whic

Re: Removing similar documents from search results

2005-03-15 Thread David Spencer
Miles Barr wrote: On Mon, 2005-03-14 at 20:48 +0100, Dawid Weiss wrote: I think what they do at Google is a fancy heuristic -- as David Spencer mentioned, suburls of a given page, identical snippets, or titles... My idea was more towards providing a 'realistic overview' of subjects in

Re: Alert function (aka "profiled alerting")

2005-03-17 Thread David Spencer
Robert Watkins wrote: The reason your suggestion is not practical is scalability. In a production environment you might have, for example, 10,000 stored queries and 10 new documents a minute. That's a fair bit of load on the system for only one aspect of a much larger search application. Fun, inter

Re: Alert function (aka "profiled alerting")

2005-03-17 Thread David Spencer
ut you have certainly given me some good ideas. Answers to your questions are below. -- Robert On Thu, 17 Mar 2005, David Spencer wrote: Fun, interesting question - maybe you could elaborate on the requirements a bit. We deliver on-line content -- journals, reference works and the like. Users ca

Re: Search performance under high load

2005-04-06 Thread David Spencer
Daniel Herlitz wrote: Hi everybody, We have been using Lucene for about one year now with great success. Recently though the index has growed noticably and so has the number of searches. I was wondering if anyone would like to comment on these figures and say if it works for them? Index size: ~

Re: Search performance under high load

2005-04-07 Thread David Spencer
Yura Smolsky wrote: Hello, mark. mh> 2) My app uses long queries, some of which include mh> very common terms. Using the "MoreLikeThis" query to mh> drop common terms drastically improved performance. If mh> your "killer queries" are long ones you could spot mh> them and service them with a MoreLik

Re: Scoring, cosine measure

2005-04-20 Thread David Spencer
Daniel Naber wrote: On Wednesday 20 April 2005 18:22, Paul Elschot wrote: Has anyone tried an index based on n-grams? Nutch has bigrams for phrases with frequently occurring words. Also the spell checker in SVN uses n-grams I think. SVN here: http://svn.apache.org/repos/asf/lucene/java/trunk/co

Re: indexing synonyms / reducing the index size

2005-05-04 Thread David Spencer
Pablo Gomes Ludermir wrote: Hello all, I know that we can expand a word to get its synonyms with Wordnet. I was wondering if we could reduce the index size by including a synonym instead of a word on the synonym list. For instance, if "screen" shows up, I would like to replace it by "monitor" (it i

Re: I need 100 most frequently used words in different languages.

2005-05-11 Thread David Spencer
You could try downloading a copy of the wikipedia and processing the entries yourself. I don't know how well represented other languages are but there's lot of English. Ahmet Aksoy wrote: Hi, I have a project which will be used in order to supply automatic dictionary helps in different language

Re: Search Theory Book

2005-05-12 Thread David Spencer
Anna Bing wrote: Firstly the Lucene in Action Book is great. It really helped me with implementing search for a project. Sorry if this is the wrong forum but as you are all search people. I wondered if you could recommend any good books about search theory/algorithms, readable if that is possible

Re: SF.net search system

2005-06-29 Thread David Spencer
Chris Conrad wrote: I know I've been asked before for a description of how SourceForge.net is using Lucene. I wrote a blog entry about it and thought people might be interested in seeing at a high level how it was designed. Take a look at http://blog.dev.sf.net. Any comments are welcome.

Re: search caching

2005-08-03 Thread David Spencer
Chris Fraschetti wrote: I've got an application that performs millions of searches against a If the results are not, say, "personalized", than I suggest some kind of web container cache - I use and like OSCache - and it can even cache page fragments. http://www.opensymphony.com/oscache/

Re: New Site Live Using Lucene

2005-08-08 Thread David Spencer
Otis Gospodnetic wrote: --- "Kevin L. Cobb" <[EMAIL PROTECTED]> wrote: Open Source C/C++ only? When are you going to include Open Source Java? We demand fair treatmant ;) There are several related sites: http://www.searchmorph.com/ Thanks for ref Otis. I run this site, and primarily inde

Re: Announcement: Lucene powering CNET.com Product Category Listings

2005-08-31 Thread David Spencer
Nice write up. One other nice thing I noticed is you seem to sort numeric attributes numerically instead of alphabetically e.g. here: http://reviews.cnet.com/4566-3156_7-0.html?filter=500193_5314692_ see the 3rd col, "Find by max speed", and note that has has choices in this order: < 2