cost of proximity question
Hi all, Does anyone know what's the performance cost of a Nutch like proximity query that looks like this: (+Hello +World +\Hello world\~p^a)x ? or just how in general how much processing does proximity add to a query? Thanks, Anson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene vs. MySQL Full-Text
Depending on what MySQL Full-text search support you probably will lose some of the advance things you get for free from Lucene, such as proximity search, wildcard search, search term and search field boosting, scoring of the documents, etc. Afterall it depends on what you need to do. In our dev team we are actually currently having a mini debate over whether to use lucene for our project or write something from scratch that's based on a DB. We need really good performance. I feel lucene can do our job very well, some of our guys feel using a DB based search can give us greater performance on the type of search we do. Anson -Original Message- From: Florian Sauvin [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 21, 2004 8:55 AM To: Lucene Users List Subject: Re: Lucene vs. MySQL Full-Text On Jul 20, 2004, at 12:29 PM, Tim Brennan wrote: Someone came into my office today and asked me about the project I am trying to Lucene for -- why aren't you just using a MySQL full-text index to do that -- after thinking about it for a few minutes, I realized I don't have a great answer. MySQL builds inverted indexes for (in theory) doing the same type of lookup that lucene does. You'd maybe have to build some kind of a layer on the front to mimic Lucene's analyzers, but that wouldn't be too hard My only experience with MySQLfulltext is trivial test apps -- but the MySQL world does have some significant advantages (its a known quantity from an operations perspective, etc). Does anyone out there have anything more concrete they can add? --tim I'd say that MySQL full text is much slower if you have a lot of data... that is one of the reasons we started using lucene (We had a mysql db to do the search), it's way faster! -- Florian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: speeding up lucene search
Has anyone tried splitting up an index into smaller chunks, without putting the different indicies on a different physical disk/box? What sort of performance gain do you get from it? Anson -Original Message- From: John Wang [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 21, 2004 7:43 PM To: Lucene Users List Subject: Re: speeding up lucene search In general, yes. By splitting up a large index into smaller indicies, you are linearizing the search time. Furthermore, that allows you to make your search distributable. -John On Wed, 21 Jul 2004 13:00:28 +1000, Anson Lau [EMAIL PROTECTED] wrote: Hello guys, What are some general techniques to make lucene search faster? I'm thinking about splitting up the index. My current index has approx 1.8 million documents (small documents) and index size is about 550MB. Am I likely to get much gain out of splitting it up and use a multiparallelsearcher? Most of my search queries search queries search on 5-10 fields. Are there other things I should look at? Thanks to all, Anson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Weighting database fields
Erik, Is there any benefit to set the boost during indexing rather than set it during query? I usually set it when doing a query because you can change that boost values easily without having to re-index. Thanks, ANson -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, July 22, 2004 12:52 AM To: Lucene Users List Subject: Re: Weighting database fields On Jul 21, 2004, at 10:09 AM, Anson Lau wrote: Apply boost factor to fields when you do a lucene search. Or... set the boost on the Field during indexing. Erik Anson -Original Message- From: John Patterson [mailto:[EMAIL PROTECTED] Sent: Thursday, July 22, 2004 12:07 AM To: [EMAIL PROTECTED] Subject: Weighting database fields Hi, What is the best way to get Lucene to assign weightings to certain fields from a database? For example, the 'name' field should be weighted higher than the 'description' field. Thanks, John. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
speeding up lucene search
Hello guys, What are some general techniques to make lucene search faster? I'm thinking about splitting up the index. My current index has approx 1.8 million documents (small documents) and index size is about 550MB. Am I likely to get much gain out of splitting it up and use a multiparallelsearcher? Most of my search queries search queries search on 5-10 fields. Are there other things I should look at? Thanks to all, Anson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Scoring without normalization!
If you don't mind hacking the source: In Hits.java In method getMoreDocs() // Comment out the following //float scoreNorm = 1.0f; //if (length 0 scoreDocs[0].score 1.0f) { // scoreNorm = 1.0f / scoreDocs[0].score; //} // And just set scoreNorm to 1. int scoreNorm = 1; I don't know if u can do it without going to the src. Anson -Original Message- From: Jones G [mailto:[EMAIL PROTECTED] Sent: Thursday, July 15, 2004 6:52 AM To: [EMAIL PROTECTED] Subject: Scoring without normalization! How do I remove document normalization from scoring in Lucene? I just want to stick to TF IDF. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Pool of IndexReaders or Pool of Searchers?
Hi, When I did some load testing on a lucene powered search app, using a pool of index searchers doesn't give me any more search per second than just using a singleton index searcher. Anson Quoting [EMAIL PROTECTED]: Hi, I have multiple threads reading an index. Should they all be using the same IndexReader and using a pool of IndexSearchers? Or should they be using a pool of IndexReaders? Basically, one reader or many? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: best ways of using IndexSearcher
Otis, Thanks for the advice. When you say This stuff is not really CPU intensive are you refering to the search itself or something else? In my experience the search tends to be ultimately bounded by CPU. Anson -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 29, 2004 2:51 PM To: Lucene Users List Subject: Re: best ways of using IndexSearcher Anson, Use a single instance of IndexSearcher and, if you want to always 'see' even the latest index changes (deletes and adds since you opened the IndexSearcher) make sure to re-create the IndexSearcher when you detect that the index version has changed (see http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReade r.html#getCurrentVersion(org.apache.lucene.store.Directory)) When you get the new IndexSearcher, leave the old instance alone - let the GC take care of it, and don't call close() on it, in case something in your application is still using that instance. This stuff is not really CPU intensive. Disk I/O tends to be the bottleneck. If you are working with multiple indices, spread them over multiple disks (not just partitions, real disks), if you can. Otis --- Anson Lau [EMAIL PROTECTED] wrote: Hi Guys, What's the recommended way of using IndexSearcher? Should IndexSearcher be a singleton or pooled? Would pooling provide a more scalable solution by allowing you to decide how many IndexSearcher to use based on say how many CPU u have on ur server? Thanks, Anson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
best ways of using IndexSearcher
Hi Guys, What's the recommended way of using IndexSearcher? Should IndexSearcher be a singleton or pooled? Would pooling provide a more scalable solution by allowing you to decide how many IndexSearcher to use based on say how many CPU u have on ur server? Thanks, Anson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: using boost factor
Hi guys, It seems like to really customise the scoring in lucene, one will have to go into the lucene source. I spend a fair bit of time looking into this and it seems to me not the full scoring api is exported. The formula documented on the Similarity class seems to explain how a term is scored, but not, for example, how the final score on a Boolean query is computed from each individual component. (Please correct me if I'm wrong). Normalisation is another part where the API is not exported. Anson -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 23, 2004 3:51 AM To: Lucene Users List Subject: Re: using boost factor Hello Anson, I would look at IndexSearcher's explain method: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSear cher.html#explain(org.apache.lucene.search.Query,%20int) This should give you insight into what's contributing to the high/low scores, thus telling you what you can tweak. Perhaps it's just the boost, perhaps some other similarity factors. Using explain should provide you information such as this, for example: http://www.mozdex.com/explain.jsp?idx=2id=2067257query=goober I hope this helps. Somebody else will probably be able to give more information, but this should get you started while you wait. Otis --- Anson Lau [EMAIL PROTECTED] wrote: Hi guys, Lets say I want to search the term hello world over 3 fields with different boost: ((hello:field1 world:field1)^0.001 (hello:field2 world:field2)^100 (hello:field3 world:field3)^2)) Note I've given field1 a really low boost, a heavy boost to field2 and a REALLY heavy boost to field3. What is happening to me is that a term that matches both field1 and field2, will have a higher score than a term that matches field3 only, even though field3's boost is WAY higher. Can I change this behaviour such that the match in field3 only will actually have a higher score because of the boost? Thanks, Anson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: a list of matching search term
Thanks Erik I'll give that a try. Anson -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 02, 2004 7:28 PM To: Lucene Users List Subject: Re: a list of matching search term On Jun 1, 2004, at 9:19 PM, Anson Lau wrote: Further to my previous email: The highlighter package should be able to pick up the matching search terms. Can some experienced highlighter package users tell me if I should look down that line? Yes, Highlighter (available in the sandbox) picks out matching terms. If you used a custom Formatter with Highlighter, you could pick out matching terms and have a list of them. This would not be something you do for every hit, though, as it would take a little time to do for each document. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
a list of matching search term
Hi All, Eg. Lets say someone do a search on the terms 'apple orange banana'. In the search results, is it possible to find out for each hit, which of those terms did match? Ie. The document with the highest score has all three words so the matching terms are all of those words. A lesser document may only have 'apple' and 'orange' inside it. So the matching terms will be 'apple' and 'orange'. Thanks, Anson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: a list of matching search term
Further to my previous email: The highlighter package should be able to pick up the matching search terms. Can some experienced highlighter package users tell me if I should look down that line? Thanks a lot. Anson -Original Message- From: Anson Lau [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 01, 2004 5:20 PM To: 'Lucene Users List' Subject: a list of matching search term Hi All, Eg. Lets say someone do a search on the terms 'apple orange banana'. In the search results, is it possible to find out for each hit, which of those terms did match? Ie. The document with the highest score has all three words so the matching terms are all of those words. A lesser document may only have 'apple' and 'orange' inside it. So the matching terms will be 'apple' and 'orange'. Thanks, Anson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
field boost factor
Hi all, Is it possible to set different boost factor to different fields when you do a search, rather than when you index? Thanks, Anson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: field boost factor
I think I found it in Query API... Thanks, Anson -Original Message- From: Anson Lau [mailto:[EMAIL PROTECTED] Sent: Friday, May 14, 2004 4:27 PM To: [EMAIL PROTECTED] Subject: field boost factor Hi all, Is it possible to set different boost factor to different fields when you do a search, rather than when you index? Thanks, Anson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
looking for developer
Hi All, Our company is looking for 2 java developer with strong Lucene experience to do some contract work. We're in Sydney, Australia. If anyone is interested plesaes email me direct ([EMAIL PROTECTED]). Thanks, Anson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: looking for developer
Esmond, Thanks a lot for your email. Will certainly consider that - do you know another 1-2 person who also knows Lucene very well, who you can work with in case we need another person? No point having 1 developer in Sydney and one in Melbourne. To give you a bit more background - we'll be indexing and searching a database of approx 1.5 million records. Do you have experience with this sort of scale? I don't and I am specifically looking for people with that. Thanks, Anson -Original Message- From: Esmond Pitt [mailto:[EMAIL PROTECTED] Sent: Monday, March 29, 2004 12:41 PM To: Lucene Users List Subject: Re: looking for developer Anson I have very strong Lucene experience, covering 1.2, 1.3, and the current 1.4-almost-an-RC, having built search engines for several web sites with it. I'm located in Melbourne not very able to change that, but if you don't get other bites maybe we can talk about telecommuting the occasional day trip? Esmond Pitt FACS 0400 139 869 - Original Message - From: Anson Lau [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Monday, March 29, 2004 12:23 AM Subject: looking for developer Hi All, Our company is looking for 2 java developer with strong Lucene experience to do some contract work. We're in Sydney, Australia. If anyone is interested plesaes email me direct ([EMAIL PROTECTED]). Thanks, Anson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: looking for developer
Ops - sorry should reply to Esmond direct. Pls ignore the previous msg. Anson -Original Message- From: Anson Lau [mailto:[EMAIL PROTECTED] Sent: Monday, March 29, 2004 1:29 PM To: 'Lucene Users List' Subject: RE: looking for developer Esmond, Thanks a lot for your email. Will certainly consider that - do you know another 1-2 person who also knows Lucene very well, who you can work with in case we need another person? No point having 1 developer in Sydney and one in Melbourne. To give you a bit more background - we'll be indexing and searching a database of approx 1.5 million records. Do you have experience with this sort of scale? I don't and I am specifically looking for people with that. Thanks, Anson -Original Message- From: Esmond Pitt [mailto:[EMAIL PROTECTED] Sent: Monday, March 29, 2004 12:41 PM To: Lucene Users List Subject: Re: looking for developer Anson I have very strong Lucene experience, covering 1.2, 1.3, and the current 1.4-almost-an-RC, having built search engines for several web sites with it. I'm located in Melbourne not very able to change that, but if you don't get other bites maybe we can talk about telecommuting the occasional day trip? Esmond Pitt FACS 0400 139 869 - Original Message - From: Anson Lau [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Monday, March 29, 2004 12:23 AM Subject: looking for developer Hi All, Our company is looking for 2 java developer with strong Lucene experience to do some contract work. We're in Sydney, Australia. If anyone is interested plesaes email me direct ([EMAIL PROTECTED]). Thanks, Anson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE : Lucene scalability/clustering
RBP, I'm implementing a search engine for a project at work. It's going to index approx 1.5 rows in a database. I am trying to get a feel of what my options are when scalability becomes an issue. I also want to know if those options require me to implement my app in a different way right from the start. Anson -Original Message- From: Rasik Pandey [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 24, 2004 9:34 PM To: 'Lucene Users List' Subject: RE : RE : Lucene scalability/clustering I'm trying to see what are some common ways to scale lucene onto multiple boxes. Is RMI based search and using a MultiSearcher the general approach? More details about what you are attempting would be helpful. RBP - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: RE : Lucene scalability/clustering
I'm trying to see what are some common ways to scale lucene onto multiple boxes. Is RMI based search and using a MultiSearcher the general approach? There doesn't seem to be many articles on the web on how to implement a lucene search cluster. If anyone knows a good article can you please post it here? Thanks, Anson -Original Message- From: Rasik Pandey [mailto:[EMAIL PROTECTED] Sent: Monday, February 23, 2004 9:46 PM To: 'Lucene Users List' Subject: RE : Lucene scalability/clustering Further on this topic - has anyone tried implementing a distributed search with Lucene? How does it work and does it work well? I assume you are referring to RMI based search? It works well as does MultiSearcher. RBP - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene scalability/clustering
Further on this topic - has anyone tried implementing a distributed search with Lucene? How does it work and does it work well? Anson -Original Message- From: Hamish Carpenter [mailto:[EMAIL PROTECTED] Sent: Monday, February 23, 2004 5:24 AM To: Lucene Users List Subject: Re: Lucene scalability/clustering Hi All, I'm Hamish Carpenter who contributed the benchmarks with the comment about the IndexSearcherCache. Using this solved our issues with too many files open under linux. The original IndexSearcherCache email is here: http://www.mail-archive.com/[EMAIL PROTECTED]/msg01967.html See here for a copy of the above message and a download link: http://www.geocities.com/haytona/lucene/ The mailing list doesn't like attachments. The source is 10K in size. HTH Hamish Carpenter. [EMAIL PROTECTED] wrote: BTW, where can I get Peter Halacsy's IndexSearcherCache? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
number of fields, size of fields
Hi All, I'm a beginner with Lucene. I would like to know if there are general guidelines on: 1. the number of field a document can have 2. size of unindexed fields 3. size of a stored text field I just want to get a feel for what are the good practices. Thanks, Anson Lau - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]