RE: Limiting Hits with a score threshold
I would not recommend doing this because absolute score values in Lucene are not meaningful (e.g., scores are not directly comparable across searches). The ratio of a score to the highest score returned is meaningful, but there is no absolute calibration for the highest score returned, at least at present, so there is not a way to determine from the scores what the quality of the result set is overall. There are various approaches to improving this that have been discussed (making the scores more directly comparable by encoding additional information into the score and using that for normalization, or probably better, generalizing the score to an object that contains multiple pieces of information; e.g. the total number of query terms matched by the top result if you are using default OR would be quite useful). None of these ideas are implemented yet as far as I know. Chuck -Original Message- From: Jay Hill [mailto:[EMAIL PROTECTED] Sent: Monday, February 14, 2005 11:08 AM To: lucene-user@jakarta.apache.org Subject: Limiting Hits with a score threshold Does anyone have an example of limiting results returned based on a score threshold? For example if I'm only interested in documents with a score 0.05. Thanks, -Jay - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Similarity coord,lengthNorm
Hi Michael, I'd suggest first using the explain() mechanism to figure out what's going on. Besides lengthNorm(), another factor that is likely skewing your results in my experience is idf(), which Lucene typically makes very large by squaring the intrinsic value. I've found it helpful to flatten lengthNorm(), tf() and idf() relative to what is used in DefaultSimilarity. There is a comparative evaluation of Similarity's going on now. You might consider looking at these: Bug 32674 has a WikipediaSimilarity posted that you might want to try. You might want to flatten lengthNorm() even further (e.g. all the way to 1.0), but I'd suggest trying it as is first. If you try it, please post your assessment. Here's the link: http://issues.apache.org/bugzilla/show_bug.cgi?id=32674 You also might find it interesting to read the thread entitled RE: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? on lucene-dev, as this contains a discussion of many of the issues. Good luck, Chuck -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Monday, February 07, 2005 6:51 AM To: Lucene Users List Subject: Re: Similarity coord,lengthNorm On Feb 7, 2005, at 8:53 AM, Michael Celona wrote: Would fixing the lengthNorm to 1 fix this problem? Yes, it would eliminate the length of a field as a factor. Your best bet is to set up a test harness where you can try out various tweaks to Similarity, but setting the length normalization factor to 1.0 may be all you need to do, as the coord() takes care of the other factor you're after. Erik Michael -Original Message- From: Michael Celona [mailto:[EMAIL PROTECTED] Sent: Monday, February 07, 2005 8:48 AM To: Lucene Users List Subject: Similarity coord,lengthNorm I have varying length text fields which I am searching on. I would like relevancy to be dictated predominantly by the number of terms in my query that match. Right now I am seeing a high relevancy for a single word matching in a small document even though all the terms in my query don't match. Does, anyone have an example of a custom Similarity sub class which overrides the coord and lengthNorm methods. Thanks.. Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: which HTML parser is better?
I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150MP3 http://music.yisou.com/ http://image.yisou.com 1G1000 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma il_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Newbie: Human Readable Stemming, Lucene Architecture, etc!
Like any other field, A.I. is only elusive until you master it. There are plenty of companies using A.I. techniques in various IR applications successfully. LSI in particular has been around a long time and is well understood. Chuck -Original Message- From: jian chen [mailto:[EMAIL PROTECTED] Sent: Thursday, January 20, 2005 2:10 PM To: Lucene Users List Subject: Re: Newbie: Human Readable Stemming, Lucene Architecture, etc! Hi, One thing to point out. I think Lucene is not using LSI as the underlying retrieval model. It uses vector space model and also proximity based retrieval. Personally, I don't know much about LSI and I don't think the fancy stuff like LSI is workable in industry. I believe we are far away from the era of artificial intelligence and using any elusive way to do information retrieval. Cheers, Jian On Thu, 20 Jan 2005 14:50:10 -0700, Owen Densmore [EMAIL PROTECTED] wrote: Hi .. I'm new to the list so forgive a dumb question or two as I get started. We're in the midst of converting a small collection (1200-1500 currently) of scientific literature to be easily searchable/navigable. We'll likely provide both a text query interface as well as a graphical way to search and discover. Our initial approach will be vector based, looking at Latent Semantic Indexing (LSI) as a potential tool, although if that's not needed, we'll stop at reasonably simple stemming with a weighted document term matrix (DTM). (Bear in mind I couldn't even pronounce most of these concepts last week, so go easy if I'm incoherent!) It looks to me that Lucene has a quite well factored architecture. I should at the very least be able to use the analyzer and stemmer to create a good starting point in the project. I'd also like to leave a nice architecture behind in case we or others end up experimenting with, or extending, the system. So a couple of questions: 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) apparently produces non-word stems .. i.e. not really human readable. (Example: generate, generates, generated, generating - generat) Although in typical queries this is not important because the result of the search is a document list, it *would* be important if we use the stems within a graphical navigation interface. So the question is: Is there a way to have the stemmer produce english base forms of the words being stemmed? 2 - We're probably using Lucene in ways it was not designed for, such as DTM/LSI and graphical clustering and navigation. Naturally we'll provide code for these parts that are not in Lucene. But the question arises: is this kinda dumb?! Has anyone stretched Lucene's design center with positive results? Are we barking up the wrong tree? 3 - A nit on hyphenation: Our collection is scientific so has many hyphenated words. I'm wondering about your experiences with hyphenation. In our collection, things like self-organization, power-law, space-time, small-world, agent-based, etc. occur often, for example. So the question is: Do folks break up hyphenated words? If not, do you stem the parts and glue them back together? Do you apply stoplists to the parts? Thanks for any help and pointers you can fling along, Owenhttp://backspaces.net/http://redfish.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: QUERYPARSIN BOOSTING
Google has natural results on the left and sponsored results on the right. I do not believe the natural results are affected by paid keywords at all. What you seem to be describing is the behavior of the sponsored results, which I believe are explicitly attached to certain keywords. The same approach would work in Lucene. Create a field to hold purchased keywords (any keywords you want to associate with the result). Then you can include this field in your search with a high boost (see DistributingMultiFieldQueryParser, http://issues.apache.org/bugzilla/show_bug.cgi?id=32674). Google prefers certain results over others for certain keywords based on various factors of the keyword purchase and the site (amount paid for the keyword, Page Rank of the site, tenure of the listing, popularity of the listing, etc.). You could emulate this in various ways, using a combination of document/field boosting and perhaps replication of the term in the field (to increase its tf), or even perhaps multiple fields that are boosted at different levels. I'm not sure of the best approach to this part -- you could experiment a little. Chuck -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 12, 2005 2:30 AM To: Lucene Users List Subject: RE: QUERYPARSIN BOOSTING Hi Guys Apologies... If somebody's is been closely watching GOOGLE, It boost's WEBSITES for payed category sites based on search words. Can This [ boost the Full WEBSITE ] be achieved in Lucene's search based on searchword If So Please Explain /examples ???. with regards karthik -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 11, 2005 2:00 PM To: Lucene Users List; [EMAIL PROTECTED] Subject: RE: QUERYPARSIN BOOSTING Karthik, I don't think the boost in your example does much since you are using an AND query, i.e. all hits will have to contain both vendor:nike and contents:shoes. If you used an OR, then the boost would put nike products above (non-nike) shoes, unless there was some other factor that causes score of contents:shoes to be 10x greater than that of vendor:nike. It's a good idea to look at the results of explain() when analyzing what's happening with scoring, tuning your boosts and your Similarity. Chuck -Original Message- From: Nader Henein [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 11, 2005 12:21 AM To: Lucene Users List Subject: Re: QUERYPARSIN BOOSTING From the text on the Lucene Jakarta Site : http://jakarta.apache.org/lucene/docs/queryparsersyntax.html Lucene provides the relevance level of matching documents based on the terms found. To boost a term use the caret, ^, symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be. Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for jakarta apache and you want the term jakarta to be more relevant boost it using the ^ symbol along with the boost factor next to the term. You would type: jakarta^4 apache This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in the example: jakarta apache^4 jakarta lucene By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2) Regards. Nader Henein Karthik N S wrote: Hi Guys Apologies... This Question may be asked million times on this form ,need some clarifications. 1) FieldType = keyword name = vendor 2)FieldType = text name = contents Question: 1) How to Construct a Query which would allow hits avaliable for the VENDOR to appear first ?. 2) If boosting is to be applied How TO ?. 3) Is the Query Constructed Below correct?. +Contents:shoes +((vendor:nike)^10) Please Advise. Thx in advance. WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED
RE: QUERYPARSIN BOOSTING
Karthik, I don't think the boost in your example does much since you are using an AND query, i.e. all hits will have to contain both vendor:nike and contents:shoes. If you used an OR, then the boost would put nike products above (non-nike) shoes, unless there was some other factor that causes score of contents:shoes to be 10x greater than that of vendor:nike. It's a good idea to look at the results of explain() when analyzing what's happening with scoring, tuning your boosts and your Similarity. Chuck -Original Message- From: Nader Henein [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 11, 2005 12:21 AM To: Lucene Users List Subject: Re: QUERYPARSIN BOOSTING From the text on the Lucene Jakarta Site : http://jakarta.apache.org/lucene/docs/queryparsersyntax.html Lucene provides the relevance level of matching documents based on the terms found. To boost a term use the caret, ^, symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be. Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for jakarta apache and you want the term jakarta to be more relevant boost it using the ^ symbol along with the boost factor next to the term. You would type: jakarta^4 apache This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in the example: jakarta apache^4 jakarta lucene By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2) Regards. Nader Henein Karthik N S wrote: Hi Guys Apologies... This Question may be asked million times on this form ,need some clarifications. 1) FieldType = keyword name = vendor 2)FieldType = text name = contents Question: 1) How to Construct a Query which would allow hits avaliable for the VENDOR to appear first ?. 2) If boosting is to be applied How TO ?. 3) Is the Query Constructed Below correct?. +Contents:shoes +((vendor:nike)^10) Please Advise. Thx in advance. WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: SQL Distinct sintax in Lucen
If I understand what you are trying to do, you don't have a problem. You can OR to your heart's content and Lucene will properly create the union of the results. I.e., there will be no duplicates. There is built-in support for this kind of thing. See MultiFieldQueryParser, and for better results, consider http://issues.apache.org/bugzilla/show_bug.cgi?id=32674. Chuck -Original Message- From: Carlos Franco Robles [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 11, 2005 2:05 PM To: lucene-user@jakarta.apache.org Subject: SQL Distinct sintax in Lucen Hi all. I'm starting to use lucene and I wonder if it is possible to make a query syntax to ask for one string which can be in two different fields and filter duplicated results like with distinct in SQL syntax. Something like: distinct (+string OR OtherField:(+string)) Thanks a lot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Parsing issue
I use it and have yet to have a problem with it. It uses the Xerces API so you parse and access html files just like xml files. Very cool, Chuck -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 04, 2005 2:05 PM To: Lucene Users List Subject: Re: Parsing issue That's the correct place to look and it includes code samples. Yes, it's a Jar file that you add to the CLASSPATH and use ... hm, normally programmatically, yes :). Otis --- Hetan Shah [EMAIL PROTECTED] wrote: Has any one used NekoHTML ? If so how do I use it. Is it a stand alone jar file that I include in my classpath and start using just like IndexHTML ? Can some one share syntax and or code if it is supposed to be used programetically. I am looking at http://www.apache.org/~andyc/neko/doc/html/ for more information is that the correct place to look? Thanks, -H Erik Hatcher wrote: Sure... clean up your HTML and it'll parse fine :) Perhaps use JTidy to clean up the HTML. Or switch to using a more forgiving parser like NekoHTML. Erik On Jan 4, 2005, at 3:59 PM, Hetan Shah wrote: Hello All, Does any one know how to handle the following parsing error? thanks for pointers/code snippets. -H While trying to parse a HTML file using IndexHTML I get Parse Aborted: Encountered \ at line 8, column 1162. Was expecting one of: ArgName ... = ... TagEnd ... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Asking Questions in a Search
Verity acquired Native Minds -- Verity Response appears to be that technology. It is not search technology at all -- rather it is a programmed question-answer script knowledge base. IMO, there are much better commercial solutions to this problem; e.g., see www.inquira.com, which integrates automated natural language search (i.e., finding specific answers to natural language questions from within a text corpus) with question/answer scripting capabilities. I believe Lucene would be an excellent foundation for a system like this, but it would need to be extended with a natural language query parser / search-query generator and, if desired, some form of scripting knowledge base. Somebody may have gone down this path, but I'm not aware of it. Chuck -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 28, 2004 7:52 PM To: lucene-user@jakarta.apache.org Subject: Asking Questions in a Search Hi Is it possible to do something like this with lucene: http://www.verity.com/products/response/index.html Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Poor Lucene Ranking for Short Text
I think you are confusing lengthNorm and the overall normalization of the score. For overall normalization (prior to a final forced normalization in Hits), Lucene uses the formula you cite, except that it never sums td_d*idf_t, using instead tf_q*idf_t again, because the former is computationally intractable (changing even a single document changes the idf values, which means either that all document norms would have to be computed or that the sum over the document would need to happen at query time; the former is unacceptable for indexing time with large indices and the latter is unacceptable for query time with large documents). lengthNorm is by default 1/sqrt(number_terms_in_document). It is not 1.0f by default because 1.0f is in general not a good value; e.g., a single occurrence of a term in a 1meg document is not as significant as a single occurrence of the same term in a 1k document. However, I find the default value to need additional damping because it affects the score too much, especially for small documents. So, I use something like 3.0f/log10(1000 + number_terms_in_document) Chuck -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, December 24, 2004 8:24 AM To: 'Lucene Users List' Subject: AW: Poor Lucene Ranking for Short Text Hi Kevin, Seem like you have some knowledge about the lenghtNorm value in Lucene. Comparing it to the formula in Modern Information Retrieval does it sum up the denominator sqrt((sum(tf_d*idf_t)²)) * sqrt((sum(tf_q*idf_t)²)) Just a quick note is ok. Besides that could you invite me to rojo. There beta status seem to be quite long. Thanks Michael | -Ursprüngliche Nachricht- | Von: | [EMAIL PROTECTED] | e.org | [mailto:[EMAIL PROTECTED] | ta.apache.org] Im Auftrag von Kevin A. Burton | Gesendet: Mittwoch, 27. Oktober 2004 22:48 | An: Lucene Users List | Betreff: Re: Poor Lucene Ranking for Short Text | | Daniel Naber wrote: | | (Kevin complains about shorter documents ranked higher) | | This is something that can easily be fixed. Just use a Similarity | implementation that extends DefaultSimilarity and that overwrites | lengthNorm: just return 1.0f there. You need to use that | Similarity for | indexing and searching, i.e. it requires reindexing. | | | What happens when I do this with an existing index? I don't | want to have to rewrite this index as it will take FOREVER | | If the current behavior is all that happens this is fine... | this way I can just get this behavior for new documents that | are added. | | Also... why isn't this the default? | | Kevin | | -- | | Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask | me for an invite! Also see irc.freenode.net #rojo if you | want to chat. | | Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html | | If you're interested in RSS, Weblogs, Social Networking, | etc... then you should work for Rojo! If you recommend | someone and we hire them you'll get a free iPod! | | Kevin A. Burton, Location - San Francisco, CA |AIM/YIM - sfburtonator, Web - http://peerfear.org/ | GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 | | | - | To unsubscribe, e-mail: [EMAIL PROTECTED] | For additional commands, e-mail: [EMAIL PROTECTED] | | - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: I though I understood, but obviously I missed something.
All of your Document.add's need to be doc.add's. You are adding the field to the document, not the class. Chuck -Original Message- From: Jim Lynch [mailto:[EMAIL PROTECTED] Sent: Friday, December 24, 2004 8:30 AM To: Lucene Users List Subject: I though I understood, but obviously I missed something. A snippet from my program: Document doc = new Document(); Field fContent = new Field(content,content.toString(),false,true,true); Field fTitle = new Field(title,title,true,true,true); Field fDate = new Field(date,date,true,true,false); Document.add(fContent); Document.add(fTitle); Document.add(fDate); Generate this (and other like it ) error method add(org.apache.lucene.document.Field) cannot be referenced from a static context [javac] Document.add(fContent); Where did I go wrong? Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Relevance percentage
Gururaja, If you want to score based solely on coord(), then Paul's approach looks best. However, based on your earlier messages, it looks to me like you want to score based on all factors (with coord boosted as Paul suggested, or lengthNorm flattened as I suggested -- either will get the order you want in the example you posted), but you want to print the (unboosted) coord percentage along with each result in the result list. If this is the case, since the number of results per page on the result list is presumably small, I think you are best off replicating the explain() mechanism. I don't have the source code, but you can look at IndexSearcher.explain(), which recreates the weight with Query.weight(), then calls what in this case will be BooleanQuery.BooleanWeight.explain(), which has the code to recompute coord on a result (specifically it computes overlap and maxoverlap and then calls Similarity.coord()). You could cut and paste this code to just compute coord for your top-level BooleanQuery's. Sorry I don't have source code to do this, but the approach should work. Good luck, Chuck -Original Message- From: Paul Elschot [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 22, 2004 11:59 PM To: lucene-user@jakarta.apache.org Subject: Re: Relevance percentage On Thursday 23 December 2004 08:13, Gururaja H wrote: Hi Chuck Williams, Thanks much for the reply. If your queries are all BooleanQuery's of TermQuery's, then this is very simple. Iterate down the list of BooleanClause's and count the number whose score is 0, then divide this by the total number of clauses. Take a look at BooleanQuery.BooleanWeight.explain() as it does this (along with generating the rest of the explanation). If you support the full Lucene query language, then you need to look at all the query types and decide what exactly you want to compute (as coord is not always well- defined). We are supporting full Lucene query language. My request is, assuming queries are all BooleanQuery please post the implementation source code for the same. ie to calculate the coord() method input parameters overlap and maxOverlap. I don't have the code, but I can give an overview of possible steps: First inherit from BooleanScorer to implement a score() method that returns only the coord() value (preferably a precomputed one). Then inherit from BooleanQuery.BooleanWeight to return the above Scorer. Then inherit from BooleanQuery to use the above Weight in createWeight(). Then inherit from QueryParser to use the above Query in getBooleanQuery(). Finally use such a query in a search: the document scores will be the coord() values. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene index files from two different applications.
Depending on what you are doing, there are some problems with MultiSearcher. See http://issues.apache.org/bugzilla/show_bug.cgi?id=31841 for a description of the issues and possible patch(es) to fix. Chuck -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 21, 2004 3:09 AM To: Lucene Users List Subject: Re: Lucene index files from two different applications. On Dec 21, 2004, at 5:51 AM, Gururaja H wrote: 1. Can two applications write index files, in the same directory, at the same time ? If you mean to the same Lucene index, the answer is no. Only a single IndexWriter instance may be writing to an index at one time. 2. If two applications cannot write index files, in the same directory, at the same time. How should we resolve this ? Would appriciate any solutions to this... You may consider writing a queuing system so that two applications queue up a document to index, and a single indexer application reads from the queue. Or the applications could wait until the index is available for writing. Or... 3. My thought is to write the index files in two different directories and read both the indexes (as though it forms a single index, search results should consider the documents in both the indexes) from the WebApplication. How to go about implementing this, using Lucene API ? Need inputs on which of the Lucene API's to use ? Lucene can easily search from multiple indexes using MultiSearcher. This merges the results together as you'd expect. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Relevance percentage
The coord() value is not saved anywhere so you would need to recompute it. You could either call explain() and parse the result string, or better, look at explain() and implement what it does more efficiently just for coord(). If your queries are all BooleanQuery's of TermQuery's, then this is very simple. Iterate down the list of BooleanClause's and count the number whose score is 0, then divide this by the total number of clauses. Take a look at BooleanQuery.BooleanWeight.explain() as it does this (along with generating the rest of the explanation). If you support the full Lucene query language, then you need to look at all the query types and decide what exactly you want to compute (as coord is not always well-defined). I'm on the West Coast of the U.S. so evidently on a very different time zone from you -- will look at your other message next. Chuck -Original Message- From: Gururaja H [mailto:[EMAIL PROTECTED] Sent: Monday, December 20, 2004 6:10 AM To: Lucene Users List; Mike Snare Subject: Re: Relevance percentage Hi, But, How to calculate the coord() fraction ? I know by default, in DefaultSimilarity the coord() fraction is defined as below: /** Implemented as codeoverlap / maxOverlap/code. */ public float coord(int overlap, int maxOverlap) { return overlap / (float)maxOverlap; } How to get the overlap and maxOverlap value in each of the matched document(s) ? Thanks, Gururaja Mike Snare [EMAIL PROTECTED] wrote: I'm still new to Lucene, but wouldn't that be the coord()? My understanding is that the coord() is the fraction of the boolean query that matched a given document. Again, I'm new, so somebody else will have to confirm or deny... -Mike On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H wrote: How to find out the percentages of matched terms in the document(s) using Lucene ? Here is an example, of what i am trying to do: The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 matching documents with the following attributes: Doc#1: contains terms(ibm,drive) Doc#2: contains terms(ibm,risc, tape, drive) Doc#3: contains terms(ibm,risc, tape,drive) Doc#4: contains terms(ibm, risc, tape, drive, manual). The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40% (doc#1). Any help on how to go about doing this ? Thanks, Gururaja - Do you Yahoo!? Send a seasonal email greeting and help others. Do good. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? All your favorites on one personal page - Try My Yahoo! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Relevance and ranking ...
I believe your sole problem is that you need to tone down your lengthNorm. Because doc4 is 10 times longer than doc2, its lengthNorm is less than 1/3 of that of doc2 (1/sqrt(10) to be precise). This is a larger effect than the higher coord factor (1/.8) and the extra matching term in doc4. In your original description, it sounds like you want coord() to dominate lengthNorm(), with lengthNorm() just being used as a tie-breaker among queries with the same coord(). To achieve this, you need to reduce the impact of the lengthNorm() differences, by changing the sqrt() function in the computation of lengthNorm to something much flatter. E.g., you might use: public float lengthNorm(String fieldName, int numTerms) { return (float)(1.0 / Math.log10(1000+numTerms)); } I'm not sure whether that specific formula will work, but you can find one that will by adjusting the base of the logarithm and the additive constant (1000 in the example). Some general things: 1. You need to reindex when you change the Similarity (it is used for indexing and searching -- e.g., the lengthNorm's are computed at index time). 2. Be careful not to overtune your scoring for just one example. Try many examples. You won't be able to get it perfect -- the idea is to get close to your subjective judgments as frequently as possible. 3. The idea here is to find a value of lengthNorm() that doesn't override coord, but still provides the tie-breaking you are looking for (doc2 ahead of doc3). Chuck -Original Message- From: Gururaja H [mailto:[EMAIL PROTECTED] Sent: Sunday, December 19, 2004 10:10 PM To: Lucene Users List Subject: RE: Relevance and ranking ... Chuck Williams, Thanks for the reply. Source code and Output are below. Please give me your inputs. Default document order i am getting is: Doc#2, Doc#4, Doc#3, Doc#1. Document order needed is: Doc#4, Doc#2, Doc#3, Doc#1. Let me know, if you need more information. NOTE: Using Luene Query object not BooleanQuery. Here is the source code: Searcher searcher = new IndexSearcher(index); Analyzer analyzer = new StandardAnalyzer(); BufferedReader in = new BufferedReader(new InputStreamReader(System.in)); System.out.print(Query: ); String line = in.readLine(); Query query = QueryParser.parse(line, contents, analyzer); System.out.println(Searching for: + query.toString(contents)); Hits hits = searcher.search(query); System.out.println(hits.length() + total matching documents); for (int i = start; i hits.length(); i++) { Document doc = hits.doc(i); System.out.print(Score is: + hits.score(i)); // Use whatever your fields are here: System.out.print( title:); System.out.print(doc.get(title)); System.out.print( description:); System.out.println(doc.get(description)); // End of fields System.out.println(searcher.explain(query, hits.id(i))); //System.out.println(Score of the document is: +hits.score(i)); String path = doc.get(path); if (path != null) { System.out.println(i + . + path); System.out.println(--); } --- Here is the output from the program: Query: ibm risc tape drive manual Searching for: ibm risc tape drive manual 4 total matching documents Score is: 0.16266039 title:null description:null 0.16266039 = product of: 0.20332548 = sum of: 0.03826245 = weight(contents:ibm in 1), product of: 0.31521872 = queryWeight(contents:ibm), product of: 0.7768564 = idf(docFreq=4) 0.40576187 = queryNorm 0.121383816 = fieldWeight(contents:ibm in 1), product of: 1.0 = tf(termFreq(contents:ibm)=1) 0.7768564 = idf(docFreq=4) 0.15625 = fieldNorm(field=contents, doc=1) 0.06340029 = weight(contents:risc in 1), product of: 0.40576187 = queryWeight(contents:risc), product of: 1.0 = idf(docFreq=3) 0.40576187 = queryNorm 0.15625 = fieldWeight(contents:risc in 1), product of: 1.0 = tf(termFreq(contents:risc)=1) 1.0 = idf(docFreq=3) 0.15625 = fieldNorm(field=contents, doc=1) 0.06340029 = weight(contents:tape in 1), product of: 0.40576187 = queryWeight(contents:tape), product of: 1.0 = idf(docFreq=3) 0.40576187 = queryNorm 0.15625 = fieldWeight(contents:tape in 1), product of: 1.0 = tf(termFreq(contents:tape)=1) 1.0 = idf(docFreq=3) 0.15625 = fieldNorm(field=contents, doc=1) 0.03826245 = weight(contents:drive in 1), product of: 0.31521872 = queryWeight(contents:drive), product of: 0.7768564 = idf(docFreq=4) 0.40576187 = queryNorm 0.121383816 = fieldWeight(contents:drive in 1), product of: 1.0 = tf(termFreq(contents:drive)=1) 0.7768564
RE: determination of matching hits
This is not the official recommendation, but I'd suggest you are least consider: http://issues.apache.org/bugzilla/show_bug.cgi?id=32674 If you're not using Java 1.5 and you decide you want to use it, you'd need to take out those dependencies. If you improve it, please share. Chuck -Original Message- From: Christiaan Fluit [mailto:[EMAIL PROTECTED] Sent: Monday, December 20, 2004 2:51 PM To: Lucene Users List Subject: Re: determination of matching hits ok, I feel a bit stupid now ;) Turns out this issue has been discussed a while ago on both mailing lists and I even participated in one of them... shame on me. The problem is indeed in how MFQP parses my query: the query A -B becomes: (text:A -text:B) (title:A -title:B) (path:A -path:B) (summary:A -summary:B) (agent:A -agent:B) whereas I intuitively expexted it to be evaluated as A in any field and not B in any field. When I use a normal QueryParser and let it use a single field only, everything works as expected. Browsing the lists archives I see that there were some efforts from different people in solving this issue, but I'm a bit confused about the final outcome. Was this solved in the MFQP in 1.4.3? If not, what alternative implementation of MFPQ can I currently use best? Kind regards, Chris -- Erik Hatcher wrote: Christian, Please simplify your situation. Use a plain TermQuery for B and see what is returned. Then use a simple BooleanQuery for A -B. I suspect MultiFieldQueryParser is the culprit. What does the toString of the generated Query return? MFQP is known to be trouble, and an overhaul to it has been contributed recently. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Relevance and ranking ...
The coord is the fraction of clauses matched in a BooleanQuery, so with your example of a 5-word BooleanQuery, the coord factors should be .4, .8, .8, 1.0 respectively for doc1, doc2, doc3 and doc4. One big issue you've got here is lengthNorm. Doc2 is 1/10 the size of doc4, so its lengthNorm is over 3x larger (sqrt(10)). This more than makes up for the difference in coord. In your original post you indicated a desire for a linear lengthNorm, which would actually make this problem much worse. You problem need to tone down the lengthNorm instead (I turn mine off entirely, at least so far, by fixing it at 1.0; this is not good in general, but got me past similar problems until I can find a good formula). You might try an inverse-log lengthNorm with a high base (like the formula for idf I posted earlier). The other thing that can bite you is the tf and idf computations. E.g., if manual is a more common term than the others, this could cause the tf*idf scores on doc2 to more than compensate for the difference in coord, even if you set lengthNorm to be 1.0. What is happening will be apparent from the explanations. If you print these out and post them, I'd be happy to suggest specific formulas. Just use code like this: IndexSearcher searcher = new IndexSearcher(directory); System.out.println(query); Hits hits = searcher.search(query); for (int i=0; ihits.length(); i++) { Document doc = hits.doc(i); System.out.print(hits.score(i)); // Use whatever your fields are here: System.out.print( title:); System.out.print(doc.get(title)); System.out.print( description:); System.out.println(doc.get(description)); // End of fields System.out.println(searcher.explain(query, hits.id(i))); System.out.println(--); } Chuck -Original Message- From: Gururaja H [mailto:[EMAIL PROTECTED] Sent: Saturday, December 18, 2004 4:56 AM To: Lucene Users List Subject: Re: Relevance and ranking ... Hi Erik, Created my own subclass of Similarity. When i printed the values for coord() factor i am getting the same for all the 4 documents. So the value is NOT getting boosted. Want to do this. as i want the document that has e.g., all three terms in a three word query over those that contain just two of the words. Please let me how do i go about doing this ? Please explain the coordination factor ? The default order of document that i get for my example given in this thread is as follows: Doc#2 Doc#4 Doc#3 Doc#1 Any inputs would be help full. Thanks, Gururaja Erik Hatcher [EMAIL PROTECTED] wrote: On Dec 17, 2004, at 6:09 AM, Gururaja H wrote: Thanks for the reply. Is there any sample code which tells me how to change these coord() factor, overlapping, lenght normalizaiton etc.. ?? If there are any please provide me. Have a look at Lucene's DefaultSimilarity code itself. Use that as a starting point - in fact you should subclass it and only override the one or two methods you want to tweak. There are probably some other examples in Lucene's test cases, or that have been posted to the list but I don't have handy pointers to them. Erik Thanks, Gururaja Erik Hatcher wrote: The coord() factor of Similarity is what controls a muliplier factor for overlapping query terms in a document. The DefaultSimilarity already contains factors that allow documents with overlapping terms to get boosted. Is this not working for you? You may also need to adjust length normalization factors. Check the javadocs on Similarity for details on implementing your own formulas. Also become familiar with IndexSearcher.explain() and the Explanation so that you can see how adjusting things affects the details. Erik On Dec 17, 2004, at 3:42 AM, Gururaja H wrote: Hi, How to implement the following ? Please provide inputs For example, if the search query has 5 terms (ibm, risc, tape, drive, manual) and there are 4 matching documents with the following attributes, then the order should be as described below. Doc#1: contains terms (ibm, drive) and has a total of 100 terms in the document. Doc#2: contains terms (ibm, risc, tape, drive) and has a total of 30 terms in the document. Doc#3: contains terms (ibm, risc, tape, drive) and has a total of 100 terms in the document. Doc#4: contains terms (ibm, risc, tape, drive, manual) and has a total of 300 terms in the document The search results should include all three documents since each has one or more of the search terms, however, the order should be returned
RE: Relevance and ranking ...
Another issue will likely be the tf() and idf() computations. I have a similar desired relevance ranking and was not getting what I wanted due to the idf() term dominating the score. Lucene squares the contribution of this term, which is not considered best practice in IR. To address these issues, I increased the base of the log for both tf() and idf() (tones them down) and took a final square root on idf(). FYI, here are the definitions I'm using for these methods -- similar definitions should give you the ordering you want. You might want to adjust lengthNorm if you really want it to be linear (square root by default). You should not have to touch coord(). public float tf(float freq) { return 1.0f + (float)Math.log10(freq); } public float idf(int docFreq, int numDocs) { return (float)Math.sqrt(1.0 + Math.log10(numDocs/(double)(docFreq+1))); } Chuck -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, December 17, 2004 4:06 AM To: Lucene Users List Subject: Re: Relevance and ranking ... On Dec 17, 2004, at 6:09 AM, Gururaja H wrote: Thanks for the reply. Is there any sample code which tells me how to change these coord() factor, overlapping, lenght normalizaiton etc.. ?? If there are any please provide me. Have a look at Lucene's DefaultSimilarity code itself. Use that as a starting point - in fact you should subclass it and only override the one or two methods you want to tweak. There are probably some other examples in Lucene's test cases, or that have been posted to the list but I don't have handy pointers to them. Erik Thanks, Gururaja Erik Hatcher [EMAIL PROTECTED] wrote: The coord() factor of Similarity is what controls a muliplier factor for overlapping query terms in a document. The DefaultSimilarity already contains factors that allow documents with overlapping terms to get boosted. Is this not working for you? You may also need to adjust length normalization factors. Check the javadocs on Similarity for details on implementing your own formulas. Also become familiar with IndexSearcher.explain() and the Explanation so that you can see how adjusting things affects the details. Erik On Dec 17, 2004, at 3:42 AM, Gururaja H wrote: Hi, How to implement the following ? Please provide inputs For example, if the search query has 5 terms (ibm, risc, tape, drive, manual) and there are 4 matching documents with the following attributes, then the order should be as described below. Doc#1: contains terms (ibm, drive) and has a total of 100 terms in the document. Doc#2: contains terms (ibm, risc, tape, drive) and has a total of 30 terms in the document. Doc#3: contains terms (ibm, risc, tape, drive) and has a total of 100 terms in the document. Doc#4: contains terms (ibm, risc, tape, drive, manual) and has a total of 300 terms in the document The search results should include all three documents since each has one or more of the search terms, however, the order should be returned as: Doc#4 Doc#2 Doc#3 Doc#1 Doc#4 should be first, since of the 5 search terms, it contains all 5. Doc#2 should be second, since it has 4 of the 5 search terms and of the number of terms in the document, its ratio is higher than Doc#3 (4/30). Doc#3 has 4 of the 5 terms, but its ratio is 4/100. Doc#1 is last since it only has 2 of the 5 terms. Thanks, Gururaja __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Send holiday email and support a worthy cause. Do good. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing with Lucene 1.4.3
That looks right to me, assuming you have done an optimize. All of your index segments are merged into the one .cfs file (which is large, right?). Try searching -- it should work. Chuck -Original Message- From: Hetan Shah [mailto:[EMAIL PROTECTED] Sent: Thursday, December 16, 2004 11:00 AM To: Lucene Users List Subject: Indexing with Lucene 1.4.3 Hello, I have been trying to index around 6000 documents using IndexHTML from 1.4.3 and at the end of indexing in my index directory I only have 3 files. segments deletable and _5en.cfs Can someone tell me what is going on and where are the actual index files? How can I resolve this issue? Thanks. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: NUMERIC RANGE BOOLEAN
Karthik, RangeQuery expands into a BooleanQuery containing all of the terms in the index that fall within the range. By default, BooleanQuery's can have at most 1,024 terms. So, if your index has more than 1,024 different prices that fall within your range then you will hit this exception. What matters is distinct prices, not multiple items. E.g., it's ok to have 10,000 items at $5 -- that's just one price. But more than 1,024 distinct prices is a problem. You can fix this at least a couple different ways. 1. Increase the maximum number of clauses allowed in a BooleanQuery (see BooleanQuery.maxClauseCount). Note that this is done at a cost of performance. 2. Restructure your indexed prices and range query to reduce the number of clauses. E.g., index dollars and cents as two different fields. Then, for a range like $1.33 to $5.27, construct an or of 3 queries: a. $1 and [33 to 99 cents] b. [$2 to $5] c. $5 and [0 to 27 cents] I don't know about RangeFilter, but look at QueryFilter. You can use it with a RangeQuery to implement a range filter. However, I think you'll hit the same issue, so Erik may be referring to a new mechanism that is not in 1.4.3. Chuck -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: Thursday, December 16, 2004 9:38 PM To: Lucene Users List Subject: RE: NUMERIC RANGE BOOLEAN Hi Erik Apologies.. Sometimes I find it hard to understnad the Answer u reply 1) I looked at the the Wiki and similarly padded '0' [ Total Length = 8 ] at the time of indexing so before Indexprocess the values will be $ 10.25 , $ 0.50 ,$ 15.50. After padding and indexing finally [ Used Luke to moniter ] the values were 0010.25 ,.25,0015.50 2) I did not find the RangeFilter API in Lucene1.4.3 [is it recently added if so How Do I use the same some code snippets please ] with regards Karthik -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, December 16, 2004 6:55 PM To: Lucene Users List Subject: Re: NUMERIC RANGE BOOLEAN On Dec 16, 2004, at 7:17 AM, Karthik N S wrote: We have to get the All the Hits int the Range , So 0.99 cents IS ALWAYS be 0.99 cents on which we do the price Comaprison from consumer point of view . I hope I have answered u'r Question No, in fact, you have not. If you want to continue to receive my help here, you need to provide *details*. You pose often ambiguous and hard to decipher questions. Please help us help you by answering the questions we ask precisely. What are the values (exact string values) in that field? Please also read the wiki page on indexing numeric values. Look at using the new RangeFilter rather than a RangeQuery due to the noted issues with doing a RangeQuery. Erik With regards Karthik -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, December 16, 2004 5:24 PM To: Lucene Users List Subject: Re: NUMERIC RANGE BOOLEAN On Dec 16, 2004, at 5:03 AM, Morus Walter wrote: Erik Hatcher writes: TooManyClauses exception occurs when a query such as a RangeQuery expands to more than 1024 terms. I don't see how this could be the case in the query you provided - are you certain that is the query that generated the error? Why not: the terms might be 0003 0003.1 0003.11 ... So the question is, how do his terms look like... Ah, good point! So, Karthik - what are are the values of those terms? Pragmatically, do you really need to do a range involving the cents of a price? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: NUMERIC RANGE BOOLEAN
Errata: b. [$2 to 4] Chuck -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Thursday, December 16, 2004 9:58 PM To: Lucene Users List Subject: RE: NUMERIC RANGE BOOLEAN Karthik, RangeQuery expands into a BooleanQuery containing all of the terms in the index that fall within the range. By default, BooleanQuery's can have at most 1,024 terms. So, if your index has more than 1,024 different prices that fall within your range then you will hit this exception. What matters is distinct prices, not multiple items. E.g., it's ok to have 10,000 items at $5 -- that's just one price. But more than 1,024 distinct prices is a problem. You can fix this at least a couple different ways. 1. Increase the maximum number of clauses allowed in a BooleanQuery (see BooleanQuery.maxClauseCount). Note that this is done at a cost of performance. 2. Restructure your indexed prices and range query to reduce the number of clauses. E.g., index dollars and cents as two different fields. Then, for a range like $1.33 to $5.27, construct an or of 3 queries: a. $1 and [33 to 99 cents] b. [$2 to $5] c. $5 and [0 to 27 cents] I don't know about RangeFilter, but look at QueryFilter. You can use it with a RangeQuery to implement a range filter. However, I think you'll hit the same issue, so Erik may be referring to a new mechanism that is not in 1.4.3. Chuck -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: Thursday, December 16, 2004 9:38 PM To: Lucene Users List Subject: RE: NUMERIC RANGE BOOLEAN Hi Erik Apologies.. Sometimes I find it hard to understnad the Answer u reply 1) I looked at the the Wiki and similarly padded '0' [ Total Length = 8 ] at the time of indexing so before Indexprocess the values will be $ 10.25 , $ 0.50 ,$ 15.50. After padding and indexing finally [ Used Luke to moniter ] the values were 0010.25 ,.25,0015.50 2) I did not find the RangeFilter API in Lucene1.4.3 [is it recently added if so How Do I use the same some code snippets please ] with regards Karthik -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, December 16, 2004 6:55 PM To: Lucene Users List Subject: Re: NUMERIC RANGE BOOLEAN On Dec 16, 2004, at 7:17 AM, Karthik N S wrote: We have to get the All the Hits int the Range , So 0.99 cents IS ALWAYS be 0.99 cents on which we do the price Comaprison from consumer point of view . I hope I have answered u'r Question No, in fact, you have not. If you want to continue to receive my help here, you need to provide *details*. You pose often ambiguous and hard to decipher questions. Please help us help you by answering the questions we ask precisely. What are the values (exact string values) in that field? Please also read the wiki page on indexing numeric values. Look at using the new RangeFilter rather than a RangeQuery due to the noted issues with doing a RangeQuery. Erik With regards Karthik -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, December 16, 2004 5:24 PM To: Lucene Users List Subject: Re: NUMERIC RANGE BOOLEAN On Dec 16, 2004, at 5:03 AM, Morus Walter wrote: Erik Hatcher writes: TooManyClauses exception occurs when a query such as a RangeQuery expands to more than 1024 terms. I don't see how this could be the case in the query you provided - are you certain that is the query that generated the error? Why not: the terms might be 0003 0003.1 0003.11 ... So the question is, how do his terms look like... Ah, good point! So, Karthik - what are are the values of those terms? Pragmatically, do you really need to do a range involving the cents of a price? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED
RE: A question about scoring function in Lucene
I'll try to address all the comments here. The normalization I proposed a while back on lucene-dev is specified. Its properties can be analyzed, so there is no reason to guess about them. Re. Hoss's example and analysis, yes, I believe it can be demonstrated that the proposed normalization would make certain absolute statements like x and y meaningful. However, it is not a panacea -- there would be some limitations in these statements. To see what could be said meaningfully, it is necessary to recall a couple detailed aspects of the proposal: 1. The normalization would not change the ranking order or the ratios among scores in a single result set from what they are now. Only two things change: the query normalization constant, and the ad hoc final normalization in Hits is eliminated because the scores are intrinsically between 0 and 1. Another way to look at this is that the sole purpose of the normalization is to set the score of the highest-scoring result. Once this score is set, all the other scores are determined since the ratios of their scores to that of the top-scoring result do not change from today. Put simply, Hoss's explanation is correct. 2. There are multiple ways to normalize and achieve property 1. One simple approach is to set the top score based on the boost-weighted percentage of query terms it matches (assuming, for simplicity, the query is an OR-type BooleanQuery). So if all boosts are the same, the top score is the percentage of query terms matched. If there are boosts, then these cause the terms to have a corresponding relative importance in the determination of this percentage. More complex normalization schemes would go further and allow the tf's and/or idf's to play a role in the determination of the top score -- I didn't specify details here and am not sure how good a thing that would be to do. So, for now, let's just consider the properties of the simple boost-weighted-query-term percentage normalization. Hoss's example could be interpreted as single-term phrases Doug Cutting and Chris Hostetter, or as two-term BooleanQuery's. Considering both of these cases illustrates the absolute-statement properties and limitations of the proposed normalization. If single-term PhraseQuery's, then the top score will always be 1.0 assuming the phrase matches (while the other results have arbitrary fractional scores based on the tfidf ratios as today). If the queries are BooleanQuery's with no boosts, then the top score would be 1.0 or 0.5 depending on whether 1 or two terms were matched. This is meaningful. In Lucene today, the top score is not meaningful. It will always be 1.0 if the highest intrinsic score is = 1.0. I believe this could happen, for example, in a two-term BooleanQuery that matches only one term (if the tf on the matched document for that term is high enough). So, to be concrete, a score of 1.0 with the proposed normalization scheme would mean that all query terms are matched, while today a score of 1.0 doesn't really tell you anything. Certain absolute statements can therefore be made with the new scheme. This makes the absolute-threshold monitored search application possible, along with the segregating and filtering applications I've previously mentioned (call out good results and filter out bad results by using absolute thresholds). These analyses are simplified by using only BooleanQuery's, but I believe the properties carry over generally. Doug also asked about research results. I don't know of published research on this topic, but I can again repeat an experience from InQuira. We found that end users benefited from a search experience where good results were called out and bad results were downplayed or filtered out. And we managed to achieve this with absolute thresholding through careful normalization (of a much more complex scoring mechanism). To get a better intuitive feel for this, think about you react to a search where all the results suck, but there is no visual indication of this that is any different from a search that returns great results. Otis raised the patch I submitted for MultiSearcher. This addresses a related problem, in that the current MultiSearcher does not rank results equivalently to a single unified index -- specifically it fails Daniel Naber's test case. However, this is just a simple bug whose fix doesn't require the new normalization. I submitted a patch to fix that bug, along with a caveat that I'm not sure the patch is complete, or even consistent with the intentions of the author of this mechanism. I'm glad to see this topic is generating some interest, and apologize if anything I've said comes across as overly abrasive. I use and really like Lucene. I put a lot of focus on creating a great experience for the end user, and so am perhaps more concerned about quality of results and certain UI aspects than most other users. Chuck -Original Message- From: Doug Cutting [mailto:[EMAIL
RE: A question about scoring function in Lucene
Nhan, You are correct that dropping the document norm does cause Lucene's scoring model to deviate from the pure vector space model. However, including norm_d would cause other problems -- e.g., with short queries, as are typical in reality, the resulting scores with norm_d would all be extremely small. You are also correct that since norm_q is invariant, it does not affect relevance ranking. Norm_q is simply part of the normalization of final scores. There are many different formulas for scoring and relevance ranking in IR. All of these have some intuitive justification, but in the end can only be evaluated empirically. There is no correct formula. I believe the biggest problem with Lucene's approach relative to the pure vector space model is that Lucene does not properly normalize. The pure vector space model implements a cosine in the strictly positive sector of the coordinate space. This is guaranteed intrinsically to be between 0 and 1, and produces scores that can be compared across distinct queries (i.e., 0.8 means something about the result quality independent of the query). Lucene does not have this property. Its formula produces scores of arbitrary magnitude depending on the query. The results cannot be compared meaningfully across queries; i.e., 0.8 means nothing intrinsically. To keep final scores between 0 and 1, Lucene introduces an ad hoc query-dependent final normalization in Hits: viz., it divides all scores by the highest score if the highest score happens to be greater than 1. This makes it impossible for an application to properly inform its users about the quality of the results, to cut off bad results, etc. Applications may do that, but in fact what they are doing is random, not what they think they are doing. I've proposed a fix for this -- there was a long thread on Lucene-dev. It is possible to revise Lucene's scoring to keep its efficiency, keep its current per-query relevance ranking, and yet intrinsically normalize its scores so that they are meaningful across queries. I posted a fairly detailed spec of how to do this in the Lucene-dev thread. I'm hoping to have time to build it and submit it as a proposed update to Lucene, but it is a large effort that would involve changing just about every scoring class in Lucene. I'm not sure it would be incorporated even if I did it as that would take considerable work from a developer. There doesn't seem to be much concern about these various scoring and relevancy ranking issues among the general Lucene community. Chuck -Original Message- From: Nhan Nguyen Dang [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 15, 2004 1:18 AM To: Lucene Users List Subject: RE: A question about scoring function in Lucene Thank for your answer, In Lucene scoring function, they use only norm_q, but for one query, norm_q is the same for all documents. So norm_q is actually not effect the score. But norm_d is different, each document has a different norm_d; it effect the score of document d for query q. If you drop it, the score information is not correct anymore or it not space vector model anymore. Could you explain it a little bit. I think that it's expensive to computed in incremetal indexing because when one document is added, idf of each term changed. But drop it is not a good choice. What is the role of norm_d_t ? Nhan. --- Chuck Williams [EMAIL PROTECTED] wrote: Nhan, Re. your two differences: 1 is not a difference. Norm_d and Norm_q are both independent of t, so summing over t has no effect on them. I.e., Norm_d * Norm_q is constant wrt the summation, so it doesn't matter if the sum is over just the numerator or over the entire fraction, the result is the same. 2 is a difference. Lucene uses Norm_q instead of Norm_d because Norm_d is too expensive to compute, especially in the presence of incremental indexing. E.g., adding or deleting any document changes the idf's, so if Norm_d was used it would have to be recomputed for ALL documents. This is not feasible. Another point you did not mention is that the idf term is squared (in both of your formulas). Salton, the originator of the vector space model, dropped one idf factor from his formula as it improved results empirically. More recent theoretical justifications of tf*idf provide intuitive explanations of why idf should only be included linearly. tf is best thought of as the real vector entry, while idf is a weighting term on the components of the inner product. E.g., seen the excellent paper by Robertson, Understanding inverse document frequency: on theoretical arguments for IDF, available here: http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl if you sign up for an eval. It's easy to correct for idf^2
RE: A question about scoring function in Lucene
Nhan, Re. your two differences: 1 is not a difference. Norm_d and Norm_q are both independent of t, so summing over t has no effect on them. I.e., Norm_d * Norm_q is constant wrt the summation, so it doesn't matter if the sum is over just the numerator or over the entire fraction, the result is the same. 2 is a difference. Lucene uses Norm_q instead of Norm_d because Norm_d is too expensive to compute, especially in the presence of incremental indexing. E.g., adding or deleting any document changes the idf's, so if Norm_d was used it would have to be recomputed for ALL documents. This is not feasible. Another point you did not mention is that the idf term is squared (in both of your formulas). Salton, the originator of the vector space model, dropped one idf factor from his formula as it improved results empirically. More recent theoretical justifications of tf*idf provide intuitive explanations of why idf should only be included linearly. tf is best thought of as the real vector entry, while idf is a weighting term on the components of the inner product. E.g., seen the excellent paper by Robertson, Understanding inverse document frequency: on theoretical arguments for IDF, available here: http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl if you sign up for an eval. It's easy to correct for idf^2 by using a customer Similarity that takes a final square root. Chuck -Original Message- From: Vikas Gupta [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 14, 2004 9:32 PM To: Lucene Users List Subject: Re: A question about scoring function in Lucene Lucene uses the vector space model. To understand that: -Read section 2.1 of Space optimizations for Total Ranking paper (Linked here http://lucene.sourceforge.net/publications.html) -Read section 6 to 6.4 of http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf -Read section 1 of http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps Vikas On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote: Hi all, Lucene score document based on the correlation between the query q and document t: (this is raw function, I don't pay attention to the boost_t, coord_q_d factor) score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) (*) Could anybody explain it in detail ? Or are there any papers, documents about this function ? Because: I have also read the book: Modern Information Retrieval, author: Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley (Hope you have read it too). In page 27, they also suggest a scoring funtion for vector model based on the correlation between query q and document d as follow (I use different symbol): sum_t( weight_t_d * weight_t_q) score_d(d, q)= - (**) norm_d * norm_q where weight_t_d = tf_d * idf_t weight_t_q = tf_q * idf_t norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) ) norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) ) (**): sum_t( tf_q*idf_t * tf_d*idf_t) score_d(d, q)=- (***) norm_d * norm_q The two function, (*) and (***), have 2 differences: 1. in (***), the sum_t is just for the numerator but in the (*), the sum_t is for everything. So, with norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is calculated twice. Is this right? please explain. 2. No factor that define norms of the document: norm_d in the function (*). Can you explain this. what is the role of factor norm_d_t ? One more question: could anybody give me documents, papers that explain this function in detail. so when I apply Lucene for my system, I can adapt the document, and the field so that I still receive the correct scoring information from Lucene . Best regard, Thanks every body, = Ð#7863;ng Nhân - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: A simple Query Language
You could support only terms with no operators at all, which will work in most search engines (except those that require combining operators). Using just terms and phrases embedded in 's is pretty universal. After that, you might want to add +/- required/prohibited restrictions, which many engines support. After that, I think you're getting pretty specific. Lucene supports all of these and many more. Chuck -Original Message- From: Dongling Ding [mailto:[EMAIL PROTECTED] Sent: Friday, December 10, 2004 5:08 PM To: Lucene Users List Subject: A simple Query Language Hi, I am going to implement a search service and plan to use Lucene. Is there any simple query language that is independent of any particular search engine out there? Thanks Dongling If you have received this e-mail in error, please delete it and notify the sender as soon as possible. The contents of this e-mail may be confidential and the unauthorized use, copying, or dissemination of it and any attachments to it, is prohibited. Internet communications are not secure and Hyperion does not, therefore, accept legal responsibility for the contents of this message nor for any damage caused by viruses. The views expressed here do not necessarily represent those of Hyperion. For more information about Hyperion, please visit our Web site at www.hyperion.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Coordination value
There is an easier way. You should use a custom Similarity, which allows you to define your own coord() method. Look at DefaultSimilarity (which specializes Similarity). I'd suggest analyzing your scores first with explain() to decide what you really want to tweak. Just a guess, but your issue might be that your idf()'s are dominating the score computation. I had this problem and change the default idf() to take a final square root, since Lucene squares that contribution (which is one of its few areas that is generally not considered best practice). I also boost the base of the logarithms on both tf and idf to weight those factors lower. Good luck, Chuck -Original Message- From: Jason Haruska [mailto:[EMAIL PROTECTED] Sent: Thursday, December 09, 2004 1:36 PM To: Lucene Users List Subject: Coordination value I would like to adjust the score lucene is returning to use the coordination component more. For example, I have a BooleanQuery containing three TermQueries. I would like to adjust the score so that documents containing all three terms appear first, followed by docs that contain only two of the terms, followed by documents that contain only one of the terms. I understand that the coordination is a component of the overall document score currently, but I'd like to make it more absolute. I was wondering if someone on the list has done something similar. I have implemented a hack that works by adding a function to the BooleanWeight class but it is very slow. I believe it is inefficient because it uses the Explanation class to get the coordination value. There must be an easier way that I'm missing. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene Vs Ixiasoft
Lucene contains a complete set of Boolean query operators, and it uses the vector space model to determine scores for relevance ranking. It's fast. It works. Chuck -Original Message- From: John Wang [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 08, 2004 7:13 PM To: Lucene Users List; Nicolas Maisonneuve Subject: Re: Lucene Vs Ixiasoft I thought Lucene implements the Boolean model. -John On Thu, 9 Dec 2004 00:19:21 +0100, Nicolas Maisonneuve [EMAIL PROTECTED] wrote: hi, think first of the relevance of the model in this 2 search engine for XML document retrieval. Lucene is classic fulltext search engine using the vector space model. this model is efficient for indexing no structred document (like plain text file ) and not made for structured document like XML. there is a XML demo of lucene sandbox but it's not really very efficient because it doesn't take advantage of the document strucutre in the indexing and the ranking model, so it lose semantic information and relevance. i don't know Ixiasoft, check the information to see how it index and rank XML document. nicolas On Wed, 8 Dec 2004 14:20:45 -0500, Praveen Peddi [EMAIL PROTECTED] wrote: Does anyone know about Ixiasoft server. Its a xml repository/search engine. If anyone knows about it, does he/she also know how it is compared to Lucene? Which is fast? Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- The Leader in Enterprise Content Integration - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Sorting in Lucene
Since it's untokenized, are you searching with the exact string stored in the field? Chuck -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 07, 2004 3:29 PM To: 'Lucene Users List'; 'Chris Fraschetti' Subject: RE: Sorting in Lucene I also tried searching the said field on LIMO and I don't get a match. Thanks, Ramon -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 07, 2004 3:20 PM To: 'Lucene Users List'; 'Chris Fraschetti' Subject: RE: Sorting in Lucene Hi, I use LIMO to look into my index. Limo tells me that the field is untokenized but is indexed. Is it possible to search on untokenized field? Thanks, Ramon -Original Message- From: Chris Fraschetti [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 07, 2004 3:14 PM To: Lucene Users List Subject: Re: Sorting in Lucene I would try 'luke' to look at your index and use it's search functionality to make sure it's now your code that is the problem, as well as to ensure your document is appearing in the index as you intend it. It's been a lifesaver for me. http://www.getopt.org/luke/ On Tue, 7 Dec 2004 15:02:26 -0800, Ramon Aseniero [EMAIL PROTECTED] wrote: Hi All, Any idea why a Keyword field is not searchable? On my index I have a field of type Keyword but I could not somehow search on the field. Thanks in advance. Ramon -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004 -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- No virus found in this incoming message. Checked by AVG Anti-Virus. Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004 -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- No virus found in this incoming message. Checked by AVG Anti-Virus. Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004 -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Sorting in Lucene
Ramon, Field.Keyword is definitely searchable. I use them. I think I use every combination of tokenized/untokenized, index/unindexed, and stored/unstored. They all work. This seems unlikely given that you tried with Luke, but do you perhaps have an analyzer applied to the query so that the query string is transformed before it is applied to the index? I'd suggest printing the query after you parse it. Query's have a good toString() method. Chuck -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 07, 2004 4:14 PM To: 'Lucene Users List' Subject: RE: Sorting in Lucene Hi Chuck, Yes I tried to search with the exact string stored on the index but I don't get a match. I tried the search using LIMO and LUKE. It seems like untokenized field are not searchable. Thanks, Ramon -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 07, 2004 4:04 PM To: Lucene Users List Subject: RE: Sorting in Lucene Since it's untokenized, are you searching with the exact string stored in the field? Chuck -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 07, 2004 3:29 PM To: 'Lucene Users List'; 'Chris Fraschetti' Subject: RE: Sorting in Lucene I also tried searching the said field on LIMO and I don't get a match. Thanks, Ramon -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 07, 2004 3:20 PM To: 'Lucene Users List'; 'Chris Fraschetti' Subject: RE: Sorting in Lucene Hi, I use LIMO to look into my index. Limo tells me that the field is untokenized but is indexed. Is it possible to search on untokenized field? Thanks, Ramon -Original Message- From: Chris Fraschetti [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 07, 2004 3:14 PM To: Lucene Users List Subject: Re: Sorting in Lucene I would try 'luke' to look at your index and use it's search functionality to make sure it's now your code that is the problem, as well as to ensure your document is appearing in the index as you intend it. It's been a lifesaver for me. http://www.getopt.org/luke/ On Tue, 7 Dec 2004 15:02:26 -0800, Ramon Aseniero [EMAIL PROTECTED] wrote: Hi All, Any idea why a Keyword field is not searchable? On my index I have a field of type Keyword but I could not somehow search on the field. Thanks in advance. Ramon -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004 -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- No virus found in this incoming message. Checked by AVG Anti-Virus. Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004 -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- No virus found in this incoming message. Checked by AVG Anti-Virus. Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004 -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- No virus found in this incoming message. Checked by AVG Anti-Virus. Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004 -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.289 / Virus Database: 265.4.7 - Release Date: 12/7/2004
Recommended values for mergeFactor, minMergeDocs, maxMergeDocs
I'm wondering what values of mergeFactor, minMergeDocs and maxMergeDocs people have found to yield the best performance for different configurations. Is there a repository of this information anywhere? I've got about 30k documents and have 3 indexing scenarios: 1. Full indexing and optimize 2. Incremental indexing and optimize 3. Parallel incremental indexing without optimize Search performance is critical. For both cases 1 and 2, I'd like the fastest possible indexing time. For case 3, I'd like minimal pauses and no noticeable degradation in search performance. Based on reading the code (including the javadocs comments), I'm thinking of values along these lines: mergeFactor: 1000 during Full indexing, and during optimize (for both cases 1 and 2); 10 during incremental indexing (cases 2 and 3) minMergeDocs: 1000 during Full indexing, 10 during incremental indexing maxMergeDocs: Integer.MAX_VALUE during full indexing, 1000 during incremental indexing Do these values seem reasonable? Are there better settings before I start experimenting? Since mergeFactor is used in both addDocument() and optimize(), I'm thinking of using two different values in case 2: 10 during the incremental indexing, and then 1000 during the optimize. Is changing the value like this going to cause a problem? Thanks for any advice, Chuck
RE: Search multiple Fields
If you want this to be efficient in your application, I'd suggest integrating at a lower level. E.g., take a look at TermScorer.explain() to see how it determines whether or not a term matches in a field of document. Another approach might be to specialize BooleanQuery to keep track of which clauses matched. Chuck -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, December 02, 2004 12:13 PM To: Lucene Users List Subject: Re: Search multiple Fields On Dec 2, 2004, at 11:43 AM, Eric Louvard wrote: I'm searching, for example title:world OR contents:world OR author:world Is it possible to know where (in which Field) have Lucene found 'world' in each Document, without making 3 queries ? Not in a straightforward way, but you can dig through the Explanation returned from IndexSearcher.explain() to see what factors are involved in the score, which does include info on what fields/terms were matched. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: boosting challenge
Try the explain() capability to see what factors are influencing the order of your results. Probably these other factors are overwhelming your boost. I had similar problems and resolved them by tweaking these other contributions, especially idf. You can do that in a custom Similarity. Chuck -Original Message- From: Frank Morton [mailto:[EMAIL PROTECTED] Sent: Monday, November 29, 2004 12:49 PM To: Lucene Users List Subject: Re: boosting challenge Thanks for the response. Using 4.0 did not work either. Additionally, I have also tried Field.setBoost(4.0) on the name field. That didn't work either. Still perplexedI assume people are using boosting with 1.4 successfully. On Nov 29, 2004, at 3:36 PM, Otis Gospodnetic wrote: Try 4.0 instead of 4. That may be correct syntax (don't have QueryParser source to check), because the code takes boosts as float type values. Otis --- Frank Morton [EMAIL PROTECTED] wrote: I have an index of restaurants with two fields. The name of the restaurant and a description. I would like to search for the word bob in both fields, but if it occurs in the name, it would score higher. So, if Bob Evans is the name of the restaurant, but other restaurants refer to Bob in the description, the restaurant Bob Evans would score highest, but the others would also match the query. I thought you could boost the term with a query like: name:bob^4 description:bob and it would boost the word bob if found in the name property, but this is not working for me. I get the exact same results using the above query and a simple bob query. I am using lucene-1.4-final.jar. I am using the PorterStemAnalyzer Am I missing something. Lucene seems very capable, otherwise. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: modifying existing index
I haven't tried it but believe this should work: IndexReader reader; void delete(long id) { reader.delete(new Term(id, Long.toString(id))); } This also has the benefit that it does binary search rather than sequential search. You will want to pad you id's with leading zeroes if you are going to do incremental indexing (both when storing them and when looking them up). Sorting is by lexicographic order, not numerical order, and incremental indexing is much faster if the id's are kept sorted (as is done in IndexHTML). Chuck -Original Message- From: Santosh [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 24, 2004 9:54 AM To: Lucene Users List Subject: Re: modifying existing index I am able to delete now the Index using the following if(indexDir.exists()) { IndexReader reader = IndexReader.open( indexDir ); uidIter = reader.terms(new Term(id, )); while (uidIter.term() != null uidIter.term().field() == id) { reader.delete(uidIter.term()); uidIter.next(); } reader.close(); } where id is the keyword field. But here also all the documents are deleted. How can I modify my code and delete particular document with given id Iam creating the index in the following way Document doc = new Document(); doc.add(Field.Text(text,text)); doc.add(Field.Keyword(id,Long.toString(id))); doc.add(Field.Keyword(title,title)); doc.add(Field.Keyword(keywords,keywords)); doc.add(Field.Keyword(type,type)); writer.addDocument(doc); - Original Message - From: Chuck Williams [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 24, 2004 1:06 PM Subject: RE: modifying existing index A good way to do this is to add a keyword field with whatever unique id you have for the document. Then you can delete the term containing a unique id to delete the document from the index (look at IndexReader.delete(Term)). You can look at the demo class IndexHTML to see how it does incremental indexing for an example. Chuck -Original Message- From: Santosh [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 23, 2004 11:34 PM To: Lucene Users List Subject: Re: modifying existing index I have gon through IndexReader , I got method : delete(int docNum) , but from where I will get document number? Is this predifined? or we have to give a number prior to indexing? - Original Message - From: Luke Francl [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 24, 2004 1:26 AM Subject: Re: modifying existing indexOn Tue, 2004-11-23 at 13:59, Santosh wrote: I am using lucene for indexing, when I am creating Index the docuemnts are added. but when I want to modify the single existing document and reIndex again, it is taking as new document and adding one more time, so that I am getting same document twice in the results. To overcome this I am deleting existing Index and again recreating whole Index. but is it possibe to index the modified document again and overwrite existing document without deleting and recreation. can I do this? If so how? You do not need to recreate the whole index. Just mark the document as deleted using the IndexReader and then add it again with the IndexWriter. Remember to close your IndexReader and IndexWriter after doing this. The deleted document will be removed the next time you optimize your index. Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: URGENT: Help indexing large document set
Does keyIter return the keys in sorted order? This should reduce seeks, especially if the keys are dense. Also, you should be able to localReader.delete(term) instead of iterating over the docs (of which I presume there is only one doc since keys are unique). This won't improve performance as IndexReader.delete(Term) does exactly what your code does, but it will be cleaner. A linear slowdown with number of docs doesn't make sense, so something else must be wrong. I'm not sure what the default buffer size is (it appears it used to be 128 but is dynamic now I think). You might find the slowdown stops after a certain point, especially if you increase your batch size. Chuck -Original Message- From: John Wang [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 24, 2004 12:21 PM To: Lucene Users List Subject: Re: URGENT: Help indexing large document set Thanks Paul! Using your suggestion, I have changed the update check code to use only the indexReader: try { localReader = IndexReader.open(path); while (keyIter.hasNext()) { key = (String) keyIter.next(); term = new Term(key, key); TermDocs tDocs = localReader.termDocs(term); if (tDocs != null) { try { while (tDocs.next()) { localReader.delete(tDocs.doc()); } } finally { tDocs.close(); } } } } finally { if (localReader != null) { localReader.close(); } } Unfortunately it didn't seem to make any dramatic difference. I also see the CPU is only 30-50% busy, so I am guessing it's spending a lot of time in IO. Anyway of making the CPU work harder? Is batch size of 500 too small for 1 million documents? Currently I am seeing a linear speed degredation of 0.3 milliseconds per document. Thanks -John On Wed, 24 Nov 2004 09:05:39 +0100, Paul Elschot [EMAIL PROTECTED] wrote: On Wednesday 24 November 2004 00:37, John Wang wrote: Hi: I am trying to index 1M documents, with batches of 500 documents. Each document has an unique text key, which is added as a Field.KeyWord(name,value). For each batch of 500, I need to make sure I am not adding a document with a key that is already in the current index. To do this, I am calling IndexSearcher.docFreq for each document and delete the document currently in the index with the same key: while (keyIter.hasNext()) { String objectID = (String) keyIter.next(); term = new Term(key, objectID); int count = localSearcher.docFreq(term); To speed this up a bit make sure that the iterator gives the terms in sorted order. I'd use an index reader instead of a searcher, but that will probably not make a difference. Adding the documents can be done with multiple threads. Last time I checked that, there was a moderate speed up using three threads instead of one on a single CPU machine. Tuning the values of minMergeDocs and maxMergeDocs may also help to increase performance of adding documents. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: URGENT: Help indexing large document set
Are you sure you have a performance problem with TermInfosReader.get(Term)? It looks to me like it scans sequentially only within a small buffer window (of size SegmentTermEnum.indexInterval) and that it uses binary search otherwise. See TermInfosReader.getIndexOffset(Term). Chuck -Original Message- From: John Wang [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 23, 2004 3:38 PM To: [EMAIL PROTECTED] Subject: URGENT: Help indexing large document set Hi: I am trying to index 1M documents, with batches of 500 documents. Each document has an unique text key, which is added as a Field.KeyWord(name,value). For each batch of 500, I need to make sure I am not adding a document with a key that is already in the current index. To do this, I am calling IndexSearcher.docFreq for each document and delete the document currently in the index with the same key: while (keyIter.hasNext()) { String objectID = (String) keyIter.next(); term = new Term(key, objectID); int count = localSearcher.docFreq(term); if (count != 0) { localReader.delete(term); } } Then I proceed with adding the documents. This turns out to be extremely expensive, I looked into the code and I see in TermInfosReader.get(Term term) it is doing a linear look up for each term. So as the index grows, the above operation degrades at a linear rate. So for each commit, we are doing a docFreq for 500 documents. I also tried to create a BooleanQuery composed of 500 TermQueries and do 1 search for each batch, and the performance didn't get better. And if the batch size increases to say 50,000, creating a BooleanQuery composed of 50,000 TermQuery instances may introduce huge memory costs. Is there a better way to do this? Can TermInfosReader.get(Term term) be optimized to do a binary lookup instead of a linear walk? Of course that depends on whether the terms are stored in sorted order, are they? This is very urgent, thanks in advance for all your help. -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: lucene Scorers
Hi Ken, I'm glad our replies were helpful. It sounds like you looked at the code in MaxDisjunctionQuery, so you probably noticed that it also implements skipTo(). Your suggestion sounds like a good thing to do. I thought about that when writing MaxDisjunctionQuery, but didn't need the generality, and it does make the code more complex. I think Lucene needs one of these mechanisms in it, at least to solve the problems associated with the current default use of BooleanQuery for multiple field expansions. Your proposal would generalize this to solve additional cases where different accrual operators are appropriate. You could write and submit the generalization, although there are no guarantees anybody would do anything with it. I didn't get anywhere in my attempt to submit MaxDisjunctionQuery. I think there is also a serious problem in scoring with the current score normalization (it does not provide meaningfully comaparable scores across different searches, which means that absolute score numbers like 0.8 have no intrinsic meaning concerning how good a result is or is not). When I finally get back to tuning search in my app, that's the next one I'll try a submission on. Chuck -Original Message- From: Ken McCracken [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 23, 2004 4:31 PM To: Lucene Users List Subject: Re: lucene Scorers Hi, Thanks the pointers in your replies. Would it be possible to include some sort of accrual scorer interface somewhere in the Lucene Query APIs? This could be passed into a query similar to MaxDisjunctionQuery; and combine the sum, max, tieBreaker, etc., according to the implementor's discretion, to compute the overall score for a document. -Ken On Sat, 13 Nov 2004 12:07:05 +0100, Paul Elschot [EMAIL PROTECTED] wrote: On Friday 12 November 2004 22:56, Chuck Williams wrote: I had a similar need and wrote MaxDisjunctionQuery and MaxDisjunctionScorer. Unfortunately these are not available as a patch but I've included the original message below that has the code (modulo line breaks added by simple text email format). This code is functional -- I use it in my app. It is optimized for its stated use, which involves a small number of clauses. You'd want to improve the incremental sorting (e.g., using the bucket technique of BooleanQuery) if you need it for large numbers of clauses. When you're interested, you can also have a look here for yet another DisjunctionScorer: http://issues.apache.org/bugzilla/show_bug.cgi?id=31785 It has the advantage that it implements skipTo() so that it can be used as a subscorer of ConjunctionScorer, ie. it can be faster in situations like this: aa AND (bb OR cc) where bb and cc are treated by the DisjunctionScorer. When aa is a filter this can also be used to implement a filtering query. Re. Paul's suggested steps below, I did not integrate this with query parser as I didn't need that functionality (since I'm generating the multi-field expansions for which max is a much better scoring choice than sum). Chuck Included message: -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Monday, October 11, 2004 9:55 PM To: [EMAIL PROTECTED] Subject: Contribution: better multi-field searching The files included below (MaxDisjunctionQuery.java and MaxDisjunctionScorer.java) provide a new mechanism for searching across multiple fields. The maximum indeed works well, also when the fields differ a lot length. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: modifying existing index
A good way to do this is to add a keyword field with whatever unique id you have for the document. Then you can delete the term containing a unique id to delete the document from the index (look at IndexReader.delete(Term)). You can look at the demo class IndexHTML to see how it does incremental indexing for an example. Chuck -Original Message- From: Santosh [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 23, 2004 11:34 PM To: Lucene Users List Subject: Re: modifying existing index I have gon through IndexReader , I got method : delete(int docNum) , but from where I will get document number? Is this predifined? or we have to give a number prior to indexing? - Original Message - From: Luke Francl [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 24, 2004 1:26 AM Subject: Re: modifying existing index On Tue, 2004-11-23 at 13:59, Santosh wrote: I am using lucene for indexing, when I am creating Index the docuemnts are added. but when I want to modify the single existing document and reIndex again, it is taking as new document and adding one more time, so that I am getting same document twice in the results. To overcome this I am deleting existing Index and again recreating whole Index. but is it possibe to index the modified document again and overwrite existing document without deleting and recreation. can I do this? If so how? You do not need to recreate the whole index. Just mark the document as deleted using the IndexReader and then add it again with the IndexWriter. Remember to close your IndexReader and IndexWriter after doing this. The deleted document will be removed the next time you optimize your index. Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: fetching similar wordlist as given word
Lucene does support stemming, but that is not what your example requires (stemming equates roaming, roam, roamed, etc.). For stemming, look at PorterStemFilter or better, the Snowball stemmers in the sandbox. For your similar word list, I think you are looking for the class FuzzyTermEnum. This should give you the terms you need, although perhaps only those with a common prefix of a specified length. Otherwise, you could develop your own algorithm to look for similar terms in the index. Chuck -Original Message- From: Santosh [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 23, 2004 11:15 PM To: Lucene Users List Subject: fetching similar wordlist as given word can lucene will be able to do stemming? If I am searching for roam then I know that it can give result for foam using fuzzy query. But my requirement is if I search for roam can I get the similar wordlist as output. so that I can show the end user in the column --- do you mean foam? How can I get similar word list in the given content? ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Question about multi-searching [re-post]
If you are going to compare scores across multiple indices, I'd suggest considering one of the patches here: http://issues.apache.org/bugzilla/show_bug.cgi?id=31841 Chuck -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Monday, November 22, 2004 6:30 AM To: Lucene Users List Subject: Re: Question about multi-searching [re-post] On Nov 22, 2004, at 9:18 AM, Cocula Remi wrote: (First of all : what is the plurial of index in english ; indexes or indices ?) We used indexes in Lucene in Action. Its a bit ambiguous in English, but indexes sounds less formal and is acceptable. For that, I parse a new query using QueryParser or MultiFieldQueryParser. Then I search my indexes using the MultiSearcher class. Ok, but the problem comes when different analyzer are used for each index. QueryParser requires an analyzer to parse the query but a query parsed with an analyzer is not suitable for searching into an index that uses another analyzer. Does anyone know a trick to cope this problem. Nothing built into Lucene solves this problem specifically. You'll have to come up with your own MultiSearcher-like facility that can apply different queries to different indexes and merge the results back together. This will be awkward when it comes to scoring though, since each index is using a different query. Eventually I could run a different query on each index to obtain several Hits objects. Then I could write some collector that collects Hits in the order of highest scores. I wonder if this could work and if it would be as efficient as the MultiSearcher . In this situation does it make sense to compare the scores of two differents Hits. No, it won't make good sense to compare the scores between the queries, but I suspect our queries are pretty close to one another if all that varies is the analyzer. It still will be an awkward comparison though, but maybe good enough for your needs? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Need help with filtering
It sounds like you need to pad your numbers with leading zeroes, i.e. use the same type of encoding as is required by RangeQuery's. If you query with 05 instead of 5 do you get what you expect? If all your document id's are fixed length, then string comparison will be isomorphic to integer comparison. Chuck -Original Message- From: Edwin Tang [mailto:[EMAIL PROTECTED] Sent: Monday, November 22, 2004 10:34 AM To: Lucene Users List Subject: Re: Need help with filtering Hello again, I've modified DateFilter to filter out document IDs as suggested. All seems to be running well until I tried a specific test case. All my documents have IDs in the 400,000 range. If I set my lower limit to 5, nothing comes back. After examining the code, I found the issue to be at the following line: TermEnum enumerator = reader.terms(new Term(field, start)); Is there a way to retrieve a set of documents with IDs using a Integer comparison versus a String comparison? If I set start to 0, I get everything, but that's not very efficient. Thanks in advance, Ed --- Paul Elschot [EMAIL PROTECTED] wrote: On Wednesday 17 November 2004 01:20, Edwin Tang wrote: Hello, I have been using DateFilter to limit my search results to a certain date range. I am now asked to replace this filter with one where my search results have document IDs greater than a given document ID. This document ID is assigned during indexing and is a Keyword field. I've browsed around the FAQs and archives and see that I can either use QueryFilter or BooleanQuery. I've tried both approaches to limit the document ID range, but am getting the BooleanQuery.TooManyClauses exception in both cases. I've also tried bumping max number of clauses via setMaxClauseCount(), but that number has gotten pretty big. Is there another approach to this? ... Recoding DateFilter to a DocumentIdFilter should be straightforward. The trick is to use only one document enumerator at a time for all terms. Document enumerators take buffer space, and that is the reason why BooleanQuery has an exception for too many clauses. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene - index fields design question
I do most of these same things and made these relevant design decisions: 1. Use a combination of query expansion to search across multiple fields and field concatenation to create document fields that combine separate object fields. I use multiple fields only when it is important to weight them differently. E.g., in my case the separate fields are combined into just title and body document fields for general term searching. I expand queries (with my own expander after parsing) by rewriting queries against the default field into an OR across title and body with title boosted higher than body. 2. One problem with the above concerns scoring (and this is also one of the reasons to use concatenation rather than query expansion as much as possible). Lucene's BooleanQuery use sum-based scoring for OR's that is further factored with the coord() adjustment (settable in the Similarity). This causes OR's to behave very poorly for the field-expansion case. E.g., if the query is foo bar, and you expand each term into title and body in the simplest way to produce title:foo^4 body:foo title:bar^4 body:bar, then a document with foo in title and bar in body will get the same score as one with foo in title and foo in body, clearly not desired. There are at least 3 different solutions to this problem discussed on this list. I wrote my own MaxDisjunctionQuery just to handle this case: it uses max instead of sum for this kind of OR query, and it does not use coord() (so use MaxDisjunctionQuery for the OR's of the same term or other query across multiple fields, and regular BooleanQuery to OR together the different terms or other queries). Paul Elschot wrote a more general DisjunctionQuery that can be configured to do the same thing. Doug Cutting came up with a solution that does not require a new Query class; his solution expands the query in a certain way and specializes certain existing methods. You should be able to find these solutions by searching the archive (e.g., search for MaxDisjunctionQuery and DisjunctionQuery and read the threads). Code is posted in one way or other. 3. RangeQuery's are the way to do your date ranges, or any other ranges. The encodings need to be lexicographic, not integer. E.g., 10 precedes 2, so pad with leading 0's (02 precedes 10). If you need negatives or floats, you need additional considerations to ensure consistency with lexicographic order (invert the order of negatives and use a sign representation such that the positive sign indicator follows the negative sign indicator; floats require nothing special so long as the integer portion is fixed length). Dates encode naturally. I add additional fields like those used to search Ranges onto the Lucene documents in addition to title and body. There are numerous messages on the list that discuss details of this, and there is a link to the web site that goes through a complete example, including showing how to specialize the query parser if you want users entering RangeQuery's in Lucene syntax (either way you have to lexicographically encode both queries and the document fields you index). If you have more specific questions or cannot find the references, please just ask. Good luck, Chuck -Original Message- From: Venkatraju [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 16, 2004 9:51 AM To: lucene-user Subject: Lucene - index fields design question Hi, I am a new user of Lucene. so please point me to documentation/archives if these issues have been covered before. I plan to use Lucene in a application with the following (fairly standard) requirements: - Index documents that contain a title, author, date and content - It is fairly common to search for some text across all the fields - Matches in the title field should be given more weightage over matches in the content field - Provide an option to restrict search to documents within a date range Give these requirements, what is a good index design with search speed in mind? Documents will have fields title, author, date and content. Should I make title and author part of the content as well so that search across all fields will just become a search in content field? If so, how do I give more weightage to matches in title field? The other option would be to expand a simple query to include searches across all fields. Ex.: Expand abcd to title:abcd^4 OR content:abcd. Also, should the boost for title field be applied in the query or is it better to provide a boost to the title field during indexing (is that possible)? Which of these options will work and be more effecient? For date range limited search, can field values be integers? If not, encoding the date as MMDDHHMM and then use a filter or a RangeQuery - is that the way to do this? Thanks, Venkat - To
RE: setting Similarity at search time
Take a look at this: http://issues.apache.org/bugzilla/show_bug.cgi?id=31841 Not my initial patch, but the latest patch from Wolf Siberski. I haven't used it yet, but it looks like what you are looking for, and something I want to use too. Chuck -Original Message- From: Ken McCracken [mailto:[EMAIL PROTECTED] Sent: Monday, November 15, 2004 11:31 AM To: Lucene Users List Subject: setting Similarity at search time Hi, Is there a way to set the Similarity at search(...) time, rather than just setting it on the (Index)Searcher object itself? I'd like to be able to specify different similarities in different threads searching concurrently, using the same IndexSearcher instance. In my use case, the choice of Similarity is a parameter of the search request, and hence may be different for each request. Can such a method be added to override the search(...) method? Thanks, -Ken - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: How to efficiently get # of search results, per attribute
My Lucene application includes multi-faceted navigation that does a more complex version of the below. I've got 5 different taxonomies into which every indexed item is classified. The largest of the taxonomies has over 15,000 entries while the other 4 are much smaller. For every search query, I determine the best small set of nodes from each taxonomy to present to the user as drill down options, and provide the counts regarding how many results fall under each of these nodes. At present I only have about 25,000 indexed objects and usually no more than 1,000 results from the initial query. To determine the drill-down options and counts, I scan up to 1,000 results computing the counts for all nodes into which these results classify. Then for each taxonomy I pick the best drill-down options available (orthogonal set with reasonable branching factor that covers all results) and present them with their counts. If there are more than 1,000 results, I extrapolate the computed counts to estimate the actual counts on the entire set of results. This is all done with a single index and a single search. The total time required for performing this computation for the one large taxonomy is under 10ms, running in full debug mode in my ide. The query response time overall is subjectively instantaneous at the UI (Google-speed or better). So, unless some dimension of the problem is much bigger than mine, I doubt performance will be an issue. Chuck -Original Message- From: Nader Henein [mailto:[EMAIL PROTECTED] Sent: Saturday, November 13, 2004 2:29 AM To: Lucene Users List Subject: Re: How to efficiently get # of search results, per attribute It depends on how many results they're looking through, here are two scenarios I see: 1] If you don't have that many records you can fetch all the results and then do a post parsing step the determine totals 2] If you have a lot of entries in each category and you're worried about fetching thousands of records every time, you can just have seperate indecies per category and search them in in parallel (not Lucene Parallel Search) and you can get up to 100 hits for each one (efficiency) but you'll also have the total from the search to display. Either way you can boost up speed using RamDirectory if you need more speed from the search, but whichever approach you choose I would recommend that you sit down and do some number crunching to figure out which way to go. Hope this helps Nader Henein Chris Lamprecht wrote: I'd like to implement a search across several types of entities, let's say, classes, professors, and departments. I want the user to be able to enter a simple, single query and not have to specify what they're looking for. Then I want the search results to be something like this: Search results for: philosophy boyer Found: 121 classes - 5 professors - 2 departments search results here... I know I could iterate through every hit returned and count them up myself, but that seems inefficient if there are lots of results. Is there some other way to get this kind of information from the search result set? My other ideas are: doing a separate search each result type, or storing different types in different indexes. Any suggestions? Thanks for your help! -Chris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Anyone implemented custom hit ranking?
I've done some customization of scoring/ranking and plan to do more. A good place to start is with your own Similarity, extending Lucene's DefaultSimilarity. Like you, I found the default length normalization to not work well with my dataset. I separately weight each indexed field according to a static relative importance (implemented as a query boost factor that is automatically applied) and then disable length normalization altogether by redefining lengthNorm() to always return 1.0f. I also had problems with tf and idf normalization, especially with idf dominating the ranking determination. To address that, my Similarity increases the base of the log for each, and adds a final square root to the idf computation since Lucene squares the idf in the score computations. Have you tried the explain() mechanism? It is a great way to see precisely how your results are being scored (but be warned there is a final normalization in Hits that explain() does not show -- this final normalization does not affect the ranking order, but it does affect the final scores). Chuck -Original Message- From: Sanyi [mailto:[EMAIL PROTECTED] Sent: Saturday, November 13, 2004 12:38 AM To: [EMAIL PROTECTED] Subject: Anyone implemented custom hit ranking? Hi! I have problems with short text ranking. I've read about same raking problems in the list archives, but found only hints and toughts (adjust DefaultSimilarity, Similarity, etc...), not complete solutions with source code. Anyone implemented a good solution for this problem? (example: my search application returns about 10-20 pages of 1-2 word hits for hello, and then it starts to list the longer texts) I've implemented a very simple solution: I boost documents shorter than 300 chars with 1/300*doclength at index time. Now it works a lot better. In fact, I can't see any problems now. Anyway, I think this is not the solution, this is a patch or workaround. So, I'd be interested in some kind of well designed complete solution for this problem. Regards, Sanyi __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: lucene Scorers
I had a similar need and wrote MaxDisjunctionQuery and MaxDisjunctionScorer. Unfortunately these are not available as a patch but I've included the original message below that has the code (modulo line breaks added by simple text email format). This code is functional -- I use it in my app. It is optimized for its stated use, which involves a small number of clauses. You'd want to improve the incremental sorting (e.g., using the bucket technique of BooleanQuery) if you need it for large numbers of clauses. Re. Paul's suggested steps below, I did not integrate this with query parser as I didn't need that functionality (since I'm generating the multi-field expansions for which max is a much better scoring choice than sum). Chuck Included message: -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Monday, October 11, 2004 9:55 PM To: [EMAIL PROTECTED] Subject: Contribution: better multi-field searching The files included below (MaxDisjunctionQuery.java and MaxDisjunctionScorer.java) provide a new mechanism for searching across multiple fields. The issue is this. Imagine you have two fields, title and document, both of which you want to search with simple queries like: albino elephant. There are two general approaches, either a) create a combined field that concatenates the two individual fields, or b) expand the simple query into a BooleanQuery that searches for each term in both fields. With approach a), you lose the flexibility to set separate boost factors on the individual fields. I wanted title to be much more important than description for ranking results, and wanted to control this explicitly, as length norm was not always doing the right thing; e.g., descriptions are not always long. With approach b) you run into another problem. Suppose the example query is expanded into (title:albino description:albino title:elephant description:elephant). Then, assuming tf/idf doesn't affect ranking, a document with albino in both title and description will score the same as a document with albino in title and elephant in description. The latter document for most applications is much better since it matches both query terms. If albino is the more important term according to idf, then the less desirable documents (albino in both fields) will rank consistently ahead of the albino elephants (which is what was happening to me, yielding horrible results). MaxDisjunctionQuery solves this problem. The MaxDisjunctionQuery pretty prints as: (q1 | q2 | ... | qn)~tiebreaker The qi's are any subqueries. This generates the same results as an OR-type BooleanQuery but scores them differently. The score for any document d is the maximum value of the score that d receives for any subquery, plus the tiebreaker times the sum of the scores it receives for any other retrieving subqueries. In the simplest case, tiebreaker is 0.0f, and the score is simply the maximum score for any retrieving subquery. If tiebreaker is nonzero, it should be much smaller than the boosts being used (0.1 is working very well for me with title boost at 4.0 and description boost at 1.0). With this mechanism, the albino elephant query is expanded like this: ( (title^4.0:albino | description:albino)~0.1 (title^4.0:elephant | description:elephant)~0.1 ) I.e., a BooleanQuery is used to cover the distinct terms, while a MaxDisjunctionQuery is used to expand the fields. This query has the following properties: 1. Documents with two distinct terms score higher than documents with the same term in the two different fields. 2. Documents that contain a title match for a term score higher than documents containing only a description match for the same term. 3. If two documents contain the same query terms, and yet one of them contains one of the query terms in multiple fields while the other does not, the document containing the term in multiple fields scores higher (this is the purpose of the tiebreaker -- it breaks ties among documents that match the same terms in the same highest-scoring fields). Sorry if this is redundant, but I didn't find anything in Lucene already to do this. It has helped me considerably, so I'd like to submit it in case others are facing the same issues. As an aside, is there a reason that idf is squared in each Term and Phrase match (it is multiplied both into the query component and the field component)? To compensate for this, I'm taking the square root of the idf I really want in my Similarity, which seems strange. Thanks for any info on that and any feedback on the utility of MaxDisjunctionQuery. NOTE: The java files use generics and so require the 1.5 jdk, although it would be straightforward to back-port them to earlier jdk's. Chuck Williams *** MaxDisjunctionQuery.java /* * MaxDisjunctionQuery.java * * Created on October 9, 2004, 3:17 PM */ package org.apache.lucene.search; import java.io.IOException
RE: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory?
Thanks Daniel and Justin for the suggestions! I have a fix and will record my experience here for the benefit of anybody else facing this problem: 1. .cvsignore did not work. CVS may ignore the Lucene index directory, but it still insists on creating the CVS subdirectory of the index directory. 2. I didn't try the suggestion of defining an alias module with a CVS directory exclude (!) restriction. This might have worked had I limited all my CVS operations to just work with the alias module, but this would limit flexibility and remove a lot of the nice CVS integration features in the Netbeans ide. 3. Bernhard's patch solves the problem! I had a couple minor glitches installing it. First, there is a missing throws IOException declaration on the list(FileFilter) method he has added. Second, the patch is based on a newer version of FSDirectory than the version in 1.4.2, so my attempt to apply the patch automatically failed. Applying the patch manually and adding the throws declaration fixed all problems. I would like to suggest that Bernhard's patch be integrated into the next version of Lucene. Chuck -Original Message- From: Daniel Naber [mailto:[EMAIL PROTECTED] Sent: Friday, November 05, 2004 10:00 AM To: Lucene Users List Subject: Re: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory? On Friday 05 November 2004 18:03, Chuck Williams wrote: The Lucene index is not in CVS -- neither the directory nor the files. But it is a subdirectory of a directory that is in CVS, Does this patch help? http://issues.apache.org/bugzilla/show_bug.cgi?id=31747 -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory?
Otis, thanks for looking at this. The stack trace of the exception is below. I looked at the code. It wants to delete every file in the index directory, but fails to delete the CVS subdirectory entry (presumably because it is marked read-only; the specific exception is swallowed). Even if it could delete the CVS subdirectory, this would just cause another problem with Netbeans/CVS, since it wouldn't know how to fix up the pointers in the parent CVS subdirectory. Is there a change I could make that would cause it to safely leave this alone? This problem only arises on a full index (incremental == false = create == true). Incremental indexes work fine in my app. Chuck java.io.IOException: Cannot delete CVS at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:144) at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:128) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:102) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:83) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173) at [my app]... -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 1:54 PM To: Lucene Users List Subject: Re: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory? Hm, as far as I know, a CVS sub-directory in an index directory should not bother Lucene. As a matter of fact, I tested this (I used a file, not a directory) for Lucene in Action. What error are you getting? I know there is -I CVS option for ignoring files; perhaps it works with directories, too. Otis --- Chuck Williams [EMAIL PROTECTED] wrote: I have a Tomcat web module being developed with Netbeans 4.0 ide using CVS. One CVS repository holds the sources of my various web files in a directory structure that directly parallels the standard Tomcat webapp directory structure. This is well supported in a fully automated way within Netbeans. I have my search index directory as a subdirectory of WEB-INF, which seemed the natural place to put it. The index files themselves are not in the repository. I want to be able to do CVS Update for the web module directory tree as a whole. However, this places a CVS subdirectory within the index directory, which in turn causes Lucene indexing to blow up the next time I run it since this is an unexpected entry in the index directory. To make things works, to work around the problem I both need to delete the CVS subdirectory and find and delete the pointers to it in the Entries file and Netbeans cache file within the CVS subdirectory of the parent directory. This is annoying to say the least. I've asked the Netbeans users if there is a way to avoid creation of the index's CVS subdirectory, but the same thing happened using WinCVS and I so I expect this is not a Netbeans issue. It could be my relative ignorance of CVS. How do others avoid this problem? Any advice or suggestions would be appreciated. Thanks, Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Sorting in Lucene.
Yes, by one or multiple criteria. Chuck -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 6:21 PM To: 'Lucene Users List' Subject: Sorting in Lucene. Hi All, Does Lucene supports sorting on the search results? Thanks in advance. Ramon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Sorting in Lucene.
Ramon, I'm not sure where a guide or tutorial might be, but you should be able to see how to do it from the javadoc. Look at classes Sort, SortField, SortComparator. I've also included a recent message from this group below concerning sorting with multiple fields. FYI, a number of people have wanted to first sort by score and secondarily by another field. This is tricky since scores are frequently different in low-order decimal positions. Good luck, Chuck -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 1:33 AM To: Lucene Users List Subject: Re: sorting by score and an additional field On Nov 3, 2004, at 9:52 PM, Chris Fraschetti wrote: Has anyone had any luck using lucene's built in sort functions to sort first by the lucene hit score and secondarily by a Field in each document indexed as Keyword and in integer form? I get multiple sort fields to work, here's two examples: new Sort(new SortField[]{ new SortField(category), SortField.FIELD_SCORE, new SortField(pubmonth, SortField.INT, true) }); new Sort(new SortField[] {SortField.FIELD_SCORE, new SortField(category)}) Both of these, on a tiny dataset of only 10 documents, works exactly as expected. I can only get it to sort by one or the other... but when it does one, it does sort correctly, but together in {score, custom_field} only the first sort seems to apply. Any ideas? Are you using Lucene 1.4.2? How did you index your integer field? Are you simply using the .toString() of an Integer? Or zero padding the field somehow? You can use the .toString method, but you have to be sure that the sorting code does the right parsing of it - so you might need to specify SortField.INT as its type. It will do automatic detection if the type is not specified, but that assumes that the first document it encounters parses properly, otherwise it will fall back to using a String sort. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 9:53 PM To: 'Lucene Users List' Subject: RE: Sorting in Lucene. Hi Chuck, Can you please point me to some articles or FAQ about Sorting in Lucene? Thanks a lot for your reply. Thanks, Ramon -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 9:44 PM To: Lucene Users List Subject: RE: Sorting in Lucene. Yes, by one or multiple criteria. Chuck -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 6:21 PM To: 'Lucene Users List' Subject: Sorting in Lucene. Hi All, Does Lucene supports sorting on the search results? Thanks in advance. Ramon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Aliasing problem
Looks like you produced a PhraseQuery rather than a BooleanQuery. You want +GAME:(doom3 3 doom) Chuck -Original Message- From: Abhay Saswade [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 26, 2004 10:22 AM To: [EMAIL PROTECTED] Subject: Aliasing problem Hi, One document in my index contains term 'doom 3' (indexed, tokenized, stored) How can I match term doom3 with that document? I tried following but no luck I have written alias filter which returns 2 more tokens for doom3 as 3 and doom I construct query +GAME:doom3 QueryParser returns +GAME:doom3 3 doom I am using StandardTokenizer Is my approach is correct? Or am I missing something? Any help highly appreciated. Thanks in advance, Abhay - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Range Query
Karthik, It is all spelled out in a Lucene HowTo here: http://wiki.apache.org/jakarta-lucene/SearchNumericalFields Have fun with it, Chuck -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 12:15 AM To: Lucene Users List; Jonathan Hager Subject: RE: Range Query Hi Jonathan When searching I also pad the query term ??? When Exactly are u handling this [ using During Indexing Process Also or while Search on Process Only ] Can u be Please be specific. [ if time permits and possible please can u send me the sample Code for the same ] . :) Thx in advance -Original Message- From: Jonathan Hager [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 3:31 AM To: Lucene Users List Subject: Re: Range Query That is exactly right. It is searching the ASCII. To solve it I pad my price using a method like this: /** * Pads the Price so that all prices are the same number of characters and * can be compared lexigraphically. * @param price * @return */ public static String formatPriceAsString(Double price) { if (price == null) { return null; } return PRICE_FORMATTER.format(price.doubleValue()); } where PRICE_FORMATTER contains enough digits for your largest number. private static final DecimalFormat PRICE_FORMATTER = new DecimalFormat(000.00); When searching I also pad the query term. I looked into hooking into QueryParser, but since the lower/upper prices for my application are different inputs, I choose to handle them without hooking into the QueryParser. Jonathan On Tue, 19 Oct 2004 12:35:06 +0530, Karthik N S [EMAIL PROTECTED] wrote: Hi Guys Apologies. I have a Field Type Text 'ItemPrice' , Using it to Store Price Factor in numeric such as 10, 25.25 , 50.00 If I am suppose to Find the Range factor between 2 prices ex - Contents:shoes +ItemPrice:[10.00 TO 50.60] I get results other then the Range that has been executed [This may be due to query parsing the Ascii values instead of numeric values ] Am I am missing something in the Querry syntax or Is this the wrong way to construct the Query. Please Somebody Advise me ASAP. :( Thx in advance WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Range Query
Range queries use a lexicographic (dictionary) order. So, assuming all your values are positive, you need to ensure that the integer part of each number has a fixed number of digits (pad with leading 0's). The fractional part should be fine, although 1.0 will follow 1. If you have negative numbers you need to pad an extra 0 on the left of the positives, start the negatives with -, and invert the magnitude of the negatives (so they go in the other order). Your actual example below should work as is, except that 10 will not be in the range since 10.00 is strictly after 10. However, this won't work without the padding assuming you have any prices with at an integer part of other than exactly two digits (e.g., 10 is before 6, but after 06). Chuck -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 19, 2004 12:05 AM To: LUCENE Subject: Range Query Hi Guys Apologies. I have a Field Type Text 'ItemPrice' , Using it to Store Price Factor in numeric such as 10, 25.25 , 50.00 If I am suppose to Find the Range factor between 2 prices ex - Contents:shoes +ItemPrice:[10.00 TO 50.60] I get results other then the Range that has been executed [This may be due to query parsing the Ascii values instead of numeric values ] Am I am missing something in the Querry syntax or Is this the wrong way to construct the Query. Please Somebody Advise me ASAP. :( Thx in advance WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Index and Search Phrase Documents
You haven't provided enough information for anybody to help. Have you added indexed Field's to your document? If not, there is nothing to search. I don't think you are looking for a parameter to the IndexWriter constructor. I expect the advice from Aviran is best. You should read and understand the demo apps. That's how I got started -- the demo apps are quite illuminating about how to index, how to search, how to incrementally index, etc. They work and they show the techniques that you can readily adapt to your app. Also, I've taken the liberty to move this thread to the more appropriate mail list. Good luck, Chuck -Original Message- From: PROYECTA.Fernandez Garcia, Ivan [mailto:[EMAIL PROTECTED] Sent: Monday, October 18, 2004 8:13 AM To: Lucene Developers List Subject: RE: Index and Search Phrase Documents I´m looking for information about this question in this page but i can not resolve my problem. After index a document i search text and no hits are returned when there are two or three to return. Why?. -Mensaje original- De: Aviran [mailto:[EMAIL PROTECTED] Enviado el: lunes, 18 de octubre de 2004 17:08 Para: 'Lucene Developers List' Asunto: RE: Index and Search Phrase Documents Lucene comes with demo apps that you can learn from. You can read about it here http://jakarta.apache.org/lucene/docs/demo.html Aviran http://aviran.mordos.com -Original Message- From: PROYECTA.Fernandez Garcia, Ivan [mailto:[EMAIL PROTECTED] Sent: Monday, October 18, 2004 10:18 AM To: [EMAIL PROTECTED] Subject: Index and Search Phrase Documents Hy everybody, I want to index a document text. We would like to know which parameter must I use (in IndexWriter constructor) to index a document if i can search text after? If I want to search phrases. What class must i use to do this?. It would be grateful if you send me an example. Thanks very much. Iván Fernández García Proyecta Sistemas de Información --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004 -- Has decidido el mejor precio. Has decidido IBERIA.com You´ve chosen the best price. You´ve chosen IBERIA.com -- http://www.iberia.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- Incoming mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004 --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004 -- Has decidido el mejor precio. Has decidido IBERIA.com You´ve chosen the best price. You´ve chosen IBERIA.com -- http://www.iberia.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: index, reindexing problem
I had this same problem a while back. It should be resolved if you move the writer = new IndexWriter(...) until after the reader.close(). I.e., complete all the deletions and close the reader before creating the writer. Chuck -Original Message- From: MATL (Mats Lindberg) [mailto:[EMAIL PROTECTED] Sent: Sunday, October 17, 2004 5:36 AM To: [EMAIL PROTECTED] Subject: index, reindexing problem Hello. I have a problem when reindexing some documents after an index has been created, i get an error, the error is the following. caught a class java.io.IOException with message: Lock obtain timed out: [EMAIL PROTECTED]:\DOCUME~1\..lucene-0b877c2d5472a608d6ec3ee6174018de-write .lock mailto:[EMAIL PROTECTED]:\DOCUME~1\..lucene-0b877c2d5472a608d6ec3ee6174018 de-write.lock This is how i do it. 1.st make the index (_indexDir is the location of the index) writer = new IndexWriter(_indexDir, new StandardAnalyzer(), true); . do the indexing here writer.optimize(); writer.close(); this works fine 2. this is where i get the error (reindex an existing document) writer = new IndexWriter(_indexDir, new StandardAnalyzer(), false); Directory directory; IndexReader reader; // if the file is in the index already, remove it directory = FSDirectory.getDirectory(_indexDir, false); reader = IndexReader.open(directory); try { Term term = new Term(deleteid, deleteID.toLowerCase()); if (reader.docFreq(term) = 1) { deletedItems = reader.delete(term);// - this is where the error occurs, i get the locking error } } catch (Exception e) { System.out.println( caught a + e.getClass() + \n with message: + e.getMessage());} finally { reader.close(); directory.close(); } continue with reindexing the new document .. I hope anyone can help me with this problem. Best regards, Mats Lindberg - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Filtering Results?
Ahh yes, that is a good article. I inadvertently missed the need to invert the magnitude of negative numbers in the recipe below (I don't have negatives in any of my fiels). Fortunately that is also easy to do. FYI, you don't need a custom query parser for range queries. That's only required if you expect your users to type in range query syntax (so that you have to convert their numbers to your formatted representation). Rather than expect the user to type in that syntax, I provide text input fields for the range bounds in range-searchable fields. You can then either generate standard range query syntax (using the string-formatted encoding of numbers) or generate the RangeQuery objects directly, depending on how you are constructing your queries (with or without QueryParser). Chuck -Original Message- From: sam s [mailto:[EMAIL PROTECTED] Sent: Thursday, October 14, 2004 11:22 AM To: [EMAIL PROTECTED] Subject: RE: Filtering Results? Thanks Chuck. Meanwhile searching on net and found this link http://wiki.apache.org/jakarta-lucene/SearchNumericalFields Thanks again From: Chuck Williams [EMAIL PROTECTED] Reply-To: Lucene Users List [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Subject: RE: Filtering Results? Date: Thu, 14 Oct 2004 09:55:07 -0700 Sam, You can pick any encoding such that lexicographic order (alphabetic order) is consistent with the numeric order you want. E.g., if a single field can contain positive or negative integers or floats, then the following should work: 1. First character of every value represents the sign. You can't use + and - since + is alphabetically before - (which would make positives smaller than negatives), so pick a different character to represent + like maybe =. 2. Characters 2 through n are a fixed length string that presents the integer part of the number, padded with leading zeroes. 3. You don't need padding on the right since longer strings alphabetically follow shorter strings. Just included the decimal point if the number is float, and trail out whatever remaining digits naturally print. 4. One other subtlety occurs if you need to ensure that 2 and 2.0 are equal. You need to transform one to the other (if you can have both integers and floats in a single field -- otherwise this is not an issue). You will lose information about the original type. I haven't tested the above, but think it should work. Chuck -Original Message- From: sam s [mailto:[EMAIL PROTECTED] Sent: Thursday, October 14, 2004 6:40 AM To: [EMAIL PROTECTED] Subject: RE: Filtering Results? Thanks Chuck. What is the workaround for filtering (preferably using RangeQuery) following? 1. Float values. Do I have to pad those with zeros on both sides? 2. Negative numbers (integer as well as floats) Thanks From: Chuck Williams [EMAIL PROTECTED] Reply-To: Lucene Users List [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Subject: RE: Filtering Results? Date: Wed, 13 Oct 2004 21:49:30 -0700 RangeQuery is a good approach. Put fields on your documents like age. The only tricky thing is that the comparisons are all done lexicographically rather that numerically. Lucene has a built-in routine to convert dates into a monotonic lexicographic sequence (DateField.timeToString). For positive integer data types like age, it is sufficient to store them as fixed line String's, e.g.: 5 -- 005 18 -- 018 100 -- 100 Then just access range queries. E.g.: 1. age:[018 TO] 2. age:[TO 018] 3. age:[005 TO 018] Those are = queries. Use {} instead of [] for queries. Good luck, Chuck -Original Message- From: sam s [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 13, 2004 12:55 PM To: [EMAIL PROTECTED] Subject: Filtering Results? Hi, I want to do filtering on matched results of a query. For example 1. age 18 2. age 18 3. age 5 and age 18 4. birthdate = [some date] What can be the best approach? How can it be done with range query? Can it be done without range query? Also. Where can I find information meaning of following classes and how to use them? FilteredQuery QueryFilter (I didnt understand much looking at test case of this) CachingWrapperFilter etc.. Thanks in advance _ Don't just search. Find. Check out the new MSN Search! http://search.msn.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED