Re: Internals question: BooleanQuery with many TermQuery children
On Tuesday 07 April 2009 05:04:44 Daniel Noll wrote: > Hi all. > > This is something I have been wondering for a while but can't find a > good answer by reading the code myself. > > If you have a query like this: > >( field:Value1 OR > field:Value2 OR > field:Value3 OR > ... ) > > How many TermEnum / TermDocs scans should this execute? > > (a) One per clause, or > (b) One for the entire boolean query? One per clause. > > I wonder because we do use a lot of queries of this nature, and I can't > find any direct evidence that they get logically merged, leading me to > believe that it's one per clause at present (and thus this becomes a > potential optimisation.) The problem is not only in the scanning of the TermDocs, but also in the merging by docId (on a heap) that has to take place when more of them are used at the same time during the query search. Some optimisations are already in place: - By allowing docs scored out of order, most top level OR queries can be merged with a faster algorithm (distributive sort over docId ranges) using the term frequencies (see BooleanQuery.setAllowDocsOutOfOrder()) - Various Filters that merge into a bitset, using a single TermDocs and ignoring term frequencies, (see MultiTermQuery.getFilter()). - The new TrieRangeFilter that premerges ranges at indexing time, also ignoring term frequencies. Using the TermDocs one by one has another advantage in that it reduces disk seek distances in the index. This is noticeable when disks have heads that take more time to move longer distances. SSD's don't have moving heads, so they have smaller performance differences between merging into a bitset, by distributive sort, and by a heap. For the time being, Lucene does not have a low level facility for key values that occur at most once per document field, so for these it normally helps to use a Filter. Regards, Paul Elschot
Re: boost and score doubt
Negative boosts are accepted, though rather "unusual". Also note that Lucene by default filters out any hits with scores <= 0.0. Normally you'd set boost to something > 0.0 (0.1 should work). What unexpected effect are you seeing? If you omit norms, then indeed your per-doc boost (and per-field boost, if used) are discarded (have no effect). Mike On Mon, Apr 6, 2009 at 4:01 PM, Marc Sturlese wrote: > > Hey there, > Does de function doc.setBoost(x.y) accept negative values or values minor > than 1?? I mean... it compile and doesn't give errors but the behabiour is > not exactly what I was expecting. > In my use case I have the field title... I want to give very very low > relevance to the documents witch title has less that 40 characters. I have > tried setting boost to negatives values or to 0.1 > Wich is the best way to do that? > Is there any range of values for setting boost? > > And another thing that confuses me... if I omit norms is the score > function... how does it affect to the boosting I am setting? does it loose > the effect? > > Thanks in advance! > -- > View this message in context: > http://www.nabble.com/boost-and-score-doubt-tp22916108p22916108.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: boost and score doubt
That was my problem... I was ommiting norms, so the boost I gave at the document at index time was not taking effect. Since I stop omitting them results have changed completely. Thanks! Michael McCandless-2 wrote: > > Negative boosts are accepted, though rather "unusual". Also note that > Lucene by default filters out any hits with scores <= 0.0. > > Normally you'd set boost to something > 0.0 (0.1 should work). > > What unexpected effect are you seeing? > > If you omit norms, then indeed your per-doc boost (and per-field > boost, if used) are discarded (have no effect). > > Mike > > On Mon, Apr 6, 2009 at 4:01 PM, Marc Sturlese > wrote: >> >> Hey there, >> Does de function doc.setBoost(x.y) accept negative values or values minor >> than 1?? I mean... it compile and doesn't give errors but the behabiour >> is >> not exactly what I was expecting. >> In my use case I have the field title... I want to give very very low >> relevance to the documents witch title has less that 40 characters. I >> have >> tried setting boost to negatives values or to 0.1 >> Wich is the best way to do that? >> Is there any range of values for setting boost? >> >> And another thing that confuses me... if I omit norms is the score >> function... how does it affect to the boosting I am setting? does it >> loose >> the effect? >> >> Thanks in advance! >> -- >> View this message in context: >> http://www.nabble.com/boost-and-score-doubt-tp22916108p22916108.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > -- View this message in context: http://www.nabble.com/boost-and-score-doubt-tp22916108p22925764.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
How to customize score according to field value?
Hi, I have the following situation which needs to customize the final score according to field value. Suppose there are two docs in my query result, and they are ordered by default score sort: doc1(field1:bookA, field2:2000-01-01) -- score:0.80 doc2(field1:bookB, filed2:2009-01-01) -- score:0.70 I want "doc2" to have a higher score since it's publishing date is more recent, while "doc1" to have a lower score: doc2(field1:bookB, filed2:2009-01-01) -- score:0.77 doc1(field1:bookA, field2:2000-01-01) -- score:0.73 I found this scenario is different from doc.setBoost() and field.setBoost(). Is there any way to impact the score calculated for "doc1" & "doc2" according to the value of "field2"? Thank you in advance!
Re: How to customize score according to field value?
Do you want the dates to *influence* or *determine* the order? I don't have much help if what you're after is something like "docs that are more recent tend to rank higher", although I vaguely remember this question coming up on the user list, maybe a search of the archive would turn something helpful up... But if you want the date to completely determine order you can always sort by date, see some of the IndexSearcher.search(...sort...) methods. Best Erick On Tue, Apr 7, 2009 at 3:08 AM, Jinming Zhang wrote: > Hi, > > I have the following situation which needs to customize the final score > according to field value. > > Suppose there are two docs in my query result, and they are ordered by > default score sort: > > doc1(field1:bookA, field2:2000-01-01) -- score:0.80 > doc2(field1:bookB, filed2:2009-01-01) -- score:0.70 > > I want "doc2" to have a higher score since it's publishing date is more > recent, while "doc1" to have a lower score: > > doc2(field1:bookB, filed2:2009-01-01) -- score:0.77 > doc1(field1:bookA, field2:2000-01-01) -- score:0.73 > > I found this scenario is different from doc.setBoost() and > field.setBoost(). > Is there any way to impact the score calculated for "doc1" & "doc2" > according to the value of "field2"? > > Thank you in advance! >
RE: Multiple Analyzer on Single field
Hi All, Sorry for the confused email. Suppose I have a field text with content below KeyWordAnalyzer is a class. this keyword is used in java. Here the KeyWordAnalyzer into Key Word Analyzer and class should be a Key word. So if some one search. Apart from this I want Key Word Analzer to tokenized properly so that search become better. Regards, Allahbaksh -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, April 06, 2009 9:31 PM To: java-user@lucene.apache.org Subject: Re: Multiple Analyzer on Single field This really doesn't make sense. KeywordAnalyzer will NOT tokenize the input stream. StandardAnalyzer WILL tokenize the input stream. I can't imagine what it means to do both at the same time. Perhaps you could give us some examples of what your desired inputs and outputs are we could steer you in the right direction. I suspect you're thinking more in terms of TokenFilters and/or Tokenizers... Best Erick On Mon, Apr 6, 2009 at 10:52 AM, Allahbaksh Mohammedali Asadullah < allahbaksh_asadul...@infosys.com> wrote: > Hi, > I want to add multiple Analyzer on single field. I want properties of > KeywordAnalyzer, SimpleAnalyzer, StandardAnalyzer, WhiteSpaceAnalyzer. Is > there any easy way to have all analyzer bundled on single field. > Regards, > Allahbaksh > > > > > > > > CAUTION - Disclaimer * > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended > solely > for the use of the addressee(s). If you are not the intended recipient, > please > notify the sender by e-mail and delete the original message. Further, you > are not > to copy, disclose, or distribute this e-mail or its contents to any other > person and > any such actions are unlawful. This e-mail may contain viruses. Infosys has > taken > every reasonable precaution to minimize this risk, but is not liable for > any damage > you may sustain as a result of any virus in this e-mail. You should carry > out your > own virus checks before opening the e-mail or attachment. Infosys reserves > the > right to monitor and review the content of all messages sent to or from > this e-mail > address. Messages sent to or from this e-mail address may be stored on the > Infosys e-mail system. > ***INFOSYS End of Disclaimer INFOSYS*** > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to customize score according to field value?
On Tue, Apr 7, 2009 at 3:08 AM, Jinming Zhang wrote: > Hi, > > I have the following situation which needs to customize the final score > according to field value. > > Suppose there are two docs in my query result, and they are ordered by > default score sort: > > doc1(field1:bookA, field2:2000-01-01) -- score:0.80 > doc2(field1:bookB, filed2:2009-01-01) -- score:0.70 > > I want "doc2" to have a higher score since it's publishing date is more > recent, while "doc1" to have a lower score: > > doc2(field1:bookB, filed2:2009-01-01) -- score:0.77 > doc1(field1:bookA, field2:2000-01-01) -- score:0.73 > > I found this scenario is different from doc.setBoost() and field.setBoost(). > Is there any way to impact the score calculated for "doc1" & "doc2" > according to the value of "field2"? > > Thank you in advance! If you have access to the MEAP for Lucine In Action 2nd Edition, it demonstrates using a CustomScoreQuery[1] for to boost a docs score based on recency. --tim [1] - http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/function/CustomScoreQuery.html - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Multiple Analyzer on Single field
H. There's nothing in Lucene that I know of that will do what you want, you'll have to do one of two things: In general, you'll have to break up your token stream yourself, either through pre-processing or building your own analyzers. There's nothing already built that I know of that will break up, for instance, KeyWordAnalyzer into three tokens. Part of the confusion is the use of the phrase "keyword" as in "and class should be a Key word". If I'm reading this right, you'll want "class" to be in a separate field since it's special (in your context). Again, to accomplish this you either need to pre-process the input stream, extract "class", and put it in a separate field or create your own analyzer that extracts only "class" from the input stream. Then you'd feed the entire contents into *both* fields (say "content" and "key"). The analyzer attached to the "content" field (see PerFieldAnalyzerWrapper) would take care of breaking up things like KeyWordAnalyzer, and the analyzer attached to the "key" field would throw away everything except "class".. Hope this helps Erick On Tue, Apr 7, 2009 at 8:57 AM, Allahbaksh Mohammedali Asadullah < allahbaksh_asadul...@infosys.com> wrote: > Hi All, > Sorry for the confused email. > > Suppose I have a field text with content below > > KeyWordAnalyzer is a class. this keyword is used in java. > > Here the KeyWordAnalyzer into Key Word Analyzer and class should be a Key > word. So if some one search. Apart from this I want Key Word Analzer to > tokenized properly so that search become better. > Regards, > Allahbaksh > > > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Monday, April 06, 2009 9:31 PM > To: java-user@lucene.apache.org > Subject: Re: Multiple Analyzer on Single field > > This really doesn't make sense. KeywordAnalyzer will NOT > tokenize the input stream. StandardAnalyzer WILL tokenize > the input stream. I can't imagine what it means to do both at > the same time. > > Perhaps you could give us some examples of what your desired > inputs and outputs are we could steer you in the right direction. > > I suspect you're thinking more in terms of TokenFilters and/or > Tokenizers... > > Best > Erick > > On Mon, Apr 6, 2009 at 10:52 AM, Allahbaksh Mohammedali Asadullah < > allahbaksh_asadul...@infosys.com> wrote: > > > Hi, > > I want to add multiple Analyzer on single field. I want properties of > > KeywordAnalyzer, SimpleAnalyzer, StandardAnalyzer, WhiteSpaceAnalyzer. Is > > there any easy way to have all analyzer bundled on single field. > > Regards, > > Allahbaksh > > > > > > > > > > > > > > > > CAUTION - Disclaimer * > > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended > > solely > > for the use of the addressee(s). If you are not the intended recipient, > > please > > notify the sender by e-mail and delete the original message. Further, you > > are not > > to copy, disclose, or distribute this e-mail or its contents to any other > > person and > > any such actions are unlawful. This e-mail may contain viruses. Infosys > has > > taken > > every reasonable precaution to minimize this risk, but is not liable for > > any damage > > you may sustain as a result of any virus in this e-mail. You should carry > > out your > > own virus checks before opening the e-mail or attachment. Infosys > reserves > > the > > right to monitor and review the content of all messages sent to or from > > this e-mail > > address. Messages sent to or from this e-mail address may be stored on > the > > Infosys e-mail system. > > ***INFOSYS End of Disclaimer INFOSYS*** > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Lucene and Phrase Correction
Karl, Thankyou for your in-depth reply. This has given me good grounds to go on. Regards Glyn 2009/4/6 Karl Wettin : > 6 apr 2009 kl. 14.59 skrev Glyn Darkin: > > Hi Glyn, > >> to be able to spell check phrases >> >> E.g >> >> "Harry Poter" is converted to "Harry Potter" >> >> We have a fixed dataset so can build indexes/ dictionaries from our >> own data. > > the most obvious solution is index your contrib/spell checker with shingles. > This will however probably only help out with exact phrases. Perhaps that is > enough for you. > > If your example is a real one that you came up with by analyzing query logs > then you might want consider creating an index "stemmed" to handle various > problems associated with reading and writing disorders. Dyslectic people > often miss out on vowels, they who suffer from dysgraphia have problems with > q/p/d/b, other have problems with reoccuring characters, et c. A combination > of these problems could end up in a secondary "fuzzy" index that contains > weighted shingles like this for the document that points at "harry potter": > > "hary poter"^0.9 > "harry #otter"^0.8 > "hary #oter"^0.7 > "hrry pttr"^0.7 > "hry ptr"^0.5 > > In order to get a good precision/recall your query to such an index would > have to produce a boolean query containing all of the "stems" above if the > input was spelled correct. > > > One alternative to the contrib/spell checker is Spelt: > http://groups.google.com/group/spelt/ and I believe it is supposed to handle > phrases. > > > Note the difference between spell checking and suggestion schemes. Something > can be wrong even though the spelling is correct. Consider the game "Heroes > of might and magic", people might have fogotten what it was called and > search for "Heroes of light and magic" instead. Hopefully your query would > still yield a fairly good result for the correct document if the latter was > entered, but if you require all terms or something similar then it might > return no hits. > > > More advanced strategies for contextual spell checking of phrases usually > involve statistical models such as neural networs, hidden markov models, et > c. LingPipe contains such an implementation. > > > You can also take a look at reinforcement learning, learning from the > misstakes and corrections made by your users. It requires a lot of data > (user query logs) in order to work but will yeild very cool results such as > acronyms. > > LUCENE-626 is a multi layered spell checker with reinforcement learning in > the top, backed by an a priori corpus (that can be compiled from old user > queries) used to find context. It also use a refactored version of the > contrib/spell checker as second level suggestion when there is nothing to > pick up from previous user behaviour. I never deployed this in a real > system, it does however seem to work great when trained with a few hundred > thousand query sessions. > > > Finally I recommend that you take some time to analyze user query sessions > to find what the most common problems your users have and try to find a > solution that best fit those problems. Too often features are implemented > because they are listed in a specification and not because the users need > them. > > > I hope this helps. > > karl > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Glyn Darkin Darkin Systems Ltd Mob: 07961815649 Fax: 08717145065 Web: www.darkinsystems.com Company No: 6173001 VAT No: 906350835 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to customize score according to field value?
You might want to play with both boosting and multiple sorting. You might want to look at something like Solr's boost queries or boost functions http://wiki.apache.org/solr/DisMaxRequestHandler#head-6862070cf279d9a09bdab971309135c7aea22fb3 Or if you want to go down the path of a custom score, most folks override the customScore method of CustomScoreQuery *//create a term query to search against all documents* Query tq = *new* TermQuery(*new* Term(*"metafile"*, *"doc"*)); FieldScoreQuery fsQuery = *new* FieldScoreQuery(*"geo_distance"*, Type.FLOAT); CustomScoreQuery customScore = *new* CustomScoreQuery(tq,fsQuery){ @Override *public* *float* customScore(*int* doc, *float* subQueryScore, *float* valSrcScore){ . return myFunkyScore; } } You can see a quick version in http://svn.apache.org/viewvc/lucene/java/trunk/contrib/spatial/src/test/org/apache/lucene/spatial/tier/TestCartesian.java?revision=762801&view=markup HTH P On Tue, Apr 7, 2009 at 9:01 AM, Tim Williams wrote: > On Tue, Apr 7, 2009 at 3:08 AM, Jinming Zhang > wrote: > > Hi, > > > > I have the following situation which needs to customize the final score > > according to field value. > > > > Suppose there are two docs in my query result, and they are ordered by > > default score sort: > > > > doc1(field1:bookA, field2:2000-01-01) -- score:0.80 > > doc2(field1:bookB, filed2:2009-01-01) -- score:0.70 > > > > I want "doc2" to have a higher score since it's publishing date is more > > recent, while "doc1" to have a lower score: > > > > doc2(field1:bookB, filed2:2009-01-01) -- score:0.77 > > doc1(field1:bookA, field2:2000-01-01) -- score:0.73 > > > > I found this scenario is different from doc.setBoost() and > field.setBoost(). > > Is there any way to impact the score calculated for "doc1" & "doc2" > > according to the value of "field2"? > > > > Thank you in advance! > > If you have access to the MEAP for Lucine In Action 2nd Edition, it > demonstrates using a CustomScoreQuery[1] for to boost a docs score > based on recency. > > --tim > > [1] - > http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/function/CustomScoreQuery.html > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: How to search a phrase using quotes in a query ???
Here is my code for indexing: [code] public static void main(String[] args) throws IOException { if(args.length==2){ String docsDirectory =args[0]; String indexFilepath = args[1]; int numIndexed = 0; IndexWriter writer; ArrayList arrayList = new ArrayList(); try { Analyzer Analyzer = new EnglishAnalyzer(); writer = new IndexWriter(indexFilepath, Analyzer , true); writer.setUseCompoundFile(true); File directory = new File(docsDirectory); String[] list = directory.list(); for (int i = 0; i < list.length; i++) { File doc = new File(docsDirectory, list[i]); BufferedReader reader; try { reader = new BufferedReader(new FileReader(doc)); String linea = reader.readLine(); StringBuffer texto = new StringBuffer(); while (linea != null){ // Aquí lo que tengamos que hacer con la línea puede ser esto texto.append(linea); linea = reader.readLine(); } System.out.println(i); indexFile(writer, texto.toString(),doc.getAbsolutePath() ); arrayList.add(new String(new byte[1000])); reader.close(); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } numIndexed = writer.docCount(); writer.optimize(); writer.close(); } catch (CorruptIndexException e1) { e1.printStackTrace(); } catch (LockObtainFailedException e1) { e1.printStackTrace(); } catch (IOException e1) { e1.printStackTrace(); } } else { System.err.println("You need to provide arguments "); } } // method to actually index a file using Lucene private static void indexFile(IndexWriter writer, String content, String title) throws IOException { long init = System.currentTimeMillis(); Document doc = new Document(); doc.add(new Field("content", content, Field.Store.YES , Field.Index.TOKENIZED, Field.TermVector.YES)); doc.add(new Field("title", title, Field.Store.YES , Field.Index.TOKENIZED, Field.TermVector.YES)); writer.addDocument(doc); long end = System.currentTimeMillis(); System.out.println("ms " + (end - init)); }t [/code] And for searching: [code] public static void main(String[] args) { String path = "C:\\index"; try { IndexSearcher indexSearcher = new IndexSearcher(path ); String[] fields = new String[]{"title","content"}; Analyzer analyzer = new EnglishAnalyzer(); String[] textFields = new String[]{"\"The Bank of America\"","\"The Bank of America\""};; Query query = MultiFieldQueryParser.parse(textFields, fields, analyzer); Hits hits = indexSearcher.search(query ); System.out.println("Founded: " + hits.length()); QueryScorer scorer = new QueryScorer(query); Highlighter highlighter = new Highlighter(scorer); Fragmenter fragmenter = new SimpleFragmenter(100); highlighter.setTextFragmenter(fragmenter); for (int i = 0; i < hits.length(); i++) { Document document = hits.doc(i); String body = hits.doc(i).get("content"); System.out.println((i+1)+ " " + body.substring(0, 20)); System.out.println(document.get("path")); if (body==null) body =""; TokenStream stream = analyzer.tokenStream("content",new StringReader(body)); //System.out.println(highlighter.getBestFragment(stream, body)); String[] fragment = highlighter.getBestFragments(stream, body, 3); if (fragment.length == 0){ fragment = new String[1]; fragment[0] = ""; } StringBuilder buffer = new StringBuilder(); for (int I = 0; I < fragment.length; I++){ buffer.append(fragment[I] + "...\n"); } String stringFragment = buffer.toString(); System.out.println(stringFragment); } } catch (CorruptIndexException e) { e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ParseException e) {
Re: How to search a phrase using quotes in a query ???
Well, nothing jumps out at me, although I confess that I've not used MultiFieldQueryParser. So here's what I'd do. 1> drop back to a simpler way of doing things. Forget about MultiFieldQueryParser for instance. Get the really simple case working then build back up. I'd also drop back to a very basic analyzer (perhaps SimpleAnalyzer). Get that very simple case to work. Then substitute your EnglishAnalyzer back in. etc. I'm guessing that one of these steps will suddenly fail and you'll have a good place to start. 2> Print out query.toString() and paste the results into Luke and see what it gives you. The Explain (in Luke) should help. Sorry I can't be more help, but I've often found that getting the easy way of doing things to work then adding my complications back in produces one of those "I didn't think *that* could possibly fail" moments . Best Erick On Tue, Apr 7, 2009 at 5:12 PM, Ariel wrote: > Here is my code for indexing: > [code] >public static void main(String[] args) throws IOException { > >if(args.length==2){ >String docsDirectory =args[0]; >String indexFilepath = args[1]; >int numIndexed = 0; >IndexWriter writer; >ArrayList arrayList = new ArrayList(); >try { >Analyzer Analyzer = new EnglishAnalyzer(); >writer = new IndexWriter(indexFilepath, Analyzer , true); >writer.setUseCompoundFile(true); >File directory = new File(docsDirectory); >String[] list = directory.list(); >for (int i = 0; i < list.length; i++) { >File doc = new File(docsDirectory, list[i]); >BufferedReader reader; >try { >reader = new BufferedReader(new FileReader(doc)); >String linea = reader.readLine(); >StringBuffer texto = new StringBuffer(); >while (linea != null){ >// Aquí lo que tengamos que hacer con la línea > puede ser esto >texto.append(linea); >linea = reader.readLine(); >} >System.out.println(i); >indexFile(writer, > texto.toString(),doc.getAbsolutePath() ); >arrayList.add(new String(new byte[1000])); >reader.close(); >} catch (FileNotFoundException e) { >e.printStackTrace(); >} catch (IOException e) { >e.printStackTrace(); >} >} >numIndexed = writer.docCount(); >writer.optimize(); >writer.close(); >} catch (CorruptIndexException e1) { >e1.printStackTrace(); >} catch (LockObtainFailedException e1) { >e1.printStackTrace(); >} catch (IOException e1) { >e1.printStackTrace(); >} > >} else { >System.err.println("You need to provide arguments "); >} >} > >// method to actually index a file using Lucene >private static void indexFile(IndexWriter writer, String content, String > title) throws IOException { >long init = System.currentTimeMillis(); >Document doc = new Document(); >doc.add(new Field("content", content, Field.Store.YES , > Field.Index.TOKENIZED, Field.TermVector.YES)); >doc.add(new Field("title", title, Field.Store.YES , > Field.Index.TOKENIZED, Field.TermVector.YES)); >writer.addDocument(doc); >long end = System.currentTimeMillis(); >System.out.println("ms " + (end - init)); >}t > [/code] > > And for searching: > [code] >public static void main(String[] args) { >String path = "C:\\index"; >try { >IndexSearcher indexSearcher = new IndexSearcher(path ); >String[] fields = new String[]{"title","content"}; >Analyzer analyzer = new EnglishAnalyzer(); >String[] textFields = new String[]{"\"The Bank of > America\"","\"The Bank of America\""};; >Query query = MultiFieldQueryParser.parse(textFields, fields, > analyzer); >Hits hits = indexSearcher.search(query ); >System.out.println("Founded: " + hits.length()); > >QueryScorer scorer = new QueryScorer(query); >Highlighter highlighter = new Highlighter(scorer); >Fragmenter fragmenter = new SimpleFragmenter(100); >highlighter.setTextFragmenter(fragmenter); > >for (int i = 0; i < hits.length(); i++) { >Document document = hits.doc(i); >String body = hits.doc(i).get("content"); >System.out.println((i+1)+ " " + body.substring(0, 20)); >System.out.printl
Re: How to customize score according to field value?
Jinming Zhang wrote: Hi, I have the following situation which needs to customize the final score according to field value. Suppose there are two docs in my query result, and they are ordered by default score sort: doc1(field1:bookA, field2:2000-01-01) -- score:0.80 doc2(field1:bookB, filed2:2009-01-01) -- score:0.70 I want "doc2" to have a higher score since it's publishing date is more recent, while "doc1" to have a lower score: doc2(field1:bookB, filed2:2009-01-01) -- score:0.77 doc1(field1:bookA, field2:2000-01-01) -- score:0.73 I found this scenario is different from doc.setBoost() and field.setBoost(). Is there any way to impact the score calculated for "doc1" & "doc2" according to the value of "field2"? Thank you in advance! hi, If I were you, I would store the date information as a long type (as i know, lucene stores any date information as a long type automatically, so it should be a natural way. you can change date type to long type and vice versa very easily by using lucene's provided date apis.) and make a linear function about the date information. The input of the function is a date information and the output is a simple float value which indicates how recent a book is. As more recent books have larger function values linearly, you can finely adjust your score by weighting the output of function according to your ranking policy. After then, simply, modify your docs' score on the fly by using the function's output at customScore() which are mentioned at one of replies for your question by patrick o'leary. bye. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
test
hi -- DigitalGlue, India
RE: test
Hi, In a long running process Lucene get crashed in my application, Is there any way to diagnose or how can I turn on debug logging / trace logging for Lucene? Thanks Antony -- DigitalGlue, India - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org