Re: Building maven artifacts
Hi, I don't know. I tried to setup somethind like this: But error is the same. Maybe there are any other parameters? 2010/7/16 Zhang, Lisheng > Hi, > > I never this kind of build before, but just from the error message > I guess it could mean two variables: > > ${project.artifactId} > ${project.version} > > are not defined (otherwise exact jar file name would be printed out)? > > Could it be some environment setup issue? > > Best regards, Lisheng > > -Original Message- > From: Pavel Minchenkov [mailto:char...@gmail.com] > Sent: Friday, July 16, 2010 8:35 AM > To: java-user@lucene.apache.org; solr-u...@lucene.apache.org > Subject: Building maven artifacts > >
API to retrieve search results without scoring or sorting
HI Is there any API using which I can retrieve search results, such that they are neither scored nor sorted (for performance reasons). I just need the results, don't need any extra computation on that. Any suggestion will be very helpful. -- Thanks Naveen Kumar
Re: API to retrieve search results without scoring or sorting
On Mon, Jul 19, 2010 at 6:14 AM, Naveen Kumar wrote: > Is there any API using which I can retrieve search results, such that they > are neither scored nor sorted (for performance reasons). I just need the > results, don't need any extra computation on that. Use your own custom Collector class. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Get lengthNorm of a field
Hi, is there a possibility to retrieve the lengthNorm for all (or a specific) fields in a specific document? Regards, Philippe - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Get lengthNorm of a field
On Mon, Jul 19, 2010 at 9:53 AM, Philippe wrote: > is there a possibility to retrieve the lengthNorm for all (or a specific) > fields in a specific document? See IndexReader: public abstract byte[] norms(String field) throws IOException; And Similarity: public float decodeNormValue(byte b) { The byte[] is indexed by document id, and you can decode that into a float value with a Similarity. -Yonik http://www.lucidimagination.com > Regards, > Philippe > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Get lengthNorm of a field
Hi Yonik, Am 19.07.2010 16:21, schrieb Yonik Seeley: On Mon, Jul 19, 2010 at 9:53 AM, Philippe wrote: is there a possibility to retrieve the lengthNorm for all (or a specific) fields in a specific document? See IndexReader: public abstract byte[] norms(String field) throws IOException; And Similarity: public float decodeNormValue(byte b) { The byte[] is indexed by document id, and you can decode that into a float value with a Similarity. Thanks for the quick reply. I was searching for methods in IndexSearcher. Therefore I did not find the norms method. Cheers, Philippe -Yonik http://www.lucidimagination.com Regards, Philippe - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Scoring exact matches higher in a stemmed field
If your analyzer outputs b and b$ in the same position, then the below query will already be what the QP output today If you want to incorporate boosting, I can suggest that you extend QP, override newTermQuery for example, and if the term is a stemmed term, then set the query's boost (Query.setBoost) accordingly. Would that work for you? You'll need to check whether you want to boost terms inside phrases, or entire phrases, and then override more methods from QP. But that approach will get you the native product of the engine, I think. Alternatively, you can set a payload on the stemmed terms and incorporate that into Similarity, but that's more costly. I don't follow that's been deprecated on Sim that you cannot use anymore? All I see are 3 deprecated static methods which are related to norms ... Shai On Sat, Jul 17, 2010 at 9:04 PM, Itamar Syn-Hershko wrote: > Shai, you got it right. I want to be able to send "b bb" through the QP > with my custom analyzer, and get back "(b b$) (b bb$)" -- 2 terms with 2 > tokens in the same position for each. > > I want this to be a native product of the engine, as opposed to forcing > this from the query end. I'm using different types of queries (Bool, > DisMax), and I'm actually interested in using the QP itself. Instead of > going through all sub-queries post-parsing and boosting terms ending with $, > I want some sort of a plugin mechanism to do this for me per result. The > easiest path would be subcalssing Similarity, if only the relevant functions > wouldn't have been deprecated... > > Are there any other ways to do so? For example, is this doable with > function queries (since access to the actual term is required)? > > Itamar. > > On 16/7/2010 8:01 PM, Shai Erera wrote: > >> Depends for which query no? ;) >> >> Sounds like you want to simulate the QP behavior >> http://lucene.apache.org/java/2_4_0/queryparsersyntax.html for >> boosting. Meaning, if for the query "b" you want to simulate the query >> "b OR b$^2" and have matches of b$ count more than b, then I'd follow >> how QP does it - create the query programmatically or something (I'm >> not near the code at the moment so I cannot give a more concrete >> approach). >> >> If you want b and b$ to count the same, then that's already the >> behavior - i.e., docs containing both will score higher. >> >> If I misunderstood your question, then plea correct me. >> >> Shai >> >> On Friday, July 16, 2010, Itamar Syn-Hershko wrote: >> >> >>> Hi all, >>> >>> >>> Consider the following string: "the buffalo buffaloes" [1]. >>> >>> >>> When passed through a stemming analyzer, the resulting token would be >>> "buffalo buffalo" (assuming a good stemmer). >>> >>> >>> To enable exact searches, say I mark the original term and index it at >>> the same term position. So "the buffalo buffaloes" -> (buffalo buffalo$) >>> (buffalo buffaloes$) - now exact searches are allowed on the same field >>> without having 2 different fields [2]. >>> >>> >>> However, with this approach default scoring isn't working well. What is >>> my best option at upgrading a match for an exact match of this sort, also >>> when using the same stemming analyzer, without using payloads on the marked >>> token? >>> >>> >>> In other words - how do I make documents containing "the buffalo >>> buffaloes" considered more relevant than docs containing the word "buffalo" >>> only once? >>> >>> >>> The trick here is to boost the marked token if found at search time. >>> While this sounds easy to do, I can't find the best approach on implementing >>> this - esp. since Similarity.float Idf(Index.Term term, Searcher searcher) >>> seem to have been deprecated for some reason. >>> >>> >>> Itamar. >>> >>> >>> [1] >>> http://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo:) >>> >>> [2] Rationale: >>> http://www.code972.com/blog/2010/07/more-flexible-hebrew-indexing-hebmorph/ >>> >>> >>> - >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >>> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
RE: Building maven artifacts
Hi Pavel, I have not done this build, I sent last message based on my experiences using ant on other projects, maybe people who worked on maven artifacts could help? Best regards, Lisheng -Original Message- From: Pavel Minchenkov [mailto:char...@gmail.com] Sent: Monday, July 19, 2010 3:03 AM To: java-user@lucene.apache.org Subject: Re: Building maven artifacts Hi, I don't know. I tried to setup somethind like this: But error is the same. Maybe there are any other parameters? 2010/7/16 Zhang, Lisheng > Hi, > > I never this kind of build before, but just from the error message > I guess it could mean two variables: > > ${project.artifactId} > ${project.version} > > are not defined (otherwise exact jar file name would be printed out)? > > Could it be some environment setup issue? > > Best regards, Lisheng > > -Original Message- > From: Pavel Minchenkov [mailto:char...@gmail.com] > Sent: Friday, July 16, 2010 8:35 AM > To: java-user@lucene.apache.org; solr-u...@lucene.apache.org > Subject: Building maven artifacts > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Scoring exact matches higher in a stemmed field
On 19/7/2010 5:50 PM, Shai Erera wrote: If your analyzer outputs b and b$ in the same position, then the below query will already be what the QP output today If you want to incorporate boosting, I can suggest that you extend QP, override newTermQuery for example, and if the term is a stemmed term, then set the query's boost (Query.setBoost) accordingly. Would that work for you? I want to avoid overriding the QP, and do this as a pluggable extension. What other options do I have other than what you've suggested? Ideally, that would be through a class or a function I can override or extend, so each term hit while searching will be examined. By checking its type and text (for suffix), that interface could double its weight (or boost). The similarity functions I mentioned could have provided this ability (see below). How can this be done without them? You'll need to check whether you want to boost terms inside phrases, or entire phrases, and then override more methods from QP. But that approach will get you the native product of the engine, I think. Just to make sure we are on the same page here, here's an example (assuming the default tf/idf implementation in Lucene). I want to make sure anyone searching for "song of songs" will find texts discussing the biblical book, and have them ranked the highest, instead of having short texts containing one word "song" score higher. So what I do is have my stemming analyzer save the string "song of songs" like this, where each parenthesis represents a token position: (song song$) (song songs$). The part I'm missing is how to score terms with suffixes higher. The best approach seem to be looking at the term read by IndexReader and boost this finding somehow. The assumption is if IndexReader has read the term songs$ it has been looked for, and therefore this is the exact word that has been queried for. Which is the best Lucene part to hijack for this mission? Alternatively, you can set a payload on the stemmed terms and incorporate that into Similarity, but that's more costly. I had mentioned Payloads - this will get me exactly what I want but as you say are quite costly when used for almost every term in the index. If I could replace the suffix with Payloads I would have done this (byte vs. byte), but I'm using the suffix for one other thing. I don't follow that's been deprecated on Sim that you cannot use anymore? All I see are 3 deprecated static methods which are related to norms ... In 2.3.2 there were these functions: public float idf(Term term, Searcher searcher) public float idf(Collection terms, Searcher searcher) These have been deprecated somewhere between that version and 2.9.2, and it seems like I could have used those for what I'm trying to do. Thanks, Itamar. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
How to modify a document Field before the document is indexed?
Hey All, I am using Apache Lucene (2.9.1) and its fast and it works great! I have a question in connection with Apache PDFBox. The following command creates a Lucent Document from a PDF file: Document document = org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile); The Lucene Document, document, has a bunch of fields. Among those fields, is a field named, "content". I need to add some more data to that field. For example, I would like to add some description and keywords. How do I go about doing that? Any pointers would be greatly welcome! :) Thanks for your time! Regards, Joe - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to modify a document Field before the document is indexed?
(10/07/20 7:31), Joe Hansen wrote: Hey All, I am using Apache Lucene (2.9.1) and its fast and it works great! I have a question in connection with Apache PDFBox. The following command creates a Lucent Document from a PDF file: Document document = org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile); The Lucene Document, document, has a bunch of fields. Among those fields, is a field named, "content". I need to add some more data to that field. For example, I would like to add some description and keywords. How do I go about doing that? Any pointers would be greatly welcome! :) Thanks for your time! Regards, Joe Joe, You can add your data to the document object: Document document = org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile); document.add( new Field( "content", "your data", Store.YES, Index.ANALYZED ) ); http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/document/Document.html#add%28org.apache.lucene.document.Fieldable%29 Koji -- http://www.rondhuit.com/en/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to modify a document Field before the document is indexed?
Thanks for your reply Koji! Your suggestion worked fine. I thought adding a field named "contents" to a document, even though it contains a field already named "contents" would NOT do anything. But looks like I am wrong! Thank you for your kind help! :) Regards, Joe On Mon, Jul 19, 2010 at 5:12 PM, Koji Sekiguchi wrote: > (10/07/20 7:31), Joe Hansen wrote: >> >> Hey All, >> >> I am using Apache Lucene (2.9.1) and its fast and it works great! I >> have a question in connection with Apache PDFBox. >> >> The following command creates a Lucent Document from a PDF file: >> Document document = >> >> org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile); >> >> The Lucene Document, document, has a bunch of fields. Among those >> fields, is a field named, "content". I need to add some more data to >> that field. For example, I would like to add some description and >> keywords. How do I go about doing that? Any pointers would be greatly >> welcome! :) >> >> Thanks for your time! >> >> Regards, >> Joe >> >> > > Joe, > > You can add your data to the document object: > > Document document = > org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile); > document.add( new Field( "content", "your data", Store.YES, Index.ANALYZED ) > ); > > http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/document/Document.html#add%28org.apache.lucene.document.Fieldable%29 > > Koji > > -- > http://www.rondhuit.com/en/ > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to modify a document Field before the document is indexed?
One subtlety you might be able to use to advantage... This is where getPositionIncrementGap in your analyzer can be used to separate the two bits of data in the same field. If I have my own analyzer (which could be a trivial override of an existing one) that returns, say 10,000 from getPositionIncrementGap Now, if you wanted to insure that proximity queries only matched in a particular add to your "content" field, you could specify that all the terms had to occur within 10,000 of each other... FWIW Erick On Mon, Jul 19, 2010 at 7:56 PM, Joe Hansen wrote: > Thanks for your reply Koji! Your suggestion worked fine. I thought > adding a field named "contents" to a document, even though it contains > a field already named "contents" would NOT do anything. But looks like > I am wrong! > > Thank you for your kind help! :) > > Regards, > Joe > > On Mon, Jul 19, 2010 at 5:12 PM, Koji Sekiguchi > wrote: > > (10/07/20 7:31), Joe Hansen wrote: > >> > >> Hey All, > >> > >> I am using Apache Lucene (2.9.1) and its fast and it works great! I > >> have a question in connection with Apache PDFBox. > >> > >> The following command creates a Lucent Document from a PDF file: > >> Document document = > >> > >> > org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile); > >> > >> The Lucene Document, document, has a bunch of fields. Among those > >> fields, is a field named, "content". I need to add some more data to > >> that field. For example, I would like to add some description and > >> keywords. How do I go about doing that? Any pointers would be greatly > >> welcome! :) > >> > >> Thanks for your time! > >> > >> Regards, > >> Joe > >> > >> > > > > Joe, > > > > You can add your data to the document object: > > > > Document document = > > > org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile); > > document.add( new Field( "content", "your data", Store.YES, > Index.ANALYZED ) > > ); > > > > > http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/document/Document.html#add%28org.apache.lucene.document.Fieldable%29 > > > > Koji > > > > -- > > http://www.rondhuit.com/en/ > > > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >