SpanNearQuery distance issue
Hello All, I've a issue with respect to the distance measure of SpanNearQuery in Lucene. Let's say I've following two documents: DocID: 6, cotent:"1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1001 1002 1003 1004 1005 1006 1007 1008 1009 1100", DocID: 7, content:"a b c d e a b c f g h i j k l m l k j z z z" If my span query is : a) "3n(a,e)" - It matches doc 7 But, if it is: b) "3n(1,5)" - It does not match doc 6 If query is: c) "4n(1,5)" - it matches doc 6 I have no clue why a) works rather not b). I tried to debug the code, but couldn't figure it out. Any help ? -- View this message in context: http://lucene.472066.n3.nabble.com/SpanNearQuery-distance-issue-tp4008975.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: SpanNearQuery distance issue
Shoot me. Thanks, I did not notice that the doc has ".. e a .." in the content. Thanks again for the reply :) -- View this message in context: http://lucene.472066.n3.nabble.com/SpanNearQuery-distance-issue-tp4008975p4009033.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
StandardTokenizer generation from JFlex grammar
Hello, I'm trying to generate the standard tokenizer again using the jflex specification (StandardTokenizerImpl.jflex) but I'm not able to do so due to some errors (I would like to create my own jflex file using the standard tokenizer which is why I'm trying to first generate using that to get a hang of things). I'm using jflex 1.4.3 and I ran into the following error: Error in file "" (line 64): Syntax error. HangulEx = (!(!\p{Script:Hangul}|!\p{WB:ALetter})) ({Format} | {Extend})* Also, I tried installing an eclipse plugin from http://cup-lex-eclipse.sourceforge.net/ which I thought would provide options similar to JavaCC (http://eclipse-javacc.sourceforge.net/) through which we can generate classes within eclipse - but had a hard luck. Any help would be very helpful. Regards, Phani. -- View this message in context: http://lucene.472066.n3.nabble.com/StandardTokenizer-generation-from-JFlex-grammar-tp4011940.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: StandardTokenizer generation from JFlex grammar
Thanks Steve for the pointers. I'll look into it. -- View this message in context: http://lucene.472066.n3.nabble.com/StandardTokenizer-generation-from-JFlex-grammar-tp4011940p4011944.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
To get Term Offsets of a term per document
Hello, Is there a way to get Term Offsets of a given term per document without enabling the termVectors ? Is it that Lucene index stores the positions but not the offsets by default - is it correct ? Thanks, Phani. -- View this message in context: http://lucene.472066.n3.nabble.com/To-get-Term-Offsets-of-a-term-per-document-tp4041699.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
WeightedSpanTermsExtractor
Hi, I've multiple fields (name, name2 - content copied below). And if I extract the weighted span terms out based on a query (the query is with a specific field) why am I not getting the positions properly out of the WeightedSpanTerm covering multiple fields ? Is it because the query is specific to a field & not to others ? query: "Running Apple" (phrase query) document content: name: Running Apple 60 GB iPod with Video Playback Black - Apple name2: Sample Running Apple 60 GB iPod with Video Playback Black - Apple I'm getting the positions as : 0,1 & 3,4 which I don't understand as it should be 0,1 & 1,2 for the fields name & name2 respectively. Am I doing something wrong expecting that behavior. Help or pointers would be appreciated. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/WeightedSpanTermsExtractor-tp4054149.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Token Stream with Offsets (Token Sources class)
Hi, I've the following snippet code where I'm trying to extract weighted span terms from the query (I do have term vectors enabled on the fields): File path = new File( ""); FSDirectory directory = FSDirectory.open(path); IndexReader indexReader = DirectoryReader.open(directory); Map allWeightedSpanTerms = new HashMap(); WeightedSpanTermExtractor extractor = null; extractor = new WeightedSpanTermExtractor(); TokenStream tokenStream = null; tokenStream = TokenSources.getTokenStreamWithOffsets(indexReader, 0, "name"); allWeightedSpanTerms.putAll(extractor.getWeightedSpanTerms(q, tokenStream)); In the end, if I look at the map "allWeightedSpanTerms" - I don't have any weighted span terms & when I tried to debug the code I found that when it is trying to build the TermContext the statement "fields.terms(field);" is returning "null" which I don't understand. My query is : "Running Apple" (a phrase query) my doc contents are : name : Running Apple 60 GB iPod with Video Playback Black - Apple Please let me know on what I'm doing anything wrong. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Token-Stream-with-Offsets-Token-Sources-class-tp4054383.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Token Stream with Offsets (Token Sources class)
I apologize. I did not know where exactly I needed to post this - I'll remove the others. As for indexing, I'm using Solr example docs script to post the documents & then using the mentioned code to get the token stream of that index. I've the following doc : ipod_video_1.xml : MA147LL/A Running Apple 60 GB iPod with Video Playback Black - Apple Sample Running Apple 60 GB iPod with Video Playback Black - Apple And I indexed it using the "post.sh" script in example docs via "sh post.sh ipod_video_1.xml" Thanks. - -- Phani -- View this message in context: http://lucene.472066.n3.nabble.com/Token-Stream-with-Offsets-Token-Sources-class-tp4054383p4054514.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Token Stream with Offsets (Token Sources class)
well, I found out the issue - it is because the maxDocCharsToAnalyze is 0 in the weightedSpanTermsExtractor by default. Works fine if I change there or use the QueryScorer which has a default limit of 51200. Thanks. - -- Phani -- View this message in context: http://lucene.472066.n3.nabble.com/Token-Stream-with-Offsets-Token-Sources-class-tp4054383p4054830.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org