Re: java.io.IOException when trying to list terms in index (IndexReader)
hi,as you the error messages you listed below,pls put the 'reader.close()' block to the bottom of method. i think,if you invoke it first,the infrastructure stream is closed ,so exceptions is encountered. ohaya wrote: Hi, I changed the beginning of the try to: try { System.out.println(About to call .next()...); boolean foo = termsEnumerator.next(); System.out.println(Finished calling first .next()); System.out.println(About to drop into while()...); . . . and here's what I got when I ran the app: Index in directory :[C:\lucene-devel\lucene-devel\index] was opened successfully! About to call .next()... ** ERROR **: Exception while stepping through index: [java.io.IOException: The handle is invalid] java.io.IOException: The handle is invalid at java.io.RandomAccessFile.seek(Native Method) at org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:591) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:136) at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:247) at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:157) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:78) at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:64) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:127) at ReadIndex.main(ReadIndex.java:29) Jim oh...@cox.net wrote: Hi, BTW, the next() method is an abstract method in the Javadocs. Does that mean that I'm suppose to have my own implementation? Jim oh...@cox.net wrote: Phil, I posted in haste. Actually, from the output that I posted, doesn't it it look like the .next() itself is throwing the exception? That is what has been puzzling me. It looks like it got through the open() and terms() with no problem, then it blew up when calling the next()? Jim oh...@cox.net wrote: Phil, Yes, that exception is not very helpful :)!! I'll try your suggestions and post back. Thanks, Jim Phil Whelan phil...@gmail.com wrote: Hi Jim, I cannot see anything obvious, but both open() and terms() throw IOException's. You could try putting these in separate try..catch blocks to see which one it's coming from. Or using e.printStackTrace() in the catch block will give more info to help you debug what's happening. On Sat, Aug 1, 2009 at 7:09 PM, oh...@cox.net wrote: reader = IndexReader.open(args[0]); Term term = new Term(path, ); termsEnumerator = reader.terms(term); Cheers, Phil - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- View this message in context: http://www.nabble.com/java.io.IOException-when-trying-to-list-terms-in-index-%28IndexReader%29-tp24774351p24775753.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Group by in Lucene ?
Don't overlook Solr: http://lucene.apache.org/solr Erik On Aug 1, 2009, at 5:43 AM, mschipperheyn wrote: http://code.google.com/p/bobo-browse looks like it may be the ticket. Marc -- View this message in context: http://www.nabble.com/Group-by-in-Lucene---tp13581760p24767693.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Weird behaviour
Hi, I've indexed some 50million documents. I've indexed the target URL of each document as url field by using StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia page with title:Rahul Dravid and url: http://en.wikipedia.org/wiki/Rahul_Dravid. But when I search for +title:Rahul Dravid +url:Wikipedia, I'm getting no results. I get the document(s) when I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url: en.wikipedia.org/wiki/Rahul_Dravid. I get results even when I search for url:wiki/Rahul_Dravid. It'd be helpful if somebody can throw some light on this. -- Prashant.
Re: Weird behaviour
You write that you index the string under the url field. Do you also index it under title? If not, that can explain why title:Rahul Dravid does not work for you. Also, did you try to look at the index w/ Luke? It will show you what are the terms in the index. Another thing which is always good to debug such things is to create a StandardAnalyzer, then request a tokenStream() from it, passing a StringReader w/ the text you want to parse. Then just print the tokens returned. I've done that, using the version from trunk, w/ Version.2_4, and the tokens that are extracted are: (http,0,4,type=ALPHANUM) (en.wikipedia.org,7,23,type=HOST) (wiki,24,28,type=ALPHANUM) (rahul,29,34,type=ALPHANUM) (dravid,35,41,type=ALPHANUM) So: 1) You don't get results for title:Rahul Dravid since you index it under url and not title. 2) url:wiki/Rahul_Dravid works, since it looks for a phrase that exists in the index (look at the last 3 tokens produced by the Analyzer, in the output above). 3) ur:entire string also works, since you index all of it under the url field. Does this explain the behavior you see? Shai On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi prashullega...@gmail.com wrote: Hi, I've indexed some 50million documents. I've indexed the target URL of each document as url field by using StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia page with title:Rahul Dravid and url: http://en.wikipedia.org/wiki/Rahul_Dravid. But when I search for +title:Rahul Dravid +url:Wikipedia, I'm getting no results. I get the document(s) when I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url: en.wikipedia.org/wiki/Rahul_Dravid. I get results even when I search for url:wiki/Rahul_Dravid. It'd be helpful if somebody can throw some light on this. -- Prashant.
Re: Weird behaviour
Firstly, I'm indexing the string in url field only. I've never used Luke, I don't know how to use. What I'm trying to do is search for those documents which are from some particular site, and have a given title. On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera ser...@gmail.com wrote: You write that you index the string under the url field. Do you also index it under title? If not, that can explain why title:Rahul Dravid does not work for you. Also, did you try to look at the index w/ Luke? It will show you what are the terms in the index. Another thing which is always good to debug such things is to create a StandardAnalyzer, then request a tokenStream() from it, passing a StringReader w/ the text you want to parse. Then just print the tokens returned. I've done that, using the version from trunk, w/ Version.2_4, and the tokens that are extracted are: (http,0,4,type=ALPHANUM) (en.wikipedia.org,7,23,type=HOST) (wiki,24,28,type=ALPHANUM) (rahul,29,34,type=ALPHANUM) (dravid,35,41,type=ALPHANUM) So: 1) You don't get results for title:Rahul Dravid since you index it under url and not title. 2) url:wiki/Rahul_Dravid works, since it looks for a phrase that exists in the index (look at the last 3 tokens produced by the Analyzer, in the output above). 3) ur:entire string also works, since you index all of it under the url field. Does this explain the behavior you see? Shai On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi prashullega...@gmail.com wrote: Hi, I've indexed some 50million documents. I've indexed the target URL of each document as url field by using StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia page with title:Rahul Dravid and url: http://en.wikipedia.org/wiki/Rahul_Dravid. But when I search for +title:Rahul Dravid +url:Wikipedia, I'm getting no results. I get the document(s) when I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url: en.wikipedia.org/wiki/Rahul_Dravid. I get results even when I search for url:wiki/Rahul_Dravid. It'd be helpful if somebody can throw some light on this. -- Prashant.
Re: Weird discrepancy with term counts vs. terms (off by 1)
Hi, BTW, my indexer app is basically the same as the demo IndexFiles.java. Here's part of the main: try { IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED); System.out.println(Indexing to directory ' +INDEX_DIR+ '...); indexDocs(writer, docDir); System.out.println(Optimizing...); writer.optimize(); writer.close(); Date end = new Date(); System.out.println(end.getTime() - start.getTime() + total milliseconds); } catch (IOException e) { System.out.println( caught a + e.getClass() + \n with message: + e.getMessage()); } when I run the indexer, I can see it say it added the document that ends up being missing from the terms. Thanks, Jim oh...@cox.net wrote: Hi, I've noticed a kind of strange problem with term counts and actual terms. Some background: I wrote an app that creates an index, including a path field. I am now working on an app (code was in the previous thread) that, as part of what it does, needs to get a list of all of the path fields for documents that were added. I first noticed the problem that I'm seeing while working on this latter app. Basically, what I noticed was that while I was adding 13 documents to the index, when I listed the path terms, there were only 12 of them. So then, I reviewed the index using Luke, and what I saw with that was that there were indeed only 12 path terms (under Term Count on the left), but, when I clicked the Show Top Terms in Luke, there were 13 terms listed by Luke. At this point, I'm very puzzled about all of this :(... Can anyone explain why the difference in Luke, and, more importantly, what I am only getting 12 (i.e., 1 less than the # of documents added) when I try to programmatically list the terms? Thanks, Jim - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
score from spans
Hi, How can I get the score of a span that is the result of SpanQuery.getSpans() ? The score should can be the same for each document, but if it's unique per span, it's even better. I tried looking for a way to expose this functionality through the Spans class but it looks too complicated. I'm not even sure that by default some score calculation is even performed when using span queries. I've noticed that some calculations are made using payloads and BoostingTermQuery but the score result is used internally and can't be accessed from the Spans results. I don't want to re-run the query again using a HitCollector and since the reader is passed to getSpans, I think it should be possible to do what I want. Any help on the correct way to expose the span score will be appreciated. Thanks, Eran.
Re: Weird behaviour
How do you parse/convert the page to a Document object? Are you sure the title Rahul Dravid is extracted properly and put in the title field? You can read about Luke here: http://www.getopt.org/luke/. Can you do System.out.println(document.toString()) before you add it to the index, and paste the output here? Shai On Sun, Aug 2, 2009 at 4:47 PM, prashant ullegaddi prashullega...@gmail.com wrote: Firstly, I'm indexing the string in url field only. I've never used Luke, I don't know how to use. What I'm trying to do is search for those documents which are from some particular site, and have a given title. On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera ser...@gmail.com wrote: You write that you index the string under the url field. Do you also index it under title? If not, that can explain why title:Rahul Dravid does not work for you. Also, did you try to look at the index w/ Luke? It will show you what are the terms in the index. Another thing which is always good to debug such things is to create a StandardAnalyzer, then request a tokenStream() from it, passing a StringReader w/ the text you want to parse. Then just print the tokens returned. I've done that, using the version from trunk, w/ Version.2_4, and the tokens that are extracted are: (http,0,4,type=ALPHANUM) (en.wikipedia.org,7,23,type=HOST) (wiki,24,28,type=ALPHANUM) (rahul,29,34,type=ALPHANUM) (dravid,35,41,type=ALPHANUM) So: 1) You don't get results for title:Rahul Dravid since you index it under url and not title. 2) url:wiki/Rahul_Dravid works, since it looks for a phrase that exists in the index (look at the last 3 tokens produced by the Analyzer, in the output above). 3) ur:entire string also works, since you index all of it under the url field. Does this explain the behavior you see? Shai On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi prashullega...@gmail.com wrote: Hi, I've indexed some 50million documents. I've indexed the target URL of each document as url field by using StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia page with title:Rahul Dravid and url: http://en.wikipedia.org/wiki/Rahul_Dravid. But when I search for +title:Rahul Dravid +url:Wikipedia, I'm getting no results. I get the document(s) when I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url: en.wikipedia.org/wiki/Rahul_Dravid. I get results even when I search for url:wiki/Rahul_Dravid. It'd be helpful if somebody can throw some light on this. -- Prashant.
Re: Weird discrepancy with term counts vs. terms (off by 1)
Hi Jim, On Sun, Aug 2, 2009 at 1:32 AM, oh...@cox.net wrote: I first noticed the problem that I'm seeing while working on this latter app. Basically, what I noticed was that while I was adding 13 documents to the index, when I listed the path terms, there were only 12 of them. Field text (the whole path in your case) and terms (the tokens of the field text) are different. The StandardAnalyzer breaks up words like this... Field text = /a/b/c.txt Tokens = {a,b,c,txt} So this 1 field of 1 document become 4 terms / tokens (not sure if there is a difference in this terminology between terms and tokens sorry). Therefore, you're going to have more terms than documents initially, but as the overlap in term usage increases this changes. For instance, these 3 paths /a/b/c/d.txt,/b/c/d/a.txt,/c/d/a/b.txt are still only a total of 4 terms, since they share the same terms. In fact, StandardAnalyzer goes a bit further than that and removes stop-words, such as a (or an, the) as it's designed for general text searching. That said, I think you have a point with the next part of your question... So then, I reviewed the index using Luke, and what I saw with that was that there were indeed only 12 path terms (under Term Count on the left), but, when I clicked the Show Top Terms in Luke, there were 13 terms listed by Luke. Yes, I just checked this and this seems to be a bug with Luke. It always shows 1 less than in Term Count than it should. Well spotted. Cheers, Phil - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Weird discrepancy with term counts vs. terms (off by 1)
Hi Jim, On Sun, Aug 2, 2009 at 9:08 AM, Phil Whelanphil...@gmail.com wrote: So then, I reviewed the index using Luke, and what I saw with that was that there were indeed only 12 path terms (under Term Count on the left), but, when I clicked the Show Top Terms in Luke, there were 13 terms listed by Luke. Yes, I just checked this and this seems to be a bug with Luke. It always shows 1 less than in Term Count than it should. Well spotted. I was able to see why this way happening in the Luke source and I've submitted the following patch to Andrzej, the author of Luke. Thanks, Phil --- luke.orig/src/org/getopt/luke/Luke.java 2009-03-19 22:41:34.0 -0700 +++ luke-src-0.9.2/src/org/getopt/luke/Luke.java2009-08-02 09:33:24.0 -0700 @@ -813,23 +813,18 @@ setString(iFields, text, String.valueOf(idxFields.length)); Object iTerms = find(pOver, iTerms); termCounts.clear(); - FieldTermCount ftc = new FieldTermCount(); + FieldTermCount ftc = null; TermEnum te = ir.terms(); numTerms = 0; while (te.next()) { Term currTerm = te.term(); -if (ftc.fieldname == null) { +if (ftc == null || ftc.fieldname == null || ftc.fieldname != currTerm.field()) { // initialize - ftc.fieldname = currTerm.field(); - termCounts.put(ftc.fieldname, ftc); -} -if (ftc.fieldname == currTerm.field()) { - ftc.termCount++; -} else { ftc = new FieldTermCount(); ftc.fieldname = currTerm.field(); termCounts.put(ftc.fieldname, ftc); } +ftc.termCount++; numTerms++; } te.close(); - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Weird behaviour
Yes, I'm sure that title:Rahul Dravid is extracted properly, and there is a document relevant to this query as well. The following query and its results proves it: Enter query: Searching for: +title:rahul dravid +url:wiki 4 total matching documents trec-id: clueweb09-enwp02-13-14368, URL: http://en.wikipedia.org/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-83-11378, URL: http://en.wikipedia.org/wiki/Rahul_S_Dravid trec-id: clueweb09-en0011-08-22737, URL: http://www.reference.com/browse/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-69-13556, URL: http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid Press (q)uit or enter number to jump to a page. But see following query: Enter query: +title:rahul dravid +url:wikipedia Searching for: +title:rahul dravid +url:wikipedia 0 total matching documents Press (q)uit or enter number to jump to a page. Isn't it weird? -- Prashant. On Sun, Aug 2, 2009 at 9:13 PM, Shai Erera ser...@gmail.com wrote: How do you parse/convert the page to a Document object? Are you sure the title Rahul Dravid is extracted properly and put in the title field? You can read about Luke here: http://www.getopt.org/luke/. Can you do System.out.println(document.toString()) before you add it to the index, and paste the output here? Shai On Sun, Aug 2, 2009 at 4:47 PM, prashant ullegaddi prashullega...@gmail.com wrote: Firstly, I'm indexing the string in url field only. I've never used Luke, I don't know how to use. What I'm trying to do is search for those documents which are from some particular site, and have a given title. On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera ser...@gmail.com wrote: You write that you index the string under the url field. Do you also index it under title? If not, that can explain why title:Rahul Dravid does not work for you. Also, did you try to look at the index w/ Luke? It will show you what are the terms in the index. Another thing which is always good to debug such things is to create a StandardAnalyzer, then request a tokenStream() from it, passing a StringReader w/ the text you want to parse. Then just print the tokens returned. I've done that, using the version from trunk, w/ Version.2_4, and the tokens that are extracted are: (http,0,4,type=ALPHANUM) (en.wikipedia.org,7,23,type=HOST) (wiki,24,28,type=ALPHANUM) (rahul,29,34,type=ALPHANUM) (dravid,35,41,type=ALPHANUM) So: 1) You don't get results for title:Rahul Dravid since you index it under url and not title. 2) url:wiki/Rahul_Dravid works, since it looks for a phrase that exists in the index (look at the last 3 tokens produced by the Analyzer, in the output above). 3) ur:entire string also works, since you index all of it under the url field. Does this explain the behavior you see? Shai On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi prashullega...@gmail.com wrote: Hi, I've indexed some 50million documents. I've indexed the target URL of each document as url field by using StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia page with title:Rahul Dravid and url: http://en.wikipedia.org/wiki/Rahul_Dravid. But when I search for +title:Rahul Dravid +url:Wikipedia, I'm getting no results. I get the document(s) when I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url: en.wikipedia.org/wiki/Rahul_Dravid. I get results even when I search for url:wiki/Rahul_Dravid. It'd be helpful if somebody can throw some light on this. -- Prashant.
Re: Weird behaviour
Hi Prashant, I agree with Shai, that using Luke and printing out what the Document looks like before it goes into the index, are going to be your best bet for debugging this problem. The problem you're having is that StandardAnalyzer does not break-up the hostname into separate terms, as it has a special case for hostnames and acronyms. This should work... +title:rahul dravid +url:en.wikipedia.org Thanks, Phil On Sun, Aug 2, 2009 at 10:14 AM, prashant ullegaddiprashullega...@gmail.com wrote: Yes, I'm sure that title:Rahul Dravid is extracted properly, and there is a document relevant to this query as well. The following query and its results proves it: Enter query: Searching for: +title:rahul dravid +url:wiki 4 total matching documents trec-id: clueweb09-enwp02-13-14368, URL: http://en.wikipedia.org/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-83-11378, URL: http://en.wikipedia.org/wiki/Rahul_S_Dravid trec-id: clueweb09-en0011-08-22737, URL: http://www.reference.com/browse/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-69-13556, URL: http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid Press (q)uit or enter number to jump to a page. But see following query: Enter query: +title:rahul dravid +url:wikipedia Searching for: +title:rahul dravid +url:wikipedia 0 total matching documents Press (q)uit or enter number to jump to a page. Isn't it weird? -- Prashant. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: ThreadedIndexWriter vs. IndexWriter
Woops sorry for the confusion! Mike On Sat, Aug 1, 2009 at 1:03 PM, Phil Whelanphil...@gmail.com wrote: Hi Mike, It's Jibo, not me, having the problem. But thanks for the link. I was interested to look at the code. Will be buying the book soon. Phil On Sat, Aug 1, 2009 at 2:08 AM, Michael McCandless luc...@mikemccandless.com wrote: (Please note that ThreadedIndexWriter is source code available with the upcoming revision to Lucene in Action.) Phil, is it possible you are using an older version of the book's source code? In particular, can you check whether your version of ThreadedIndexWriter.java has this: public void close(boolean doWait) throws CorruptIndexException, IOException { finish(); super.close(doWait); } (I vaguely remember that being missing from earlier releases, which could explain what you're seeing). If you are missing that, can you download the current code from http://www.manning.com/hatcher3 and try again? If that's not the problem... can you post the benchmark alg you are using in each case? Mike - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Weird behaviour
Hi Phil, The query you gave did work. Well, that proves StandardAnalyzer has a different way of tokenizing URLs. Thanks, Prashant. On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan phil...@gmail.com wrote: Hi Prashant, I agree with Shai, that using Luke and printing out what the Document looks like before it goes into the index, are going to be your best bet for debugging this problem. The problem you're having is that StandardAnalyzer does not break-up the hostname into separate terms, as it has a special case for hostnames and acronyms. This should work... +title:rahul dravid +url:en.wikipedia.org Thanks, Phil On Sun, Aug 2, 2009 at 10:14 AM, prashant ullegaddiprashullega...@gmail.com wrote: Yes, I'm sure that title:Rahul Dravid is extracted properly, and there is a document relevant to this query as well. The following query and its results proves it: Enter query: Searching for: +title:rahul dravid +url:wiki 4 total matching documents trec-id: clueweb09-enwp02-13-14368, URL: http://en.wikipedia.org/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-83-11378, URL: http://en.wikipedia.org/wiki/Rahul_S_Dravid trec-id: clueweb09-en0011-08-22737, URL: http://www.reference.com/browse/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-69-13556, URL: http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid Press (q)uit or enter number to jump to a page. But see following query: Enter query: +title:rahul dravid +url:wikipedia Searching for: +title:rahul dravid +url:wikipedia 0 total matching documents Press (q)uit or enter number to jump to a page. Isn't it weird? -- Prashant. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Weird discrepancy with term counts vs. terms (off by 1)
On Sun, Aug 2, 2009 at 10:58 AM, Andrzej Bialeckia...@getopt.org wrote: Thank you Phil for spotting this bug - this fix will be included in the next release of Luke. Glad to help. Thanks for building this great tool! Phil - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Weird behaviour
You can always create your own Analyzer which creates a TokenStream just like StandardAnalyzer, but instead of using StandardFilter, write another TokenFilter which receives the HOST token type, and breaks it further to its components (e.g., extract en, wikipedia and org). You can also return the original HOST token and its components. I hope this helps. Shai On Sun, Aug 2, 2009 at 8:58 PM, prashant ullegaddi prashullega...@gmail.com wrote: Hi Phil, The query you gave did work. Well, that proves StandardAnalyzer has a different way of tokenizing URLs. Thanks, Prashant. On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan phil...@gmail.com wrote: Hi Prashant, I agree with Shai, that using Luke and printing out what the Document looks like before it goes into the index, are going to be your best bet for debugging this problem. The problem you're having is that StandardAnalyzer does not break-up the hostname into separate terms, as it has a special case for hostnames and acronyms. This should work... +title:rahul dravid +url:en.wikipedia.org Thanks, Phil On Sun, Aug 2, 2009 at 10:14 AM, prashant ullegaddiprashullega...@gmail.com wrote: Yes, I'm sure that title:Rahul Dravid is extracted properly, and there is a document relevant to this query as well. The following query and its results proves it: Enter query: Searching for: +title:rahul dravid +url:wiki 4 total matching documents trec-id: clueweb09-enwp02-13-14368, URL: http://en.wikipedia.org/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-83-11378, URL: http://en.wikipedia.org/wiki/Rahul_S_Dravid trec-id: clueweb09-en0011-08-22737, URL: http://www.reference.com/browse/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-69-13556, URL: http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid Press (q)uit or enter number to jump to a page. But see following query: Enter query: +title:rahul dravid +url:wikipedia Searching for: +title:rahul dravid +url:wikipedia 0 total matching documents Press (q)uit or enter number to jump to a page. Isn't it weird? -- Prashant. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Weird behaviour
Thank you Phil and Shai. I will write a different Analyzer. On Sun, Aug 2, 2009 at 11:50 PM, Shai Erera ser...@gmail.com wrote: You can always create your own Analyzer which creates a TokenStream just like StandardAnalyzer, but instead of using StandardFilter, write another TokenFilter which receives the HOST token type, and breaks it further to its components (e.g., extract en, wikipedia and org). You can also return the original HOST token and its components. I hope this helps. Shai On Sun, Aug 2, 2009 at 8:58 PM, prashant ullegaddi prashullega...@gmail.com wrote: Hi Phil, The query you gave did work. Well, that proves StandardAnalyzer has a different way of tokenizing URLs. Thanks, Prashant. On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan phil...@gmail.com wrote: Hi Prashant, I agree with Shai, that using Luke and printing out what the Document looks like before it goes into the index, are going to be your best bet for debugging this problem. The problem you're having is that StandardAnalyzer does not break-up the hostname into separate terms, as it has a special case for hostnames and acronyms. This should work... +title:rahul dravid +url:en.wikipedia.org Thanks, Phil On Sun, Aug 2, 2009 at 10:14 AM, prashant ullegaddiprashullega...@gmail.com wrote: Yes, I'm sure that title:Rahul Dravid is extracted properly, and there is a document relevant to this query as well. The following query and its results proves it: Enter query: Searching for: +title:rahul dravid +url:wiki 4 total matching documents trec-id: clueweb09-enwp02-13-14368, URL: http://en.wikipedia.org/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-83-11378, URL: http://en.wikipedia.org/wiki/Rahul_S_Dravid trec-id: clueweb09-en0011-08-22737, URL: http://www.reference.com/browse/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-69-13556, URL: http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid Press (q)uit or enter number to jump to a page. But see following query: Enter query: +title:rahul dravid +url:wikipedia Searching for: +title:rahul dravid +url:wikipedia 0 total matching documents Press (q)uit or enter number to jump to a page. Isn't it weird? -- Prashant. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: arabic analyzer
the fact is, plural (as an example) is not supported, and that is one of the most common things that a person doing some search will expect to Walid, I'm not sure this is true. Many plurals are supported (certainly not exceptional cases or broken plurals). This is no different than the other language analyzers in lucene, even english stemmers: the most common forms are grouped together and thats about all you can say :) maybe in the future we can improve it though for your particular concern, add simple dictionary mappings for at least the most common broken plurals, something like that. -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Weird discrepancy with term counts vs. terms (off by 1)
Hi Phil, For problem with my app, it wasn't what you suggested (about the tokens, etc.). For some later things, my indexer creates both a path field that is analyzed (and thus tokenized, etc.) and another field, fullpath, which is not analyzed (and thus, not tokenized). The problem with my app was that I was create a TermEnum: Term term = new Term(fullpath, ); termsEnumerator = reader.terms(term); and then going immediately into a while loop: while (termsEnumerator.next()) { . . } i.e., I was ignoring the 1st term in the TermEnum (since the .next() bumps the TermEnum to the 2nd term, initially). Anyway, so the code that I ended up with is: try { System.out.println(Outside while: About to get 1st termsEnumerator.term()...); currentTerm = termsEnumerator.term(); currentField = currentTerm.field(); termpathcount++; System.out.println(Outside while: 1st Field = [ + currentField + ] Term = [ + currentTerm.text() + ]); System.out.println(Outside while: About to drop into while()...); while (termsEnumerator.next()) { currentTerm = termsEnumerator.term(); currentField = currentTerm.field(); if (currentField.equalsIgnoreCase(fullpath)) { termpathcount++; System.out.println(Count= + termpathcount + Field = [ + currentField + ] Term = [ + currentTerm.text() + ]); } } // end while() termsEnumerator.close(); System.out.println(Matching terms count = + termpathcount); } catch (Exception e) { System.out.println(** ERROR **: Exception while stepping through index: [ + e + ]); e.printStackTrace(); } and, that seems to be working perfectly. Also, thanks for following up re. that Luke problem. That was one piece of this puzzle that was kind of driving me batty :)!! Jim Phil Whelan phil...@gmail.com wrote: Hi Jim, On Sun, Aug 2, 2009 at 1:32 AM, oh...@cox.net wrote: I first noticed the problem that I'm seeing while working on this latter app. Basically, what I noticed was that while I was adding 13 documents to the index, when I listed the path terms, there were only 12 of them. Field text (the whole path in your case) and terms (the tokens of the field text) are different. The StandardAnalyzer breaks up words like this... Field text = /a/b/c.txt Tokens = {a,b,c,txt} So this 1 field of 1 document become 4 terms / tokens (not sure if there is a difference in this terminology between terms and tokens sorry). Therefore, you're going to have more terms than documents initially, but as the overlap in term usage increases this changes. For instance, these 3 paths /a/b/c/d.txt,/b/c/d/a.txt,/c/d/a/b.txt are still only a total of 4 terms, since they share the same terms. In fact, StandardAnalyzer goes a bit further than that and removes stop-words, such as a (or an, the) as it's designed for general text searching. That said, I think you have a point with the next part of your question... So then, I reviewed the index using Luke, and what I saw with that was that there were indeed only 12 path terms (under Term Count on the left), but, when I clicked the Show Top Terms in Luke, there were 13 terms listed by Luke. Yes, I just checked this and this seems to be a bug with Luke. It always shows 1 less than in Term Count than it should. Well spotted. Cheers, Phil - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: java.io.IOException when trying to list terms in index (IndexReader)
Hi, I thought that, in the code that I posted, there was a close() in the finally? Or, are you saying that when an IndexReader is opened, that that somehow persists in the system, even past my Java app terminating? FYI, I'm doing this testing on Windows, under Eclipse... Jim se3g2011 se3g2...@gmail.com wrote: hi,as you the error messages you listed below,pls put the 'reader.close()' block to the bottom of method. i think,if you invoke it first,the infrastructure stream is closed ,so exceptions is encountered. ohaya wrote: Hi, I changed the beginning of the try to: try { System.out.println(About to call .next()...); boolean foo = termsEnumerator.next(); System.out.println(Finished calling first .next()); System.out.println(About to drop into while()...); . . . and here's what I got when I ran the app: Index in directory :[C:\lucene-devel\lucene-devel\index] was opened successfully! About to call .next()... ** ERROR **: Exception while stepping through index: [java.io.IOException: The handle is invalid] java.io.IOException: The handle is invalid at java.io.RandomAccessFile.seek(Native Method) at org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:591) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:136) at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:247) at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:157) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:78) at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:64) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:127) at ReadIndex.main(ReadIndex.java:29) Jim oh...@cox.net wrote: Hi, BTW, the next() method is an abstract method in the Javadocs. Does that mean that I'm suppose to have my own implementation? Jim oh...@cox.net wrote: Phil, I posted in haste. Actually, from the output that I posted, doesn't it it look like the .next() itself is throwing the exception? That is what has been puzzling me. It looks like it got through the open() and terms() with no problem, then it blew up when calling the next()? Jim oh...@cox.net wrote: Phil, Yes, that exception is not very helpful :)!! I'll try your suggestions and post back. Thanks, Jim Phil Whelan phil...@gmail.com wrote: Hi Jim, I cannot see anything obvious, but both open() and terms() throw IOException's. You could try putting these in separate try..catch blocks to see which one it's coming from. Or using e.printStackTrace() in the catch block will give more info to help you debug what's happening. On Sat, Aug 1, 2009 at 7:09 PM, oh...@cox.net wrote: reader = IndexReader.open(args[0]); Term term = new Term(path, ); termsEnumerator = reader.terms(term); Cheers, Phil - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- View this message in context: http://www.nabble.com/java.io.IOException-when-trying-to-list-terms-in-index-%28IndexReader%29-tp24774351p24775753.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Weird discrepancy with term counts vs. terms (off by 1)
Hi Jim, On Sun, Aug 2, 2009 at 12:12 PM, oh...@cox.net wrote: i.e., I was ignoring the 1st term in the TermEnum (since the .next() bumps the TermEnum to the 2nd term, initially). Great! Glad you found the problem. I couldn't see it. Phil - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: java.io.IOException when trying to list terms in index (IndexReader)
I've seen Eclipse get into weird states, but I don't think that's your problem. You open the IndexReader and set up a TermEnum on it. Then, no matter what you close the underlying IndexReader in the finally block. Then later you use the TermEnum *even though the underlying reader has been closed*. You want something like try { open reader set up TermEnum enumerate terms close termEnum } catch () { } finally { close reader } Getting an IO exception isn't at all strange in this situation, and exactly when you throw the exception is indeterminate. See below. public class ReadIndex { public static void main(String[] args) { IndexReader reader = null; TermEnum termsEnumerator = null; Term currentTerm = null; try { reader = IndexReader.open(args[0]); Term term = new Term(path, ); termsEnumerator = reader.terms(term); } catch (IOException e) { System.out.println(** ERROR **: Exception when opened IndexReader: [ + e + ]); } finally { ** Why close the reader here? You need it later I think. ** try { reader.close(); } catch (IOException e) { /* suck it up */ } } System.out.println(Index in directory :[ + args[0] + ] was opened successfully!); try { System.out.println(About to drop into while()...); ** This relies on the underlying reader that has been closed??? ** while (termsEnumerator.next()) { System.out.println(About to get terms.Enumerator.term()...); currentTerm = termsEnumerator.term(); System.out.println(Term = [ + currentTerm.text() + ]); } termsEnumerator.close(); } catch (Exception e) { System.out.println(** ERROR **: Exception while stepping through index: [ + e + ]); } } // end main() } // end CLASS ReadIndex On Sun, Aug 2, 2009 at 3:15 PM, oh...@cox.net wrote: Hi, I thought that, in the code that I posted, there was a close() in the finally? Or, are you saying that when an IndexReader is opened, that that somehow persists in the system, even past my Java app terminating? FYI, I'm doing this testing on Windows, under Eclipse... Jim se3g2011 se3g2...@gmail.com wrote: hi,as you the error messages you listed below,pls put the 'reader.close()' block to the bottom of method. i think,if you invoke it first,the infrastructure stream is closed ,so exceptions is encountered. ohaya wrote: Hi, I changed the beginning of the try to: try { System.out.println(About to call .next()...); boolean foo = termsEnumerator.next(); System.out.println(Finished calling first .next()); System.out.println(About to drop into while()...); . . . and here's what I got when I ran the app: Index in directory :[C:\lucene-devel\lucene-devel\index] was opened successfully! About to call .next()... ** ERROR **: Exception while stepping through index: [java.io.IOException: The handle is invalid] java.io.IOException: The handle is invalid at java.io.RandomAccessFile.seek(Native Method) at org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:591) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:136) at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:247) at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:157) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:78) at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:64) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:127) at ReadIndex.main(ReadIndex.java:29) Jim oh...@cox.net wrote: Hi, BTW, the next() method is an abstract method in the Javadocs. Does that mean that I'm suppose to have my own implementation? Jim oh...@cox.net wrote: Phil, I posted in haste. Actually, from the output that I posted, doesn't it it look like the .next() itself is throwing the exception? That is what has been puzzling me. It looks like it got through the open() and terms() with no problem, then it blew up when calling the next()? Jim oh...@cox.net wrote: Phil, Yes, that exception is not very helpful :)!!
question about
Hello, I have question about KEYWORD type and searching/updating. I am getting strange behavior that I can't quite comprehend. My index is created using standard analyzer, which used for writing and searching. It has three fields userpin - alphanumeric field which is stored as TEXT documentkey - alphanumeric field which is stored as TEXT contents - text of document which is stored as TEXT When I try to update document I am creating Term to find document by documentKey and I am using org.apache.lucene.index.IndexWriter.updateDocument(term, pDocument); to do the update. Lucene fails to find the document by the term and I am getting duplicate documents in the index. When I changed index to define documentKey as KEYWORD the updates started to work fine. However, search for documentKey using StandardAnalyzer stopped working. It appears that lucene is using keywordAnalyzer for searching for the term during update, even though the indexer is open with StandardAnalyzer. The sample values that are stored in documentKeys are: LFAHBHMF, LFAHBHAS. I noticed if documentKey is numeric value, both KeywordAnalyzer and StandardAnalyzer can find the documents by it without any problem thus reader can find and indexer can update without any problems. With alphanumeric I cant get both to work. Any help is appreciated. Thanks Leonard
question about indexing/searching using standardanalyzer for KEYWORD field that contains alphanumeric data
Hello, I have question about KEYWORD type and searching/updating. I am getting strange behavior that I can't quite comprehend. My index is created using standard analyzer, which used for writing and searching. It has three fields userpin - alphanumeric field which is stored as TEXT documentkey - alphanumeric field which is stored as TEXT contents - text of document which is stored as TEXT When I try to update document I am creating Term to find document by documentKey and I am using org.apache.lucene.index.IndexWriter.updateDocument(term, pDocument); to do the update. Lucene fails to find the document by the term and I am getting duplicate documents in the index. When I changed index to define documentKey as KEYWORD the updates started to work fine. However, search for documentKey using StandardAnalyzer stopped working. It appears that lucene is using keywordAnalyzer for searching for the term during update, even though the indexer is open with StandardAnalyzer. The sample values that are stored in documentKeys are: LFAHBHMF, LFAHBHAS. I noticed if documentKey is numeric value, both KeywordAnalyzer and StandardAnalyzer can find the documents by it without any problem thus reader can find and indexer can update without any problems. With alphanumeric I cant get both to work. Any help is appreciated. Thanks Leonard - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Boosting Search Results
Thanks for all the reply. It help me to understand problem better, but is it possible to create a query that will give additional boost to the results if and only if both of the word is found inside the results. This will definitely make sure that the results will be in the higher up of the list. Can this type of query be created? -- View this message in context: http://www.nabble.com/Boosting-Search-Results-tp24753954p24784708.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene for dynamic data retrieval
Hi Satish, Lucene doesn't enforce an index schema, so each document can have a different set of fields. It sounds like you need to write a custom indexer that follows your custom rules and creates Lucene Documents with different Fields, depending on what you want indexed. You also mention searching and retrieval of data from DB. This, too, sounds like a custom search application - there is nothing in Lucene that uses a (R)DBMS to retrieve field values. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Findsatish findsat...@gmail.com To: java-user@lucene.apache.org Sent: Friday, July 31, 2009 7:13:47 AM Subject: Lucene for dynamic data retrieval Hi All, I am new to Lucene and I am working on a search application. My application needs dynamic data retrieval from the database. That means, based on my previous step output, I need to retrieve entries from the DB for the next step. For example, if my search query contains Name field entry, I need to retrieve the Designations from the DB that are matched with the identified Name in the query. if there is no Name identified in the query, then I need to retrieve ALL the Designations from the DB. In the next step, if Designation is also identified in the query, then I need to retrieve the Departments from the DB that are matched with this Designation. if there is no Designation identified, then I need to retrieve ALL the Departments from the DB. Like this, there are around 6-7 steps, all are dependent on the previous step output. In this scenario, I would like to know whether I can use Lucene for creating the index? If so, How can I use it? Any help is highly appreciated. Thanks, Satish -- View this message in context: http://www.nabble.com/Lucene-for-dynamic-data-retrieval-tp24754777p24754777.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
How to improve search time?
Hi, I've a single index of size 87GB containing around 50M documents. When I search for any query, best search time I observed was 8sec. And when query is expanded with synonyms, search takes minutes (~ 2-3min). Is there a better way to search so that overall search time reduces? Thanks, Prashant.
Re: How to improve search time?
Hi Prashant, Take a look at this... http://wiki.apache.org/lucene-java/ImproveSearchingSpeed Cheers, Phil On Sun, Aug 2, 2009 at 9:33 PM, prashant ullegaddiprashullega...@gmail.com wrote: Hi, I've a single index of size 87GB containing around 50M documents. When I search for any query, best search time I observed was 8sec. And when query is expanded with synonyms, search takes minutes (~ 2-3min). Is there a better way to search so that overall search time reduces? Thanks, Prashant. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Boosting Search Results
hello there i like to know about the Boosting Search results thing thanks --- On Sun, 8/2/09, bourne71 gary...@live.com wrote: From: bourne71 gary...@live.com Subject: Re: Boosting Search Results To: java-user@lucene.apache.org Date: Sunday, August 2, 2009, 8:14 PM Thanks for all the reply. It help me to understand problem better, but is it possible to create a query that will give additional boost to the results if and only if both of the word is found inside the results. This will definitely make sure that the results will be in the higher up of the list. Can this type of query be created? -- View this message in context: http://www.nabble.com/Boosting-Search-Results-tp24753954p24784708.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org