Re: java.io.IOException when trying to list terms in index (IndexReader)

2009-08-02 Thread se3g2011

hi,as you the error messages you listed below,pls put the 'reader.close()'
block to the bottom of method.
i think,if you invoke it first,the infrastructure stream is closed ,so
exceptions is encountered.


ohaya wrote:
 
 Hi,
 
 I changed the beginning of the try to:
 
   try {
   System.out.println(About to call .next()...);
   boolean foo = termsEnumerator.next();
   System.out.println(Finished calling first .next());
   System.out.println(About to drop into while()...);
 .
 .
 .
 
 and here's what I got when I ran the app:
 
 Index in directory :[C:\lucene-devel\lucene-devel\index] was opened
 successfully!
 About to call .next()...
 ** ERROR **: Exception while stepping through index: [java.io.IOException:
 The handle is invalid]
 java.io.IOException: The handle is invalid
   at java.io.RandomAccessFile.seek(Native Method)
   at
 org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:591)
   at
 org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:136)
   at
 org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:247)
   at
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:157)
   at
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
   at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:78)
   at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:64)
   at 
 org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:127)
   at ReadIndex.main(ReadIndex.java:29)
 
 Jim
 
  oh...@cox.net wrote: 
 Hi,
 
 BTW, the next() method is an abstract method in the Javadocs.  Does that
 mean that I'm suppose to have my own implementation?
 
 Jim
 
 
  oh...@cox.net wrote: 
  Phil,
  
  I posted in haste.  Actually, from the output that I posted, doesn't it
 it look like the .next() itself is throwing the exception?
  
  That is what has been puzzling me.  It looks like it got through the
 open() and terms() with no problem, then it blew up when calling the
 next()?
  
  Jim
  
  
   oh...@cox.net wrote: 
   Phil,
   
   Yes, that exception is not very helpful :)!!
   
   I'll try your suggestions and post back.
   
   Thanks,
   Jim
   
   
    Phil Whelan phil...@gmail.com wrote: 
Hi Jim,

I cannot see anything obvious, but both open() and terms() throw
IOException's. You could try putting these in separate try..catch
blocks to see which one it's coming from. Or using
 e.printStackTrace()
in the catch block will give more info to help you debug what's
happening.

On Sat, Aug 1, 2009 at 7:09 PM, oh...@cox.net wrote:
                        reader = IndexReader.open(args[0]);
                        Term term = new Term(path, );
                        termsEnumerator = reader.terms(term);

Cheers,
Phil

   
 -
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

   
   
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
   
  
  
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
  
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 

-- 
View this message in context: 
http://www.nabble.com/java.io.IOException-when-trying-to-list-terms-in-index-%28IndexReader%29-tp24774351p24775753.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Group by in Lucene ?

2009-08-02 Thread Erik Hatcher

Don't overlook Solr: http://lucene.apache.org/solr

Erik

On Aug 1, 2009, at 5:43 AM, mschipperheyn wrote:



http://code.google.com/p/bobo-browse

looks like it may be the ticket.

Marc

--
View this message in context: 
http://www.nabble.com/Group-by-in-Lucene---tp13581760p24767693.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Weird behaviour

2009-08-02 Thread prashant ullegaddi
Hi,

I've indexed some 50million documents. I've indexed the target URL of each
document as url field by using
StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia page
with title:Rahul Dravid and
url: http://en.wikipedia.org/wiki/Rahul_Dravid.

But when I search for +title:Rahul Dravid +url:Wikipedia, I'm getting no
results. I get the document(s) when
I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url:
en.wikipedia.org/wiki/Rahul_Dravid. I get
results even when I search for url:wiki/Rahul_Dravid.

It'd be helpful if somebody can throw some light on this.

-- Prashant.


Re: Weird behaviour

2009-08-02 Thread Shai Erera
You write that you index the string under the url field. Do you also index
it under title? If not, that can explain why title:Rahul Dravid does not
work for you.

Also, did you try to look at the index w/ Luke? It will show you what are
the terms in the index.

Another thing which is always good to debug such things is to create a
StandardAnalyzer, then request a tokenStream() from it, passing a
StringReader w/ the text you want to parse. Then just print the tokens
returned.

I've done that, using the version from trunk, w/ Version.2_4, and the tokens
that are extracted are:
(http,0,4,type=ALPHANUM)
(en.wikipedia.org,7,23,type=HOST)
(wiki,24,28,type=ALPHANUM)
(rahul,29,34,type=ALPHANUM)
(dravid,35,41,type=ALPHANUM)

So:
1) You don't get results for title:Rahul Dravid since you index it under
url and not title.
2) url:wiki/Rahul_Dravid works, since it looks for a phrase that exists in
the index (look at the last 3 tokens produced by the Analyzer, in the output
above).
3) ur:entire string also works, since you index all of it under the url
field.

Does this explain the behavior you see?

Shai

On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi prashullega...@gmail.com
 wrote:

 Hi,

 I've indexed some 50million documents. I've indexed the target URL of each
 document as url field by using
 StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia page
 with title:Rahul Dravid and
 url: http://en.wikipedia.org/wiki/Rahul_Dravid.

 But when I search for +title:Rahul Dravid +url:Wikipedia, I'm getting
 no
 results. I get the document(s) when
 I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url:
 en.wikipedia.org/wiki/Rahul_Dravid. I get
 results even when I search for url:wiki/Rahul_Dravid.

 It'd be helpful if somebody can throw some light on this.

 -- Prashant.



Re: Weird behaviour

2009-08-02 Thread prashant ullegaddi
Firstly, I'm indexing the string in url field only.

I've never used Luke, I don't know how to use.

What I'm trying to do is search for those documents which are from
some particular site, and have a given title.


On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera ser...@gmail.com wrote:

 You write that you index the string under the url field. Do you also
 index
 it under title? If not, that can explain why title:Rahul Dravid does
 not
 work for you.

 Also, did you try to look at the index w/ Luke? It will show you what are
 the terms in the index.

 Another thing which is always good to debug such things is to create a
 StandardAnalyzer, then request a tokenStream() from it, passing a
 StringReader w/ the text you want to parse. Then just print the tokens
 returned.

 I've done that, using the version from trunk, w/ Version.2_4, and the
 tokens
 that are extracted are:
 (http,0,4,type=ALPHANUM)
 (en.wikipedia.org,7,23,type=HOST)
 (wiki,24,28,type=ALPHANUM)
 (rahul,29,34,type=ALPHANUM)
 (dravid,35,41,type=ALPHANUM)

 So:
 1) You don't get results for title:Rahul Dravid since you index it under
 url and not title.
 2) url:wiki/Rahul_Dravid works, since it looks for a phrase that exists
 in
 the index (look at the last 3 tokens produced by the Analyzer, in the
 output
 above).
 3) ur:entire string also works, since you index all of it under the
 url
 field.

 Does this explain the behavior you see?

 Shai

 On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi 
 prashullega...@gmail.com
  wrote:

  Hi,
 
  I've indexed some 50million documents. I've indexed the target URL of
 each
  document as url field by using
  StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia page
  with title:Rahul Dravid and
  url: http://en.wikipedia.org/wiki/Rahul_Dravid.
 
  But when I search for +title:Rahul Dravid +url:Wikipedia, I'm getting
  no
  results. I get the document(s) when
  I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url:
  en.wikipedia.org/wiki/Rahul_Dravid. I get
  results even when I search for url:wiki/Rahul_Dravid.
 
  It'd be helpful if somebody can throw some light on this.
 
  -- Prashant.
 



Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread ohaya
Hi,

BTW, my indexer app is basically the same as the demo IndexFiles.java.  Here's 
part of the main:

try {
  IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), 
true, IndexWriter.MaxFieldLength.LIMITED);
  System.out.println(Indexing to directory ' +INDEX_DIR+ '...);
  indexDocs(writer, docDir);
  System.out.println(Optimizing...);
  writer.optimize();
  writer.close();

  Date end = new Date();
  System.out.println(end.getTime() - start.getTime() +  total 
milliseconds);

} catch (IOException e) {
  System.out.println( caught a  + e.getClass() +
   \n with message:  + e.getMessage());
}

when I run the indexer, I can see it say it added the document that ends up 
being missing from the terms.

Thanks,
Jim


 oh...@cox.net wrote: 
 Hi,
 
 I've noticed a kind of strange problem with term counts and actual terms.
 
 Some background:  I wrote an app that creates an index, including a path 
 field.  
 
 I am now working on an app (code was in the previous thread) that, as part of 
 what it does, needs to get a list of all of the path fields for documents 
 that were added.
 
 I first noticed the problem that I'm seeing while working on this latter app. 
  Basically, what I noticed was that while I was adding 13 documents to the 
 index, when I listed the path terms, there were only 12 of them.
 
 So then, I reviewed the index using Luke, and what I saw with that was that 
 there were indeed only 12 path terms (under Term Count on the left), but, 
 when I clicked the Show Top Terms in Luke, there were 13 terms listed by 
 Luke.
 
 At this point, I'm very puzzled about all of this :(...
 
 Can anyone explain why the difference in Luke, and, more importantly, what I 
 am only getting 12 (i.e., 1 less than the # of documents added) when I try to 
 programmatically list the terms?
 
 Thanks,
 Jim
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



score from spans

2009-08-02 Thread Eran Sevi
Hi,

How can I get the score of a span that is the result of SpanQuery.getSpans()
? The score should can be the same for each document, but if it's unique per
span, it's even better.

I tried looking for a way to expose this functionality through the Spans
class but it looks too complicated.
I'm not even sure that by default some score calculation is even performed
when using span queries.

I've noticed that some calculations are made using payloads and
BoostingTermQuery but the score result is used internally and can't be
accessed from the Spans results.
I don't want to re-run the query again using a HitCollector and since the
reader is passed to getSpans, I think it should be possible to do what I
want.

Any help on the correct way to expose the span score will be appreciated.

Thanks,
Eran.


Re: Weird behaviour

2009-08-02 Thread Shai Erera
How do you parse/convert the page to a Document object? Are you sure the
title Rahul Dravid is extracted properly and put in the title field?

You can read about Luke here: http://www.getopt.org/luke/.

Can you do System.out.println(document.toString()) before you add it to the
index, and paste the output here?

Shai

On Sun, Aug 2, 2009 at 4:47 PM, prashant ullegaddi prashullega...@gmail.com
 wrote:

 Firstly, I'm indexing the string in url field only.

 I've never used Luke, I don't know how to use.

 What I'm trying to do is search for those documents which are from
 some particular site, and have a given title.


 On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera ser...@gmail.com wrote:

  You write that you index the string under the url field. Do you also
  index
  it under title? If not, that can explain why title:Rahul Dravid does
  not
  work for you.
 
  Also, did you try to look at the index w/ Luke? It will show you what are
  the terms in the index.
 
  Another thing which is always good to debug such things is to create a
  StandardAnalyzer, then request a tokenStream() from it, passing a
  StringReader w/ the text you want to parse. Then just print the tokens
  returned.
 
  I've done that, using the version from trunk, w/ Version.2_4, and the
  tokens
  that are extracted are:
  (http,0,4,type=ALPHANUM)
  (en.wikipedia.org,7,23,type=HOST)
  (wiki,24,28,type=ALPHANUM)
  (rahul,29,34,type=ALPHANUM)
  (dravid,35,41,type=ALPHANUM)
 
  So:
  1) You don't get results for title:Rahul Dravid since you index it
 under
  url and not title.
  2) url:wiki/Rahul_Dravid works, since it looks for a phrase that exists
  in
  the index (look at the last 3 tokens produced by the Analyzer, in the
  output
  above).
  3) ur:entire string also works, since you index all of it under the
  url
  field.
 
  Does this explain the behavior you see?
 
  Shai
 
  On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi 
  prashullega...@gmail.com
   wrote:
 
   Hi,
  
   I've indexed some 50million documents. I've indexed the target URL of
  each
   document as url field by using
   StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia
 page
   with title:Rahul Dravid and
   url: http://en.wikipedia.org/wiki/Rahul_Dravid.
  
   But when I search for +title:Rahul Dravid +url:Wikipedia, I'm
 getting
   no
   results. I get the document(s) when
   I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url:
   en.wikipedia.org/wiki/Rahul_Dravid. I get
   results even when I search for url:wiki/Rahul_Dravid.
  
   It'd be helpful if somebody can throw some light on this.
  
   -- Prashant.
  
 



Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Phil Whelan
Hi Jim,

On Sun, Aug 2, 2009 at 1:32 AM, oh...@cox.net wrote:
 I first noticed the problem that I'm seeing while working on this latter app. 
  Basically, what I noticed was that while I was adding 13 documents to the 
 index, when I listed the path terms, there were only 12 of them.

Field text (the whole path in your case) and terms (the tokens of
the field text) are different.

The StandardAnalyzer breaks up words like this...
Field text = /a/b/c.txt
Tokens = {a,b,c,txt}

So this 1 field of 1 document become 4 terms / tokens (not sure if
there is a difference in this terminology between terms and tokens
sorry).
Therefore, you're going to have more terms than documents initially,
but as the overlap in term usage increases this changes.

For instance, these 3 paths
/a/b/c/d.txt,/b/c/d/a.txt,/c/d/a/b.txt are still only a total of
4 terms, since they share the same terms.

In fact, StandardAnalyzer goes a bit further than that and removes
stop-words, such as a (or an, the) as it's designed for
general text searching.

That said, I think you have a point with the next part of your question...

 So then, I reviewed the index using Luke, and what I saw with that was that 
 there were indeed only 12 path terms (under Term Count on the left), but, 
 when I clicked the Show Top Terms in Luke, there were 13 terms listed by 
 Luke.

Yes, I just checked this and this seems to be a bug with Luke. It
always shows 1 less than in Term Count than it should. Well spotted.

Cheers,
Phil

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Phil Whelan
Hi Jim,

On Sun, Aug 2, 2009 at 9:08 AM, Phil Whelanphil...@gmail.com wrote:

 So then, I reviewed the index using Luke, and what I saw with that was that 
 there were indeed only 12 path terms (under Term Count on the left), 
 but, when I clicked the Show Top Terms in Luke, there were 13 terms listed 
 by Luke.

 Yes, I just checked this and this seems to be a bug with Luke. It
 always shows 1 less than in Term Count than it should. Well spotted.

I was able to see why this way happening in the Luke source and I've
submitted the following patch to Andrzej, the author of Luke.

Thanks,
Phil

--- luke.orig/src/org/getopt/luke/Luke.java 2009-03-19 22:41:34.0 
-0700
+++ luke-src-0.9.2/src/org/getopt/luke/Luke.java2009-08-02
09:33:24.0 -0700
@@ -813,23 +813,18 @@
   setString(iFields, text, String.valueOf(idxFields.length));
   Object iTerms = find(pOver, iTerms);
   termCounts.clear();
-  FieldTermCount ftc = new FieldTermCount();
+  FieldTermCount ftc = null;
   TermEnum te = ir.terms();
   numTerms = 0;
   while (te.next()) {
 Term currTerm = te.term();
-if (ftc.fieldname == null) {
+if (ftc == null || ftc.fieldname == null || ftc.fieldname !=
currTerm.field()) {
   // initialize
-  ftc.fieldname = currTerm.field();
-  termCounts.put(ftc.fieldname, ftc);
-}
-if (ftc.fieldname == currTerm.field()) {
-  ftc.termCount++;
-} else {
   ftc = new FieldTermCount();
   ftc.fieldname = currTerm.field();
   termCounts.put(ftc.fieldname, ftc);
 }
+ftc.termCount++;
 numTerms++;
   }
   te.close();

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Weird behaviour

2009-08-02 Thread prashant ullegaddi
Yes, I'm sure that title:Rahul Dravid is extracted properly, and there is
a document relevant to this query as well.
The following query and its results proves it:

Enter query:
Searching for: +title:rahul dravid +url:wiki
4 total matching documents
   trec-id: clueweb09-enwp02-13-14368, URL:
http://en.wikipedia.org/wiki/Rahul_Dravid
   trec-id: clueweb09-enwp01-83-11378, URL:
http://en.wikipedia.org/wiki/Rahul_S_Dravid
   trec-id: clueweb09-en0011-08-22737, URL:
http://www.reference.com/browse/wiki/Rahul_Dravid
   trec-id: clueweb09-enwp01-69-13556, URL:
http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid
Press (q)uit or enter number to jump to a page.

But see following query:

Enter query:
+title:rahul dravid +url:wikipedia
Searching for: +title:rahul dravid +url:wikipedia
0 total matching documents
Press (q)uit or enter number to jump to a page.

Isn't it weird?

-- Prashant.

On Sun, Aug 2, 2009 at 9:13 PM, Shai Erera ser...@gmail.com wrote:

 How do you parse/convert the page to a Document object? Are you sure the
 title Rahul Dravid is extracted properly and put in the title field?

 You can read about Luke here: http://www.getopt.org/luke/.

 Can you do System.out.println(document.toString()) before you add it to the
 index, and paste the output here?

 Shai

 On Sun, Aug 2, 2009 at 4:47 PM, prashant ullegaddi 
 prashullega...@gmail.com
  wrote:

  Firstly, I'm indexing the string in url field only.
 
  I've never used Luke, I don't know how to use.
 
  What I'm trying to do is search for those documents which are from
  some particular site, and have a given title.
 
 
  On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera ser...@gmail.com wrote:
 
   You write that you index the string under the url field. Do you also
   index
   it under title? If not, that can explain why title:Rahul Dravid
 does
   not
   work for you.
  
   Also, did you try to look at the index w/ Luke? It will show you what
 are
   the terms in the index.
  
   Another thing which is always good to debug such things is to create a
   StandardAnalyzer, then request a tokenStream() from it, passing a
   StringReader w/ the text you want to parse. Then just print the tokens
   returned.
  
   I've done that, using the version from trunk, w/ Version.2_4, and the
   tokens
   that are extracted are:
   (http,0,4,type=ALPHANUM)
   (en.wikipedia.org,7,23,type=HOST)
   (wiki,24,28,type=ALPHANUM)
   (rahul,29,34,type=ALPHANUM)
   (dravid,35,41,type=ALPHANUM)
  
   So:
   1) You don't get results for title:Rahul Dravid since you index it
  under
   url and not title.
   2) url:wiki/Rahul_Dravid works, since it looks for a phrase that
 exists
   in
   the index (look at the last 3 tokens produced by the Analyzer, in the
   output
   above).
   3) ur:entire string also works, since you index all of it under the
   url
   field.
  
   Does this explain the behavior you see?
  
   Shai
  
   On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi 
   prashullega...@gmail.com
wrote:
  
Hi,
   
I've indexed some 50million documents. I've indexed the target URL of
   each
document as url field by using
StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia
  page
with title:Rahul Dravid and
url: http://en.wikipedia.org/wiki/Rahul_Dravid.
   
But when I search for +title:Rahul Dravid +url:Wikipedia, I'm
  getting
no
results. I get the document(s) when
I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url:
en.wikipedia.org/wiki/Rahul_Dravid. I get
results even when I search for url:wiki/Rahul_Dravid.
   
It'd be helpful if somebody can throw some light on this.
   
-- Prashant.
   
  
 



Re: Weird behaviour

2009-08-02 Thread Phil Whelan
Hi Prashant,

I agree with Shai, that using Luke and printing out what the Document
looks like before it goes into the index, are going to be your best
bet for debugging this problem.

The problem you're having is that StandardAnalyzer does not break-up
the hostname into separate terms, as it has a special case for
hostnames and acronyms.

This should work...
+title:rahul dravid +url:en.wikipedia.org

Thanks,
Phil

On Sun, Aug 2, 2009 at 10:14 AM, prashant
ullegaddiprashullega...@gmail.com wrote:
 Yes, I'm sure that title:Rahul Dravid is extracted properly, and there is
 a document relevant to this query as well.
 The following query and its results proves it:

 Enter query:
 Searching for: +title:rahul dravid +url:wiki
 4 total matching documents
   trec-id: clueweb09-enwp02-13-14368, URL:
 http://en.wikipedia.org/wiki/Rahul_Dravid
   trec-id: clueweb09-enwp01-83-11378, URL:
 http://en.wikipedia.org/wiki/Rahul_S_Dravid
   trec-id: clueweb09-en0011-08-22737, URL:
 http://www.reference.com/browse/wiki/Rahul_Dravid
   trec-id: clueweb09-enwp01-69-13556, URL:
 http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid
 Press (q)uit or enter number to jump to a page.

 But see following query:

 Enter query:
 +title:rahul dravid +url:wikipedia
 Searching for: +title:rahul dravid +url:wikipedia
 0 total matching documents
 Press (q)uit or enter number to jump to a page.

 Isn't it weird?

 -- Prashant.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ThreadedIndexWriter vs. IndexWriter

2009-08-02 Thread Michael McCandless
Woops sorry for the confusion!

Mike

On Sat, Aug 1, 2009 at 1:03 PM, Phil Whelanphil...@gmail.com wrote:
 Hi Mike,

 It's Jibo, not me, having the problem. But thanks for the link. I was
 interested to look at the code. Will be buying the book soon.

 Phil

 On Sat, Aug 1, 2009 at 2:08 AM, Michael McCandless
 luc...@mikemccandless.com wrote:

 (Please note that ThreadedIndexWriter is source code available with
 the upcoming revision to Lucene in Action.)

 Phil, is it possible you are using an older version of the book's
 source code?  In particular, can you check whether your version of
 ThreadedIndexWriter.java has this:

  public void close(boolean doWait) throws CorruptIndexException, IOException 
 {
    finish();
    super.close(doWait);
  }

 (I vaguely remember that being missing from earlier releases, which
 could explain what you're seeing).  If you are missing that, can you
 download the current code from http://www.manning.com/hatcher3 and try
 again?

 If that's not the problem... can you post the benchmark alg you are
 using in each case?

 Mike

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Weird behaviour

2009-08-02 Thread prashant ullegaddi
Hi Phil,

The query you gave did work. Well, that proves StandardAnalyzer has a
different way
of tokenizing URLs.

Thanks,
Prashant.

On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan phil...@gmail.com wrote:

 Hi Prashant,

 I agree with Shai, that using Luke and printing out what the Document
 looks like before it goes into the index, are going to be your best
 bet for debugging this problem.

 The problem you're having is that StandardAnalyzer does not break-up
 the hostname into separate terms, as it has a special case for
 hostnames and acronyms.

 This should work...
 +title:rahul dravid +url:en.wikipedia.org

 Thanks,
 Phil

 On Sun, Aug 2, 2009 at 10:14 AM, prashant
 ullegaddiprashullega...@gmail.com wrote:
  Yes, I'm sure that title:Rahul Dravid is extracted properly, and there
 is
  a document relevant to this query as well.
  The following query and its results proves it:
 
  Enter query:
  Searching for: +title:rahul dravid +url:wiki
  4 total matching documents
trec-id: clueweb09-enwp02-13-14368, URL:
  http://en.wikipedia.org/wiki/Rahul_Dravid
trec-id: clueweb09-enwp01-83-11378, URL:
  http://en.wikipedia.org/wiki/Rahul_S_Dravid
trec-id: clueweb09-en0011-08-22737, URL:
  http://www.reference.com/browse/wiki/Rahul_Dravid
trec-id: clueweb09-enwp01-69-13556, URL:
  http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid
  Press (q)uit or enter number to jump to a page.
 
  But see following query:
 
  Enter query:
  +title:rahul dravid +url:wikipedia
  Searching for: +title:rahul dravid +url:wikipedia
  0 total matching documents
  Press (q)uit or enter number to jump to a page.
 
  Isn't it weird?
 
  -- Prashant.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Phil Whelan
On Sun, Aug 2, 2009 at 10:58 AM, Andrzej Bialeckia...@getopt.org wrote:
 Thank you Phil for spotting this bug - this fix will be included in the next
 release of Luke.

Glad to help. Thanks for building this great tool!

Phil

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Weird behaviour

2009-08-02 Thread Shai Erera
You can always create your own Analyzer which creates a TokenStream just
like StandardAnalyzer, but instead of using StandardFilter, write another
TokenFilter which receives the HOST token type, and breaks it further to its
components (e.g., extract en, wikipedia and org). You can also return
the original HOST token and its components.

I hope this helps.

Shai

On Sun, Aug 2, 2009 at 8:58 PM, prashant ullegaddi prashullega...@gmail.com
 wrote:

 Hi Phil,

 The query you gave did work. Well, that proves StandardAnalyzer has a
 different way
 of tokenizing URLs.

 Thanks,
 Prashant.

 On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan phil...@gmail.com wrote:

  Hi Prashant,
 
  I agree with Shai, that using Luke and printing out what the Document
  looks like before it goes into the index, are going to be your best
  bet for debugging this problem.
 
  The problem you're having is that StandardAnalyzer does not break-up
  the hostname into separate terms, as it has a special case for
  hostnames and acronyms.
 
  This should work...
  +title:rahul dravid +url:en.wikipedia.org
 
  Thanks,
  Phil
 
  On Sun, Aug 2, 2009 at 10:14 AM, prashant
  ullegaddiprashullega...@gmail.com wrote:
   Yes, I'm sure that title:Rahul Dravid is extracted properly, and
 there
  is
   a document relevant to this query as well.
   The following query and its results proves it:
  
   Enter query:
   Searching for: +title:rahul dravid +url:wiki
   4 total matching documents
 trec-id: clueweb09-enwp02-13-14368, URL:
   http://en.wikipedia.org/wiki/Rahul_Dravid
 trec-id: clueweb09-enwp01-83-11378, URL:
   http://en.wikipedia.org/wiki/Rahul_S_Dravid
 trec-id: clueweb09-en0011-08-22737, URL:
   http://www.reference.com/browse/wiki/Rahul_Dravid
 trec-id: clueweb09-enwp01-69-13556, URL:
   http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid
   Press (q)uit or enter number to jump to a page.
  
   But see following query:
  
   Enter query:
   +title:rahul dravid +url:wikipedia
   Searching for: +title:rahul dravid +url:wikipedia
   0 total matching documents
   Press (q)uit or enter number to jump to a page.
  
   Isn't it weird?
  
   -- Prashant.
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 



Re: Weird behaviour

2009-08-02 Thread prashant ullegaddi
Thank you Phil and Shai.

I will write a different Analyzer.

On Sun, Aug 2, 2009 at 11:50 PM, Shai Erera ser...@gmail.com wrote:

 You can always create your own Analyzer which creates a TokenStream just
 like StandardAnalyzer, but instead of using StandardFilter, write another
 TokenFilter which receives the HOST token type, and breaks it further to
 its
 components (e.g., extract en, wikipedia and org). You can also return
 the original HOST token and its components.

 I hope this helps.

 Shai

 On Sun, Aug 2, 2009 at 8:58 PM, prashant ullegaddi 
 prashullega...@gmail.com
  wrote:

  Hi Phil,
 
  The query you gave did work. Well, that proves StandardAnalyzer has a
  different way
  of tokenizing URLs.
 
  Thanks,
  Prashant.
 
  On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan phil...@gmail.com wrote:
 
   Hi Prashant,
  
   I agree with Shai, that using Luke and printing out what the Document
   looks like before it goes into the index, are going to be your best
   bet for debugging this problem.
  
   The problem you're having is that StandardAnalyzer does not break-up
   the hostname into separate terms, as it has a special case for
   hostnames and acronyms.
  
   This should work...
   +title:rahul dravid +url:en.wikipedia.org
  
   Thanks,
   Phil
  
   On Sun, Aug 2, 2009 at 10:14 AM, prashant
   ullegaddiprashullega...@gmail.com wrote:
Yes, I'm sure that title:Rahul Dravid is extracted properly, and
  there
   is
a document relevant to this query as well.
The following query and its results proves it:
   
Enter query:
Searching for: +title:rahul dravid +url:wiki
4 total matching documents
  trec-id: clueweb09-enwp02-13-14368, URL:
http://en.wikipedia.org/wiki/Rahul_Dravid
  trec-id: clueweb09-enwp01-83-11378, URL:
http://en.wikipedia.org/wiki/Rahul_S_Dravid
  trec-id: clueweb09-en0011-08-22737, URL:
http://www.reference.com/browse/wiki/Rahul_Dravid
  trec-id: clueweb09-enwp01-69-13556, URL:
http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid
Press (q)uit or enter number to jump to a page.
   
But see following query:
   
Enter query:
+title:rahul dravid +url:wikipedia
Searching for: +title:rahul dravid +url:wikipedia
0 total matching documents
Press (q)uit or enter number to jump to a page.
   
Isn't it weird?
   
-- Prashant.
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
 



Re: arabic analyzer

2009-08-02 Thread Robert Muir
 the fact is, plural (as an example) is not supported, and that is one of
 the most common things that a person doing some search will expect to

Walid, I'm not sure this is true. Many plurals are supported
(certainly not exceptional cases or broken plurals).
This is no different than the other language analyzers in lucene, even
english stemmers: the most common forms are grouped together and thats
about all you can say :)

maybe in the future we can improve it though for your particular
concern, add simple dictionary mappings for at least the most common
broken plurals, something like that.

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread ohaya
Hi Phil,

For problem with my app, it wasn't what you suggested (about the tokens, etc.).

For some later things, my indexer creates both a path field that is analyzed 
(and thus tokenized, etc.) and another field, fullpath, which is not analyzed 
(and thus, not tokenized).

The problem with my app was that I was create a TermEnum:

Term term = new Term(fullpath, );
termsEnumerator = reader.terms(term);

and then going immediately into a while loop:

while (termsEnumerator.next()) {
.
.
}

i.e., I was ignoring the 1st term in the TermEnum (since the .next() bumps the 
TermEnum to the 2nd term, initially).

Anyway, so the code that I ended up with is:

try {
System.out.println(Outside while: About to get 1st termsEnumerator.term()...);
currentTerm = termsEnumerator.term();
currentField = currentTerm.field();
termpathcount++;
System.out.println(Outside while: 1st Field = [ + currentField + ] Term = [ 
+ currentTerm.text() + ]);
System.out.println(Outside while: About to drop into while()...);
while (termsEnumerator.next()) {
currentTerm = termsEnumerator.term();
currentField = currentTerm.field();
if (currentField.equalsIgnoreCase(fullpath)) {
termpathcount++;
System.out.println(Count= + termpathcount +  Field = [ + 
currentField + ] Term = [ + currentTerm.text() + ]);
}
} // end while()

termsEnumerator.close();
System.out.println(Matching terms count =  + termpathcount);
} catch (Exception e) {
System.out.println(** ERROR **: Exception while stepping through 
index: [ + e + ]);
e.printStackTrace();
}

and, that seems to be working perfectly.

Also, thanks for following up re. that Luke problem.  That was one piece of 
this puzzle that was kind of driving me batty :)!!

Jim



 Phil Whelan phil...@gmail.com wrote: 
 Hi Jim,
 
 On Sun, Aug 2, 2009 at 1:32 AM, oh...@cox.net wrote:
  I first noticed the problem that I'm seeing while working on this latter 
  app.  Basically, what I noticed was that while I was adding 13 documents to 
  the index, when I listed the path terms, there were only 12 of them.
 
 Field text (the whole path in your case) and terms (the tokens of
 the field text) are different.
 
 The StandardAnalyzer breaks up words like this...
 Field text = /a/b/c.txt
 Tokens = {a,b,c,txt}
 
 So this 1 field of 1 document become 4 terms / tokens (not sure if
 there is a difference in this terminology between terms and tokens
 sorry).
 Therefore, you're going to have more terms than documents initially,
 but as the overlap in term usage increases this changes.
 
 For instance, these 3 paths
 /a/b/c/d.txt,/b/c/d/a.txt,/c/d/a/b.txt are still only a total of
 4 terms, since they share the same terms.
 
 In fact, StandardAnalyzer goes a bit further than that and removes
 stop-words, such as a (or an, the) as it's designed for
 general text searching.
 
 That said, I think you have a point with the next part of your question...
 
  So then, I reviewed the index using Luke, and what I saw with that was that 
  there were indeed only 12 path terms (under Term Count on the left), 
  but, when I clicked the Show Top Terms in Luke, there were 13 terms 
  listed by Luke.
 
 Yes, I just checked this and this seems to be a bug with Luke. It
 always shows 1 less than in Term Count than it should. Well spotted.
 
 Cheers,
 Phil
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: java.io.IOException when trying to list terms in index (IndexReader)

2009-08-02 Thread ohaya
Hi,

I thought that, in the code that I posted, there was a close() in the finally?

Or, are you saying that when an IndexReader is opened, that that somehow 
persists in the system, even past my Java app terminating?

FYI, I'm doing this testing on Windows, under Eclipse...

Jim



 se3g2011 se3g2...@gmail.com wrote: 
 
 hi,as you the error messages you listed below,pls put the 'reader.close()'
 block to the bottom of method.
 i think,if you invoke it first,the infrastructure stream is closed ,so
 exceptions is encountered.
 
 
 ohaya wrote:
  
  Hi,
  
  I changed the beginning of the try to:
  
  try {
  System.out.println(About to call .next()...);
  boolean foo = termsEnumerator.next();
  System.out.println(Finished calling first .next());
  System.out.println(About to drop into while()...);
  .
  .
  .
  
  and here's what I got when I ran the app:
  
  Index in directory :[C:\lucene-devel\lucene-devel\index] was opened
  successfully!
  About to call .next()...
  ** ERROR **: Exception while stepping through index: [java.io.IOException:
  The handle is invalid]
  java.io.IOException: The handle is invalid
  at java.io.RandomAccessFile.seek(Native Method)
  at
  org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:591)
  at
  org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:136)
  at
  org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:247)
  at
  org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:157)
  at
  org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
  at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:78)
  at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:64)
  at 
  org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:127)
  at ReadIndex.main(ReadIndex.java:29)
  
  Jim
  
   oh...@cox.net wrote: 
  Hi,
  
  BTW, the next() method is an abstract method in the Javadocs.  Does that
  mean that I'm suppose to have my own implementation?
  
  Jim
  
  
   oh...@cox.net wrote: 
   Phil,
   
   I posted in haste.  Actually, from the output that I posted, doesn't it
  it look like the .next() itself is throwing the exception?
   
   That is what has been puzzling me.  It looks like it got through the
  open() and terms() with no problem, then it blew up when calling the
  next()?
   
   Jim
   
   
    oh...@cox.net wrote: 
Phil,

Yes, that exception is not very helpful :)!!

I'll try your suggestions and post back.

Thanks,
Jim


 Phil Whelan phil...@gmail.com wrote: 
 Hi Jim,
 
 I cannot see anything obvious, but both open() and terms() throw
 IOException's. You could try putting these in separate try..catch
 blocks to see which one it's coming from. Or using
  e.printStackTrace()
 in the catch block will give more info to help you debug what's
 happening.
 
 On Sat, Aug 1, 2009 at 7:09 PM, oh...@cox.net wrote:
                         reader = IndexReader.open(args[0]);
                         Term term = new Term(path, );
                         termsEnumerator = reader.terms(term);
 
 Cheers,
 Phil
 

  -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

   
   
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
   
  
  
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
  
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/java.io.IOException-when-trying-to-list-terms-in-index-%28IndexReader%29-tp24774351p24775753.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 



Re: Weird discrepancy with term counts vs. terms (off by 1)

2009-08-02 Thread Phil Whelan
Hi Jim,

On Sun, Aug 2, 2009 at 12:12 PM, oh...@cox.net wrote:
 i.e., I was ignoring the 1st term in the TermEnum (since the .next() bumps 
 the TermEnum to the 2nd term, initially).

Great! Glad you found the problem. I couldn't see it.

Phil

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: java.io.IOException when trying to list terms in index (IndexReader)

2009-08-02 Thread Erick Erickson
I've seen Eclipse get into weird states, but I don't think that's your
problem.

You open the IndexReader and set up a TermEnum on it. Then, no matter
what you close the underlying IndexReader in the finally block. Then later
you use the TermEnum *even though the underlying reader has been closed*.
You want something like
try {
  open reader
  set up TermEnum
  enumerate terms
  close termEnum
 } catch () {
}
finally {
close reader
}

Getting an IO exception isn't at all strange in this situation, and exactly
when
you throw the exception is indeterminate.

See below.
public class ReadIndex {

   public static void main(String[] args) {
   IndexReader reader = null;
   TermEnum termsEnumerator = null;
   Term currentTerm = null;

   try {
   reader = IndexReader.open(args[0]);
   Term term = new Term(path, );
   termsEnumerator = reader.terms(term);
   } catch (IOException e) {
   System.out.println(** ERROR **: Exception when
opened IndexReader: [ + e + ]);
   } finally {
**
Why close the reader here? You need it later I think.
**
   try { reader.close(); } catch (IOException e) { /*
suck it up */ }
   }

   System.out.println(Index in directory :[ + args[0] + ] was
opened successfully!);

   try {
   System.out.println(About to drop into while()...);
**
This relies on the underlying reader that has been closed???
**

   while (termsEnumerator.next()) {
   System.out.println(About to get
terms.Enumerator.term()...);
   currentTerm = termsEnumerator.term();
   System.out.println(Term = [ +
currentTerm.text() + ]);
   }
   termsEnumerator.close();
   } catch (Exception e) {
   System.out.println(** ERROR **: Exception while
stepping through index: [ + e + ]);
   }
   } // end main()

} // end CLASS ReadIndex

On Sun, Aug 2, 2009 at 3:15 PM, oh...@cox.net wrote:

 Hi,

 I thought that, in the code that I posted, there was a close() in the
 finally?

 Or, are you saying that when an IndexReader is opened, that that somehow
 persists in the system, even past my Java app terminating?

 FYI, I'm doing this testing on Windows, under Eclipse...

 Jim



  se3g2011 se3g2...@gmail.com wrote:
 
  hi,as you the error messages you listed below,pls put the
 'reader.close()'
  block to the bottom of method.
  i think,if you invoke it first,the infrastructure stream is closed ,so
  exceptions is encountered.
 
 
  ohaya wrote:
  
   Hi,
  
   I changed the beginning of the try to:
  
   try {
   System.out.println(About to call .next()...);
   boolean foo = termsEnumerator.next();
   System.out.println(Finished calling first
 .next());
   System.out.println(About to drop into
 while()...);
   .
   .
   .
  
   and here's what I got when I ran the app:
  
   Index in directory :[C:\lucene-devel\lucene-devel\index] was opened
   successfully!
   About to call .next()...
   ** ERROR **: Exception while stepping through index:
 [java.io.IOException:
   The handle is invalid]
   java.io.IOException: The handle is invalid
   at java.io.RandomAccessFile.seek(Native Method)
   at
  
 org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:591)
   at
  
 org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:136)
   at
  
 org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:247)
   at
  
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:157)
   at
  
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
   at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:78)
   at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:64)
   at
 org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:127)
   at ReadIndex.main(ReadIndex.java:29)
  
   Jim
  
    oh...@cox.net wrote:
   Hi,
  
   BTW, the next() method is an abstract method in the Javadocs.  Does
 that
   mean that I'm suppose to have my own implementation?
  
   Jim
  
  
    oh...@cox.net wrote:
Phil,
   
I posted in haste.  Actually, from the output that I posted, doesn't
 it
   it look like the .next() itself is throwing the exception?
   
That is what has been puzzling me.  It looks like it got through the
   open() and terms() with no problem, then it blew up when calling the
   next()?
   
Jim
   
   
 oh...@cox.net wrote:
 Phil,

 Yes, that exception is not very helpful :)!!
 

question about

2009-08-02 Thread Leonard Gestrin
Hello,
I have question about KEYWORD type and searching/updating.  I am getting 
strange behavior that I can't quite comprehend.
My index is created using standard analyzer, which used for writing and 
searching. It has three fields

userpin - alphanumeric field which is stored as TEXT
documentkey  - alphanumeric field which is stored as TEXT
contents - text of document which is stored as TEXT

When I try to update document I am creating Term to find document by 
documentKey and I am using

 org.apache.lucene.index.IndexWriter.updateDocument(term, pDocument);

to do the update.  Lucene fails to find the document by the term and I am 
getting duplicate documents in the index.
When I changed index to define documentKey as KEYWORD the updates started to 
work fine.
However, search for documentKey using StandardAnalyzer stopped working.

It appears that lucene is using keywordAnalyzer for searching for the term 
during update, even though the indexer is open with StandardAnalyzer.

The sample values that are stored in documentKeys are: LFAHBHMF, 
LFAHBHAS.
I noticed if documentKey is numeric value, both KeywordAnalyzer and 
StandardAnalyzer can find the documents by it without any problem thus reader 
can find and indexer can update without any problems. With alphanumeric I cant 
get both to work.
Any help is appreciated.
Thanks
Leonard











question about indexing/searching using standardanalyzer for KEYWORD field that contains alphanumeric data

2009-08-02 Thread Leonard Gestrin

Hello,
I have question about KEYWORD type and searching/updating.  I am getting 
strange behavior that I can't quite comprehend.
My index is created using standard analyzer, which used for writing and 
searching. It has three fields

userpin - alphanumeric field which is stored as TEXT
documentkey  - alphanumeric field which is stored as TEXT
contents - text of document which is stored as TEXT

When I try to update document I am creating Term to find document by 
documentKey and I am using

 org.apache.lucene.index.IndexWriter.updateDocument(term, pDocument);

to do the update.  Lucene fails to find the document by the term and I am 
getting duplicate documents in the index.
When I changed index to define documentKey as KEYWORD the updates started to 
work fine.
However, search for documentKey using StandardAnalyzer stopped working.

It appears that lucene is using keywordAnalyzer for searching for the term 
during update, even though the indexer is open with StandardAnalyzer.

The sample values that are stored in documentKeys are: LFAHBHMF, 
LFAHBHAS.
I noticed if documentKey is numeric value, both KeywordAnalyzer and 
StandardAnalyzer can find the documents by it without any problem thus reader 
can find and indexer can update without any problems. With alphanumeric I cant 
get both to work.
Any help is appreciated.
Thanks
Leonard










-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Boosting Search Results

2009-08-02 Thread bourne71

Thanks for all the reply. It help me to understand problem better, but is it
possible to create a query that will give additional boost to the results if
and only if both of the word is found inside the results. This will
definitely make sure that the results will be in the higher up of the list.

Can this type of query be created?
-- 
View this message in context: 
http://www.nabble.com/Boosting-Search-Results-tp24753954p24784708.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene for dynamic data retrieval

2009-08-02 Thread Otis Gospodnetic
Hi Satish,

Lucene doesn't enforce an index schema, so each document can have a different 
set of fields.  It sounds like you need to write a custom indexer that follows 
your custom rules and creates Lucene Documents with different Fields, depending 
on what you want indexed.

You also mention searching and retrieval of data from DB.  This, too, sounds 
like a custom search application - there is nothing in Lucene that uses a 
(R)DBMS to retrieve field values.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Findsatish findsat...@gmail.com
 To: java-user@lucene.apache.org
 Sent: Friday, July 31, 2009 7:13:47 AM
 Subject: Lucene for dynamic data retrieval
 
 
 Hi All,
 I am new to Lucene and I am working on a search application.
 
 My application needs dynamic data retrieval from the database. That means,
 based on my previous step output, I need to retrieve entries from the DB for
 the next step.
 
 For example, if my search query contains Name field entry, I need to
 retrieve the Designations from the DB that are matched with the identified
 Name in the query.
 if there is no Name identified in the query, then I
 need to retrieve ALL the Designations from the DB.
 
 In the next step, if Designation is also identified in the query, then I
 need to retrieve the Departments from the DB that are matched with this
 Designation.
 if there is no Designation identified, then I need
 to retrieve ALL the Departments from the DB.
 
 Like this, there are around 6-7 steps, all are dependent on the previous
 step output.
 
 In this scenario, I would like to know whether I can use Lucene for creating
 the index? If so, How can I use it?
 
 Any help is highly appreciated.
 
 Thanks,
 Satish
 -- 
 View this message in context: 
 http://www.nabble.com/Lucene-for-dynamic-data-retrieval-tp24754777p24754777.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



How to improve search time?

2009-08-02 Thread prashant ullegaddi
Hi,

I've a single index of size 87GB containing around 50M documents. When I
search for any query,
best search time I observed was 8sec. And when query is expanded with
synonyms, search takes
minutes (~ 2-3min). Is there a better way to search so that overall search
time reduces?

Thanks,
Prashant.


Re: How to improve search time?

2009-08-02 Thread Phil Whelan
Hi Prashant,

Take a look at this...
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

Cheers,
Phil

On Sun, Aug 2, 2009 at 9:33 PM, prashant
ullegaddiprashullega...@gmail.com wrote:
 Hi,

 I've a single index of size 87GB containing around 50M documents. When I
 search for any query,
 best search time I observed was 8sec. And when query is expanded with
 synonyms, search takes
 minutes (~ 2-3min). Is there a better way to search so that overall search
 time reduces?

 Thanks,
 Prashant.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Boosting Search Results

2009-08-02 Thread henok sahilu
hello there 
i like to know about the Boosting Search results thing
thanks


--- On Sun, 8/2/09, bourne71 gary...@live.com wrote:

From: bourne71 gary...@live.com
Subject: Re: Boosting Search Results
To: java-user@lucene.apache.org
Date: Sunday, August 2, 2009, 8:14 PM


Thanks for all the reply. It help me to understand problem better, but is it
possible to create a query that will give additional boost to the results if
and only if both of the word is found inside the results. This will
definitely make sure that the results will be in the higher up of the list.

Can this type of query be created?
-- 
View this message in context: 
http://www.nabble.com/Boosting-Search-Results-tp24753954p24784708.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org