Re: Closing IndexWriter object after each file causes NullPointerException?

2004-04-14 Thread jitender ahuja
Hi,

Ok, but what is the use of  the writeLock, as the directory is
modified anyway!
As if the writeLock is an issue then then the index directory should have
index information only for the first file.

Thanks,
Jitender

- Original Message - 
From: Brisbart Franck [EMAIL PROTECTED]
To: [EMAIL PROTECTED]; Lucene Users List
[EMAIL PROTECTED]
Sent: Tuesday, April 13, 2004 10:15 PM
Subject: Re: Closing IndexWriter object after each file causes
NullPointerException?


 If you close an IndexWriter more than once, the release of the writeLock
   creates a NullPointerException.
 You should clean your code and close your writer only once. Anyway, I
 don't know why there's no test on the 'writeLock' as in the 'finalize'
 method.
 I think it's a little error, so I suggest the attached patch to fix that.

 Franck Brisbart


 jitender ahuja wrote:
  Hi,
   Can anyone tell what is the cause of error for the following error
  as the source of error is not any of the following:
  a) Index directory closing after each file of the directory (to be
  indexed) : verified by the changing directory size, with the changing
   number of files to be indexed
  b) IndexWriter object being closed out : verified by checking the
  IndexWriter object ( here, writ) being a non-null object, by the line:
  System.out.println(writ != null); in the attached code
 
 
  Error output:
   java.lang.NullPointerException
  at org.apache.lucene.index.IndexWriter.close(Unknown Source)
  at IndexDatanew.indexDocs(IndexDatanew.java:89)
  at IndexDatanew.indexDocs(IndexDatanew.java:50)
  at IndexDatanew.main(IndexDatanew.java:25)
 
  The code that causes this error is working fine otherwise (i.e. for
  indexing purposes) and is attached; the output in detail for a
  directory of 2 files is also attached.:
 
  Thanks
  Jitender
 
 
  
 
  C:\lucrochejava IndexDatanew E:\freebooks\books\whole\jiten
  Index Directory: E:\freebooks\books\whole\jiten
  2
  E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm
  adding: E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm
  File contents from buffer:
  E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm
  false
  E:\freebooks\books\whole\jiten\TIJ3_c.htm
  adding: E:\freebooks\books\whole\jiten\TIJ3_c.htm
  File contents from buffer:
  E:\freebooks\books\whole\jiten\TIJ3_c.htm
  false
  java.lang.NullPointerException
  at org.apache.lucene.index.IndexWriter.close(Unknown Source)
  at IndexDatanew.indexDocs(IndexDatanew.java:89)
  at IndexDatanew.indexDocs(IndexDatanew.java:50)
  at IndexDatanew.main(IndexDatanew.java:25)
 
 
 
  
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]


 -- 
 Franck Brisbart
 RD
 http://www.kelkoo.com







 Index: IndexWriter.java
 ===
 RCS file:
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.
java,v
 retrieving revision 1.28
 diff -u -r1.28 IndexWriter.java
 --- IndexWriter.java 25 Mar 2004 19:34:53 - 1.28
 +++ IndexWriter.java 13 Apr 2004 16:39:56 -
 @@ -235,8 +235,10 @@
public synchronized void close() throws IOException {
  flushRamSegments();
  ramDirectory.close();
 -writeLock.release();  // release write lock
 -writeLock = null;
 +if (writeLock != null) {
 +  writeLock.release();  // release write lock
 +  writeLock = null;
 +}
  directory.close();
}




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to retrieve the terms that matched

2004-04-14 Thread Erik Hatcher
Have a look at the Highlighter code that lives in the Lucene sandbox.  
It is a new addition there, but has been available for some time from 
the creators website.  I'm not sure if this will give you the 
information you need directly, but it would be a start.

	Erik

On Apr 14, 2004, at 8:27 AM, David Thibau wrote:

Perharps a silly question, but ...

I do not find the way to retrieve the matched terms of a found 
document.
Indeed, We construct a Lucene query searching on different fields with 
OR clause
and we want to display to the user for each result the term(s) which 
have
matched.
Is it possible with the Lucene API ?

Thanks in advance
David THIBAU


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Closing IndexWriter object after each file causes NullPointerException?

2004-04-14 Thread Brisbart Franck
I'm not sure to understand what is your problem.
Anyway, the writeLock is used to avoid 2 different writers (or reader if 
you use 'delete') to modify the same index.
What do you mean by first file ??

Franck

jitender ahuja wrote:
Hi,

Ok, but what is the use of  the writeLock, as the directory is
modified anyway!
As if the writeLock is an issue then then the index directory should have
index information only for the first file.
Thanks,
Jitender
- Original Message - 
From: Brisbart Franck [EMAIL PROTECTED]
To: [EMAIL PROTECTED]; Lucene Users List
[EMAIL PROTECTED]
Sent: Tuesday, April 13, 2004 10:15 PM
Subject: Re: Closing IndexWriter object after each file causes
NullPointerException?



If you close an IndexWriter more than once, the release of the writeLock
 creates a NullPointerException.
You should clean your code and close your writer only once. Anyway, I
don't know why there's no test on the 'writeLock' as in the 'finalize'
method.
I think it's a little error, so I suggest the attached patch to fix that.
Franck Brisbart

jitender ahuja wrote:

Hi,
Can anyone tell what is the cause of error for the following error
as the source of error is not any of the following:
a) Index directory closing after each file of the directory (to be
indexed) : verified by the changing directory size, with the changing
number of files to be indexed
b) IndexWriter object being closed out : verified by checking the
IndexWriter object ( here, writ) being a non-null object, by the line:
   System.out.println(writ != null); in the attached code
Error output:
java.lang.NullPointerException
   at org.apache.lucene.index.IndexWriter.close(Unknown Source)
   at IndexDatanew.indexDocs(IndexDatanew.java:89)
   at IndexDatanew.indexDocs(IndexDatanew.java:50)
   at IndexDatanew.main(IndexDatanew.java:25)
The code that causes this error is working fine otherwise (i.e. for
indexing purposes) and is attached; the output in detail for a
directory of 2 files is also attached.:
Thanks
Jitender


C:\lucrochejava IndexDatanew E:\freebooks\books\whole\jiten
Index Directory: E:\freebooks\books\whole\jiten
2
E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm
adding: E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm
File contents from buffer:
E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm
false
E:\freebooks\books\whole\jiten\TIJ3_c.htm
adding: E:\freebooks\books\whole\jiten\TIJ3_c.htm
File contents from buffer:
E:\freebooks\books\whole\jiten\TIJ3_c.htm
false
java.lang.NullPointerException
   at org.apache.lucene.index.IndexWriter.close(Unknown Source)
   at IndexDatanew.indexDocs(IndexDatanew.java:89)
   at IndexDatanew.indexDocs(IndexDatanew.java:50)
   at IndexDatanew.main(IndexDatanew.java:25)




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
Franck Brisbart
RD
http://www.kelkoo.com






Index: IndexWriter.java
===
RCS file:
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.
java,v
retrieving revision 1.28
diff -u -r1.28 IndexWriter.java
--- IndexWriter.java 25 Mar 2004 19:34:53 - 1.28
+++ IndexWriter.java 13 Apr 2004 16:39:56 -
@@ -235,8 +235,10 @@
  public synchronized void close() throws IOException {
flushRamSegments();
ramDirectory.close();
-writeLock.release();  // release write lock
-writeLock = null;
+if (writeLock != null) {
+  writeLock.release();  // release write lock
+  writeLock = null;
+}
directory.close();
  }



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
Franck Brisbart
RD
http://www.kelkoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Result scoring question

2004-04-14 Thread Armbrust, Daniel C.
I know that the lucene scoring algorithm is pretty complicated, I know I don't 
understand all the pieces.  But given these documents:

A) - preferred_designation left renal calculus
B) - other_designation renal calculus

Should a query of 

other_designation:(renal calculus) OR preferred_designation:(renal calculus)

Score document B higher than document A?

Those documents are a made up example.  Here are the documents and scores I am getting 
back from the query on my real index:

Score 1.0 - DocumentTextfirst_word:left Textpreferred_designation:left renal 
calculus in calyceal diverticulum Unindexedfrequency:4 
TextcodeTokenized:M4001 Keywordcode:M4001 
KeywordUNIQUE_DOCUMENT_IDENTIFIER_FIELD:48270

Score 0.85714287 - DocumentKeywordUNIQUE_DOCUMENT_IDENTIFIER_FIELD:514631 
Keywordcode:M00035214 TextcodeTokenized:M00035214 Unindexedfrequency:4 
Textpreferred_designation:left renal calculus in a solitary left kidney 
Textfirst_word:left

Score 0.7409672 - DocumentTextfirst_word:renal Textother_designation:renal 
calculus Unindexedfrequency:3 TextcodeTokenized:M00032753 Keywordcode:M00032753 
KeywordUNIQUE_DOCUMENT_IDENTIFIER_FIELD:481129


Am I just making a dumb mistake somewhere?

Thanks, 

Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Result scoring question

2004-04-14 Thread Erik Hatcher
Try using IndexSearcher.explain (and then a toString on the resulting 
Explanation object) to see the details of why things are scoring how 
they are.  This can be most enlightening!

	Erik

On Apr 14, 2004, at 12:16 PM, Armbrust, Daniel C. wrote:

I know that the lucene scoring algorithm is pretty complicated, I know 
I don't understand all the pieces.  But given these documents:

A) - preferred_designation left renal calculus
B) - other_designation renal calculus
Should a query of

other_designation:(renal calculus) OR preferred_designation:(renal 
calculus)

Score document B higher than document A?

Those documents are a made up example.  Here are the documents and 
scores I am getting back from the query on my real index:

Score 1.0 - DocumentTextfirst_word:left 
Textpreferred_designation:left renal calculus in calyceal 
diverticulum Unindexedfrequency:4 TextcodeTokenized:M4001 
Keywordcode:M4001 
KeywordUNIQUE_DOCUMENT_IDENTIFIER_FIELD:48270

Score 0.85714287 - 
DocumentKeywordUNIQUE_DOCUMENT_IDENTIFIER_FIELD:514631 
Keywordcode:M00035214 TextcodeTokenized:M00035214 
Unindexedfrequency:4 Textpreferred_designation:left renal calculus 
in a solitary left kidney Textfirst_word:left

Score 0.7409672 - DocumentTextfirst_word:renal 
Textother_designation:renal calculus Unindexedfrequency:3 
TextcodeTokenized:M00032753 Keywordcode:M00032753 
KeywordUNIQUE_DOCUMENT_IDENTIFIER_FIELD:481129

Am I just making a dumb mistake somewhere?

Thanks,

Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: index update (was Re: Large InputStream.BUFFER_SIZE causes OutOfMemoryError.. FYI)

2004-04-14 Thread Kevin A. Burton
petite_abeille wrote:

On Apr 13, 2004, at 02:45, Kevin A. Burton wrote:

He mentioned that I might be able to squeeze 5-10% out of index 
merges this way.


Talking of which... what strategy(ies) do people use to minimize 
downtime when updating an index?

This should probably be a wiki page.

Anyway... two thoughts I had on the subject a while back:

You maintain two disk (not RAID ... you get reliability through software).

Searches are load balanced between disks for performance reasons.  If 
one fails you just stop using it.

When you want to do an index merge you read from disk0 and write to 
disk1.  Then you take disk0 out of search rotation and add disk1 and 
copy the contents of disk1 to disk two.  Users shouldn't notice much of 
a performance issue during the merge because it will be VERY fast and 
it's just reads from disk0.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Result scoring question

2004-04-14 Thread Armbrust, Daniel C.
I should have remembered that.

Here are the 3 explanations for the top 3 documents returned (contents below)

3.3513687 = product of:
  6.7027373 = weight(preferred_designation:renal calculus in 48270), product of:
0.8114604 = queryWeight(preferred_designation:renal calculus), product of:
  18.88021 = idf(preferred_designation: renal= calculus=37)
  0.04297941 = queryNorm
8.260092 = fieldWeight(preferred_designation:renal calculus in 48270), product 
of:
  1.0 = tf(phraseFreq=1.0)
  18.88021 = idf(preferred_designation: renal= calculus=37)
  0.4375 = fieldNorm(field=preferred_designation, doc=48270)
  0.5 = coord(1/2)

2.8726017 = product of:
  5.7452035 = weight(preferred_designation:renal calculus in 514631), product of:
0.8114604 = queryWeight(preferred_designation:renal calculus), product of:
  18.88021 = idf(preferred_designation: renal= calculus=37)
  0.04297941 = queryNorm
7.080079 = fieldWeight(preferred_designation:renal calculus in 514631), product 
of:
  1.0 = tf(phraseFreq=1.0)
  18.88021 = idf(preferred_designation: renal= calculus=37)
  0.375 = fieldNorm(field=preferred_designation, doc=514631)
  0.5 = coord(1/2)

2.4832542 = product of:
  4.9665084 = weight(other_designation:renal calculus in 481129), product of:
0.58440757 = queryWeight(other_designation:renal calculus), product of:
  13.5973835 = idf(other_designation: renal=8560 calculus=971)
  0.04297941 = queryNorm
8.498364 = fieldWeight(other_designation:renal calculus in 481129), product of:
  1.0 = tf(phraseFreq=1.0)
  13.5973835 = idf(other_designation: renal=8560 calculus=971)
  0.625 = fieldNorm(field=other_designation, doc=481129)
  0.5 = coord(1/2) 


Is there anything that I can do in my query construction, to ensure that if a query 
exactly matches a document, it will be the top result?

Thanks, 

Dan


-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 14, 2004 12:17 PM
To: Lucene Users List
Subject: Re: Result scoring question

Try using IndexSearcher.explain (and then a toString on the resulting 
Explanation object) to see the details of why things are scoring how 
they are.  This can be most enlightening!

Erik


On Apr 14, 2004, at 12:16 PM, Armbrust, Daniel C. wrote:

 I know that the lucene scoring algorithm is pretty complicated, I know 
 I don't understand all the pieces.  But given these documents:

 A) - preferred_designation left renal calculus
 B) - other_designation renal calculus

 Should a query of

 other_designation:(renal calculus) OR preferred_designation:(renal 
 calculus)

 Score document B higher than document A?

 Those documents are a made up example.  Here are the documents and 
 scores I am getting back from the query on my real index:

 Score 1.0 - DocumentTextfirst_word:left 
 Textpreferred_designation:left renal calculus in calyceal 
 diverticulum Unindexedfrequency:4 TextcodeTokenized:M4001 
 Keywordcode:M4001 
 KeywordUNIQUE_DOCUMENT_IDENTIFIER_FIELD:48270

 Score 0.85714287 - 
 DocumentKeywordUNIQUE_DOCUMENT_IDENTIFIER_FIELD:514631 
 Keywordcode:M00035214 TextcodeTokenized:M00035214 
 Unindexedfrequency:4 Textpreferred_designation:left renal calculus 
 in a solitary left kidney Textfirst_word:left

 Score 0.7409672 - DocumentTextfirst_word:renal 
 Textother_designation:renal calculus Unindexedfrequency:3 
 TextcodeTokenized:M00032753 Keywordcode:M00032753 
 KeywordUNIQUE_DOCUMENT_IDENTIFIER_FIELD:481129


 Am I just making a dumb mistake somewhere?

 Thanks,

 Dan

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Presentation in Mtl

2004-04-14 Thread Stephane James Vaucher
Hi everyone,

I did a presentation tonight in Montreal at a java users group metting.
I've got to say that they were maybe 4 companies present that use Lucene
and find it very useful and simple to use. It lead to the longuest
discussion (positive that is) I having at the users' group.

So I've got to tell the Lucene contributors GOOD JOB!

I'll probably upload my ppt presentation (heavily based on existing
tutorials) to the wiki, so you can comment it.

cheers,
sv


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Presentation in Mtl

2004-04-14 Thread Stephane James Vaucher
Wow discussion Lucene in French for 2 1/2 hours has affected my english.
Please ignore spelling mistakes ;), but don't ignore the spirit of the
message.

sv

On Thu, 15 Apr 2004, Stephane James Vaucher wrote:

 Hi everyone,

 I did a presentation tonight in Montreal at a java users group metting.
 I've got to say that they were maybe 4 companies present that use Lucene
 and find it very useful and simple to use. It lead to the longuest
 discussion (positive that is) I having at the users' group.

 So I've got to tell the Lucene contributors GOOD JOB!

 I'll probably upload my ppt presentation (heavily based on existing
 tutorials) to the wiki, so you can comment it.

 cheers,
 sv


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Presentation in Mtl

2004-04-14 Thread Matt Quail
I too gave a Lucene presentation to my local JUG (Canberra, Australia)
last night.
It also went over very well. Lucene totally rocks!

=Matt

Stephane James Vaucher wrote:

Hi everyone,

I did a presentation tonight in Montreal at a java users group metting.
I've got to say that they were maybe 4 companies present that use Lucene
and find it very useful and simple to use. It lead to the longuest
discussion (positive that is) I having at the users' group.
So I've got to tell the Lucene contributors GOOD JOB!

I'll probably upload my ppt presentation (heavily based on existing
tutorials) to the wiki, so you can comment it.
cheers,
sv
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: suitability of lucene for project

2004-04-14 Thread Sebastian Ho
I will be searching webpages (url given by user) for keyword (in
clinical record). Will that be structured or unstructured? The records
might be in a table or a list of urls pointing to individual record
webpages.

thks

sebastian


On Tue, 2004-04-13 at 11:15, Stephane James Vaucher wrote:
 It could be part of you solution, but I don't think so. Let me explain:
 
 I've done this a few times something similar to what you describe. I use 
 often use HttpUnit to get information. How you process it, it's up 
 to you. If you want it to be indexed (searchable), you can use Lucene. If 
 you want to extract structured (or semi-structured) information, use 
 wrapper induction techniques (not Lucene).
 
 cheers,
 sv
 
 On 13 Apr 2004, Sebastian Ho wrote:
 
  hi all
  
  i am investigating technologies to use for a project which basically
  retrieves html pages on a regular basis(or whenever there are changes)
  and allow html parsing to extract specific information, and presenting
  them as links in a webpage. Note that this is not a general search
  engine kind of project but we are extracting clinical information from
  various website and consolidating them.
  
  Pls advise me whether Lucene can do the above and in areas where it
  cannot, suggestions to solutions will be appreciated.
  
  Thanks
  
  Sebastian Ho
  Bioinformatics Institute
  
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]