Re: suitability of lucene for project

2004-04-14 Thread Sebastian Ho
I will be searching webpages (url given by user) for keyword (in
clinical record). Will that be structured or unstructured? The records
might be in a table or a list of urls pointing to individual record
webpages.

thks

sebastian


On Tue, 2004-04-13 at 11:15, Stephane James Vaucher wrote:
> It could be part of you solution, but I don't think so. Let me explain:
> 
> I've done this a few times something similar to what you describe. I use 
> often use HttpUnit to get information. How you process it, it's up 
> to you. If you want it to be indexed (searchable), you can use Lucene. If 
> you want to extract structured (or semi-structured) information, use 
> wrapper induction techniques (not Lucene).
> 
> cheers,
> sv
> 
> On 13 Apr 2004, Sebastian Ho wrote:
> 
> > hi all
> > 
> > i am investigating technologies to use for a project which basically
> > retrieves html pages on a regular basis(or whenever there are changes)
> > and allow html parsing to extract specific information, and presenting
> > them as links in a webpage. Note that this is not a general search
> > engine kind of project but we are extracting clinical information from
> > various website and consolidating them.
> > 
> > Pls advise me whether Lucene can do the above and in areas where it
> > cannot, suggestions to solutions will be appreciated.
> > 
> > Thanks
> > 
> > Sebastian Ho
> > Bioinformatics Institute
> > 
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Presentation in Mtl

2004-04-14 Thread Matt Quail
I too gave a Lucene presentation to my local JUG (Canberra, Australia)
last night.
It also went over very well. Lucene totally rocks!

=Matt

Stephane James Vaucher wrote:

Hi everyone,

I did a presentation tonight in Montreal at a java users group metting.
I've got to say that they were maybe 4 companies present that use Lucene
and find it very useful and simple to use. It lead to the longuest
discussion (positive that is) I having at the users' group.
So I've got to tell the Lucene contributors GOOD JOB!

I'll probably upload my ppt presentation (heavily based on existing
tutorials) to the wiki, so you can comment it.
cheers,
sv
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Presentation in Mtl

2004-04-14 Thread Stephane James Vaucher
Wow discussion Lucene in French for 2 1/2 hours has affected my english.
Please ignore spelling mistakes ;), but don't ignore the spirit of the
message.

sv

On Thu, 15 Apr 2004, Stephane James Vaucher wrote:

> Hi everyone,
>
> I did a presentation tonight in Montreal at a java users group metting.
> I've got to say that they were maybe 4 companies present that use Lucene
> and find it very useful and simple to use. It lead to the longuest
> discussion (positive that is) I having at the users' group.
>
> So I've got to tell the Lucene contributors GOOD JOB!
>
> I'll probably upload my ppt presentation (heavily based on existing
> tutorials) to the wiki, so you can comment it.
>
> cheers,
> sv
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Presentation in Mtl

2004-04-14 Thread Stephane James Vaucher
Hi everyone,

I did a presentation tonight in Montreal at a java users group metting.
I've got to say that they were maybe 4 companies present that use Lucene
and find it very useful and simple to use. It lead to the longuest
discussion (positive that is) I having at the users' group.

So I've got to tell the Lucene contributors GOOD JOB!

I'll probably upload my ppt presentation (heavily based on existing
tutorials) to the wiki, so you can comment it.

cheers,
sv


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Result scoring question

2004-04-14 Thread Armbrust, Daniel C.
I should have remembered that.

Here are the 3 explanations for the top 3 documents returned (contents below)

3.3513687 = product of:
  6.7027373 = weight(preferred_designation:"renal calculus" in 48270), product of:
0.8114604 = queryWeight(preferred_designation:"renal calculus"), product of:
  18.88021 = idf(preferred_designation: renal= calculus=37)
  0.04297941 = queryNorm
8.260092 = fieldWeight(preferred_designation:"renal calculus" in 48270), product 
of:
  1.0 = tf(phraseFreq=1.0)
  18.88021 = idf(preferred_designation: renal= calculus=37)
  0.4375 = fieldNorm(field=preferred_designation, doc=48270)
  0.5 = coord(1/2)

2.8726017 = product of:
  5.7452035 = weight(preferred_designation:"renal calculus" in 514631), product of:
0.8114604 = queryWeight(preferred_designation:"renal calculus"), product of:
  18.88021 = idf(preferred_designation: renal= calculus=37)
  0.04297941 = queryNorm
7.080079 = fieldWeight(preferred_designation:"renal calculus" in 514631), product 
of:
  1.0 = tf(phraseFreq=1.0)
  18.88021 = idf(preferred_designation: renal= calculus=37)
  0.375 = fieldNorm(field=preferred_designation, doc=514631)
  0.5 = coord(1/2)

2.4832542 = product of:
  4.9665084 = weight(other_designation:"renal calculus" in 481129), product of:
0.58440757 = queryWeight(other_designation:"renal calculus"), product of:
  13.5973835 = idf(other_designation: renal=8560 calculus=971)
  0.04297941 = queryNorm
8.498364 = fieldWeight(other_designation:"renal calculus" in 481129), product of:
  1.0 = tf(phraseFreq=1.0)
  13.5973835 = idf(other_designation: renal=8560 calculus=971)
  0.625 = fieldNorm(field=other_designation, doc=481129)
  0.5 = coord(1/2) 


Is there anything that I can do in my query construction, to ensure that if a query 
exactly matches a document, it will be the top result?

Thanks, 

Dan


-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 14, 2004 12:17 PM
To: Lucene Users List
Subject: Re: Result scoring question

Try using IndexSearcher.explain (and then a toString on the resulting 
Explanation object) to see the details of why things are scoring how 
they are.  This can be most enlightening!

Erik


On Apr 14, 2004, at 12:16 PM, Armbrust, Daniel C. wrote:

> I know that the lucene scoring algorithm is pretty complicated, I know 
> I don't understand all the pieces.  But given these documents:
>
> A) -  left renal calculus
> B) -  renal calculus
>
> Should a query of
>
> other_designation:("renal calculus") OR preferred_designation:("renal 
> calculus")
>
> Score document B higher than document A?
>
> Those documents are a made up example.  Here are the documents and 
> scores I am getting back from the query on my real index:
>
> Score 1.0 - Document 
> Text diverticulum> Unindexed Text 
> Keyword 
> Keyword>
>
> Score 0.85714287 - 
> Document 
> Keyword Text 
> Unindexed Text in a solitary left kidney> Text>
>
> Score 0.7409672 - Document 
> Text Unindexed 
> Text Keyword 
> Keyword>
>
>
> Am I just making a dumb mistake somewhere?
>
> Thanks,
>
> Dan
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index update (was Re: Large InputStream.BUFFER_SIZE causes OutOfMemoryError.. FYI)

2004-04-14 Thread Kevin A. Burton
petite_abeille wrote:

On Apr 13, 2004, at 02:45, Kevin A. Burton wrote:

He mentioned that I might be able to squeeze 5-10% out of index 
merges this way.


Talking of which... what strategy(ies) do people use to minimize 
downtime when updating an index?

This should probably be a wiki page.

Anyway... two thoughts I had on the subject a while back:

You maintain two disk (not RAID ... you get reliability through software).

Searches are load balanced between disks for performance reasons.  If 
one fails you just stop using it.

When you want to do an index merge you read from disk0 and write to 
disk1.  Then you take disk0 out of search rotation and add disk1 and 
copy the contents of disk1 to disk two.  Users shouldn't notice much of 
a performance issue during the merge because it will be VERY fast and 
it's just reads from disk0.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Result scoring question

2004-04-14 Thread Erik Hatcher
Try using IndexSearcher.explain (and then a toString on the resulting 
Explanation object) to see the details of why things are scoring how 
they are.  This can be most enlightening!

	Erik

On Apr 14, 2004, at 12:16 PM, Armbrust, Daniel C. wrote:

I know that the lucene scoring algorithm is pretty complicated, I know 
I don't understand all the pieces.  But given these documents:

A) -  left renal calculus
B) -  renal calculus
Should a query of

other_designation:("renal calculus") OR preferred_designation:("renal 
calculus")

Score document B higher than document A?

Those documents are a made up example.  Here are the documents and 
scores I am getting back from the query on my real index:

Score 1.0 - Document 
Text Unindexed Text 
Keyword 
Keyword>

Score 0.85714287 - 
Document 
Keyword Text 
Unindexed Text Text>

Score 0.7409672 - Document 
Text Unindexed 
Text Keyword 
Keyword>

Am I just making a dumb mistake somewhere?

Thanks,

Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Result scoring question

2004-04-14 Thread Armbrust, Daniel C.
I know that the lucene scoring algorithm is pretty complicated, I know I don't 
understand all the pieces.  But given these documents:

A) -  left renal calculus
B) -  renal calculus

Should a query of 

other_designation:("renal calculus") OR preferred_designation:("renal calculus")

Score document B higher than document A?

Those documents are a made up example.  Here are the documents and scores I am getting 
back from the query on my real index:

Score 1.0 - Document Text Unindexed 
Text Keyword 
Keyword>

Score 0.85714287 - Document 
Keyword Text Unindexed 
Text 
Text>

Score 0.7409672 - Document Text Unindexed Text Keyword 
Keyword>


Am I just making a dumb mistake somewhere?

Thanks, 

Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Closing IndexWriter object after each file causes NullPointerException?

2004-04-14 Thread Brisbart Franck
I'm not sure to understand what is your problem.
Anyway, the writeLock is used to avoid 2 different writers (or reader if 
you use 'delete') to modify the same index.
What do you mean by first file ??

Franck

jitender ahuja wrote:
Hi,

Ok, but what is the use of  the writeLock, as the directory is
modified anyway!
As if the writeLock is an issue then then the index directory should have
index information only for the first file.
Thanks,
Jitender
- Original Message - 
From: "Brisbart Franck" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>; "Lucene Users List"
<[EMAIL PROTECTED]>
Sent: Tuesday, April 13, 2004 10:15 PM
Subject: Re: Closing IndexWriter object after each file causes
NullPointerException?



If you close an IndexWriter more than once, the release of the writeLock
 creates a NullPointerException.
You should clean your code and close your writer only once. Anyway, I
don't know why there's no test on the 'writeLock' as in the 'finalize'
method.
I think it's a little error, so I suggest the attached patch to fix that.
Franck Brisbart

jitender ahuja wrote:

Hi,
Can anyone tell what is the cause of error for the following error
as the source of error is not any of the following:
a) Index directory closing after each file of the directory (to be
indexed) : verified by the changing directory size, with the changing
number of files to be indexed
b) IndexWriter object being closed out : verified by checking the
IndexWriter object ( here, writ) being a non-null object, by the line:
   System.out.println(writ != null); in the attached code
Error output:
java.lang.NullPointerException
   at org.apache.lucene.index.IndexWriter.close(Unknown Source)
   at IndexDatanew.indexDocs(IndexDatanew.java:89)
   at IndexDatanew.indexDocs(IndexDatanew.java:50)
   at IndexDatanew.main(IndexDatanew.java:25)
The code that causes this error is working fine otherwise (i.e. for
indexing purposes) and is "attached"; the output in detail for a
directory of 2 files is also attached.:
Thanks
Jitender


C:\lucroche>java IndexDatanew E:\freebooks\books\whole\jiten
Index Directory: E:\freebooks\books\whole\jiten
2
E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm
adding: E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm
File contents from buffer:
E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm
false
E:\freebooks\books\whole\jiten\TIJ3_c.htm
adding: E:\freebooks\books\whole\jiten\TIJ3_c.htm
File contents from buffer:
E:\freebooks\books\whole\jiten\TIJ3_c.htm
false
java.lang.NullPointerException
   at org.apache.lucene.index.IndexWriter.close(Unknown Source)
   at IndexDatanew.indexDocs(IndexDatanew.java:89)
   at IndexDatanew.indexDocs(IndexDatanew.java:50)
   at IndexDatanew.main(IndexDatanew.java:25)




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
Franck Brisbart
R&D
http://www.kelkoo.com






Index: IndexWriter.java
===
RCS file:
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.
java,v
retrieving revision 1.28
diff -u -r1.28 IndexWriter.java
--- IndexWriter.java 25 Mar 2004 19:34:53 - 1.28
+++ IndexWriter.java 13 Apr 2004 16:39:56 -
@@ -235,8 +235,10 @@
  public synchronized void close() throws IOException {
flushRamSegments();
ramDirectory.close();
-writeLock.release();  // release write lock
-writeLock = null;
+if (writeLock != null) {
+  writeLock.release();  // release write lock
+  writeLock = null;
+}
directory.close();
  }



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
Franck Brisbart
R&D
http://www.kelkoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to retrieve the terms that matched

2004-04-14 Thread Erik Hatcher
Have a look at the Highlighter code that lives in the Lucene sandbox.  
It is a new addition there, but has been available for some time from 
the creators website.  I'm not sure if this will give you the 
information you need directly, but it would be a start.

	Erik

On Apr 14, 2004, at 8:27 AM, David Thibau wrote:

Perharps a silly question, but ...

I do not find the way to retrieve the matched terms of a found 
document.
Indeed, We construct a Lucene query searching on different fields with 
OR clause
and we want to display to the user for each result the term(s) which 
have
matched.
Is it possible with the Lucene API ?

Thanks in advance
David THIBAU


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


How to retrieve the terms that matched

2004-04-14 Thread David Thibau
Perharps a silly question, but ...

I do not find the way to retrieve the matched terms of a found document.
Indeed, We construct a Lucene query searching on different fields with 
OR clause
and we want to display to the user for each result the term(s) which have
matched.
Is it possible with the Lucene API ?

Thanks in advance
David THIBAU


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Closing IndexWriter object after each file causes NullPointerException?

2004-04-14 Thread jitender ahuja
Hi,

Ok, but what is the use of  the writeLock, as the directory is
modified anyway!
As if the writeLock is an issue then then the index directory should have
index information only for the first file.

Thanks,
Jitender

- Original Message - 
From: "Brisbart Franck" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>; "Lucene Users List"
<[EMAIL PROTECTED]>
Sent: Tuesday, April 13, 2004 10:15 PM
Subject: Re: Closing IndexWriter object after each file causes
NullPointerException?


> If you close an IndexWriter more than once, the release of the writeLock
>   creates a NullPointerException.
> You should clean your code and close your writer only once. Anyway, I
> don't know why there's no test on the 'writeLock' as in the 'finalize'
> method.
> I think it's a little error, so I suggest the attached patch to fix that.
>
> Franck Brisbart
>
>
> jitender ahuja wrote:
> > Hi,
> >  Can anyone tell what is the cause of error for the following error
> > as the source of error is not any of the following:
> > a) Index directory closing after each file of the directory (to be
> > indexed) : verified by the changing directory size, with the changing
> >  number of files to be indexed
> > b) IndexWriter object being closed out : verified by checking the
> > IndexWriter object ( here, writ) being a non-null object, by the line:
> > System.out.println(writ != null); in the attached code
> >
> >
> > Error output:
> >  java.lang.NullPointerException
> > at org.apache.lucene.index.IndexWriter.close(Unknown Source)
> > at IndexDatanew.indexDocs(IndexDatanew.java:89)
> > at IndexDatanew.indexDocs(IndexDatanew.java:50)
> > at IndexDatanew.main(IndexDatanew.java:25)
> >
> > The code that causes this error is working fine otherwise (i.e. for
> > indexing purposes) and is "attached"; the output in detail for a
> > directory of 2 files is also attached.:
> >
> > Thanks
> > Jitender
> >
> >
> > 
> >
> > C:\lucroche>java IndexDatanew E:\freebooks\books\whole\jiten
> > Index Directory: E:\freebooks\books\whole\jiten
> > 2
> > E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm
> > adding: E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm
> > File contents from buffer:
> > E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm
> > false
> > E:\freebooks\books\whole\jiten\TIJ3_c.htm
> > adding: E:\freebooks\books\whole\jiten\TIJ3_c.htm
> > File contents from buffer:
> > E:\freebooks\books\whole\jiten\TIJ3_c.htm
> > false
> > java.lang.NullPointerException
> > at org.apache.lucene.index.IndexWriter.close(Unknown Source)
> > at IndexDatanew.indexDocs(IndexDatanew.java:89)
> > at IndexDatanew.indexDocs(IndexDatanew.java:50)
> > at IndexDatanew.main(IndexDatanew.java:25)
> >
> >
> >
> > 
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -- 
> Franck Brisbart
> R&D
> http://www.kelkoo.com
>






> Index: IndexWriter.java
> ===
> RCS file:
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.
java,v
> retrieving revision 1.28
> diff -u -r1.28 IndexWriter.java
> --- IndexWriter.java 25 Mar 2004 19:34:53 - 1.28
> +++ IndexWriter.java 13 Apr 2004 16:39:56 -
> @@ -235,8 +235,10 @@
>public synchronized void close() throws IOException {
>  flushRamSegments();
>  ramDirectory.close();
> -writeLock.release();  // release write lock
> -writeLock = null;
> +if (writeLock != null) {
> +  writeLock.release();  // release write lock
> +  writeLock = null;
> +}
>  directory.close();
>}
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]