Re: suitability of lucene for project
I will be searching webpages (url given by user) for keyword (in clinical record). Will that be structured or unstructured? The records might be in a table or a list of urls pointing to individual record webpages. thks sebastian On Tue, 2004-04-13 at 11:15, Stephane James Vaucher wrote: > It could be part of you solution, but I don't think so. Let me explain: > > I've done this a few times something similar to what you describe. I use > often use HttpUnit to get information. How you process it, it's up > to you. If you want it to be indexed (searchable), you can use Lucene. If > you want to extract structured (or semi-structured) information, use > wrapper induction techniques (not Lucene). > > cheers, > sv > > On 13 Apr 2004, Sebastian Ho wrote: > > > hi all > > > > i am investigating technologies to use for a project which basically > > retrieves html pages on a regular basis(or whenever there are changes) > > and allow html parsing to extract specific information, and presenting > > them as links in a webpage. Note that this is not a general search > > engine kind of project but we are extracting clinical information from > > various website and consolidating them. > > > > Pls advise me whether Lucene can do the above and in areas where it > > cannot, suggestions to solutions will be appreciated. > > > > Thanks > > > > Sebastian Ho > > Bioinformatics Institute > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Presentation in Mtl
I too gave a Lucene presentation to my local JUG (Canberra, Australia) last night. It also went over very well. Lucene totally rocks! =Matt Stephane James Vaucher wrote: Hi everyone, I did a presentation tonight in Montreal at a java users group metting. I've got to say that they were maybe 4 companies present that use Lucene and find it very useful and simple to use. It lead to the longuest discussion (positive that is) I having at the users' group. So I've got to tell the Lucene contributors GOOD JOB! I'll probably upload my ppt presentation (heavily based on existing tutorials) to the wiki, so you can comment it. cheers, sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Presentation in Mtl
Wow discussion Lucene in French for 2 1/2 hours has affected my english. Please ignore spelling mistakes ;), but don't ignore the spirit of the message. sv On Thu, 15 Apr 2004, Stephane James Vaucher wrote: > Hi everyone, > > I did a presentation tonight in Montreal at a java users group metting. > I've got to say that they were maybe 4 companies present that use Lucene > and find it very useful and simple to use. It lead to the longuest > discussion (positive that is) I having at the users' group. > > So I've got to tell the Lucene contributors GOOD JOB! > > I'll probably upload my ppt presentation (heavily based on existing > tutorials) to the wiki, so you can comment it. > > cheers, > sv > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Presentation in Mtl
Hi everyone, I did a presentation tonight in Montreal at a java users group metting. I've got to say that they were maybe 4 companies present that use Lucene and find it very useful and simple to use. It lead to the longuest discussion (positive that is) I having at the users' group. So I've got to tell the Lucene contributors GOOD JOB! I'll probably upload my ppt presentation (heavily based on existing tutorials) to the wiki, so you can comment it. cheers, sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Result scoring question
I should have remembered that. Here are the 3 explanations for the top 3 documents returned (contents below) 3.3513687 = product of: 6.7027373 = weight(preferred_designation:"renal calculus" in 48270), product of: 0.8114604 = queryWeight(preferred_designation:"renal calculus"), product of: 18.88021 = idf(preferred_designation: renal= calculus=37) 0.04297941 = queryNorm 8.260092 = fieldWeight(preferred_designation:"renal calculus" in 48270), product of: 1.0 = tf(phraseFreq=1.0) 18.88021 = idf(preferred_designation: renal= calculus=37) 0.4375 = fieldNorm(field=preferred_designation, doc=48270) 0.5 = coord(1/2) 2.8726017 = product of: 5.7452035 = weight(preferred_designation:"renal calculus" in 514631), product of: 0.8114604 = queryWeight(preferred_designation:"renal calculus"), product of: 18.88021 = idf(preferred_designation: renal= calculus=37) 0.04297941 = queryNorm 7.080079 = fieldWeight(preferred_designation:"renal calculus" in 514631), product of: 1.0 = tf(phraseFreq=1.0) 18.88021 = idf(preferred_designation: renal= calculus=37) 0.375 = fieldNorm(field=preferred_designation, doc=514631) 0.5 = coord(1/2) 2.4832542 = product of: 4.9665084 = weight(other_designation:"renal calculus" in 481129), product of: 0.58440757 = queryWeight(other_designation:"renal calculus"), product of: 13.5973835 = idf(other_designation: renal=8560 calculus=971) 0.04297941 = queryNorm 8.498364 = fieldWeight(other_designation:"renal calculus" in 481129), product of: 1.0 = tf(phraseFreq=1.0) 13.5973835 = idf(other_designation: renal=8560 calculus=971) 0.625 = fieldNorm(field=other_designation, doc=481129) 0.5 = coord(1/2) Is there anything that I can do in my query construction, to ensure that if a query exactly matches a document, it will be the top result? Thanks, Dan -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 14, 2004 12:17 PM To: Lucene Users List Subject: Re: Result scoring question Try using IndexSearcher.explain (and then a toString on the resulting Explanation object) to see the details of why things are scoring how they are. This can be most enlightening! Erik On Apr 14, 2004, at 12:16 PM, Armbrust, Daniel C. wrote: > I know that the lucene scoring algorithm is pretty complicated, I know > I don't understand all the pieces. But given these documents: > > A) - left renal calculus > B) - renal calculus > > Should a query of > > other_designation:("renal calculus") OR preferred_designation:("renal > calculus") > > Score document B higher than document A? > > Those documents are a made up example. Here are the documents and > scores I am getting back from the query on my real index: > > Score 1.0 - Document > Text diverticulum> Unindexed Text > Keyword > Keyword> > > Score 0.85714287 - > Document > Keyword Text > Unindexed Text in a solitary left kidney> Text> > > Score 0.7409672 - Document > Text Unindexed > Text Keyword > Keyword> > > > Am I just making a dumb mistake somewhere? > > Thanks, > > Dan > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index update (was Re: Large InputStream.BUFFER_SIZE causes OutOfMemoryError.. FYI)
petite_abeille wrote: On Apr 13, 2004, at 02:45, Kevin A. Burton wrote: He mentioned that I might be able to squeeze 5-10% out of index merges this way. Talking of which... what strategy(ies) do people use to minimize downtime when updating an index? This should probably be a wiki page. Anyway... two thoughts I had on the subject a while back: You maintain two disk (not RAID ... you get reliability through software). Searches are load balanced between disks for performance reasons. If one fails you just stop using it. When you want to do an index merge you read from disk0 and write to disk1. Then you take disk0 out of search rotation and add disk1 and copy the contents of disk1 to disk two. Users shouldn't notice much of a performance issue during the merge because it will be VERY fast and it's just reads from disk0. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Result scoring question
Try using IndexSearcher.explain (and then a toString on the resulting Explanation object) to see the details of why things are scoring how they are. This can be most enlightening! Erik On Apr 14, 2004, at 12:16 PM, Armbrust, Daniel C. wrote: I know that the lucene scoring algorithm is pretty complicated, I know I don't understand all the pieces. But given these documents: A) - left renal calculus B) - renal calculus Should a query of other_designation:("renal calculus") OR preferred_designation:("renal calculus") Score document B higher than document A? Those documents are a made up example. Here are the documents and scores I am getting back from the query on my real index: Score 1.0 - Document Text Unindexed Text Keyword Keyword> Score 0.85714287 - Document Keyword Text Unindexed Text Text> Score 0.7409672 - Document Text Unindexed Text Keyword Keyword> Am I just making a dumb mistake somewhere? Thanks, Dan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Result scoring question
I know that the lucene scoring algorithm is pretty complicated, I know I don't understand all the pieces. But given these documents: A) - left renal calculus B) - renal calculus Should a query of other_designation:("renal calculus") OR preferred_designation:("renal calculus") Score document B higher than document A? Those documents are a made up example. Here are the documents and scores I am getting back from the query on my real index: Score 1.0 - Document Text Unindexed Text Keyword Keyword> Score 0.85714287 - Document Keyword Text Unindexed Text Text> Score 0.7409672 - Document Text Unindexed Text Keyword Keyword> Am I just making a dumb mistake somewhere? Thanks, Dan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Closing IndexWriter object after each file causes NullPointerException?
I'm not sure to understand what is your problem. Anyway, the writeLock is used to avoid 2 different writers (or reader if you use 'delete') to modify the same index. What do you mean by first file ?? Franck jitender ahuja wrote: Hi, Ok, but what is the use of the writeLock, as the directory is modified anyway! As if the writeLock is an issue then then the index directory should have index information only for the first file. Thanks, Jitender - Original Message - From: "Brisbart Franck" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]>; "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, April 13, 2004 10:15 PM Subject: Re: Closing IndexWriter object after each file causes NullPointerException? If you close an IndexWriter more than once, the release of the writeLock creates a NullPointerException. You should clean your code and close your writer only once. Anyway, I don't know why there's no test on the 'writeLock' as in the 'finalize' method. I think it's a little error, so I suggest the attached patch to fix that. Franck Brisbart jitender ahuja wrote: Hi, Can anyone tell what is the cause of error for the following error as the source of error is not any of the following: a) Index directory closing after each file of the directory (to be indexed) : verified by the changing directory size, with the changing number of files to be indexed b) IndexWriter object being closed out : verified by checking the IndexWriter object ( here, writ) being a non-null object, by the line: System.out.println(writ != null); in the attached code Error output: java.lang.NullPointerException at org.apache.lucene.index.IndexWriter.close(Unknown Source) at IndexDatanew.indexDocs(IndexDatanew.java:89) at IndexDatanew.indexDocs(IndexDatanew.java:50) at IndexDatanew.main(IndexDatanew.java:25) The code that causes this error is working fine otherwise (i.e. for indexing purposes) and is "attached"; the output in detail for a directory of 2 files is also attached.: Thanks Jitender C:\lucroche>java IndexDatanew E:\freebooks\books\whole\jiten Index Directory: E:\freebooks\books\whole\jiten 2 E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm adding: E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm File contents from buffer: E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm false E:\freebooks\books\whole\jiten\TIJ3_c.htm adding: E:\freebooks\books\whole\jiten\TIJ3_c.htm File contents from buffer: E:\freebooks\books\whole\jiten\TIJ3_c.htm false java.lang.NullPointerException at org.apache.lucene.index.IndexWriter.close(Unknown Source) at IndexDatanew.indexDocs(IndexDatanew.java:89) at IndexDatanew.indexDocs(IndexDatanew.java:50) at IndexDatanew.main(IndexDatanew.java:25) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Franck Brisbart R&D http://www.kelkoo.com Index: IndexWriter.java === RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter. java,v retrieving revision 1.28 diff -u -r1.28 IndexWriter.java --- IndexWriter.java 25 Mar 2004 19:34:53 - 1.28 +++ IndexWriter.java 13 Apr 2004 16:39:56 - @@ -235,8 +235,10 @@ public synchronized void close() throws IOException { flushRamSegments(); ramDirectory.close(); -writeLock.release(); // release write lock -writeLock = null; +if (writeLock != null) { + writeLock.release(); // release write lock + writeLock = null; +} directory.close(); } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Franck Brisbart R&D http://www.kelkoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to retrieve the terms that matched
Have a look at the Highlighter code that lives in the Lucene sandbox. It is a new addition there, but has been available for some time from the creators website. I'm not sure if this will give you the information you need directly, but it would be a start. Erik On Apr 14, 2004, at 8:27 AM, David Thibau wrote: Perharps a silly question, but ... I do not find the way to retrieve the matched terms of a found document. Indeed, We construct a Lucene query searching on different fields with OR clause and we want to display to the user for each result the term(s) which have matched. Is it possible with the Lucene API ? Thanks in advance David THIBAU - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to retrieve the terms that matched
Perharps a silly question, but ... I do not find the way to retrieve the matched terms of a found document. Indeed, We construct a Lucene query searching on different fields with OR clause and we want to display to the user for each result the term(s) which have matched. Is it possible with the Lucene API ? Thanks in advance David THIBAU - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Closing IndexWriter object after each file causes NullPointerException?
Hi, Ok, but what is the use of the writeLock, as the directory is modified anyway! As if the writeLock is an issue then then the index directory should have index information only for the first file. Thanks, Jitender - Original Message - From: "Brisbart Franck" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]>; "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, April 13, 2004 10:15 PM Subject: Re: Closing IndexWriter object after each file causes NullPointerException? > If you close an IndexWriter more than once, the release of the writeLock > creates a NullPointerException. > You should clean your code and close your writer only once. Anyway, I > don't know why there's no test on the 'writeLock' as in the 'finalize' > method. > I think it's a little error, so I suggest the attached patch to fix that. > > Franck Brisbart > > > jitender ahuja wrote: > > Hi, > > Can anyone tell what is the cause of error for the following error > > as the source of error is not any of the following: > > a) Index directory closing after each file of the directory (to be > > indexed) : verified by the changing directory size, with the changing > > number of files to be indexed > > b) IndexWriter object being closed out : verified by checking the > > IndexWriter object ( here, writ) being a non-null object, by the line: > > System.out.println(writ != null); in the attached code > > > > > > Error output: > > java.lang.NullPointerException > > at org.apache.lucene.index.IndexWriter.close(Unknown Source) > > at IndexDatanew.indexDocs(IndexDatanew.java:89) > > at IndexDatanew.indexDocs(IndexDatanew.java:50) > > at IndexDatanew.main(IndexDatanew.java:25) > > > > The code that causes this error is working fine otherwise (i.e. for > > indexing purposes) and is "attached"; the output in detail for a > > directory of 2 files is also attached.: > > > > Thanks > > Jitender > > > > > > > > > > C:\lucroche>java IndexDatanew E:\freebooks\books\whole\jiten > > Index Directory: E:\freebooks\books\whole\jiten > > 2 > > E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm > > adding: E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm > > File contents from buffer: > > E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm > > false > > E:\freebooks\books\whole\jiten\TIJ3_c.htm > > adding: E:\freebooks\books\whole\jiten\TIJ3_c.htm > > File contents from buffer: > > E:\freebooks\books\whole\jiten\TIJ3_c.htm > > false > > java.lang.NullPointerException > > at org.apache.lucene.index.IndexWriter.close(Unknown Source) > > at IndexDatanew.indexDocs(IndexDatanew.java:89) > > at IndexDatanew.indexDocs(IndexDatanew.java:50) > > at IndexDatanew.main(IndexDatanew.java:25) > > > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- > Franck Brisbart > R&D > http://www.kelkoo.com > > Index: IndexWriter.java > === > RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter. java,v > retrieving revision 1.28 > diff -u -r1.28 IndexWriter.java > --- IndexWriter.java 25 Mar 2004 19:34:53 - 1.28 > +++ IndexWriter.java 13 Apr 2004 16:39:56 - > @@ -235,8 +235,10 @@ >public synchronized void close() throws IOException { > flushRamSegments(); > ramDirectory.close(); > -writeLock.release(); // release write lock > -writeLock = null; > +if (writeLock != null) { > + writeLock.release(); // release write lock > + writeLock = null; > +} > directory.close(); >} > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]