Re: file open handles?
Hi Jake You were indexing but not searching? So you are never calling getReader() in the first place? Of course, the call exists, its just that during testing we did not execute any searches at all. How have you been doing search in a realtime fashion with Lucene before 2.9's introduction of IndexWriter.getReader()? Nope. I previously used to open and close the reader on each search. When I noticed the getReader() functionality was available, I jumped at it. It immediately offered significant performance increases... We are now attempting to analyze Lucene using JPicus to try and get a picture of what is happening here. See: http://wiki.sdn.sap.com/wiki/display/Java/JPicus - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: file open handles?
On Wed, Jan 27, 2010 at 12:17 AM, Jamie wrote: > Hi Jake > > > You were indexing but not searching? So you are never calling getReader() >> in the first place? >> >> > Of course, the call exists, its just that during testing we did not execute > any searches at all. Oh! Re-reading your initial post - you're just seeing lots of files which haven't quite yet been cleaned up during indexing, it looks like, yes? There are threads going on in the background which are merging segments and deleting old files, these should go away over time. Do you see that they are still around after a very long period? How high does the file count grow? > > How have you been doing search in a realtime fashion with Lucene before >> 2.9's introduction of >> IndexWriter.getReader()? >> >> > Nope. I previously used to open and close the reader on each search. When I > noticed the getReader() functionality > was available, I jumped at it. It immediately offered significant > performance increases... > Gah! You must have a pretty small index, for that to be performant. That's historically been a really good way to kill your search performance. "significant performance increases" in comparison to opening new IndexReader per request in the pre-2.9 days indeed! -jake
Re: file open handles?
Hi Jake Ok. The number of file handles left open is increasing rapidly. For instance, 4200 file handles were left open by Lucene 2.9.1 over a period of 16 min. You can see in the attached snapshot a picture from JPicus showing the file handles that are left open. These index files are deleted but the OS still holds references to them. Could it be that Lucene merge threads are not closing files correctly before they are deleted? More than likely, it is an error with our code, but where? Our LuceneIndex wrapper class is attached. If I set the max file OS count to a low figure, my application stops in its track, so this is definitely a critical issue that must be resolved. Jamie On 2010/01/27 10:24 AM, Jake Mannix wrote: On Wed, Jan 27, 2010 at 12:17 AM, Jamie wrote: Oh! Re-reading your initial post - you're just seeing lots of files which haven't quite yet been cleaned up during indexing, it looks like, yes? There are threads going on in the background which are merging segments and deleting old files, these should go away over time. Yes, but they do not. They just keep growing over time until the file handle count is exhausted. I can see from the JPicus utility that although these Do you see that they are still around after a very long period? How high does the file count grow? package com.stimulus.archiva.index; import com.stimulus.util.*; import java.io.File; import java.io.IOException; import java.io.PrintStream; import org.apache.commons.logging.*; import org.apache.lucene.document.Document; import org.apache.lucene.index.*; import org.apache.lucene.store.FSDirectory; import com.stimulus.archiva.domain.Config; import com.stimulus.archiva.domain.Indexer; import com.stimulus.archiva.domain.Volume; import com.stimulus.archiva.exception.*; import com.stimulus.archiva.language.AnalyzerFactory; import com.stimulus.archiva.search.*; import java.util.*; import org.apache.lucene.store.LockObtainFailedException; import org.apache.lucene.store.AlreadyClosedException; import java.util.concurrent.locks.ReentrantLock; import java.util.concurrent.*; public class LuceneIndex extends Thread { protected ArrayBlockingQueue queue; protected static final Log logger = LogFactory.getLog(LuceneIndex.class.getName()); protected static final Log indexLog = LogFactory.getLog("indexlog"); IndexWriter writer = null; protected static ScheduledExecutorService scheduler; protected static ScheduledFuture scheduledTask; protected LuceneDocument EXIT_REQ = null; ReentrantLock indexLock = new ReentrantLock(); ArchivaAnalyzer analyzer = new ArchivaAnalyzer(); File indexLogFile; PrintStream indexLogOut; IndexProcessor indexProcessor; String friendlyName; String indexPath; int maxSimultaneousDocs; int indexThreads; public LuceneIndex(int queueSize, LuceneDocument exitReq, String friendlyName, String indexPath, int maxSimultaneousDocs, int indexThreads) { this.queue = new ArrayBlockingQueue(queueSize); this.EXIT_REQ = exitReq; this.friendlyName = friendlyName; this.indexPath = indexPath; this.maxSimultaneousDocs = maxSimultaneousDocs; this.indexThreads = indexThreads; setLog(friendlyName); } public int getMaxSimultaneousDocs() { return maxSimultaneousDocs; } public void setMaxSimultaneousDocs(int maxSimultaneousDocs) { this.maxSimultaneousDocs = maxSimultaneousDocs; } public ReentrantLock getIndexLock() { return indexLock; } protected void setLog(String logName) { try { indexLogFile = getIndexLogFile(logName); if (indexLogFile!=null) { if (indexLogFile.length()>10485760) indexLogFile.delete(); indexLogOut = new PrintStream(indexLogFile); } logger.debug("set index log file path {path='"+indexLogFile.getCanonicalPath()+"'}"); } catch (Exception e) { logger.error("failed to open index log file:"+e.getMessage(),e); }
Re: file open handles?
Hi Jake We got to the bottom of it. Turned out to be a status page that was opening the reader to obtain docCount but not closing it.Thanks for your help! Jamie On 2010/01/27 10:48 AM, Jamie wrote: Hi Jake Ok. The number of file handles left open is increasing rapidly. For instance, 4200 file handles were left open by Lucene 2.9.1 over a period of 16 min. You can see in the attached snapshot a picture from JPicus showing the file handles that are left open. These index files are deleted but the OS still holds references to them. Could it be that Lucene merge threads are not closing files correctly before they are deleted? More than likely, it is an error with our code, but where? Our LuceneIndex wrapper class is attached. If I set the max file OS count to a low figure, my application stops in its track, so this is definitely a critical issue that must be resolved. Jamie - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Index searching problem
i build an index to store 100 docs, each with field author, title and abstract.for (i=0;i<100;i++) {writer = new IndexWriter("index",new StandardAnalyzer(),true,IndexWriter.MaxFieldLength.UNLIMITED); doc.add(new Field("author",cfcDoc.getAu(), Field.Store.YES, Field.Index.TOKENIZED));doc.add(new Field("title",cfcDoc.getTi(), Field.Store.YES, Field.Index.TOKENIZED));doc.add(new Field("abstract",cfcDoc.getAb(), Field.Store.YES, Field.Index.TOKENIZED));writer.addDocument(doc);} But when i perfrom a search, it returns zero results, even querystring exist in one of the field of document. why is it so? Hits hits = se.performSearch("Hotel");System.out.println("hits length = "+ hits.length()); It creates index folder in file system, but when i open the file _0.fdt or _0.fdx with Luke. this shows nothing... it also deletes the file from file system. Asif _ Your E-mail and More On-the-Go. Get Windows Live Hotmail Free. https://signup.live.com/signup.aspx?id=60969
Re: Index searching problem
do you close your index writer or commit it before you open your searcher? one more thing, if you search for "Hotel" you might not find anything if the querystring is not passed through the StandardAnalyzer you use for indexing. (well, or another analyzer that does lowercasing). BTW. you email is hard to read though - I don't see a single newline. simon On Wed, Jan 27, 2010 at 10:40 AM, Asif Nawaz wrote: > > i build an index to store 100 docs, each with field author, title and > abstract.for (i=0;i<100;i++) {writer = new IndexWriter("index",new > StandardAnalyzer(),true,IndexWriter.MaxFieldLength.UNLIMITED); > doc.add(new Field("author",cfcDoc.getAu(), Field.Store.YES, > Field.Index.TOKENIZED));doc.add(new Field("title",cfcDoc.getTi(), > Field.Store.YES, Field.Index.TOKENIZED));doc.add(new > Field("abstract",cfcDoc.getAb(), Field.Store.YES, > Field.Index.TOKENIZED));writer.addDocument(doc);} > But when i perfrom a search, it returns zero results, even querystring exist > in one of the field of document. why is it so? > Hits hits = se.performSearch("Hotel");System.out.println("hits length = "+ > hits.length()); > It creates index folder in file system, but when i open the file _0.fdt or > _0.fdx with Luke. this shows nothing... it also deletes the file from file > system. > > > > > > > Asif > > > _ > Your E-mail and More On-the-Go. Get Windows Live Hotmail Free. > https://signup.live.com/signup.aspx?id=60969 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Index searching problem
ok. it works when i add commit and close indexes. when open the index file with Lukes, it shows me the list of documents that were matched. But in my program it returns no of hits = 0. Why??? Hits hits = se.performSearch("significance");System.out.println("hits length = "+ hits.length()); > Date: Wed, 27 Jan 2010 10:45:27 +0100 > Subject: Re: Index searching problem > From: simon.willna...@googlemail.com > To: java-user@lucene.apache.org > > do you close your index writer or commit it before you open your searcher? > > one more thing, if you search for "Hotel" you might not find anything > if the querystring is not passed through the StandardAnalyzer you use > for indexing. (well, or another analyzer that does lowercasing). > BTW. you email is hard to read though - I don't see a single newline. > > simon > > On Wed, Jan 27, 2010 at 10:40 AM, Asif Nawaz wrote: > > > > i build an index to store 100 docs, each with field author, title and > > abstract.for (i=0;i<100;i++) {writer = new IndexWriter("index",new > > StandardAnalyzer(),true,IndexWriter.MaxFieldLength.UNLIMITED); > > doc.add(new Field("author",cfcDoc.getAu(), Field.Store.YES, > > Field.Index.TOKENIZED));doc.add(new Field("title",cfcDoc.getTi(), > > Field.Store.YES, Field.Index.TOKENIZED));doc.add(new > > Field("abstract",cfcDoc.getAb(), Field.Store.YES, > > Field.Index.TOKENIZED));writer.addDocument(doc);} > > But when i perfrom a search, it returns zero results, even querystring > > exist in one of the field of document. why is it so? > > Hits hits = se.performSearch("Hotel");System.out.println("hits length = "+ > > hits.length()); > > It creates index folder in file system, but when i open the file _0.fdt or > > _0.fdx with Luke. this shows nothing... it also deletes the file from file > > system. > > > > > > > > > > > > > > Asif > > > > > > _ > > Your E-mail and More On-the-Go. Get Windows Live Hotmail Free. > > https://signup.live.com/signup.aspx?id=60969 > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > _ Your E-mail and More On-the-Go. Get Windows Live Hotmail Free. https://signup.live.com/signup.aspx?id=60969
Re: Index searching problem
Do you open the searcher / reader after you call commit on the writer? simon On Wed, Jan 27, 2010 at 12:40 PM, Asif Nawaz wrote: > > ok. it works when i add commit and close indexes. when open the index file > with Lukes, it shows me the list of documents that were matched. But in my > program it returns no of hits = 0. Why??? > Hits hits = se.performSearch("significance");System.out.println("hits length > = "+ hits.length()); > > > > > > > > > > > >> Date: Wed, 27 Jan 2010 10:45:27 +0100 >> Subject: Re: Index searching problem >> From: simon.willna...@googlemail.com >> To: java-user@lucene.apache.org >> >> do you close your index writer or commit it before you open your searcher? >> >> one more thing, if you search for "Hotel" you might not find anything >> if the querystring is not passed through the StandardAnalyzer you use >> for indexing. (well, or another analyzer that does lowercasing). >> BTW. you email is hard to read though - I don't see a single newline. >> >> simon >> >> On Wed, Jan 27, 2010 at 10:40 AM, Asif Nawaz wrote: >> > >> > i build an index to store 100 docs, each with field author, title and >> > abstract.for (i=0;i<100;i++) {writer = new IndexWriter("index",new >> > StandardAnalyzer(),true,IndexWriter.MaxFieldLength.UNLIMITED); >> > doc.add(new Field("author",cfcDoc.getAu(), Field.Store.YES, >> > Field.Index.TOKENIZED));doc.add(new Field("title",cfcDoc.getTi(), >> > Field.Store.YES, Field.Index.TOKENIZED));doc.add(new >> > Field("abstract",cfcDoc.getAb(), Field.Store.YES, >> > Field.Index.TOKENIZED));writer.addDocument(doc);} >> > But when i perfrom a search, it returns zero results, even querystring >> > exist in one of the field of document. why is it so? >> > Hits hits = se.performSearch("Hotel");System.out.println("hits length = "+ >> > hits.length()); >> > It creates index folder in file system, but when i open the file _0.fdt or >> > _0.fdx with Luke. this shows nothing... it also deletes the file from file >> > system. >> > >> > >> > >> > >> > >> > >> > Asif >> > >> > >> > _ >> > Your E-mail and More On-the-Go. Get Windows Live Hotmail Free. >> > https://signup.live.com/signup.aspx?id=60969 >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > _ > Your E-mail and More On-the-Go. Get Windows Live Hotmail Free. > https://signup.live.com/signup.aspx?id=60969 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Index searching problem
Lots of other things to check are listed in the FAQ: http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_no_hits_.2BAC8_incorrect_hits.3F -- Ian. On Wed, Jan 27, 2010 at 11:47 AM, Simon Willnauer wrote: > Do you open the searcher / reader after you call commit on the writer? > > simon > > On Wed, Jan 27, 2010 at 12:40 PM, Asif Nawaz wrote: >> >> ok. it works when i add commit and close indexes. when open the index file >> with Lukes, it shows me the list of documents that were matched. But in my >> program it returns no of hits = 0. Why??? >> Hits hits = se.performSearch("significance");System.out.println("hits length >> = "+ hits.length()); >> >> >> >> >> >> >> >> >> >> >> >>> Date: Wed, 27 Jan 2010 10:45:27 +0100 >>> Subject: Re: Index searching problem >>> From: simon.willna...@googlemail.com >>> To: java-user@lucene.apache.org >>> >>> do you close your index writer or commit it before you open your searcher? >>> >>> one more thing, if you search for "Hotel" you might not find anything >>> if the querystring is not passed through the StandardAnalyzer you use >>> for indexing. (well, or another analyzer that does lowercasing). >>> BTW. you email is hard to read though - I don't see a single newline. >>> >>> simon >>> >>> On Wed, Jan 27, 2010 at 10:40 AM, Asif Nawaz wrote: >>> > >>> > i build an index to store 100 docs, each with field author, title and >>> > abstract.for (i=0;i<100;i++) {writer = new IndexWriter("index",new >>> > StandardAnalyzer(),true,IndexWriter.MaxFieldLength.UNLIMITED); >>> > doc.add(new Field("author",cfcDoc.getAu(), Field.Store.YES, >>> > Field.Index.TOKENIZED));doc.add(new Field("title",cfcDoc.getTi(), >>> > Field.Store.YES, Field.Index.TOKENIZED));doc.add(new >>> > Field("abstract",cfcDoc.getAb(), Field.Store.YES, >>> > Field.Index.TOKENIZED));writer.addDocument(doc);} >>> > But when i perfrom a search, it returns zero results, even querystring >>> > exist in one of the field of document. why is it so? >>> > Hits hits = se.performSearch("Hotel");System.out.println("hits length = >>> > "+ hits.length()); >>> > It creates index folder in file system, but when i open the file _0.fdt >>> > or _0.fdx with Luke. this shows nothing... it also deletes the file from >>> > file system. >>> > >>> > >>> > >>> > >>> > >>> > >>> > Asif >>> > >>> > >>> > _ >>> > Your E-mail and More On-the-Go. Get Windows Live Hotmail Free. >>> > https://signup.live.com/signup.aspx?id=60969 >>> >>> - >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> _ >> Your E-mail and More On-the-Go. Get Windows Live Hotmail Free. >> https://signup.live.com/signup.aspx?id=60969 > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: file open handles?
On Wed, Jan 27, 2010 at 4:25 AM, Jamie wrote: > We got to the bottom of it. Thanks for bringing closure! > Turned out to be a status page that was opening > the reader to obtain docCount but not closing it.Thanks for your help! If you only need the docCount in the index, it's much faster to use oal.index.SegmentInfos (public since 2.9). That simply reads the latest segments_N file, which internally records the docCount & deletion count per segment, which you can then sum up. Mike - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Index searching problem
In the demo example for hotel database searching. I am confused how to open the index and where should i fit that code. In SearchEngine.java file i opened the index this way IndexSearcher is = new IndexSearcher(IndexReader.open("index")); but it's not working and still returns 0 hits :( > Date: Wed, 27 Jan 2010 12:47:57 +0100 > Subject: Re: Index searching problem > From: simon.willna...@googlemail.com > To: java-user@lucene.apache.org > > Do you open the searcher / reader after you call commit on the writer? > > simon > > On Wed, Jan 27, 2010 at 12:40 PM, Asif Nawaz wrote: > > > > ok. it works when i add commit and close indexes. when open the index file > > with Lukes, it shows me the list of documents that were matched. But in my > > program it returns no of hits = 0. Why??? > > Hits hits = se.performSearch("significance");System.out.println("hits > > length = "+ hits.length()); > > _ Hotmail: Trusted email with powerful SPAM protection. https://signup.live.com/signup.aspx?id=60969
RE: Index searching problem
IndexSearcher is = new IndexSearcher("index");IndexReader ir = is.getIndexReader().open("index");System.out.println("No of documents in index = "+ir.numDocs()); The last statement shows no of documents = 167. that means IndexReader is reading from index, which is open. I think the problem may exists in query parser. I am using following code QueryParser parser = new QueryParser("content", analyzer); Query query = parser.parse(queryString);Hits hits = is.search(query); > Date: Wed, 27 Jan 2010 12:47:57 +0100 > Subject: Re: Index searching problem > From: simon.willna...@googlemail.com > To: java-user@lucene.apache.org > > Do you open the searcher / reader after you call commit on the writer? > > simon _ Hotmail: Trusted email with powerful SPAM protection. https://signup.live.com/signup.aspx?id=60969
Problem with „AND“ operat or to search Chinese text
Hello , I could successfully implement the Chinese analyzer (CJKAnalyzer) and search Chinese text. However, I have problem when I use the Boolean operator AND then I got always 0 hits. When I search for the 2 Chinese terms without the “AND” operator is no problem, When I want to count only the hits when both exist in the same document, the result always zero. How I could use the “AND” operator when I search Chinese text and why it is a problem to use it when searching Chinese text. Thanks in advance -- View this message in context: http://old.nabble.com/Problem-with-%E2%80%9EAND%E2%80%9C-operator-to-search-Chinese-text-tp27341810p27341810.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Index searching problem
On Wed, Jan 27, 2010 at 4:53 PM, Asif Nawaz wrote: > > > IndexSearcher is = new IndexSearcher("index");IndexReader ir = > is.getIndexReader().open("index");System.out.println("No of documents in > index = "+ir.numDocs()); > The last statement shows no of documents = 167. that means IndexReader is > reading from index, which is open. I think the problem may exists in query > parser. I am using following code > QueryParser parser = new QueryParser("content", analyzer); > Query query = parser.parse(queryString); Hits hits = > is.search(query); > I don't see the field "content" in your document you build in the first mail. What do you search for? remember if you do not specifiy a field in you querystring the parser will use the default field which is "content". Could that cause your problem? simon > > > >> Date: Wed, 27 Jan 2010 12:47:57 +0100 >> Subject: Re: Index searching problem >> From: simon.willna...@googlemail.com >> To: java-user@lucene.apache.org >> >> Do you open the searcher / reader after you call commit on the writer? >> >> simon > > > > > _ > Hotmail: Trusted email with powerful SPAM protection. > https://signup.live.com/signup.aspx?id=60969 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Analyze java camelcase words ?
Can everyone suggest me a solution for tokenize the camelcase words in java ? Examples for camelcase words are: getXmlRule, setTokenizeAnalyzer. They should be tokenized to get, Xml, Rule, set, Tokenize, Analyzer. Thank you very much!
Re: Average Precision - TREC-3
On Jan 26, 2010, at 8:28 AM, Ivan Provalov wrote: > We are looking into making some improvements to relevance ranking of our > search platform based on Lucene. We started by running the Ad Hoc TREC task > on the TREC-3 data using "out-of-the-box" Lucene. The reason to run this old > TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200) data was that the content > is matching the content of our production system. > > We are currently getting average precision of 0.14. We found some format > issues with the TREC-3 data which were causing even lower score. For > example, the initial average precision number was 0.9. We discovered that > the topics included the word "Topic:" in the tag. For example, > " Topic: Coping with overcrowded prisons". By removing this term > from the queries, we bumped the average precision to 0.14. There's usually a lot of this involved in running TREC. I've also seen a good deal of improvement from things like using phrase queries and the Dismax Query Parser in Solr (which uses DisjunctionQuery in Lucene, amongst other things) and by playing around with length normalization. > > Our query is based on the title tag of the topic and the index field is based > on the tag of the document. > > QualityQueryParser qqParser = new SimpleQQParser("title", "TEXT"); > > Is there an average precision number which "out-of-the-box" Lucene should be > close to? For example, this IBM's 2007 TREC paper mentions 0.154: > http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf Hard to say. I can't say I've run TREC 3. You might ask over on the Open Relevance list too (http://lucene.apache.org/openrelevance). I know Robert Muir's done a lot of experiments with Lucene on standard collections like TREC. I guess the bigger question back to you is what is your goal? Is it to get better at TREC or to actually tune your system? -Grant -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Analyze java camelcase words ?
WordDelimiterFilter has a splitOnCaseChange option that should be useful for this: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory >From the example: PowerShot -> Power, Shot On Wed, Jan 27, 2010 at 11:01 AM, Phan The Dai wrote: > Can everyone suggest me a solution for tokenize the camelcase words in java > ? > Examples for camelcase words are: getXmlRule, setTokenizeAnalyzer. > They should be tokenized to get, Xml, Rule, set, Tokenize, Analyzer. > > Thank you very much! > -- Robert Muir rcm...@gmail.com
Re: Average Precision - TREC-3
Hello, forgive my ignorance here (I have not worked with these english TREC collections), but is the TREC-3 test collection the same as the test collection used in the 2007 paper you referenced? It looks like that is a different collection, its not really possible to compare these relevance scores across different collections. On Wed, Jan 27, 2010 at 11:06 AM, Grant Ingersoll wrote: > > On Jan 26, 2010, at 8:28 AM, Ivan Provalov wrote: > > > We are looking into making some improvements to relevance ranking of our > search platform based on Lucene. We started by running the Ad Hoc TREC task > on the TREC-3 data using "out-of-the-box" Lucene. The reason to run this > old TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200) data was that the > content is matching the content of our production system. > > > > We are currently getting average precision of 0.14. We found some format > issues with the TREC-3 data which were causing even lower score. For > example, the initial average precision number was 0.9. We discovered that > the topics included the word "Topic:" in the tag. For example, > > " Topic: Coping with overcrowded prisons". By removing this term > from the queries, we bumped the average precision to 0.14. > > There's usually a lot of this involved in running TREC. I've also seen a > good deal of improvement from things like using phrase queries and the > Dismax Query Parser in Solr (which uses DisjunctionQuery in Lucene, amongst > other things) and by playing around with length normalization. > > > > > > Our query is based on the title tag of the topic and the index field is > based on the tag of the document. > > > > QualityQueryParser qqParser = new SimpleQQParser("title", "TEXT"); > > > > Is there an average precision number which "out-of-the-box" Lucene should > be close to? For example, this IBM's 2007 TREC paper mentions 0.154: > > http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf > > Hard to say. I can't say I've run TREC 3. You might ask over on the Open > Relevance list too (http://lucene.apache.org/openrelevance). I know > Robert Muir's done a lot of experiments with Lucene on standard collections > like TREC. > > I guess the bigger question back to you is what is your goal? Is it to get > better at TREC or to actually tune your system? > > -Grant > > > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com
Re: Analyze java camelcase words ?
Robert: Is this in Lucene yet? According to what I could find in JIRA, it's still open. And it's not in the Javadocs on a quick scan. Erick On Wed, Jan 27, 2010 at 11:08 AM, Robert Muir wrote: > WordDelimiterFilter has a splitOnCaseChange option that should be useful > for > this: > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory > > From the example: PowerShot -> Power, Shot > > On Wed, Jan 27, 2010 at 11:01 AM, Phan The Dai >wrote: > > > Can everyone suggest me a solution for tokenize the camelcase words in > java > > ? > > Examples for camelcase words are: getXmlRule, setTokenizeAnalyzer. > > They should be tokenized to get, Xml, Rule, set, Tokenize, Analyzer. > > > > Thank you very much! > > > > > > -- > Robert Muir > rcm...@gmail.com >
Re: Analyze java camelcase words ?
no, but you can take the tokenfilter itself and simply use it in your lucene application. it uses the old tokenstream API so if you want to use Lucene 3.0 or 3.1, you will need a version that works with the new tokenstream API. There is a patch available here for that: https://issues.apache.org/jira/browse/SOLR-1710 On Wed, Jan 27, 2010 at 11:17 AM, Erick Erickson wrote: > Robert: > > Is this in Lucene yet? According to what I could find in JIRA, it's > still open. And it's not in the Javadocs on a quick scan. > > Erick > > On Wed, Jan 27, 2010 at 11:08 AM, Robert Muir wrote: > > > WordDelimiterFilter has a splitOnCaseChange option that should be useful > > for > > this: > > > > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory > > > > From the example: PowerShot -> Power, Shot > > > > On Wed, Jan 27, 2010 at 11:01 AM, Phan The Dai < > thienthanhom...@gmail.com > > >wrote: > > > > > Can everyone suggest me a solution for tokenize the camelcase words in > > java > > > ? > > > Examples for camelcase words are: getXmlRule, setTokenizeAnalyzer. > > > They should be tokenized to get, Xml, Rule, set, Tokenize, Analyzer. > > > > > > Thank you very much! > > > > > > > > > > > -- > > Robert Muir > > rcm...@gmail.com > > > -- Robert Muir rcm...@gmail.com
Re: Analyze java camelcase words ?
Thank you much. I study about your comments. They are useful. I am newer using Lucene 3.0. Hope it works well. On Thu, Jan 28, 2010 at 1:21 AM, Robert Muir wrote: > no, but you can take the tokenfilter itself and simply use it in your > lucene > application. > > it uses the old tokenstream API so if you want to use Lucene 3.0 or 3.1, > you > will need a version that works with the new tokenstream API. > There is a patch available here for that: > https://issues.apache.org/jira/browse/SOLR-1710 > > On Wed, Jan 27, 2010 at 11:17 AM, Erick Erickson >wrote: > > > Robert: > > > > Is this in Lucene yet? According to what I could find in JIRA, it's > > still open. And it's not in the Javadocs on a quick scan. > > > > Erick > > > > On Wed, Jan 27, 2010 at 11:08 AM, Robert Muir wrote: > > > > > WordDelimiterFilter has a splitOnCaseChange option that should be > useful > > > for > > > this: > > > > > > > > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory > > > > > > From the example: PowerShot -> Power, Shot > > > > > > On Wed, Jan 27, 2010 at 11:01 AM, Phan The Dai < > > thienthanhom...@gmail.com > > > >wrote: > > > > > > > Can everyone suggest me a solution for tokenize the camelcase words > in > > > java > > > > ? > > > > Examples for camelcase words are: getXmlRule, setTokenizeAnalyzer. > > > > They should be tokenized to get, Xml, Rule, set, Tokenize, Analyzer. > > > > > > > > Thank you very much! > > > > > > > > > > > > > > > > -- > > > Robert Muir > > > rcm...@gmail.com > > > > > > > > > -- > Robert Muir > rcm...@gmail.com >
Re: Average Precision - TREC-3
Robert, Grant: Thank you for your replies. Our goal is to fine-tune our existing system to perform better on relevance. I agree with Robert's comment that these collections are not completely compatible. Yes, it is possible that the results will vary some depending on the collections differences. The reason for us picking TREC-3 TIPSTER collection is that our production content overlaps with some TIPSTER documents. Any suggestions on how to obtain Lucene's TREC-3 compatible results, or select a better approach would be appreciated. We are doing this project in three stages: 1. Test Lucene's "vanilla" performance to establish the baseline. We want to iron out the issues such as topic or document formats. For example, we had to add a different parser and clean up the topic title. This will give us confidence that we are using the data and the methodology correctly. 2. Fine-tune Lucene based on the latest research findings (TREC by E. Voorhees, conference proceedings, etc...). 3. Repeat these steps with our production system which runs on Lucene. The reason we are doing this step last is to ensure that our overall system doesn't introduce the relevance issues (content pre-processing steps, query parsing steps, etc...). Thank you, Ivan Provalov --- On Wed, 1/27/10, Robert Muir wrote: > From: Robert Muir > Subject: Re: Average Precision - TREC-3 > To: java-user@lucene.apache.org > Date: Wednesday, January 27, 2010, 11:16 AM > Hello, forgive my ignorance here (I > have not worked with these english TREC > collections), but is the TREC-3 test collection the same as > the test > collection used in the 2007 paper you referenced? > > It looks like that is a different collection, its not > really possible to > compare these relevance scores across different > collections. > > On Wed, Jan 27, 2010 at 11:06 AM, Grant Ingersoll wrote: > > > > > On Jan 26, 2010, at 8:28 AM, Ivan Provalov wrote: > > > > > We are looking into making some improvements to > relevance ranking of our > > search platform based on Lucene. We started by > running the Ad Hoc TREC task > > on the TREC-3 data using "out-of-the-box" > Lucene. The reason to run this > > old TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200) > data was that the > > content is matching the content of our production > system. > > > > > > We are currently getting average precision of > 0.14. We found some format > > issues with the TREC-3 data which were causing even > lower score. For > > example, the initial average precision number was > 0.9. We discovered that > > the topics included the word "Topic:" in the > tag. For example, > > > " Topic: Coping with > overcrowded prisons". By removing this term > > from the queries, we bumped the average precision to > 0.14. > > > > There's usually a lot of this involved in running > TREC. I've also seen a > > good deal of improvement from things like using phrase > queries and the > > Dismax Query Parser in Solr (which uses > DisjunctionQuery in Lucene, amongst > > other things) and by playing around with length > normalization. > > > > > > > > > > Our query is based on the title tag of the topic > and the index field is > > based on the tag of the document. > > > > > > QualityQueryParser qqParser = new > SimpleQQParser("title", "TEXT"); > > > > > > Is there an average precision number which > "out-of-the-box" Lucene should > > be close to? For example, this IBM's 2007 TREC > paper mentions 0.154: > > > http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf > > > > Hard to say. I can't say I've run TREC 3. > You might ask over on the Open > > Relevance list too (http://lucene.apache.org/openrelevance). I know > > Robert Muir's done a lot of experiments with Lucene on > standard collections > > like TREC. > > > > I guess the bigger question back to you is what is > your goal? Is it to get > > better at TREC or to actually tune your system? > > > > -Grant > > > > > > -- > > Grant Ingersoll > > http://www.lucidimagination.com/ > > > > Search the Lucene ecosystem using Solr/Lucene: > > http://www.lucidimagination.com/search > > > > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > -- > Robert Muir > rcm...@gmail.com > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Average Precision - TREC-3
Hi Ivan, you might want use the lucene BM25 implementation. Results should be better changing the ranking function. Another option is Language model implementation for Lucene: http://nlp.uned.es/~jperezi/Lucene-BM25/ http://ilps.science.uva.nl/resources/lm-lucene The main problem with this implementation is that not every different kind of Lucene query, but if you don't need that these alternatives implementation are a good choice. best jose On Wed, Jan 27, 2010 at 1:36 PM, Ivan Provalov wrote: > Robert, Grant: > > Thank you for your replies. > > Our goal is to fine-tune our existing system to perform better on relevance. > > I agree with Robert's comment that these collections are not completely > compatible. Yes, it is possible that the results will vary some depending on > the collections differences. The reason for us picking TREC-3 TIPSTER > collection is that our production content overlaps with some TIPSTER > documents. > > Any suggestions on how to obtain Lucene's TREC-3 compatible results, or > select a better approach would be appreciated. > > We are doing this project in three stages: > > 1. Test Lucene's "vanilla" performance to establish the baseline. We want to > iron out the issues such as topic or document formats. For example, we had > to add a different parser and clean up the topic title. This will give us > confidence that we are using the data and the methodology correctly. > > 2. Fine-tune Lucene based on the latest research findings (TREC by E. > Voorhees, conference proceedings, etc...). > > 3. Repeat these steps with our production system which runs on Lucene. The > reason we are doing this step last is to ensure that our overall system > doesn't introduce the relevance issues (content pre-processing steps, query > parsing steps, etc...). > > Thank you, > > Ivan Provalov > > --- On Wed, 1/27/10, Robert Muir wrote: > >> From: Robert Muir >> Subject: Re: Average Precision - TREC-3 >> To: java-user@lucene.apache.org >> Date: Wednesday, January 27, 2010, 11:16 AM >> Hello, forgive my ignorance here (I >> have not worked with these english TREC >> collections), but is the TREC-3 test collection the same as >> the test >> collection used in the 2007 paper you referenced? >> >> It looks like that is a different collection, its not >> really possible to >> compare these relevance scores across different >> collections. >> >> On Wed, Jan 27, 2010 at 11:06 AM, Grant Ingersoll wrote: >> >> > >> > On Jan 26, 2010, at 8:28 AM, Ivan Provalov wrote: >> > >> > > We are looking into making some improvements to >> relevance ranking of our >> > search platform based on Lucene. We started by >> running the Ad Hoc TREC task >> > on the TREC-3 data using "out-of-the-box" >> Lucene. The reason to run this >> > old TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200) >> data was that the >> > content is matching the content of our production >> system. >> > > >> > > We are currently getting average precision of >> 0.14. We found some format >> > issues with the TREC-3 data which were causing even >> lower score. For >> > example, the initial average precision number was >> 0.9. We discovered that >> > the topics included the word "Topic:" in the >> tag. For example, >> > > " Topic: Coping with >> overcrowded prisons". By removing this term >> > from the queries, we bumped the average precision to >> 0.14. >> > >> > There's usually a lot of this involved in running >> TREC. I've also seen a >> > good deal of improvement from things like using phrase >> queries and the >> > Dismax Query Parser in Solr (which uses >> DisjunctionQuery in Lucene, amongst >> > other things) and by playing around with length >> normalization. >> > >> > >> > > >> > > Our query is based on the title tag of the topic >> and the index field is >> > based on the tag of the document. >> > > >> > > QualityQueryParser qqParser = new >> SimpleQQParser("title", "TEXT"); >> > > >> > > Is there an average precision number which >> "out-of-the-box" Lucene should >> > be close to? For example, this IBM's 2007 TREC >> paper mentions 0.154: >> > > http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf >> > >> > Hard to say. I can't say I've run TREC 3. >> You might ask over on the Open >> > Relevance list too (http://lucene.apache.org/openrelevance). I know >> > Robert Muir's done a lot of experiments with Lucene on >> standard collections >> > like TREC. >> > >> > I guess the bigger question back to you is what is >> your goal? Is it to get >> > better at TREC or to actually tune your system? >> > >> > -Grant >> > >> > >> > -- >> > Grant Ingersoll >> > http://www.lucidimagination.com/ >> > >> > Search the Lucene ecosystem using Solr/Lucene: >> > http://www.lucidimagination.com/search >> > >> > >> > >> - >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail
RE: Average Precision - TREC-3
Thank you, Jose. -Original Message- From: José Ramón Pérez Agüera [mailto:jose.agu...@gmail.com] Sent: Wednesday, January 27, 2010 1:42 PM To: java-user@lucene.apache.org Subject: Re: Average Precision - TREC-3 Hi Ivan, you might want use the lucene BM25 implementation. Results should be better changing the ranking function. Another option is Language model implementation for Lucene: http://nlp.uned.es/~jperezi/Lucene-BM25/ http://ilps.science.uva.nl/resources/lm-lucene The main problem with this implementation is that not every different kind of Lucene query, but if you don't need that these alternatives implementation are a good choice. best jose On Wed, Jan 27, 2010 at 1:36 PM, Ivan Provalov wrote: > Robert, Grant: > > Thank you for your replies. > > Our goal is to fine-tune our existing system to perform better on relevance. > > I agree with Robert's comment that these collections are not completely > compatible. Yes, it is possible that the results will vary some depending on > the collections differences. The reason for us picking TREC-3 TIPSTER > collection is that our production content overlaps with some TIPSTER > documents. > > Any suggestions on how to obtain Lucene's TREC-3 compatible results, or > select a better approach would be appreciated. > > We are doing this project in three stages: > > 1. Test Lucene's "vanilla" performance to establish the baseline. We want to > iron out the issues such as topic or document formats. For example, we had > to add a different parser and clean up the topic title. This will give us > confidence that we are using the data and the methodology correctly. > > 2. Fine-tune Lucene based on the latest research findings (TREC by E. > Voorhees, conference proceedings, etc...). > > 3. Repeat these steps with our production system which runs on Lucene. The > reason we are doing this step last is to ensure that our overall system > doesn't introduce the relevance issues (content pre-processing steps, query > parsing steps, etc...). > > Thank you, > > Ivan Provalov > > --- On Wed, 1/27/10, Robert Muir wrote: > >> From: Robert Muir >> Subject: Re: Average Precision - TREC-3 >> To: java-user@lucene.apache.org >> Date: Wednesday, January 27, 2010, 11:16 AM >> Hello, forgive my ignorance here (I >> have not worked with these english TREC >> collections), but is the TREC-3 test collection the same as >> the test >> collection used in the 2007 paper you referenced? >> >> It looks like that is a different collection, its not >> really possible to >> compare these relevance scores across different >> collections. >> >> On Wed, Jan 27, 2010 at 11:06 AM, Grant Ingersoll wrote: >> >> > >> > On Jan 26, 2010, at 8:28 AM, Ivan Provalov wrote: >> > >> > > We are looking into making some improvements to >> relevance ranking of our >> > search platform based on Lucene. We started by >> running the Ad Hoc TREC task >> > on the TREC-3 data using "out-of-the-box" >> Lucene. The reason to run this >> > old TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200) >> data was that the >> > content is matching the content of our production >> system. >> > > >> > > We are currently getting average precision of >> 0.14. We found some format >> > issues with the TREC-3 data which were causing even >> lower score. For >> > example, the initial average precision number was >> 0.9. We discovered that >> > the topics included the word "Topic:" in the >> tag. For example, >> > > " Topic: Coping with >> overcrowded prisons". By removing this term >> > from the queries, we bumped the average precision to >> 0.14. >> > >> > There's usually a lot of this involved in running >> TREC. I've also seen a >> > good deal of improvement from things like using phrase >> queries and the >> > Dismax Query Parser in Solr (which uses >> DisjunctionQuery in Lucene, amongst >> > other things) and by playing around with length >> normalization. >> > >> > >> > > >> > > Our query is based on the title tag of the topic >> and the index field is >> > based on the tag of the document. >> > > >> > > QualityQueryParser qqParser = new >> SimpleQQParser("title", "TEXT"); >> > > >> > > Is there an average precision number which >> "out-of-the-box" Lucene should >> > be close to? For example, this IBM's 2007 TREC >> paper mentions 0.154: >> > > http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf >> > >> > Hard to say. I can't say I've run TREC 3. >> You might ask over on the Open >> > Relevance list too (http://lucene.apache.org/openrelevance). I know >> > Robert Muir's done a lot of experiments with Lucene on >> standard collections >> > like TREC. >> > >> > I guess the bigger question back to you is what is >> your goal? Is it to get >> > better at TREC or to actually tune your system? >> > >> > -Grant >> > >> > >> > -- >> > Grant Ingersoll >> > http://www.lucidimagination.com/ >> > >> > Search the Lucene ecosystem using Solr/Lucene: >> > http://www
Re: Average Precision - TREC-3
Hi Ivan, it sounds to me like you are going about it the right way. I too have complained about different document/topic formats before, at least with non-TREC test collections that claim to be in TREC format. Here is a description of what I do, for what its worth. 1. if you use the trunk benchmark code, it will now parse Descriptions and Narratives in addition to Titles. This way you can run TD and TDN queries. While I think Topic only (T) queries are generally the only interesting value, as users only typically type a few short words in their search, the TD and TDN queries are sometimes useful for comparisons. so to do this you will have to either change SimpleQQParser or make your own, that simply creates a BooleanQuery of Topic + Description + Narrative or whatever. 2. another thing I usually test with is query expansion with MoreLikeThis, all defaults, from the top 5 returned docs. I do this with T, TD, and TDN, for 6 different MAP measures. You can see a recent example where I applied all 6 measures here: https://issues.apache.org/jira/browse/LUCENE-2234 . I feel these 6 measures give me a better overall idea of any relative relevance improvement, look in that example where the unexpanded T is improved 75%, but the other 5 its only a 40-50% improvement. While unexpanded T is theoretically the most realistic to me, I feel its a bit fragile and sensitive, and there's a good example. 3. I don't even bother with the 'summary output' that the lucene benchmark pkg prints out, but instead simply use the benchmark pkg to run the queries and generate the trec_top_file (submission.txt), which I hand to trec_eval On Wed, Jan 27, 2010 at 1:36 PM, Ivan Provalov wrote: > Robert, Grant: > > Thank you for your replies. > > Our goal is to fine-tune our existing system to perform better on > relevance. > > I agree with Robert's comment that these collections are not completely > compatible. Yes, it is possible that the results will vary some depending > on the collections differences. The reason for us picking TREC-3 TIPSTER > collection is that our production content overlaps with some TIPSTER > documents. > > Any suggestions on how to obtain Lucene's TREC-3 compatible results, or > select a better approach would be appreciated. > > We are doing this project in three stages: > > 1. Test Lucene's "vanilla" performance to establish the baseline. We want > to iron out the issues such as topic or document formats. For example, we > had to add a different parser and clean up the topic title. This will give > us confidence that we are using the data and the methodology correctly. > > 2. Fine-tune Lucene based on the latest research findings (TREC by E. > Voorhees, conference proceedings, etc...). > > 3. Repeat these steps with our production system which runs on Lucene. The > reason we are doing this step last is to ensure that our overall system > doesn't introduce the relevance issues (content pre-processing steps, query > parsing steps, etc...). > > Thank you, > > Ivan Provalov > > --- On Wed, 1/27/10, Robert Muir wrote: > > > From: Robert Muir > > Subject: Re: Average Precision - TREC-3 > > To: java-user@lucene.apache.org > > Date: Wednesday, January 27, 2010, 11:16 AM > > Hello, forgive my ignorance here (I > > have not worked with these english TREC > > collections), but is the TREC-3 test collection the same as > > the test > > collection used in the 2007 paper you referenced? > > > > It looks like that is a different collection, its not > > really possible to > > compare these relevance scores across different > > collections. > > > > On Wed, Jan 27, 2010 at 11:06 AM, Grant Ingersoll >wrote: > > > > > > > > On Jan 26, 2010, at 8:28 AM, Ivan Provalov wrote: > > > > > > > We are looking into making some improvements to > > relevance ranking of our > > > search platform based on Lucene. We started by > > running the Ad Hoc TREC task > > > on the TREC-3 data using "out-of-the-box" > > Lucene. The reason to run this > > > old TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200) > > data was that the > > > content is matching the content of our production > > system. > > > > > > > > We are currently getting average precision of > > 0.14. We found some format > > > issues with the TREC-3 data which were causing even > > lower score. For > > > example, the initial average precision number was > > 0.9. We discovered that > > > the topics included the word "Topic:" in the > > tag. For example, > > > > " Topic: Coping with > > overcrowded prisons". By removing this term > > > from the queries, we bumped the average precision to > > 0.14. > > > > > > There's usually a lot of this involved in running > > TREC. I've also seen a > > > good deal of improvement from things like using phrase > > queries and the > > > Dismax Query Parser in Solr (which uses > > DisjunctionQuery in Lucene, amongst > > > other things) and by playing around with length > > normalization. > > > > > > > > > > > > > > Our query i
Search for more than one term
Hello: IÄm working with Lucene for my thesis, please I need answers to these questions: 1. How can I tell Lucene to search for more than one term??? (for example: the query "house garden computer" will return documents in which at least one of the term appears) What classes I need to use? 2. Lucene works well in Windows, Mac OS X, Linux y Unix??? what other platform? Thanks in advanced, Carmen -- View this message in context: http://old.nabble.com/Search-for-more-than-one-term-tp27348933p27348933.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Search for more than one term
ctorresl wrote: > Hello: > IÄm working with Lucene for my thesis, please I need answers to > these questions: > 1. How can I tell Lucene to search for more than one term??? (for example: > the query "house garden computer" will return documents in which at least > one of the > term appears) What classes I need to use? > 2. Lucene works well in Windows, Mac OS X, Linux y Unix??? what other > platform? > > Thanks in advanced, > Carmen > I've seen it run nicely on AIX. If you can call running on AIX nice. -- - Mark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Search for more than one term
Have you looked at the query syntax? See... http://lucene.apache.org/java/3_0_0/queryparsersyntax.html And the book Lucene In Action has many examples HTH Erick On Wed, Jan 27, 2010 at 6:55 PM, ctorresl wrote: > > Hello: > IÄm working with Lucene for my thesis, please I need answers to > these questions: > 1. How can I tell Lucene to search for more than one term??? (for example: > the query "house garden computer" will return documents in which at least > one of the > term appears) What classes I need to use? > 2. Lucene works well in Windows, Mac OS X, Linux y Unix??? what other > platform? > > Thanks in advanced, > Carmen > -- > View this message in context: > http://old.nabble.com/Search-for-more-than-one-term-tp27348933p27348933.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Average Precision - TREC-3
Robert, Thank you for this great information. Let me look into these suggestions. Ivan --- On Wed, 1/27/10, Robert Muir wrote: > From: Robert Muir > Subject: Re: Average Precision - TREC-3 > To: java-user@lucene.apache.org > Date: Wednesday, January 27, 2010, 2:52 PM > Hi Ivan, it sounds to me like you are > going about it the right way. > I too have complained about different document/topic > formats before, at > least with non-TREC test collections that claim to be in > TREC format. > > Here is a description of what I do, for what its worth. > > 1. if you use the trunk benchmark code, it will now parse > Descriptions and > Narratives in addition to Titles. This way you can run TD > and TDN queries. > While I think Topic only (T) queries are generally the only > interesting > value, as users only typically type a few short words in > their search, the > TD and TDN queries are sometimes useful for comparisons. so > to do this you > will have to either change SimpleQQParser or make your own, > that simply > creates a BooleanQuery of Topic + Description + Narrative > or whatever. > > 2. another thing I usually test with is query expansion > with MoreLikeThis, > all defaults, from the top 5 returned docs. I do this with > T, TD, and TDN, > for 6 different MAP measures. You can see a recent example > where I applied > all 6 measures here: https://issues.apache.org/jira/browse/LUCENE-2234 . I > feel these 6 measures give me a better overall idea of any > relative > relevance improvement, look in that example where the > unexpanded T is > improved 75%, but the other 5 its only a 40-50% > improvement. While > unexpanded T is theoretically the most realistic to me, I > feel its a bit > fragile and sensitive, and there's a good example. > > two things if you > think it would be useful, just havent gotten around to > it> > > 3. I don't even bother with the 'summary output' that the > lucene benchmark > pkg prints out, but instead simply use the benchmark pkg to > run the queries > and generate the trec_top_file (submission.txt), which I > hand to trec_eval > > > On Wed, Jan 27, 2010 at 1:36 PM, Ivan Provalov > wrote: > > > Robert, Grant: > > > > Thank you for your replies. > > > > Our goal is to fine-tune our existing system to > perform better on > > relevance. > > > > I agree with Robert's comment that these collections > are not completely > > compatible. Yes, it is possible that the results > will vary some depending > > on the collections differences. The reason for > us picking TREC-3 TIPSTER > > collection is that our production content overlaps > with some TIPSTER > > documents. > > > > Any suggestions on how to obtain Lucene's TREC-3 > compatible results, or > > select a better approach would be appreciated. > > > > We are doing this project in three stages: > > > > 1. Test Lucene's "vanilla" performance to establish > the baseline. We want > > to iron out the issues such as topic or document > formats. For example, we > > had to add a different parser and clean up the topic > title. This will give > > us confidence that we are using the data and the > methodology correctly. > > > > 2. Fine-tune Lucene based on the latest research > findings (TREC by E. > > Voorhees, conference proceedings, etc...). > > > > 3. Repeat these steps with our production system which > runs on Lucene. The > > reason we are doing this step last is to ensure that > our overall system > > doesn't introduce the relevance issues (content > pre-processing steps, query > > parsing steps, etc...). > > > > Thank you, > > > > Ivan Provalov > > > > --- On Wed, 1/27/10, Robert Muir > wrote: > > > > > From: Robert Muir > > > Subject: Re: Average Precision - TREC-3 > > > To: java-user@lucene.apache.org > > > Date: Wednesday, January 27, 2010, 11:16 AM > > > Hello, forgive my ignorance here (I > > > have not worked with these english TREC > > > collections), but is the TREC-3 test collection > the same as > > > the test > > > collection used in the 2007 paper you > referenced? > > > > > > It looks like that is a different collection, its > not > > > really possible to > > > compare these relevance scores across different > > > collections. > > > > > > On Wed, Jan 27, 2010 at 11:06 AM, Grant Ingersoll > > >wrote: > > > > > > > > > > > On Jan 26, 2010, at 8:28 AM, Ivan Provalov > wrote: > > > > > > > > > We are looking into making some > improvements to > > > relevance ranking of our > > > > search platform based on Lucene. We > started by > > > running the Ad Hoc TREC task > > > > on the TREC-3 data using "out-of-the-box" > > > Lucene. The reason to run this > > > > old TREC-3 (TIPSTER Disk 1 and Disk 2; > topics 151-200) > > > data was that the > > > > content is matching the content of our > production > > > system. > > > > > > > > > > We are currently getting average > precision of > > > 0.14. We found some format > > > > issues with the TREC-3 data which were > causing even > > > lower score. For > > > > example
Re: Search for more than one term
Hello ctorresl, you can use QueryParser automatically creating query as query syntax (Erick showed). Or use BooleanQuery class. BooleanQuery query = new BooleanQuery; query.add(a_termquery, Occur.SHOULD); query.add(other_termquery, Occur.SHOULD); On Thu, Jan 28, 2010 at 11:15 AM, Erick Erickson wrote: > Have you looked at the query syntax? > > See... > http://lucene.apache.org/java/3_0_0/queryparsersyntax.html > > And the book Lucene In Action has many examples > > HTH > Erick > > > On Wed, Jan 27, 2010 at 6:55 PM, ctorresl >wrote: > > > > > Hello: > > IÄm working with Lucene for my thesis, please I need answers to > > these questions: > > 1. How can I tell Lucene to search for more than one term??? (for > example: > > the query "house garden computer" will return documents in which at least > > one of the > > term appears) What classes I need to use? > > 2. Lucene works well in Windows, Mac OS X, Linux y Unix??? what other > > platform? > > > > Thanks in advanced, > > Carmen > > -- > > View this message in context: > > > http://old.nabble.com/Search-for-more-than-one-term-tp27348933p27348933.html > > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >