Re: file open handles?

2010-01-27 Thread Jamie

Hi Jake


You were indexing but not searching?  So you are never calling getReader()
in the first place?
   
Of course, the call exists, its just that during testing we did not 
execute any searches at all.

How have you been doing search in a realtime fashion with Lucene before
2.9's introduction of
IndexWriter.getReader()?
   
Nope. I previously used to open and close the reader on each search. 
When I noticed the getReader() functionality
was available, I jumped at it. It immediately offered significant 
performance increases...


We are now attempting to analyze Lucene using JPicus to try and get a 
picture of what is happening here.


See: http://wiki.sdn.sap.com/wiki/display/Java/JPicus

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: file open handles?

2010-01-27 Thread Jake Mannix
On Wed, Jan 27, 2010 at 12:17 AM, Jamie  wrote:

> Hi Jake
>
>
>  You were indexing but not searching?  So you are never calling getReader()
>> in the first place?
>>
>>
> Of course, the call exists, its just that during testing we did not execute
> any searches at all.


Oh!  Re-reading your initial post - you're just seeing lots of files which
haven't quite yet
been cleaned up during indexing, it looks like, yes?  There are threads
going on in the
background which are merging segments and deleting old files, these should
go away
over time.

Do you see that they are still around after a very long period?  How high
does the file count grow?


>
>  How have you been doing search in a realtime fashion with Lucene before
>> 2.9's introduction of
>> IndexWriter.getReader()?
>>
>>
> Nope. I previously used to open and close the reader on each search. When I
> noticed the getReader() functionality
> was available, I jumped at it. It immediately offered significant
> performance increases...
>

Gah!  You must have a pretty small index, for that to be performant.  That's
historically been a really good way to
kill your search performance.  "significant performance increases" in
comparison to opening new IndexReader
per request in the pre-2.9 days indeed!

  -jake


Re: file open handles?

2010-01-27 Thread Jamie

Hi Jake

Ok. The number of file handles left open is increasing rapidly. For 
instance, 4200 file handles were left open by Lucene 2.9.1 over a period 
of  16 min. You can see in the attached snapshot a picture from JPicus 
showing the file handles that are left open. These index files are  
deleted but the OS still holds references to them. Could it be that 
Lucene merge threads are not closing files correctly before they are 
deleted? More than likely, it is an error with our code, but where? Our 
LuceneIndex wrapper class is attached. If I set the max file OS count to 
a low figure, my application stops in its track, so this is definitely a 
critical issue that must be resolved.


Jamie


On 2010/01/27 10:24 AM, Jake Mannix wrote:

On Wed, Jan 27, 2010 at 12:17 AM, Jamie  wrote:
Oh!  Re-reading your initial post - you're just seeing lots of files which
haven't quite yet
been cleaned up during indexing, it looks like, yes?  There are threads
going on in the
background which are merging segments and deleting old files, these should
go away
over time.
   


Yes, but they do not. They just keep growing over time until the file 
handle count is exhausted.

I can see from the JPicus utility that although these

Do you see that they are still around after a very long period?  How high
does the file count grow?


   


package com.stimulus.archiva.index;
import com.stimulus.util.*;
import java.io.File;
import java.io.IOException;
import java.io.PrintStream;
import org.apache.commons.logging.*;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.*;
import org.apache.lucene.store.FSDirectory;
import com.stimulus.archiva.domain.Config;
import com.stimulus.archiva.domain.Indexer;
import com.stimulus.archiva.domain.Volume;
import com.stimulus.archiva.exception.*;
import com.stimulus.archiva.language.AnalyzerFactory;
import com.stimulus.archiva.search.*;
import java.util.*;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.AlreadyClosedException;
import java.util.concurrent.locks.ReentrantLock;
import java.util.concurrent.*;

public class LuceneIndex extends Thread {

 protected ArrayBlockingQueue queue;
 protected static final Log logger = 
LogFactory.getLog(LuceneIndex.class.getName());
 protected static final Log indexLog = 
LogFactory.getLog("indexlog");
 IndexWriter writer = null;
 protected static ScheduledExecutorService scheduler;
 protected static ScheduledFuture scheduledTask;
 protected LuceneDocument EXIT_REQ = null;
 ReentrantLock indexLock = new ReentrantLock();
 ArchivaAnalyzer analyzer   = new ArchivaAnalyzer();
 File indexLogFile;
 PrintStream indexLogOut;
 IndexProcessor indexProcessor;
 String friendlyName;
 String indexPath;
 int maxSimultaneousDocs;
 int indexThreads;
 
  public LuceneIndex(int queueSize, LuceneDocument exitReq, 
String friendlyName, String indexPath, int  maxSimultaneousDocs, int 
indexThreads) {
  this.queue = new 
ArrayBlockingQueue(queueSize);
  this.EXIT_REQ = exitReq;
  this.friendlyName = friendlyName;
  this.indexPath = indexPath;
  this.maxSimultaneousDocs = maxSimultaneousDocs;
  this.indexThreads = indexThreads;
  setLog(friendlyName);
  }
  
  
public int getMaxSimultaneousDocs() {
return maxSimultaneousDocs;
}

public void setMaxSimultaneousDocs(int maxSimultaneousDocs) {
this.maxSimultaneousDocs = maxSimultaneousDocs;
}
  
  
public ReentrantLock getIndexLock() {
return indexLock;
}
  
protected void setLog(String logName) {

  try {
  indexLogFile = getIndexLogFile(logName);
  if (indexLogFile!=null) {
  if (indexLogFile.length()>10485760)
  indexLogFile.delete();
  indexLogOut = new 
PrintStream(indexLogFile);
  }
  logger.debug("set index log file path 
{path='"+indexLogFile.getCanonicalPath()+"'}");
  } catch (Exception e) {
  logger.error("failed to open index log 
file:"+e.getMessage(),e);
  }
  

Re: file open handles?

2010-01-27 Thread Jamie

Hi Jake

We got to the bottom of it. Turned out to be a status page that was 
opening the reader to obtain docCount but not closing it.Thanks for your 
help!


Jamie
On 2010/01/27 10:48 AM, Jamie wrote:

Hi Jake

Ok. The number of file handles left open is increasing rapidly. For 
instance, 4200 file handles were left open by Lucene 2.9.1 over a 
period of  16 min. You can see in the attached snapshot a picture from 
JPicus showing the file handles that are left open. These index files 
are  deleted but the OS still holds references to them. Could it be 
that Lucene merge threads are not closing files correctly before they 
are deleted? More than likely, it is an error with our code, but 
where? Our LuceneIndex wrapper class is attached. If I set the max 
file OS count to a low figure, my application stops in its track, so 
this is definitely a critical issue that must be resolved.


Jamie




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Index searching problem

2010-01-27 Thread Asif Nawaz

i build an index to store 100 docs, each with field author, title and 
abstract.for (i=0;i<100;i++) {writer = new IndexWriter("index",new 
StandardAnalyzer(),true,IndexWriter.MaxFieldLength.UNLIMITED); 
doc.add(new Field("author",cfcDoc.getAu(), Field.Store.YES, 
Field.Index.TOKENIZED));doc.add(new Field("title",cfcDoc.getTi(), 
Field.Store.YES, Field.Index.TOKENIZED));doc.add(new 
Field("abstract",cfcDoc.getAb(), Field.Store.YES, 
Field.Index.TOKENIZED));writer.addDocument(doc);}
But when i perfrom a search, it returns zero results, even querystring exist in 
one of the field of document. why is it so?
Hits hits = se.performSearch("Hotel");System.out.println("hits length = "+ 
hits.length());
It creates index folder in file system, but when i open the file _0.fdt or 
_0.fdx with Luke. this shows nothing... it also deletes the file from file 
system.






Asif

  
_
Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
https://signup.live.com/signup.aspx?id=60969

Re: Index searching problem

2010-01-27 Thread Simon Willnauer
do you close your index writer or commit it before you open your searcher?

one more thing, if you search for "Hotel" you might not find anything
if the querystring is not passed through the StandardAnalyzer you use
for indexing. (well, or another analyzer that does lowercasing).
BTW. you email is hard to read though - I don't see a single newline.

simon

On Wed, Jan 27, 2010 at 10:40 AM, Asif Nawaz  wrote:
>
> i build an index to store 100 docs, each with field author, title and 
> abstract.for (i=0;i<100;i++) {writer = new IndexWriter("index",new 
> StandardAnalyzer(),true,IndexWriter.MaxFieldLength.UNLIMITED);         
> doc.add(new Field("author",cfcDoc.getAu(), Field.Store.YES, 
> Field.Index.TOKENIZED));doc.add(new Field("title",cfcDoc.getTi(), 
> Field.Store.YES, Field.Index.TOKENIZED));doc.add(new 
> Field("abstract",cfcDoc.getAb(), Field.Store.YES, 
> Field.Index.TOKENIZED));writer.addDocument(doc);}
> But when i perfrom a search, it returns zero results, even querystring exist 
> in one of the field of document. why is it so?
> Hits hits = se.performSearch("Hotel");System.out.println("hits length = "+ 
> hits.length());
> It creates index folder in file system, but when i open the file _0.fdt or 
> _0.fdx with Luke. this shows nothing... it also deletes the file from file 
> system.
>
>
>
>
>
>
> Asif
>
>
> _
> Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
> https://signup.live.com/signup.aspx?id=60969

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Index searching problem

2010-01-27 Thread Asif Nawaz

ok. it works when i add commit and close indexes. when open the index file with 
Lukes, it shows me the list of documents that were matched.  But in my program 
it returns no of hits = 0. Why???
Hits hits = se.performSearch("significance");System.out.println("hits length = 
"+ hits.length());











> Date: Wed, 27 Jan 2010 10:45:27 +0100
> Subject: Re: Index searching problem
> From: simon.willna...@googlemail.com
> To: java-user@lucene.apache.org
> 
> do you close your index writer or commit it before you open your searcher?
> 
> one more thing, if you search for "Hotel" you might not find anything
> if the querystring is not passed through the StandardAnalyzer you use
> for indexing. (well, or another analyzer that does lowercasing).
> BTW. you email is hard to read though - I don't see a single newline.
> 
> simon
> 
> On Wed, Jan 27, 2010 at 10:40 AM, Asif Nawaz  wrote:
> >
> > i build an index to store 100 docs, each with field author, title and 
> > abstract.for (i=0;i<100;i++) {writer = new IndexWriter("index",new 
> > StandardAnalyzer(),true,IndexWriter.MaxFieldLength.UNLIMITED); 
> > doc.add(new Field("author",cfcDoc.getAu(), Field.Store.YES, 
> > Field.Index.TOKENIZED));doc.add(new Field("title",cfcDoc.getTi(), 
> > Field.Store.YES, Field.Index.TOKENIZED));doc.add(new 
> > Field("abstract",cfcDoc.getAb(), Field.Store.YES, 
> > Field.Index.TOKENIZED));writer.addDocument(doc);}
> > But when i perfrom a search, it returns zero results, even querystring 
> > exist in one of the field of document. why is it so?
> > Hits hits = se.performSearch("Hotel");System.out.println("hits length = "+ 
> > hits.length());
> > It creates index folder in file system, but when i open the file _0.fdt or 
> > _0.fdx with Luke. this shows nothing... it also deletes the file from file 
> > system.
> >
> >
> >
> >
> >
> >
> > Asif
> >
> >
> > _
> > Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
> > https://signup.live.com/signup.aspx?id=60969
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
  
_
Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
https://signup.live.com/signup.aspx?id=60969

Re: Index searching problem

2010-01-27 Thread Simon Willnauer
Do you open the searcher  / reader after you call commit on the writer?

simon

On Wed, Jan 27, 2010 at 12:40 PM, Asif Nawaz  wrote:
>
> ok. it works when i add commit and close indexes. when open the index file 
> with Lukes, it shows me the list of documents that were matched.  But in my 
> program it returns no of hits = 0. Why???
> Hits hits = se.performSearch("significance");System.out.println("hits length 
> = "+ hits.length());
>
>
>
>
>
>
>
>
>
>
>
>> Date: Wed, 27 Jan 2010 10:45:27 +0100
>> Subject: Re: Index searching problem
>> From: simon.willna...@googlemail.com
>> To: java-user@lucene.apache.org
>>
>> do you close your index writer or commit it before you open your searcher?
>>
>> one more thing, if you search for "Hotel" you might not find anything
>> if the querystring is not passed through the StandardAnalyzer you use
>> for indexing. (well, or another analyzer that does lowercasing).
>> BTW. you email is hard to read though - I don't see a single newline.
>>
>> simon
>>
>> On Wed, Jan 27, 2010 at 10:40 AM, Asif Nawaz  wrote:
>> >
>> > i build an index to store 100 docs, each with field author, title and 
>> > abstract.for (i=0;i<100;i++) {writer = new IndexWriter("index",new 
>> > StandardAnalyzer(),true,IndexWriter.MaxFieldLength.UNLIMITED);         
>> > doc.add(new Field("author",cfcDoc.getAu(), Field.Store.YES, 
>> > Field.Index.TOKENIZED));doc.add(new Field("title",cfcDoc.getTi(), 
>> > Field.Store.YES, Field.Index.TOKENIZED));doc.add(new 
>> > Field("abstract",cfcDoc.getAb(), Field.Store.YES, 
>> > Field.Index.TOKENIZED));writer.addDocument(doc);}
>> > But when i perfrom a search, it returns zero results, even querystring 
>> > exist in one of the field of document. why is it so?
>> > Hits hits = se.performSearch("Hotel");System.out.println("hits length = "+ 
>> > hits.length());
>> > It creates index folder in file system, but when i open the file _0.fdt or 
>> > _0.fdx with Luke. this shows nothing... it also deletes the file from file 
>> > system.
>> >
>> >
>> >
>> >
>> >
>> >
>> > Asif
>> >
>> >
>> > _
>> > Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
>> > https://signup.live.com/signup.aspx?id=60969
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
> _
> Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
> https://signup.live.com/signup.aspx?id=60969

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index searching problem

2010-01-27 Thread Ian Lea
Lots of other things to check are listed in the FAQ:
http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_no_hits_.2BAC8_incorrect_hits.3F


--
Ian.


On Wed, Jan 27, 2010 at 11:47 AM, Simon Willnauer
 wrote:
> Do you open the searcher  / reader after you call commit on the writer?
>
> simon
>
> On Wed, Jan 27, 2010 at 12:40 PM, Asif Nawaz  wrote:
>>
>> ok. it works when i add commit and close indexes. when open the index file 
>> with Lukes, it shows me the list of documents that were matched.  But in my 
>> program it returns no of hits = 0. Why???
>> Hits hits = se.performSearch("significance");System.out.println("hits length 
>> = "+ hits.length());
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> Date: Wed, 27 Jan 2010 10:45:27 +0100
>>> Subject: Re: Index searching problem
>>> From: simon.willna...@googlemail.com
>>> To: java-user@lucene.apache.org
>>>
>>> do you close your index writer or commit it before you open your searcher?
>>>
>>> one more thing, if you search for "Hotel" you might not find anything
>>> if the querystring is not passed through the StandardAnalyzer you use
>>> for indexing. (well, or another analyzer that does lowercasing).
>>> BTW. you email is hard to read though - I don't see a single newline.
>>>
>>> simon
>>>
>>> On Wed, Jan 27, 2010 at 10:40 AM, Asif Nawaz  wrote:
>>> >
>>> > i build an index to store 100 docs, each with field author, title and 
>>> > abstract.for (i=0;i<100;i++) {writer = new IndexWriter("index",new 
>>> > StandardAnalyzer(),true,IndexWriter.MaxFieldLength.UNLIMITED);         
>>> > doc.add(new Field("author",cfcDoc.getAu(), Field.Store.YES, 
>>> > Field.Index.TOKENIZED));doc.add(new Field("title",cfcDoc.getTi(), 
>>> > Field.Store.YES, Field.Index.TOKENIZED));doc.add(new 
>>> > Field("abstract",cfcDoc.getAb(), Field.Store.YES, 
>>> > Field.Index.TOKENIZED));writer.addDocument(doc);}
>>> > But when i perfrom a search, it returns zero results, even querystring 
>>> > exist in one of the field of document. why is it so?
>>> > Hits hits = se.performSearch("Hotel");System.out.println("hits length = 
>>> > "+ hits.length());
>>> > It creates index folder in file system, but when i open the file _0.fdt 
>>> > or _0.fdx with Luke. this shows nothing... it also deletes the file from 
>>> > file system.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Asif
>>> >
>>> >
>>> > _
>>> > Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
>>> > https://signup.live.com/signup.aspx?id=60969
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>
>> _
>> Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
>> https://signup.live.com/signup.aspx?id=60969
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: file open handles?

2010-01-27 Thread Michael McCandless
On Wed, Jan 27, 2010 at 4:25 AM, Jamie  wrote:

> We got to the bottom of it.

Thanks for bringing closure!

> Turned out to be a status page that was opening
> the reader to obtain docCount but not closing it.Thanks for your help!

If you only need the docCount in the index, it's much faster to use
oal.index.SegmentInfos (public since 2.9).  That simply reads the
latest segments_N file, which internally records the docCount &
deletion count per segment, which you can then sum up.

Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Index searching problem

2010-01-27 Thread Asif Nawaz

In the demo example for hotel database searching. I am confused how to open the 
index and where should i fit that code. In SearchEngine.java file i opened the 
index this way
IndexSearcher is = new IndexSearcher(IndexReader.open("index")); 

but it's not working and still returns 0 hits :(



> Date: Wed, 27 Jan 2010 12:47:57 +0100
> Subject: Re: Index searching problem
> From: simon.willna...@googlemail.com
> To: java-user@lucene.apache.org
> 
> Do you open the searcher  / reader after you call commit on the writer?
> 
> simon
> 
> On Wed, Jan 27, 2010 at 12:40 PM, Asif Nawaz  wrote:
> >
> > ok. it works when i add commit and close indexes. when open the index file 
> > with Lukes, it shows me the list of documents that were matched.  But in my 
> > program it returns no of hits = 0. Why???
> > Hits hits = se.performSearch("significance");System.out.println("hits 
> > length = "+ hits.length());
> >

  
_
Hotmail: Trusted email with powerful SPAM protection.
https://signup.live.com/signup.aspx?id=60969

RE: Index searching problem

2010-01-27 Thread Asif Nawaz


IndexSearcher is = new IndexSearcher("index");IndexReader ir = 
is.getIndexReader().open("index");System.out.println("No of documents in index 
= "+ir.numDocs());
The last statement shows no of documents = 167. that means IndexReader is 
reading from index, which is open. I think the problem may exists in query 
parser. I am using following code
QueryParser parser = new QueryParser("content", analyzer);  
  Query query = parser.parse(queryString);Hits hits = is.search(query); 
   




> Date: Wed, 27 Jan 2010 12:47:57 +0100
> Subject: Re: Index searching problem
> From: simon.willna...@googlemail.com
> To: java-user@lucene.apache.org
> 
> Do you open the searcher  / reader after you call commit on the writer?
> 
> simon
 


  
_
Hotmail: Trusted email with powerful SPAM protection.
https://signup.live.com/signup.aspx?id=60969

Problem with „AND“ operat or to search Chinese text

2010-01-27 Thread starz10de



Hello ,

I could successfully implement the Chinese analyzer (CJKAnalyzer) and search
Chinese text. However, I have problem when I use the Boolean operator AND
then I got always 0 hits. When I search for the 2 Chinese terms without the
“AND” operator is no problem, When I want to count only the hits when both
exist in the same document, the result always zero.

How I could use the “AND” operator when I search Chinese text and why it is
a problem to use it when searching Chinese text.

Thanks in advance

-- 
View this message in context: 
http://old.nabble.com/Problem-with-%E2%80%9EAND%E2%80%9C-operator-to-search-Chinese-text-tp27341810p27341810.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index searching problem

2010-01-27 Thread Simon Willnauer
On Wed, Jan 27, 2010 at 4:53 PM, Asif Nawaz  wrote:
>
>
> IndexSearcher is = new IndexSearcher("index");IndexReader ir = 
> is.getIndexReader().open("index");System.out.println("No of documents in 
> index = "+ir.numDocs());
> The last statement shows no of documents = 167. that means IndexReader is 
> reading from index, which is open. I think the problem may exists in query 
> parser. I am using following code
> QueryParser parser = new QueryParser("content", analyzer);                    
>     Query query = parser.parse(queryString);        Hits hits = 
> is.search(query);
>
I don't see the field "content" in your document you build in the
first mail. What do you search for? remember if you do not specifiy a
field in you querystring the parser will use the default  field which
is "content". Could that cause your problem?

simon
>
>
>
>> Date: Wed, 27 Jan 2010 12:47:57 +0100
>> Subject: Re: Index searching problem
>> From: simon.willna...@googlemail.com
>> To: java-user@lucene.apache.org
>>
>> Do you open the searcher  / reader after you call commit on the writer?
>>
>> simon
>
>
>
>
> _
> Hotmail: Trusted email with powerful SPAM protection.
> https://signup.live.com/signup.aspx?id=60969

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Analyze java camelcase words ?

2010-01-27 Thread Phan The Dai
Can everyone suggest me a solution for tokenize the camelcase words in java
?
Examples for camelcase words are: getXmlRule, setTokenizeAnalyzer.
They should be tokenized to get, Xml, Rule, set, Tokenize, Analyzer.

Thank you very much!


Re: Average Precision - TREC-3

2010-01-27 Thread Grant Ingersoll

On Jan 26, 2010, at 8:28 AM, Ivan Provalov wrote:

> We are looking into making some improvements to relevance ranking of our 
> search platform based on Lucene.  We started by running the Ad Hoc TREC task 
> on the TREC-3 data using "out-of-the-box" Lucene.  The reason to run this old 
> TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200) data was that the content 
> is matching the content of our production system.  
> 
> We are currently getting average precision of 0.14.  We found some format 
> issues with the TREC-3 data which were causing even lower score.  For 
> example, the initial average precision number was 0.9.  We discovered that 
> the topics included the word "Topic:" in the  tag.  For example, 
> " Topic:  Coping with overcrowded prisons".  By removing this term 
> from the queries, we bumped the average precision to 0.14.

There's usually a lot of this involved in running TREC.  I've also seen a good 
deal of improvement from things like using phrase queries and the Dismax Query 
Parser in Solr (which uses DisjunctionQuery in Lucene, amongst other things) 
and by playing around with length normalization.


> 
> Our query is based on the title tag of the topic and the index field is based 
> on the  tag of the document.  
> 
> QualityQueryParser qqParser = new SimpleQQParser("title", "TEXT");
> 
> Is there an average precision number which "out-of-the-box" Lucene should be 
> close to?  For example, this IBM's 2007 TREC paper mentions 0.154:  
> http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf

Hard to say.  I can't say I've run TREC 3.  You might ask over on the Open 
Relevance list too (http://lucene.apache.org/openrelevance).  I know Robert 
Muir's done a lot of experiments with Lucene on standard collections like TREC.

I guess the bigger question back to you is what is your goal?  Is it to get 
better at TREC or to actually tune your system?

-Grant


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Analyze java camelcase words ?

2010-01-27 Thread Robert Muir
WordDelimiterFilter has a splitOnCaseChange option that should be useful for
this:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

>From the example: PowerShot -> Power, Shot

On Wed, Jan 27, 2010 at 11:01 AM, Phan The Dai wrote:

> Can everyone suggest me a solution for tokenize the camelcase words in java
> ?
> Examples for camelcase words are: getXmlRule, setTokenizeAnalyzer.
> They should be tokenized to get, Xml, Rule, set, Tokenize, Analyzer.
>
> Thank you very much!
>



-- 
Robert Muir
rcm...@gmail.com


Re: Average Precision - TREC-3

2010-01-27 Thread Robert Muir
Hello, forgive my ignorance here (I have not worked with these english TREC
collections), but is the TREC-3 test collection the same as the test
collection used in the 2007 paper you referenced?

It looks like that is a different collection, its not really possible to
compare these relevance scores across different collections.

On Wed, Jan 27, 2010 at 11:06 AM, Grant Ingersoll wrote:

>
> On Jan 26, 2010, at 8:28 AM, Ivan Provalov wrote:
>
> > We are looking into making some improvements to relevance ranking of our
> search platform based on Lucene.  We started by running the Ad Hoc TREC task
> on the TREC-3 data using "out-of-the-box" Lucene.  The reason to run this
> old TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200) data was that the
> content is matching the content of our production system.
> >
> > We are currently getting average precision of 0.14.  We found some format
> issues with the TREC-3 data which were causing even lower score.  For
> example, the initial average precision number was 0.9.  We discovered that
> the topics included the word "Topic:" in the  tag.  For example,
> > " Topic:  Coping with overcrowded prisons".  By removing this term
> from the queries, we bumped the average precision to 0.14.
>
> There's usually a lot of this involved in running TREC.  I've also seen a
> good deal of improvement from things like using phrase queries and the
> Dismax Query Parser in Solr (which uses DisjunctionQuery in Lucene, amongst
> other things) and by playing around with length normalization.
>
>
> >
> > Our query is based on the title tag of the topic and the index field is
> based on the  tag of the document.
> >
> > QualityQueryParser qqParser = new SimpleQQParser("title", "TEXT");
> >
> > Is there an average precision number which "out-of-the-box" Lucene should
> be close to?  For example, this IBM's 2007 TREC paper mentions 0.154:
> > http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf
>
> Hard to say.  I can't say I've run TREC 3.  You might ask over on the Open
> Relevance list too (http://lucene.apache.org/openrelevance).  I know
> Robert Muir's done a lot of experiments with Lucene on standard collections
> like TREC.
>
> I guess the bigger question back to you is what is your goal?  Is it to get
> better at TREC or to actually tune your system?
>
> -Grant
>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com


Re: Analyze java camelcase words ?

2010-01-27 Thread Erick Erickson
Robert:

Is this in Lucene yet? According to what I could find in JIRA, it's
still open. And it's not in the Javadocs on a quick scan.

Erick

On Wed, Jan 27, 2010 at 11:08 AM, Robert Muir  wrote:

> WordDelimiterFilter has a splitOnCaseChange option that should be useful
> for
> this:
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
>
> From the example: PowerShot -> Power, Shot
>
> On Wed, Jan 27, 2010 at 11:01 AM, Phan The Dai  >wrote:
>
> > Can everyone suggest me a solution for tokenize the camelcase words in
> java
> > ?
> > Examples for camelcase words are: getXmlRule, setTokenizeAnalyzer.
> > They should be tokenized to get, Xml, Rule, set, Tokenize, Analyzer.
> >
> > Thank you very much!
> >
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>


Re: Analyze java camelcase words ?

2010-01-27 Thread Robert Muir
no, but you can take the tokenfilter itself and simply use it in your lucene
application.

it uses the old tokenstream API so if you want to use Lucene 3.0 or 3.1, you
will need a version that works with the new tokenstream API.
There is a patch available here for that:
https://issues.apache.org/jira/browse/SOLR-1710

On Wed, Jan 27, 2010 at 11:17 AM, Erick Erickson wrote:

> Robert:
>
> Is this in Lucene yet? According to what I could find in JIRA, it's
> still open. And it's not in the Javadocs on a quick scan.
>
> Erick
>
> On Wed, Jan 27, 2010 at 11:08 AM, Robert Muir  wrote:
>
> > WordDelimiterFilter has a splitOnCaseChange option that should be useful
> > for
> > this:
> >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
> >
> > From the example: PowerShot -> Power, Shot
> >
> > On Wed, Jan 27, 2010 at 11:01 AM, Phan The Dai <
> thienthanhom...@gmail.com
> > >wrote:
> >
> > > Can everyone suggest me a solution for tokenize the camelcase words in
> > java
> > > ?
> > > Examples for camelcase words are: getXmlRule, setTokenizeAnalyzer.
> > > They should be tokenized to get, Xml, Rule, set, Tokenize, Analyzer.
> > >
> > > Thank you very much!
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcm...@gmail.com
> >
>



-- 
Robert Muir
rcm...@gmail.com


Re: Analyze java camelcase words ?

2010-01-27 Thread Phan The Dai
Thank you much.
I study about your comments. They are useful.
I am newer using Lucene 3.0. Hope it works well.

On Thu, Jan 28, 2010 at 1:21 AM, Robert Muir  wrote:

> no, but you can take the tokenfilter itself and simply use it in your
> lucene
> application.
>
> it uses the old tokenstream API so if you want to use Lucene 3.0 or 3.1,
> you
> will need a version that works with the new tokenstream API.
> There is a patch available here for that:
> https://issues.apache.org/jira/browse/SOLR-1710
>
> On Wed, Jan 27, 2010 at 11:17 AM, Erick Erickson  >wrote:
>
> > Robert:
> >
> > Is this in Lucene yet? According to what I could find in JIRA, it's
> > still open. And it's not in the Javadocs on a quick scan.
> >
> > Erick
> >
> > On Wed, Jan 27, 2010 at 11:08 AM, Robert Muir  wrote:
> >
> > > WordDelimiterFilter has a splitOnCaseChange option that should be
> useful
> > > for
> > > this:
> > >
> > >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
> > >
> > > From the example: PowerShot -> Power, Shot
> > >
> > > On Wed, Jan 27, 2010 at 11:01 AM, Phan The Dai <
> > thienthanhom...@gmail.com
> > > >wrote:
> > >
> > > > Can everyone suggest me a solution for tokenize the camelcase words
> in
> > > java
> > > > ?
> > > > Examples for camelcase words are: getXmlRule, setTokenizeAnalyzer.
> > > > They should be tokenized to get, Xml, Rule, set, Tokenize, Analyzer.
> > > >
> > > > Thank you very much!
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcm...@gmail.com
> > >
> >
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>


Re: Average Precision - TREC-3

2010-01-27 Thread Ivan Provalov
Robert, Grant:

Thank you for your replies.  

Our goal is to fine-tune our existing system to perform better on relevance.

I agree with Robert's comment that these collections are not completely 
compatible.  Yes, it is possible that the results will vary some depending on 
the collections differences.  The reason for us picking TREC-3 TIPSTER 
collection is that our production content overlaps with some TIPSTER 
documents.  

Any suggestions on how to obtain Lucene's TREC-3 compatible results, or select 
a better approach would be appreciated.

We are doing this project in three stages:

1. Test Lucene's "vanilla" performance to establish the baseline.  We want to 
iron out the issues such as topic or document formats.  For example, we had to 
add a different parser and clean up the topic title.  This will give us 
confidence that we are using the data and the methodology correctly.

2. Fine-tune Lucene based on the latest research findings (TREC by E. Voorhees, 
conference proceedings, etc...).

3. Repeat these steps with our production system which runs on Lucene.  The 
reason we are doing this step last is to ensure that our overall system doesn't 
introduce the relevance issues (content pre-processing steps, query parsing 
steps, etc...).

Thank you,

Ivan Provalov

--- On Wed, 1/27/10, Robert Muir  wrote:

> From: Robert Muir 
> Subject: Re: Average Precision - TREC-3
> To: java-user@lucene.apache.org
> Date: Wednesday, January 27, 2010, 11:16 AM
> Hello, forgive my ignorance here (I
> have not worked with these english TREC
> collections), but is the TREC-3 test collection the same as
> the test
> collection used in the 2007 paper you referenced?
> 
> It looks like that is a different collection, its not
> really possible to
> compare these relevance scores across different
> collections.
> 
> On Wed, Jan 27, 2010 at 11:06 AM, Grant Ingersoll wrote:
> 
> >
> > On Jan 26, 2010, at 8:28 AM, Ivan Provalov wrote:
> >
> > > We are looking into making some improvements to
> relevance ranking of our
> > search platform based on Lucene.  We started by
> running the Ad Hoc TREC task
> > on the TREC-3 data using "out-of-the-box"
> Lucene.  The reason to run this
> > old TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200)
> data was that the
> > content is matching the content of our production
> system.
> > >
> > > We are currently getting average precision of
> 0.14.  We found some format
> > issues with the TREC-3 data which were causing even
> lower score.  For
> > example, the initial average precision number was
> 0.9.  We discovered that
> > the topics included the word "Topic:" in the
>  tag.  For example,
> > > " Topic:  Coping with
> overcrowded prisons".  By removing this term
> > from the queries, we bumped the average precision to
> 0.14.
> >
> > There's usually a lot of this involved in running
> TREC.  I've also seen a
> > good deal of improvement from things like using phrase
> queries and the
> > Dismax Query Parser in Solr (which uses
> DisjunctionQuery in Lucene, amongst
> > other things) and by playing around with length
> normalization.
> >
> >
> > >
> > > Our query is based on the title tag of the topic
> and the index field is
> > based on the  tag of the document.
> > >
> > > QualityQueryParser qqParser = new
> SimpleQQParser("title", "TEXT");
> > >
> > > Is there an average precision number which
> "out-of-the-box" Lucene should
> > be close to?  For example, this IBM's 2007 TREC
> paper mentions 0.154:
> > > http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf
> >
> > Hard to say.  I can't say I've run TREC 3. 
> You might ask over on the Open
> > Relevance list too (http://lucene.apache.org/openrelevance).  I know
> > Robert Muir's done a lot of experiments with Lucene on
> standard collections
> > like TREC.
> >
> > I guess the bigger question back to you is what is
> your goal?  Is it to get
> > better at TREC or to actually tune your system?
> >
> > -Grant
> >
> >
> > --
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem using Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
> >
> -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> 
> 
> -- 
> Robert Muir
> rcm...@gmail.com
>







-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Average Precision - TREC-3

2010-01-27 Thread José Ramón Pérez Agüera
Hi Ivan,

you might want use the lucene BM25 implementation. Results should be
better changing the ranking function. Another option is Language model
implementation for Lucene:

http://nlp.uned.es/~jperezi/Lucene-BM25/
http://ilps.science.uva.nl/resources/lm-lucene

The main problem with this implementation is that not every different
kind of Lucene query, but if you don't need that these alternatives
implementation are a good choice.

best jose

On Wed, Jan 27, 2010 at 1:36 PM, Ivan Provalov  wrote:
> Robert, Grant:
>
> Thank you for your replies.
>
> Our goal is to fine-tune our existing system to perform better on relevance.
>
> I agree with Robert's comment that these collections are not completely 
> compatible.  Yes, it is possible that the results will vary some depending on 
> the collections differences.  The reason for us picking TREC-3 TIPSTER 
> collection is that our production content overlaps with some TIPSTER 
> documents.
>
> Any suggestions on how to obtain Lucene's TREC-3 compatible results, or 
> select a better approach would be appreciated.
>
> We are doing this project in three stages:
>
> 1. Test Lucene's "vanilla" performance to establish the baseline.  We want to 
> iron out the issues such as topic or document formats.  For example, we had 
> to add a different parser and clean up the topic title.  This will give us 
> confidence that we are using the data and the methodology correctly.
>
> 2. Fine-tune Lucene based on the latest research findings (TREC by E. 
> Voorhees, conference proceedings, etc...).
>
> 3. Repeat these steps with our production system which runs on Lucene.  The 
> reason we are doing this step last is to ensure that our overall system 
> doesn't introduce the relevance issues (content pre-processing steps, query 
> parsing steps, etc...).
>
> Thank you,
>
> Ivan Provalov
>
> --- On Wed, 1/27/10, Robert Muir  wrote:
>
>> From: Robert Muir 
>> Subject: Re: Average Precision - TREC-3
>> To: java-user@lucene.apache.org
>> Date: Wednesday, January 27, 2010, 11:16 AM
>> Hello, forgive my ignorance here (I
>> have not worked with these english TREC
>> collections), but is the TREC-3 test collection the same as
>> the test
>> collection used in the 2007 paper you referenced?
>>
>> It looks like that is a different collection, its not
>> really possible to
>> compare these relevance scores across different
>> collections.
>>
>> On Wed, Jan 27, 2010 at 11:06 AM, Grant Ingersoll wrote:
>>
>> >
>> > On Jan 26, 2010, at 8:28 AM, Ivan Provalov wrote:
>> >
>> > > We are looking into making some improvements to
>> relevance ranking of our
>> > search platform based on Lucene.  We started by
>> running the Ad Hoc TREC task
>> > on the TREC-3 data using "out-of-the-box"
>> Lucene.  The reason to run this
>> > old TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200)
>> data was that the
>> > content is matching the content of our production
>> system.
>> > >
>> > > We are currently getting average precision of
>> 0.14.  We found some format
>> > issues with the TREC-3 data which were causing even
>> lower score.  For
>> > example, the initial average precision number was
>> 0.9.  We discovered that
>> > the topics included the word "Topic:" in the
>>  tag.  For example,
>> > > " Topic:  Coping with
>> overcrowded prisons".  By removing this term
>> > from the queries, we bumped the average precision to
>> 0.14.
>> >
>> > There's usually a lot of this involved in running
>> TREC.  I've also seen a
>> > good deal of improvement from things like using phrase
>> queries and the
>> > Dismax Query Parser in Solr (which uses
>> DisjunctionQuery in Lucene, amongst
>> > other things) and by playing around with length
>> normalization.
>> >
>> >
>> > >
>> > > Our query is based on the title tag of the topic
>> and the index field is
>> > based on the  tag of the document.
>> > >
>> > > QualityQueryParser qqParser = new
>> SimpleQQParser("title", "TEXT");
>> > >
>> > > Is there an average precision number which
>> "out-of-the-box" Lucene should
>> > be close to?  For example, this IBM's 2007 TREC
>> paper mentions 0.154:
>> > > http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf
>> >
>> > Hard to say.  I can't say I've run TREC 3.
>> You might ask over on the Open
>> > Relevance list too (http://lucene.apache.org/openrelevance).  I know
>> > Robert Muir's done a lot of experiments with Lucene on
>> standard collections
>> > like TREC.
>> >
>> > I guess the bigger question back to you is what is
>> your goal?  Is it to get
>> > better at TREC or to actually tune your system?
>> >
>> > -Grant
>> >
>> >
>> > --
>> > Grant Ingersoll
>> > http://www.lucidimagination.com/
>> >
>> > Search the Lucene ecosystem using Solr/Lucene:
>> > http://www.lucidimagination.com/search
>> >
>> >
>> >
>> -
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail

RE: Average Precision - TREC-3

2010-01-27 Thread Provalov, Ivan (Gale)
Thank you, Jose.

-Original Message-
From: José Ramón Pérez Agüera [mailto:jose.agu...@gmail.com] 
Sent: Wednesday, January 27, 2010 1:42 PM
To: java-user@lucene.apache.org
Subject: Re: Average Precision - TREC-3

Hi Ivan,

you might want use the lucene BM25 implementation. Results should be
better changing the ranking function. Another option is Language model
implementation for Lucene:

http://nlp.uned.es/~jperezi/Lucene-BM25/
http://ilps.science.uva.nl/resources/lm-lucene

The main problem with this implementation is that not every different
kind of Lucene query, but if you don't need that these alternatives
implementation are a good choice.

best jose

On Wed, Jan 27, 2010 at 1:36 PM, Ivan Provalov  wrote:
> Robert, Grant:
>
> Thank you for your replies.
>
> Our goal is to fine-tune our existing system to perform better on relevance.
>
> I agree with Robert's comment that these collections are not completely 
> compatible.  Yes, it is possible that the results will vary some depending on 
> the collections differences.  The reason for us picking TREC-3 TIPSTER 
> collection is that our production content overlaps with some TIPSTER 
> documents.
>
> Any suggestions on how to obtain Lucene's TREC-3 compatible results, or 
> select a better approach would be appreciated.
>
> We are doing this project in three stages:
>
> 1. Test Lucene's "vanilla" performance to establish the baseline.  We want to 
> iron out the issues such as topic or document formats.  For example, we had 
> to add a different parser and clean up the topic title.  This will give us 
> confidence that we are using the data and the methodology correctly.
>
> 2. Fine-tune Lucene based on the latest research findings (TREC by E. 
> Voorhees, conference proceedings, etc...).
>
> 3. Repeat these steps with our production system which runs on Lucene.  The 
> reason we are doing this step last is to ensure that our overall system 
> doesn't introduce the relevance issues (content pre-processing steps, query 
> parsing steps, etc...).
>
> Thank you,
>
> Ivan Provalov
>
> --- On Wed, 1/27/10, Robert Muir  wrote:
>
>> From: Robert Muir 
>> Subject: Re: Average Precision - TREC-3
>> To: java-user@lucene.apache.org
>> Date: Wednesday, January 27, 2010, 11:16 AM
>> Hello, forgive my ignorance here (I
>> have not worked with these english TREC
>> collections), but is the TREC-3 test collection the same as
>> the test
>> collection used in the 2007 paper you referenced?
>>
>> It looks like that is a different collection, its not
>> really possible to
>> compare these relevance scores across different
>> collections.
>>
>> On Wed, Jan 27, 2010 at 11:06 AM, Grant Ingersoll wrote:
>>
>> >
>> > On Jan 26, 2010, at 8:28 AM, Ivan Provalov wrote:
>> >
>> > > We are looking into making some improvements to
>> relevance ranking of our
>> > search platform based on Lucene.  We started by
>> running the Ad Hoc TREC task
>> > on the TREC-3 data using "out-of-the-box"
>> Lucene.  The reason to run this
>> > old TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200)
>> data was that the
>> > content is matching the content of our production
>> system.
>> > >
>> > > We are currently getting average precision of
>> 0.14.  We found some format
>> > issues with the TREC-3 data which were causing even
>> lower score.  For
>> > example, the initial average precision number was
>> 0.9.  We discovered that
>> > the topics included the word "Topic:" in the
>>  tag.  For example,
>> > > " Topic:  Coping with
>> overcrowded prisons".  By removing this term
>> > from the queries, we bumped the average precision to
>> 0.14.
>> >
>> > There's usually a lot of this involved in running
>> TREC.  I've also seen a
>> > good deal of improvement from things like using phrase
>> queries and the
>> > Dismax Query Parser in Solr (which uses
>> DisjunctionQuery in Lucene, amongst
>> > other things) and by playing around with length
>> normalization.
>> >
>> >
>> > >
>> > > Our query is based on the title tag of the topic
>> and the index field is
>> > based on the  tag of the document.
>> > >
>> > > QualityQueryParser qqParser = new
>> SimpleQQParser("title", "TEXT");
>> > >
>> > > Is there an average precision number which
>> "out-of-the-box" Lucene should
>> > be close to?  For example, this IBM's 2007 TREC
>> paper mentions 0.154:
>> > > http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf
>> >
>> > Hard to say.  I can't say I've run TREC 3.
>> You might ask over on the Open
>> > Relevance list too (http://lucene.apache.org/openrelevance).  I know
>> > Robert Muir's done a lot of experiments with Lucene on
>> standard collections
>> > like TREC.
>> >
>> > I guess the bigger question back to you is what is
>> your goal?  Is it to get
>> > better at TREC or to actually tune your system?
>> >
>> > -Grant
>> >
>> >
>> > --
>> > Grant Ingersoll
>> > http://www.lucidimagination.com/
>> >
>> > Search the Lucene ecosystem using Solr/Lucene:
>> > http://www

Re: Average Precision - TREC-3

2010-01-27 Thread Robert Muir
Hi Ivan, it sounds to me like you are going about it the right way.
I too have complained about different document/topic formats before, at
least with non-TREC test collections that claim to be in TREC format.

Here is a description of what I do, for what its worth.

1. if you use the trunk benchmark code, it will now parse Descriptions and
Narratives in addition to Titles. This way you can run TD and TDN queries.
While I think Topic only (T) queries are generally the only interesting
value, as users only typically type a few short words in their search, the
TD and TDN queries are sometimes useful for comparisons. so to do this you
will have to either change SimpleQQParser or make your own, that simply
creates a BooleanQuery of Topic + Description + Narrative or whatever.

2. another thing I usually test with is query expansion with MoreLikeThis,
all defaults, from the top 5 returned docs. I do this with T, TD, and TDN,
for 6 different MAP measures. You can see a recent example where I applied
all 6 measures here: https://issues.apache.org/jira/browse/LUCENE-2234 . I
feel these 6 measures give me a better overall idea of any relative
relevance improvement, look in that example where the unexpanded T is
improved 75%, but the other 5 its only a 40-50% improvement. While
unexpanded T is theoretically the most realistic to me, I feel its a bit
fragile and sensitive, and there's a good example.



3. I don't even bother with the 'summary output' that the lucene benchmark
pkg prints out, but instead simply use the benchmark pkg to run the queries
and generate the trec_top_file (submission.txt), which I hand to trec_eval


On Wed, Jan 27, 2010 at 1:36 PM, Ivan Provalov  wrote:

> Robert, Grant:
>
> Thank you for your replies.
>
> Our goal is to fine-tune our existing system to perform better on
> relevance.
>
> I agree with Robert's comment that these collections are not completely
> compatible.  Yes, it is possible that the results will vary some depending
> on the collections differences.  The reason for us picking TREC-3 TIPSTER
> collection is that our production content overlaps with some TIPSTER
> documents.
>
> Any suggestions on how to obtain Lucene's TREC-3 compatible results, or
> select a better approach would be appreciated.
>
> We are doing this project in three stages:
>
> 1. Test Lucene's "vanilla" performance to establish the baseline.  We want
> to iron out the issues such as topic or document formats.  For example, we
> had to add a different parser and clean up the topic title.  This will give
> us confidence that we are using the data and the methodology correctly.
>
> 2. Fine-tune Lucene based on the latest research findings (TREC by E.
> Voorhees, conference proceedings, etc...).
>
> 3. Repeat these steps with our production system which runs on Lucene.  The
> reason we are doing this step last is to ensure that our overall system
> doesn't introduce the relevance issues (content pre-processing steps, query
> parsing steps, etc...).
>
> Thank you,
>
> Ivan Provalov
>
> --- On Wed, 1/27/10, Robert Muir  wrote:
>
> > From: Robert Muir 
> > Subject: Re: Average Precision - TREC-3
> > To: java-user@lucene.apache.org
> > Date: Wednesday, January 27, 2010, 11:16 AM
> > Hello, forgive my ignorance here (I
> > have not worked with these english TREC
> > collections), but is the TREC-3 test collection the same as
> > the test
> > collection used in the 2007 paper you referenced?
> >
> > It looks like that is a different collection, its not
> > really possible to
> > compare these relevance scores across different
> > collections.
> >
> > On Wed, Jan 27, 2010 at 11:06 AM, Grant Ingersoll  >wrote:
> >
> > >
> > > On Jan 26, 2010, at 8:28 AM, Ivan Provalov wrote:
> > >
> > > > We are looking into making some improvements to
> > relevance ranking of our
> > > search platform based on Lucene.  We started by
> > running the Ad Hoc TREC task
> > > on the TREC-3 data using "out-of-the-box"
> > Lucene.  The reason to run this
> > > old TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200)
> > data was that the
> > > content is matching the content of our production
> > system.
> > > >
> > > > We are currently getting average precision of
> > 0.14.  We found some format
> > > issues with the TREC-3 data which were causing even
> > lower score.  For
> > > example, the initial average precision number was
> > 0.9.  We discovered that
> > > the topics included the word "Topic:" in the
> >  tag.  For example,
> > > > " Topic:  Coping with
> > overcrowded prisons".  By removing this term
> > > from the queries, we bumped the average precision to
> > 0.14.
> > >
> > > There's usually a lot of this involved in running
> > TREC.  I've also seen a
> > > good deal of improvement from things like using phrase
> > queries and the
> > > Dismax Query Parser in Solr (which uses
> > DisjunctionQuery in Lucene, amongst
> > > other things) and by playing around with length
> > normalization.
> > >
> > >
> > > >
> > > > Our query i

Search for more than one term

2010-01-27 Thread ctorresl

Hello:
IÄm working with Lucene for my thesis, please I need answers to
these questions:
1. How can I tell Lucene to search for more than one term??? (for example:
the query "house garden computer" will return documents in which at least
one of the
term appears) What classes I need to use?
2. Lucene works well in Windows, Mac OS X, Linux y Unix??? what other
platform?

Thanks in advanced,
Carmen 
-- 
View this message in context: 
http://old.nabble.com/Search-for-more-than-one-term-tp27348933p27348933.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Search for more than one term

2010-01-27 Thread Mark Miller
ctorresl wrote:
> Hello:
> IÄm working with Lucene for my thesis, please I need answers to
> these questions:
> 1. How can I tell Lucene to search for more than one term??? (for example:
> the query "house garden computer" will return documents in which at least
> one of the
> term appears) What classes I need to use?
> 2. Lucene works well in Windows, Mac OS X, Linux y Unix??? what other
> platform?
>
> Thanks in advanced,
> Carmen 
>   
I've seen it run nicely on AIX. If you can call running on AIX nice.

-- 
- Mark





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Search for more than one term

2010-01-27 Thread Erick Erickson
Have you looked at the query syntax?

See...
http://lucene.apache.org/java/3_0_0/queryparsersyntax.html

And the book Lucene In Action has many examples

HTH
Erick


On Wed, Jan 27, 2010 at 6:55 PM, ctorresl wrote:

>
> Hello:
> IÄm working with Lucene for my thesis, please I need answers to
> these questions:
> 1. How can I tell Lucene to search for more than one term??? (for example:
> the query "house garden computer" will return documents in which at least
> one of the
> term appears) What classes I need to use?
> 2. Lucene works well in Windows, Mac OS X, Linux y Unix??? what other
> platform?
>
> Thanks in advanced,
> Carmen
> --
> View this message in context:
> http://old.nabble.com/Search-for-more-than-one-term-tp27348933p27348933.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Average Precision - TREC-3

2010-01-27 Thread Ivan Provalov
Robert,

Thank you for this great information.  Let me look into these suggestions.

Ivan

--- On Wed, 1/27/10, Robert Muir  wrote:

> From: Robert Muir 
> Subject: Re: Average Precision - TREC-3
> To: java-user@lucene.apache.org
> Date: Wednesday, January 27, 2010, 2:52 PM
> Hi Ivan, it sounds to me like you are
> going about it the right way.
> I too have complained about different document/topic
> formats before, at
> least with non-TREC test collections that claim to be in
> TREC format.
> 
> Here is a description of what I do, for what its worth.
> 
> 1. if you use the trunk benchmark code, it will now parse
> Descriptions and
> Narratives in addition to Titles. This way you can run TD
> and TDN queries.
> While I think Topic only (T) queries are generally the only
> interesting
> value, as users only typically type a few short words in
> their search, the
> TD and TDN queries are sometimes useful for comparisons. so
> to do this you
> will have to either change SimpleQQParser or make your own,
> that simply
> creates a BooleanQuery of Topic + Description + Narrative
> or whatever.
> 
> 2. another thing I usually test with is query expansion
> with MoreLikeThis,
> all defaults, from the top 5 returned docs. I do this with
> T, TD, and TDN,
> for 6 different MAP measures. You can see a recent example
> where I applied
> all 6 measures here: https://issues.apache.org/jira/browse/LUCENE-2234 . I
> feel these 6 measures give me a better overall idea of any
> relative
> relevance improvement, look in that example where the
> unexpanded T is
> improved 75%, but the other 5 its only a 40-50%
> improvement. While
> unexpanded T is theoretically the most realistic to me, I
> feel its a bit
> fragile and sensitive, and there's a good example.
> 
>  two things if you
> think it would be useful, just havent gotten around to
> it>
> 
> 3. I don't even bother with the 'summary output' that the
> lucene benchmark
> pkg prints out, but instead simply use the benchmark pkg to
> run the queries
> and generate the trec_top_file (submission.txt), which I
> hand to trec_eval
> 
> 
> On Wed, Jan 27, 2010 at 1:36 PM, Ivan Provalov 
> wrote:
> 
> > Robert, Grant:
> >
> > Thank you for your replies.
> >
> > Our goal is to fine-tune our existing system to
> perform better on
> > relevance.
> >
> > I agree with Robert's comment that these collections
> are not completely
> > compatible.  Yes, it is possible that the results
> will vary some depending
> > on the collections differences.  The reason for
> us picking TREC-3 TIPSTER
> > collection is that our production content overlaps
> with some TIPSTER
> > documents.
> >
> > Any suggestions on how to obtain Lucene's TREC-3
> compatible results, or
> > select a better approach would be appreciated.
> >
> > We are doing this project in three stages:
> >
> > 1. Test Lucene's "vanilla" performance to establish
> the baseline.  We want
> > to iron out the issues such as topic or document
> formats.  For example, we
> > had to add a different parser and clean up the topic
> title.  This will give
> > us confidence that we are using the data and the
> methodology correctly.
> >
> > 2. Fine-tune Lucene based on the latest research
> findings (TREC by E.
> > Voorhees, conference proceedings, etc...).
> >
> > 3. Repeat these steps with our production system which
> runs on Lucene.  The
> > reason we are doing this step last is to ensure that
> our overall system
> > doesn't introduce the relevance issues (content
> pre-processing steps, query
> > parsing steps, etc...).
> >
> > Thank you,
> >
> > Ivan Provalov
> >
> > --- On Wed, 1/27/10, Robert Muir 
> wrote:
> >
> > > From: Robert Muir 
> > > Subject: Re: Average Precision - TREC-3
> > > To: java-user@lucene.apache.org
> > > Date: Wednesday, January 27, 2010, 11:16 AM
> > > Hello, forgive my ignorance here (I
> > > have not worked with these english TREC
> > > collections), but is the TREC-3 test collection
> the same as
> > > the test
> > > collection used in the 2007 paper you
> referenced?
> > >
> > > It looks like that is a different collection, its
> not
> > > really possible to
> > > compare these relevance scores across different
> > > collections.
> > >
> > > On Wed, Jan 27, 2010 at 11:06 AM, Grant Ingersoll
>  > >wrote:
> > >
> > > >
> > > > On Jan 26, 2010, at 8:28 AM, Ivan Provalov
> wrote:
> > > >
> > > > > We are looking into making some
> improvements to
> > > relevance ranking of our
> > > > search platform based on Lucene.  We
> started by
> > > running the Ad Hoc TREC task
> > > > on the TREC-3 data using "out-of-the-box"
> > > Lucene.  The reason to run this
> > > > old TREC-3 (TIPSTER Disk 1 and Disk 2;
> topics 151-200)
> > > data was that the
> > > > content is matching the content of our
> production
> > > system.
> > > > >
> > > > > We are currently getting average
> precision of
> > > 0.14.  We found some format
> > > > issues with the TREC-3 data which were
> causing even
> > > lower score.  For
> > > > example

Re: Search for more than one term

2010-01-27 Thread Phan The Dai
Hello ctorresl,
you can use QueryParser automatically creating query as query syntax (Erick
showed).
Or use BooleanQuery class.
BooleanQuery query = new BooleanQuery;
query.add(a_termquery, Occur.SHOULD);
query.add(other_termquery, Occur.SHOULD);


On Thu, Jan 28, 2010 at 11:15 AM, Erick Erickson wrote:

> Have you looked at the query syntax?
>
> See...
> http://lucene.apache.org/java/3_0_0/queryparsersyntax.html
>
> And the book Lucene In Action has many examples
>
> HTH
> Erick
>
>
> On Wed, Jan 27, 2010 at 6:55 PM, ctorresl  >wrote:
>
> >
> > Hello:
> > IÄm working with Lucene for my thesis, please I need answers to
> > these questions:
> > 1. How can I tell Lucene to search for more than one term??? (for
> example:
> > the query "house garden computer" will return documents in which at least
> > one of the
> > term appears) What classes I need to use?
> > 2. Lucene works well in Windows, Mac OS X, Linux y Unix??? what other
> > platform?
> >
> > Thanks in advanced,
> > Carmen
> > --
> > View this message in context:
> >
> http://old.nabble.com/Search-for-more-than-one-term-tp27348933p27348933.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>