Re: Benchmarking my indexer

2008-11-02 Thread Grant Ingersoll


On Nov 1, 2008, at 1:39 AM, Rafael Cunha de Almeida wrote:


Hello,

I did an indexer that parses some files and indexes them using  
lucene. I

want to benchmark the whole thing, so I'd like to count the tokens
being indexed so I can calculate the average number of indexed tokens
per second. Is there a way to count the number of tokens on a  
document?


I think you would have to add a "CountingTokenFilter", that you write  
and manage as you add documents.  Or, you could just take the total #  
of tokens / by the number of docs and use the average.  That can be  
obtained w/o writing a new TokenFilter.





While I'm at it, I will also need to calculate the amount of memory my
java program used (peak, avg, etc), what java tool would you suggest  
me

to figure that out?



Would JConsole work: http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html 
  help?  I'm not sure what people use here


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Exact Phrase Query

2008-11-02 Thread semelak ss
I was in a hurry when copying and pasting the code. What I've been using is 
only writer. RamWriter was never used as it never really worked (thanks to you, 
I now understand the reason).

The above is not really related to the problem I was facing. I modified my code 
so that an indexreader/indexwriter is opened right before the words comparison 
takes place and is closed right after. (currently not using RamDir due to the 
problems faced earlier)

Considering that the program is basically a loop that does thousands and 
thousands of comparison, this is definitely not the most efficient way of 
handling things.

I would appreciate any input in this regard on how to improve the efficiency. 



--- On Sat, 11/1/08, Erick Erickson <[EMAIL PROTECTED]> wrote:

> From: Erick Erickson <[EMAIL PROTECTED]>
> Subject: Re: Exact Phrase Query
> To: java-user@lucene.apache.org, [EMAIL PROTECTED]
> Date: Saturday, November 1, 2008, 5:06 PM
> a, finally. I'm almost completely sure you can't
> *write* to a
> RAMDirectory
> and expect the underlying FSDir to be updated. The intent
> of RAMDirectorys
> is to *read* in an index from disk and keep it in memory.
> Essentially I
> believe
> that your RAMDirecotry constructor is taking a snapshot of
> the underlying
> disk index, modifying that in-memory copy, and throwing it
> away without
> ever writing it to disk. I wouldn't expect opening the
> FSDirectory after
> writing
> to the RAMDirectory to find anything. Ever.
> 
> If you really need the RAMDir, I suspect you'll have to
> open an FS-based
> writer as well as a RAM-based writer, and write to both
> when necessary.
> You'll probably also have to open/search your RAM-based
> index as the
> faster alternative to re-opening the FS-based index. Either
> way, reopening
> the index is probably expensive, are you sure you need to?
> Is there a way
> to keep your information in an internal data structure for
> some period of
> time?
> 
> Best
> Erick
> 
> 
> 
> On Sat, Nov 1, 2008 at 6:31 PM, semelak ss
> <[EMAIL PROTECTED]> wrote:
> 
> > I am not entirely sure if this can be the cause, but
> here is something I
> > thought might be related:
> > The idea is have an index containing documents where
> each document has a
> > combination of two words : word1 and word2 and a score
> for these two words.
> > The index would be searched first if the two words
> exist, and if not the
> > score would be computed on the fly and then added to
> the index. This process
> > would be repeated thousands of times for thousands of
> words.
> >
> > Hence, I have an indexwriter and a searcher
> > 
> > RAMDirectory ramDir = new RAMDirectory(INDEX_DIR);
> > IndexWriter  ramWriter = new IndexWriter(ramDir, new
> WhitespaceAnalyzer(),
> > true,IndexWriter.MaxFieldLength.UNLIMITED);
> > writer = new IndexWriter(INDEX_DIR,new
> WhitespaceAnalyzer(),true
> > ,IndexWriter.MaxFieldLength.UNLIMITED);
> >
> > FSDirectory fsdir =
> FSDirectory.getDirectory(INDEX_DIR);
> > IndexReader ir = IndexReader.open(fsdir);
> > _searcher = new IndexSearcher(ir);
> > 
> >
> > The indexWriter is closed near the end of the program
> (it's open while
> > searching for words combinations ).
> >
> > When using Luke,, I was able to search successfully
> for exact phrases. My
> > guess is that the problem I am facing has something to
> do with the
> > indexWriter, but I can not pinpoint the exact cause of
> the problem.
> >
> >
> > --- On Sat, 11/1/08, semelak ss
> <[EMAIL PROTECTED]> wrote:
> >
> > > From: semelak ss <[EMAIL PROTECTED]>
> > > Subject: Re: Exact Phrase Query
> > > To: java-user@lucene.apache.org
> > > Date: Saturday, November 1, 2008, 10:03 AM
> > > When using Luke,, searching for the followings
> gives me hits
> > > now:
> > > "insurer storm"
> > > The synatx of the query as parsed by Luke is :
> > > word:"insurer storm"
> > >
> > > The code I am using is as follows:
> > > --
> > > _searcher = new IndexSearcher(INDEX_DIR);
> > > _parser = new QueryParser("word", new
> > > WhitespaceAnalyzer());
> > > Query q = _parser.parse(query);
> > > System.out.println(q.toString()); // this outputs
> ->
> > > word:"insurer storm"
> > > TopDocs vv= _searcher.search(q, 1);
> > > Hits tmph = _searcher.search(q);
> > > -
> > >
> > > both vv and tmph give no results (their size is
> 0)
> > >
> > >
> > >
> > > --- On Fri, 10/31/08, semelak ss
> > > <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: semelak ss
> <[EMAIL PROTECTED]>
> > > > Subject: Re: Exact Phrase Query
> > > > To: java-user@lucene.apache.org
> > > > Date: Friday, October 31, 2008, 9:41 AM
> > > > For indexing, I use the following:
> > > > ===
> > > > writer = new IndexWriter(INDEX_DIR,new
> > > > WhitespaceAnalyzer(),true
> > > > ,IndexWriter.MaxFieldLength.UNLIMITED);
> > > > Document doc = new Document();
> > > > String tmpword = this.getProperForm(word1,
> word2);
> > > > doc.add(new Field("WORDS",
> tmpwo

addDocument vs addIndexes

2008-11-02 Thread Hadi Forghani
hi friends
merge N document to an existing index is better than add N document to an
existing index?
in the other word, has IndexWriter.addIndexesNoOptimize less I/O than
IndexWriter.addDocument?
thanks


Re: Exact Phrase Query

2008-11-02 Thread semelak ss
Also, is there a way to pass a null or no tokenizer when writing to the index 
the field "words" ?? I have no need for tokenizing the words and the exact 
query will always be known. 

To understand better the problem, when are performing words comparison in large 
number of text documents. Each word in each sentence is compared with the rest 
of the words in the other sentences. A similarity score is computed for each 
pair and stored in the index for fast retrieval in the future (computation of 
the score is resource intensive). What we used to do is construct a matrix and 
store the words in alphabetical order (for binary search) and then load the 
words when the program is launched. Due to the size of the files generated, the 
update was a real struggle.

Thus, we decided to use Lucene and store a score for each pair of words. 
Updates should be much easier and faster, however improving the search is 
something we're looking into. We are new to Lucene, and would appreciate any 
input in this regard. 

Knowing that the document would contain only two fields : score and words and 
that no tokenization is needed, what would be the most efficient way for 
implementing this index using Lucene ? 


--- On Sun, 11/2/08, semelak ss <[EMAIL PROTECTED]> wrote:

> From: semelak ss <[EMAIL PROTECTED]>
> Subject: Re: Exact Phrase Query
> To: java-user@lucene.apache.org
> Date: Sunday, November 2, 2008, 7:26 AM
> I was in a hurry when copying and pasting the code. What
> I've been using is only writer. RamWriter was never used
> as it never really worked (thanks to you, I now understand
> the reason).
> 
> The above is not really related to the problem I was
> facing. I modified my code so that an
> indexreader/indexwriter is opened right before the words
> comparison takes place and is closed right after. (currently
> not using RamDir due to the problems faced earlier)
> 
> Considering that the program is basically a loop that does
> thousands and thousands of comparison, this is definitely
> not the most efficient way of handling things.
> 
> I would appreciate any input in this regard on how to
> improve the efficiency. 
> 
> 
> 
> --- On Sat, 11/1/08, Erick Erickson
> <[EMAIL PROTECTED]> wrote:
> 
> > From: Erick Erickson <[EMAIL PROTECTED]>
> > Subject: Re: Exact Phrase Query
> > To: java-user@lucene.apache.org, [EMAIL PROTECTED]
> > Date: Saturday, November 1, 2008, 5:06 PM
> > a, finally. I'm almost completely sure you
> can't
> > *write* to a
> > RAMDirectory
> > and expect the underlying FSDir to be updated. The
> intent
> > of RAMDirectorys
> > is to *read* in an index from disk and keep it in
> memory.
> > Essentially I
> > believe
> > that your RAMDirecotry constructor is taking a
> snapshot of
> > the underlying
> > disk index, modifying that in-memory copy, and
> throwing it
> > away without
> > ever writing it to disk. I wouldn't expect opening
> the
> > FSDirectory after
> > writing
> > to the RAMDirectory to find anything. Ever.
> > 
> > If you really need the RAMDir, I suspect you'll
> have to
> > open an FS-based
> > writer as well as a RAM-based writer, and write to
> both
> > when necessary.
> > You'll probably also have to open/search your
> RAM-based
> > index as the
> > faster alternative to re-opening the FS-based index.
> Either
> > way, reopening
> > the index is probably expensive, are you sure you need
> to?
> > Is there a way
> > to keep your information in an internal data structure
> for
> > some period of
> > time?
> > 
> > Best
> > Erick
> > 
> > 
> > 
> > On Sat, Nov 1, 2008 at 6:31 PM, semelak ss
> > <[EMAIL PROTECTED]> wrote:
> > 
> > > I am not entirely sure if this can be the cause,
> but
> > here is something I
> > > thought might be related:
> > > The idea is have an index containing documents
> where
> > each document has a
> > > combination of two words : word1 and word2 and a
> score
> > for these two words.
> > > The index would be searched first if the two
> words
> > exist, and if not the
> > > score would be computed on the fly and then added
> to
> > the index. This process
> > > would be repeated thousands of times for
> thousands of
> > words.
> > >
> > > Hence, I have an indexwriter and a searcher
> > > 
> > > RAMDirectory ramDir = new
> RAMDirectory(INDEX_DIR);
> > > IndexWriter  ramWriter = new IndexWriter(ramDir,
> new
> > WhitespaceAnalyzer(),
> > > true,IndexWriter.MaxFieldLength.UNLIMITED);
> > > writer = new IndexWriter(INDEX_DIR,new
> > WhitespaceAnalyzer(),true
> > > ,IndexWriter.MaxFieldLength.UNLIMITED);
> > >
> > > FSDirectory fsdir =
> > FSDirectory.getDirectory(INDEX_DIR);
> > > IndexReader ir = IndexReader.open(fsdir);
> > > _searcher = new IndexSearcher(ir);
> > > 
> > >
> > > The indexWriter is closed near the end of the
> program
> > (it's open while
> > > searching for words combinations ).
> > >
> > > When using Luke,, I was able to search
> successfully
> > for exact phrases. My
> > > guess is that th

Re: Exact Phrase Query

2008-11-02 Thread Erick Erickson
Sorry, but I've really run out of patience here. You have consistently
stated only
part of the problem, never posting enough information to allow me to answer
helpfully. You haven't even taken the time to proofread your posts, which
has wasted my (limited, volunteer) time.

In the future, please consider the fact that people trying to help with your

problem are volunteering their time and respect that fact by making a
greater effort to make it easy and efficient for us to help with what is,
after all, *your* problem.

Best
Erick

On Sun, Nov 2, 2008 at 11:03 AM, semelak ss <[EMAIL PROTECTED]> wrote:

> Also, is there a way to pass a null or no tokenizer when writing to the
> index the field "words" ?? I have no need for tokenizing the words and the
> exact query will always be known.
>
> To understand better the problem, when are performing words comparison in
> large number of text documents. Each word in each sentence is compared with
> the rest of the words in the other sentences. A similarity score is computed
> for each pair and stored in the index for fast retrieval in the future
> (computation of the score is resource intensive). What we used to do is
> construct a matrix and store the words in alphabetical order (for binary
> search) and then load the words when the program is launched. Due to the
> size of the files generated, the update was a real struggle.
>
> Thus, we decided to use Lucene and store a score for each pair of words.
> Updates should be much easier and faster, however improving the search is
> something we're looking into. We are new to Lucene, and would appreciate any
> input in this regard.
>
> Knowing that the document would contain only two fields : score and words
> and that no tokenization is needed, what would be the most efficient way for
> implementing this index using Lucene ?
>
>
> --- On Sun, 11/2/08, semelak ss <[EMAIL PROTECTED]> wrote:
>
> > From: semelak ss <[EMAIL PROTECTED]>
> > Subject: Re: Exact Phrase Query
> > To: java-user@lucene.apache.org
> > Date: Sunday, November 2, 2008, 7:26 AM
> > I was in a hurry when copying and pasting the code. What
> > I've been using is only writer. RamWriter was never used
> > as it never really worked (thanks to you, I now understand
> > the reason).
> >
> > The above is not really related to the problem I was
> > facing. I modified my code so that an
> > indexreader/indexwriter is opened right before the words
> > comparison takes place and is closed right after. (currently
> > not using RamDir due to the problems faced earlier)
> >
> > Considering that the program is basically a loop that does
> > thousands and thousands of comparison, this is definitely
> > not the most efficient way of handling things.
> >
> > I would appreciate any input in this regard on how to
> > improve the efficiency.
> >
> >
> >
> > --- On Sat, 11/1/08, Erick Erickson
> > <[EMAIL PROTECTED]> wrote:
> >
> > > From: Erick Erickson <[EMAIL PROTECTED]>
> > > Subject: Re: Exact Phrase Query
> > > To: java-user@lucene.apache.org, [EMAIL PROTECTED]
> > > Date: Saturday, November 1, 2008, 5:06 PM
> > > a, finally. I'm almost completely sure you
> > can't
> > > *write* to a
> > > RAMDirectory
> > > and expect the underlying FSDir to be updated. The
> > intent
> > > of RAMDirectorys
> > > is to *read* in an index from disk and keep it in
> > memory.
> > > Essentially I
> > > believe
> > > that your RAMDirecotry constructor is taking a
> > snapshot of
> > > the underlying
> > > disk index, modifying that in-memory copy, and
> > throwing it
> > > away without
> > > ever writing it to disk. I wouldn't expect opening
> > the
> > > FSDirectory after
> > > writing
> > > to the RAMDirectory to find anything. Ever.
> > >
> > > If you really need the RAMDir, I suspect you'll
> > have to
> > > open an FS-based
> > > writer as well as a RAM-based writer, and write to
> > both
> > > when necessary.
> > > You'll probably also have to open/search your
> > RAM-based
> > > index as the
> > > faster alternative to re-opening the FS-based index.
> > Either
> > > way, reopening
> > > the index is probably expensive, are you sure you need
> > to?
> > > Is there a way
> > > to keep your information in an internal data structure
> > for
> > > some period of
> > > time?
> > >
> > > Best
> > > Erick
> > >
> > >
> > >
> > > On Sat, Nov 1, 2008 at 6:31 PM, semelak ss
> > > <[EMAIL PROTECTED]> wrote:
> > >
> > > > I am not entirely sure if this can be the cause,
> > but
> > > here is something I
> > > > thought might be related:
> > > > The idea is have an index containing documents
> > where
> > > each document has a
> > > > combination of two words : word1 and word2 and a
> > score
> > > for these two words.
> > > > The index would be searched first if the two
> > words
> > > exist, and if not the
> > > > score would be computed on the fly and then added
> > to
> > > the index. This process
> > > > would be repeated thousands of times for
> > thousands of
> > > words.
> > > >
> 

Re: Exact Phrase Query

2008-11-02 Thread semelak ss
Hello Erick,

If it weren't for your help and kind response, I would be struggling now with 
the initial problem I had. The solution to that problem turned out to be the 
one you mentioned in your response (indexwriters/indexreaders both being opened 
at the same time).

The problem I mentioned in my last response is different from the initial 
question I posted. It's really a request for thoughts and people inputs on how 
to improve searching given the structure of the data described in my last 
response.

Again, I appreciate your help (and I am not saying this because I am looking 
forward to your response.) 


--- On Sun, 11/2/08, Erick Erickson <[EMAIL PROTECTED]> wrote:

> From: Erick Erickson <[EMAIL PROTECTED]>
> Subject: Re: Exact Phrase Query
> To: java-user@lucene.apache.org, [EMAIL PROTECTED]
> Date: Sunday, November 2, 2008, 12:11 PM
> Sorry, but I've really run out of patience here. You
> have consistently
> stated only
> part of the problem, never posting enough information to
> allow me to answer
> helpfully. You haven't even taken the time to proofread
> your posts, which
> has wasted my (limited, volunteer) time.
> 
> In the future, please consider the fact that people trying
> to help with your
> 
> problem are volunteering their time and respect that fact
> by making a
> greater effort to make it easy and efficient for us to help
> with what is,
> after all, *your* problem.
> 
> Best
> Erick
> 
> On Sun, Nov 2, 2008 at 11:03 AM, semelak ss
> <[EMAIL PROTECTED]> wrote:
> 
> > Also, is there a way to pass a null or no tokenizer
> when writing to the
> > index the field "words" ?? I have no need
> for tokenizing the words and the
> > exact query will always be known.
> >
> > To understand better the problem, when are performing
> words comparison in
> > large number of text documents. Each word in each
> sentence is compared with
> > the rest of the words in the other sentences. A
> similarity score is computed
> > for each pair and stored in the index for fast
> retrieval in the future
> > (computation of the score is resource intensive). What
> we used to do is
> > construct a matrix and store the words in alphabetical
> order (for binary
> > search) and then load the words when the program is
> launched. Due to the
> > size of the files generated, the update was a real
> struggle.
> >
> > Thus, we decided to use Lucene and store a score for
> each pair of words.
> > Updates should be much easier and faster, however
> improving the search is
> > something we're looking into. We are new to
> Lucene, and would appreciate any
> > input in this regard.
> >
> > Knowing that the document would contain only two
> fields : score and words
> > and that no tokenization is needed, what would be the
> most efficient way for
> > implementing this index using Lucene ?
> >
> >
> > --- On Sun, 11/2/08, semelak ss
> <[EMAIL PROTECTED]> wrote:
> >
> > > From: semelak ss <[EMAIL PROTECTED]>
> > > Subject: Re: Exact Phrase Query
> > > To: java-user@lucene.apache.org
> > > Date: Sunday, November 2, 2008, 7:26 AM
> > > I was in a hurry when copying and pasting the
> code. What
> > > I've been using is only writer. RamWriter was
> never used
> > > as it never really worked (thanks to you, I now
> understand
> > > the reason).
> > >
> > > The above is not really related to the problem I
> was
> > > facing. I modified my code so that an
> > > indexreader/indexwriter is opened right before
> the words
> > > comparison takes place and is closed right after.
> (currently
> > > not using RamDir due to the problems faced
> earlier)
> > >
> > > Considering that the program is basically a loop
> that does
> > > thousands and thousands of comparison, this is
> definitely
> > > not the most efficient way of handling things.
> > >
> > > I would appreciate any input in this regard on
> how to
> > > improve the efficiency.
> > >
> > >
> > >
> > > --- On Sat, 11/1/08, Erick Erickson
> > > <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: Erick Erickson
> <[EMAIL PROTECTED]>
> > > > Subject: Re: Exact Phrase Query
> > > > To: java-user@lucene.apache.org,
> [EMAIL PROTECTED]
> > > > Date: Saturday, November 1, 2008, 5:06 PM
> > > > a, finally. I'm almost completely
> sure you
> > > can't
> > > > *write* to a
> > > > RAMDirectory
> > > > and expect the underlying FSDir to be
> updated. The
> > > intent
> > > > of RAMDirectorys
> > > > is to *read* in an index from disk and keep
> it in
> > > memory.
> > > > Essentially I
> > > > believe
> > > > that your RAMDirecotry constructor is taking
> a
> > > snapshot of
> > > > the underlying
> > > > disk index, modifying that in-memory copy,
> and
> > > throwing it
> > > > away without
> > > > ever writing it to disk. I wouldn't
> expect opening
> > > the
> > > > FSDirectory after
> > > > writing
> > > > to the RAMDirectory to find anything. Ever.
> > > >
> > > > If you really need the RAMDir, I suspect
> you'll
> > > have to
> > > > open an FS-based
> > > > writer as well as a RAM-base

Re: Benchmarking my indexer

2008-11-02 Thread Rafael Cunha de Almeida
On Sun, 2 Nov 2008 07:11:20 -0500
Grant Ingersoll <[EMAIL PROTECTED]> wrote:

> 
> On Nov 1, 2008, at 1:39 AM, Rafael Cunha de Almeida wrote:
> 
> > Hello,
> >
> > I did an indexer that parses some files and indexes them using  
> > lucene. I
> > want to benchmark the whole thing, so I'd like to count the tokens
> > being indexed so I can calculate the average number of indexed tokens
> > per second. Is there a way to count the number of tokens on a  
> > document?
> 
> I think you would have to add a "CountingTokenFilter", that you write  
> and manage as you add documents.  Or, you could just take the total #  
> of tokens / by the number of docs and use the average.  That can be  
> obtained w/o writing a new TokenFilter.

How would I obtain the total number of tokens on an index? I couldn't
find that statistic anywhere. I looked for it on IndexWritter,
IndexReader and IndexSearcher classes. Is there maybe some tool I'd run
on a index or something like that?
> 
> >
> > While I'm at it, I will also need to calculate the amount of memory my
> > java program used (peak, avg, etc), what java tool would you suggest  
> > me
> > to figure that out?
> 
> 
> Would JConsole work: 
> http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html 
>help?  I'm not sure what people use here
> 

Will look into it :-)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Searching over multiple fields using XML document

2008-11-02 Thread syedfa

Dear fellow Java/Lucene developers:

I am trying to search an xml document over multiple fields.  The index I
created using the SAX method.  I am trying to search shakespeare's "Hamlet"
over the  and  tags for words that the user is looking for. 
I am thinking of using the MultiFieldQueryParser however, I also read that a
better alternative would be to combine the various fields together.  In the
book, "Lucene in Action", the author writes:

"A synthetic 'contents' field in our test environment uses this scheme to
put author and subjects together:

doc.add(Field.UnStored("contents", author + " " + subjects));

We used a space (" ") between author and subjects to separate words for the
analyzer."

I am not sure I fully understand what the author is referring to here.  In
my situation I have the following:

public void endElement(String uri, String localName, String qName) throws
SAXException{

try {

if(qName.equals("REFERENCE")){
Field reference = new Field(qName, 
elementBuffer.toString(),
Field.Store.YES, Field.Index.NO, Field.TermVector.NO);
doc.add(reference);
}

else if(qName.equals("SPEAKER")){
Field speaker = new Field(qName, 
elementBuffer.toString(),
Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES);
speaker.setBoost(2.0f);
doc.add(speaker);
}
else if(qName.equals("LINES")){
Field lines = new Field(qName, 
elementBuffer.toString(),
Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES);
lines.setBoost(1.0f);
doc.add(lines);
indexWriter.addDocument(doc);   

}

else{
return;
}

} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}



}


How would I combine the fields together into one synthetic field here so
that in my searcher code, I would search over one field, yet retrieve the
results from the several fields that the keyword is found and show that to
the user?

All I want to do, is allow a user to search an xml document over multiple
fields and return the results with the keywords they are searching for,
highlighted in the results list, just as google does when searching for
websites.   At this point, I am able to do a simple/fuzzy/wildcard search
over one field in the xml document, but would like to extend this
functionality over multiple fields.  Any ideas?

Thanks in advance to all who reply.
Sincerely;
Fayyaz
-- 
View this message in context: 
http://www.nabble.com/Searching-over-multiple-fields-using-XML-document-tp20295306p20295306.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Performance of never optimizing

2008-11-02 Thread Justus Pendleton

Howdy,

I have a couple of questions regarding some Lucene benchmarking and  
what the results mean[3]. (Skip to the numbered list at the end if you  
don't want to read the lengthy exegesis :)


I'm a developer for JIRA[1]. We are currently trying to get a better  
understanding of Lucene, and our use of it, to cope with the needs of  
our larger customers. These "large" indexes are only a couple hundred  
thousand documents but our problem is compounded by the fact that they  
have a relatively high rate of modification (=delete+insert of new  
document) and our users expect these modification to show up in query  
results pretty much instantly.


Our current default behaviour is a merge factor of 4. We perform an  
optimization on the index every 4000 additions. We also perform an  
optimize at midnight. Our fundamental problem is that these  
optimizations are locking the index for unacceptably long periods of  
time, something that we want to resolve for our next major release,  
hopefully without undermining search performance too badly.


In the Lucene javadoc there is a comment, and a link to a mailing list  
discussion[2], that suggests applications such as JIRA should never  
perform optimize but should instead set their merge factor very low.


In an attempt to understand the impact of a) lowering the merge factor  
from 4 to 2 and b) never, ever optimizing on an index (over the course  
of years and millions of additions/updates) I wanted to try to  
benchmark Lucene.


I used the contrib/benchmark framework and wrote a small algorithm  
that adds documents to an index (using the Reuters doc generator),  
does a search, does an optimize, then does another search. All the  
pretty pictures can be seen at:


  http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs

I have several questions, hopefully they aren't overwhelming in their  
quantity :-/


1. Why does the merge factor of 4 appear to be faster than the merge  
factor of 2?


2. Why does non-optimized searching appear to be faster than optimized  
searching once the index hits ~500,000 documents?


3. There appears to be a fairly sizable performance drop across the  
board around 450,000 documents. Why is that?


4. Searching performance appears to decrease towards a fairly  
pessimistic 20 searches per second (for a relatively simple search).  
Is this really what we should expect long-term from Lucene?


5. Does my benchmark even make sense? I am far from an expert on  
benchmarking so it is possible I'm not measuring what I think I am  
measuring.


Thanks in advance for any insight you can provide. This is an area  
that we very much want to understand better as Lucene is a key part of  
JIRA's success,


Cheers,
Justus
JIRA Developer

[1]: http://www.atlassian.com
[2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895
[3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance of never optimizing

2008-11-02 Thread Otis Gospodnetic
Hello,

 
Very quick comments.


- Original Message 
> From: Justus Pendleton <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Sunday, November 2, 2008 10:42:52 PM
> Subject: Performance of never optimizing
> 
> Howdy,
> 
> I have a couple of questions regarding some Lucene benchmarking and what the 
> results mean[3]. (Skip to the numbered list at the end if you don't want to 
> read 
> the lengthy exegesis :)
> 
> I'm a developer for JIRA[1]. We are currently trying to get a better 
> understanding of Lucene, and our use of it, to cope with the needs of our 
> larger 
> customers. These "large" indexes are only a couple hundred thousand documents 
> but our problem is compounded by the fact that they have a relatively high 
> rate 
> of modification (=delete+insert of new document) and our users expect these 
> modification to show up in query results pretty much instantly.


This will be a tough call with large indices - there is no real-time search in 
Lucene yet.

> Our current default behaviour is a merge factor of 4. We perform an 
> optimization 
> on the index every 4000 additions. We also perform an optimize at midnight. 
> Our 


I wouldn't optimize every 4000 additions - you are killing IO, rewriting the 
whole index, while trying to provide fast searches, plus you are locking the 
index for other modifications.

> fundamental problem is that these optimizations are locking the index for 
> unacceptably long periods of time, something that we want to resolve for our 
> next major release, hopefully without undermining search performance too 
> badly.


Why are you optimizing?  Trying to make the search faster?  I would try to 
avoid optimizing during high usage periods.

> In the Lucene javadoc there is a comment, and a link to a mailing list 
> discussion[2], that suggests applications such as JIRA should never perform 
> optimize but should instead set their merge factor very low.


Right, you can let Lucene merge segments.

> In an attempt to understand the impact of a) lowering the merge factor from 4 
> to 
> 2 and b) never, ever optimizing on an index (over the course of years and 
> millions of additions/updates) I wanted to try to benchmark Lucene.


One thing that you might not have tried is the constant re-opening of the 
IndexReader, which you'll need to do if you want to see index changes instantly.

> I used the contrib/benchmark framework and wrote a small algorithm that adds 
> documents to an index (using the Reuters doc generator), does a search, does 
> an 
> optimize, then does another search. All the pretty pictures can be seen at:


So you indexed once and then measured search performance?  Or did you measure 
indexing performance?  I can't quite tell from your email.
And in one case you optimized before searching and in the other you did not 
optimize?

>   http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
> 
> I have several questions, hopefully they aren't overwhelming in their 
> quantity 
> :-/
> 
> 1. Why does the merge factor of 4 appear to be faster than the merge factor 
> of 
> 2?


Faster for indexing or searching?  If indexing, then it's because 4 means fewer 
segment merges than 2.  If searching, then I don't know, unless you had 
indexing and searching happening in parallel, which then means less IO for 4.

Did you index fit in RAM, by the way?

> 2. Why does non-optimized searching appear to be faster than optimized 
> searching 
> once the index hits ~500,000 documents?


Not sure without seeing the index/machine.
It sounds like you were measuring search performance while at the same time 
increasing the index size by incrementally adding more docs?

> 3. There appears to be a fairly sizable performance drop across the board 
> around 
> 450,000 documents. Why is that?

Something to do with Lucene merging index segments around that point?  At this 
point I'm assuming you were measuring search speed while indexing.


> 4. Searching performance appears to decrease towards a fairly pessimistic 20 
> searches per second (for a relatively simple search). Is this really what we 
> should expect long-term from Lucene?


20 reqs/sec sounds very low.  How large is your index, how much RAM, and how 
about heap size?
What were your queries like? random?  from log?

> 5. Does my benchmark even make sense? I am far from an expert on benchmarking 
> so 
> it is possible I'm not measuring what I think I am measuring.


I'm confused by what exactly you did and measured, but it could just be that 
I'm tired.

> Thanks in advance for any insight you can provide. This is an area that we 
> very 
> much want to understand better as Lucene is a key part of JIRA's success,

>
> [1]: http://www.atlassian.com
> [2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895
> [3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


--

Re: Performance of never optimizing

2008-11-02 Thread Justus Pendleton

On 03/11/2008, at 4:27 PM, Otis Gospodnetic wrote:
Why are you optimizing?  Trying to make the search faster?  I would  
try to avoid optimizing during high usage periods.


I assume that the original, long-ago, decision to optimize was made to  
improve searching performance.


One thing that you might not have tried is the constant re-opening  
of the IndexReader, which you'll need to do if you want to see index  
changes instantly.


We do keep track of when the index has been updated and re-open  
IndexReaders so that they see the updates instantly.




So you indexed once and then measured search performance?  Or did  
you measure indexing performance?  I can't quite tell from your email.
And in one case you optimized before searching and in the other you  
did not optimize?


Yes, I indexed once and then measured search performance. (The actual  
algorithm used can be seen at http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs) 
 For my current purposes I don't care about indexing performance.


1. Why does the merge factor of 4 appear to be faster than the  
merge factor of

2?



Faster for indexing or searching?  If indexing, then it's because 4  
means fewer segment merges than 2.  If searching, then I don't know,  
unless you had indexing and searching happening in parallel, which  
then means less IO for 4.


For searching. The index and search should not have been happening in  
parallel. However, multiple searches are occurring in parallel.



Did you index fit in RAM, by the way?


The machine has, I believe, 4 GB of RAM and the benchmark suite  
reports than 700 MB were used, so it does appear to have fit into RAM.


2. Why does non-optimized searching appear to be faster than  
optimized searching

once the index hits ~500,000 documents?



Not sure without seeing the index/machine.


The machine is an 8-core Mac Pro. If you'd like, I can provide the  
indexes online somewhere. Or if you can provide pointers on what to  
look for, I'm more than happy to investigate this myself.




It sounds like you were measuring search performance while at the  
same time increasing the index size by incrementally adding more docs?


No documents were being added to the index while the searching was  
being performed. I was trying to measure only the search performance.


20 reqs/sec sounds very low.  How large is your index, how much RAM,  
and how about heap size?

What were your queries like? random?  from log?


The queries were generated by the ReutersQueryMaker. I am not sure  
what the heap size used as various stages were. (I ran the benchmarks  
over the weekend; they took several days.)


I'm confused by what exactly you did and measured, but it could just  
be that I'm tired.


My apologies for not being clearer in my initial email. I appreciate  
the help,


Cheers,
Justus


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance of never optimizing

2008-11-02 Thread Chris Lu

Hi, Justus,

I had met with very similar problems as JIRA has, which has high 
modification and on a large data volume. It's a pretty common use case 
for Lucene.


The way I dealt with high rate of modification is to create a secondary 
in-memory index. And only persist documents older than a period of time.
So searching will need to combine results from two indexes. It's a bit 
complicated when creating the index, but it's worth well to save the 
extra IO-heavy merging and to improve response time, especially the 
ability to search right away with just added documents.


BTW: JIRA is great!

--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: 
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 
Million Euro funding!

Justus Pendleton wrote:

Howdy,

I have a couple of questions regarding some Lucene benchmarking and 
what the results mean[3]. (Skip to the numbered list at the end if you 
don't want to read the lengthy exegesis :)


I'm a developer for JIRA[1]. We are currently trying to get a better 
understanding of Lucene, and our use of it, to cope with the needs of 
our larger customers. These "large" indexes are only a couple hundred 
thousand documents but our problem is compounded by the fact that they 
have a relatively high rate of modification (=delete+insert of new 
document) and our users expect these modification to show up in query 
results pretty much instantly.


Our current default behaviour is a merge factor of 4. We perform an 
optimization on the index every 4000 additions. We also perform an 
optimize at midnight. Our fundamental problem is that these 
optimizations are locking the index for unacceptably long periods of 
time, something that we want to resolve for our next major release, 
hopefully without undermining search performance too badly.


In the Lucene javadoc there is a comment, and a link to a mailing list 
discussion[2], that suggests applications such as JIRA should never 
perform optimize but should instead set their merge factor very low.


In an attempt to understand the impact of a) lowering the merge factor 
from 4 to 2 and b) never, ever optimizing on an index (over the course 
of years and millions of additions/updates) I wanted to try to 
benchmark Lucene.


I used the contrib/benchmark framework and wrote a small algorithm 
that adds documents to an index (using the Reuters doc generator), 
does a search, does an optimize, then does another search. All the 
pretty pictures can be seen at:


  http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs

I have several questions, hopefully they aren't overwhelming in their 
quantity :-/


1. Why does the merge factor of 4 appear to be faster than the merge 
factor of 2?


2. Why does non-optimized searching appear to be faster than optimized 
searching once the index hits ~500,000 documents?


3. There appears to be a fairly sizable performance drop across the 
board around 450,000 documents. Why is that?


4. Searching performance appears to decrease towards a fairly 
pessimistic 20 searches per second (for a relatively simple search). 
Is this really what we should expect long-term from Lucene?


5. Does my benchmark even make sense? I am far from an expert on 
benchmarking so it is possible I'm not measuring what I think I am 
measuring.


Thanks in advance for any insight you can provide. This is an area 
that we very much want to understand better as Lucene is a key part of 
JIRA's success,


Cheers,
Justus
JIRA Developer

[1]: http://www.atlassian.com
[2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895
[3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]