Re: URGENT: Help indexing large document set

2004-11-27 Thread John Wang
 PROTECTED]
 Sent: Monday, November 22, 2004 12:35 PM
 To: Lucene Users List
 Subject: Re: Index in RAM - is it realy worthy?
 
 In my test, I have 12900 documents. Each document is small, a few
 discreet fields (KeyWord type) and 1 Text field containing only 1
 sentence.
 
 with both mergeFactor and maxMergeDocs being 1000
 
 using RamDirectory, the indexing job took about 9.2 seconds
 
 not using RamDirectory, the indexing job took about 122 seconds.
 
 I am not calling optimize.
 
 This is on windows Xp running java 1.5.
 
 Is there something very wrong or different in my setup to cause such a
 big different?
 
 Thanks
 
 -John
 
 On Mon, 22 Nov 2004 09:23:40 -0800 (PST), Otis Gospodnetic
 [EMAIL PROTECTED] wrote:
  For the Lucene book I wrote some test cases that compare FSDirectory
  and RAMDirectory.  What I found was that with certain settings
  FSDirectory was almost as fast as RAMDirectory.  Personally, I would
  push FSDirectory and hope that the OS and the Filesystem do their
 share
  of work and caching for me before looking for ways to optimize my
 code.
 
  Otis
 
 
 
  --- [EMAIL PROTECTED] wrote:
 
  
   I did following test:
   I created  the RAM folder on my Red Hat box and copied   c. 1Gb of
   indexes
   there.
   I expected the queries to run much quicker.
   In reality it was even sometimes slower(sic!)
  
   Lucene has it's own RAM disk functionality. If I implement it, would
   it
   bring any benefits?
  
   Thanks in advance
   J.
 
  -
 
 
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
   -Original Message-
   From: John Wang [mailto:[EMAIL PROTECTED]
   Sent: Saturday, November 27, 2004 11:50 AM
   To: Chuck Williams
   Subject: Re: URGENT: Help indexing large document set
  
   I found the reason for the degredation. It is because I was writing
 to
   a RamDirectory and then adding to a FSWriter. I guess it makes sense
   since the addIndex call would slow down as the index grows.
  
   I guess it is not a good idea to use RamDirectory if there are many
   small batches. Are there some performace numbers that would tell me
   when to/not to use a RamDirectory?
  
   thanks
  
   -John
  
  
   On Wed, 24 Nov 2004 15:23:49 -0800, John Wang [EMAIL PROTECTED]
   wrote:
Hi Chuck:
   
 The reason I am not using localReader.delete(term) is because
 I
have some logic to check whether to delete the term based on a
 flag.
   
 I am testing with the keys to be sorted.
   
 I am not doing anything weird, just committing a batch of 500
documents to the index of 2000 batches. I don't what why it is
 having
this linear slow down...
   
   
   
Thanks
   
-John
   
On Wed, 24 Nov 2004 12:32:52 -0800, Chuck Williams
 [EMAIL PROTECTED]
   wrote:
 Does keyIter return the keys in sorted order?  This should
 reduce
   seeks,
 especially if the keys are dense.

 Also, you should be able to localReader.delete(term) instead of
 iterating over the docs (of which I presume there is only one
 doc
   since
 keys are unique).  This won't improve performance as
 IndexReader.delete(Term) does exactly what your code does, but
 it
   will
 be cleaner.

 A linear slowdown with number of docs doesn't make sense, so
   something
 else must be wrong.  I'm not sure what the default buffer size
 is
   (it
 appears it used to be 128 but is dynamic now I think).  You
 might
   find
 the slowdown stops after a certain point, especially if you
 increase
 your batch size.



 Chuck

   -Original Message-
   From: John Wang [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, November 24, 2004 12:21 PM
   To: Lucene Users List
   Subject: Re: URGENT: Help indexing large document set
  
   Thanks Paul!
  
   Using your suggestion, I have changed the update check code
 to
   use
   only the indexReader:
  
   try {
 localReader = IndexReader.open(path);
  
 while (keyIter.hasNext()) {
   key = (String) keyIter.next();
   term = new Term(key, key);
   TermDocs tDocs = localReader.termDocs(term);
   if (tDocs != null) {
 try {
   while (tDocs.next()) {
 localReader.delete(tDocs.doc());
   }
 } finally {
   tDocs.close();
 }
   }
 }
   } finally {
  
 if (localReader != null

Re: URGENT: Help indexing large document set

2004-11-24 Thread Paul Elschot
On Wednesday 24 November 2004 00:37, John Wang wrote:
 Hi:
 
I am trying to index 1M documents, with batches of 500 documents.
 
Each document has an unique text key, which is added as a
 Field.KeyWord(name,value).
 
For each batch of 500, I need to make sure I am not adding a
 document with a key that is already in the current index.
 
   To do this, I am calling IndexSearcher.docFreq for each document and
 delete the document currently in the index with the same key:
  
while (keyIter.hasNext()) {
 String objectID = (String) keyIter.next();
 term = new Term(key, objectID);
 int count = localSearcher.docFreq(term);

To speed this up a bit make sure that the iterator gives
the terms in sorted order. I'd use an index reader instead
of a searcher, but that will probably not make a difference.

Adding the documents can be done with multiple threads.
Last time I checked that, there was a moderate speed up
using three threads instead of one on a single CPU machine.
Tuning the values of minMergeDocs and maxMergeDocs
may also help to increase performance of adding documents.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: URGENT: Help indexing large document set

2004-11-24 Thread John Wang
Thanks Paul!

Using your suggestion, I have changed the update check code to use
only the indexReader:

try {
  localReader = IndexReader.open(path);

  while (keyIter.hasNext()) {
key = (String) keyIter.next();
term = new Term(key, key);
TermDocs tDocs = localReader.termDocs(term);
if (tDocs != null) {
  try {
while (tDocs.next()) {
  localReader.delete(tDocs.doc());
}
  } finally {
tDocs.close();
  }
}
  }
} finally {

  if (localReader != null) {
localReader.close();
  }

}


Unfortunately it didn't seem to make any dramatic difference.

I also see the CPU is only 30-50% busy, so I am guessing it's spending
a lot of time in IO. Anyway of making the CPU work harder?

Is batch size of 500 too small for 1 million documents?

Currently I am seeing a linear speed degredation of 0.3 milliseconds
per document.

Thanks

-John


On Wed, 24 Nov 2004 09:05:39 +0100, Paul Elschot [EMAIL PROTECTED] wrote:
 On Wednesday 24 November 2004 00:37, John Wang wrote:
 
 
  Hi:
 
 I am trying to index 1M documents, with batches of 500 documents.
 
 Each document has an unique text key, which is added as a
  Field.KeyWord(name,value).
 
 For each batch of 500, I need to make sure I am not adding a
  document with a key that is already in the current index.
 
To do this, I am calling IndexSearcher.docFreq for each document and
  delete the document currently in the index with the same key:
 
 while (keyIter.hasNext()) {
  String objectID = (String) keyIter.next();
  term = new Term(key, objectID);
  int count = localSearcher.docFreq(term);
 
 To speed this up a bit make sure that the iterator gives
 the terms in sorted order. I'd use an index reader instead
 of a searcher, but that will probably not make a difference.
 
 Adding the documents can be done with multiple threads.
 Last time I checked that, there was a moderate speed up
 using three threads instead of one on a single CPU machine.
 Tuning the values of minMergeDocs and maxMergeDocs
 may also help to increase performance of adding documents.
 
 Regards,
 Paul Elschot
 
 -
 
 
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: URGENT: Help indexing large document set

2004-11-24 Thread Chuck Williams
Does keyIter return the keys in sorted order?  This should reduce seeks,
especially if the keys are dense.

Also, you should be able to localReader.delete(term) instead of
iterating over the docs (of which I presume there is only one doc since
keys are unique).  This won't improve performance as
IndexReader.delete(Term) does exactly what your code does, but it will
be cleaner.

A linear slowdown with number of docs doesn't make sense, so something
else must be wrong.  I'm not sure what the default buffer size is (it
appears it used to be 128 but is dynamic now I think).  You might find
the slowdown stops after a certain point, especially if you increase
your batch size.

Chuck

   -Original Message-
   From: John Wang [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, November 24, 2004 12:21 PM
   To: Lucene Users List
   Subject: Re: URGENT: Help indexing large document set
   
   Thanks Paul!
   
   Using your suggestion, I have changed the update check code to use
   only the indexReader:
   
   try {
 localReader = IndexReader.open(path);
   
 while (keyIter.hasNext()) {
   key = (String) keyIter.next();
   term = new Term(key, key);
   TermDocs tDocs = localReader.termDocs(term);
   if (tDocs != null) {
 try {
   while (tDocs.next()) {
 localReader.delete(tDocs.doc());
   }
 } finally {
   tDocs.close();
 }
   }
 }
   } finally {
   
 if (localReader != null) {
   localReader.close();
 }
   
   }
   
   
   Unfortunately it didn't seem to make any dramatic difference.
   
   I also see the CPU is only 30-50% busy, so I am guessing it's
spending
   a lot of time in IO. Anyway of making the CPU work harder?
   
   Is batch size of 500 too small for 1 million documents?
   
   Currently I am seeing a linear speed degredation of 0.3 milliseconds
   per document.
   
   Thanks
   
   -John
   
   
   On Wed, 24 Nov 2004 09:05:39 +0100, Paul Elschot
   [EMAIL PROTECTED] wrote:
On Wednesday 24 November 2004 00:37, John Wang wrote:
   
   
 Hi:

I am trying to index 1M documents, with batches of 500
documents.

Each document has an unique text key, which is added as a
 Field.KeyWord(name,value).

For each batch of 500, I need to make sure I am not adding a
 document with a key that is already in the current index.

   To do this, I am calling IndexSearcher.docFreq for each
document
   and
 delete the document currently in the index with the same key:

while (keyIter.hasNext()) {
 String objectID = (String) keyIter.next();
 term = new Term(key, objectID);
 int count = localSearcher.docFreq(term);
   
To speed this up a bit make sure that the iterator gives
the terms in sorted order. I'd use an index reader instead
of a searcher, but that will probably not make a difference.
   
Adding the documents can be done with multiple threads.
Last time I checked that, there was a moderate speed up
using three threads instead of one on a single CPU machine.
Tuning the values of minMergeDocs and maxMergeDocs
may also help to increase performance of adding documents.
   
Regards,
Paul Elschot
   
   
-
   
   
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



URGENT: Help indexing large document set

2004-11-23 Thread John Wang
Hi:

   I am trying to index 1M documents, with batches of 500 documents.

   Each document has an unique text key, which is added as a
Field.KeyWord(name,value).

   For each batch of 500, I need to make sure I am not adding a
document with a key that is already in the current index.

  To do this, I am calling IndexSearcher.docFreq for each document and
delete the document currently in the index with the same key:
 
   while (keyIter.hasNext()) {
String objectID = (String) keyIter.next();
term = new Term(key, objectID);
int count = localSearcher.docFreq(term);

if (count != 0) {
localReader.delete(term);
}
  }

Then I proceed with adding the documents.

This turns out to be extremely expensive, I looked into the code and I see in 
TermInfosReader.get(Term term) it is doing a linear look up for each
term. So as the index grows, the above operation degrades at a linear
rate. So for each commit, we are doing a docFreq for 500 documents.

I also tried to create a BooleanQuery composed of 500 TermQueries and
do 1 search for each batch, and the performance didn't get better. And
if the batch size increases to say 50,000, creating a BooleanQuery
composed of 50,000 TermQuery instances may introduce huge memory
costs.

Is there a better way to do this?

Can TermInfosReader.get(Term term) be optimized to do a binary lookup
instead of a linear walk? Of course that depends on whether the terms
are stored in sorted order, are they?

This is very urgent, thanks in advance for all your help.

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: URGENT: Help indexing large document set

2004-11-23 Thread Chuck Williams
Are you sure you have a performance problem with
TermInfosReader.get(Term)?  It looks to me like it scans sequentially
only within a small buffer window (of size
SegmentTermEnum.indexInterval) and that it uses binary search otherwise.
See TermInfosReader.getIndexOffset(Term).

Chuck

   -Original Message-
   From: John Wang [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, November 23, 2004 3:38 PM
   To: [EMAIL PROTECTED]
   Subject: URGENT: Help indexing large document set
   
   Hi:
   
  I am trying to index 1M documents, with batches of 500 documents.
   
  Each document has an unique text key, which is added as a
   Field.KeyWord(name,value).
   
  For each batch of 500, I need to make sure I am not adding a
   document with a key that is already in the current index.
   
 To do this, I am calling IndexSearcher.docFreq for each document
and
   delete the document currently in the index with the same key:
   
  while (keyIter.hasNext()) {
   String objectID = (String) keyIter.next();
   term = new Term(key, objectID);
   int count = localSearcher.docFreq(term);
   
   if (count != 0) {
   localReader.delete(term);
   }
 }
   
   Then I proceed with adding the documents.
   
   This turns out to be extremely expensive, I looked into the code and
I
   see in
   TermInfosReader.get(Term term) it is doing a linear look up for each
   term. So as the index grows, the above operation degrades at a
linear
   rate. So for each commit, we are doing a docFreq for 500 documents.
   
   I also tried to create a BooleanQuery composed of 500 TermQueries
and
   do 1 search for each batch, and the performance didn't get better.
And
   if the batch size increases to say 50,000, creating a BooleanQuery
   composed of 50,000 TermQuery instances may introduce huge memory
   costs.
   
   Is there a better way to do this?
   
   Can TermInfosReader.get(Term term) be optimized to do a binary
lookup
   instead of a linear walk? Of course that depends on whether the
terms
   are stored in sorted order, are they?
   
   This is very urgent, thanks in advance for all your help.
   
   -John
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: URGENT: Help indexing large document set

2004-11-23 Thread John Wang
Thanks Chuck! I missed the call: getIndexOffset.
I am profiling it again to pin-point where the performance problem is.

-John

On Tue, 23 Nov 2004 16:13:22 -0800, Chuck Williams [EMAIL PROTECTED] wrote:
 Are you sure you have a performance problem with
 TermInfosReader.get(Term)?  It looks to me like it scans sequentially
 only within a small buffer window (of size
 SegmentTermEnum.indexInterval) and that it uses binary search otherwise.
 See TermInfosReader.getIndexOffset(Term).
 
 Chuck
 
 
 
   -Original Message-
   From: John Wang [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, November 23, 2004 3:38 PM
   To: [EMAIL PROTECTED]
   Subject: URGENT: Help indexing large document set
  
   Hi:
  
  I am trying to index 1M documents, with batches of 500 documents.
  
  Each document has an unique text key, which is added as a
   Field.KeyWord(name,value).
  
  For each batch of 500, I need to make sure I am not adding a
   document with a key that is already in the current index.
  
 To do this, I am calling IndexSearcher.docFreq for each document
 and
   delete the document currently in the index with the same key:
  
  while (keyIter.hasNext()) {
   String objectID = (String) keyIter.next();
   term = new Term(key, objectID);
   int count = localSearcher.docFreq(term);
  
   if (count != 0) {
   localReader.delete(term);
   }
 }
  
   Then I proceed with adding the documents.
  
   This turns out to be extremely expensive, I looked into the code and
 I
   see in
   TermInfosReader.get(Term term) it is doing a linear look up for each
   term. So as the index grows, the above operation degrades at a
 linear
   rate. So for each commit, we are doing a docFreq for 500 documents.
  
   I also tried to create a BooleanQuery composed of 500 TermQueries
 and
   do 1 search for each batch, and the performance didn't get better.
 And
   if the batch size increases to say 50,000, creating a BooleanQuery
   composed of 50,000 TermQuery instances may introduce huge memory
   costs.
  
   Is there a better way to do this?
  
   Can TermInfosReader.get(Term term) be optimized to do a binary
 lookup
   instead of a linear walk? Of course that depends on whether the
 terms
   are stored in sorted order, are they?
  
   This is very urgent, thanks in advance for all your help.
  
   -John
  
  
 -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]