RE: Indexing a large number of DB records

2004-12-16 Thread Garrett Heaver
There was other reasons for my choice of going with a Temp Index - namely I
was having terrible write times to my Live index as it was stored on a
different server, also, while I was writing to my live index people were
trying to search on it and were getting file not found exceptions so
rather than spend hours or days trying to fix it I took the easiest route by
creating a temp index on the server which had the application and merging to
the server with the live index. This greatly increased my indexing speed.

Best of luck
Garrett

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: 15 December 2004 18:43
To: Lucene Users List
Subject: RE: Indexing a large number of DB records

Note that this really includes some extra steps.
You don't need a temp index.  Add everything to a single index using a
single IndexWriter instance.  No need to call addIndexes nor optimize
until the end.  Adding Documents to an index takes a constant amount of
time, regardless of the index size, because new segments are created as
documents are added, and existing segments don't need to be updated
(only when merges happen).  Again, I'd run your app under a profiler to
see where the time and memory are going.

Otis

--- Garrett Heaver [EMAIL PROTECTED] wrote:

 Hi Homan
 
 I had a similar problem as you in that I was indexing A LOT of data
 
 Essentially how I got round it was to batch the index.
 
 What I was doing was to add 10,000 documents to a temporary index,
 use
 addIndexes() to merge to temporary index into the live index (which
 also
 optimizes the live index) then delete the temporary index. On the
 next loop
 I'd only query rows from the db above the id in the maxdoc of the
 live index
 and set the max rows of the query to to 10,000
 i.e
 
 SELECT TOP 1 [fields] FROM [tables] WHERE [id_field]  {ID from
 Index.MaxDoc()} ORDER BY [id_field] ASC
 
 Ensuring that the documents go into the index sequentially your
 problem is
 solved and memory usage on mine (dotlucene 1.3) is low
 
 Regards
 Garrett
 
 -Original Message-
 From: Homam S.A. [mailto:[EMAIL PROTECTED] 
 Sent: 15 December 2004 02:43
 To: Lucene Users List
 Subject: Indexing a large number of DB records
 
 I'm trying to index a large number of records from the
 DB (a few millions). Each record will be stored as a
 document with about 30 fields, most of them are
 UnStored and represent small strings or numbers. No
 huge DB Text fields.
 
 But I'm running out of memory very fast, and the
 indexing is slowing down to a crawl once I hit around
 1500 records. The problem is each document is holding
 references to the string objects returned from
 ToString() on the DB field, and the IndexWriter is
 holding references to all these document objects in
 memory, so the garbage collector is getting a chance
 to clean these up.
 
 How do you guys go about indexing a large DB table?
 Here's a snippet of my code (this method is called for
 each record in the DB):
 
 private void IndexRow(SqlDataReader rdr, IndexWriter
 iw) {
   Document doc = new Document();
   for (int i = 0; i  BrowseFieldNames.Length; i++) {
   doc.Add(Field.UnStored(BrowseFieldNames[i],
 rdr.GetValue(i).ToString()));
   }
   iw.AddDocument(doc);
 }
 
 
 
 
   
 __ 
 Do you Yahoo!? 
 Yahoo! Mail - Find what you need with new enhanced search.
 http://info.mail.yahoo.com/mail_250
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing a large number of DB records

2004-12-15 Thread Otis Gospodnetic
Hello Homam,

The batches I was referring to were batches of DB rows.
Instead of SELECT * FROM table... do SELECT * FROM table ... OFFSET=X
LIMIT=Y.

Don't close IndexWriter - use the single instance.

There is no MakeStable()-like method in Lucene, but you can control the
number of in-memory Documents, the frequence of segment merges, and
maximal size of an index segments with 3 IndexWriter parameters,
described fairly verbosely in the javadocs.

Since you are using the .Net version, you should really consult
dotLucene guy(s).  Running under the profiler should also tell you
where the time and memory go.

Otis

--- Homam S.A. [EMAIL PROTECTED] wrote:

 Thanks Otis!
 
 What do you mean by building it in batches? Does it
 mean I should close the IndexWriter every 1000 rows
 and reopen it? Does that releases references to the
 document objects so that they can be
 garbage-collected?
 
 I'm calling optimize() only at the end.
 
 I agree that 1500 documents is very small. I'm
 building the index on a PC with 512 megs, and the
 indexing process is quickly gobbling up around 400
 megs when I index around 1800 documents and the whole
 machine is grinding to a virtual halt. I'm using the
 latest DotLucene .NET port, so may be there's a memory
 leak in it.
 
 I have experience with AltaVista search (acquired by
 FastSearch), and I used to call MakeStable() every
 20,000 documents to flush memory structures to disk.
 There doesn't seem to be an equivalent in Lucene.
 
 -- Homam
 
 
 
 
 
 
 --- Otis Gospodnetic [EMAIL PROTECTED]
 wrote:
 
  Hello,
  
  There are a few things you can do:
  
  1) Don't just pull all rows from the DB at once.  Do
  that in batches.
  
  2) If you can get a Reader from your SqlDataReader,
  consider this:
 

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)
  
  3) Give the JVM more memory to play with by using
  -Xms and -Xmx JVM
  parameters
  
  4) See IndexWriter's minMergeDocs parameter.
  
  5) Are you calling optimize() at some point by any
  chance?  Leave that
  call for the end.
  
  1500 documents with 30 columns of short
  String/number values is not a
  lot.  You may be doing something else not Lucene
  related that's slowing
  things down.
  
  Otis
  
  
  --- Homam S.A. [EMAIL PROTECTED] wrote:
  
   I'm trying to index a large number of records from
  the
   DB (a few millions). Each record will be stored as
  a
   document with about 30 fields, most of them are
   UnStored and represent small strings or numbers.
  No
   huge DB Text fields.
   
   But I'm running out of memory very fast, and the
   indexing is slowing down to a crawl once I hit
  around
   1500 records. The problem is each document is
  holding
   references to the string objects returned from
   ToString() on the DB field, and the IndexWriter is
   holding references to all these document objects
  in
   memory, so the garbage collector is getting a
  chance
   to clean these up.
   
   How do you guys go about indexing a large DB
  table?
   Here's a snippet of my code (this method is called
  for
   each record in the DB):
   
   private void IndexRow(SqlDataReader rdr,
  IndexWriter
   iw) {
 Document doc = new Document();
 for (int i = 0; i  BrowseFieldNames.Length; i++)
  {
 doc.Add(Field.UnStored(BrowseFieldNames[i],
   rdr.GetValue(i).ToString()));
 }
 iw.AddDocument(doc);
   }
   
   
   
   
 
   __ 
   Do you Yahoo!? 
   Yahoo! Mail - Find what you need with new enhanced
  search.
   http://info.mail.yahoo.com/mail_250
   
  
 
 -
   To unsubscribe, e-mail:
  [EMAIL PROTECTED]
   For additional commands, e-mail:
  [EMAIL PROTECTED]
   
   
  
  
 
 -
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
  
  
 
 
 
   
 __ 
 Do you Yahoo!? 
 Take Yahoo! Mail with you! Get it on your mobile phone. 
 http://mobile.yahoo.com/maildemo 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing a large number of DB records

2004-12-15 Thread Garrett Heaver
Hi Homan

I had a similar problem as you in that I was indexing A LOT of data

Essentially how I got round it was to batch the index.

What I was doing was to add 10,000 documents to a temporary index, use
addIndexes() to merge to temporary index into the live index (which also
optimizes the live index) then delete the temporary index. On the next loop
I'd only query rows from the db above the id in the maxdoc of the live index
and set the max rows of the query to to 10,000
i.e

SELECT TOP 1 [fields] FROM [tables] WHERE [id_field]  {ID from
Index.MaxDoc()} ORDER BY [id_field] ASC

Ensuring that the documents go into the index sequentially your problem is
solved and memory usage on mine (dotlucene 1.3) is low

Regards
Garrett

-Original Message-
From: Homam S.A. [mailto:[EMAIL PROTECTED] 
Sent: 15 December 2004 02:43
To: Lucene Users List
Subject: Indexing a large number of DB records

I'm trying to index a large number of records from the
DB (a few millions). Each record will be stored as a
document with about 30 fields, most of them are
UnStored and represent small strings or numbers. No
huge DB Text fields.

But I'm running out of memory very fast, and the
indexing is slowing down to a crawl once I hit around
1500 records. The problem is each document is holding
references to the string objects returned from
ToString() on the DB field, and the IndexWriter is
holding references to all these document objects in
memory, so the garbage collector is getting a chance
to clean these up.

How do you guys go about indexing a large DB table?
Here's a snippet of my code (this method is called for
each record in the DB):

private void IndexRow(SqlDataReader rdr, IndexWriter
iw) {
Document doc = new Document();
for (int i = 0; i  BrowseFieldNames.Length; i++) {
doc.Add(Field.UnStored(BrowseFieldNames[i],
rdr.GetValue(i).ToString()));
}
iw.AddDocument(doc);
}





__ 
Do you Yahoo!? 
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing a large number of DB records

2004-12-15 Thread Otis Gospodnetic
Note that this really includes some extra steps.
You don't need a temp index.  Add everything to a single index using a
single IndexWriter instance.  No need to call addIndexes nor optimize
until the end.  Adding Documents to an index takes a constant amount of
time, regardless of the index size, because new segments are created as
documents are added, and existing segments don't need to be updated
(only when merges happen).  Again, I'd run your app under a profiler to
see where the time and memory are going.

Otis

--- Garrett Heaver [EMAIL PROTECTED] wrote:

 Hi Homan
 
 I had a similar problem as you in that I was indexing A LOT of data
 
 Essentially how I got round it was to batch the index.
 
 What I was doing was to add 10,000 documents to a temporary index,
 use
 addIndexes() to merge to temporary index into the live index (which
 also
 optimizes the live index) then delete the temporary index. On the
 next loop
 I'd only query rows from the db above the id in the maxdoc of the
 live index
 and set the max rows of the query to to 10,000
 i.e
 
 SELECT TOP 1 [fields] FROM [tables] WHERE [id_field]  {ID from
 Index.MaxDoc()} ORDER BY [id_field] ASC
 
 Ensuring that the documents go into the index sequentially your
 problem is
 solved and memory usage on mine (dotlucene 1.3) is low
 
 Regards
 Garrett
 
 -Original Message-
 From: Homam S.A. [mailto:[EMAIL PROTECTED] 
 Sent: 15 December 2004 02:43
 To: Lucene Users List
 Subject: Indexing a large number of DB records
 
 I'm trying to index a large number of records from the
 DB (a few millions). Each record will be stored as a
 document with about 30 fields, most of them are
 UnStored and represent small strings or numbers. No
 huge DB Text fields.
 
 But I'm running out of memory very fast, and the
 indexing is slowing down to a crawl once I hit around
 1500 records. The problem is each document is holding
 references to the string objects returned from
 ToString() on the DB field, and the IndexWriter is
 holding references to all these document objects in
 memory, so the garbage collector is getting a chance
 to clean these up.
 
 How do you guys go about indexing a large DB table?
 Here's a snippet of my code (this method is called for
 each record in the DB):
 
 private void IndexRow(SqlDataReader rdr, IndexWriter
 iw) {
   Document doc = new Document();
   for (int i = 0; i  BrowseFieldNames.Length; i++) {
   doc.Add(Field.UnStored(BrowseFieldNames[i],
 rdr.GetValue(i).ToString()));
   }
   iw.AddDocument(doc);
 }
 
 
 
 
   
 __ 
 Do you Yahoo!? 
 Yahoo! Mail - Find what you need with new enhanced search.
 http://info.mail.yahoo.com/mail_250
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Indexing a large number of DB records

2004-12-14 Thread Homam S.A.
I'm trying to index a large number of records from the
DB (a few millions). Each record will be stored as a
document with about 30 fields, most of them are
UnStored and represent small strings or numbers. No
huge DB Text fields.

But I'm running out of memory very fast, and the
indexing is slowing down to a crawl once I hit around
1500 records. The problem is each document is holding
references to the string objects returned from
ToString() on the DB field, and the IndexWriter is
holding references to all these document objects in
memory, so the garbage collector is getting a chance
to clean these up.

How do you guys go about indexing a large DB table?
Here's a snippet of my code (this method is called for
each record in the DB):

private void IndexRow(SqlDataReader rdr, IndexWriter
iw) {
Document doc = new Document();
for (int i = 0; i  BrowseFieldNames.Length; i++) {
doc.Add(Field.UnStored(BrowseFieldNames[i],
rdr.GetValue(i).ToString()));
}
iw.AddDocument(doc);
}





__ 
Do you Yahoo!? 
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing a large number of DB records

2004-12-14 Thread Otis Gospodnetic
Hello,

There are a few things you can do:

1) Don't just pull all rows from the DB at once.  Do that in batches.

2) If you can get a Reader from your SqlDataReader, consider this:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)

3) Give the JVM more memory to play with by using -Xms and -Xmx JVM
parameters

4) See IndexWriter's minMergeDocs parameter.

5) Are you calling optimize() at some point by any chance?  Leave that
call for the end.

1500 documents with 30 columns of short String/number values is not a
lot.  You may be doing something else not Lucene related that's slowing
things down.

Otis


--- Homam S.A. [EMAIL PROTECTED] wrote:

 I'm trying to index a large number of records from the
 DB (a few millions). Each record will be stored as a
 document with about 30 fields, most of them are
 UnStored and represent small strings or numbers. No
 huge DB Text fields.
 
 But I'm running out of memory very fast, and the
 indexing is slowing down to a crawl once I hit around
 1500 records. The problem is each document is holding
 references to the string objects returned from
 ToString() on the DB field, and the IndexWriter is
 holding references to all these document objects in
 memory, so the garbage collector is getting a chance
 to clean these up.
 
 How do you guys go about indexing a large DB table?
 Here's a snippet of my code (this method is called for
 each record in the DB):
 
 private void IndexRow(SqlDataReader rdr, IndexWriter
 iw) {
   Document doc = new Document();
   for (int i = 0; i  BrowseFieldNames.Length; i++) {
   doc.Add(Field.UnStored(BrowseFieldNames[i],
 rdr.GetValue(i).ToString()));
   }
   iw.AddDocument(doc);
 }
 
 
 
 
   
 __ 
 Do you Yahoo!? 
 Yahoo! Mail - Find what you need with new enhanced search.
 http://info.mail.yahoo.com/mail_250
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing a large number of DB records

2004-12-14 Thread Homam S.A.
Thanks Otis!

What do you mean by building it in batches? Does it
mean I should close the IndexWriter every 1000 rows
and reopen it? Does that releases references to the
document objects so that they can be
garbage-collected?

I'm calling optimize() only at the end.

I agree that 1500 documents is very small. I'm
building the index on a PC with 512 megs, and the
indexing process is quickly gobbling up around 400
megs when I index around 1800 documents and the whole
machine is grinding to a virtual halt. I'm using the
latest DotLucene .NET port, so may be there's a memory
leak in it.

I have experience with AltaVista search (acquired by
FastSearch), and I used to call MakeStable() every
20,000 documents to flush memory structures to disk.
There doesn't seem to be an equivalent in Lucene.

-- Homam






--- Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 Hello,
 
 There are a few things you can do:
 
 1) Don't just pull all rows from the DB at once.  Do
 that in batches.
 
 2) If you can get a Reader from your SqlDataReader,
 consider this:

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)
 
 3) Give the JVM more memory to play with by using
 -Xms and -Xmx JVM
 parameters
 
 4) See IndexWriter's minMergeDocs parameter.
 
 5) Are you calling optimize() at some point by any
 chance?  Leave that
 call for the end.
 
 1500 documents with 30 columns of short
 String/number values is not a
 lot.  You may be doing something else not Lucene
 related that's slowing
 things down.
 
 Otis
 
 
 --- Homam S.A. [EMAIL PROTECTED] wrote:
 
  I'm trying to index a large number of records from
 the
  DB (a few millions). Each record will be stored as
 a
  document with about 30 fields, most of them are
  UnStored and represent small strings or numbers.
 No
  huge DB Text fields.
  
  But I'm running out of memory very fast, and the
  indexing is slowing down to a crawl once I hit
 around
  1500 records. The problem is each document is
 holding
  references to the string objects returned from
  ToString() on the DB field, and the IndexWriter is
  holding references to all these document objects
 in
  memory, so the garbage collector is getting a
 chance
  to clean these up.
  
  How do you guys go about indexing a large DB
 table?
  Here's a snippet of my code (this method is called
 for
  each record in the DB):
  
  private void IndexRow(SqlDataReader rdr,
 IndexWriter
  iw) {
  Document doc = new Document();
  for (int i = 0; i  BrowseFieldNames.Length; i++)
 {
  doc.Add(Field.UnStored(BrowseFieldNames[i],
  rdr.GetValue(i).ToString()));
  }
  iw.AddDocument(doc);
  }
  
  
  
  
  
  __ 
  Do you Yahoo!? 
  Yahoo! Mail - Find what you need with new enhanced
 search.
  http://info.mail.yahoo.com/mail_250
  
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Take Yahoo! Mail with you! Get it on your mobile phone. 
http://mobile.yahoo.com/maildemo 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]