RE: Indexing a large number of DB records

2004-12-16 Thread Garrett Heaver
There was other reasons for my choice of going with a Temp Index - namely I
was having terrible write times to my Live index as it was stored on a
different server, also, while I was writing to my live index people were
trying to search on it and were getting file not found exceptions so
rather than spend hours or days trying to fix it I took the easiest route by
creating a temp index on the server which had the application and merging to
the server with the live index. This greatly increased my indexing speed.

Best of luck
Garrett

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: 15 December 2004 18:43
To: Lucene Users List
Subject: RE: Indexing a large number of DB records

Note that this really includes some extra steps.
You don't need a temp index.  Add everything to a single index using a
single IndexWriter instance.  No need to call addIndexes nor optimize
until the end.  Adding Documents to an index takes a constant amount of
time, regardless of the index size, because new segments are created as
documents are added, and existing segments don't need to be updated
(only when merges happen).  Again, I'd run your app under a profiler to
see where the time and memory are going.

Otis

--- Garrett Heaver [EMAIL PROTECTED] wrote:

 Hi Homan
 
 I had a similar problem as you in that I was indexing A LOT of data
 
 Essentially how I got round it was to batch the index.
 
 What I was doing was to add 10,000 documents to a temporary index,
 use
 addIndexes() to merge to temporary index into the live index (which
 also
 optimizes the live index) then delete the temporary index. On the
 next loop
 I'd only query rows from the db above the id in the maxdoc of the
 live index
 and set the max rows of the query to to 10,000
 i.e
 
 SELECT TOP 1 [fields] FROM [tables] WHERE [id_field]  {ID from
 Index.MaxDoc()} ORDER BY [id_field] ASC
 
 Ensuring that the documents go into the index sequentially your
 problem is
 solved and memory usage on mine (dotlucene 1.3) is low
 
 Regards
 Garrett
 
 -Original Message-
 From: Homam S.A. [mailto:[EMAIL PROTECTED] 
 Sent: 15 December 2004 02:43
 To: Lucene Users List
 Subject: Indexing a large number of DB records
 
 I'm trying to index a large number of records from the
 DB (a few millions). Each record will be stored as a
 document with about 30 fields, most of them are
 UnStored and represent small strings or numbers. No
 huge DB Text fields.
 
 But I'm running out of memory very fast, and the
 indexing is slowing down to a crawl once I hit around
 1500 records. The problem is each document is holding
 references to the string objects returned from
 ToString() on the DB field, and the IndexWriter is
 holding references to all these document objects in
 memory, so the garbage collector is getting a chance
 to clean these up.
 
 How do you guys go about indexing a large DB table?
 Here's a snippet of my code (this method is called for
 each record in the DB):
 
 private void IndexRow(SqlDataReader rdr, IndexWriter
 iw) {
   Document doc = new Document();
   for (int i = 0; i  BrowseFieldNames.Length; i++) {
   doc.Add(Field.UnStored(BrowseFieldNames[i],
 rdr.GetValue(i).ToString()));
   }
   iw.AddDocument(doc);
 }
 
 
 
 
   
 __ 
 Do you Yahoo!? 
 Yahoo! Mail - Find what you need with new enhanced search.
 http://info.mail.yahoo.com/mail_250
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing a large number of DB records

2004-12-15 Thread Garrett Heaver
Hi Homan

I had a similar problem as you in that I was indexing A LOT of data

Essentially how I got round it was to batch the index.

What I was doing was to add 10,000 documents to a temporary index, use
addIndexes() to merge to temporary index into the live index (which also
optimizes the live index) then delete the temporary index. On the next loop
I'd only query rows from the db above the id in the maxdoc of the live index
and set the max rows of the query to to 10,000
i.e

SELECT TOP 1 [fields] FROM [tables] WHERE [id_field]  {ID from
Index.MaxDoc()} ORDER BY [id_field] ASC

Ensuring that the documents go into the index sequentially your problem is
solved and memory usage on mine (dotlucene 1.3) is low

Regards
Garrett

-Original Message-
From: Homam S.A. [mailto:[EMAIL PROTECTED] 
Sent: 15 December 2004 02:43
To: Lucene Users List
Subject: Indexing a large number of DB records

I'm trying to index a large number of records from the
DB (a few millions). Each record will be stored as a
document with about 30 fields, most of them are
UnStored and represent small strings or numbers. No
huge DB Text fields.

But I'm running out of memory very fast, and the
indexing is slowing down to a crawl once I hit around
1500 records. The problem is each document is holding
references to the string objects returned from
ToString() on the DB field, and the IndexWriter is
holding references to all these document objects in
memory, so the garbage collector is getting a chance
to clean these up.

How do you guys go about indexing a large DB table?
Here's a snippet of my code (this method is called for
each record in the DB):

private void IndexRow(SqlDataReader rdr, IndexWriter
iw) {
Document doc = new Document();
for (int i = 0; i  BrowseFieldNames.Length; i++) {
doc.Add(Field.UnStored(BrowseFieldNames[i],
rdr.GetValue(i).ToString()));
}
iw.AddDocument(doc);
}





__ 
Do you Yahoo!? 
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



C# Ports

2004-12-15 Thread Garrett Heaver
I was just wondering what tools (JLCA?) people are using to port Lucene to
c# as I'd be well interesting in converting things like snowball stemmers,
wordnet etc.

 

Thanks

Garrett



maxDoc()

2004-12-09 Thread Garrett Heaver
Can anyone please explain to my why maxDoc returns 0 when Luke shows 239,473
documents?

 

maxDoc returns the correct number until I delete a document. And I have
called optimize after the delete but still the problem remains

 

Strange.

 

Any ideas greatly appreciated

Garrett



RE: addIndexes() Size

2004-12-07 Thread Garrett Heaver
Ok I upgraded to 1.4.3 but that didn't solve the issue - I was still ending
up with huge indexes. So I changed approach - Instead of handing the
addIndexes method IndexReaders I gave it the directory instead as the code
is slightly different - now index size is what I would expect it to be. I
haven't had time to check it out fully yet as to why - but from what I can
see the major difference in the two methods is that the
addIndexes(IndexReaders[]) uses the following

 if (segmentInfos.size() == 1)  // add existing index, if any
  merger.add(new SegmentReader(segmentInfos.info(0)));

Perhaps this is resulting in an unnecessary ballooning of the index?

I'll leave it for someone with a better understanding of the underlying file
system to answer...

Thanks
Garrett

-Original Message-
From: Garrett Heaver [mailto:[EMAIL PROTECTED] 
Sent: 06 December 2004 17:32
To: 'Lucene Users List'
Subject: RE: addIndexes() Size

Cheers for that Erik - believe it or not I'm still back at v1.3 (doh!!!)

Will try 1.4.3 tomorrow

Thanks
Garrett

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: 06 December 2004 17:27
To: Lucene Users List
Subject: Re: addIndexes() Size

There was a bug in 1.4 (and maybe 1.4.1?) that kept some index files 
around that were not used.

Are you using Lucene 1.4.3?  It not, try that and see if it helps.

Erik

On Dec 6, 2004, at 12:17 PM, Garrett Heaver wrote:

 No there are no duplicate copies - I've the correct number when I view
 through luke and I don't overlap - the temporary index is destroyed 
 after it
 is added to the main index - I'm currently at index version 159 and it 
 seems
 that all of my .prx files come in at around 1435 megs (ouch)

 Thanks
 Garrett

 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
 Sent: 06 December 2004 17:12
 To: Lucene Users List
 Subject: Re: addIndexes() Size

 If I were you, I would first use Luke to peek at the index.  You may
 find something obvious there, like multiple copies of the same
 Document.
 Does your temp index 'overlap' with A index in terms of Documents?  If
 so, you will end up with multliple copies, as addIndexes method doesn't
 detect and remove duplicate Documents.

 Otis

 --- Garrett Heaver [EMAIL PROTECTED] wrote:

 Hi.



 Its probably really simple to explain this but since I'm not up to
 speed on
 the way Lucene stores the data I'm a little confused.



 I'm building an Index, which resides on Server A, with the Lucene
 Service
 running on Server B. Now not to bore you with the details but because
 of the
 network transfer rate etc I'm running the actual index on
 \\ServerA\idx
 file:///\\ServerA\idx  and building a temp Index at
 \\ServerB\idx\temp
 file:///\\ServerB\idx\temp  (obviously because the Local FS is much
 faster
 for the service) and then calling addIndexes to import the temp index
 to the
 ServerA index before destroying the ServerB index, holding for a bit
 and
 then checking for new documents.



 All works grand BUT the size of the resultant index on ServerA is
 HUGE in
 comparison to one I'd build from start to finish (i.e. a simple
 addDocument
 Index) - 38gig for 220,000 Unstored Items cannot be right (to give
 you and
 idea of how mad this seems, the backed up version of the database
 from which
 the data is pulled is only 2gigs)



 I've considered it being perhaps the number of Items that had to be
 integrated each time addIndexes was called - right now I'm adding
 around
 10,000 at a time (I had done 1000 at a time but this looked like it
 was
 going to end up even larger still)



 I'm holding off twiddling the minMergeDocs and mergeFactor until I
 can get a
 better understanding of whats going on here.



 Many thanks for any reply's

 Garrett






 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



addIndexes() Size

2004-12-06 Thread Garrett Heaver
Hi.

 

Its probably really simple to explain this but since I'm not up to speed on
the way Lucene stores the data I'm a little confused.

 

I'm building an Index, which resides on Server A, with the Lucene Service
running on Server B. Now not to bore you with the details but because of the
network transfer rate etc I'm running the actual index on \\ServerA\idx
file:///\\ServerA\idx  and building a temp Index at \\ServerB\idx\temp
file:///\\ServerB\idx\temp  (obviously because the Local FS is much faster
for the service) and then calling addIndexes to import the temp index to the
ServerA index before destroying the ServerB index, holding for a bit and
then checking for new documents.

 

All works grand BUT the size of the resultant index on ServerA is HUGE in
comparison to one I'd build from start to finish (i.e. a simple addDocument
Index) - 38gig for 220,000 Unstored Items cannot be right (to give you and
idea of how mad this seems, the backed up version of the database from which
the data is pulled is only 2gigs)

 

I've considered it being perhaps the number of Items that had to be
integrated each time addIndexes was called - right now I'm adding around
10,000 at a time (I had done 1000 at a time but this looked like it was
going to end up even larger still)

 

I'm holding off twiddling the minMergeDocs and mergeFactor until I can get a
better understanding of whats going on here.

 

Many thanks for any reply's

Garrett

 



RE: addIndexes() Size

2004-12-06 Thread Garrett Heaver
No there are no duplicate copies - I've the correct number when I view
through luke and I don't overlap - the temporary index is destroyed after it
is added to the main index - I'm currently at index version 159 and it seems
that all of my .prx files come in at around 1435 megs (ouch)

Thanks
Garrett

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: 06 December 2004 17:12
To: Lucene Users List
Subject: Re: addIndexes() Size

If I were you, I would first use Luke to peek at the index.  You may
find something obvious there, like multiple copies of the same
Document.
Does your temp index 'overlap' with A index in terms of Documents?  If
so, you will end up with multliple copies, as addIndexes method doesn't
detect and remove duplicate Documents.

Otis

--- Garrett Heaver [EMAIL PROTECTED] wrote:

 Hi.
 
  
 
 Its probably really simple to explain this but since I'm not up to
 speed on
 the way Lucene stores the data I'm a little confused.
 
  
 
 I'm building an Index, which resides on Server A, with the Lucene
 Service
 running on Server B. Now not to bore you with the details but because
 of the
 network transfer rate etc I'm running the actual index on
 \\ServerA\idx
 file:///\\ServerA\idx  and building a temp Index at
 \\ServerB\idx\temp
 file:///\\ServerB\idx\temp  (obviously because the Local FS is much
 faster
 for the service) and then calling addIndexes to import the temp index
 to the
 ServerA index before destroying the ServerB index, holding for a bit
 and
 then checking for new documents.
 
  
 
 All works grand BUT the size of the resultant index on ServerA is
 HUGE in
 comparison to one I'd build from start to finish (i.e. a simple
 addDocument
 Index) - 38gig for 220,000 Unstored Items cannot be right (to give
 you and
 idea of how mad this seems, the backed up version of the database
 from which
 the data is pulled is only 2gigs)
 
  
 
 I've considered it being perhaps the number of Items that had to be
 integrated each time addIndexes was called - right now I'm adding
 around
 10,000 at a time (I had done 1000 at a time but this looked like it
 was
 going to end up even larger still)
 
  
 
 I'm holding off twiddling the minMergeDocs and mergeFactor until I
 can get a
 better understanding of whats going on here.
 
  
 
 Many thanks for any reply's
 
 Garrett
 
  
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: addIndexes() Size

2004-12-06 Thread Garrett Heaver
Cheers for that Erik - believe it or not I'm still back at v1.3 (doh!!!)

Will try 1.4.3 tomorrow

Thanks
Garrett

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: 06 December 2004 17:27
To: Lucene Users List
Subject: Re: addIndexes() Size

There was a bug in 1.4 (and maybe 1.4.1?) that kept some index files 
around that were not used.

Are you using Lucene 1.4.3?  It not, try that and see if it helps.

Erik

On Dec 6, 2004, at 12:17 PM, Garrett Heaver wrote:

 No there are no duplicate copies - I've the correct number when I view
 through luke and I don't overlap - the temporary index is destroyed 
 after it
 is added to the main index - I'm currently at index version 159 and it 
 seems
 that all of my .prx files come in at around 1435 megs (ouch)

 Thanks
 Garrett

 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
 Sent: 06 December 2004 17:12
 To: Lucene Users List
 Subject: Re: addIndexes() Size

 If I were you, I would first use Luke to peek at the index.  You may
 find something obvious there, like multiple copies of the same
 Document.
 Does your temp index 'overlap' with A index in terms of Documents?  If
 so, you will end up with multliple copies, as addIndexes method doesn't
 detect and remove duplicate Documents.

 Otis

 --- Garrett Heaver [EMAIL PROTECTED] wrote:

 Hi.



 Its probably really simple to explain this but since I'm not up to
 speed on
 the way Lucene stores the data I'm a little confused.



 I'm building an Index, which resides on Server A, with the Lucene
 Service
 running on Server B. Now not to bore you with the details but because
 of the
 network transfer rate etc I'm running the actual index on
 \\ServerA\idx
 file:///\\ServerA\idx  and building a temp Index at
 \\ServerB\idx\temp
 file:///\\ServerB\idx\temp  (obviously because the Local FS is much
 faster
 for the service) and then calling addIndexes to import the temp index
 to the
 ServerA index before destroying the ServerB index, holding for a bit
 and
 then checking for new documents.



 All works grand BUT the size of the resultant index on ServerA is
 HUGE in
 comparison to one I'd build from start to finish (i.e. a simple
 addDocument
 Index) - 38gig for 220,000 Unstored Items cannot be right (to give
 you and
 idea of how mad this seems, the backed up version of the database
 from which
 the data is pulled is only 2gigs)



 I've considered it being perhaps the number of Items that had to be
 integrated each time addIndexes was called - right now I'm adding
 around
 10,000 at a time (I had done 1000 at a time but this looked like it
 was
 going to end up even larger still)



 I'm holding off twiddling the minMergeDocs and mergeFactor until I
 can get a
 better understanding of whats going on here.



 Many thanks for any reply's

 Garrett






 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]