Re: Unusually long data import time?

2012-02-22 Thread Glen Newton
Import times will depend on:
- hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
- Java configuration (heap size, etc)
- Lucene/Solr configuration (many ...)
- Index configuration - how many fields, indexed how; faceting, etc
- OS configuration (this usually to a lesser degree; _usually_)
- Network issues if non-local
- DB configuration (driver, etc)

If you can give more information about the above, people on this list
should be able to better indicate whether 18 hours sounds right for
your situation.

-Glen Newton

On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
dbaumgar...@nationalcorp.com wrote:
 Hello,

 Would it be unusual for an import of 160 million documents to take 18 hours?  
 Each document is less than 1kb and I have the DataImportHandler using the 
 jdbc driver to connect to SQL Server 2008. The full-import query calls a 
 stored procedure that contains only a select from my target table.

 Is there any way I can speed this up? I saw recently someone on this list 
 suggested a new user could get all their Solr data imported in under an hour. 
 I sure hope that's true!


 Devon Baumgarten





-- 
-
http://zzzoot.blogspot.com/
-


RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
Oh sure! As best as I can, anyway.

I have not set the Java heap size, or really configured it at all. 

The server running both the SQL Server and Solr has:
* 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors)
* 64 GB RAM
* One Solr instance (no shards)

I'm not using faceting.
My schema has these fields:
  field name=Id type=string indexed=true stored=true / 
  field name=RecordId type=int indexed=true stored=true / 
  field name=RecordType type=string indexed=true stored=true / 
  field name=Name type=LikeText indexed=true stored=true 
termVectors=true / 
  field name=NameFuzzy type=FuzzyText indexed=true stored=true 
termVectors=true / 
  copyField source=Name dest=NameFuzzy / 
  field name=NameType type=string indexed=true stored=true /

Custom types:

*LikeText
PatternReplaceCharFilterFactory (\W+ = )
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
EdgeNGramFilterFactory
LengthFilterFactory (min:3, max:512)

*FuzzyText
PatternReplaceCharFilterFactory (\W+ = )
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
NGramFilterFactory
LengthFilterFactory (min:3, max:512)

Devon Baumgarten


-Original Message-
From: Glen Newton [mailto:glen.new...@gmail.com] 
Sent: Wednesday, February 22, 2012 9:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

Import times will depend on:
- hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
- Java configuration (heap size, etc)
- Lucene/Solr configuration (many ...)
- Index configuration - how many fields, indexed how; faceting, etc
- OS configuration (this usually to a lesser degree; _usually_)
- Network issues if non-local
- DB configuration (driver, etc)

If you can give more information about the above, people on this list
should be able to better indicate whether 18 hours sounds right for
your situation.

-Glen Newton

On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
dbaumgar...@nationalcorp.com wrote:
 Hello,

 Would it be unusual for an import of 160 million documents to take 18 hours?  
 Each document is less than 1kb and I have the DataImportHandler using the 
 jdbc driver to connect to SQL Server 2008. The full-import query calls a 
 stored procedure that contains only a select from my target table.

 Is there any way I can speed this up? I saw recently someone on this list 
 suggested a new user could get all their Solr data imported in under an hour. 
 I sure hope that's true!


 Devon Baumgarten





-- 
-
http://zzzoot.blogspot.com/
-


RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
I changed the heap size (Xmx1582m was as high as I could go). The import is at 
about 5% now, and from that I now estimate about 13 hours. It's hard to say 
though.. it keeps going up little by little.

If I get approval to use Solr for this project, I'll have them install a 64bit 
jvm instead, but is there anything else I can do?


Devon Baumgarten
Application Developer


-Original Message-
From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] 
Sent: Wednesday, February 22, 2012 10:32 AM
To: 'solr-user@lucene.apache.org'
Subject: RE: Unusually long data import time?

Oh sure! As best as I can, anyway.

I have not set the Java heap size, or really configured it at all. 

The server running both the SQL Server and Solr has:
* 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors)
* 64 GB RAM
* One Solr instance (no shards)

I'm not using faceting.
My schema has these fields:
  field name=Id type=string indexed=true stored=true / 
  field name=RecordId type=int indexed=true stored=true / 
  field name=RecordType type=string indexed=true stored=true / 
  field name=Name type=LikeText indexed=true stored=true 
termVectors=true / 
  field name=NameFuzzy type=FuzzyText indexed=true stored=true 
termVectors=true / 
  copyField source=Name dest=NameFuzzy / 
  field name=NameType type=string indexed=true stored=true /

Custom types:

*LikeText
PatternReplaceCharFilterFactory (\W+ = )
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
EdgeNGramFilterFactory
LengthFilterFactory (min:3, max:512)

*FuzzyText
PatternReplaceCharFilterFactory (\W+ = )
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
NGramFilterFactory
LengthFilterFactory (min:3, max:512)

Devon Baumgarten


-Original Message-
From: Glen Newton [mailto:glen.new...@gmail.com] 
Sent: Wednesday, February 22, 2012 9:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

Import times will depend on:
- hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
- Java configuration (heap size, etc)
- Lucene/Solr configuration (many ...)
- Index configuration - how many fields, indexed how; faceting, etc
- OS configuration (this usually to a lesser degree; _usually_)
- Network issues if non-local
- DB configuration (driver, etc)

If you can give more information about the above, people on this list
should be able to better indicate whether 18 hours sounds right for
your situation.

-Glen Newton

On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
dbaumgar...@nationalcorp.com wrote:
 Hello,

 Would it be unusual for an import of 160 million documents to take 18 hours?  
 Each document is less than 1kb and I have the DataImportHandler using the 
 jdbc driver to connect to SQL Server 2008. The full-import query calls a 
 stored procedure that contains only a select from my target table.

 Is there any way I can speed this up? I saw recently someone on this list 
 suggested a new user could get all their Solr data imported in under an hour. 
 I sure hope that's true!


 Devon Baumgarten





-- 
-
http://zzzoot.blogspot.com/
-


Re: Unusually long data import time?

2012-02-22 Thread Walter Underwood
In my first try with the DIH, I had several sub-entities and it was making six 
queries per document. My 20M doc load was going to take many hours, most of a 
day. I re-wrote it to eliminate those, and now it makes a single query for the 
whole load and takes 70 minutes. These are small documents, just the metadata 
for each book.

wunder
Search Guy
Chegg

On Feb 22, 2012, at 9:41 AM, Devon Baumgarten wrote:

 I changed the heap size (Xmx1582m was as high as I could go). The import is 
 at about 5% now, and from that I now estimate about 13 hours. It's hard to 
 say though.. it keeps going up little by little.
 
 If I get approval to use Solr for this project, I'll have them install a 
 64bit jvm instead, but is there anything else I can do?
 
 
 Devon Baumgarten
 Application Developer
 
 
 -Original Message-
 From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] 
 Sent: Wednesday, February 22, 2012 10:32 AM
 To: 'solr-user@lucene.apache.org'
 Subject: RE: Unusually long data import time?
 
 Oh sure! As best as I can, anyway.
 
 I have not set the Java heap size, or really configured it at all. 
 
 The server running both the SQL Server and Solr has:
 * 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors)
 * 64 GB RAM
 * One Solr instance (no shards)
 
 I'm not using faceting.
 My schema has these fields:
  field name=Id type=string indexed=true stored=true / 
  field name=RecordId type=int indexed=true stored=true / 
  field name=RecordType type=string indexed=true stored=true / 
  field name=Name type=LikeText indexed=true stored=true 
 termVectors=true / 
  field name=NameFuzzy type=FuzzyText indexed=true stored=true 
 termVectors=true / 
  copyField source=Name dest=NameFuzzy / 
  field name=NameType type=string indexed=true stored=true /
 
 Custom types:
 
 *LikeText
   PatternReplaceCharFilterFactory (\W+ = )
   KeywordTokenizerFactory 
   StopFilterFactory (~40 words in stoplist)
   ASCIIFoldingFilterFactory
   LowerCaseFilterFactory
   EdgeNGramFilterFactory
   LengthFilterFactory (min:3, max:512)
 
 *FuzzyText
   PatternReplaceCharFilterFactory (\W+ = )
   KeywordTokenizerFactory 
   StopFilterFactory (~40 words in stoplist)
   ASCIIFoldingFilterFactory
   LowerCaseFilterFactory
   NGramFilterFactory
   LengthFilterFactory (min:3, max:512)
 
 Devon Baumgarten
 
 
 -Original Message-
 From: Glen Newton [mailto:glen.new...@gmail.com] 
 Sent: Wednesday, February 22, 2012 9:24 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Unusually long data import time?
 
 Import times will depend on:
 - hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
 - Java configuration (heap size, etc)
 - Lucene/Solr configuration (many ...)
 - Index configuration - how many fields, indexed how; faceting, etc
 - OS configuration (this usually to a lesser degree; _usually_)
 - Network issues if non-local
 - DB configuration (driver, etc)
 
 If you can give more information about the above, people on this list
 should be able to better indicate whether 18 hours sounds right for
 your situation.
 
 -Glen Newton
 
 On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
 dbaumgar...@nationalcorp.com wrote:
 Hello,
 
 Would it be unusual for an import of 160 million documents to take 18 hours? 
  Each document is less than 1kb and I have the DataImportHandler using the 
 jdbc driver to connect to SQL Server 2008. The full-import query calls a 
 stored procedure that contains only a select from my target table.
 
 Is there any way I can speed this up? I saw recently someone on this list 
 suggested a new user could get all their Solr data imported in under an 
 hour. I sure hope that's true!
 
 
 Devon Baumgarten
 
 
 
 
 
 -- 
 -
 http://zzzoot.blogspot.com/
 -







Re: Unusually long data import time?

2012-02-22 Thread Ahmet Arslan
 Would it be unusual for an import of 160 million documents
 to take 18 hours?  Each document is less than 1kb and I
 have the DataImportHandler using the jdbc driver to connect
 to SQL Server 2008. The full-import query calls a stored
 procedure that contains only a select from my target table.
 
 Is there any way I can speed this up? I saw recently someone
 on this list suggested a new user could get all their Solr
 data imported in under an hour. I sure hope that's true!

Do have autoCommit or autoSoftCommit configured in solrconfig.xml?


RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
Ahmet,

I do not. I commented autoCommit out.

Devon Baumgarten



-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Wednesday, February 22, 2012 12:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

 Would it be unusual for an import of 160 million documents
 to take 18 hours?  Each document is less than 1kb and I
 have the DataImportHandler using the jdbc driver to connect
 to SQL Server 2008. The full-import query calls a stored
 procedure that contains only a select from my target table.
 
 Is there any way I can speed this up? I saw recently someone
 on this list suggested a new user could get all their Solr
 data imported in under an hour. I sure hope that's true!

Do have autoCommit or autoSoftCommit configured in solrconfig.xml? 


Re: Unusually long data import time?

2012-02-22 Thread eks dev
Davon, you ought to try to update from many threads, (I do not know if
DIH can do it, check it), but lucene does great job if fed from many
update threads...

depends where your time gets lost, but it is usually a) analysis chain
or b) database

if it os a) and your server has spare cpu-cores, you can scale at X
NooCores rate

On Wed, Feb 22, 2012 at 7:41 PM, Devon Baumgarten
dbaumgar...@nationalcorp.com wrote:
 Ahmet,

 I do not. I commented autoCommit out.

 Devon Baumgarten



 -Original Message-
 From: Ahmet Arslan [mailto:iori...@yahoo.com]
 Sent: Wednesday, February 22, 2012 12:25 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Unusually long data import time?

 Would it be unusual for an import of 160 million documents
 to take 18 hours?  Each document is less than 1kb and I
 have the DataImportHandler using the jdbc driver to connect
 to SQL Server 2008. The full-import query calls a stored
 procedure that contains only a select from my target table.

 Is there any way I can speed this up? I saw recently someone
 on this list suggested a new user could get all their Solr
 data imported in under an hour. I sure hope that's true!

 Do have autoCommit or autoSoftCommit configured in solrconfig.xml?


RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
Thank you everyone for your patience and suggestions.

It turns out I was doing something really unreasonable in my schema. I 
mistakenly edited the max EdgeNgram size to 512, when I meant to set the 
lengthFilter max to 512. I brought this to a more reasonable number, and my 
estimated time to import is now down to 4 hours. Based on the size of my record 
set, this time is more consistent with Walter's observations in his own project.

Thanks again for your help,

Devon Baumgarten

-Original Message-
From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] 
Sent: Wednesday, February 22, 2012 12:42 PM
To: 'solr-user@lucene.apache.org'
Subject: RE: Unusually long data import time?

Ahmet,

I do not. I commented autoCommit out.

Devon Baumgarten



-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Wednesday, February 22, 2012 12:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

 Would it be unusual for an import of 160 million documents
 to take 18 hours?  Each document is less than 1kb and I
 have the DataImportHandler using the jdbc driver to connect
 to SQL Server 2008. The full-import query calls a stored
 procedure that contains only a select from my target table.
 
 Is there any way I can speed this up? I saw recently someone
 on this list suggested a new user could get all their Solr
 data imported in under an hour. I sure hope that's true!

Do have autoCommit or autoSoftCommit configured in solrconfig.xml?