Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
Hello,

Would it be unusual for an import of 160 million documents to take 18 hours?  
Each document is less than 1kb and I have the DataImportHandler using the jdbc 
driver to connect to SQL Server 2008. The full-import query calls a stored 
procedure that contains only a select from my target table.

Is there any way I can speed this up? I saw recently someone on this list 
suggested a new user could get all their Solr data imported in under an hour. I 
sure hope that's true!


Devon Baumgarten




RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
Oh sure! As best as I can, anyway.

I have not set the Java heap size, or really configured it at all. 

The server running both the SQL Server and Solr has:
* 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors)
* 64 GB RAM
* One Solr instance (no shards)

I'm not using faceting.
My schema has these fields:
  field name=Id type=string indexed=true stored=true / 
  field name=RecordId type=int indexed=true stored=true / 
  field name=RecordType type=string indexed=true stored=true / 
  field name=Name type=LikeText indexed=true stored=true 
termVectors=true / 
  field name=NameFuzzy type=FuzzyText indexed=true stored=true 
termVectors=true / 
  copyField source=Name dest=NameFuzzy / 
  field name=NameType type=string indexed=true stored=true /

Custom types:

*LikeText
PatternReplaceCharFilterFactory (\W+ = )
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
EdgeNGramFilterFactory
LengthFilterFactory (min:3, max:512)

*FuzzyText
PatternReplaceCharFilterFactory (\W+ = )
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
NGramFilterFactory
LengthFilterFactory (min:3, max:512)

Devon Baumgarten


-Original Message-
From: Glen Newton [mailto:glen.new...@gmail.com] 
Sent: Wednesday, February 22, 2012 9:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

Import times will depend on:
- hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
- Java configuration (heap size, etc)
- Lucene/Solr configuration (many ...)
- Index configuration - how many fields, indexed how; faceting, etc
- OS configuration (this usually to a lesser degree; _usually_)
- Network issues if non-local
- DB configuration (driver, etc)

If you can give more information about the above, people on this list
should be able to better indicate whether 18 hours sounds right for
your situation.

-Glen Newton

On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
dbaumgar...@nationalcorp.com wrote:
 Hello,

 Would it be unusual for an import of 160 million documents to take 18 hours?  
 Each document is less than 1kb and I have the DataImportHandler using the 
 jdbc driver to connect to SQL Server 2008. The full-import query calls a 
 stored procedure that contains only a select from my target table.

 Is there any way I can speed this up? I saw recently someone on this list 
 suggested a new user could get all their Solr data imported in under an hour. 
 I sure hope that's true!


 Devon Baumgarten





-- 
-
http://zzzoot.blogspot.com/
-


RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
I changed the heap size (Xmx1582m was as high as I could go). The import is at 
about 5% now, and from that I now estimate about 13 hours. It's hard to say 
though.. it keeps going up little by little.

If I get approval to use Solr for this project, I'll have them install a 64bit 
jvm instead, but is there anything else I can do?


Devon Baumgarten
Application Developer


-Original Message-
From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] 
Sent: Wednesday, February 22, 2012 10:32 AM
To: 'solr-user@lucene.apache.org'
Subject: RE: Unusually long data import time?

Oh sure! As best as I can, anyway.

I have not set the Java heap size, or really configured it at all. 

The server running both the SQL Server and Solr has:
* 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors)
* 64 GB RAM
* One Solr instance (no shards)

I'm not using faceting.
My schema has these fields:
  field name=Id type=string indexed=true stored=true / 
  field name=RecordId type=int indexed=true stored=true / 
  field name=RecordType type=string indexed=true stored=true / 
  field name=Name type=LikeText indexed=true stored=true 
termVectors=true / 
  field name=NameFuzzy type=FuzzyText indexed=true stored=true 
termVectors=true / 
  copyField source=Name dest=NameFuzzy / 
  field name=NameType type=string indexed=true stored=true /

Custom types:

*LikeText
PatternReplaceCharFilterFactory (\W+ = )
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
EdgeNGramFilterFactory
LengthFilterFactory (min:3, max:512)

*FuzzyText
PatternReplaceCharFilterFactory (\W+ = )
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
NGramFilterFactory
LengthFilterFactory (min:3, max:512)

Devon Baumgarten


-Original Message-
From: Glen Newton [mailto:glen.new...@gmail.com] 
Sent: Wednesday, February 22, 2012 9:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

Import times will depend on:
- hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
- Java configuration (heap size, etc)
- Lucene/Solr configuration (many ...)
- Index configuration - how many fields, indexed how; faceting, etc
- OS configuration (this usually to a lesser degree; _usually_)
- Network issues if non-local
- DB configuration (driver, etc)

If you can give more information about the above, people on this list
should be able to better indicate whether 18 hours sounds right for
your situation.

-Glen Newton

On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
dbaumgar...@nationalcorp.com wrote:
 Hello,

 Would it be unusual for an import of 160 million documents to take 18 hours?  
 Each document is less than 1kb and I have the DataImportHandler using the 
 jdbc driver to connect to SQL Server 2008. The full-import query calls a 
 stored procedure that contains only a select from my target table.

 Is there any way I can speed this up? I saw recently someone on this list 
 suggested a new user could get all their Solr data imported in under an hour. 
 I sure hope that's true!


 Devon Baumgarten





-- 
-
http://zzzoot.blogspot.com/
-


RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
Ahmet,

I do not. I commented autoCommit out.

Devon Baumgarten



-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Wednesday, February 22, 2012 12:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

 Would it be unusual for an import of 160 million documents
 to take 18 hours?  Each document is less than 1kb and I
 have the DataImportHandler using the jdbc driver to connect
 to SQL Server 2008. The full-import query calls a stored
 procedure that contains only a select from my target table.
 
 Is there any way I can speed this up? I saw recently someone
 on this list suggested a new user could get all their Solr
 data imported in under an hour. I sure hope that's true!

Do have autoCommit or autoSoftCommit configured in solrconfig.xml? 


RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
Thank you everyone for your patience and suggestions.

It turns out I was doing something really unreasonable in my schema. I 
mistakenly edited the max EdgeNgram size to 512, when I meant to set the 
lengthFilter max to 512. I brought this to a more reasonable number, and my 
estimated time to import is now down to 4 hours. Based on the size of my record 
set, this time is more consistent with Walter's observations in his own project.

Thanks again for your help,

Devon Baumgarten

-Original Message-
From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] 
Sent: Wednesday, February 22, 2012 12:42 PM
To: 'solr-user@lucene.apache.org'
Subject: RE: Unusually long data import time?

Ahmet,

I do not. I commented autoCommit out.

Devon Baumgarten



-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Wednesday, February 22, 2012 12:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

 Would it be unusual for an import of 160 million documents
 to take 18 hours?  Each document is less than 1kb and I
 have the DataImportHandler using the jdbc driver to connect
 to SQL Server 2008. The full-import query calls a stored
 procedure that contains only a select from my target table.
 
 Is there any way I can speed this up? I saw recently someone
 on this list suggested a new user could get all their Solr
 data imported in under an hour. I sure hope that's true!

Do have autoCommit or autoSoftCommit configured in solrconfig.xml? 


RE: Solr, SQL Server's LIKE

2012-01-04 Thread Devon Baumgarten
Great suggestion! Thanks for keeping it simple for a complete Solr newbie.

I'm going to go try this right now.

Thanks!
Devon Baumgarten


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Monday, January 02, 2012 12:30 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr, SQL Server's LIKE

On 12/29/2011 3:51 PM, Devon Baumgarten wrote:
 N-Grams get me pretty great results in general, but I don't want the results 
 for this particular search to be fuzzy. How can I prevent the fuzzy matches 
 from appearing?

 Ex: If I search Albatross I want Albert to be excluded completely, rather 
 than having a low score.

To achieve this while using the ngram filter, just do the ngram analysis 
on the index side, but not on the query side.  If you do this, you'll 
likely need a maxGramSize larger than would normally be required (which 
will make the index larger), and you might need to use the LengthFilter too.

Thanks,
Shawn



RE: Solr, SQL Server's LIKE

2011-12-30 Thread Devon Baumgarten
Hoss,

Thanks. You've answered my question. To clarify, what I should have asked for 
instead of 'exact' was 'not fuzzy'. For some reason it didn't occur to me that 
I didn't need n-grams to use the wildcard. You asking for me to clarify what I 
meant made me realize that the n-grams are the source of all my current 
problems. :)

Thanks!

Devon Baumgarten


-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Thursday, December 29, 2011 7:00 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr, SQL Server's LIKE


: Thanks. I know I'll be able to utilize some of Solr's free text 
: searching capabilities in other search types in this project. The 
: product manager wants this particular search to exactly mimic LIKE%.
...
: Ex: If I search Albatross I want Albert to be excluded completely, 
: rather than having a low score.

please be specific about the types of queries you want. ie: we need more 
then one example of the type of input you want to provide, the type of 
matches you want to see for that input, and the type of matches you want 
to get back.

in your first message you said you need to match company titles pretty 
exactly but then seem to contradict yourself by saying the SQL's LIKE 
command fit's the bill -- even though the SQL LIKE command exists 
specificly for in-exact matches on field values.

Based on your one example above of Albatross, you don't need anything 
special: don't use ngrams, don't use stemming, don't use fuzzy anything -- 
just search for Albatross and it will match Albatross but not 
Albert.  if you want Albatross to match Albatross Road use some 
basic tokenization.

If all you really care about is prefix searching (which seems suggested by 
your LIKE% comment above, which i'm guessing is shorthand for something 
similar to LIKE 'ABC%'), so that queries like abc and abcd both 
match abcdef and abcd but neither of them match abcd 
then just use prefix queries (ie: abcd*) -- they should be plenty 
efficient for your purposes.  you only need to worry about ngrams when you 
want to efficiently match in the middle of a string. (ie: TITLE LIKE 
%ABC%)


-Hoss


Solr, SQL Server's LIKE

2011-12-29 Thread Devon Baumgarten
I have been tinkering with Solr for a few weeks, and I am convinced that it 
could be very helpful in many of my upcoming projects. I am trying to decide 
whether Solr is appropriate for this one, and I haven't had luck looking for 
answers on Google.

I need to search a list of names of companies and individuals pretty exactly. 
T-SQL's LIKE operator does this with decent performance, but I have a feeling 
there is a way to configure Solr to do this better. I've tried using an edge 
N-gram tokenizer, but it feels like it might be more complicated than 
necessary. What would you suggest?

I know this sounds kind of 'Golden Hammer,' but there has been talk of other, 
more complicated (magic) searches that I don't think SQL Server can handle, 
since its tokens (as far as I know) can't be smaller than one word.

Thanks,

Devon Baumgarten



RE: Solr, SQL Server's LIKE

2011-12-29 Thread Devon Baumgarten
Erick,

Thanks. I know I'll be able to utilize some of Solr's free text searching 
capabilities in other search types in this project. The product manager wants 
this particular search to exactly mimic LIKE%.

N-Grams get me pretty great results in general, but I don't want the results 
for this particular search to be fuzzy. How can I prevent the fuzzy matches 
from appearing?

Ex: If I search Albatross I want Albert to be excluded completely, rather 
than having a low score.

Devon Baumgarten


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, December 29, 2011 3:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr, SQL Server's LIKE

SQLs like is usually handled with ngrams if you want
*stuff* kinds of searches. Wildcards are interesting
in Solr.

Things Solr handles that aren't easy in SQL
Phrases, phrases with slop, stemming,
synonyms. And, especially, some kind
of relevance ranking.

But Solr does NOT do the things SQL is best at,
things like joins etc. Each has it's sweet spot
and trying to make one do all the functions of the
other is fraught with places to go wrong.

Not a lot of help, but free text searching is what Solr is
all about, so if your problem maps into that space,
it's a great tool!

Best
Erick

On Thu, Dec 29, 2011 at 1:06 PM, Shashi Kant sk...@sloan.mit.edu wrote:
 for a simple, hackish (albeit inefficient) approach look up wildcard searchers

 e,g foo*, *bar



 On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten
 dbaumgar...@nationalcorp.com wrote:
 I have been tinkering with Solr for a few weeks, and I am convinced that it 
 could be very helpful in many of my upcoming projects. I am trying to decide 
 whether Solr is appropriate for this one, and I haven't had luck looking for 
 answers on Google.

 I need to search a list of names of companies and individuals pretty 
 exactly. T-SQL's LIKE operator does this with decent performance, but I have 
 a feeling there is a way to configure Solr to do this better. I've tried 
 using an edge N-gram tokenizer, but it feels like it might be more 
 complicated than necessary. What would you suggest?

 I know this sounds kind of 'Golden Hammer,' but there has been talk of 
 other, more complicated (magic) searches that I don't think SQL Server can 
 handle, since its tokens (as far as I know) can't be smaller than one word.

 Thanks,

 Devon Baumgarten



Removing whitespace

2011-12-12 Thread Devon Baumgarten
Hello,

I am having trouble finding how to remove/ignore whitespace when indexing. The 
only answer I have found suggested that it is necessary to write my own 
tokenizer. Is this true? I want to remove whitespace and special characters 
from the phrase and create N-grams from the result.

Ultimately, the effect I am after is that searching bobdole would match Bob 
Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better way... can 
anyone lend some assistance?

Thanks!

Dev B



RE: Removing whitespace

2011-12-12 Thread Devon Baumgarten
Thanks Alireza, Steven and Koji for the quick responses!

I'll read up on those and give it a shot.

Devon Baumgarten

-Original Message-
From: Alireza Salimi [mailto:alireza.sal...@gmail.com] 
Sent: Monday, December 12, 2011 4:08 PM
To: solr-user@lucene.apache.org
Subject: Re: Removing whitespace

That sounds strange requirement, but I think you can use CharFilters
instead of implementing your own Tokenizer.
Take a look at this section, maybe it helps.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories



The

On Mon, Dec 12, 2011 at 4:51 PM, Devon Baumgarten 
dbaumgar...@nationalcorp.com wrote:

 Hello,

 I am having trouble finding how to remove/ignore whitespace when indexing.
 The only answer I have found suggested that it is necessary to write my own
 tokenizer. Is this true? I want to remove whitespace and special characters
 from the phrase and create N-grams from the result.

 Ultimately, the effect I am after is that searching bobdole would match
 Bob Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better way...
 can anyone lend some assistance?

 Thanks!

 Dev B




-- 
Alireza Salimi
Java EE Developer


RE: Removing whitespace

2011-12-12 Thread Devon Baumgarten
Thanks Alireza, Steven and Koji for the quick responses!

I'll read up on those and give it a shot.

Devon Baumgarten