Re: dataset parameters suitable for lucene application

2007-10-02 Thread Chris Harris
Hi There,

Would you mind if I pasted your data onto the wiki page at

http://wiki.apache.org/solr/SolrPerformanceData

I think it would be helpful to get some more numbers on that page, so
people can help decide if Solr is the right application for them.

Thanks,
Chris Harris, new Solr user

On 9/26/07, Xuesong Luo [EMAIL PROTECTED] wrote:
 My experience so far:
 200k number of indexes were created in 90 mins(including db time), index
 size is 200m, query a key word on all string fields(30) takes 0.3-1 sec,
 query a key word on one field takes tens of mill seconds.



 -Original Message-
 From: Charlie Jackson [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, September 26, 2007 8:53 AM
 To: solr-user@lucene.apache.org
 Subject: RE: dataset parameters suitable for lucene application

 My experiences so far with this level of data have been good.

 Number of records: Maxed out at 8.8 million
 Database size: friggin huge (100+ GB)
 Index size: ~24 GB

 1) It took me about a day to index 8 million docs using a non-optimized
 program I wrote. It's non-optimized in the sense that it's not
 multi-threaded. It batched together groups of about 5,000 docs at a time
 to be indexed.

 2) Search times for a basic search are almost always sub-second. If we
 toss in some faceting, it takes a little longer, but I've hardly ever
 seen it go above 1-2 seconds even with the most advanced queries.

 Hope that helps.


 Charlie

 

 -Original Message-
 From: Law, John [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, September 26, 2007 9:28 AM
 To: solr-user@lucene.apache.org
 Subject: dataset parameters suitable for lucene application

 I am new to the list and new to lucene and solr. I am considering Lucene
 for a potential new application and need to know how well it scales.

 Following are the parameters of the dataset.

 Number of records: 7+ million
 Database size: 13.3 GB
 Index Size:  10.9 GB

 My questions are simply:

 1) Approximately how long would it take Lucene to index these documents?
 2) What would the approximate retrieval time be (i.e. search response
 time)?

 Can someone provide me with some informed guidance in this regard?

 Thanks in advance,
 John

 __
 John Law
 Director, Platform Management
 ProQuest
 789 Eisenhower Parkway
 Ann Arbor, MI 48106
 734-997-4877
 [EMAIL PROTECTED]
 www.proquest.com
 www.csa.com

 ProQuest... Start here.





Re: dataset parameters suitable for lucene application

2007-09-26 Thread Walter Underwood
That seems well within Solr's capabilities, though you should come up
with a desired queries/sec figure.

Solr's query rate varies widely with the configuration -- how many
fields, fuzzy search, highlighting, facets, etc.

Essentially, Solr uses Lucene, a modern search core. It has performance
and scaling comparable to the commercial products I know about, and I was
building enterprise search for nine years. If you need to search over
100M docs or over 1000 queries/second, you may need fancier distributed
search than is available in Solr or commercially.

Solr's big weaknesses are the quality of the stemmers, parsing document
formats (PDF, MS Word), and access control on queries. If you can live
with the stemmers, Solr will probably do the job.

I worked at Infoseek, Inktomi, Verity, and Autonomy, and I'm using
Solr here at Netflix.

wunder

On 9/26/07 7:27 AM, Law, John [EMAIL PROTECTED] wrote:

 I am new to the list and new to lucene and solr. I am considering Lucene
 for a potential new application and need to know how well it scales.
 
 Following are the parameters of the dataset.
 
 Number of records: 7+ million
 Database size: 13.3 GB
 Index Size:  10.9 GB
 
 My questions are simply:
 
 1) Approximately how long would it take Lucene to index these documents?
 2) What would the approximate retrieval time be (i.e. search response
 time)?
 
 Can someone provide me with some informed guidance in this regard?
 
 Thanks in advance,
 John
 
 __
 John Law
 Director, Platform Management
 ProQuest
 789 Eisenhower Parkway
 Ann Arbor, MI 48106
 734-997-4877
 [EMAIL PROTECTED]
 www.proquest.com
 www.csa.com
 
 ProQuest... Start here.




RE: dataset parameters suitable for lucene application

2007-09-26 Thread Charlie Jackson
My experiences so far with this level of data have been good.

Number of records: Maxed out at 8.8 million
Database size: friggin huge (100+ GB)
Index size: ~24 GB

1) It took me about a day to index 8 million docs using a non-optimized
program I wrote. It's non-optimized in the sense that it's not
multi-threaded. It batched together groups of about 5,000 docs at a time
to be indexed.

2) Search times for a basic search are almost always sub-second. If we
toss in some faceting, it takes a little longer, but I've hardly ever
seen it go above 1-2 seconds even with the most advanced queries. 

Hope that helps.


Charlie



-Original Message-
From: Law, John [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 9:28 AM
To: solr-user@lucene.apache.org
Subject: dataset parameters suitable for lucene application

I am new to the list and new to lucene and solr. I am considering Lucene
for a potential new application and need to know how well it scales. 

Following are the parameters of the dataset.

Number of records: 7+ million
Database size: 13.3 GB
Index Size:  10.9 GB 

My questions are simply:

1) Approximately how long would it take Lucene to index these documents?
2) What would the approximate retrieval time be (i.e. search response
time)?

Can someone provide me with some informed guidance in this regard?

Thanks in advance,
John

__
John Law
Director, Platform Management
ProQuest
789 Eisenhower Parkway
Ann Arbor, MI 48106
734-997-4877
[EMAIL PROTECTED]
www.proquest.com
www.csa.com

ProQuest... Start here.





Re: dataset parameters suitable for lucene application

2007-09-26 Thread Chris Harris
By maxed out do you mean that Solr's performance became unacceptable
beyond 8.8M records, or that you only had 8.8M records to index? If
the former, can you share the particular symptoms?

On 9/26/07, Charlie Jackson [EMAIL PROTECTED] wrote:
 My experiences so far with this level of data have been good.

 Number of records: Maxed out at 8.8 million
 Database size: friggin huge (100+ GB)
 Index size: ~24 GB

 1) It took me about a day to index 8 million docs using a non-optimized
 program I wrote. It's non-optimized in the sense that it's not
 multi-threaded. It batched together groups of about 5,000 docs at a time
 to be indexed.

 2) Search times for a basic search are almost always sub-second. If we
 toss in some faceting, it takes a little longer, but I've hardly ever
 seen it go above 1-2 seconds even with the most advanced queries.

 Hope that helps.


 Charlie

 

 -Original Message-
 From: Law, John [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, September 26, 2007 9:28 AM
 To: solr-user@lucene.apache.org
 Subject: dataset parameters suitable for lucene application

 I am new to the list and new to lucene and solr. I am considering Lucene
 for a potential new application and need to know how well it scales.

 Following are the parameters of the dataset.

 Number of records: 7+ million
 Database size: 13.3 GB
 Index Size:  10.9 GB

 My questions are simply:

 1) Approximately how long would it take Lucene to index these documents?
 2) What would the approximate retrieval time be (i.e. search response
 time)?

 Can someone provide me with some informed guidance in this regard?

 Thanks in advance,
 John

 __
 John Law
 Director, Platform Management
 ProQuest
 789 Eisenhower Parkway
 Ann Arbor, MI 48106
 734-997-4877
 [EMAIL PROTECTED]
 www.proquest.com
 www.csa.com

 ProQuest... Start here.






RE: dataset parameters suitable for lucene application

2007-09-26 Thread Charlie Jackson
Sorry, I meant that it maxed out in the sense that my maxDoc field on
the stats page was 8.8 million, which indicates that the most docs it
has ever had was around 8.8 million. It's down to about 7.8 million
currently. I have seen no signs of a maximum number of docs Solr can
handle. 


-Original Message-
From: Chris Harris [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 11:49 AM
To: solr-user@lucene.apache.org
Subject: Re: dataset parameters suitable for lucene application

By maxed out do you mean that Solr's performance became unacceptable
beyond 8.8M records, or that you only had 8.8M records to index? If
the former, can you share the particular symptoms?

On 9/26/07, Charlie Jackson [EMAIL PROTECTED] wrote:
 My experiences so far with this level of data have been good.

 Number of records: Maxed out at 8.8 million
 Database size: friggin huge (100+ GB)
 Index size: ~24 GB

 1) It took me about a day to index 8 million docs using a
non-optimized
 program I wrote. It's non-optimized in the sense that it's not
 multi-threaded. It batched together groups of about 5,000 docs at a
time
 to be indexed.

 2) Search times for a basic search are almost always sub-second. If we
 toss in some faceting, it takes a little longer, but I've hardly ever
 seen it go above 1-2 seconds even with the most advanced queries.

 Hope that helps.


 Charlie

 

 -Original Message-
 From: Law, John [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, September 26, 2007 9:28 AM
 To: solr-user@lucene.apache.org
 Subject: dataset parameters suitable for lucene application

 I am new to the list and new to lucene and solr. I am considering
Lucene
 for a potential new application and need to know how well it scales.

 Following are the parameters of the dataset.

 Number of records: 7+ million
 Database size: 13.3 GB
 Index Size:  10.9 GB

 My questions are simply:

 1) Approximately how long would it take Lucene to index these
documents?
 2) What would the approximate retrieval time be (i.e. search response
 time)?

 Can someone provide me with some informed guidance in this regard?

 Thanks in advance,
 John

 __
 John Law
 Director, Platform Management
 ProQuest
 789 Eisenhower Parkway
 Ann Arbor, MI 48106
 734-997-4877
 [EMAIL PROTECTED]
 www.proquest.com
 www.csa.com

 ProQuest... Start here.






RE: dataset parameters suitable for lucene application

2007-09-26 Thread Law, John
Thanks all! One last question...

If I had a collection of 2.5 billion docs and a demand averaging 200
queries per second, what's the confidence that Solr/Lucene could handle
this volume and execute search with sub-second response times?


-Original Message-
From: Charlie Jackson [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 1:32 PM
To: solr-user@lucene.apache.org
Subject: RE: dataset parameters suitable for lucene application

Sorry, I meant that it maxed out in the sense that my maxDoc field on
the stats page was 8.8 million, which indicates that the most docs it
has ever had was around 8.8 million. It's down to about 7.8 million
currently. I have seen no signs of a maximum number of docs Solr can
handle. 


-Original Message-
From: Chris Harris [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 11:49 AM
To: solr-user@lucene.apache.org
Subject: Re: dataset parameters suitable for lucene application

By maxed out do you mean that Solr's performance became unacceptable
beyond 8.8M records, or that you only had 8.8M records to index? If
the former, can you share the particular symptoms?

On 9/26/07, Charlie Jackson [EMAIL PROTECTED] wrote:
 My experiences so far with this level of data have been good.

 Number of records: Maxed out at 8.8 million
 Database size: friggin huge (100+ GB)
 Index size: ~24 GB

 1) It took me about a day to index 8 million docs using a
non-optimized
 program I wrote. It's non-optimized in the sense that it's not
 multi-threaded. It batched together groups of about 5,000 docs at a
time
 to be indexed.

 2) Search times for a basic search are almost always sub-second. If we
 toss in some faceting, it takes a little longer, but I've hardly ever
 seen it go above 1-2 seconds even with the most advanced queries.

 Hope that helps.


 Charlie

 

 -Original Message-
 From: Law, John [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, September 26, 2007 9:28 AM
 To: solr-user@lucene.apache.org
 Subject: dataset parameters suitable for lucene application

 I am new to the list and new to lucene and solr. I am considering
Lucene
 for a potential new application and need to know how well it scales.

 Following are the parameters of the dataset.

 Number of records: 7+ million
 Database size: 13.3 GB
 Index Size:  10.9 GB

 My questions are simply:

 1) Approximately how long would it take Lucene to index these
documents?
 2) What would the approximate retrieval time be (i.e. search response
 time)?

 Can someone provide me with some informed guidance in this regard?

 Thanks in advance,
 John

 __
 John Law
 Director, Platform Management
 ProQuest
 789 Eisenhower Parkway
 Ann Arbor, MI 48106
 734-997-4877
 [EMAIL PROTECTED]
 www.proquest.com
 www.csa.com

 ProQuest... Start here.






Re: dataset parameters suitable for lucene application

2007-09-26 Thread Walter Underwood
No one can answer that, because it depends on how you configure Solr.
How many fields do you want to search? Are you using fuzzy search?
Facets? Highlighting?

We are searching a much smaller collection, about 250K docs, with
great success. We see 80 queries/sec on each of four servers, and
response times under 100ms. Each query searches against seven
fields and we don't use any of the features I listed above.

wunder

On 9/26/07 10:50 AM, Law, John [EMAIL PROTECTED] wrote:

 Thanks all! One last question...
 
 If I had a collection of 2.5 billion docs and a demand averaging 200
 queries per second, what's the confidence that Solr/Lucene could handle
 this volume and execute search with sub-second response times?
 
 
 -Original Message-
 From: Charlie Jackson [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, September 26, 2007 1:32 PM
 To: solr-user@lucene.apache.org
 Subject: RE: dataset parameters suitable for lucene application
 
 Sorry, I meant that it maxed out in the sense that my maxDoc field on
 the stats page was 8.8 million, which indicates that the most docs it
 has ever had was around 8.8 million. It's down to about 7.8 million
 currently. I have seen no signs of a maximum number of docs Solr can
 handle. 
 
 
 -Original Message-
 From: Chris Harris [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, September 26, 2007 11:49 AM
 To: solr-user@lucene.apache.org
 Subject: Re: dataset parameters suitable for lucene application
 
 By maxed out do you mean that Solr's performance became unacceptable
 beyond 8.8M records, or that you only had 8.8M records to index? If
 the former, can you share the particular symptoms?
 
 On 9/26/07, Charlie Jackson [EMAIL PROTECTED] wrote:
 My experiences so far with this level of data have been good.
 
 Number of records: Maxed out at 8.8 million
 Database size: friggin huge (100+ GB)
 Index size: ~24 GB
 
 1) It took me about a day to index 8 million docs using a
 non-optimized
 program I wrote. It's non-optimized in the sense that it's not
 multi-threaded. It batched together groups of about 5,000 docs at a
 time
 to be indexed.
 
 2) Search times for a basic search are almost always sub-second. If we
 toss in some faceting, it takes a little longer, but I've hardly ever
 seen it go above 1-2 seconds even with the most advanced queries.
 
 Hope that helps.
 
 
 Charlie
 
 
 
 -Original Message-
 From: Law, John [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, September 26, 2007 9:28 AM
 To: solr-user@lucene.apache.org
 Subject: dataset parameters suitable for lucene application
 
 I am new to the list and new to lucene and solr. I am considering
 Lucene
 for a potential new application and need to know how well it scales.
 
 Following are the parameters of the dataset.
 
 Number of records: 7+ million
 Database size: 13.3 GB
 Index Size:  10.9 GB
 
 My questions are simply:
 
 1) Approximately how long would it take Lucene to index these
 documents?
 2) What would the approximate retrieval time be (i.e. search response
 time)?
 
 Can someone provide me with some informed guidance in this regard?
 
 Thanks in advance,
 John
 
 __
 John Law
 Director, Platform Management
 ProQuest
 789 Eisenhower Parkway
 Ann Arbor, MI 48106
 734-997-4877
 [EMAIL PROTECTED]
 www.proquest.com
 www.csa.com
 
 ProQuest... Start here.
 
 
 
 



RE: dataset parameters suitable for lucene application

2007-09-26 Thread Lance Norskog
My limited experience with larger indexes is:  
1) the logistics of copying around and backing up this much data, and
2) indexing is disk-bound. We're on SAS disks and it makes no difference
between one indexing thread and a dozen (we have small records).

Smaller returns are faster. You need to limit the search results via as many
parameters as you can, and filters are the way to do this.

-Original Message-
From: Walter Underwood [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 10:58 AM
To: solr-user@lucene.apache.org
Subject: Re: dataset parameters suitable for lucene application

No one can answer that, because it depends on how you configure Solr.
How many fields do you want to search? Are you using fuzzy search?
Facets? Highlighting?

We are searching a much smaller collection, about 250K docs, with great
success. We see 80 queries/sec on each of four servers, and response times
under 100ms. Each query searches against seven fields and we don't use any
of the features I listed above.

wunder

On 9/26/07 10:50 AM, Law, John [EMAIL PROTECTED] wrote:

 Thanks all! One last question...
 
 If I had a collection of 2.5 billion docs and a demand averaging 200 
 queries per second, what's the confidence that Solr/Lucene could 
 handle this volume and execute search with sub-second response times?
 
 
 -Original Message-
 From: Charlie Jackson [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, September 26, 2007 1:32 PM
 To: solr-user@lucene.apache.org
 Subject: RE: dataset parameters suitable for lucene application
 
 Sorry, I meant that it maxed out in the sense that my maxDoc field on 
 the stats page was 8.8 million, which indicates that the most docs it 
 has ever had was around 8.8 million. It's down to about 7.8 million 
 currently. I have seen no signs of a maximum number of docs Solr can 
 handle.
 
 
 -Original Message-
 From: Chris Harris [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, September 26, 2007 11:49 AM
 To: solr-user@lucene.apache.org
 Subject: Re: dataset parameters suitable for lucene application
 
 By maxed out do you mean that Solr's performance became unacceptable 
 beyond 8.8M records, or that you only had 8.8M records to index? If 
 the former, can you share the particular symptoms?
 
 On 9/26/07, Charlie Jackson [EMAIL PROTECTED] wrote:
 My experiences so far with this level of data have been good.
 
 Number of records: Maxed out at 8.8 million Database size: friggin 
 huge (100+ GB) Index size: ~24 GB
 
 1) It took me about a day to index 8 million docs using a
 non-optimized
 program I wrote. It's non-optimized in the sense that it's not 
 multi-threaded. It batched together groups of about 5,000 docs at a
 time
 to be indexed.
 
 2) Search times for a basic search are almost always sub-second. If 
 we toss in some faceting, it takes a little longer, but I've hardly 
 ever seen it go above 1-2 seconds even with the most advanced queries.
 
 Hope that helps.
 
 
 Charlie
 
 
 
 -Original Message-
 From: Law, John [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, September 26, 2007 9:28 AM
 To: solr-user@lucene.apache.org
 Subject: dataset parameters suitable for lucene application
 
 I am new to the list and new to lucene and solr. I am considering
 Lucene
 for a potential new application and need to know how well it scales.
 
 Following are the parameters of the dataset.
 
 Number of records: 7+ million
 Database size: 13.3 GB
 Index Size:  10.9 GB
 
 My questions are simply:
 
 1) Approximately how long would it take Lucene to index these
 documents?
 2) What would the approximate retrieval time be (i.e. search response 
 time)?
 
 Can someone provide me with some informed guidance in this regard?
 
 Thanks in advance,
 John
 
 __
 John Law
 Director, Platform Management
 ProQuest
 789 Eisenhower Parkway
 Ann Arbor, MI 48106
 734-997-4877
 [EMAIL PROTECTED]
 www.proquest.com
 www.csa.com
 
 ProQuest... Start here.
 
 
 
 



Re: dataset parameters suitable for lucene application

2007-09-26 Thread Mike Klaas

On 26-Sep-07, at 10:50 AM, Law, John wrote:


Thanks all! One last question...

If I had a collection of 2.5 billion docs and a demand averaging 200
queries per second, what's the confidence that Solr/Lucene could  
handle

this volume and execute search with sub-second response times?


No search software can search 2.5 billion docs (assuming web-sized  
documents) in 5ms on a single server.


You certainly could build such a system with Solr distributed over  
100's of nodes, but this is not built into Solr currently.


-Mike