largish test data set?

2007-09-17 Thread David Welton
Hi,

I'm in the process of evaluating solr and sphinx, and have come to
realize that actually having a large data set to run them against
would be handy.  However, I'm pretty new to both systems, so thought
that perhaps asking around my produce something useful.

What *I* mean by largish is something that won't fit into memory - say
5 or 6 gigs, which is probably puny for some and huge for others.

BTW, I would also welcome any input from others who have done the
above comparison, although what we'll be using it for is specific
enough that of course I'll need to do my own testing.

Thanks!
-- 
David N. Welton
http://www.welton.it/davidw/


Re: largish test data set?

2007-09-17 Thread Grant Ingersoll
You might be interested in the Lucene Java contrib/Benchmark task,  
which provides an indexing implementation of a download of Wikipedia  
(available at http://people.apache.org/~gsingers/wikipedia/)


It is pretty trivial to convert the indexing code to send add  
commands to Solr.


HTH,
Grant

On Sep 17, 2007, at 6:06 AM, David Welton wrote:


Hi,

I'm in the process of evaluating solr and sphinx, and have come to
realize that actually having a large data set to run them against
would be handy.  However, I'm pretty new to both systems, so thought
that perhaps asking around my produce something useful.

What *I* mean by largish is something that won't fit into memory - say
5 or 6 gigs, which is probably puny for some and huge for others.

BTW, I would also welcome any input from others who have done the
above comparison, although what we'll be using it for is specific
enough that of course I'll need to do my own testing.

Thanks!
--
David N. Welton
http://www.welton.it/davidw/





Re: largish test data set?

2007-09-17 Thread Daniel Alheiros
Hi Yonik.

Do you have any performance statistics about those changes?
Is it possible to upgrade to this new Lucene version using the Solr 1.2
stable version?

Regards,
Daniel


On 17/9/07 17:37, Yonik Seeley [EMAIL PROTECTED] wrote:

 If you want to see what performance will be like on the next release,
 you could try upgrading Solr's internal version of lucene to trunk
 (current dev version)... there have been some fantastic improvements
 in indexing speed.
 
 For query speed/throughput, Solr 1.2 or trunk should do fine.
 
 -Yonik
 
 On 9/17/07, David Welton [EMAIL PROTECTED] wrote:
 Hi,
 
 I'm in the process of evaluating solr and sphinx, and have come to
 realize that actually having a large data set to run them against
 would be handy.  However, I'm pretty new to both systems, so thought
 that perhaps asking around my produce something useful.
 
 What *I* mean by largish is something that won't fit into memory - say
 5 or 6 gigs, which is probably puny for some and huge for others.
 
 BTW, I would also welcome any input from others who have done the
 above comparison, although what we'll be using it for is specific
 enough that of course I'll need to do my own testing.
 
 Thanks!
 --
 David N. Welton
 http://www.welton.it/davidw/
 


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.



Re: largish test data set?

2007-09-17 Thread Yonik Seeley
If you want to see what performance will be like on the next release,
you could try upgrading Solr's internal version of lucene to trunk
(current dev version)... there have been some fantastic improvements
in indexing speed.

For query speed/throughput, Solr 1.2 or trunk should do fine.

-Yonik

On 9/17/07, David Welton [EMAIL PROTECTED] wrote:
 Hi,

 I'm in the process of evaluating solr and sphinx, and have come to
 realize that actually having a large data set to run them against
 would be handy.  However, I'm pretty new to both systems, so thought
 that perhaps asking around my produce something useful.

 What *I* mean by largish is something that won't fit into memory - say
 5 or 6 gigs, which is probably puny for some and huge for others.

 BTW, I would also welcome any input from others who have done the
 above comparison, although what we'll be using it for is specific
 enough that of course I'll need to do my own testing.

 Thanks!
 --
 David N. Welton
 http://www.welton.it/davidw/



Re: largish test data set?

2007-09-17 Thread Karl Wettin


17 sep 2007 kl. 12.06 skrev David Welton:



I'm in the process of evaluating solr and sphinx, and have come to
realize that actually having a large data set to run them against
would be handy.  However, I'm pretty new to both systems, so thought
that perhaps asking around my produce something useful.

What *I* mean by largish is something that won't fit into memory - say
5 or 6 gigs, which is probably puny for some and huge for others.


IMDB is about 1.2GB of data:

http://www.imdb.com/interfaces#plain

You can extract real queries from the TPB data collection, it should  
contain about 1M queries in the movie category:


http://torrents.thepiratebay.org/3783572/ 
db_dump_and_query_log_from_piratebay.org__summer_of_2006.3783572.TPB.tor 
rent



--
karl