Can Solr with the StatsComponent analyze 20+ million files?

2011-08-08 Thread Fred Smith
Hi,
Currently we are in the process of figuring out how to deal with
millions of CSV files containing weather data(20+ million files). Each
file is about 500 bytes in size.
We want to calculate statistics on fields read from the file. For
example, the standard deviation of wind speed across all 20+ million files.
Processing speed isn't an important issue. The analysis routine can run
for days, if needed.

The StatsComponent(http://wiki.apache.org/solr/StatsComponent) for Solr
appears to be able to calculate the statistics we are interested in.

Will the StatsComponent in Solr do what we need with minimal configuration?
Can the StatsComponent only be used on a subset of the data? For
example, only look at data from certain months?
Are there other free programs out there that can parse and analyze 20+
million files?

We are still very new to Solr and really appreciate all your help.
Thanks,
Fred


Re: Can Solr with the StatsComponent analyze 20+ million files?

2011-08-08 Thread Walter Underwood
This does not seem well matched to Solr. Solr and Lucene are optimized to show 
the best few matches, not every match.

I'd use Hadoop for this. Or MarkLogic, if you'd like to talk about that 
off-list.

wunder
Lead Engineer, MarkLogic

On Aug 8, 2011, at 1:59 PM, Fred Smith wrote:

 Hi,
 Currently we are in the process of figuring out how to deal with
 millions of CSV files containing weather data(20+ million files). Each
 file is about 500 bytes in size.
 We want to calculate statistics on fields read from the file. For
 example, the standard deviation of wind speed across all 20+ million files.
 Processing speed isn't an important issue. The analysis routine can run
 for days, if needed.
 
 The StatsComponent(http://wiki.apache.org/solr/StatsComponent) for Solr
 appears to be able to calculate the statistics we are interested in.
 
 Will the StatsComponent in Solr do what we need with minimal configuration?
 Can the StatsComponent only be used on a subset of the data? For
 example, only look at data from certain months?
 Are there other free programs out there that can parse and analyze 20+
 million files?
 
 We are still very new to Solr and really appreciate all your help.
 Thanks,
 Fred



Re: Can Solr with the StatsComponent analyze 20+ million files?

2011-08-08 Thread Markus Jelsma
 Hi,
 Currently we are in the process of figuring out how to deal with
 millions of CSV files containing weather data(20+ million files). Each
 file is about 500 bytes in size.
 We want to calculate statistics on fields read from the file. For
 example, the standard deviation of wind speed across all 20+ million files.
 Processing speed isn't an important issue. The analysis routine can run
 for days, if needed.
 
 The StatsComponent(http://wiki.apache.org/solr/StatsComponent) for Solr
 appears to be able to calculate the statistics we are interested in.
 
 Will the StatsComponent in Solr do what we need with minimal configuration?
 Can the StatsComponent only be used on a subset of the data? For
 example, only look at data from certain months?

If i remember correctly, it cannot.

 Are there other free programs out there that can parse and analyze 20+
 million files?

Yes, if analyzing data like your data is all you do (not search, that's Solr's 
power) then you're most likely much better of not using Solr and write 
map/reduce programs for Apache Hadoop, it will analyze huge amounts of data. 
Hadoop can be quite difficult to start with so you can use the excellent Apache 
CouchDB database that supports map/reduce as well.

CouchDB is much easier to begin with. If you transform a sample of your data 
to the JSON format, install CouchDB, load your data, write a simple map/reduce 
function all in 8 hours. Loading and processing all the data will take a bit 
longer.

Cheers


 
 We are still very new to Solr and really appreciate all your help.
 Thanks,
 Fred


Re: Can Solr with the StatsComponent analyze 20+ million files?

2011-08-08 Thread Jonathan Rochkind

On 8/8/2011 5:10 PM, Markus Jelsma wrote:

Will the StatsComponent in Solr do what we need with minimal configuration?
Can the StatsComponent only be used on a subset of the data? For
example, only look at data from certain months?

If i remember correctly, it cannot.


Well, if you index things properly, you could an fq to only certain 
months, and then use StatsComponent on top.


But I'd agree with others that Solr is probably not the best tool for 
this job. Solr's primary area of competency is text indexing and text 
search, not mathematical calculation. If you need a whole lot of text 
indexing and a little bit of math too, you might be able to get 
StatsComponent to do what you need, although you'll probably run into 
some tricky parts becuase this isn't really Solr's focus.


But if you need a whole bunch of math and no text indexing at all -- use 
a tool that has math rather than text search as it's prime area of 
competency/focus, don't make things hard for yourself by using the wrong 
tool for the job.


(StatsComponent, incidentally, performs not-so-great on very large 
result sets, depending on what you ask it for).


Re: Can Solr with the StatsComponent analyze 20+ million files?

2011-08-08 Thread Fred Smith
Thank you Walter, Markus and Jonathan for your fast responses and help!
We will be looking into CouchDB (and Hadoop if necessary) to process our
data.
Thanks again,
Fred