Re: [julia-users] Re: A Big Data stress test

2014-05-01 Thread Viral Shah
This would certainly be useful - to have prepackaged large datasets for people to work with. The question is what kind of operations would one want to do on such a dataset. If you could provide a set of well defined benchmarks (simple kernel codes that developers can work with), this could

[julia-users] Re: A Big Data stress test

2014-04-30 Thread Douglas Bates
On Wednesday, April 30, 2014 11:30:41 AM UTC-5, Douglas Bates wrote: It is sometimes difficult to obtain realistic Big data sets. A Revolution Analytics blog post yesterday http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers-will-become-repeat-buyers.html mentioned the

Re: [julia-users] Re: A Big Data stress test

2014-04-30 Thread Stefan Karpinski
Is 22GB too much? It seems like just uncompressing this and storing it naturally would be fine on a large machine. How big are the categorical integers? Would storing an index to an integer really help? It seems like it would only help if the integers are larger than the indices. On Wed, Apr 30,

Re: [julia-users] Re: A Big Data stress test

2014-04-30 Thread Douglas Bates
On Wednesday, April 30, 2014 1:20:26 PM UTC-5, Stefan Karpinski wrote: Is 22GB too much? It seems like just uncompressing this and storing it naturally would be fine on a large machine. How big are the categorical integers? Would storing an index to an integer really help? It seems like it

Re: [julia-users] Re: A Big Data stress test

2014-04-30 Thread Stefan Karpinski
Ah, ok, yes – if there aren't very many distinct values, it could definitely help. With strings it's always nice to convert from variable-length strings to fixed-size indices. On Wed, Apr 30, 2014 at 2:54 PM, Douglas Bates dmba...@gmail.com wrote: On Wednesday, April 30, 2014 1:20:26 PM UTC-5,

Re: [julia-users] Re: A Big Data stress test

2014-04-30 Thread Cameron McBride
If there is some desire for big data tests, there is a fair number of public astronomical datasets that wouldn't be too hard to package up. The catalog level versions aren't too different than the type of dataset metioned by Doug. There are a number of fairly simple analyses that could be done on