This would certainly be useful - to have prepackaged large datasets for
people to work with. The question is what kind of operations would one want
to do on such a dataset. If you could provide a set of well defined
benchmarks (simple kernel codes that developers can work with), this could
On Wednesday, April 30, 2014 11:30:41 AM UTC-5, Douglas Bates wrote:
It is sometimes difficult to obtain realistic Big data sets. A
Revolution Analytics blog post yesterday
http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers-will-become-repeat-buyers.html
mentioned the
Is 22GB too much? It seems like just uncompressing this and storing it
naturally would be fine on a large machine. How big are the categorical
integers? Would storing an index to an integer really help? It seems like
it would only help if the integers are larger than the indices.
On Wed, Apr 30,
On Wednesday, April 30, 2014 1:20:26 PM UTC-5, Stefan Karpinski wrote:
Is 22GB too much? It seems like just uncompressing this and storing it
naturally would be fine on a large machine. How big are the categorical
integers? Would storing an index to an integer really help? It seems like
it
Ah, ok, yes – if there aren't very many distinct values, it could
definitely help. With strings it's always nice to convert from
variable-length strings to fixed-size indices.
On Wed, Apr 30, 2014 at 2:54 PM, Douglas Bates dmba...@gmail.com wrote:
On Wednesday, April 30, 2014 1:20:26 PM UTC-5,
If there is some desire for big data tests, there is a fair number of
public astronomical datasets that wouldn't be too hard to package up.
The catalog level versions aren't too different than the type of dataset
metioned by Doug. There are a number of fairly simple analyses that could
be done on