Re: Considering Spark for large data elements

Jeffrey Jedele Thu, 26 Feb 2015 08:01:01 -0800

Hi Rob,
I fear your questions will be hard to answer without additional information
about what kind of simulations you plan to do. int[r][c] basically means
you have a matrix of integers? You could for example map this to a
row-oriented RDD of integer-arrays or to a column oriented RDD of integer
arrays. What the better option is will heavily depend on your workload.
Also have a look at the algebaraic data-structures that come with mllib (
https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.mllib.linalg.Vectors
).


Regards,
Jeff

2015-02-25 23:58 GMT+01:00 Rob Sargent <rob.sarg...@utah.edu>:

>  I have an application which might benefit from Sparks
> distribution/analysis, but I'm worried about the size and structure of my
> data set.  I need to perform several thousand simulation on a rather large
> data set and I need access to all the generated simulations.  The data
> element is largely in int[r][c] where r is 100 to 1000 and c is 20-80K
> (there's more but that array is the bulk of the problem.  I have machines
> and memory capable of doing 6-10 simulations simultaneously in separate
> jvms.  Is this data structure compatible with Sparks RDD notion?
>
> If yes, I will have a slough of how-to-get-started questions, the first of
> which is how to seed the run?  My thinking is to use
> org.apache.spark.api.java.FlatMapFunction starting with an EmptyRDD and
> the seed data.  Would that be the way to go?
>
> Thanks
>

Re: Considering Spark for large data elements

Reply via email to