Hi Rob, I fear your questions will be hard to answer without additional information about what kind of simulations you plan to do. int[r][c] basically means you have a matrix of integers? You could for example map this to a row-oriented RDD of integer-arrays or to a column oriented RDD of integer arrays. What the better option is will heavily depend on your workload. Also have a look at the algebaraic data-structures that come with mllib ( https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.mllib.linalg.Vectors ).
Regards, Jeff 2015-02-25 23:58 GMT+01:00 Rob Sargent <rob.sarg...@utah.edu>: > I have an application which might benefit from Sparks > distribution/analysis, but I'm worried about the size and structure of my > data set. I need to perform several thousand simulation on a rather large > data set and I need access to all the generated simulations. The data > element is largely in int[r][c] where r is 100 to 1000 and c is 20-80K > (there's more but that array is the bulk of the problem. I have machines > and memory capable of doing 6-10 simulations simultaneously in separate > jvms. Is this data structure compatible with Sparks RDD notion? > > If yes, I will have a slough of how-to-get-started questions, the first of > which is how to seed the run? My thinking is to use > org.apache.spark.api.java.FlatMapFunction starting with an EmptyRDD and > the seed data. Would that be the way to go? > > Thanks >