Hi, I got a good answer on the main question elsewhere, would anyone
please confirm the first code is the right approach? For a MVCE I am
trying to adapt this example and it's seems like I am having Java issues
with types:
(but this is basically the right approach?)
int count = spark.parallelize(makeRange(1, NUM_SAMPLES)).filter(new
Function<Integer, Boolean>() {
public Boolean call(Integer i) {
double x = Math.random();
double y = Math.random();
return x*x + y*y < 1;
}
}).count();
System.out.println("Pi is roughly " + 4 * count / NUM_SAMPLES);
And this is definitely the wrong approach? Using the loop in the
function will all execute on one partition? Want to be sure I understood
the other answer correct. Thanks!
|JavaRDD<DropResult>nSizedRDD
=spark.parallelize(pldListofOne).flatMap(newFlatMapFunction<PipeLinkageData,DropResult>(){publicIterable<DropResult>call(PipeLinkageDatapld){List<DropResult>returnRDD
=newArrayList();// is Spark good at spreading a for loop like
this?for(inti =0;i <howMany
;i++){returnRDD.add(pld.doDrop());}returnreturnRDD;}});|
On 11/27/2015 4:18 PM, Jim wrote:
Hello there,
(part of my problem is docs that say "undocumented" on parallelize
<https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/SparkContext.html#parallelize%28scala.collection.Seq,%20int,%20scala.reflect.ClassTag%29>
leave me reading books for examples that don't always pertain )
I am trying to create an RDD length N = 10^6 by executing N operations
of a Java class we have, I can have that class implement Serializable
or any Function if necessary. I don't have a fixed length dataset up
front, I am trying to create one. Trying to figure out whether to
create a dummy array of length N to parallelize, or pass it a function
that runs N times.
Not sure which approach is valid/better, I see in Spark if I am
starting out with a well defined data set like words in a doc, the
length/count of those words is already defined and I just parallelize
some map or filter to do some operation on that data.
In my case I think it's different, trying to parallelize the creation
an RDD that will contain 10^6 elements... here's a lot more info if
you want ...
DESCRIPTION:
In Java 8 using Spark 1.5.1, we have a Java method doDrop() that takes
a PipeLinkageData and returns a DropResult.
I am thinking I could use map() or flatMap() to call a one to many
function, I was trying to do something like this in another question
that never quite worked
<http://stackoverflow.com/questions/33882283/build-spark-javardd-list-from-dropresult-objects>:
|JavaRDD<DropResult>simCountRDD
=spark.parallelize(makeRange(1,getSimCount())).map(newFunction<Integer,DropResult>(){publicDropResultcall(Integeri){returnpld.doDrop();}});|
Thinking something like this is more the correct approach? And this
has more context if desired:
|// pld is of type PipeLinkageData, it's already initialized//
parallelize wants a collection passed into first
paramList<PipeLinkageData>pldListofOne =newArrayList();// make an
ArrayList of onepldListofOne.add(pld);inthowMany
=1000000;JavaRDD<DropResult>nSizedRDD
=spark.parallelize(pldListofOne).flatMap(newFlatMapFunction<PipeLinkageData,DropResult>(){publicIterable<DropResult>call(PipeLinkageDatapld){List<DropResult>returnRDD
=newArrayList();// is Spark good at spreading a for loop like
this?for(inti =0;i <howMany
;i++){returnRDD.add(pld.doDrop());}returnreturnRDD;}});|
One other concern: A JavaRDD is corrrect here? I can see needing to
call FlatMapFunction but I don't need a FlatMappedRDD? And since I am
never trying to flatten a group of arrays or lists to a single array
or list, do I really ever need to flatten anything?