Confirm this won't parallelize/partition?

Jim Lohse Sat, 28 Nov 2015 12:23:53 -0800

Hi, I got a good answer on the main question elsewhere, would anyoneplease confirm the first code is the right approach? For a MVCE I amtrying to adapt this example and it's seems like I am having Java issueswith types:


(but this is basically the right approach?)

int count = spark.parallelize(makeRange(1, NUM_SAMPLES)).filter(newFunction<Integer, Boolean>() {

  public Boolean call(Integer i) {
    double x = Math.random();
    double y = Math.random();
    return x*x + y*y < 1;
  }
}).count();
System.out.println("Pi is roughly " + 4 * count / NUM_SAMPLES);

And this is definitely the wrong approach? Using the loop in thefunction will all execute on one partition? Want to be sure I understoodthe other answer correct. Thanks!

|JavaRDD<DropResult>nSizedRDD=spark.parallelize(pldListofOne).flatMap(newFlatMapFunction<PipeLinkageData,DropResult>(){publicIterable<DropResult>call(PipeLinkageDatapld){List<DropResult>returnRDD=newArrayList();// is Spark good at spreading a for loop likethis?for(inti =0;i <howMany;i++){returnRDD.add(pld.doDrop());}returnreturnRDD;}});|




On 11/27/2015 4:18 PM, Jim wrote:

 Hello there,
(part of my problem is docs that say "undocumented" on parallelize<https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/SparkContext.html#parallelize%28scala.collection.Seq,%20int,%20scala.reflect.ClassTag%29>leave me reading books for examples that don't always pertain )
I am trying to create an RDD length N = 10^6 by executing N operationsof a Java class we have, I can have that class implement Serializableor any Function if necessary. I don't have a fixed length dataset upfront, I am trying to create one. Trying to figure out whether tocreate a dummy array of length N to parallelize, or pass it a functionthat runs N times.
Not sure which approach is valid/better, I see in Spark if I amstarting out with a well defined data set like words in a doc, thelength/count of those words is already defined and I just parallelizesome map or filter to do some operation on that data.
In my case I think it's different, trying to parallelize the creationan RDD that will contain 10^6 elements... here's a lot more info ifyou want ...
DESCRIPTION:
In Java 8 using Spark 1.5.1, we have a Java method doDrop() that takesa PipeLinkageData and returns a DropResult.
I am thinking I could use map() or flatMap() to call a one to manyfunction, I was trying to do something like this in another questionthat never quite worked<http://stackoverflow.com/questions/33882283/build-spark-javardd-list-from-dropresult-objects>:
|JavaRDD<DropResult>simCountRDD=spark.parallelize(makeRange(1,getSimCount())).map(newFunction<Integer,DropResult>(){publicDropResultcall(Integeri){returnpld.doDrop();}});|
Thinking something like this is more the correct approach? And thishas more context if desired:
|// pld is of type PipeLinkageData, it's already initialized//parallelize wants a collection passed into firstparamList<PipeLinkageData>pldListofOne =newArrayList();// make anArrayList of onepldListofOne.add(pld);inthowMany=1000000;JavaRDD<DropResult>nSizedRDD=spark.parallelize(pldListofOne).flatMap(newFlatMapFunction<PipeLinkageData,DropResult>(){publicIterable<DropResult>call(PipeLinkageDatapld){List<DropResult>returnRDD=newArrayList();// is Spark good at spreading a for loop likethis?for(inti =0;i <howMany;i++){returnRDD.add(pld.doDrop());}returnreturnRDD;}});|
One other concern: A JavaRDD is corrrect here? I can see needing tocall FlatMapFunction but I don't need a FlatMappedRDD? And since I amnever trying to flatten a group of arrays or lists to a single arrayor list, do I really ever need to flatten anything?

Confirm this won't parallelize/partition?

Reply via email to