Thanks for the quick reply. I will be unable to collect more data until Monday though, but I will update the thread accordingly.
I am using Spark 1.4.0. Were there any related issues reported? I wasn't able to find any, but I may have overlooked something. I have also updated the original question to include the relevant Java files, maybe the issue is hidden there somewhere. Ted Yu <[email protected]> schrieb am Fr., 31. Juli 2015 um 18:09 Uhr: > Can you call collect() and log the output to get more clue what is left ? > > Which Spark release are you using ? > > Cheers > > On Fri, Jul 31, 2015 at 9:01 AM, Warfish <[email protected]> > wrote: > >> Hi everyone, >> >> I work with Spark for a little while now and have encountered a strange >> problem that gives me headaches, which has to do with the JavaRDD.subtract >> method. Consider the following piece of code: >> >> public static void main(String[] args) { >> //context is of type JavaSparkContext; FILE is the filepath to my >> input file >> JavaRDD<String> rawTestSet = context.textFile(FILE); >> JavaRDD<String> rawTestSet2 = context.textFile(FILE); >> >> //Gives 0 everytime -> Correct >> System.out.println("rawTestSetMinusRawTestSet2 = " + >> rawTestSet.subtract(rawTestSet2).count()); >> >> //SearchData is a custom POJO that holds my data >> JavaRDD<SearchData> testSet = convert(rawTestSet); >> JavaRDD<SearchData> testSet2 = convert(rawTestSet); >> JavaRDD<SearchData> testSet3 = convert(rawTestSet2); >> >> //These calls give numbers !=0 on cluster mode -> Incorrect >> System.out.println("testSetMinuesTestSet2 = " + >> testSet.subtract(testSet2).count()); >> System.out.println("testSetMinuesTestSet3 = " + >> testSet.subtract(testSet3).count()); >> System.out.println("testSet2MinuesTestSet3 = " + >> testSet2.subtract(testSet3).count()); >> } >> >> private static JavaRDD<SearchData> convert(JavaRDD<String> input) { >> return input.filter(new Matches("myRegex")) >> .map(new DoSomething()) >> .map(new Split("mySplitParam")) >> .map(new ToMap()) >> .map(new Clean()) >> .map(new ToSearchData()); >> } >> >> In this code, I read a file (usually from HDFS, but applies to disk as >> well) >> and then convert the Strings into custom objects to hold the data using a >> chain of filter- and map-operations. These objects are simple POJOs with >> overriden hashCode() and equal() functions. I then apply the subtract >> method >> to several JavaRDDs that contain exact equal data. >> >> Note: I have omitted the POJO code and the filter- and map-functions to >> make >> the code more concise, but I can post it later if the need arises. >> >> In the main method shown above are several calls of the subtract method, >> all >> of which should give empty RDDs as results because the data in all RDDs >> should be exactly the same. This works for Spark in local mode, however >> when >> executing the code on a cluster the second block of subtract calls does >> not >> result in empty sets, which tells me that it is a more complicated issue. >> The input data on local and cluster mode was exactly the same. >> >> Can someone shed some light on this issue? I feel like I'm overlooking >> something rather obvious. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-JavaRDD-subtract-JavaRDD-method-in-local-vs-cluster-mode-tp24099.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> >
