Thanks for the quick  reply. I will be unable to collect more data until
Monday though, but I will update the thread accordingly.

I am using Spark 1.4.0. Were there any related issues reported? I wasn't
able to find any, but I may have overlooked something. I have also updated
the original question to include the relevant Java files, maybe the issue
is hidden there somewhere.

Ted Yu <[email protected]> schrieb am Fr., 31. Juli 2015 um 18:09 Uhr:

> Can you call collect() and log the output to get more clue what is left ?
>
> Which Spark release are you using ?
>
> Cheers
>
> On Fri, Jul 31, 2015 at 9:01 AM, Warfish <[email protected]>
> wrote:
>
>> Hi everyone,
>>
>> I work with Spark for a little while now and have encountered a strange
>> problem that gives me headaches, which has to do with the JavaRDD.subtract
>> method. Consider the following piece of code:
>>
>>     public static void main(String[] args) {
>>         //context is of type JavaSparkContext; FILE is the filepath to my
>> input file
>>         JavaRDD<String> rawTestSet   = context.textFile(FILE);
>>         JavaRDD<String> rawTestSet2 = context.textFile(FILE);
>>
>>         //Gives 0 everytime -> Correct
>>         System.out.println("rawTestSetMinusRawTestSet2    = " +
>> rawTestSet.subtract(rawTestSet2).count());
>>
>>         //SearchData is a custom POJO that holds my data
>>         JavaRDD<SearchData> testSet      = convert(rawTestSet);
>>         JavaRDD<SearchData> testSet2    = convert(rawTestSet);
>>         JavaRDD<SearchData> testSet3    = convert(rawTestSet2);
>>
>>         //These calls give numbers !=0 on cluster mode -> Incorrect
>>         System.out.println("testSetMinuesTestSet2         = " +
>> testSet.subtract(testSet2).count());
>>         System.out.println("testSetMinuesTestSet3         = " +
>> testSet.subtract(testSet3).count());
>>         System.out.println("testSet2MinuesTestSet3       = " +
>> testSet2.subtract(testSet3).count());
>>     }
>>
>>     private static JavaRDD<SearchData> convert(JavaRDD<String> input) {
>>         return input.filter(new Matches("myRegex"))
>>                          .map(new DoSomething())
>>                          .map(new Split("mySplitParam"))
>>                          .map(new ToMap())
>>                          .map(new Clean())
>>                          .map(new ToSearchData());
>>     }
>>
>> In this code, I read a file (usually from HDFS, but applies to disk as
>> well)
>> and then convert the Strings into custom objects to hold the data using a
>> chain of filter- and map-operations. These objects are simple POJOs with
>> overriden hashCode() and equal() functions. I then apply the subtract
>> method
>> to several JavaRDDs that contain exact equal data.
>>
>> Note: I have omitted the POJO code and the filter- and map-functions to
>> make
>> the code more concise, but I can post it later if the need arises.
>>
>> In the main method shown above are several calls of the subtract method,
>> all
>> of which should give empty RDDs as results because the data in all RDDs
>> should be exactly the same. This works for Spark in local mode, however
>> when
>> executing the code on a cluster the second block of subtract calls does
>> not
>> result in empty sets, which tells me that it is a more complicated issue.
>> The input data on local and cluster mode was exactly the same.
>>
>> Can someone shed some light on this issue? I feel like I'm overlooking
>> something rather obvious.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-JavaRDD-subtract-JavaRDD-method-in-local-vs-cluster-mode-tp24099.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>

Reply via email to