Greetings, I have created an RDD of 600000 rows and then I joined it with itself. For some reason Spark consumes all of my storage which is more than 20GB of free storage! Is this the expected behavior of Spark!? Am I doing something wrong here? The code is shown below (done in Java). I tried to cache the RDD but I got java heap except! Is there a way around it!? Note that the file size is 150MB only!
//Initializing Spark JavaSparkContext sc = new JavaSparkContext(conf); //Reading a file that has 600,000 rows and transform it into an RDD of <Integer,Row> //The key is basically a hashcode to the similar attributes in a row (java String.hashCode) so that similar rows hash to the same code JavaPairRDD<Integer, Row> rdd1 = sc.textFile(filePath1,7).map(new PairFunction<String,Integer,Row>() { @Override public Tuple2<Integer, Row> call(String arg0) throws Exception { Row row = new Row(arg0, true); return new Tuple2<Integer, Row>(row.getHashCode(), row); } }); //Joining rdd1 with itself to pair similar rows with eachother. JavaPairRDD<Integer, Tuple2<Row, Row>> i = rdd1.join(rdd1); Your help is highly appreciated. Regards, Hasan