That's very useful information. The reason for weird problem is because of the non-determination of RDD before applying randomSplit. By caching RDD, we can make RDD become deterministic and so problem is solved. Thank you for your help.
2016-02-21 11:12 GMT+07:00 Ted Yu <[email protected]>: > Have you looked at: > SPARK-12662 Fix DataFrame.randomSplit to avoid creating overlapping splits > > Cheers > > On Sat, Feb 20, 2016 at 7:01 PM, tuan3w <[email protected]> wrote: > >> I'm training a model using MLLib. When I try to split data into training >> and >> test data, I found a weird problem. I can't figure what problem is >> happening >> here. >> >> Here is my code in experiment: >> >> val logData = rdd.map(x => (x._1, x._2)).distinct() >> val ratings: RDD[Rating] = logData.map(x => Rating(x._1, x._2, 1)) >> val userProducts = ratings.map(x => (x.user, x.product)) >> val splits = userProducts.randomSplit(Array(0.7, 0.3)) >> val train = splits(0) >> train.count() // 1660895 >> val test = splits(1) >> test.count() // 712306 >> // test if an element appear in both splits >> train.map(x => (x._1 + "_" + x._2, 1)).join(test.map(x => (x._1 + "_" + >> x._2, 2))).take(5) >> //return res153: Array[(String, (Int, Int))] = Array((1172491_2899,(1,2)), >> (1206777_1567,(1,2)), (91828_571,(1,2)), (329210_2435,(1,2)), >> (24356_135,(1,2))) >> >> If I try to save to hdfs and load RDD from HDFS this problem doesn't >> happen. >> >> userProducts.map(x => x._1 + ":" + >> x._2).saveAsTextFile("/user/tuannd/test2.txt") >> val userProducts = sc.textFile("/user/tuannd/test2.txt").map(x => { >> val d =x.split(":") >> (d(0).toInt(), d(1).toInt()) >> }) >> // other steps are as same as above >> >> I'm using spark 1.5.2. >> Thanks for all your help. >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Element-appear-in-both-2-splits-of-RDD-after-using-randomSplit-tp26281.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> >
