Dear All, I would like to know how, in spark 2.0, can I split a dataframe into two dataframes when I know the exact counts the two dataframes should have. I tried using limit but got quite weird results. Also, I am looking for exact counts in child dfs, not the approximate % based split.
*Following is what I have tried:* var dfParent = sc.read.parquet("somelocation");// let's say it has 4000 rows I want to split the parent into two dfs with the following counts: var dfChild1Count = 1000 var dfChild2Count = 3000 *I tried this: * var dfChild1 = dfParent.limit(dfChild1Count); var dfChild2 = dfParent.except(dfChild1); *and wrote that to output hdfs directories:* dfChild1.write.parquet("/outputfilechild1"); dfChild2.write.parquet("/outputfilechild2"); It turns out this results in some duplicates saved in files outputfilechild1 & outputfilechild2. Could anyone explain why they have duplicates? When I sorted my parent dataframe before limit, it then worked fine: *dfParent = dfParent.sortBy(col("unique_col").desc())* Seems like the limit on parent is executed twice and return different records each time. Not sure why it is executed twice when I mentioned only once. Also, Is there a better way to split a df into multiple dfs when we know exact counts of the child dfs? Regards, Mohit