Dear All,

I would like to know how, in spark 2.0, can I split a dataframe into two
dataframes when I know the exact counts the two dataframes should have. I
tried using limit but got quite weird results. Also, I am looking for exact
counts in child dfs, not the approximate % based split.

*Following is what I have tried:*

var dfParent = sc.read.parquet("somelocation");// let's say it has 4000 rows

I want to split the parent into two dfs with the following counts:

var dfChild1Count = 1000

var dfChild2Count = 3000

*I tried this: *

var dfChild1 = dfParent.limit(dfChild1Count);

var dfChild2 = dfParent.except(dfChild1);

*and wrote that to output hdfs directories:*

dfChild1.write.parquet("/outputfilechild1");

dfChild2.write.parquet("/outputfilechild2");

It turns out this results in some duplicates saved in
files outputfilechild1 & outputfilechild2. Could anyone explain why they
have duplicates?

When I sorted my parent dataframe before limit, it then worked fine:


*dfParent = dfParent.sortBy(col("unique_col").desc())*
Seems like the limit on parent is executed twice and return different
records each time. Not sure why it is executed twice when I mentioned only
once.

Also, Is there a better way to split a df into multiple dfs when we know
exact counts of the child dfs?




Regards,
Mohit

Reply via email to