Spark groupby and agg inconsistent and missing data

Saif.A.Ellafi Thu, 22 Oct 2015 08:29:21 -0700

Hello everyone,

I am doing some analytics experiments under a 4 server stand-alone cluster in a 
spark shell, mostly involving a huge database with groupBy and aggregations.


I am picking 6 groupBy columns and returning various aggregated results in a 
dataframe. GroupBy fields are of two types, most of them are StringType and the 
rest are LongType.

The data source is a splitted json file dataframe,  once the data is persisted, 
the result is consistent. But if I unload the memory and reload the data, the 
groupBy action returns different content results, missing data.

Could I be missing something? this is rather serious for my analytics, and not 
sure how to properly diagnose this situation.

Thanks,
Saif

Spark groupby and agg inconsistent and missing data

Reply via email to