You're asking whether it's more time efficient to do a single "universal sort" of all the data vs first doing a group by cf and sorting each group individually? Thats like a question more appropriate for the spark user list.
-n On Wed, Jul 30, 2014 at 8:01 PM, Jianshi Huang <[email protected]> wrote: > I need to generate from a 2TB dataset and exploded it to 4 Column Families. > > The result dataset is likely to be 20TB or more. I'm currently using Spark > so I sorted the (rk, cf, cq) myself. It's huge and I'm considering how to > optimize it. > > My question is: > Should I sort and write each column family one by one, or should I put them > all together then do sort and write? > > Does my question make sense? > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ >
