Re: Best practice for writing to HFileOutputFormat(2) with multiple Column Families

Nick Dimiduk Fri, 01 Aug 2014 09:53:38 -0700

You're asking whether it's more time efficient to do a single "universal
sort" of all the data vs first doing a group by cf and sorting each group
individually? Thats like a question more appropriate for the spark user
list.


-n


On Wed, Jul 30, 2014 at 8:01 PM, Jianshi Huang <[email protected]>
wrote:

> I need to generate from a 2TB dataset and exploded it to 4 Column Families.
>
> The result dataset is likely to be 20TB or more. I'm currently using Spark
> so I sorted the (rk, cf, cq) myself. It's huge and I'm considering how to
> optimize it.
>
> My question is:
> Should I sort and write each column family one by one, or should I put them
> all together then do sort and write?
>
> Does my question make sense?
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>

Re: Best practice for writing to HFileOutputFormat(2) with multiple Column Families

Reply via email to