Re: Bump: on disk storage formats

Patrick Wendell Sun, 08 Dec 2013 17:39:28 -0800

This is a very open ended question so it's hard to give a specific
answer... it depends a lot on whether disk IO is a bottleneck in your
workload and whether you tend to analyze all of each record or only
certain fields. If you are doing disk IO a lot and only touching a few
fields something like Parquet might help, or (simpler) just creating
smaller projections of your data with only the fields you care about.
Tab delimited formats can have less serialization overhead than JSON,
so flattening the data might also help. It really depends on your
access patterns and data types.


In many cases with Spark another important question is how the user
stores the data in-memory, not the on-disk format. It does depend how
they are using Spark though.

- Patrick

On Sun, Dec 8, 2013 at 3:03 PM, Andrew Ash <[email protected]> wrote:
> LZO compression at a minimum, and using Parquet as a second step,
> seems like the way to go though I haven't tried either personally yet.
>
> Sent from my mobile phone
>
> On Dec 8, 2013, at 16:54, Ankur Chauhan <[email protected]> wrote:
>
>> Hi all,
>>
>> Sorry for posting this again but I am interested in finding out what 
>> different on disk data formats for storing timeline event and analytics 
>> aggregate data.
>>
>> Currently I am just using newline delimited json gzipped files. I was 
>> wondering if there were any recommendations.
>>
>> -- Ankur

Re: Bump: on disk storage formats

Reply via email to