Re: Bump: on disk storage formats

Ankur Chauhan Sun, 08 Dec 2013 19:10:25 -0800

Hi Patrick,

I agree this is a very open ended question but I was trying to get a general 
answer anyway but I think you did hint on some nuances.
1. My work load is definitely bottlenecked by disk IO just beacause even with a 
project on a single column(mostly 2-3 out of 20) there is a lot of data to 
churn throught.
2. The fields are mostly all headers and some know parameter fields from a http 
GET request so analysis on let's say account id and user agent or ip address is 
fairly selective.
3. Flattening the fields and using csv definitely looks like something i can 
try out.


I believe parquet files can be ceated with a sorted column (for example 
timestamp) that would make selection of the right segment of data easier 
too(although i don't have any experience with parquet files). 
What is the recommended way of interacting(read/write) with parquet files? 

-- Ankur

On 8 Dec 2013, at 17:38, Patrick Wendell <[email protected]> wrote:

> This is a very open ended question so it's hard to give a specific
> answer... it depends a lot on whether disk IO is a bottleneck in your
> workload and whether you tend to analyze all of each record or only
> certain fields. If you are doing disk IO a lot and only touching a few
> fields something like Parquet might help, or (simpler) just creating
> smaller projections of your data with only the fields you care about.
> Tab delimited formats can have less serialization overhead than JSON,
> so flattening the data might also help. It really depends on your
> access patterns and data types.
> 
> In many cases with Spark another important question is how the user
> stores the data in-memory, not the on-disk format. It does depend how
> they are using Spark though.
> 
> - Patrick
> 
> On Sun, Dec 8, 2013 at 3:03 PM, Andrew Ash <[email protected]> wrote:
>> LZO compression at a minimum, and using Parquet as a second step,
>> seems like the way to go though I haven't tried either personally yet.
>> 
>> Sent from my mobile phone
>> 
>> On Dec 8, 2013, at 16:54, Ankur Chauhan <[email protected]> wrote:
>> 
>>> Hi all,
>>> 
>>> Sorry for posting this again but I am interested in finding out what 
>>> different on disk data formats for storing timeline event and analytics 
>>> aggregate data.
>>> 
>>> Currently I am just using newline delimited json gzipped files. I was 
>>> wondering if there were any recommendations.
>>> 
>>> -- Ankur

Re: Bump: on disk storage formats

Reply via email to