Hi Patrick, I agree this is a very open ended question but I was trying to get a general answer anyway but I think you did hint on some nuances. 1. My work load is definitely bottlenecked by disk IO just beacause even with a project on a single column(mostly 2-3 out of 20) there is a lot of data to churn throught. 2. The fields are mostly all headers and some know parameter fields from a http GET request so analysis on let's say account id and user agent or ip address is fairly selective. 3. Flattening the fields and using csv definitely looks like something i can try out.
I believe parquet files can be ceated with a sorted column (for example timestamp) that would make selection of the right segment of data easier too(although i don't have any experience with parquet files). What is the recommended way of interacting(read/write) with parquet files? -- Ankur On 8 Dec 2013, at 17:38, Patrick Wendell <[email protected]> wrote: > This is a very open ended question so it's hard to give a specific > answer... it depends a lot on whether disk IO is a bottleneck in your > workload and whether you tend to analyze all of each record or only > certain fields. If you are doing disk IO a lot and only touching a few > fields something like Parquet might help, or (simpler) just creating > smaller projections of your data with only the fields you care about. > Tab delimited formats can have less serialization overhead than JSON, > so flattening the data might also help. It really depends on your > access patterns and data types. > > In many cases with Spark another important question is how the user > stores the data in-memory, not the on-disk format. It does depend how > they are using Spark though. > > - Patrick > > On Sun, Dec 8, 2013 at 3:03 PM, Andrew Ash <[email protected]> wrote: >> LZO compression at a minimum, and using Parquet as a second step, >> seems like the way to go though I haven't tried either personally yet. >> >> Sent from my mobile phone >> >> On Dec 8, 2013, at 16:54, Ankur Chauhan <[email protected]> wrote: >> >>> Hi all, >>> >>> Sorry for posting this again but I am interested in finding out what >>> different on disk data formats for storing timeline event and analytics >>> aggregate data. >>> >>> Currently I am just using newline delimited json gzipped files. I was >>> wondering if there were any recommendations. >>> >>> -- Ankur
