This is a very open ended question so it's hard to give a specific answer... it depends a lot on whether disk IO is a bottleneck in your workload and whether you tend to analyze all of each record or only certain fields. If you are doing disk IO a lot and only touching a few fields something like Parquet might help, or (simpler) just creating smaller projections of your data with only the fields you care about. Tab delimited formats can have less serialization overhead than JSON, so flattening the data might also help. It really depends on your access patterns and data types.
In many cases with Spark another important question is how the user stores the data in-memory, not the on-disk format. It does depend how they are using Spark though. - Patrick On Sun, Dec 8, 2013 at 3:03 PM, Andrew Ash <[email protected]> wrote: > LZO compression at a minimum, and using Parquet as a second step, > seems like the way to go though I haven't tried either personally yet. > > Sent from my mobile phone > > On Dec 8, 2013, at 16:54, Ankur Chauhan <[email protected]> wrote: > >> Hi all, >> >> Sorry for posting this again but I am interested in finding out what >> different on disk data formats for storing timeline event and analytics >> aggregate data. >> >> Currently I am just using newline delimited json gzipped files. I was >> wondering if there were any recommendations. >> >> -- Ankur
