You can compress a csv or tab delimited file as well :) You can specify the codec of your choice, say snappy, when writing out. That's what we do. You can also write out data as sequence files. RCFile should also be possible given the flexibility of Spark API but we haven't tried that. On Dec 7, 2013 2:02 AM, "Ankur Chauhan" <[email protected]> wrote:
> Hi all, > > I am wondering what do people use as the on disk storage format. I have > seen almost all the examples use csv files to store and load data but that > seems too simplisting for obvious reasons (compressibility to name one). I > was just interested to find out what people use to store computation > results. For example consider that you did some computation on some log > files and want to store all sorts of metrics for each and every user so > that you can later use shark to query it interactively. What is the > preferred or good format to store all the data? Parquet? RCFiles? csv? JSON? > > -- Ankur
