What is the schema for the data?
If every field is a string, then you could end up in this situation. Your best bet is to use compression for the Avro data. If you have a lot of CSV files that you want to convert to compressed Avro, there are some command line tools in the Kite SDK[1] that might help. Check out this example: http://kitesdk.org/docs/current/guide/Using-the-Kite-CLI-to-Create-a-Dataset/ -Joey [1] http://kitesdk.org/docs/current/ — Joey Echeverria On Fri, Sep 19, 2014 at 3:31 AM, diplomatic Guru <[email protected]> wrote: > I've been experimenting with MapReduce job using CSV and avro format. What > I find it strange is that Avro format is larger than CSV. > For example, I exported some data in CSV, which is about 1.6GB. I then > wrote a schema and a MapReduce job to take that CSV and serialize and write > the output back to HDFS. > When I checked the file size of the output, it was 2.4GB. I assumed that > the size would be smaller because it convert the data into binary but I was > wrong. Do you know what the reason is and refer me to some documentation on > this? > I've checked the .avro file and I could see that header contains the schema > and the rest are data blocks.
