What is the schema for the data?

If every field is a string, then you could end up in this situation. Your best 
bet is to use compression for the Avro data. 




If you have a lot of CSV files that you want to convert to compressed Avro, 
there are some command line tools in the Kite SDK[1] that might help. 




Check out this example:




http://kitesdk.org/docs/current/guide/Using-the-Kite-CLI-to-Create-a-Dataset/




-Joey




[1] http://kitesdk.org/docs/current/
—
Joey Echeverria

On Fri, Sep 19, 2014 at 3:31 AM, diplomatic Guru <[email protected]>
wrote:

> I've been experimenting with MapReduce job using CSV and avro format. What
> I find it strange is that Avro format is larger than CSV.
> For example, I exported some data in CSV, which is about 1.6GB. I then
> wrote a schema and a MapReduce job to take that CSV and serialize and write
> the output back to HDFS.
> When I checked the file size of the output, it was 2.4GB. I assumed that
> the size would be smaller because it convert the data into binary but I was
> wrong. Do you know what the reason is and refer me to some documentation on
> this?
> I've checked the .avro file and I could see that header contains the schema
> and the rest are data blocks.

Reply via email to