Hi I currently have a bunch of data in json format in hdfs. I would like to use pig to load it dedupe it and store it back using snappy compression.
Currently I do something like this. raw = LOAD '$INPUT' USING PigJsonLoader(); uniq = DISTINCT raw; STORE uniq INTO '$OUTPUT' USING PigStorage(); If I add the following to the pig job it seems to write the files with a '.snappy' extension <property> <name>mapred.output.compress</name> <value>true</value> </property> <property> <name>mapred.output.compression.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>mapred.output.compression.type</name> <value>BLOCK</value> </property> Is this all I need to do? Or do I need to write it in a different format? and is there a way to load the snappy compressed json data or do I need to implement a new load function? any help is much appreciated. Thanks
