Hi

I currently have a bunch of data in json format in hdfs. I would like to use
pig to load it dedupe it and store it back using snappy compression.

Currently I do something like this.

raw = LOAD '$INPUT' USING PigJsonLoader();
uniq = DISTINCT raw;
STORE uniq INTO '$OUTPUT' USING PigStorage();

If I add the following to the pig job it seems to write the files with a
'.snappy' extension

<property>
  <name>mapred.output.compress</name>
  <value>true</value>
</property>
<property>
  <name>mapred.output.compression.codec</name>
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>
 </property>
 <property>
   <name>mapred.output.compression.type</name>
   <value>BLOCK</value>
 </property>

Is this all I need to do? Or do I need to write it in a different format?
and is there a way to load the snappy compressed json data or do I need to
implement a new load function?

any help is much appreciated.

Thanks

Reply via email to