I should probably mention my attempt to use the 'hadoop' command for this task fails (this file is fairly large, about 80GB compressed):
$ HADOOP_HEAPSIZE=3000 hadoop fs -text /path/to/file | wc -c Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:300) at java.lang.StringCoding.encode(StringCoding.java:344) at java.lang.StringCoding.encode(StringCoding.java:387) at java.lang.String.getBytes(String.java:956) at org.apache.hadoop.fs.FsShell$TextRecordInputStream.read(FsShell.java:391) at java.io.InputStream.read(InputStream.java:179) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:100) at org.apache.hadoop.fs.FsShell.printToStdout(FsShell.java:122) at org.apache.hadoop.fs.FsShell.access$100(FsShell.java:50) at org.apache.hadoop.fs.FsShell$2.process(FsShell.java:427) at org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1934) at org.apache.hadoop.fs.FsShell.text(FsShell.java:421) at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1597) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1798) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1916) On Sat, Nov 23, 2013 at 3:14 PM, Robert Dyer <[email protected]> wrote: > Is there an easy way to get the uncompressed size of a sequence file that > is block compressed? I am using the Snappy compressor. > > I realize I can obviously just decompress them to temporary files to get > the size, but I would assume there is an easier way. Perhaps an existing > tool that my search did not turn up? > > If not, I will have to run a MR job load each compressed block and read > the Snappy header to get the size. I need to do this for a large number of > files so I'd prefer a simple CLI tool (sort of like 'hadoop fs -du'). > > - Robert > >
