Ok, it was reading the postscript (via OrcProto$Postscript.parseFrom), which is the very first thing it does.
The first thing to try is to see if you have a proper postscript somewhere in the file. If you are on Mac or Linux, try: % strings -n 3 -t d example/decimal.orc | grep ORC Replacing example/decimal.orc with your ORC file. You'll get an output like: 0 ORC 16333 ORC which are the offsets where "ORC" is located. The ORC format puts it once at the front of the file (so that the "file" command can detect the format) and once at the end of the postscript. (There is always one byte after the last ORC, which is the length of the postscript, so the total length of the file should be the final offset + 4.) .. Owen On Tue, Sep 26, 2017 at 1:36 AM, Yonatan Augarten <[email protected]> wrote: > No, the file is invalid. The problem is that our code sometimes generates > invalid ORC files. > The code is always called from a single thread, and it performs a series > of "addRowBatch" actions on a writer. > The file is then closed and loaded to a hive table. > This works 99% of the times, but in some cases the resulting file is > somehow corrupt. > See below the stack trace of an attempt to run orcfiledump on this file. > > Thanks for your help, > Yoni. > > Exception in thread "main" com.google.protobuf. > InvalidProtocolBufferException: Protocol message tag had invalid wire > type. > at com.google.protobuf.InvalidProtocolBufferException.invalidWireType( > InvalidProtocolBufferException.java:99) > at com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom( > UnknownFieldSet.java:498) > at com.google.protobuf.GeneratedMessage.parseUnknownField( > GeneratedMessage.java:193) > at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.<init> > (OrcProto.java:16466) > at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.<init> > (OrcProto.java:16424) > at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$1. > parsePartialFrom(OrcProto.java:16562) > at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$1. > parsePartialFrom(OrcProto.java:16557) > at com.google.protobuf.AbstractParser.parseFrom( > AbstractParser.java:89) > at com.google.protobuf.AbstractParser.parseFrom( > AbstractParser.java:95) > at com.google.protobuf.AbstractParser.parseFrom( > AbstractParser.java:49) > at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript. > parseFrom(OrcProto.java:16910) > at org.apache.hadoop.hive.ql.io.orc.ReaderImpl. > extractMetaInfoFromFooter(ReaderImpl.java:374) > at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>( > ReaderImpl.java:311) > at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader( > OrcFile.java:228) > at org.apache.hadoop.hive.ql.io.orc.FileDump.printMetaData( > FileDump.java:96) > at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:81) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke( > NativeMethodAccessorImpl.java:62) > at sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at org.apache.hadoop.util.RunJar.run(RunJar.java:221) > at org.apache.hadoop.util.RunJar.main(RunJar.java:136) > > > > On Tue, Sep 26, 2017 at 12:11 AM, Owen O'Malley <[email protected]> > wrote: > >> On Mon, Sep 25, 2017 at 12:47 PM, Yonatan Augarten <[email protected]> >> wrote: >> >>> Would you say that it's likely that this error (*Protocol message >>> contained an invalid tag (zero)*) is caused by the wrong version? >>> >> >> No, it is likely something else. However, I haven't seen that error >> coming out of the ORC reader before. Can you give me the whole stack trace? >> Are you sure that it is a valid ORC file? >> >> Thanks, >> Owen >> > >
