Thank you for the detailed explanation! Interesting. I'm getting the following (very strange) output (including the spaces before the 0):
> 0 ORC& > 10288812 ORC > 14991902 ORC > 33162184 ORC_R > The file size is 39845888 bytes. On Tue, Sep 26, 2017 at 11:49 AM, Owen O'Malley <[email protected]> wrote: > Ok, it was reading the postscript (via OrcProto$Postscript.parseFrom), > which is the very first thing it does. > > The first thing to try is to see if you have a proper postscript somewhere > in the file. If you are on Mac or Linux, > try: > > % strings -n 3 -t d example/decimal.orc | grep ORC > > Replacing example/decimal.orc with your ORC file. You'll get an output > like: > > 0 ORC > 16333 ORC > > which are the offsets where "ORC" is located. The ORC format puts it once > at the front of the file (so that the "file" command can detect the format) > and once at the end of the postscript. (There is always one byte after the > last ORC, which is the length of the postscript, so the total length of the > file should be the final offset + 4.) > > .. Owen > > On Tue, Sep 26, 2017 at 1:36 AM, Yonatan Augarten <[email protected]> > wrote: > >> No, the file is invalid. The problem is that our code sometimes generates >> invalid ORC files. >> The code is always called from a single thread, and it performs a series >> of "addRowBatch" actions on a writer. >> The file is then closed and loaded to a hive table. >> This works 99% of the times, but in some cases the resulting file is >> somehow corrupt. >> See below the stack trace of an attempt to run orcfiledump on this file. >> >> Thanks for your help, >> Yoni. >> >> Exception in thread "main" >> com.google.protobuf.InvalidProtocolBufferException: >> Protocol message tag had invalid wire type. >> at com.google.protobuf.InvalidProtocolBufferException. >> invalidWireType(InvalidProtocolBufferException.java:99) >> at com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(U >> nknownFieldSet.java:498) >> at com.google.protobuf.GeneratedMessage.parseUnknownField(Gener >> atedMessage.java:193) >> at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.<init>( >> OrcProto.java:16466) >> at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.<init>( >> OrcProto.java:16424) >> at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$1.parse >> PartialFrom(OrcProto.java:16562) >> at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$1.parse >> PartialFrom(OrcProto.java:16557) >> at com.google.protobuf.AbstractParser.parseFrom(AbstractParser. >> java:89) >> at com.google.protobuf.AbstractParser.parseFrom(AbstractParser. >> java:95) >> at com.google.protobuf.AbstractParser.parseFrom(AbstractParser. >> java:49) >> at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.parseFr >> om(OrcProto.java:16910) >> at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoF >> romFooter(ReaderImpl.java:374) >> at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImp >> l.java:311) >> at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFil >> e.java:228) >> at org.apache.hadoop.hive.ql.io.orc.FileDump.printMetaData(File >> Dump.java:96) >> at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:81) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce >> ssorImpl.java:62) >> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe >> thodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:497) >> at org.apache.hadoop.util.RunJar.run(RunJar.java:221) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:136) >> >> >> >> On Tue, Sep 26, 2017 at 12:11 AM, Owen O'Malley <[email protected]> >> wrote: >> >>> On Mon, Sep 25, 2017 at 12:47 PM, Yonatan Augarten <[email protected]> >>> wrote: >>> >>>> Would you say that it's likely that this error (*Protocol message >>>> contained an invalid tag (zero)*) is caused by the wrong version? >>>> >>> >>> No, it is likely something else. However, I haven't seen that error >>> coming out of the ORC reader before. Can you give me the whole stack trace? >>> Are you sure that it is a valid ORC file? >>> >>> Thanks, >>> Owen >>> >> >> >
