Indeed the problem was a corner case in which our app would crash and not close the file.
Thank you for your help On Wed, Sep 27, 2017 at 1:02 AM, Owen O'Malley <[email protected]> wrote: > The extra characters after the instances of ORC are because the following > characters look like valid characters and the strings command is a generic > tool. Of course you could accidentally get 0x4f, 0x52, 0x43 "ORC" in the > file, but that is relatively unlikely. > > Your output that implies that you used Writer.writeIntermediateFooter to > put in to intermediate footers into the file. Since there is a large gap > from the last offset to the length of the file, I would guess that your > application didn't close the writer to get the final footer at the end of > the file. Try passing in 33162188 in as the ReaderOptions.maxLength(). > You should get a valid reader then and be able to read the data before that > footer (ignoring the last 6mb of data in the file). > > .. Owen > > > On Tue, Sep 26, 2017 at 12:09 PM, Yonatan Augarten <[email protected]> > wrote: > >> Thank you for the detailed explanation! >> >> Interesting. I'm getting the following (very strange) output (including >> the spaces before the 0): >> >>> 0 ORC& >>> 10288812 ORC >>> 14991902 ORC >>> 33162184 ORC_R >>> >> >> The file size is 39845888 bytes. >> >> On Tue, Sep 26, 2017 at 11:49 AM, Owen O'Malley <[email protected]> >> wrote: >> >>> Ok, it was reading the postscript (via OrcProto$Postscript.parseFrom), >>> which is the very first thing it does. >>> >>> The first thing to try is to see if you have a proper postscript >>> somewhere in the file. If you are on Mac or Linux, >>> try: >>> >>> % strings -n 3 -t d example/decimal.orc | grep ORC >>> >>> Replacing example/decimal.orc with your ORC file. You'll get an output >>> like: >>> >>> 0 ORC >>> 16333 ORC >>> >>> which are the offsets where "ORC" is located. The ORC format puts it >>> once at the front of the file (so that the "file" command can detect the >>> format) and once at the end of the postscript. (There is always one byte >>> after the last ORC, which is the length of the postscript, so the total >>> length of the file should be the final offset + 4.) >>> >>> .. Owen >>> >>> On Tue, Sep 26, 2017 at 1:36 AM, Yonatan Augarten <[email protected]> >>> wrote: >>> >>>> No, the file is invalid. The problem is that our code sometimes >>>> generates invalid ORC files. >>>> The code is always called from a single thread, and it performs a >>>> series of "addRowBatch" actions on a writer. >>>> The file is then closed and loaded to a hive table. >>>> This works 99% of the times, but in some cases the resulting file is >>>> somehow corrupt. >>>> See below the stack trace of an attempt to run orcfiledump on this file. >>>> >>>> Thanks for your help, >>>> Yoni. >>>> >>>> Exception in thread "main" >>>> com.google.protobuf.InvalidProtocolBufferException: >>>> Protocol message tag had invalid wire type. >>>> at com.google.protobuf.InvalidProtocolBufferException.invalidWi >>>> reType(InvalidProtocolBufferException.java:99) >>>> at com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(U >>>> nknownFieldSet.java:498) >>>> at com.google.protobuf.GeneratedMessage.parseUnknownField(Gener >>>> atedMessage.java:193) >>>> at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.<init>( >>>> OrcProto.java:16466) >>>> at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.<init>( >>>> OrcProto.java:16424) >>>> at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$1.parse >>>> PartialFrom(OrcProto.java:16562) >>>> at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$1.parse >>>> PartialFrom(OrcProto.java:16557) >>>> at com.google.protobuf.AbstractParser.parseFrom(AbstractParser. >>>> java:89) >>>> at com.google.protobuf.AbstractParser.parseFrom(AbstractParser. >>>> java:95) >>>> at com.google.protobuf.AbstractParser.parseFrom(AbstractParser. >>>> java:49) >>>> at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.parseFr >>>> om(OrcProto.java:16910) >>>> at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoF >>>> romFooter(ReaderImpl.java:374) >>>> at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImp >>>> l.java:311) >>>> at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFil >>>> e.java:228) >>>> at org.apache.hadoop.hive.ql.io.orc.FileDump.printMetaData(File >>>> Dump.java:96) >>>> at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:81) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce >>>> ssorImpl.java:62) >>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe >>>> thodAccessorImpl.java:43) >>>> at java.lang.reflect.Method.invoke(Method.java:497) >>>> at org.apache.hadoop.util.RunJar.run(RunJar.java:221) >>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:136) >>>> >>>> >>>> >>>> On Tue, Sep 26, 2017 at 12:11 AM, Owen O'Malley <[email protected] >>>> > wrote: >>>> >>>>> On Mon, Sep 25, 2017 at 12:47 PM, Yonatan Augarten <[email protected]> >>>>> wrote: >>>>> >>>>>> Would you say that it's likely that this error (*Protocol message >>>>>> contained an invalid tag (zero)*) is caused by the wrong version? >>>>>> >>>>> >>>>> No, it is likely something else. However, I haven't seen that error >>>>> coming out of the ORC reader before. Can you give me the whole stack >>>>> trace? >>>>> Are you sure that it is a valid ORC file? >>>>> >>>>> Thanks, >>>>> Owen >>>>> >>>> >>>> >>> >> >
