Thanks for your reply! We will give a shot to creating a derived PType for Optional. We already use Optional for nullable fields in our codebase, so it will be in alignment with what we have.
I did log a JIRA[1] for documenting this oddity. [1] https://issues.apache.org/jira/browse/CRUNCH-407 -Surbhi On Fri, May 30, 2014 at 4:00 PM, Josh Wills <[email protected]> wrote: > One thought would be to use an Option-style type for the value of the > PTable, so that you would have PTable<Key, Optional<T>>, using Guava's > Optional wrapper class, described here: > > > https://code.google.com/p/guava-libraries/wiki/UsingAndAvoidingNullExplained > > You would need to create a derived PType for Optionals, perhaps using > PTypeFamily.collections(ptype) (so that a zero-element collection would be > Optional.absent, and a single-element collection would map to > Optional.of(value). If you're going to have null values in your > PCollections, null checking your code is the sort of thing you're certainly > doing anyway, so adding a bit of boilerplate to handle the null-values > officially is probably a good idea. > > Unfortunately, the org.apache.avro.mapred.Pair stuff is part of the > underlying Avro MR framework, it isn't something we directly control in > Crunch. > > > On Fri, May 30, 2014 at 1:07 PM, Surbhi Mungre <[email protected]> > wrote: > >> In our codebase there are various places where we are using PTables >> with null values because we were under the impression that PTables supports >> null values like PCollections. Based on Josh's reply below, we will have to >> replace our PTables with PCollections of Pair so that null values can be >> supported. If we make this change, then we will have to convert PCollection >> back to PTable where ever we want to do an operation supported only by >> PTables like groupByKey. Replacing all PTables with PCollections is going >> to add lot of unnecessary complexity. Does anyone know of a better >> workaround for this problem? >> >> Also, a PTable and a PCollection of Pair should essentially behave in >> same manner. In this case, functionally a PTable does not seem like an >> extension of a PCollection. So, is it possible for PTables to use >> "org.apache.crunch.Pair" under the covers instead of using >> "org.apache.avro.mapred.Pair"? >> >> -Surbhi >> >>> >>> *From: *Josh Wills <[email protected]> >>> >>> *Reply-To: *"[email protected]" <[email protected]> >>> >>> *Date: *Thursday, February 6, 2014 2:10 PM >>> >>> *To: *"[email protected]" <[email protected]> >>> >>> *Subject: *Re: crunch 0.8.2+6-cdh4.4.0 >>> >>> >>> Hey Stephen- >>> >>> >>> Thanks for putting this together; I had an inkling of what the issue was >>> when you sent the stack trace and I managed to verify it using your unit >>> test. The org.apache.avro.mapred.Pair schema that is used in PTable isn't >>> defined by Crunch and doesn't allow null values. If you change your example >>> code to use Avros.pairs() instead of tableOf(), the test works properly, >>> b/c Crunch tuples do permit nulls. >>> >>> >>> I'm unsure of the best way to fix this if we want to. We could force >>> AvroFileTarget to convert its inputs to a Crunch tuple and not allow it to >>> write out an avro Pair schema, but I'd be concerned that this would break >>> other use cases that I'm not aware of. >>> >>> >>> J >>> >>> >>> >>> On Thu, Feb 6, 2014 at 9:35 AM, Stephen Durfey <[email protected]> >>> wrote: >>> >>> Attached is a unit test that emits a null avro record that causes this >>> stack trace to occur. The file that is being read in doesn't matter as the >>> contents are ignored. I just needed to read in something to use the >>> MRPipeline. Also, this test fails on both 0.8.2+6-cdh4.4.0 and >>> 0.8.0-cdh4.3.0 (I haven't tried versions older than this), which leads me >>> to believe that this test isn't completely representative of the complexity >>> in my code base. Nevertheless, it is the exact same exception with the same >>> emit behavior. >>> >>> >>> Side note: In MemPipeline, when writing a PTable as a text file, there >>> is an assumption that all key and value pairs are non-null. This assumption >>> could be elsewhere in MemPipeline#write as well, but I haven't tried the >>> other flows. >>> >>> >>> >>> On Wed, Jan 29, 2014 at 4:46 PM, Stephen Durfey <[email protected]> >>> wrote: >>> >>> This is the full stack trace. I removed some parts of the stack trace >>> that emits the null values: >>> >>> >>> org.apache.crunch.CrunchRuntimeException: >>> org.apache.avro.file.DataFileWriter$AppendWriteException: >>> java.lang.NullPointerException: null of <specific avro model name> in field >>> value of org.apache.avro.mapred.Pair >>> >>> at >>> org.apache.crunch.impl.mr.emit.MultipleOutputEmitter.emit(MultipleOutputEmitter.java:45) >>> >>> at org.apache.crunch.MapFn.process(MapFn.java:34) >>> >>> at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:99) >>> >>> at >>> org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitter.java:56) >>> >>> at org.apache.crunch.MapFn.process(MapFn.java:34) >>> >>> at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:99) >>> >>> at >>> org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitter.java:56) >>> >>> <code that emits the null value> >>> >>> at org.apache.crunch.impl.mr.run.RTNode.cleanup(RTNode.java:118) >>> >>> at org.apache.crunch.impl.mr.run.RTNode.cleanup(RTNode.java:121) >>> >>> at org.apache.crunch.impl.mr.run.RTNode.cleanup(RTNode.java:121) >>> >>> at org.apache.crunch.impl.mr.run.RTNode.cleanup(RTNode.java:121) >>> >>> at >>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:63) >>> >>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:166) >>> >>> at >>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610) >>> >>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444) >>> >>> at org.apache.hadoop.mapred.Child$4.run(Child.java:268) >>> >>> at java.security.AccessController.doPrivileged(Native Method) >>> >>> at javax.security.auth.Subject.doAs(Subject.java:396) >>> >>> at >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) >>> >>> at org.apache.hadoop.mapred.Child.main(Child.java:262) >>> >>> Caused by: org.apache.avro.file.DataFileWriter$AppendWriteException: >>> java.lang.NullPointerException: null of <specific avro model name> in field >>> value of org.apache.avro.mapred.Pair >>> >>> at >>> org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:263) >>> >>> at >>> org.apache.crunch.types.avro.AvroOutputFormat$1.write(AvroOutputFormat.java:87) >>> >>> at >>> org.apache.crunch.types.avro.AvroOutputFormat$1.write(AvroOutputFormat.java:84) >>> >>> at org.apache.crunch.io.CrunchOutputs.write(CrunchOutputs.java:128) >>> >>> at >>> org.apache.crunch.impl.mr.emit.MultipleOutputEmitter.emit(MultipleOutputEmitter.java:41) >>> >>> ... 23 more >>> >>> Caused by: java.lang.NullPointerException: null of <specific avro model >>> name> in field value of org.apache.avro.mapred.Pair >>> >>> at >>> org.apache.avro.generic.GenericDatumWriter.npe(GenericDatumWriter.java:93) >>> >>> at >>> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:87) >>> >>> at >>> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58) >>> >>> at >>> org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:257) >>> >>> ... 27 more >>> >>> Caused by: java.lang.NullPointerException >>> >>> at org.apache.avro.generic.GenericData.getField(GenericData.java:537) >>> >>> at org.apache.avro.generic.GenericData.getField(GenericData.java:552) >>> >>> at >>> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104) >>> >>> at >>> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66) >>> >>> at >>> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:106) >>> >>> at >>> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66) >>> >>> ... 29 more >>> >>> >>> >>> On Wed, Jan 29, 2014 at 4:30 PM, Micah Whitacre <[email protected]> >>> wrote: >>> >>> That actually seems likely as the persistence of fixed bytes in the >>> Buffered vs Direct encoders differs. >>> >>> >>> Stephen can you include the full stack trace of the NPE which will help >>> to verify if differences in the encoders are at fault. >>> >>> >>> >>> On Wed, Jan 29, 2014 at 4:14 PM, Josh Wills <[email protected]> >>> wrote: >>> >>> Maybe the binary encoder change? >>> >>> On Jan 29, 2014 3:19 PM, "Durfey,Stephen" <[email protected]> >>> wrote: >>> >>> Sorry for the initial confusion. The exceptions that I am seeing, look >>> like were caused by another exception way up in my console log (that I >>> originally missed). I think the true exception is an Avro exception. It is >>> an Avro PType, and the original NPE is coming from the >>> GenericData#getField, >>> >>> when the GenericDatumWriter is serializing. The exception is for a null >>> value in the org.apache.avro.mapred.Pair (I believe this Pair is created in >>> PairMapFn?) object when expecting a specific type of Avro. I'm having a >>> difficult time figuring out what changed between versions. Without changing >>> >>> anything in my code, and just changing Crunch versions causes these >>> exceptions to be thrown. >>> >>> >>> - Stephen >>> >>> >>> *From: *Josh Wills <[email protected]> >>> >>> *Reply-To: *"[email protected]" <[email protected]> >>> >>> *Date: *Tuesday, January 28, 2014 at 4:08 PM >>> >>> *To: *"[email protected]" <[email protected]> >>> >>> *Subject: *Re: crunch 0.8.2+6-cdh4.4.0 >>> >>> >>> Hey Stephen, >>> >>> >>> Slightly confused here, question inlined. >>> >>> >>> >>> On Tue, Jan 28, 2014 at 12:59 PM, Durfey,Stephen < >>> [email protected]> wrote: >>> >>> This question is specifically about this version maintained by Cloudera. >>> >>> >>> I was looking to update out Crunch version from 0.8.0-cdh4.3.0 >>> to 0.8.2+6-cdh4.4.0. In the process some of our tests starting failing from >>> NullPointerExceptions. I’ve discovered why these exceptions are happening, >>> but I’m having trouble tracking down the where. >>> >>> >>> The exceptions occur when we emit a Pair<POJO, null> that uses an Avro >>> PType. Previously this worked just fine, and by the time the CrunchOutputs >>> started writing to a sequence file the value would be an instance of >>> NullWritable, and it would successfully pull off the output type for >>> serialization (in SequenceFile.BlockCompressWriter#append(k, v)). After the >>> version change the value when it got down to write to a sequence file was >>> 'null', rather than NullWritable. >>> >>> >>> It's an AvroType that's getting written to a Sequence File? Is that >>> right? >>> >>> >>> >>> >>> Any thoughts? >>> >>> >>> - Stephen >>> >>> CONFIDENTIALITY NOTICE This message and any included attachments are >>> from Cerner Corporation and are intended only for the addressee. The >>> information contained in this message is confidential and may constitute >>> inside or non-public information under international, federal, or state >>> securities laws. Unauthorized forwarding, printing, copying, distribution, >>> or use of such information is strictly prohibited and may be unlawful. If >>> you are not the addressee, please promptly delete this message and notify >>> the sender of the delivery error by e-mail or you may call Cerner's >>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024. >>> >>> >>> >>> >>> >>> >> > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
