On Wed, Apr 24, 2013 at 4:02 PM, Micah Whitacre <[email protected]>wrote:
> > I think it's just that. It seems relatively low-risk to me (e.g., we > already use AvroKey in the AvroPairConverter for PTables). > > Ok sounds good. Do you want me to log a bug for this? > Yes please. I'm running the small fix through regression tests now. > > > I'm also curious if you're looking at Parquet for this use case? > > Yeah was going to look at it after Trevni. It's Avro support is not as > far along (looks like ~16 days). The goal was to hopefully help get > support for both into Crunch eventually and we can choose whichever is > better for our job. > Fair enough. > > On Wed, Apr 24, 2013 at 5:52 PM, Josh Wills <[email protected]> wrote: > >> >> >> >> On Wed, Apr 24, 2013 at 3:49 PM, Micah Whitacre <[email protected]>wrote: >> >>> Is the change simply: >>> >>> private AvroWrapper<K> getWrapper() { >>> if (wrapper == null) { >>> // wrapper = new AvroWrapper<K>(); >>> wrapper = new AvroKey<K>(); >>> } >>> return wrapper; >>> } >>> >>> Or are there more changes I might be missing? Doing that got me past >>> the ClassCastException (though still trying to get my code working). >>> >>> As I indicated I'm still just trying to prove out my code and if it pans >>> out we can probably wait till the 0.7.0 release (assuming the current ~2 >>> month release cycle). I'll leave it to you to evaluate the risk. >>> >> >> I think it's just that. It seems relatively low-risk to me (e.g., we >> already use AvroKey in the AvroPairConverter for PTables). >> >> >>> >>> I'm guessing the injecting a converter issue will be more significant if >>> I try out the other Trevni format[1] where I'd need the converter to >>> support AvroValue instead of NullWritable. So I'm fine with holding off a >>> rushed change before a release in lieu of a more holistic solution to both >>> parts. >>> >>> [1] - >>> http://avro.apache.org/docs/current/api/java/org/apache/trevni/avro/mapreduce/AvroTrevniKeyValueOutputFormat.html >>> >> >> I'm also curious if you're looking at Parquet for this use case? >> >> >>> >>> >>> >>> On Wed, Apr 24, 2013 at 5:29 PM, Josh Wills <[email protected]> wrote: >>> >>>> Hey Micah, >>>> >>>> It seems like having the AvroKeyConverter use the AvroKey as the return >>>> type instead of AvroWrapper is the easiest way to solve this, since AvroKey >>>> is a subclass of AvroWrapper. That said, I agree, that's a thorny problem. >>>> We're just getting ready for the 0.6.0 release, but I'd be fine to get the >>>> switch in there if that solved this problem for you. >>>> >>>> J >>>> >>>> >>>> On Wed, Apr 24, 2013 at 3:23 PM, Micah Whitacre >>>> <[email protected]>wrote: >>>> >>>>> As an alternative to the standard AvroInput/OutputFormat, I've been >>>>> playing around with how to support alternate Avro file types like >>>>> Trevni[1], which give benefits when we want to only retrieve a subset of >>>>> the Avro object. >>>>> >>>>> Picking one of the implementations >>>>> (AvroTrevniKeyInputFormat/AvroTrevniKeyOutputFormat)[2], I implemented the >>>>> various Source/Target/SourceTarget implementations. When I started trying >>>>> to test it out (to see if I did any of it right), I hit the issue that the >>>>> AvroKeyConverter only produces AvroWrapper objects and the output format >>>>> requires AvroKey. So I get ClassCastExceptions CrunchOutputs.write(...) >>>>> method. >>>>> >>>>> Caused by: java.lang.ClassCastException: >>>>> org.apache.avro.mapred.AvroWrapper cannot be cast to >>>>> org.apache.avro.mapred.AvroKey >>>>> at >>>>> org.apache.trevni.avro.mapreduce.AvroTrevniKeyRecordWriter.write(AvroTrevniKeyRecordWriter.java:34) >>>>> at org.apache.crunch.io.CrunchOutputs.write(CrunchOutputs.java:129) >>>>> >>>>> I was hoping that the target would be able to take any PCollection<? >>>>> extends AvroType> but it looks like I'd need to implement my own PType and >>>>> force consumers to use that just to change the converter to produce >>>>> AvroKey >>>>> instead. >>>>> >>>>> Is implementing a custom PType the only way to inject an alternate >>>>> converter? That seems like a high cost on the implementation side and >>>>> forcing a restriction onto others in the pipeline who are generally happy >>>>> with the standard AvroType and shouldn't be burdened with how the data >>>>> might be stored later on in the processing. >>>>> >>>>> Thoughts? >>>>> >>>>> [1] - http://avro.apache.org/docs/current/trevni/spec.html >>>>> [2] - >>>>> http://avro.apache.org/docs/current/api/java/org/apache/trevni/avro/mapreduce/AvroTrevniKeyOutputFormat.html >>>>> >>>> >>>> >>>> >>>> -- >>>> Director of Data Science >>>> Cloudera <http://www.cloudera.com> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>>> >>> >>> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
