Committed this as CRUNCH-334. Thanks Magnus!
On Tue, Jan 28, 2014 at 1:07 AM, Magnus Runesson <[email protected]>wrote: > Thanks! Looks like it works for me. > > Here is a patch to expose it to scrunch: > > diff --git > a/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala > b/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala > index 89b331b..b77b042 100644 > --- a/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala > +++ b/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala > @@ -19,11 +19,14 @@ package org.apache.crunch.scrunch > > import org.apache.crunch.io.{From => from, To => to, At => at} > import org.apache.crunch.types.avro.AvroType > -import org.apache.hadoop.fs.Path; > +import org.apache.hadoop.fs.Path > +import org.apache.hadoop.conf.Configuration > +; > > trait From { > def avroFile[T](path: String, atype: AvroType[T]) = from.avroFile(path, > atype) > def avroFile[T](path: Path, atype: AvroType[T]) = from.avroFile(path, > atype) > + def avroFile[T](path: Path, conf: Configuration) = from.avroFile(path, > conf) > def textFile(path: String) = from.textFile(path) > def textFile(path: Path) = from.textFile(path) > > } > > > On 1/28/14 2:04 AM, Josh Wills wrote: > > Patch is here: https://issues.apache.org/jira/browse/CRUNCH-333 > > > On Mon, Jan 27, 2014 at 10:08 AM, Josh Wills <[email protected]> wrote: > >> Of course. I wrote up a little patch that adds a method to From.java to >> open the Avro file and pull out the schema and return a Source of >> GenericData.Record, but I had to roll to some meetings before I got a >> chance to test it. I'll post something later this evening ET. >> On Jan 27, 2014 11:56 AM, "Magnus Runesson" <[email protected]> >> wrote: >> >>> Thanks for quick answer. >>> >>> It is totally OK and reasonable to take one file in a directory and >>> assume all other has the same schema. >>> >>> >>> On 2014-01-27 18:27, Josh Wills wrote: >>> >>> No, I haven't written a way to do that yet, and I feel bad about it-- a >>> Clouderan asked me for just such a feature a couple of weeks ago and it >>> slipped my mind. I don't think it's hard to do, just a little tedious and >>> will require refreshing my memory of the Avro APIs. There's also the >>> potential issue that multiple Avro files in the same input directory can >>> have different schemas, so the one we would end up reading might be >>> somewhat arbitrary (e.g., based on the timestamp of the files in the >>> directory, or some such thing)-- is that ok? >>> >>> >>> On Mon, Jan 27, 2014 at 9:12 AM, Magnus Runesson >>> <[email protected]>wrote: >>> >>>> Can I in (s)crunch read an Avro-file to GenericRecord without provide >>>> the schema? I want crunch to get the schema from the avro-file it reads. >>>> How do I do it? >>>> >>>> /Magnus >>>> >>> >>> >>> > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> > > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
