Re: Reading Avro to GenericRecord

Magnus Runesson Tue, 28 Jan 2014 01:09:03 -0800

Thanks! Looks like it works for me.

Here is a patch to expose it to scrunch:

diff --gita/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scalab/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala

index 89b331b..b77b042 100644
--- a/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
+++ b/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
@@ -19,11 +19,14 @@ package org.apache.crunch.scrunch

 import org.apache.crunch.io.{From => from, To => to, At => at}
 import org.apache.crunch.types.avro.AvroType
-import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.conf.Configuration
+;

 trait From {

def avroFile[T](path: String, atype: AvroType[T]) =from.avroFile(path, atype)def avroFile[T](path: Path, atype: AvroType[T]) =from.avroFile(path, atype)+ def avroFile[T](path: Path, conf: Configuration) =from.avroFile(path, conf)

   def textFile(path: String) = from.textFile(path)
   def textFile(path: Path) = from.textFile(path)
 }


On 1/28/14 2:04 AM, Josh Wills wrote:

Patch is here: https://issues.apache.org/jira/browse/CRUNCH-333

On Mon, Jan 27, 2014 at 10:08 AM, Josh Wills <[email protected]<mailto:[email protected]>> wrote:


    Of course. I wrote up a little patch that adds a method to
    From.java to open the Avro file and pull out the schema and return
    a Source of GenericData.Record, but I had to roll to some meetings
    before I got a chance to test it. I'll post something later this
    evening ET.

    On Jan 27, 2014 11:56 AM, "Magnus Runesson" <[email protected]
    <mailto:[email protected]>> wrote:

        Thanks for quick answer.

        It is totally OK and reasonable to take one file in a
        directory and assume all other has the same schema.


        On 2014-01-27 18:27, Josh Wills wrote:

        No, I haven't written a way to do that yet, and I feel bad
        about it-- a Clouderan asked me for just such a feature a
        couple of weeks ago and it slipped my mind. I don't think
        it's hard to do, just a little tedious and will require
        refreshing my memory of the Avro APIs. There's also the
        potential issue that multiple Avro files in the same input
        directory can have different schemas, so the one we would end
        up reading might be somewhat arbitrary (e.g., based on the
        timestamp of the files in the directory, or some such
        thing)-- is that ok?


        On Mon, Jan 27, 2014 at 9:12 AM, Magnus Runesson
        <[email protected] <mailto:[email protected]>> wrote:

            Can I in (s)crunch read an Avro-file to GenericRecord
            without provide the schema? I want crunch to get the
            schema from the avro-file it reads. How do I do it?

            /Magnus





--
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Reading Avro to GenericRecord

Reply via email to