Thanks! Looks like it works for me.
Here is a patch to expose it to scrunch:
diff --git
a/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
b/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
index 89b331b..b77b042 100644
--- a/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
+++ b/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
@@ -19,11 +19,14 @@ package org.apache.crunch.scrunch
import org.apache.crunch.io.{From => from, To => to, At => at}
import org.apache.crunch.types.avro.AvroType
-import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.conf.Configuration
+;
trait From {
def avroFile[T](path: String, atype: AvroType[T]) =
from.avroFile(path, atype)
def avroFile[T](path: Path, atype: AvroType[T]) =
from.avroFile(path, atype)
+ def avroFile[T](path: Path, conf: Configuration) =
from.avroFile(path, conf)
def textFile(path: String) = from.textFile(path)
def textFile(path: Path) = from.textFile(path)
}
On 1/28/14 2:04 AM, Josh Wills wrote:
Patch is here: https://issues.apache.org/jira/browse/CRUNCH-333
On Mon, Jan 27, 2014 at 10:08 AM, Josh Wills <[email protected]
<mailto:[email protected]>> wrote:
Of course. I wrote up a little patch that adds a method to
From.java to open the Avro file and pull out the schema and return
a Source of GenericData.Record, but I had to roll to some meetings
before I got a chance to test it. I'll post something later this
evening ET.
On Jan 27, 2014 11:56 AM, "Magnus Runesson" <[email protected]
<mailto:[email protected]>> wrote:
Thanks for quick answer.
It is totally OK and reasonable to take one file in a
directory and assume all other has the same schema.
On 2014-01-27 18:27, Josh Wills wrote:
No, I haven't written a way to do that yet, and I feel bad
about it-- a Clouderan asked me for just such a feature a
couple of weeks ago and it slipped my mind. I don't think
it's hard to do, just a little tedious and will require
refreshing my memory of the Avro APIs. There's also the
potential issue that multiple Avro files in the same input
directory can have different schemas, so the one we would end
up reading might be somewhat arbitrary (e.g., based on the
timestamp of the files in the directory, or some such
thing)-- is that ok?
On Mon, Jan 27, 2014 at 9:12 AM, Magnus Runesson
<[email protected] <mailto:[email protected]>> wrote:
Can I in (s)crunch read an Avro-file to GenericRecord
without provide the schema? I want crunch to get the
schema from the avro-file it reads. How do I do it?
/Magnus
--
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>