Re: Re: How can I read this avro file using spark & scala?
Check this link. https://github.com/databricks/spark-avro Home page for Spark-avro project. Thanks, Vishnu On Wed, Feb 11, 2015 at 10:19 PM, Todd wrote: > Databricks provides a sample code on its website...but i can't find it for > now. > > > > > > > At 2015-02-12 00:43:07, "captainfranz" wrote: > >I am confused as to whether avro support was merged into Spark 1.2 or it is > >still an independent library. > >I see some people writing sqlContext.avroFile similarly to jsonFile but this > >does not work for me, nor do I see this in the Scala docs. > > > > > > > >-- > >View this message in context: > >http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-scala-tp19400p21601.html > >Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > >- > >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >For additional commands, e-mail: user-h...@spark.apache.org > > > >
Re: How can I read this avro file using spark & scala?
I am confused as to whether avro support was merged into Spark 1.2 or it is still an independent library. I see some people writing sqlContext.avroFile similarly to jsonFile but this does not work for me, nor do I see this in the Scala docs. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-scala-tp19400p21601.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How can I read this avro file using spark & scala?
Thanks for the feedback, I filed a couple of issues: https://github.com/databricks/spark-avro/issues On Fri, Nov 21, 2014 at 5:04 AM, thomas j wrote: > I've been able to load a different avro file based on GenericRecord with: > > val person = sqlContext.avroFile("/tmp/person.avro") > > When I try to call `first()` on it, I get "NotSerializableException" > exceptions again: > > person.first() > > ... > 14/11/21 12:59:17 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID > 20) > java.io.NotSerializableException: > org.apache.avro.generic.GenericData$Record > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > ... > > Apart from this I want to transform the records into pairs of (user_id, > record). I can do this by specifying the offset of the user_id column with > something like this: > > person.map(r => (r.getInt(2), r)).take(4).collect() > > Is there any way to be able to specify the column name ("user_id") instead > of needing to know/calculate the offset somehow? > > Thanks again > > > On Fri, Nov 21, 2014 at 11:48 AM, thomas j > wrote: > >> Thanks for the pointer Michael. >> >> I've downloaded spark 1.2.0 from >> https://people.apache.org/~pwendell/spark-1.2.0-snapshot1/ and clone and >> built the spark-avro repo you linked to. >> >> When I run it against the example avro file linked to in the >> documentation it works. However, when I try to load my avro file (linked to >> in my original question) I receive the following error: >> >> java.lang.RuntimeException: Unsupported type LONG >> at scala.sys.package$.error(package.scala:27) >> at com.databricks.spark.avro.AvroRelation.com >> $databricks$spark$avro$AvroRelation$$toSqlType(AvroRelation.scala:116) >> at >> com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:97) >> at >> com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:96) >> at >> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) >> at >> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) >> ... >> >> If this is useful I'm happy to try loading the various different avro >> files I have to try to battle-test spark-avro. >> >> Thanks >> >> On Thu, Nov 20, 2014 at 6:30 PM, Michael Armbrust > > wrote: >> >>> One option (starting with Spark 1.2, which is currently in preview) is >>> to use the Avro library for Spark SQL. This is very new, but we would love >>> to get feedback: https://github.com/databricks/spark-avro >>> >>> On Thu, Nov 20, 2014 at 10:19 AM, al b wrote: >>> I've read several posts of people struggling to read avro in spark. The examples I've tried don't work. When I try this solution ( https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files) I get errors: spark java.io.NotSerializableException: org.apache.avro.mapred.AvroWrapper How can I read the following sample file in spark using scala? http://www.4shared.com/file/SxnYcdgJce/sample.html Thomas >>> >>> >> >
Re: How can I read this avro file using spark & scala?
I have also been struggling with reading avro. Very glad to hear that there is a new avro library coming in Spark 1.2 (which by the way, seems to have a lot of other very useful improvements). In the meanwhile, I have been able to piece together several snippets/tips that I found from various sources and I am now able to read/write avro correctly. From my understanding, you basically need 3 pieces: 1. Use the kryo serializer. 2. Register your avro classes. I have done this using twitter chill 0.4.0. 3. Read/write avro with a snippet of code like the one you posted. Here is relevant code (hopefully all of it). // All of the following are needed in order to read/write AVRO files import org.apache.hadoop.mapreduce.Job import org.apache.hadoop.io.NullWritable import org.apache.hadoop.mapreduce.lib.input.FileInputFormat import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat import org.apache.hadoop.fs.{ FileSystem, Path } // Uncomment the following line if you want to use generic AVRO, I am using specific //import org.apache.avro.generic.GenericData import org.apache.avro.Schema import org.apache.avro.mapreduce.{ AvroJob, AvroKeyInputFormat, AvroKeyOutputFormat } import org.apache.avro.mapred.AvroKey // Kryo/avro serialization stuff import com.esotericsoftware.kryo.Kryo import com.twitter.chill.avro.AvroSerializer import org.apache.spark.serializer.{ KryoSerializer, KryoRegistrator } object MyApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName("MyApp").setMaster("local[*]") .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryo.registrator", "MyRegistrator") } // Read val readJob = new Job() AvroJob.setInputKeySchema(readJob, schema) sc.newAPIHadoopFile(inputPath, classOf[AvroKeyInputFormat[MyAvroClass]], classOf[AvroKey[MyAvroClass]], classOf[NullWritable], readJob.getConfiguration) .map { t => t._1.datum } // Write val rddAvroWritable = rdd.map { s => (new AvroKey(s), NullWritable.get) } val writeJob = new Job() AvroJob.setOutputKeySchema(writeJob, schema) writeJob.setOutputFormatClass(classOf[AvroKeyOutputFormat[MyAvroClass]]) rddAvroWritable.saveAsNewAPIHadoopFile(outputPath, classOf[AvroKey[MyAvroClass]], classOf[NullWritable], classOf[AvroKeyOutputFormat[MyAvroClass]], writeJob.getConfiguration) } } class MyRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { // Put a line like the following for each of your Avro classes if you use specific Avro // If you use generic Avro, chill also has a function for that: GenericRecordSerializer kryo.register(classOf[MyAvroClass], AvroSerializer.SpecificRecordBinarySerializer[MyAvroClass]) } } Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Fri, Nov 21, 2014 at 7:04 AM, thomas j wrote: > I've been able to load a different avro file based on GenericRecord with: > > val person = sqlContext.avroFile("/tmp/person.avro") > > When I try to call `first()` on it, I get "NotSerializableException" > exceptions again: > > person.first() > > ... > 14/11/21 12:59:17 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID > 20) > java.io.NotSerializableException: > org.apache.avro.generic.GenericData$Record > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > ... > > Apart from this I want to transform the records into pairs of (user_id, > record). I can do this by specifying the offset of the user_id column with > something like this: > > person.map(r => (r.getInt(2), r)).take(4).collect() > > Is there any way to be able to specify the column name ("user_id") instead > of needing to know/calculate the offset somehow? > > Thanks again > > > On Fri, Nov 21, 2014 at 11:48 AM, thomas j > wrote: > >> Thanks for the pointer Michael. >> >> I've downloaded spark 1.2.0 from >> https://people.apache.org/~pwendell/spark-1.2.0-snapshot1/ and clone and >> built the spark-avro repo you linked to. >> >> When I run it against the example avro file linked to in the >> documentation it works. However, when I try to load my avro file (linked to >> in my original question) I receive the following error: >> >> java.lang.RuntimeException: Unsupported type LONG >> at scala.sys.package$.error(package.scala:27) >> at com.databricks.spark.avro.AvroRelation.com >> $databricks$spark$avro$AvroRelation$$toSqlType(AvroRelation.scala:116) >> at >> com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:97) >> at >> com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:96) >> at >> scala.co
Re: How can I read this avro file using spark & scala?
I've been able to load a different avro file based on GenericRecord with: val person = sqlContext.avroFile("/tmp/person.avro") When I try to call `first()` on it, I get "NotSerializableException" exceptions again: person.first() ... 14/11/21 12:59:17 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 20) java.io.NotSerializableException: org.apache.avro.generic.GenericData$Record at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) ... Apart from this I want to transform the records into pairs of (user_id, record). I can do this by specifying the offset of the user_id column with something like this: person.map(r => (r.getInt(2), r)).take(4).collect() Is there any way to be able to specify the column name ("user_id") instead of needing to know/calculate the offset somehow? Thanks again On Fri, Nov 21, 2014 at 11:48 AM, thomas j wrote: > Thanks for the pointer Michael. > > I've downloaded spark 1.2.0 from > https://people.apache.org/~pwendell/spark-1.2.0-snapshot1/ and clone and > built the spark-avro repo you linked to. > > When I run it against the example avro file linked to in the documentation > it works. However, when I try to load my avro file (linked to in my > original question) I receive the following error: > > java.lang.RuntimeException: Unsupported type LONG > at scala.sys.package$.error(package.scala:27) > at com.databricks.spark.avro.AvroRelation.com > $databricks$spark$avro$AvroRelation$$toSqlType(AvroRelation.scala:116) > at > com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:97) > at > com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:96) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > ... > > If this is useful I'm happy to try loading the various different avro > files I have to try to battle-test spark-avro. > > Thanks > > On Thu, Nov 20, 2014 at 6:30 PM, Michael Armbrust > wrote: > >> One option (starting with Spark 1.2, which is currently in preview) is to >> use the Avro library for Spark SQL. This is very new, but we would love to >> get feedback: https://github.com/databricks/spark-avro >> >> On Thu, Nov 20, 2014 at 10:19 AM, al b wrote: >> >>> I've read several posts of people struggling to read avro in spark. The >>> examples I've tried don't work. When I try this solution ( >>> https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files) >>> I get errors: >>> >>> spark java.io.NotSerializableException: >>> org.apache.avro.mapred.AvroWrapper >>> >>> How can I read the following sample file in spark using scala? >>> >>> http://www.4shared.com/file/SxnYcdgJce/sample.html >>> >>> Thomas >>> >> >> >
Re: How can I read this avro file using spark & scala?
Thanks for the pointer Michael. I've downloaded spark 1.2.0 from https://people.apache.org/~pwendell/spark-1.2.0-snapshot1/ and clone and built the spark-avro repo you linked to. When I run it against the example avro file linked to in the documentation it works. However, when I try to load my avro file (linked to in my original question) I receive the following error: java.lang.RuntimeException: Unsupported type LONG at scala.sys.package$.error(package.scala:27) at com.databricks.spark.avro.AvroRelation.com $databricks$spark$avro$AvroRelation$$toSqlType(AvroRelation.scala:116) at com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:97) at com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:96) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) ... If this is useful I'm happy to try loading the various different avro files I have to try to battle-test spark-avro. Thanks On Thu, Nov 20, 2014 at 6:30 PM, Michael Armbrust wrote: > One option (starting with Spark 1.2, which is currently in preview) is to > use the Avro library for Spark SQL. This is very new, but we would love to > get feedback: https://github.com/databricks/spark-avro > > On Thu, Nov 20, 2014 at 10:19 AM, al b wrote: > >> I've read several posts of people struggling to read avro in spark. The >> examples I've tried don't work. When I try this solution ( >> https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files) >> I get errors: >> >> spark java.io.NotSerializableException: >> org.apache.avro.mapred.AvroWrapper >> >> How can I read the following sample file in spark using scala? >> >> http://www.4shared.com/file/SxnYcdgJce/sample.html >> >> Thomas >> > >
Re: How can I read this avro file using spark & scala?
One option (starting with Spark 1.2, which is currently in preview) is to use the Avro library for Spark SQL. This is very new, but we would love to get feedback: https://github.com/databricks/spark-avro On Thu, Nov 20, 2014 at 10:19 AM, al b wrote: > I've read several posts of people struggling to read avro in spark. The > examples I've tried don't work. When I try this solution ( > https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files) > I get errors: > > spark java.io.NotSerializableException: org.apache.avro.mapred.AvroWrapper > > How can I read the following sample file in spark using scala? > > http://www.4shared.com/file/SxnYcdgJce/sample.html > > Thomas >