BTW, I'll add that we are hoping to publish a new version of the Avro library for Spark 1.3 shortly. It should have improved support for writing data both programmatically and from SQL.
On Fri, Mar 13, 2015 at 2:01 PM, Kevin Peng <kpe...@gmail.com> wrote: > Markus, > > Thanks. That makes sense. I was able to get this to work with > spark-shell passing in the git built jar. I did notice that I couldn't get > AvroSaver.save to work with SQLContext, but it works with HiveContext. Not > sure if that is an issue, but for me, it is fine. > > Once again, thanks for the help. > > Kevin > > On Fri, Mar 13, 2015 at 1:57 PM, M. Dale <medal...@yahoo.com> wrote: > >> I probably did not do a good enough job explaining the problem. If >> you used Maven with the >> default Maven repository you have an old version of spark-avro that does >> not contain AvroSaver and does not have the saveAsAvro method implemented: >> >> Assuming you use the default Maven repo location: >> cd ~/.m2/repository/com/databricks/spark-avro_2.10/0.1 >> jar tvf spark-avro_2.10-0.1.jar | grep AvroSaver >> >> Comes up empty. The jar file does not contain this class because >> AvroSaver.scala wasn't added until Jan 21. The jar file is from 14 November. >> >> So: >> git clone g...@github.com:databricks/spark-avro.git >> cd spark-avro >> sbt publish-m2 >> >> This publishes the latest master code (this includes AvroSaver etc.) to >> your local Maven repo and Maven will pick up the latest version of >> spark-avro (for this machine). >> >> Now you should be able to compile and run. >> >> HTH, >> Markus >> >> >> On 03/12/2015 11:55 PM, Kevin Peng wrote: >> >> Dale, >> >> I basically have the same maven dependency above, but my code will not >> compile due to not being able to reference to AvroSaver, though the >> saveAsAvro reference compiles fine, which is weird. Eventhough saveAsAvro >> compiles for me, it errors out when running the spark job due to it not >> being implemented (the job quits and says non implemented method or >> something along those lines). >> >> I will try going the spark shell and passing in the jar built from >> github since I haven't tried that quite yet. >> >> On Thu, Mar 12, 2015 at 6:44 PM, M. Dale <medal...@yahoo.com> wrote: >> >>> Short answer: if you downloaded spark-avro from the >>> repo.maven.apache.org >>> repo you might be using an old version (pre-November 14, 2014) - >>> see timestamps at >>> http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/ >>> Lots of changes at https://github.com/databricks/spark-avro since then. >>> >>> Databricks, thank you for sharing the Avro code!!! >>> >>> Could you please push out the latest version or update the version >>> number and republish to repo.maven.apache.org (I have no idea how jars >>> get >>> there). Or is there a different repository that users should point to for >>> this artifact? >>> >>> Workaround: Download from https://github.com/databricks/spark-avro and >>> build >>> with latest functionality (still version 0.1) and add to your local Maven >>> or Ivy repo. >>> >>> Long version: >>> I used a default Maven build and declared my dependency on: >>> >>> <dependency> >>> <groupId>com.databricks</groupId> >>> <artifactId>spark-avro_2.10</artifactId> >>> <version>0.1</version> >>> </dependency> >>> >>> Maven downloaded the 0.1 version from >>> http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/ >>> and included it in my app code jar. >>> >>> From spark-shell: >>> >>> import com.databricks.spark.avro._ >>> import org.apache.spark.sql.SQLContext >>> val sqlContext = new SQLContext(sc) >>> >>> # This schema includes LONG for time in millis ( >>> https://github.com/medale/spark-mail/blob/master/mailrecord/src/main/avro/com/uebercomputing/mailrecord/MailRecord.avdl >>> ) >>> val recordsSchema = >>> sqlContext.avroFile("/opt/rpm1/enron/enron-tiny.avro") >>> java.lang.RuntimeException: Unsupported type LONG >>> >>> However, checking out the spark-avro code from its GitHub repo and adding >>> a test case against the MailRecord avro everything ran fine. >>> >>> So I built the databricks spark-avro locally on my box and then put it >>> in my >>> local Maven repo - everything worked from spark-shell when adding that >>> jar >>> as dependency. >>> >>> Hope this helps for the "save" case as well. On the pre-14NOV version, >>> avro.scala >>> says: >>> // TODO: Implement me. >>> implicit class AvroSchemaRDD(schemaRDD: SchemaRDD) { >>> def saveAsAvroFile(path: String): Unit = ??? >>> } >>> >>> Markus >>> >>> On 03/12/2015 07:05 PM, kpeng1 wrote: >>> >>>> Hi All, >>>> >>>> I am current trying to write out a scheme RDD to avro. I noticed that >>>> there >>>> is a databricks spark-avro library and I have included that in my >>>> dependencies, but it looks like I am not able to access the AvroSaver >>>> object. On compilation of the job I get this: >>>> error: not found: value AvroSaver >>>> [ERROR] AvroSaver.save(resultRDD, args(4)) >>>> >>>> I also tried calling saveAsAvro on the resultRDD(the actual rdd with the >>>> results) and that passes compilation, but when I run the code I get an >>>> error >>>> that says the saveAsAvro is not implemented. I am using version 0.1 of >>>> spark-avro_2.10 >>>> >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-writing-in-avro-tp22021.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >> >> >