Re: spark sql writing in avro

Michael Armbrust Fri, 13 Mar 2015 14:17:42 -0700

BTW, I'll add that we are hoping to publish a new version of the Avro
library for Spark 1.3 shortly.  It should have improved support for writing
data both programmatically and from SQL.


On Fri, Mar 13, 2015 at 2:01 PM, Kevin Peng <kpe...@gmail.com> wrote:

> Markus,
>
> Thanks.  That makes sense.  I was able to get this to work with
> spark-shell passing in the git built jar.  I did notice that I couldn't get
> AvroSaver.save to work with SQLContext, but it works with HiveContext.  Not
> sure if that is an issue, but for me, it is fine.
>
> Once again, thanks for the help.
>
> Kevin
>
> On Fri, Mar 13, 2015 at 1:57 PM, M. Dale <medal...@yahoo.com> wrote:
>
>>     I probably did not do a good enough job explaining the problem. If
>> you used Maven with the
>> default Maven repository you have an old version of spark-avro that does
>> not contain AvroSaver and does not have the saveAsAvro method implemented:
>>
>> Assuming you use the default Maven repo location:
>> cd ~/.m2/repository/com/databricks/spark-avro_2.10/0.1
>> jar tvf spark-avro_2.10-0.1.jar | grep AvroSaver
>>
>> Comes up empty. The jar file does not contain this class because
>> AvroSaver.scala wasn't added until Jan 21. The jar file is from 14 November.
>>
>> So:
>> git clone g...@github.com:databricks/spark-avro.git
>> cd spark-avro
>> sbt publish-m2
>>
>> This publishes the latest master code (this includes AvroSaver etc.) to
>> your local Maven repo and Maven will pick up the latest version of
>> spark-avro (for this machine).
>>
>> Now you should be able to compile and run.
>>
>> HTH,
>> Markus
>>
>>
>> On 03/12/2015 11:55 PM, Kevin Peng wrote:
>>
>> Dale,
>>
>>  I basically have the same maven dependency above, but my code will not
>> compile due to not being able to reference to AvroSaver, though the
>> saveAsAvro reference compiles fine, which is weird.  Eventhough saveAsAvro
>> compiles for me, it errors out when running the spark job due to it not
>> being implemented (the job quits and says non implemented method or
>> something along those lines).
>>
>>  I will try going the spark shell and passing in the jar built from
>> github since I haven't tried that quite yet.
>>
>> On Thu, Mar 12, 2015 at 6:44 PM, M. Dale <medal...@yahoo.com> wrote:
>>
>>> Short answer: if you downloaded spark-avro from the
>>> repo.maven.apache.org
>>> repo you might be using an old version (pre-November 14, 2014) -
>>> see timestamps at
>>> http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/
>>> Lots of changes at https://github.com/databricks/spark-avro since then.
>>>
>>> Databricks, thank you for sharing the Avro code!!!
>>>
>>> Could you please push out the latest version or update the version
>>> number and republish to repo.maven.apache.org (I have no idea how jars
>>> get
>>> there). Or is there a different repository that users should point to for
>>> this artifact?
>>>
>>> Workaround: Download from https://github.com/databricks/spark-avro and
>>> build
>>> with latest functionality (still version 0.1) and add to your local Maven
>>> or Ivy repo.
>>>
>>> Long version:
>>> I used a default Maven build and declared my dependency on:
>>>
>>>         <dependency>
>>>             <groupId>com.databricks</groupId>
>>>             <artifactId>spark-avro_2.10</artifactId>
>>>             <version>0.1</version>
>>>         </dependency>
>>>
>>> Maven downloaded the 0.1 version from
>>> http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/
>>> and included it in my app code jar.
>>>
>>> From spark-shell:
>>>
>>> import com.databricks.spark.avro._
>>> import org.apache.spark.sql.SQLContext
>>> val sqlContext = new SQLContext(sc)
>>>
>>> # This schema includes LONG for time in millis (
>>> https://github.com/medale/spark-mail/blob/master/mailrecord/src/main/avro/com/uebercomputing/mailrecord/MailRecord.avdl
>>> )
>>> val recordsSchema =
>>> sqlContext.avroFile("/opt/rpm1/enron/enron-tiny.avro")
>>> java.lang.RuntimeException: Unsupported type LONG
>>>
>>> However, checking out the spark-avro code from its GitHub repo and adding
>>> a test case against the MailRecord avro everything ran fine.
>>>
>>> So I built the databricks spark-avro locally on my box and then put it
>>> in my
>>> local Maven repo - everything worked from spark-shell when adding that
>>> jar
>>> as dependency.
>>>
>>> Hope this helps for the "save" case as well. On the pre-14NOV version,
>>> avro.scala
>>> says:
>>>  // TODO: Implement me.
>>>   implicit class AvroSchemaRDD(schemaRDD: SchemaRDD) {
>>>     def saveAsAvroFile(path: String): Unit = ???
>>>   }
>>>
>>> Markus
>>>
>>> On 03/12/2015 07:05 PM, kpeng1 wrote:
>>>
>>>> Hi All,
>>>>
>>>> I am current trying to write out a scheme RDD to avro.  I noticed that
>>>> there
>>>> is a databricks spark-avro library and I have included that in my
>>>> dependencies, but it looks like I am not able to access the AvroSaver
>>>> object.  On compilation of the job I get this:
>>>> error: not found: value AvroSaver
>>>> [ERROR]     AvroSaver.save(resultRDD, args(4))
>>>>
>>>> I also tried calling saveAsAvro on the resultRDD(the actual rdd with the
>>>> results) and that passes compilation, but when I run the code I get an
>>>> error
>>>> that says the saveAsAvro is not implemented.  I am using version 0.1 of
>>>> spark-avro_2.10
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-writing-in-avro-tp22021.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>>
>

Re: spark sql writing in avro

Reply via email to