Re: spark sql writing in avro

2015-03-13 Thread Michael Armbrust
BTW, I'll add that we are hoping to publish a new version of the Avro
library for Spark 1.3 shortly.  It should have improved support for writing
data both programmatically and from SQL.

On Fri, Mar 13, 2015 at 2:01 PM, Kevin Peng  wrote:

> Markus,
>
> Thanks.  That makes sense.  I was able to get this to work with
> spark-shell passing in the git built jar.  I did notice that I couldn't get
> AvroSaver.save to work with SQLContext, but it works with HiveContext.  Not
> sure if that is an issue, but for me, it is fine.
>
> Once again, thanks for the help.
>
> Kevin
>
> On Fri, Mar 13, 2015 at 1:57 PM, M. Dale  wrote:
>
>> I probably did not do a good enough job explaining the problem. If
>> you used Maven with the
>> default Maven repository you have an old version of spark-avro that does
>> not contain AvroSaver and does not have the saveAsAvro method implemented:
>>
>> Assuming you use the default Maven repo location:
>> cd ~/.m2/repository/com/databricks/spark-avro_2.10/0.1
>> jar tvf spark-avro_2.10-0.1.jar | grep AvroSaver
>>
>> Comes up empty. The jar file does not contain this class because
>> AvroSaver.scala wasn't added until Jan 21. The jar file is from 14 November.
>>
>> So:
>> git clone g...@github.com:databricks/spark-avro.git
>> cd spark-avro
>> sbt publish-m2
>>
>> This publishes the latest master code (this includes AvroSaver etc.) to
>> your local Maven repo and Maven will pick up the latest version of
>> spark-avro (for this machine).
>>
>> Now you should be able to compile and run.
>>
>> HTH,
>> Markus
>>
>>
>> On 03/12/2015 11:55 PM, Kevin Peng wrote:
>>
>> Dale,
>>
>>  I basically have the same maven dependency above, but my code will not
>> compile due to not being able to reference to AvroSaver, though the
>> saveAsAvro reference compiles fine, which is weird.  Eventhough saveAsAvro
>> compiles for me, it errors out when running the spark job due to it not
>> being implemented (the job quits and says non implemented method or
>> something along those lines).
>>
>>  I will try going the spark shell and passing in the jar built from
>> github since I haven't tried that quite yet.
>>
>> On Thu, Mar 12, 2015 at 6:44 PM, M. Dale  wrote:
>>
>>> Short answer: if you downloaded spark-avro from the
>>> repo.maven.apache.org
>>> repo you might be using an old version (pre-November 14, 2014) -
>>> see timestamps at
>>> http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/
>>> Lots of changes at https://github.com/databricks/spark-avro since then.
>>>
>>> Databricks, thank you for sharing the Avro code!!!
>>>
>>> Could you please push out the latest version or update the version
>>> number and republish to repo.maven.apache.org (I have no idea how jars
>>> get
>>> there). Or is there a different repository that users should point to for
>>> this artifact?
>>>
>>> Workaround: Download from https://github.com/databricks/spark-avro and
>>> build
>>> with latest functionality (still version 0.1) and add to your local Maven
>>> or Ivy repo.
>>>
>>> Long version:
>>> I used a default Maven build and declared my dependency on:
>>>
>>> 
>>> com.databricks
>>> spark-avro_2.10
>>> 0.1
>>> 
>>>
>>> Maven downloaded the 0.1 version from
>>> http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/
>>> and included it in my app code jar.
>>>
>>> From spark-shell:
>>>
>>> import com.databricks.spark.avro._
>>> import org.apache.spark.sql.SQLContext
>>> val sqlContext = new SQLContext(sc)
>>>
>>> # This schema includes LONG for time in millis (
>>> https://github.com/medale/spark-mail/blob/master/mailrecord/src/main/avro/com/uebercomputing/mailrecord/MailRecord.avdl
>>> )
>>> val recordsSchema =
>>> sqlContext.avroFile("/opt/rpm1/enron/enron-tiny.avro")
>>> java.lang.RuntimeException: Unsupported type LONG
>>>
>>> However, checking out the spark-avro code from its GitHub repo and adding
>>> a test case against the MailRecord avro everything ran fine.
>>>
>>> So I built the databricks spark-avro locally on my box and then put it
>>> in my
>>> local Maven repo - everything worked from spark-shell when adding that
>>> jar
>>> as dependency.
>>>
>>> Hope this helps for the "save" case as well. On the pre-14NOV version,
>>> avro.scala
>>> says:
>>>  // TODO: Implement me.
>>>   implicit class AvroSchemaRDD(schemaRDD: SchemaRDD) {
>>> def saveAsAvroFile(path: String): Unit = ???
>>>   }
>>>
>>> Markus
>>>
>>> On 03/12/2015 07:05 PM, kpeng1 wrote:
>>>
 Hi All,

 I am current trying to write out a scheme RDD to avro.  I noticed that
 there
 is a databricks spark-avro library and I have included that in my
 dependencies, but it looks like I am not able to access the AvroSaver
 object.  On compilation of the job I get this:
 error: not found: value AvroSaver
 [ERROR] AvroSaver.save(resultRDD, args(4))

 I also tried calling saveAsAvro on the resultRDD(the actual rdd with the
 

Re: spark sql writing in avro

2015-03-13 Thread Kevin Peng
Markus,

Thanks.  That makes sense.  I was able to get this to work with spark-shell
passing in the git built jar.  I did notice that I couldn't get
AvroSaver.save to work with SQLContext, but it works with HiveContext.  Not
sure if that is an issue, but for me, it is fine.

Once again, thanks for the help.

Kevin

On Fri, Mar 13, 2015 at 1:57 PM, M. Dale  wrote:

> I probably did not do a good enough job explaining the problem. If you
> used Maven with the
> default Maven repository you have an old version of spark-avro that does
> not contain AvroSaver and does not have the saveAsAvro method implemented:
>
> Assuming you use the default Maven repo location:
> cd ~/.m2/repository/com/databricks/spark-avro_2.10/0.1
> jar tvf spark-avro_2.10-0.1.jar | grep AvroSaver
>
> Comes up empty. The jar file does not contain this class because
> AvroSaver.scala wasn't added until Jan 21. The jar file is from 14 November.
>
> So:
> git clone g...@github.com:databricks/spark-avro.git
> cd spark-avro
> sbt publish-m2
>
> This publishes the latest master code (this includes AvroSaver etc.) to
> your local Maven repo and Maven will pick up the latest version of
> spark-avro (for this machine).
>
> Now you should be able to compile and run.
>
> HTH,
> Markus
>
>
> On 03/12/2015 11:55 PM, Kevin Peng wrote:
>
> Dale,
>
>  I basically have the same maven dependency above, but my code will not
> compile due to not being able to reference to AvroSaver, though the
> saveAsAvro reference compiles fine, which is weird.  Eventhough saveAsAvro
> compiles for me, it errors out when running the spark job due to it not
> being implemented (the job quits and says non implemented method or
> something along those lines).
>
>  I will try going the spark shell and passing in the jar built from
> github since I haven't tried that quite yet.
>
> On Thu, Mar 12, 2015 at 6:44 PM, M. Dale  wrote:
>
>> Short answer: if you downloaded spark-avro from the repo.maven.apache.org
>> repo you might be using an old version (pre-November 14, 2014) -
>> see timestamps at
>> http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/
>> Lots of changes at https://github.com/databricks/spark-avro since then.
>>
>> Databricks, thank you for sharing the Avro code!!!
>>
>> Could you please push out the latest version or update the version
>> number and republish to repo.maven.apache.org (I have no idea how jars
>> get
>> there). Or is there a different repository that users should point to for
>> this artifact?
>>
>> Workaround: Download from https://github.com/databricks/spark-avro and
>> build
>> with latest functionality (still version 0.1) and add to your local Maven
>> or Ivy repo.
>>
>> Long version:
>> I used a default Maven build and declared my dependency on:
>>
>> 
>> com.databricks
>> spark-avro_2.10
>> 0.1
>> 
>>
>> Maven downloaded the 0.1 version from
>> http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/
>> and included it in my app code jar.
>>
>> From spark-shell:
>>
>> import com.databricks.spark.avro._
>> import org.apache.spark.sql.SQLContext
>> val sqlContext = new SQLContext(sc)
>>
>> # This schema includes LONG for time in millis (
>> https://github.com/medale/spark-mail/blob/master/mailrecord/src/main/avro/com/uebercomputing/mailrecord/MailRecord.avdl
>> )
>> val recordsSchema = sqlContext.avroFile("/opt/rpm1/enron/enron-tiny.avro")
>> java.lang.RuntimeException: Unsupported type LONG
>>
>> However, checking out the spark-avro code from its GitHub repo and adding
>> a test case against the MailRecord avro everything ran fine.
>>
>> So I built the databricks spark-avro locally on my box and then put it in
>> my
>> local Maven repo - everything worked from spark-shell when adding that jar
>> as dependency.
>>
>> Hope this helps for the "save" case as well. On the pre-14NOV version,
>> avro.scala
>> says:
>>  // TODO: Implement me.
>>   implicit class AvroSchemaRDD(schemaRDD: SchemaRDD) {
>> def saveAsAvroFile(path: String): Unit = ???
>>   }
>>
>> Markus
>>
>> On 03/12/2015 07:05 PM, kpeng1 wrote:
>>
>>> Hi All,
>>>
>>> I am current trying to write out a scheme RDD to avro.  I noticed that
>>> there
>>> is a databricks spark-avro library and I have included that in my
>>> dependencies, but it looks like I am not able to access the AvroSaver
>>> object.  On compilation of the job I get this:
>>> error: not found: value AvroSaver
>>> [ERROR] AvroSaver.save(resultRDD, args(4))
>>>
>>> I also tried calling saveAsAvro on the resultRDD(the actual rdd with the
>>> results) and that passes compilation, but when I run the code I get an
>>> error
>>> that says the saveAsAvro is not implemented.  I am using version 0.1 of
>>> spark-avro_2.10
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-writing-in-avro-tp22021.html
>>> Sent from the Apache Spark User List mailing list archive at

Re: spark sql writing in avro

2015-03-13 Thread M. Dale
   I probably did not do a good enough job explaining the problem. If 
you used Maven with the
default Maven repository you have an old version of spark-avro that does 
not contain AvroSaver and does not have the saveAsAvro method implemented:


Assuming you use the default Maven repo location:
cd ~/.m2/repository/com/databricks/spark-avro_2.10/0.1
jar tvf spark-avro_2.10-0.1.jar | grep AvroSaver

Comes up empty. The jar file does not contain this class because 
AvroSaver.scala wasn't added until Jan 21. The jar file is from 14 November.


So:
git clone g...@github.com:databricks/spark-avro.git
cd spark-avro
sbt publish-m2

This publishes the latest master code (this includes AvroSaver etc.) to 
your local Maven repo and Maven will pick up the latest version of 
spark-avro (for this machine).


Now you should be able to compile and run.

HTH,
Markus

On 03/12/2015 11:55 PM, Kevin Peng wrote:

Dale,

I basically have the same maven dependency above, but my code will not 
compile due to not being able to reference to AvroSaver, though the 
saveAsAvro reference compiles fine, which is weird.  Eventhough 
saveAsAvro compiles for me, it errors out when running the spark job 
due to it not being implemented (the job quits and says non 
implemented method or something along those lines).


I will try going the spark shell and passing in the jar built from 
github since I haven't tried that quite yet.


On Thu, Mar 12, 2015 at 6:44 PM, M. Dale > wrote:


Short answer: if you downloaded spark-avro from the
repo.maven.apache.org 
repo you might be using an old version (pre-November 14, 2014) -
see timestamps at
http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/
Lots of changes at https://github.com/databricks/spark-avro since
then.

Databricks, thank you for sharing the Avro code!!!

Could you please push out the latest version or update the version
number and republish to repo.maven.apache.org
 (I have no idea how jars get
there). Or is there a different repository that users should point
to for
this artifact?

Workaround: Download from https://github.com/databricks/spark-avro
and build
with latest functionality (still version 0.1) and add to your
local Maven
or Ivy repo.

Long version:
I used a default Maven build and declared my dependency on:


com.databricks
spark-avro_2.10
0.1


Maven downloaded the 0.1 version from
http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/
and included it in my app code jar.

From spark-shell:

import com.databricks.spark.avro._
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)

# This schema includes LONG for time in millis

(https://github.com/medale/spark-mail/blob/master/mailrecord/src/main/avro/com/uebercomputing/mailrecord/MailRecord.avdl)
val recordsSchema =
sqlContext.avroFile("/opt/rpm1/enron/enron-tiny.avro")
java.lang.RuntimeException: Unsupported type LONG

However, checking out the spark-avro code from its GitHub repo and
adding
a test case against the MailRecord avro everything ran fine.

So I built the databricks spark-avro locally on my box and then
put it in my
local Maven repo - everything worked from spark-shell when adding
that jar
as dependency.

Hope this helps for the "save" case as well. On the pre-14NOV
version, avro.scala
says:
 // TODO: Implement me.
  implicit class AvroSchemaRDD(schemaRDD: SchemaRDD) {
def saveAsAvroFile(path: String): Unit = ???
  }

Markus

On 03/12/2015 07:05 PM, kpeng1 wrote:

Hi All,

I am current trying to write out a scheme RDD to avro.  I
noticed that there
is a databricks spark-avro library and I have included that in my
dependencies, but it looks like I am not able to access the
AvroSaver
object.  On compilation of the job I get this:
error: not found: value AvroSaver
[ERROR] AvroSaver.save(resultRDD, args(4))

I also tried calling saveAsAvro on the resultRDD(the actual
rdd with the
results) and that passes compilation, but when I run the code
I get an error
that says the saveAsAvro is not implemented.  I am using
version 0.1 of
spark-avro_2.10




--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-writing-in-avro-tp22021.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

   

Re: spark sql writing in avro

2015-03-12 Thread Kevin Peng
Dale,

I basically have the same maven dependency above, but my code will not
compile due to not being able to reference to AvroSaver, though the
saveAsAvro reference compiles fine, which is weird.  Eventhough saveAsAvro
compiles for me, it errors out when running the spark job due to it not
being implemented (the job quits and says non implemented method or
something along those lines).

I will try going the spark shell and passing in the jar built from github
since I haven't tried that quite yet.

On Thu, Mar 12, 2015 at 6:44 PM, M. Dale  wrote:

> Short answer: if you downloaded spark-avro from the repo.maven.apache.org
> repo you might be using an old version (pre-November 14, 2014) -
> see timestamps at http://repo.maven.apache.org/
> maven2/com/databricks/spark-avro_2.10/0.1/
> Lots of changes at https://github.com/databricks/spark-avro since then.
>
> Databricks, thank you for sharing the Avro code!!!
>
> Could you please push out the latest version or update the version
> number and republish to repo.maven.apache.org (I have no idea how jars get
> there). Or is there a different repository that users should point to for
> this artifact?
>
> Workaround: Download from https://github.com/databricks/spark-avro and
> build
> with latest functionality (still version 0.1) and add to your local Maven
> or Ivy repo.
>
> Long version:
> I used a default Maven build and declared my dependency on:
>
> 
> com.databricks
> spark-avro_2.10
> 0.1
> 
>
> Maven downloaded the 0.1 version from http://repo.maven.apache.org/
> maven2/com/databricks/spark-avro_2.10/0.1/ and included it in my app code
> jar.
>
> From spark-shell:
>
> import com.databricks.spark.avro._
> import org.apache.spark.sql.SQLContext
> val sqlContext = new SQLContext(sc)
>
> # This schema includes LONG for time in millis (https://github.com/medale/
> spark-mail/blob/master/mailrecord/src/main/avro/com/
> uebercomputing/mailrecord/MailRecord.avdl)
> val recordsSchema = sqlContext.avroFile("/opt/rpm1/enron/enron-tiny.avro")
> java.lang.RuntimeException: Unsupported type LONG
>
> However, checking out the spark-avro code from its GitHub repo and adding
> a test case against the MailRecord avro everything ran fine.
>
> So I built the databricks spark-avro locally on my box and then put it in
> my
> local Maven repo - everything worked from spark-shell when adding that jar
> as dependency.
>
> Hope this helps for the "save" case as well. On the pre-14NOV version,
> avro.scala
> says:
>  // TODO: Implement me.
>   implicit class AvroSchemaRDD(schemaRDD: SchemaRDD) {
> def saveAsAvroFile(path: String): Unit = ???
>   }
>
> Markus
>
> On 03/12/2015 07:05 PM, kpeng1 wrote:
>
>> Hi All,
>>
>> I am current trying to write out a scheme RDD to avro.  I noticed that
>> there
>> is a databricks spark-avro library and I have included that in my
>> dependencies, but it looks like I am not able to access the AvroSaver
>> object.  On compilation of the job I get this:
>> error: not found: value AvroSaver
>> [ERROR] AvroSaver.save(resultRDD, args(4))
>>
>> I also tried calling saveAsAvro on the resultRDD(the actual rdd with the
>> results) and that passes compilation, but when I run the code I get an
>> error
>> that says the saveAsAvro is not implemented.  I am using version 0.1 of
>> spark-avro_2.10
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/spark-sql-writing-in-avro-tp22021.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: spark sql writing in avro

2015-03-12 Thread M. Dale

Short answer: if you downloaded spark-avro from the repo.maven.apache.org
repo you might be using an old version (pre-November 14, 2014) -
see timestamps at 
http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/

Lots of changes at https://github.com/databricks/spark-avro since then.

Databricks, thank you for sharing the Avro code!!!

Could you please push out the latest version or update the version
number and republish to repo.maven.apache.org (I have no idea how jars get
there). Or is there a different repository that users should point to for
this artifact?

Workaround: Download from https://github.com/databricks/spark-avro and build
with latest functionality (still version 0.1) and add to your local Maven
or Ivy repo.

Long version:
I used a default Maven build and declared my dependency on:


com.databricks
spark-avro_2.10
0.1


Maven downloaded the 0.1 version from 
http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/ 
and included it in my app code jar.


From spark-shell:

import com.databricks.spark.avro._
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)

# This schema includes LONG for time in millis 
(https://github.com/medale/spark-mail/blob/master/mailrecord/src/main/avro/com/uebercomputing/mailrecord/MailRecord.avdl)

val recordsSchema = sqlContext.avroFile("/opt/rpm1/enron/enron-tiny.avro")
java.lang.RuntimeException: Unsupported type LONG

However, checking out the spark-avro code from its GitHub repo and adding
a test case against the MailRecord avro everything ran fine.

So I built the databricks spark-avro locally on my box and then put it in my
local Maven repo - everything worked from spark-shell when adding that jar
as dependency.

Hope this helps for the "save" case as well. On the pre-14NOV version, 
avro.scala

says:
 // TODO: Implement me.
  implicit class AvroSchemaRDD(schemaRDD: SchemaRDD) {
def saveAsAvroFile(path: String): Unit = ???
  }

Markus

On 03/12/2015 07:05 PM, kpeng1 wrote:

Hi All,

I am current trying to write out a scheme RDD to avro.  I noticed that there
is a databricks spark-avro library and I have included that in my
dependencies, but it looks like I am not able to access the AvroSaver
object.  On compilation of the job I get this:
error: not found: value AvroSaver
[ERROR] AvroSaver.save(resultRDD, args(4))

I also tried calling saveAsAvro on the resultRDD(the actual rdd with the
results) and that passes compilation, but when I run the code I get an error
that says the saveAsAvro is not implemented.  I am using version 0.1 of
spark-avro_2.10




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-writing-in-avro-tp22021.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org