Re: [Spark 2.0] Error during codegen for Java POJO

2016-08-05 Thread Andy Grove
I tracked this down in the end. It turns out the POJO was not actually
defined as 'public' for some reason. It seems like this should be detected
as an error prior to generating code though?


Thanks,

Andy.

--

Andy Grove
Chief Architect
www.agildata.com


On Fri, Aug 5, 2016 at 8:28 AM, Andy Grove <andy.gr...@agildata.com> wrote:

> Hi,
>
> I've run into another issue upgrading a Spark example written in Java from
> Spark 1.6 to 2.0. The code is here:
>
> https://github.com/AgilData/apache-spark-examples/blob/
> spark_2.0/src/main/java/JRankCountiesBySexUsingDataset.java
>
> The runtime error is:
>
> org.codehaus.commons.compiler.CompileException: File 'generated.java',
> Line 85, Column 139: No applicable constructor/method found for zero actual
> parameters; candidates are: "public java.lang.String JGeo.getLogrecno()"
>
> I'd appreciate some advice on how to fix this, or whether this is a valid
> Spark bug that I should file a JIRA for.
>
>
> Thanks,
>
> Andy.
>
> --
>
> Andy Grove
> Chief Architect
> www.agildata.com
>
>


Re: Java and SparkSession

2016-08-05 Thread Andy Grove
Ah, you still have to use the JavaSparkContext rather than using the
sparkSession.sparkContext ... that makes sense.

Thanks for your help.


Thanks,

Andy.

--

Andy Grove
Chief Architect
www.agildata.com


On Fri, Aug 5, 2016 at 12:03 PM, Everett Anderson <ever...@nuna.com> wrote:

> Hi,
>
> Can you say more about what goes wrong?
>
> I was migrating my code and began using this for initialization:
>
>SparkConf sparkConf = new SparkConf().setAppName(...)
>SparkSession sparkSession = new SparkSession.Builder().config(
> sparkConf).getOrCreate();
>JavaSparkContext jsc = new JavaSparkContext(sparkSession.
> sparkContext());
>
> and then used SparkSession instead of the SQLContext, which seemed to work.
>
>
> On Thu, Aug 4, 2016 at 9:41 PM, Andy Grove <andy.gr...@agildata.com>
> wrote:
>
>> From some brief experiments using Java with Spark 2.0 it looks like Java
>> developers should stick to SparkContext and SQLContext rather than using
>> the new SparkSession API?
>>
>> It would be great if someone could confirm if that is the intention or
>> not.
>>
>> Thanks,
>>
>> Andy.
>>
>> --
>>
>> Andy Grove
>> Chief Architect
>> www.agildata.com
>>
>>
>


[Spark 2.0] Error during codegen for Java POJO

2016-08-05 Thread Andy Grove
Hi,

I've run into another issue upgrading a Spark example written in Java from
Spark 1.6 to 2.0. The code is here:

https://github.com/AgilData/apache-spark-examples/blob/spark_2.0/src/main/java/JRankCountiesBySexUsingDataset.java

The runtime error is:

org.codehaus.commons.compiler.CompileException: File 'generated.java', Line
85, Column 139: No applicable constructor/method found for zero actual
parameters; candidates are: "public java.lang.String JGeo.getLogrecno()"

I'd appreciate some advice on how to fix this, or whether this is a valid
Spark bug that I should file a JIRA for.


Thanks,

Andy.

--

Andy Grove
Chief Architect
www.agildata.com


Java and SparkSession

2016-08-04 Thread Andy Grove
>From some brief experiments using Java with Spark 2.0 it looks like Java
developers should stick to SparkContext and SQLContext rather than using
the new SparkSession API?

It would be great if someone could confirm if that is the intention or not.

Thanks,

Andy.

--

Andy Grove
Chief Architect
www.agildata.com


Re: Regression in Java RDD sortBy() in Spark 2.0

2016-08-04 Thread Andy Grove
Moments after sending this I tracked down the issue to a subsequent
transformation of .top(10) which ran without error in Spark 1.6 (but who
knows how it was sorting since the POJO doesn't implement Comparable)
whereas in Spark 2.0 it now fails if the POJO is not Comparable.

The new behavior is better for sure.

Thanks,

Andy.

--

Andy Grove
Chief Architect
AgilData - Simple Streaming SQL that Scales
www.agildata.com


On Thu, Aug 4, 2016 at 10:25 PM, Andy Grove <andy.gr...@agildata.com> wrote:

> Hi,
>
> I have some working Java code with Spark 1.6 that I am upgrading to Spark
> 2.0
>
> I have this valid RDD:
>
> JavaRDD popSummary
>
> I want to sort using a function I provide for performing comparisons:
>
> popSummary
> .sortBy((Function<JPopulationSummary, Object>) p ->
> p.getMale() * 1.0f / p.getFemale(), true, 1)
>
> The code fails at runtime with the following error.
>
> Caused by: java.lang.ClassCastException: JPopulationSummary cannot be cast
> to java.lang.Comparable
> at org.spark_project.guava.collect.NaturalOrdering.
> compare(NaturalOrdering.java:28)
> at scala.math.LowPriorityOrderingImplicits$$anon$7.compare(Ordering.scala:
> 153)
> at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
> at org.apache.spark.util.collection.Utils$$anon$1.compare(Utils.scala:35)
> at org.spark_project.guava.collect.Ordering.max(Ordering.java:551)
> at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:667)
> at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
> at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$
> anonfun$30.apply(RDD.scala:1374)
> at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$
> anonfun$30.apply(RDD.scala:1371)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$
> anonfun$apply$23.apply(RDD.scala:766)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$
> anonfun$apply$23.apply(RDD.scala:766)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Even if the POJO did implement Comparable, Spark shouldn't care since I
> provided the comparator I want to sort by.
>
> Am I doing something wrong or is this a regression?
>
> Thanks,
>
> Andy.
>
> --
>
> Andy Grove
> Chief Architect
> www.agildata.com
>
>


Regression in Java RDD sortBy() in Spark 2.0

2016-08-04 Thread Andy Grove
Hi,

I have some working Java code with Spark 1.6 that I am upgrading to Spark
2.0

I have this valid RDD:

JavaRDD popSummary

I want to sort using a function I provide for performing comparisons:

popSummary
.sortBy((Function<JPopulationSummary, Object>) p -> p.getMale()
* 1.0f / p.getFemale(), true, 1)

The code fails at runtime with the following error.

Caused by: java.lang.ClassCastException: JPopulationSummary cannot be cast
to java.lang.Comparable
at
org.spark_project.guava.collect.NaturalOrdering.compare(NaturalOrdering.java:28)
at
scala.math.LowPriorityOrderingImplicits$$anon$7.compare(Ordering.scala:153)
at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
at org.apache.spark.util.collection.Utils$$anon$1.compare(Utils.scala:35)
at org.spark_project.guava.collect.Ordering.max(Ordering.java:551)
at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:667)
at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
at
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1374)
at
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1371)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Even if the POJO did implement Comparable, Spark shouldn't care since I
provided the comparator I want to sort by.

Am I doing something wrong or is this a regression?

Thanks,

Andy.

--

Andy Grove
Chief Architect
www.agildata.com


Re: Inferring schema from GenericRowWithSchema

2016-05-17 Thread Andy Grove
Hmm. I see. Yes, I guess that won't work then.

I don't understand what you are proposing about UDFRegistration. I only see
methods that take tuples of various sizes (1 .. 22).

On Tue, May 17, 2016 at 1:00 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> I don't think that you will be able to do that.  ScalaReflection is based
> on the TypeTag of the object, and thus the schema of any particular object
> won't be available to it.
>
> Instead I think you want to use the register functions in UDFRegistration
> that take a schema. Does that make sense?
>
> On Tue, May 17, 2016 at 11:48 AM, Andy Grove <andy.gr...@agildata.com>
> wrote:
>
>>
>> Hi,
>>
>> I have a requirement to create types dynamically in Spark and then
>> instantiate those types from Spark SQL via a UDF.
>>
>> I tried doing the following:
>>
>> val addressType = StructType(List(
>>   new StructField("state", DataTypes.StringType),
>>   new StructField("zipcode", DataTypes.IntegerType)
>> ))
>>
>> sqlContext.udf.register("Address", (args: Seq[Any]) => new
>> GenericRowWithSchema(args.toArray, addressType))
>>
>> sqlContext.sql("SELECT Address('NY', 12345)").show(10)
>>
>> This seems reasonable to me but this fails with:
>>
>> Exception in thread "main" java.lang.UnsupportedOperationException:
>> Schema for type
>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema is not
>> supported
>> at
>> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:755)
>> at
>> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:685)
>> at
>> org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:130)
>>
>> It looks like it would be simple to update ScalaReflection to be able to
>> infer the schema from a GenericRowWithSchema, but before I file a JIRA and
>> submit a patch I wanted to see if there is already a way of achieving this.
>>
>> Thanks,
>>
>> Andy.
>>
>>
>>
>


Inferring schema from GenericRowWithSchema

2016-05-17 Thread Andy Grove
Hi,

I have a requirement to create types dynamically in Spark and then
instantiate those types from Spark SQL via a UDF.

I tried doing the following:

val addressType = StructType(List(
  new StructField("state", DataTypes.StringType),
  new StructField("zipcode", DataTypes.IntegerType)
))

sqlContext.udf.register("Address", (args: Seq[Any]) => new
GenericRowWithSchema(args.toArray, addressType))

sqlContext.sql("SELECT Address('NY', 12345)").show(10)

This seems reasonable to me but this fails with:

Exception in thread "main" java.lang.UnsupportedOperationException: Schema
for type org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema is
not supported
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:755)
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:685)
at org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:130)

It looks like it would be simple to update ScalaReflection to be able to
infer the schema from a GenericRowWithSchema, but before I file a JIRA and
submit a patch I wanted to see if there is already a way of achieving this.

Thanks,

Andy.


Re: Use case for RDD and Data Frame

2016-02-16 Thread Andy Grove
This blog post should be helpful

http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/


Thanks,

Andy.

--

Andy Grove
Chief Architect
AgilData - Simple Streaming SQL that Scales
www.agildata.com


On Tue, Feb 16, 2016 at 9:05 AM, Ashok Kumar <ashok34...@yahoo.com.invalid>
wrote:

> Gurus,
>
> What are the main differences between a Resilient Distributed Data (RDD)
> and Data Frame (DF)
>
> Where one can use RDD without transforming it to DF?
>
> Regards and obliged
>


Re: Scala MatchError in Spark SQL

2016-01-20 Thread Andy Grove
Catalyst is expecting a class that implements scala.Row or scala.Product
and is instead finding a Java class. I've run into this issue a number of
times. Dataframe doesn't work so well with Java. Here's a blog post with
more information on this:

http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/


Thanks,

Andy.

--

Andy Grove
Chief Architect
AgilData - Simple Streaming SQL that Scales
www.agildata.com


On Wed, Jan 20, 2016 at 7:07 AM, raghukiran <raghuki...@gmail.com> wrote:

> Hi,
>
> I created a custom UserDefinedType in Java as follows:
>
> SQLPoint = new UserDefinedType() {
> //overriding serialize, deserialize, sqlType, userClass functions here
> }
>
> When creating a dataframe, I am following the manual mapping, I have a
> constructor for JavaPoint - JavaPoint(double x, double y) and a Customer
> record as follows:
>
> public class CustomerRecord {
> private int id;
> private String name;
> private Object location;
>
> //setters and getters follow here
> }
>
> Following the example in Spark source, when I create a RDD as follows:
>
> sc.textFile(inputFileName).map(new Function<String, CustomerRecord>() {
> //call method
> CustomerRecord rec = new CustomerRecord();
> rec.setLocation(SQLPoint.serialize(new JavaPoint(x, y)));
> });
>
> This results in a MatchError. The stack trace is as follows:
>
> scala.MatchError: [B@45aa3dd5 (of class [B)
> at
>
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
> at
>
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
> at
>
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
> at
>
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
> at
>
> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
> at
>
> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
>
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at
>
> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1358)
> at
>
> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1356)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to
> (TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
>
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at
>
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(S

Re: Scala MatchError in Spark SQL

2016-01-20 Thread Andy Grove
I'm talking about implementing CustomerRecord as a scala case class, rather
than as a Java class. Scala case classes implement the scala.Product trait,
which Catalyst is looking for.


Thanks,

Andy.

--

Andy Grove
Chief Architect
AgilData - Simple Streaming SQL that Scales
www.agildata.com


On Wed, Jan 20, 2016 at 10:21 AM, Raghu Ganti <raghuki...@gmail.com> wrote:

> Is it not internal to the Catalyst implementation? I should not be
> modifying the Spark source to get things to work, do I? :-)
>
> On Wed, Jan 20, 2016 at 12:21 PM, Raghu Ganti <raghuki...@gmail.com>
> wrote:
>
>> Case classes where?
>>
>> On Wed, Jan 20, 2016 at 12:21 PM, Andy Grove <andy.gr...@agildata.com>
>> wrote:
>>
>>> Honestly, moving to Scala and using case classes is the path of least
>>> resistance in the long term.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Andy.
>>>
>>> --
>>>
>>> Andy Grove
>>> Chief Architect
>>> AgilData - Simple Streaming SQL that Scales
>>> www.agildata.com
>>>
>>>
>>> On Wed, Jan 20, 2016 at 10:19 AM, Raghu Ganti <raghuki...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for your reply, Andy.
>>>>
>>>> Yes, that is what I concluded based on the Stack trace. The problem is
>>>> stemming from Java implementation of generics, but I thought this will go
>>>> away if you compiled against Java 1.8, which solves the issues of proper
>>>> generic implementation.
>>>>
>>>> Any ideas?
>>>>
>>>> Also, are you saying that in order for my example to work, I would need
>>>> to move to Scala and have the UDT implemented in Scala?
>>>>
>>>>
>>>> On Wed, Jan 20, 2016 at 10:27 AM, Andy Grove <andy.gr...@agildata.com>
>>>> wrote:
>>>>
>>>>> Catalyst is expecting a class that implements scala.Row or
>>>>> scala.Product and is instead finding a Java class. I've run into this 
>>>>> issue
>>>>> a number of times. Dataframe doesn't work so well with Java. Here's a blog
>>>>> post with more information on this:
>>>>>
>>>>> http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Andy.
>>>>>
>>>>> --
>>>>>
>>>>> Andy Grove
>>>>> Chief Architect
>>>>> AgilData - Simple Streaming SQL that Scales
>>>>> www.agildata.com
>>>>>
>>>>>
>>>>> On Wed, Jan 20, 2016 at 7:07 AM, raghukiran <raghuki...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I created a custom UserDefinedType in Java as follows:
>>>>>>
>>>>>> SQLPoint = new UserDefinedType() {
>>>>>> //overriding serialize, deserialize, sqlType, userClass functions here
>>>>>> }
>>>>>>
>>>>>> When creating a dataframe, I am following the manual mapping, I have a
>>>>>> constructor for JavaPoint - JavaPoint(double x, double y) and a
>>>>>> Customer
>>>>>> record as follows:
>>>>>>
>>>>>> public class CustomerRecord {
>>>>>> private int id;
>>>>>> private String name;
>>>>>> private Object location;
>>>>>>
>>>>>> //setters and getters follow here
>>>>>> }
>>>>>>
>>>>>> Following the example in Spark source, when I create a RDD as follows:
>>>>>>
>>>>>> sc.textFile(inputFileName).map(new Function<String, CustomerRecord>()
>>>>>> {
>>>>>> //call method
>>>>>> CustomerRecord rec = new CustomerRecord();
>>>>>> rec.setLocation(SQLPoint.serialize(new JavaPoint(x, y)));
>>>>>> });
>>>>>>
>>>>>> This results in a MatchError. The stack trace is as follows:
>>>>>>
>>>>>> scala.MatchError: [B@45aa3dd5 (of class [B)
>>>>>> at
>>>>>>
>>>>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
>>>>>> at
>>&g

Re: Scala MatchError in Spark SQL

2016-01-20 Thread Andy Grove
I would walk through a Spark tutorial in Scala. It will be the best way to
learn this.

In brief though, a Scala case class is like a Java bean / pojo but has a
more concise syntax (no getters/setters).

case class Person(firstName: String, lastName: String, age: Int)


Thanks,

Andy.

--

Andy Grove
Chief Architect
AgilData - Simple Streaming SQL that Scales
www.agildata.com


On Wed, Jan 20, 2016 at 10:28 AM, Raghu Ganti <raghuki...@gmail.com> wrote:

> Ah, OK! I am a novice to Scala - will take a look at Scala case classes.
> It would be awesome if you can provide some pointers.
>
> Thanks,
> Raghu
>
> On Wed, Jan 20, 2016 at 12:25 PM, Andy Grove <andy.gr...@agildata.com>
> wrote:
>
>> I'm talking about implementing CustomerRecord as a scala case class,
>> rather than as a Java class. Scala case classes implement the scala.Product
>> trait, which Catalyst is looking for.
>>
>>
>> Thanks,
>>
>> Andy.
>>
>> --
>>
>> Andy Grove
>> Chief Architect
>> AgilData - Simple Streaming SQL that Scales
>> www.agildata.com
>>
>>
>> On Wed, Jan 20, 2016 at 10:21 AM, Raghu Ganti <raghuki...@gmail.com>
>> wrote:
>>
>>> Is it not internal to the Catalyst implementation? I should not be
>>> modifying the Spark source to get things to work, do I? :-)
>>>
>>> On Wed, Jan 20, 2016 at 12:21 PM, Raghu Ganti <raghuki...@gmail.com>
>>> wrote:
>>>
>>>> Case classes where?
>>>>
>>>> On Wed, Jan 20, 2016 at 12:21 PM, Andy Grove <andy.gr...@agildata.com>
>>>> wrote:
>>>>
>>>>> Honestly, moving to Scala and using case classes is the path of least
>>>>> resistance in the long term.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Andy.
>>>>>
>>>>> --
>>>>>
>>>>> Andy Grove
>>>>> Chief Architect
>>>>> AgilData - Simple Streaming SQL that Scales
>>>>> www.agildata.com
>>>>>
>>>>>
>>>>> On Wed, Jan 20, 2016 at 10:19 AM, Raghu Ganti <raghuki...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks for your reply, Andy.
>>>>>>
>>>>>> Yes, that is what I concluded based on the Stack trace. The problem
>>>>>> is stemming from Java implementation of generics, but I thought this will
>>>>>> go away if you compiled against Java 1.8, which solves the issues of 
>>>>>> proper
>>>>>> generic implementation.
>>>>>>
>>>>>> Any ideas?
>>>>>>
>>>>>> Also, are you saying that in order for my example to work, I would
>>>>>> need to move to Scala and have the UDT implemented in Scala?
>>>>>>
>>>>>>
>>>>>> On Wed, Jan 20, 2016 at 10:27 AM, Andy Grove <andy.gr...@agildata.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Catalyst is expecting a class that implements scala.Row or
>>>>>>> scala.Product and is instead finding a Java class. I've run into this 
>>>>>>> issue
>>>>>>> a number of times. Dataframe doesn't work so well with Java. Here's a 
>>>>>>> blog
>>>>>>> post with more information on this:
>>>>>>>
>>>>>>> http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Andy.
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Andy Grove
>>>>>>> Chief Architect
>>>>>>> AgilData - Simple Streaming SQL that Scales
>>>>>>> www.agildata.com
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jan 20, 2016 at 7:07 AM, raghukiran <raghuki...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I created a custom UserDefinedType in Java as follows:
>>>>>>>>
>>>>>>>> SQLPoint = new UserDefinedType() {
>>>>>>>> //overriding serialize, deserialize, sqlType, userCla

Re: Scala MatchError in Spark SQL

2016-01-20 Thread Andy Grove
Honestly, moving to Scala and using case classes is the path of least
resistance in the long term.



Thanks,

Andy.

--

Andy Grove
Chief Architect
AgilData - Simple Streaming SQL that Scales
www.agildata.com


On Wed, Jan 20, 2016 at 10:19 AM, Raghu Ganti <raghuki...@gmail.com> wrote:

> Thanks for your reply, Andy.
>
> Yes, that is what I concluded based on the Stack trace. The problem is
> stemming from Java implementation of generics, but I thought this will go
> away if you compiled against Java 1.8, which solves the issues of proper
> generic implementation.
>
> Any ideas?
>
> Also, are you saying that in order for my example to work, I would need to
> move to Scala and have the UDT implemented in Scala?
>
>
> On Wed, Jan 20, 2016 at 10:27 AM, Andy Grove <andy.gr...@agildata.com>
> wrote:
>
>> Catalyst is expecting a class that implements scala.Row or scala.Product
>> and is instead finding a Java class. I've run into this issue a number of
>> times. Dataframe doesn't work so well with Java. Here's a blog post with
>> more information on this:
>>
>> http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/
>>
>>
>> Thanks,
>>
>> Andy.
>>
>> --
>>
>> Andy Grove
>> Chief Architect
>> AgilData - Simple Streaming SQL that Scales
>> www.agildata.com
>>
>>
>> On Wed, Jan 20, 2016 at 7:07 AM, raghukiran <raghuki...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I created a custom UserDefinedType in Java as follows:
>>>
>>> SQLPoint = new UserDefinedType() {
>>> //overriding serialize, deserialize, sqlType, userClass functions here
>>> }
>>>
>>> When creating a dataframe, I am following the manual mapping, I have a
>>> constructor for JavaPoint - JavaPoint(double x, double y) and a Customer
>>> record as follows:
>>>
>>> public class CustomerRecord {
>>> private int id;
>>> private String name;
>>> private Object location;
>>>
>>> //setters and getters follow here
>>> }
>>>
>>> Following the example in Spark source, when I create a RDD as follows:
>>>
>>> sc.textFile(inputFileName).map(new Function<String, CustomerRecord>() {
>>> //call method
>>> CustomerRecord rec = new CustomerRecord();
>>> rec.setLocation(SQLPoint.serialize(new JavaPoint(x, y)));
>>> });
>>>
>>> This results in a MatchError. The stack trace is as follows:
>>>
>>> scala.MatchError: [B@45aa3dd5 (of class [B)
>>> at
>>>
>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
>>> at
>>>
>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>>> at
>>>
>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>>> at
>>>
>>> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
>>> at
>>>
>>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
>>> at
>>>
>>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
>>> at
>>>
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>> at
>>>
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>> at
>>>
>>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>> at
>>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>> at
>>> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>>> at
>>> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>>> at
>>>
>>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1358)
>>> at
>>>
>>> org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1356)
>>> at scala.collection.Iterator$$anon$11