Re: UDAF and UDT with SparkSQL 1.5.0

jussipekkap Sun, 13 Sep 2015 02:45:08 -0700

A small update: I was able to find a solution with good performance - using
brickhouse collect (Hive UDAF). This also accept structs as an input, which
is an ok workaround, but not perfect still (support for UDTs would be
better). The built-in hive 'collect_list' seems to have a check for input
parameters, ensuring they are of primitive types, which caused the issue I
reported earlier. UDTs as values still throw scala.MatchError (see
stacktrace below).


Anyway, it would seem that adding support for aggregating all values into
collections - without compromising the performance - to Spark SQL UDAFs
would be a great improvement.

-JP

createpoint is a UDF creating an UDT out of coordinate pair. The collect
function should collect the points into an array, ready for processing
further with different geographical functions.

15/09/13 12:38:13 INFO parse.ParseDriver: Parsing command: select
collect(pt) from (SELECT createpoint(lat,lon) as pt from gpx) t2
15/09/13 12:38:13 INFO parse.ParseDriver: Parse Completed
Exception in thread "main" scala.MatchError:
com.eaglepeaks.engine.GeomUDT@7a8e35d1 (of class
com.eaglepeaks.engine.GeomUDT)
        at
org.apache.spark.sql.hive.HiveInspectors$class.toInspector(HiveInspectors.scala:618)
        at
org.apache.spark.sql.hive.HiveGenericUDAF.toInspector(hiveUDFs.scala:445)
        at
org.apache.spark.sql.hive.HiveInspectors$class.toInspector(HiveInspectors.scala:710)
        at
org.apache.spark.sql.hive.HiveGenericUDAF.toInspector(hiveUDFs.scala:445)
        at
org.apache.spark.sql.hive.HiveGenericUDAF$$anonfun$inspectors$1.apply(hiveUDFs.scala:463)
        at
org.apache.spark.sql.hive.HiveGenericUDAF$$anonfun$inspectors$1.apply(hiveUDFs.scala:463)
        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at
org.apache.spark.sql.hive.HiveGenericUDAF.inspectors$lzycompute(hiveUDFs.scala:463)
        at 
org.apache.spark.sql.hive.HiveGenericUDAF.inspectors(hiveUDFs.scala:463)
        at
org.apache.spark.sql.hive.HiveGenericUDAF.objectInspector$lzycompute(hiveUDFs.scala:457)
        at
org.apache.spark.sql.hive.HiveGenericUDAF.objectInspector(hiveUDFs.scala:456)
        at 
org.apache.spark.sql.hive.HiveGenericUDAF.dataType(hiveUDFs.scala:465)
        at
org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:140)
        at
org.apache.spark.sql.catalyst.plans.logical.Aggregate$$anonfun$output$6.apply(basicOperators.scala:223)
        at
org.apache.spark.sql.catalyst.plans.logical.Aggregate$$anonfun$output$6.apply(basicOperators.scala:223)
        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at
org.apache.spark.sql.catalyst.plans.logical.Aggregate.output(basicOperators.scala:223)
        at
org.apache.spark.sql.catalyst.analysis.Analyzer$PullOutNondeterministic$$anonfun$apply$17.applyOrElse(Analyzer.scala:962)
        at
org.apache.spark.sql.catalyst.analysis.Analyzer$PullOutNondeterministic$$anonfun$apply$17.applyOrElse(Analyzer.scala:954)
        at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
        at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
        at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
        at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
        at
org.apache.spark.sql.catalyst.analysis.Analyzer$PullOutNondeterministic$.apply(Analyzer.scala:954)
        at
org.apache.spark.sql.catalyst.analysis.Analyzer$PullOutNondeterministic$.apply(Analyzer.scala:953)
        at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
        at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
        at
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
        at
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
        at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
        at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
        at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
        at
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:910)
        at
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:910)
        at
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:908)
        at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:132)
        at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
        at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:719)
        at com.eaglepeaks.engine.SparkEngine.main(SparkEngine.java:114)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/UDAF-and-UDT-with-SparkSQL-1-5-0-tp24670p24677.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: UDAF and UDT with SparkSQL 1.5.0

Reply via email to