A small update: I was able to find a solution with good performance - using brickhouse collect (Hive UDAF). This also accept structs as an input, which is an ok workaround, but not perfect still (support for UDTs would be better). The built-in hive 'collect_list' seems to have a check for input parameters, ensuring they are of primitive types, which caused the issue I reported earlier. UDTs as values still throw scala.MatchError (see stacktrace below).
Anyway, it would seem that adding support for aggregating all values into collections - without compromising the performance - to Spark SQL UDAFs would be a great improvement. -JP createpoint is a UDF creating an UDT out of coordinate pair. The collect function should collect the points into an array, ready for processing further with different geographical functions. 15/09/13 12:38:13 INFO parse.ParseDriver: Parsing command: select collect(pt) from (SELECT createpoint(lat,lon) as pt from gpx) t2 15/09/13 12:38:13 INFO parse.ParseDriver: Parse Completed Exception in thread "main" scala.MatchError: com.eaglepeaks.engine.GeomUDT@7a8e35d1 (of class com.eaglepeaks.engine.GeomUDT) at org.apache.spark.sql.hive.HiveInspectors$class.toInspector(HiveInspectors.scala:618) at org.apache.spark.sql.hive.HiveGenericUDAF.toInspector(hiveUDFs.scala:445) at org.apache.spark.sql.hive.HiveInspectors$class.toInspector(HiveInspectors.scala:710) at org.apache.spark.sql.hive.HiveGenericUDAF.toInspector(hiveUDFs.scala:445) at org.apache.spark.sql.hive.HiveGenericUDAF$$anonfun$inspectors$1.apply(hiveUDFs.scala:463) at org.apache.spark.sql.hive.HiveGenericUDAF$$anonfun$inspectors$1.apply(hiveUDFs.scala:463) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.HiveGenericUDAF.inspectors$lzycompute(hiveUDFs.scala:463) at org.apache.spark.sql.hive.HiveGenericUDAF.inspectors(hiveUDFs.scala:463) at org.apache.spark.sql.hive.HiveGenericUDAF.objectInspector$lzycompute(hiveUDFs.scala:457) at org.apache.spark.sql.hive.HiveGenericUDAF.objectInspector(hiveUDFs.scala:456) at org.apache.spark.sql.hive.HiveGenericUDAF.dataType(hiveUDFs.scala:465) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:140) at org.apache.spark.sql.catalyst.plans.logical.Aggregate$$anonfun$output$6.apply(basicOperators.scala:223) at org.apache.spark.sql.catalyst.plans.logical.Aggregate$$anonfun$output$6.apply(basicOperators.scala:223) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.logical.Aggregate.output(basicOperators.scala:223) at org.apache.spark.sql.catalyst.analysis.Analyzer$PullOutNondeterministic$$anonfun$apply$17.applyOrElse(Analyzer.scala:962) at org.apache.spark.sql.catalyst.analysis.Analyzer$PullOutNondeterministic$$anonfun$apply$17.applyOrElse(Analyzer.scala:954) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56) at org.apache.spark.sql.catalyst.analysis.Analyzer$PullOutNondeterministic$.apply(Analyzer.scala:954) at org.apache.spark.sql.catalyst.analysis.Analyzer$PullOutNondeterministic$.apply(Analyzer.scala:953) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:910) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:910) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:908) at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:132) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:719) at com.eaglepeaks.engine.SparkEngine.main(SparkEngine.java:114) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/UDAF-and-UDT-with-SparkSQL-1-5-0-tp24670p24677.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org