This is a R native data.frame behavior. While arr is a character vector of length = 2, > arr [1] "rows= 50" "cols= 2" > length(arr) [1] 2
when it is set as R data.frame the character vector is splitted into 2 rows > data.frame(key, strings = arr, stringsAsFactors = F) key strings 1 a rows= 50 2 a cols= 2 > b <- data.frame(key, strings = arr, stringsAsFactors = F) > sapply(b, class) key strings "character" "character" > b[1,1] [1] "a" > b[1,2] [1] "rows= 50" > b[2,2] [1] "cols= 2" And each is separate in the character column. This causes a schema mismatch when it is expecting a string array, not just string when you set schema to have structField('strings', 'array<string>') _____________________________ From: shirisht <shirish.tatiko...@gmail.com<mailto:shirish.tatiko...@gmail.com>> Sent: Tuesday, October 25, 2016 11:51 PM Subject: SparkR issue with array types in gapply() To: <dev@spark.apache.org<mailto:dev@spark.apache.org>> Hello, I am getting an exception from catalyst when array types are used in the return schema of gapply() function. Following is a (made-up) example: ------------------------------------------------------------ iris$flag = base::sample(1:2, nrow(iris), T, prob = c(0.5,0.5)) irisdf = createDataFrame(iris) foo = function(key, x) { nr = nrow(x) nc = ncol(x) arr = c( paste("rows=", nr), paste("cols=",nc) ) data.frame(key, strings = arr, stringsAsFactors = F) } outSchema = structType( structField('key', 'integer'), structField('strings', 'array<string>') ) result = SparkR::gapply(irisdf, "flag", foo, outSchema) d = SparkR::collect(result) ------------------------------------------------------------ This code throws up the following error: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of array<string> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Any thoughts? Thank you, Shirish -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SparkR-issue-with-array-types-in-gapply-tp19568.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com<http://Nabble.com>. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>