This is a R native data.frame behavior.

While arr is a character vector of length = 2,
> arr
[1] "rows= 50" "cols= 2"
> length(arr)
[1] 2


when it is set as R data.frame the character vector is splitted into 2 rows


> data.frame(key, strings = arr, stringsAsFactors = F)
  key strings
1 a rows= 50
2 a cols= 2


> b <- data.frame(key, strings = arr, stringsAsFactors = F)
> sapply(b, class)
        key strings
"character" "character"
> b[1,1]
[1] "a"
> b[1,2]
[1] "rows= 50"
> b[2,2]
[1] "cols= 2"


And each is separate in the character column. This causes a schema mismatch 
when it is expecting a string array, not just string when you set schema to 
have  structField('strings', 'array<string>')


_____________________________
From: shirisht <shirish.tatiko...@gmail.com<mailto:shirish.tatiko...@gmail.com>>
Sent: Tuesday, October 25, 2016 11:51 PM
Subject: SparkR issue with array types in gapply()
To: <dev@spark.apache.org<mailto:dev@spark.apache.org>>


Hello,

I am getting an exception from catalyst when array types are used in the
return schema of gapply() function.

Following is a (made-up) example:

------------------------------------------------------------
iris$flag = base::sample(1:2, nrow(iris), T, prob = c(0.5,0.5))
irisdf = createDataFrame(iris)

foo = function(key, x) {
nr = nrow(x)
nc = ncol(x)
arr = c( paste("rows=", nr), paste("cols=",nc) )
data.frame(key, strings = arr, stringsAsFactors = F)
}

outSchema = structType( structField('key', 'integer'),
structField('strings', 'array<string>') )
result = SparkR::gapply(irisdf, "flag", foo, outSchema)
d = SparkR::collect(result)
------------------------------------------------------------

This code throws up the following error:

java.lang.RuntimeException: java.lang.String is not a valid external type
for schema of array<string>
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Any thoughts?

Thank you,
Shirish



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkR-issue-with-array-types-in-gapply-tp19568.html
Sent from the Apache Spark Developers List mailing list archive at 
Nabble.com<http://Nabble.com>.

---------------------------------------------------------------------
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>



Reply via email to