Hi, All, I am using Spark 2.1 and want to do data transfer for a nested json. I tried to read it use dataframe but failed. Following is the schema of the dataframe: root |-- deviceid: string (nullable = true) |-- app: struct (nullable = true) | |-- appList: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- appName: string (nullable = true) | | | |-- appVersion: string (nullable = true) | | | |-- pkgName: string (nullable = true) | |-- appName: string (nullable = true) | |-- appVersion: string (nullable = true) | |-- firstUseTime: string (nullable = true) | |-- installTime: string (nullable = true) | |-- pkgName: string (nullable = true)
I want to retrieve the data under appList and want to merge it. What I did is define a case class: case class AppInfo(appName:String,appVersion:String,pkgName:String) And I read it use getList(AppInfo) . It can compile successfully but I got class cast exception while run it and the exception is as following: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to com.zhishu.data.etl.ParquetTest$AppInfo at com.zhishu.data.etl.ParquetTest$$anonfun$2.apply(ParquetTest.scala:75) at com.zhishu.data.etl.ParquetTest$$anonfun$2.apply(ParquetTest.scala:56) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) So is there any easy way I can implement what I want to do? Thanks and Regards, Tony