How to retreive data from nested json use dataframe

阎志涛 Sat, 08 Sep 2018 07:38:56 -0700

Hi, All,
I am using Spark 2.1 and want to do data transfer for a nested json. I tried to 
read it use dataframe but failed. Following is the schema of the dataframe:
root
|-- deviceid: string (nullable = true)
|-- app: struct (nullable = true)
|    |-- appList: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- appName: string (nullable = true)
|    |    |    |-- appVersion: string (nullable = true)
|    |    |    |-- pkgName: string (nullable = true)
|    |-- appName: string (nullable = true)
|    |-- appVersion: string (nullable = true)
|    |-- firstUseTime: string (nullable = true)
|    |-- installTime: string (nullable = true)
|    |-- pkgName: string (nullable = true)


I want to retrieve the data under appList and want to merge it. What I did is 
define a case class:
case class AppInfo(appName:String,appVersion:String,pkgName:String)
And I read it use getList(AppInfo) .
It can compile successfully but I got class cast exception while run it and the 
exception is as following:
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to com.zhishu.data.etl.ParquetTest$AppInfo
       at com.zhishu.data.etl.ParquetTest$$anonfun$2.apply(ParquetTest.scala:75)
       at com.zhishu.data.etl.ParquetTest$$anonfun$2.apply(ParquetTest.scala:56)
       at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
       at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
       at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
       at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
       at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
       at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
       at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
       at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
       at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
       at org.apache.spark.scheduler.Task.run(Task.scala:99)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
       at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)

So is there any easy way I can implement what I want to do?

Thanks and Regards,
Tony

How to retreive data from nested json use dataframe

Reply via email to