You are looking for LATERAL VIEW explode <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode> in HiveQL.
On Mon, May 4, 2015 at 7:49 AM, Giovanni Paolo Gibilisco <gibb...@gmail.com> wrote: > Hi, I'm trying to parse log files generated by Spark using SparkSQL. > > In the JSON elements related to the StageCompleted event we have a nested > structre containing an array of elements with RDD Info. (see the log below > as an example (omitting some parts). > > { > "Event": "SparkListenerStageCompleted", > "Stage Info": { > "Stage ID": 1, > ... > "RDD Info": [ > { > "RDD ID": 5, > "Name": "5", > "Storage Level": { > "Use Disk": false, > "Use Memory": false, > "Use Tachyon": false, > "Deserialized": false, > "Replication": 1 > }, > "Number of Partitions": 2, > "Number of Cached Partitions": 0, > "Memory Size": 0, > "Tachyon Size": 0, > "Disk Size": 0 > }, > ... > > When i register the log as a table SparkSQL is able to generate the > correct schema that for the RDD Info element looks like > > | -- RDD Info: array (nullable = true) > | |-- element: struct (containsNull = true) > | | |-- Disk Size: long (nullable = true) > | | |-- Memory Size: long (nullable = true) > | | |-- Name: string (nullable = true) > > My problem is that if I try to query the table I can only get array > buffers out of it: > > "SELECT `stageEndInfos.Stage Info.Stage ID`, `stageEndInfos.Stage Info.RDD > Info` FROM stageEndInfos" > Stage ID RDD Info > 1 ArrayBuffer([0,0,... > 0 ArrayBuffer([0,0,... > 2 ArrayBuffer([0,0,... > > or: > > "SELECT `stageEndInfos.Stage Info.RDD Info.RDD ID` FROM stageEndInfos" > RDD ID > ArrayBuffer(5, 4, 3) > ArrayBuffer(2, 1, 0) > ArrayBuffer(9, 6,... > > Is there a way to explode the arrays in the rows in order to build a > single table? (Knowing that the RDD ID is unique and can be used as primary > key)? > > Thanks! > > How can I get >