Re: SparkSQL Nested structure

Michael Armbrust Mon, 04 May 2015 16:07:16 -0700

You are looking for LATERAL VIEW explode
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode>
in HiveQL.


On Mon, May 4, 2015 at 7:49 AM, Giovanni Paolo Gibilisco <gibb...@gmail.com>
wrote:

> Hi, I'm trying to parse log files generated by Spark using SparkSQL.
>
> In the JSON elements related to the StageCompleted event we have a nested
> structre containing an array of elements with RDD Info. (see the log below
> as an example (omitting some parts).
>
> {
>     "Event": "SparkListenerStageCompleted",
>     "Stage Info": {
>       "Stage ID": 1,
>       ...
>       "RDD Info": [
>         {
>           "RDD ID": 5,
>           "Name": "5",
>           "Storage Level": {
>             "Use Disk": false,
>             "Use Memory": false,
>             "Use Tachyon": false,
>             "Deserialized": false,
>             "Replication": 1
>           },
>           "Number of Partitions": 2,
>           "Number of Cached Partitions": 0,
>           "Memory Size": 0,
>           "Tachyon Size": 0,
>           "Disk Size": 0
>         },
> ...
>
> When i register the log as a table SparkSQL is able to generate the
> correct schema that for the RDD Info element looks like
>
>  | -- RDD Info: array (nullable = true)
>  |    |-- element: struct (containsNull = true)
>  |    |    |-- Disk Size: long (nullable = true)
>  |    |    |-- Memory Size: long (nullable = true)
>  |    |    |-- Name: string (nullable = true)
>
> My problem is that if I try to query the table I can only get array
> buffers out of it:
>
> "SELECT `stageEndInfos.Stage Info.Stage ID`, `stageEndInfos.Stage Info.RDD
> Info` FROM stageEndInfos"
> Stage ID RDD Info
> 1        ArrayBuffer([0,0,...
> 0        ArrayBuffer([0,0,...
> 2        ArrayBuffer([0,0,...
>
> or:
>
> "SELECT `stageEndInfos.Stage Info.RDD Info.RDD ID` FROM stageEndInfos"
> RDD ID
> ArrayBuffer(5, 4, 3)
> ArrayBuffer(2, 1, 0)
> ArrayBuffer(9, 6,...
>
> Is there a way to explode the arrays in the rows in order to build a
> single table? (Knowing that the RDD ID is unique and can be used as primary
> key)?
>
> Thanks!
>
> How can I get
>

Re: SparkSQL Nested structure

Reply via email to