Re: [SparkSQL][Parquet] Read from nested parquet data

Yanbo Liang Wed, 30 Dec 2015 02:21:51 -0800

This problem has been discussed long before, but I think there is no
straight forward way to read only col_g.


2015-12-30 17:48 GMT+08:00 lin <kurtt....@gmail.com>:

> Hi all,
>
> We are trying to read from nested parquet data. SQL is "select
> col_b.col_d.col_g from some_table" and the data schema for some_table is:
>
> root
>  |-- col_a: long (nullable = false)
>  |-- col_b: struct (nullable = true)
>  |    |-- col_c: string (nullable = true)
>  |    |-- col_d: array (nullable = true)
>  |    |    |-- element: struct (containsNull = true)
>  |    |    |    |-- col_e: integer (nullable = true)
>  |    |    |    |-- col_f: string (nullable = true)
>  |    |    |    |-- col_g: long (nullable = true)
>
> We expect to see only col_g are read and parsed from the parquet files;
> however, we acually observed the whole col_b being read and parsed.
>
> As we dig in a little bit, seems that col_g is a GetArrayStructFields,
> col_d is a GetStructField, and only col_b is an AttributeReference, so
> PhysicalOperation.collectProjectsAndFilters() returns col_b instead of
> col_g as projections.
>
> So we wonder, is there any way to read and parse only col_g instead of the
> whole col_b? We use Spark 1.5.1 and Parquet 1.7.0.
>
> Thanks! :)
>

Re: [SparkSQL][Parquet] Read from nested parquet data

Reply via email to