This problem has been discussed long before, but I think there is no straight forward way to read only col_g.
2015-12-30 17:48 GMT+08:00 lin <kurtt....@gmail.com>: > Hi all, > > We are trying to read from nested parquet data. SQL is "select > col_b.col_d.col_g from some_table" and the data schema for some_table is: > > root > |-- col_a: long (nullable = false) > |-- col_b: struct (nullable = true) > | |-- col_c: string (nullable = true) > | |-- col_d: array (nullable = true) > | | |-- element: struct (containsNull = true) > | | | |-- col_e: integer (nullable = true) > | | | |-- col_f: string (nullable = true) > | | | |-- col_g: long (nullable = true) > > We expect to see only col_g are read and parsed from the parquet files; > however, we acually observed the whole col_b being read and parsed. > > As we dig in a little bit, seems that col_g is a GetArrayStructFields, > col_d is a GetStructField, and only col_b is an AttributeReference, so > PhysicalOperation.collectProjectsAndFilters() returns col_b instead of > col_g as projections. > > So we wonder, is there any way to read and parse only col_g instead of the > whole col_b? We use Spark 1.5.1 and Parquet 1.7.0. > > Thanks! :) >