There is not a super easy way to do what you are asking since in general
parquet needs to read all the data in a column. As far as I understand it
does not have indexes that would allow you to jump to a specific entry in a
column.  There is support for pushing down predicates, but unfortunately
this is turned off by default (in Spark 1.2) due to bugs in the parquet
library.  Even with this feature though I believe you still read the data
and just skip the cost of materializing the row.

One thing that could speed up that particular query is to sort by 'rid
before storing to parquet.  Then (when filter pushdown is turned on),
parquet will keep statistics on the min/max value for each column in a
given row group.  That would allow it to completely skip row groups that
cannot contain a given 'rid.

Michael

On Tue, Dec 2, 2014 at 12:43 PM, Vishnusaran Ramaswamy <
[email protected]> wrote:

> Hi,
>
> I have 16 GB of parquet files in /tmp/logs/ folder with the following
> schema
>
> request_id(String), module(String), payload(Array[Byte])
>
> Most of my 16 GB data is the payload field, the request_id, and module
> fields take less than 200 MB.
>
> I want to load the payload only when my filter condition matches.
>
> val sqlContext = new SQLContext(sc)
> val files = sqlContext.parquetFile("/tmp/logs")
> files.registerTempTable("logs")
> val filteredLogs = sqlContext.sql("select request_id, payload from logs
> where rid = 'dd4455ee' and module = 'query' ")
>
> when i run filteredLogs.collect.foreach(println) , i see all of the 16GB
> data loaded.
>
> How do I load only the columns used in filters first and then load the
> payload for the row matching the filter criteria?
>
> Let me know if this can be done in a different way.
>
> Thanks you,
> Vishnu.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-SQL-loading-projection-columns-tp20189.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to