Re: Creating RDD from only few columns of a Parquet file

2015-01-13 Thread Reynold Xin
What query did you run? Parquet should have predicate and column pushdown,
i.e. if your query only needs to read 3 columns, then only 3 will be read.

On Mon, Jan 12, 2015 at 10:20 PM, Ajay Srivastava <
a_k_srivast...@yahoo.com.invalid> wrote:

> Hi,
> I am trying to read a parquet file using -
>
> val parquetFile = sqlContext.parquetFile("people.parquet")
>
> There is no way to specify that I am interested in reading only some columns 
> from disk. For example, If the parquet file has 10 columns and want to read 
> only 3 columns from disk.
>
> We have done an experiment -
> Table1 - Parquet file containing 10 columns
> Table2 - Parquet file containing only 3 columns which were used in query
>
> The time taken by query on table1 and table2 shows huge difference. Query on 
> Table1 takes more than double of time taken on table2 which makes me think 
> that spark is reading all the columns from disk in case of table1 when it 
> needs only 3 columns.
>
> How should I make sure that it reads only 3 of 10 columns from disk ?
>
>
> Regards,
> Ajay
>
>


Re: Creating RDD from only few columns of a Parquet file

2015-01-13 Thread Ajay Srivastava
Setting spark.sql.hive.convertMetastoreParquet to true has fixed this.

Regards,Ajay 

 On Tuesday, January 13, 2015 11:50 AM, Ajay Srivastava 
 wrote:
   

 Hi,I am trying to read a parquet file using -val parquetFile = 
sqlContext.parquetFile("people.parquet")

There is no way to specify that I am interested in reading only some columns 
from disk. For example, If the parquet file has 10 columns and want to read 
only 3 columns from disk.

We have done an experiment -
Table1 - Parquet file containing 10 columns
Table2 - Parquet file containing only 3 columns which were used in query 

The time taken by query on table1 and table2 shows huge difference. Query on 
Table1 takes more than double of time taken on table2 which makes me think that 
spark is reading all the columns from disk in case of table1 when it needs only 
3 columns.

How should I make sure that it reads only 3 of 10 columns from disk ?


Regards,
Ajay