Hi,

I think that the speed of ORC has been improved in latest versions. Any
chance you could use the latest version?

Regards,
Gourav Sengupta

On 17 Oct 2018 6:11 am, "daily" <asos...@foxmail.com> wrote:

Hi,

Spark version: 2.3.0
Hive   version: 2.1.0

Best regards.


------------------ 原始邮件 ------------------
*发件人:* "Gourav Sengupta"<gourav.sengu...@gmail.com>;
*发送时间:* 2018年10月16日(星期二) 晚上6:35
*收件人:* "daily"<asos...@foxmail.com>;
*抄送:* "user"<user@spark.apache.org>; "dev"<d...@spark.apache.org>;
*主题:* Re: SparkSQL read Hive transactional table

Hi,

can I please ask which version of Hive and Spark are you using?

Regards,
Gourav Sengupta

On Tue, Oct 16, 2018 at 2:42 AM daily <asos...@foxmail.com> wrote:

> Hi,
>
> I use HCatalog Streaming Mutation API to write data to hive transactional
> table, and then, I use SparkSQL to read data from the hive transactional
> table. I get the right result.
> However, SparkSQL uses more time to read hive orc bucket transactional
> table, beacause SparkSQL read all columns(not The columns involved in SQL)
> so it uses more time.
> My question is why that SparkSQL read all columns of hive orc bucket
> transactional table, but not the columns involved in SQL? Is it possible to
> control the SparkSQL read the columns involved in SQL?
>
>
>
> For example:
> Hive Table:
> create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6
> int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets
> stored as orc TBLPROPERTIES ('transactional'='true');
>
> create table dbtest.t_a2 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6
> int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets
> stored as orc TBLPROPERTIES ('transactional'='false');
>
> SparkSQL:
> select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
> select sum(t1),sum(t2) from dbtest.t_a2 group by t0;
>
> SparkSQL's stage Input size:
>
> dbtest.t_a1=113.9 GB,
>
> dbtest.t_a2=96.5 MB
>
>
>
> Best regards.
>
>
>
>

Reply via email to