Re: orc vs parquet aggregation, orc is really slow

Maurin Lenglart Sat, 16 Apr 2016 01:14:22 -0700

Hi,

I will put the date in the correct format in the future. And see if that change 
anything.
The query that I sent is just an exemple of one aggregation possible, I have a 
lot of them possible on the same table, so I am not sure that sorting all of 
them could actually have an impact.

I am using the latest release of cloudera and I didn’t modified any version. Do 
you think that I should try to manually update hive ?

thanks

From: Jörn Franke <jornfra...@gmail.com<mailto:jornfra...@gmail.com>>
Date: Saturday, April 16, 2016 at 1:02 AM
To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>
Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: orc vs parquet aggregation, orc is really slow

Generally a recommendation (besides the issue) - Do not put dates as String. I 
recommend here to make them ints. It will be in both cases much faster.

It could be that you load them differently in the tables. Generally for these 
tables you should insert them in both cases sorted into the tables.
It could be also that in one case you compress the file and in the other not. 
It is always a good practice to have all options in the create table statement 
- even the default ones.

Hive seems a little bit outdated. Do you use Spark as an execution engine? Then 
you should upgrade to newer versions of Hive. The Spark execution engine on 
hive is still a little bit more experimental than TEZ. Depends also which 
distribution you are using.

Normally I would expect both of them to perform similarly.

On 16 Apr 2016, at 09:20, Maurin Lenglart 
<mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote:

Hi,
I am executing one query :
“SELECT `event_date` as `event_date`,sum(`bookings`) as 
`bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date` >= 
'2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 20000”

My table was created something like :
  CREATE TABLE myTable (
  bookings            DOUBLE
  , deal views          INT
  )
   STORED AS ORC or PARQUET
     PARTITION BY (event_date STRING)

PARQUET take 9second of cumulative CPU
ORC take 50second of cumulative CPU.

For ORC I have tried to 
hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown”,“true”)
But it didn’t change anything

I am missing something, or parquet is better for this type of query?

I am using spark 1.6.0 with hive 1.1.0

thanks

Re: orc vs parquet aggregation, orc is really slow

Reply via email to