Hi, I will put the date in the correct format in the future. And see if that change anything. The query that I sent is just an exemple of one aggregation possible, I have a lot of them possible on the same table, so I am not sure that sorting all of them could actually have an impact.
I am using the latest release of cloudera and I didn’t modified any version. Do you think that I should try to manually update hive ? thanks From: Jörn Franke <jornfra...@gmail.com<mailto:jornfra...@gmail.com>> Date: Saturday, April 16, 2016 at 1:02 AM To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: orc vs parquet aggregation, orc is really slow Generally a recommendation (besides the issue) - Do not put dates as String. I recommend here to make them ints. It will be in both cases much faster. It could be that you load them differently in the tables. Generally for these tables you should insert them in both cases sorted into the tables. It could be also that in one case you compress the file and in the other not. It is always a good practice to have all options in the create table statement - even the default ones. Hive seems a little bit outdated. Do you use Spark as an execution engine? Then you should upgrade to newer versions of Hive. The Spark execution engine on hive is still a little bit more experimental than TEZ. Depends also which distribution you are using. Normally I would expect both of them to perform similarly. On 16 Apr 2016, at 09:20, Maurin Lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote: Hi, I am executing one query : “SELECT `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date` >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 20000” My table was created something like : CREATE TABLE myTable ( bookings DOUBLE , deal views INT ) STORED AS ORC or PARQUET PARTITION BY (event_date STRING) PARQUET take 9second of cumulative CPU ORC take 50second of cumulative CPU. For ORC I have tried to hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown”,“true”) But it didn’t change anything I am missing something, or parquet is better for this type of query? I am using spark 1.6.0 with hive 1.1.0 thanks