RE: Benchmarking Hive Changes

java8964 Wed, 05 Mar 2014 08:08:02 -0800

Your files are too small for any meaningful test of these 3 file types.
Most of the 23 seconds are spending on preparing/starting your MR job and 
shutdown.
You need at least Gs data to compare the performance of these 3 types, to get 
any meaningful result.
But as long as it is Hive on top of MapReduce, it will be really hard to 
archive an "interactive" result. MapReduce is a batch mode, period.
You do want to consider Impala/spark or Apache stinger, if you really are 
looking for "interactive".
Yong

Date: Wed, 5 Mar 2014 09:02:32 -0500
Subject: Re: Benchmarking Hive Changes
From: [email protected]
To: [email protected]

Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a standalone 
box.

But shame on me it looks like the files are both very tiny (46K), I'm seeing 
about 23 seconds per query, which appears mostly to be starting up MR. 

So I'm going to find a new data set and try again, is there any types of 
optimizations that can be done to reduce the start up time?

Ultimately I'm trying to compare the response time in Hive versus an EDW 
platform - of course I still expect the EDW to perform more performantly, but 
with the advancements in the newer versions of Hive I'm hoping for at least a 
reasonable response for a user wishing to do interactive querying. Specifically 
using Hive, I know you can get really good performance out of Impala, but am 
not yet interested in going that route.
Anthony Mattas
[email protected]

On Wed, Mar 5, 2014 at 8:47 AM, java8964 <[email protected]> wrote:

Are you doing on standalone one box? How large are your test files and how long 
of the jobs of each type took?
Yong

> From: [email protected]

> Subject: Benchmarking Hive Changes
> Date: Tue, 4 Mar 2014 21:31:42 -0500
> To: [email protected]
> 
> I’ve been trying to benchmark some of the Hive enhancements in Hadoop 2.0 
> using the HDP Sandbox. 

> 
> I took one of their example queries and executed it with the tables stored as 
> TEXTFILE, RCFILE, and ORC. I also tried enabling enabling vectorized 
> execution, and predicate pushdown.
> 
> SELECT s07.description, s07.salary, s08.salary,

>   s08.salary - s07.salary
> FROM
>   sample_07 s07 JOIN sample_08 s08
> ON ( s07.code = s08.code)
> WHERE
>  s07.salary < s08.salary
> SORT BY s08.salary-s07.salary DESC
> 

> Ultimately there was not much different performance in any of the executions, 
> can someone clarify for me if I need an actual full cluster to see 
> performance improvements, or if I’m missing something else. I thought at 
> minimum I would have seen an improvement moving to ORC from TEXTFILE.

RE: Benchmarking Hive Changes

Reply via email to