Hi Yong, I'm confused - I'm using Hive 0.12.0, shouldn't that be using "stinger" by default? Or is there configurations that have to be enabled?
Anthony Mattas anth...@mattas.net On Wed, Mar 5, 2014 at 11:06 AM, java8964 <java8...@hotmail.com> wrote: > Your files are too small for any meaningful test of these 3 file types. > > Most of the 23 seconds are spending on preparing/starting your MR job and > shutdown. > > You need at least Gs data to compare the performance of these 3 types, to > get any meaningful result. > > But as long as it is Hive on top of MapReduce, it will be really hard to > archive an "interactive" result. MapReduce is a batch mode, period. > > You do want to consider Impala/spark or Apache stinger, if you really are > looking for "interactive". > > Yong > > ------------------------------ > Date: Wed, 5 Mar 2014 09:02:32 -0500 > Subject: Re: Benchmarking Hive Changes > From: anth...@mattas.net > To: user@hadoop.apache.org > > > Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a > standalone box. > > But shame on me it looks like the files are both very tiny (46K), I'm > seeing about 23 seconds per query, which appears mostly to be starting up > MR. > > So I'm going to find a new data set and try again, is there any types of > optimizations that can be done to reduce the start up time? > > Ultimately I'm trying to compare the response time in Hive versus an EDW > platform - of course I still expect the EDW to perform more performantly, > but with the advancements in the newer versions of Hive I'm hoping for at > least a reasonable response for a user wishing to do interactive querying. > Specifically using Hive, I know you can get really good performance out of > Impala, but am not yet interested in going that route. > > Anthony Mattas > anth...@mattas.net > > > On Wed, Mar 5, 2014 at 8:47 AM, java8964 <java8...@hotmail.com> wrote: > > Are you doing on standalone one box? How large are your test files and how > long of the jobs of each type took? > > Yong > > > From: anth...@mattas.net > > Subject: Benchmarking Hive Changes > > Date: Tue, 4 Mar 2014 21:31:42 -0500 > > To: user@hadoop.apache.org > > > > > I've been trying to benchmark some of the Hive enhancements in Hadoop > 2.0 using the HDP Sandbox. > > > > I took one of their example queries and executed it with the tables > stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling > vectorized execution, and predicate pushdown. > > > > SELECT s07.description, s07.salary, s08.salary, > > s08.salary - s07.salary > > FROM > > sample_07 s07 JOIN sample_08 s08 > > ON ( s07.code = s08.code) > > WHERE > > s07.salary < s08.salary > > SORT BY s08.salary-s07.salary DESC > > > > Ultimately there was not much different performance in any of the > executions, can someone clarify for me if I need an actual full cluster to > see performance improvements, or if I'm missing something else. I thought > at minimum I would have seen an improvement moving to ORC from TEXTFILE. > > >