I typically change my query to query from a limited version of the whole table.

Change

select really_expensive_select_clause
from
really_big_table
where
something=something
group by something=something

to

select really_expensive_select_clause
from
(
select
*
from
really_big_table
limit 100
)t
where
something=something
group by something=something


On Tue, Mar 5, 2013 at 10:57 AM, Dean Wampler
<dean.wamp...@thinkbiganalytics.com> wrote:
> Unfortunately, it will still go through the whole thing, then just limit the
> output. However, there's a flag that I think only works in more recent Hive
> releases:
>
> set hive.limit.optimize.enable=true
>
> This is supposed to apply limiting earlier in the data stream, so it will
> give different results that limiting just the output.
>
> Like Chuck said, you might consider sampling, but unless your table is
> organized into buckets, you'll at least scan the whole table, but maybe not
> do all computation over it ??
>
> Also, if you have a small sample data set:
>
> set hive.exec.mode.local.auto=true
>
> will cause Hive to bypass the Job and Task Trackers, calling APIs directly,
> when it can do the whole thing in a single process. Not "lightning fast",
> but faster.
>
> dean
>
> On Tue, Mar 5, 2013 at 12:48 PM, Joey D'Antoni <jdant...@yahoo.com> wrote:
>>
>> Just add a limit 1 to the end of your query.
>>
>>
>>
>>
>> On Mar 5, 2013, at 1:45 PM, Kyle B <kbi...@gmail.com> wrote:
>>
>> Hello,
>>
>> I was wondering if there is a way to quick-verify a Hive query before it
>> is run against a big dataset? The tables I am querying against have millions
>> of records, and I'd like to verify my Hive query before I run it against all
>> records.
>>
>> Is there a way to test the query against a small subset of the data,
>> without going into full MapReduce? As silly as this sounds, is there a way
>> to MapReduce without the overhead of MapReduce? That way I can check my
>> query is doing what I want before I run it against all records.
>>
>> Thanks,
>>
>> -Kyle
>
>
>
>
> --
> Dean Wampler, Ph.D.
> thinkbiganalytics.com
> +1-312-339-1330
>

Reply via email to