Parquet makes more sense particularly for that kind of query you have.

Still, you might want to be careful with creating huge gzipped files.
Impala's gzip decompressor for Parquet is also not streaming.

On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <[email protected]> wrote:

> So as a workaround, does that make sense to convert it to a parquet table
> with Hive?
>
> And I think it's better to mention it in the AVRO table document because
> it is an unexpected behavior for many users.
>
> Alex Behm <[email protected]>于2017年4月6日周四 02:52写道:
>
>> Gzip supports streaming decompression, but we currently only implement
>> that for text tables.
>>
>> Doing streaming decompression certainly makes sense for Avro as well.
>>
>> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
>> improvement.
>>
>> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <[email protected]>
>> wrote:
>>
>> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <[email protected]> wrote:
>> > Will Impala load all the file into the memory? That sounds horrible. And
>> > with "show partition adhoc_data_fast.log", the compressed files are no
>> > bigger that 4GB:
>>
>> The *uncompressed* size of one of your files is 50GB. Gzip needs to
>> allocate memory for that.
>>
>> >
>> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
>> > | AVRO   | false             |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04
>> |
>> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
>> > | AVRO   | false             |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05
>> |
>> >
>> >
>> > Marcel Kornacker <[email protected]>于2017年4月6日周四 上午12:58写道:
>> >>
>> >> Apparently you have a gzipped file that is >=50GB. You either need to
>> >> break up those files, or run on larger machines.
>> >>
>> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <[email protected]> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm using Impala on production for a while. But since yesterday, some
>> >> > queries reports memory limit exceeded. Then I try a very simple count
>> >> > query,
>> >> > it still have memory limit exceeded.
>> >> >
>> >> > The query is:
>> >> >
>> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
>> >> > day<='2017-04-06';
>> >> >
>> >> > And the response in the Impala shell is:
>> >> >
>> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
>> >> > http://szq7.appadhoc.com:25000)
>> >> > Query progress can be monitored at:
>> >> >
>> >> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:
>> 734bcc1d00000000
>> >> > WARNINGS:
>> >> > Memory limit exceeded
>> >> > GzipDecompressor failed to allocate 54525952000 bytes.
>> >> >
>> >> > I have many nodes and each of them have lots of memory avaliable (~
>> 60
>> >> > GB).
>> >> > And the query failed very fast after I execute it and the nodes have
>> >> > almost
>> >> > no memory usage.
>> >> >
>> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
>> >> > gzip
>> >> > and is partitioned by the field "day". And each partition has no more
>> >> > than
>> >> > one billion rows.
>> >> >
>> >> > My Impala version is:
>> >> >
>> >> > hdfs@szq7:/home/ubuntu$ impalad --version
>> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
>> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
>> >> > Built on Wed Jan 11 13:39:25 PST 2017
>> >> >
>> >> > Any one can help for this? Thanks very much!
>> >> >
>>
>>
>>

Reply via email to