Re: Memory limit exceed even with very simple count query

Bin Wang Wed, 05 Apr 2017 18:32:22 -0700

Is the snappy decompressor for AVOR or Parquet streaming?

Alex Behm <[email protected]>于2017年4月6日周四 上午9:27写道：


> I'd say following the best practices with Parquet should work fine. Create
> snappy-compressed Parquet files of roughly 256MB in size.
> If you want to stick with Avro, then yes, you'll just have to create
> smaller files.
>
> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <[email protected]> wrote:
>
> So the best I can do to workaround this for now is splitting the files
> into small files?
>
> Alex Behm <[email protected]>于2017年4月6日周四 上午9:14写道：
>
> Parquet makes more sense particularly for that kind of query you have.
>
> Still, you might want to be careful with creating huge gzipped files.
> Impala's gzip decompressor for Parquet is also not streaming.
>
> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <[email protected]> wrote:
>
> So as a workaround, does that make sense to convert it to a parquet table
> with Hive?
>
> And I think it's better to mention it in the AVRO table document because
> it is an unexpected behavior for many users.
>
> Alex Behm <[email protected]>于2017年4月6日周四 02:52写道：
>
> Gzip supports streaming decompression, but we currently only implement
> that for text tables.
>
> Doing streaming decompression certainly makes sense for Avro as well.
>
> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
> improvement.
>
> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <[email protected]>
> wrote:
>
> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <[email protected]> wrote:
> > Will Impala load all the file into the memory? That sounds horrible. And
> > with "show partition adhoc_data_fast.log", the compressed files are no
> > bigger that 4GB:
>
> The *uncompressed* size of one of your files is 50GB. Gzip needs to
> allocate memory for that.
>
> >
> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
> >
> >
> > Marcel Kornacker <[email protected]>于2017年4月6日周四 上午12:58写道：
> >>
> >> Apparently you have a gzipped file that is >=50GB. You either need to
> >> break up those files, or run on larger machines.
> >>
> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <[email protected]> wrote:
> >> > Hi,
> >> >
> >> > I'm using Impala on production for a while. But since yesterday, some
> >> > queries reports memory limit exceeded. Then I try a very simple count
> >> > query,
> >> > it still have memory limit exceeded.
> >> >
> >> > The query is:
> >> >
> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> >> > day<='2017-04-06';
> >> >
> >> > And the response in the Impala shell is:
> >> >
> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> >> > http://szq7.appadhoc.com:25000)
> >> > Query progress can be monitored at:
> >> >
> >> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> >> > WARNINGS:
> >> > Memory limit exceeded
> >> > GzipDecompressor failed to allocate 54525952000 bytes.
> >> >
> >> > I have many nodes and each of them have lots of memory avaliable (~ 60
> >> > GB).
> >> > And the query failed very fast after I execute it and the nodes have
> >> > almost
> >> > no memory usage.
> >> >
> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
> >> > gzip
> >> > and is partitioned by the field "day". And each partition has no more
> >> > than
> >> > one billion rows.
> >> >
> >> > My Impala version is:
> >> >
> >> > hdfs@szq7:/home/ubuntu$ impalad --version
> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> >> > Built on Wed Jan 11 13:39:25 PST 2017
> >> >
> >> > Any one can help for this? Thanks very much!
> >> >
>
>
>
>
>

Re: Memory limit exceed even with very simple count query

Reply via email to