Is the snappy decompressor for AVOR or Parquet streaming? Alex Behm <[email protected]>于2017年4月6日周四 上午9:27写道:
> I'd say following the best practices with Parquet should work fine. Create > snappy-compressed Parquet files of roughly 256MB in size. > If you want to stick with Avro, then yes, you'll just have to create > smaller files. > > On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <[email protected]> wrote: > > So the best I can do to workaround this for now is splitting the files > into small files? > > Alex Behm <[email protected]>于2017年4月6日周四 上午9:14写道: > > Parquet makes more sense particularly for that kind of query you have. > > Still, you might want to be careful with creating huge gzipped files. > Impala's gzip decompressor for Parquet is also not streaming. > > On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <[email protected]> wrote: > > So as a workaround, does that make sense to convert it to a parquet table > with Hive? > > And I think it's better to mention it in the AVRO table document because > it is an unexpected behavior for many users. > > Alex Behm <[email protected]>于2017年4月6日周四 02:52写道: > > Gzip supports streaming decompression, but we currently only implement > that for text tables. > > Doing streaming decompression certainly makes sense for Avro as well. > > I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this > improvement. > > On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <[email protected]> > wrote: > > On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <[email protected]> wrote: > > Will Impala load all the file into the memory? That sounds horrible. And > > with "show partition adhoc_data_fast.log", the compressed files are no > > bigger that 4GB: > > The *uncompressed* size of one of your files is 50GB. Gzip needs to > allocate memory for that. > > > > > | 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED > > | AVRO | false | > > > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 | > > | 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED > > | AVRO | false | > > > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 | > > > > > > Marcel Kornacker <[email protected]>于2017年4月6日周四 上午12:58写道: > >> > >> Apparently you have a gzipped file that is >=50GB. You either need to > >> break up those files, or run on larger machines. > >> > >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <[email protected]> wrote: > >> > Hi, > >> > > >> > I'm using Impala on production for a while. But since yesterday, some > >> > queries reports memory limit exceeded. Then I try a very simple count > >> > query, > >> > it still have memory limit exceeded. > >> > > >> > The query is: > >> > > >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and > >> > day<='2017-04-06'; > >> > > >> > And the response in the Impala shell is: > >> > > >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator: > >> > http://szq7.appadhoc.com:25000) > >> > Query progress can be monitored at: > >> > > >> > > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000 > >> > WARNINGS: > >> > Memory limit exceeded > >> > GzipDecompressor failed to allocate 54525952000 bytes. > >> > > >> > I have many nodes and each of them have lots of memory avaliable (~ 60 > >> > GB). > >> > And the query failed very fast after I execute it and the nodes have > >> > almost > >> > no memory usage. > >> > > >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with > >> > gzip > >> > and is partitioned by the field "day". And each partition has no more > >> > than > >> > one billion rows. > >> > > >> > My Impala version is: > >> > > >> > hdfs@szq7:/home/ubuntu$ impalad --version > >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build > >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c) > >> > Built on Wed Jan 11 13:39:25 PST 2017 > >> > > >> > Any one can help for this? Thanks very much! > >> > > > > > >
