Bin, for your original problem, have you confirmed that your files are highly (10x or more) compressed? Maybe the original problem was due to corrupt files, and the don't actually decompress so such a huge size.
On Thu, Apr 6, 2017 at 12:41 AM, Bin Wang <[email protected]> wrote: > Hi, > > I've dig it and found there are two files that will trigger this problem. > After remove them from the partition, I can query on it. But these two > files are small both before and after compressed (smaller than 400 MB). The > only problem is they have some corrupt data at the end. Is it a bug in > Impala? > > Regards, > Bin Wang > > Bin Wang <[email protected]>于2017年4月6日周四 下午12:05写道: > >> I convert these avro files to json with avro-tools, and the json files >> are no larger than 1GB. So Impala should be able to read them. Some of the >> avro files are corrupt. >> >> 16M log.2017-04-05.1491321605834.avro.json >> 308M log.2017-04-05.1491323647211.avro.json >> 103M log.2017-04-05.1491327241311.avro.json >> 150M log.2017-04-05.1491330839609.avro.json >> 397M log.2017-04-05.1491334439092.avro.json >> 297M log.2017-04-05.1491338038503.avro.json >> 160M log.2017-04-05.1491341639694.avro.json >> 95M log.2017-04-05.1491345239969.avro.json >> 360M log.2017-04-05.1491348843931.avro.json >> 338M log.2017-04-05.1491352442955.avro.json >> 71M log.2017-04-05.1491359648079.avro.json >> 161M log.2017-04-05.1491363247597.avro.json >> 628M log.2017-04-05.1491366845827.avro.json >> 288M log.2017-04-05.1491370445873.avro.json >> 162M log.2017-04-05.1491374045830.avro.json >> 90M log.2017-04-05.1491377650935.avro.json >> 269M log.2017-04-05.1491381249597.avro.json >> 620M log.2017-04-05.1491384846366.avro.json >> 70M log.2017-04-05.1491388450262.avro.json >> 30M log.2017-04-05.1491392047694.avro.json >> 114M log.2017-04-05.1491395648818.avro.json >> 370M log.2017-04-05.1491399246407.avro.json >> 359M log.2017-04-05.1491402846469.avro.json >> 218M log.2017-04-05.1491406180615.avro.json >> 29M log.2017-04-05.1491409790105.avro.json >> 3.9M log.2017-04-05.1491413385884.avro.json >> 9.3M log.2017-04-05.1491416981829.avro.json >> 8.3M log.2017-04-05.1491420581588.avro.json >> 2.3M log.2017-04-05.1491424180191.avro.json >> 25M log.2017-04-05.1491427781339.avro.json >> 24M log.2017-04-05.1491431382552.avro.json >> 5.7M log.2017-04-05.1491434984679.avro.json >> 35M log.2017-04-05.1491438586674.avro.json >> 5.8M log.2017-04-05.1491442192541.avro.json >> 23M log.2017-04-05.1491445789230.avro.json >> 4.3M log.2017-04-05.1491449386630.avro.json >> >> Bin Wang <[email protected]>于2017年4月6日周四 上午11:34写道: >> >> And here is another question. How does Impala estimate the unziped file >> size? All the gziped files is no bigger than 300MB so I think it will be OK >> to unzip. >> >> Bin Wang <[email protected]>于2017年4月6日周四 上午9:31写道: >> >> Is the snappy decompressor for AVOR or Parquet streaming? >> >> Alex Behm <[email protected]>于2017年4月6日周四 上午9:27写道: >> >> I'd say following the best practices with Parquet should work fine. >> Create snappy-compressed Parquet files of roughly 256MB in size. >> If you want to stick with Avro, then yes, you'll just have to create >> smaller files. >> >> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <[email protected]> wrote: >> >> So the best I can do to workaround this for now is splitting the files >> into small files? >> >> Alex Behm <[email protected]>于2017年4月6日周四 上午9:14写道: >> >> Parquet makes more sense particularly for that kind of query you have. >> >> Still, you might want to be careful with creating huge gzipped files. >> Impala's gzip decompressor for Parquet is also not streaming. >> >> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <[email protected]> wrote: >> >> So as a workaround, does that make sense to convert it to a parquet table >> with Hive? >> >> And I think it's better to mention it in the AVRO table document because >> it is an unexpected behavior for many users. >> >> Alex Behm <[email protected]>于2017年4月6日周四 02:52写道: >> >> Gzip supports streaming decompression, but we currently only implement >> that for text tables. >> >> Doing streaming decompression certainly makes sense for Avro as well. >> >> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this >> improvement. >> >> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <[email protected]> >> wrote: >> >> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <[email protected]> wrote: >> > Will Impala load all the file into the memory? That sounds horrible. And >> > with "show partition adhoc_data_fast.log", the compressed files are no >> > bigger that 4GB: >> >> The *uncompressed* size of one of your files is 50GB. Gzip needs to >> allocate memory for that. >> >> > >> > | 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED >> > | AVRO | false | >> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 >> | >> > | 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED >> > | AVRO | false | >> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 >> | >> > >> > >> > Marcel Kornacker <[email protected]>于2017年4月6日周四 上午12:58写道: >> >> >> >> Apparently you have a gzipped file that is >=50GB. You either need to >> >> break up those files, or run on larger machines. >> >> >> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <[email protected]> wrote: >> >> > Hi, >> >> > >> >> > I'm using Impala on production for a while. But since yesterday, some >> >> > queries reports memory limit exceeded. Then I try a very simple count >> >> > query, >> >> > it still have memory limit exceeded. >> >> > >> >> > The query is: >> >> > >> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and >> >> > day<='2017-04-06'; >> >> > >> >> > And the response in the Impala shell is: >> >> > >> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator: >> >> > http://szq7.appadhoc.com:25000) >> >> > Query progress can be monitored at: >> >> > >> >> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4: >> 734bcc1d00000000 >> >> > WARNINGS: >> >> > Memory limit exceeded >> >> > GzipDecompressor failed to allocate 54525952000 bytes. >> >> > >> >> > I have many nodes and each of them have lots of memory avaliable (~ >> 60 >> >> > GB). >> >> > And the query failed very fast after I execute it and the nodes have >> >> > almost >> >> > no memory usage. >> >> > >> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with >> >> > gzip >> >> > and is partitioned by the field "day". And each partition has no more >> >> > than >> >> > one billion rows. >> >> > >> >> > My Impala version is: >> >> > >> >> > hdfs@szq7:/home/ubuntu$ impalad --version >> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build >> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c) >> >> > Built on Wed Jan 11 13:39:25 PST 2017 >> >> > >> >> > Any one can help for this? Thanks very much! >> >> > >> >> >> >> >>
