There are some known outstanding issues that reference the word "corrupt":
https://issues.apache.org/jira/issues/?jql=project%20%3D%20impala%20and%20resolution%20is%20empty%20and%20text%20~%20corrupt Feel free to post a new JIRA if you believe you are running into a new bug. On Thu, Apr 6, 2017 at 12:41 AM, Bin Wang <[email protected]> wrote: > Hi, > > I've dig it and found there are two files that will trigger this problem. > After remove them from the partition, I can query on it. But these two > files are small both before and after compressed (smaller than 400 MB). The > only problem is they have some corrupt data at the end. Is it a bug in > Impala? > > Regards, > Bin Wang > > Bin Wang <[email protected]>于2017年4月6日周四 下午12:05写道: > >> I convert these avro files to json with avro-tools, and the json files >> are no larger than 1GB. So Impala should be able to read them. Some of the >> avro files are corrupt. >> >> 16M log.2017-04-05.1491321605834.avro.json >> 308M log.2017-04-05.1491323647211.avro.json >> 103M log.2017-04-05.1491327241311.avro.json >> 150M log.2017-04-05.1491330839609.avro.json >> 397M log.2017-04-05.1491334439092.avro.json >> 297M log.2017-04-05.1491338038503.avro.json >> 160M log.2017-04-05.1491341639694.avro.json >> 95M log.2017-04-05.1491345239969.avro.json >> 360M log.2017-04-05.1491348843931.avro.json >> 338M log.2017-04-05.1491352442955.avro.json >> 71M log.2017-04-05.1491359648079.avro.json >> 161M log.2017-04-05.1491363247597.avro.json >> 628M log.2017-04-05.1491366845827.avro.json >> 288M log.2017-04-05.1491370445873.avro.json >> 162M log.2017-04-05.1491374045830.avro.json >> 90M log.2017-04-05.1491377650935.avro.json >> 269M log.2017-04-05.1491381249597.avro.json >> 620M log.2017-04-05.1491384846366.avro.json >> 70M log.2017-04-05.1491388450262.avro.json >> 30M log.2017-04-05.1491392047694.avro.json >> 114M log.2017-04-05.1491395648818.avro.json >> 370M log.2017-04-05.1491399246407.avro.json >> 359M log.2017-04-05.1491402846469.avro.json >> 218M log.2017-04-05.1491406180615.avro.json >> 29M log.2017-04-05.1491409790105.avro.json >> 3.9M log.2017-04-05.1491413385884.avro.json >> 9.3M log.2017-04-05.1491416981829.avro.json >> 8.3M log.2017-04-05.1491420581588.avro.json >> 2.3M log.2017-04-05.1491424180191.avro.json >> 25M log.2017-04-05.1491427781339.avro.json >> 24M log.2017-04-05.1491431382552.avro.json >> 5.7M log.2017-04-05.1491434984679.avro.json >> 35M log.2017-04-05.1491438586674.avro.json >> 5.8M log.2017-04-05.1491442192541.avro.json >> 23M log.2017-04-05.1491445789230.avro.json >> 4.3M log.2017-04-05.1491449386630.avro.json >> >> Bin Wang <[email protected]>于2017年4月6日周四 上午11:34写道: >> >> And here is another question. How does Impala estimate the unziped file >> size? All the gziped files is no bigger than 300MB so I think it will be OK >> to unzip. >> >> Bin Wang <[email protected]>于2017年4月6日周四 上午9:31写道: >> >> Is the snappy decompressor for AVOR or Parquet streaming? >> >> Alex Behm <[email protected]>于2017年4月6日周四 上午9:27写道: >> >> I'd say following the best practices with Parquet should work fine. >> Create snappy-compressed Parquet files of roughly 256MB in size. >> If you want to stick with Avro, then yes, you'll just have to create >> smaller files. >> >> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <[email protected]> wrote: >> >> So the best I can do to workaround this for now is splitting the files >> into small files? >> >> Alex Behm <[email protected]>于2017年4月6日周四 上午9:14写道: >> >> Parquet makes more sense particularly for that kind of query you have. >> >> Still, you might want to be careful with creating huge gzipped files. >> Impala's gzip decompressor for Parquet is also not streaming. >> >> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <[email protected]> wrote: >> >> So as a workaround, does that make sense to convert it to a parquet table >> with Hive? >> >> And I think it's better to mention it in the AVRO table document because >> it is an unexpected behavior for many users. >> >> Alex Behm <[email protected]>于2017年4月6日周四 02:52写道: >> >> Gzip supports streaming decompression, but we currently only implement >> that for text tables. >> >> Doing streaming decompression certainly makes sense for Avro as well. >> >> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this >> improvement. >> >> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <[email protected]> >> wrote: >> >> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <[email protected]> wrote: >> > Will Impala load all the file into the memory? That sounds horrible. And >> > with "show partition adhoc_data_fast.log", the compressed files are no >> > bigger that 4GB: >> >> The *uncompressed* size of one of your files is 50GB. Gzip needs to >> allocate memory for that. >> >> > >> > | 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED >> > | AVRO | false | >> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 >> | >> > | 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED >> > | AVRO | false | >> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 >> | >> > >> > >> > Marcel Kornacker <[email protected]>于2017年4月6日周四 上午12:58写道: >> >> >> >> Apparently you have a gzipped file that is >=50GB. You either need to >> >> break up those files, or run on larger machines. >> >> >> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <[email protected]> wrote: >> >> > Hi, >> >> > >> >> > I'm using Impala on production for a while. But since yesterday, some >> >> > queries reports memory limit exceeded. Then I try a very simple count >> >> > query, >> >> > it still have memory limit exceeded. >> >> > >> >> > The query is: >> >> > >> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and >> >> > day<='2017-04-06'; >> >> > >> >> > And the response in the Impala shell is: >> >> > >> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator: >> >> > http://szq7.appadhoc.com:25000) >> >> > Query progress can be monitored at: >> >> > >> >> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4: >> 734bcc1d00000000 >> >> > WARNINGS: >> >> > Memory limit exceeded >> >> > GzipDecompressor failed to allocate 54525952000 bytes. >> >> > >> >> > I have many nodes and each of them have lots of memory avaliable (~ >> 60 >> >> > GB). >> >> > And the query failed very fast after I execute it and the nodes have >> >> > almost >> >> > no memory usage. >> >> > >> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with >> >> > gzip >> >> > and is partitioned by the field "day". And each partition has no more >> >> > than >> >> > one billion rows. >> >> > >> >> > My Impala version is: >> >> > >> >> > hdfs@szq7:/home/ubuntu$ impalad --version >> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build >> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c) >> >> > Built on Wed Jan 11 13:39:25 PST 2017 >> >> > >> >> > Any one can help for this? Thanks very much! >> >> > >> >> >> >> >>
