Hi,

I've dig it and found there are two files that will trigger this problem.
After remove them from the partition, I can query on it. But these two
files are small both before and after compressed (smaller than 400 MB). The
only problem is they have some corrupt data at the end. Is it a bug in
Impala?

Regards,
Bin Wang

Bin Wang <[email protected]>于2017年4月6日周四 下午12:05写道:

> I convert these avro files to json with avro-tools, and the json files are
> no larger than 1GB. So Impala should be able to read them. Some of the avro
> files are corrupt.
>
> 16M     log.2017-04-05.1491321605834.avro.json
> 308M    log.2017-04-05.1491323647211.avro.json
> 103M    log.2017-04-05.1491327241311.avro.json
> 150M    log.2017-04-05.1491330839609.avro.json
> 397M    log.2017-04-05.1491334439092.avro.json
> 297M    log.2017-04-05.1491338038503.avro.json
> 160M    log.2017-04-05.1491341639694.avro.json
> 95M     log.2017-04-05.1491345239969.avro.json
> 360M    log.2017-04-05.1491348843931.avro.json
> 338M    log.2017-04-05.1491352442955.avro.json
> 71M     log.2017-04-05.1491359648079.avro.json
> 161M    log.2017-04-05.1491363247597.avro.json
> 628M    log.2017-04-05.1491366845827.avro.json
> 288M    log.2017-04-05.1491370445873.avro.json
> 162M    log.2017-04-05.1491374045830.avro.json
> 90M     log.2017-04-05.1491377650935.avro.json
> 269M    log.2017-04-05.1491381249597.avro.json
> 620M    log.2017-04-05.1491384846366.avro.json
> 70M     log.2017-04-05.1491388450262.avro.json
> 30M     log.2017-04-05.1491392047694.avro.json
> 114M    log.2017-04-05.1491395648818.avro.json
> 370M    log.2017-04-05.1491399246407.avro.json
> 359M    log.2017-04-05.1491402846469.avro.json
> 218M    log.2017-04-05.1491406180615.avro.json
> 29M     log.2017-04-05.1491409790105.avro.json
> 3.9M    log.2017-04-05.1491413385884.avro.json
> 9.3M    log.2017-04-05.1491416981829.avro.json
> 8.3M    log.2017-04-05.1491420581588.avro.json
> 2.3M    log.2017-04-05.1491424180191.avro.json
> 25M     log.2017-04-05.1491427781339.avro.json
> 24M     log.2017-04-05.1491431382552.avro.json
> 5.7M    log.2017-04-05.1491434984679.avro.json
> 35M     log.2017-04-05.1491438586674.avro.json
> 5.8M    log.2017-04-05.1491442192541.avro.json
> 23M     log.2017-04-05.1491445789230.avro.json
> 4.3M    log.2017-04-05.1491449386630.avro.json
>
> Bin Wang <[email protected]>于2017年4月6日周四 上午11:34写道:
>
> And here is another question. How does Impala estimate the unziped file
> size? All the gziped files is no bigger than 300MB so I think it will be OK
> to unzip.
>
> Bin Wang <[email protected]>于2017年4月6日周四 上午9:31写道:
>
> Is the snappy decompressor for AVOR or Parquet streaming?
>
> Alex Behm <[email protected]>于2017年4月6日周四 上午9:27写道:
>
> I'd say following the best practices with Parquet should work fine. Create
> snappy-compressed Parquet files of roughly 256MB in size.
> If you want to stick with Avro, then yes, you'll just have to create
> smaller files.
>
> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <[email protected]> wrote:
>
> So the best I can do to workaround this for now is splitting the files
> into small files?
>
> Alex Behm <[email protected]>于2017年4月6日周四 上午9:14写道:
>
> Parquet makes more sense particularly for that kind of query you have.
>
> Still, you might want to be careful with creating huge gzipped files.
> Impala's gzip decompressor for Parquet is also not streaming.
>
> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <[email protected]> wrote:
>
> So as a workaround, does that make sense to convert it to a parquet table
> with Hive?
>
> And I think it's better to mention it in the AVRO table document because
> it is an unexpected behavior for many users.
>
> Alex Behm <[email protected]>于2017年4月6日周四 02:52写道:
>
> Gzip supports streaming decompression, but we currently only implement
> that for text tables.
>
> Doing streaming decompression certainly makes sense for Avro as well.
>
> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
> improvement.
>
> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <[email protected]>
> wrote:
>
> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <[email protected]> wrote:
> > Will Impala load all the file into the memory? That sounds horrible. And
> > with "show partition adhoc_data_fast.log", the compressed files are no
> > bigger that 4GB:
>
> The *uncompressed* size of one of your files is 50GB. Gzip needs to
> allocate memory for that.
>
> >
> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
> >
> >
> > Marcel Kornacker <[email protected]>于2017年4月6日周四 上午12:58写道:
> >>
> >> Apparently you have a gzipped file that is >=50GB. You either need to
> >> break up those files, or run on larger machines.
> >>
> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <[email protected]> wrote:
> >> > Hi,
> >> >
> >> > I'm using Impala on production for a while. But since yesterday, some
> >> > queries reports memory limit exceeded. Then I try a very simple count
> >> > query,
> >> > it still have memory limit exceeded.
> >> >
> >> > The query is:
> >> >
> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> >> > day<='2017-04-06';
> >> >
> >> > And the response in the Impala shell is:
> >> >
> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> >> > http://szq7.appadhoc.com:25000)
> >> > Query progress can be monitored at:
> >> >
> >> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> >> > WARNINGS:
> >> > Memory limit exceeded
> >> > GzipDecompressor failed to allocate 54525952000 bytes.
> >> >
> >> > I have many nodes and each of them have lots of memory avaliable (~ 60
> >> > GB).
> >> > And the query failed very fast after I execute it and the nodes have
> >> > almost
> >> > no memory usage.
> >> >
> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
> >> > gzip
> >> > and is partitioned by the field "day". And each partition has no more
> >> > than
> >> > one billion rows.
> >> >
> >> > My Impala version is:
> >> >
> >> > hdfs@szq7:/home/ubuntu$ impalad --version
> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> >> > Built on Wed Jan 11 13:39:25 PST 2017
> >> >
> >> > Any one can help for this? Thanks very much!
> >> >
>
>
>
>
>

Reply via email to