Dejan, thanks for following up. Are you willing to share the data to help us reproduce? Is there a convenient place you can upload the data?
On Wed, Dec 28, 2016 at 10:58 AM, Dejan Prokić <[email protected]> wrote: > I opened the issue: https://issues.cloudera.org/browse/IMPALA-4715 > > Explain plan is for exact query I sent in previous mail, "e.date_id = > 2420" is probably not shown because that is the only partition currently. I > backed up only that partition and dropped original table. > > *Dejan Prokić* | Data Engineer | Nordeus > > 2016-12-28 16:31 GMT+01:00 Tim Armstrong <[email protected]>: > >> I notice that " e.date_id = 2420" doesn't appear anywhere in the explain >> plan - was the explain plan just for a slightly different query? >> >> It does look like IMPALA-4049 but the Impala server that ships with CDH >> 5.8.3 should have the fix: https://github.com/cloudera/Im >> pala/commits/cdh5-2.6.0_5.8.3. It sounds like that's what you're running. >> >> It could be a parquet bug like https://issues.cloudera.org/br >> owse/IMPALA-4539 or https://issues.cloudera.org/browse/IMPALA-4444. >> >> I think we really need to just try and reproduce it - would you be able >> to file an issue on https://issues.cloudera.org? >> >> >> On Tue, Dec 27, 2016 at 10:35 PM, Dejan Prokić <[email protected]> >> wrote: >> >>> Hi Tim, >>> >>> Thanks for the quick response! >>> >>> I am running version 2.6.0, more precisely: >>> Server version: impalad version 2.6.0-cdh5.8.3 RELEASE (build >>> c644f476b774db9db87a619628f7a6ecc5f843e0) >>> >>> Here is explain plan: >>> +----------------------------------------------------------- >>> -------------------------+ >>> | Explain String >>> | >>> +----------------------------------------------------------- >>> -------------------------+ >>> | Estimated Per-Host Requirements: Memory=882.00MB >>> VCores=2 | >>> | WARNING: The following tables are missing relevant table and/or column >>> statistics. | >>> | dejanp.event >>> | >>> | >>> | >>> | 09:EXCHANGE [UNPARTITIONED] >>> | >>> | | >>> | >>> | 08:AGGREGATE [FINALIZE] >>> | >>> | | output: count:merge(*) >>> | >>> | | group by: won >>> | >>> | | >>> | >>> | 07:EXCHANGE [HASH(won)] >>> | >>> | | >>> | >>> | 06:AGGREGATE [STREAMING] >>> | >>> | | output: count(*) >>> | >>> | | group by: max(CASE WHEN key = 'won' THEN value >>> END) | >>> | | >>> | >>> | 01:SUBPLAN >>> | >>> | | >>> | >>> | |--05:NESTED LOOP JOIN [CROSS JOIN] >>> | >>> | | | >>> | >>> | | |--02:SINGULAR ROW SRC >>> | >>> | | | >>> | >>> | | 04:AGGREGATE [FINALIZE] >>> | >>> | | | output: max(CASE WHEN key = 'won' THEN value >>> END) | >>> | | | >>> | >>> | | 03:UNNEST [e.event_map] >>> | >>> | | >>> | >>> | 00:SCAN HDFS [dejanp.event e] >>> | >>> | partitions=1/1 files=64 size=18.76GB >>> | >>> | predicates: e.event_id = 10012 >>> | >>> +----------------------------------------------------------- >>> -------------------------+ >>> >>> Looks very similar to https://issues.cloudera.org/browse/IMPALA-4049 >>> Looking at the list of bug fixes, I should upgrade Impala to the latest >>> version. >>> >>> Dejan >>> >>> >>> *Dejan Prokić* | Data Engineer | Nordeus >>> >>> 2016-12-27 17:18 GMT+01:00 Tim Armstrong <[email protected]>: >>> >>>> Hi Dejan, >>>> Thanks for the bug report! >>>> >>>> There are a few issues related to nested types that you could be >>>> hitting. It depends on the versions of Impala - we fixed different cases in >>>> different versions. What version are you running? Also if you have the >>>> "explain" plan available that would help in narrowing down the possible >>>> cases. >>>> >>>> I did a quick search that I think pulls up all the relevant JIRAs: >>>> https://issues.cloudera.org/issues/?jql=project%20%3D%20IMPA >>>> LA%20AND%20labels%20%3D%20correctness%20AND%20text%20~%20nes >>>> ted%20ORDER%20BY%20updated%20DESC >>>> >>>> Could be https://issues.cloudera.org/browse/IMPALA-4049 or >>>> https://issues.cloudera.org/browse/IMPALA-3311 >>>> >>>> - Tim >>>> >>>> On Tue, Dec 27, 2016 at 1:58 AM, Dejan Prokić <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I noticed a bug when selecting data from a table with complex type. >>>>> This is definition of my table: >>>>> >>>>> CREATE TABLE event ( >>>>> user_id BIGINT, >>>>> event_id INT, >>>>> event_type_id INT, >>>>> ts_bigint BIGINT, >>>>> processed_ts_bigint BIGINT, >>>>> event_map MAP<STRING, STRING>, >>>>> server_id INT >>>>> ) >>>>> PARTITIONED BY (date_id INT) >>>>> STORED AS PARQUET; >>>>> >>>>> I want to run this query: >>>>> >>>>> WITH >>>>> battles AS ( >>>>> SELECT >>>>> * >>>>> FROM event e, >>>>> (SELECT >>>>> MAX(CASE WHEN key='won' THEN value END) AS won >>>>> FROM e.event_map) m >>>>> WHERE e.date_id = 2420 AND e.event_id = 10012 >>>>> ) >>>>> SELECT won, count(*) FROM battles >>>>> GROUP BY won; >>>>> >>>>> Map element with key = 'won' has only 'true' and 'false' values. If >>>>> battles subquery has more than 5 million rows the query returns strange >>>>> results, and when it has less rows query returns correct data. >>>>> >>>>> Are you aware of this issue? I didn't send you my data since it is >>>>> pretty large for mail (20 GB). >>>>> >>>>> Thanks >>>>> >>>>> *Dejan Prokić* | Data Engineer | Nordeus >>>>> >>>> >>>> >>> >> >
