Dejan, thanks for following up. Are you willing to share the data to help
us reproduce? Is there a convenient place you can upload the data?

On Wed, Dec 28, 2016 at 10:58 AM, Dejan Prokić <[email protected]> wrote:

> I opened the issue: https://issues.cloudera.org/browse/IMPALA-4715
>
> Explain plan is for exact query I sent in previous mail, "e.date_id =
> 2420" is probably not shown because that is the only partition currently. I
> backed up only that partition and dropped original table.
>
> *Dejan Prokić* | Data Engineer | Nordeus
>
> 2016-12-28 16:31 GMT+01:00 Tim Armstrong <[email protected]>:
>
>> I notice that " e.date_id = 2420" doesn't appear anywhere in the explain
>> plan - was the explain plan just for a slightly different query?
>>
>> It does look like IMPALA-4049 but the Impala server that ships with CDH
>> 5.8.3 should have the fix: https://github.com/cloudera/Im
>> pala/commits/cdh5-2.6.0_5.8.3. It sounds like that's what you're running.
>>
>> It could be a parquet bug like https://issues.cloudera.org/br
>> owse/IMPALA-4539 or https://issues.cloudera.org/browse/IMPALA-4444.
>>
>> I think we really need to just try and reproduce it - would you be able
>> to file an issue on https://issues.cloudera.org?
>>
>>
>> On Tue, Dec 27, 2016 at 10:35 PM, Dejan Prokić <[email protected]>
>> wrote:
>>
>>> Hi Tim,
>>>
>>> Thanks for the quick response!
>>>
>>> I am running version 2.6.0, more precisely:
>>> Server version: impalad version 2.6.0-cdh5.8.3 RELEASE (build
>>> c644f476b774db9db87a619628f7a6ecc5f843e0)
>>>
>>> Here is explain plan:
>>> +-----------------------------------------------------------
>>> -------------------------+
>>> | Explain String
>>> |
>>> +-----------------------------------------------------------
>>> -------------------------+
>>> | Estimated Per-Host Requirements: Memory=882.00MB
>>> VCores=2                          |
>>> | WARNING: The following tables are missing relevant table and/or column
>>> statistics. |
>>> | dejanp.event
>>> |
>>> |
>>> |
>>> | 09:EXCHANGE [UNPARTITIONED]
>>>                                          |
>>> | |
>>> |
>>> | 08:AGGREGATE [FINALIZE]
>>>                                         |
>>> | |  output: count:merge(*)
>>>                                           |
>>> | |  group by: won
>>>                                         |
>>> | |
>>> |
>>> | 07:EXCHANGE [HASH(won)]
>>>                                          |
>>> | |
>>> |
>>> | 06:AGGREGATE [STREAMING]
>>>                                         |
>>> | |  output: count(*)
>>>                                           |
>>> | |  group by: max(CASE WHEN key = 'won' THEN value
>>> END)                             |
>>> | |
>>> |
>>> | 01:SUBPLAN
>>> |
>>> | |
>>> |
>>> | |--05:NESTED LOOP JOIN [CROSS JOIN]
>>> |
>>> | |  |
>>> |
>>> | |  |--02:SINGULAR ROW SRC
>>> |
>>> | |  |
>>> |
>>> | |  04:AGGREGATE [FINALIZE]
>>>                                      |
>>> | |  |  output: max(CASE WHEN key = 'won' THEN value
>>> END)                            |
>>> | |  |
>>> |
>>> | |  03:UNNEST [e.event_map]
>>>                                         |
>>> | |
>>> |
>>> | 00:SCAN HDFS [dejanp.event e]
>>> |
>>> |    partitions=1/1 files=64 size=18.76GB
>>> |
>>> |    predicates: e.event_id = 10012
>>> |
>>> +-----------------------------------------------------------
>>> -------------------------+
>>>
>>> Looks very similar to https://issues.cloudera.org/browse/IMPALA-4049
>>> Looking at the list of bug fixes, I should upgrade Impala to the latest
>>> version.
>>>
>>> Dejan
>>>
>>>
>>> *Dejan Prokić* | Data Engineer | Nordeus
>>>
>>> 2016-12-27 17:18 GMT+01:00 Tim Armstrong <[email protected]>:
>>>
>>>> Hi Dejan,
>>>>   Thanks for the bug report!
>>>>
>>>> There are a few issues related to nested types that you could be
>>>> hitting. It depends on the versions of Impala - we fixed different cases in
>>>> different versions. What version are you running? Also if you have the
>>>> "explain" plan available that would help in narrowing down the possible
>>>> cases.
>>>>
>>>> I did a quick search that I think pulls up all the relevant JIRAs:
>>>> https://issues.cloudera.org/issues/?jql=project%20%3D%20IMPA
>>>> LA%20AND%20labels%20%3D%20correctness%20AND%20text%20~%20nes
>>>> ted%20ORDER%20BY%20updated%20DESC
>>>>
>>>> Could be https://issues.cloudera.org/browse/IMPALA-4049 or
>>>> https://issues.cloudera.org/browse/IMPALA-3311
>>>>
>>>> - Tim
>>>>
>>>> On Tue, Dec 27, 2016 at 1:58 AM, Dejan Prokić <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I noticed a bug when selecting data from a table with complex type.
>>>>> This is definition of my table:
>>>>>
>>>>> CREATE TABLE event (
>>>>>   user_id BIGINT,
>>>>>   event_id INT,
>>>>>   event_type_id INT,
>>>>>   ts_bigint BIGINT,
>>>>>   processed_ts_bigint BIGINT,
>>>>>   event_map MAP<STRING, STRING>,
>>>>>   server_id INT
>>>>> )
>>>>> PARTITIONED BY (date_id INT)
>>>>> STORED AS PARQUET;
>>>>>
>>>>> I want to run this query:
>>>>>
>>>>> WITH
>>>>> battles AS (
>>>>> SELECT
>>>>>   *
>>>>> FROM event e,
>>>>>   (SELECT
>>>>>     MAX(CASE WHEN key='won' THEN value END) AS won
>>>>>   FROM e.event_map) m
>>>>> WHERE e.date_id = 2420 AND e.event_id = 10012
>>>>> )
>>>>> SELECT won, count(*) FROM battles
>>>>> GROUP BY won;
>>>>>
>>>>> Map element with key = 'won' has only 'true' and 'false' values. If
>>>>> battles subquery has more than 5 million rows the query returns strange
>>>>> results, and when it has less rows query returns correct data.
>>>>>
>>>>> Are you aware of this issue? I didn't send you my data since it is
>>>>> pretty large for mail (20 GB).
>>>>>
>>>>> Thanks
>>>>>
>>>>> *Dejan Prokić* | Data Engineer | Nordeus
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to