It sounds like a bug on the version that you are using (3.2) but on the
current 3.4 version,  I tried with a simpler test case with some sample
data with timestamps and could not reproduce the problem using the same
query pattern you ran.  It is probable that your Timestamp type issues may
have been fixed but in order to confirm that you will have to provide a
sample data file (assuming it does not contain sensitive data) and attach
it to the JIRA using the link I sent earlier.

Aman

On Mon, Aug 10, 2020 at 3:19 PM Sri Harsha Chavali <
sriharsha.chav...@outlook.com> wrote:

> Also another observation is that the below query returns same resultset
> with or without the property set. See how I have to_date() function around
> now().
>
> select count(1)
> from dbname.tablename a
>   where a.testdate <= to_date(now())
>   and a.testdate >= '2018-05-01 00:00:00';
>
>
> Thank you,
> Harsha
>
> Sent from Outlook <http://aka.ms/weboutlook>
> ------------------------------
> *From:* Sri Harsha Chavali <sriharsha.chav...@outlook.com>
> *Sent:* Monday, August 10, 2020 5:03 PM
> *To:* user@impala.apache.org <user@impala.apache.org>
> *Subject:* Re: Improper Rowresults from Impala query
>
> Hi Aman,
>
> Thank you for the quick response. I tried three things.
> 1. Removed all filters and only had a.testdate <= now() and it's a perfect
> match.
> select count(1)
> from dbname.tablename a
>   where a.testdate <= now();
> set parquet_read_statistics=false;
> 5879452
> set parquet_read_statistics=true;
> 5879452
>
> 2. Removed all filters and only had a.testdate >='2018-05-01 00:00:00';
> and it's a perfect match.
> select count(1)
> from dbname.tablename a
>   where a.testdate >= '2018-05-01 00:00:00';
> set parquet_read_statistics=false;
> 12906263
> set parquet_read_statistics=true;
> 12906263
>
> 3. Removed all filters and had   a.testdate <= now()  and a.testdate >=
> '2018-05-01 00:00:00' and I found the discrepancy.
> select count(1)
> from dbname.tablename a
>   where a.testdate <= now()
>   and a.testdate >= '2018-05-01 00:00:00';
> set parquet_read_statistics=false;
> 1687250
> set parquet_read_statistics=true;
> 12892421
>
> I eliminated the parquet files one after the other and the issue existed
> in all files. I also used parquet-tools command line tool to debug the
> files and they looked good.
>
> I also created duplicate table using hive and impala (using CTAS) and
> still face the issue with the newly created tables. Any inputs on  why the
> combination of filters might cause the issue?
>
> Thank you,
> Harsha
>
> Sent from Outlook <http://aka.ms/weboutlook>
> ------------------------------
> *From:* Aman Sinha <amansi...@gmail.com>
> *Sent:* Monday, August 10, 2020 3:52 PM
> *To:* user@impala.apache.org <user@impala.apache.org>
> *Subject:* Re: Improper Rowresults from Impala query
>
> Harsha,
> to eliminate issues with other data types, could you check just with the
> testdate column ?
> i.e. SELECT COUNT(*) FROM dbname.tablename a WHERE a.testdate >=
> '2018-05-01 00:00:00'
> Is the result different with and without the parquet_read_statistics ?
>
> There could be 2 possibilities: (a) the parquet stats for one or more of
> those files may be corrupted (not sure how they were created) ,.. can you
> narrow down the set of parquet files ?  Does it happen even with 1 parquet
> file ?
>  (b) there could be a timestamp related bug with pruning using the parquet
> stats.
> Either way, you may want to file a JIRA and provide a sample file if
> possible at https://issues.apache.org/jira/projects/IMPALA/
>
> -Aman
>
> On Mon, Aug 10, 2020 at 9:18 AM Sri Harsha Chavali <
> sriharsha.chav...@outlook.com> wrote:
>
> Hi All,
>
> We recently upgraded from impala 2.12 to 3.2 (CDH Impala). We are facing
> an issue where one of our queries are returning wrong results when there is
> a predicate (where condition) on the timestamp field (stored as string in
> our case). Given below is a sample query which is failing on our end. The
> table is parquet table and is loaded using hive.
>
> select a.testidid,a.testdate from dbname.tablename a where a.testdate <=
> now() and a.testdate >= '2018-05-01 00:00:00' and a.type = 'TEST' and
> a.context != 123 and a.status in ('OPEN','CLOSED') and a.context = 1234 and
> a.testid = 123456;
>
> I researched further and looked at the plan and found that there might be
> rowgroup filtering happening in my case. I tried to unset the below
> property and the rowresults were proper.
>
> set parquet_read_statistics=false;
>
> Do you think this might be related to an existing bug or am I doing
> something wrong?
>
> Thank you,
> Harsha
>
> Sent from Outlook <http://aka.ms/weboutlook>
>
>

Reply via email to