[jira] [Commented] (IMPALA-12322) return wrong timestamp when scan kudu timestamp with timezone

2024-05-31 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851147#comment-17851147
 ] 

Csaba Ringhofer commented on IMPALA-12322:
--

[~eyizoha] convert_kudu_utc_timestamps only affects reading, so if Impala 
writes a Kudu table, it will read back a different timestamp than what it 
written

In IMPALA-12370 there is some discussion about how to configure writing 
behavior. Do you think that convert_kudu_utc_timestamps should also govern 
writing, or that should get a separate query option?

> return wrong timestamp when scan kudu timestamp with timezone
> -
>
> Key: IMPALA-12322
> URL: https://issues.apache.org/jira/browse/IMPALA-12322
> Project: IMPALA
>  Issue Type: Bug
>Affects Versions: Impala 4.1.1
> Environment: impala 4.1.1
>Reporter: daicheng
>Assignee: Zihao Ye
>Priority: Major
> Attachments: image-2022-04-24-00-01-05-746-1.png, 
> image-2022-04-24-00-01-05-746.png, image-2022-04-24-00-01-37-520.png, 
> image-2022-04-24-00-03-14-467-1.png, image-2022-04-24-00-03-14-467.png, 
> image-2022-04-24-00-04-16-240-1.png, image-2022-04-24-00-04-16-240.png, 
> image-2022-04-24-00-04-52-860-1.png, image-2022-04-24-00-04-52-860.png, 
> image-2022-04-24-00-05-52-086-1.png, image-2022-04-24-00-05-52-086.png, 
> image-2022-04-24-00-07-09-776-1.png, image-2022-04-24-00-07-09-776.png, 
> image-2023-07-28-20-31-09-457.png, image-2023-07-28-22-27-38-521.png, 
> image-2023-07-28-22-29-40-083.png, image-2023-07-28-22-36-17-460.png, 
> image-2023-07-28-22-36-37-884.png, image-2023-07-28-22-38-19-728.png
>
>
> impala version is 3.1.0-cdh6.1
> i have set system timezone=Asia/Shanghai:
> !image-2022-04-24-00-01-37-520.png!
> !image-2022-04-24-00-01-05-746.png!
> here is the bug:
> *step 1*
> i have parquet file with two columns like below,and read it with impala-shell 
> and spark (timezone=shanghai)
> !image-2022-04-24-00-03-14-467.png|width=1016,height=154!
> !image-2022-04-24-00-04-16-240.png|width=944,height=367!
> the result both exactly right。
> *step two*
> create kudu table  with impala-shell:
> CREATE TABLE default.test_{_}test{_}_test_time2 (id BIGINT,t 
> TIMESTAMP,PRIMARY KEY (id) ) STORED AS KUDU;
> note: kudu version:1.8
> and  insert 2 row into the table with spark :
> !image-2022-04-24-00-04-52-860.png|width=914,height=279!
> *stop 3*
> read it with spark (timezone=shanghai),spark read kudu table with kudu-client 
> api,here is the result:
> !image-2022-04-24-00-05-52-086.png|width=914,height=301!
> the result is still exactly right。
> but read it with impala-shell: 
> !image-2022-04-24-00-07-09-776.png|width=915,height=154!
> the result show late 8hour
> *conclusion*
>    it seems like impala timezone didn't work when kudu column type is 
> timestamp, but it work fine in parquet file,I don't know why?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12370) Add an option to customize timezone when working with UNIXTIME_MICROS columns of Kudu tables

2024-05-31 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851132#comment-17851132
 ] 

Csaba Ringhofer commented on IMPALA-12370:
--

>That will free the users from the inconvenience of running their clusters in 
>the UTC timezone
The timezone doesn't need to be set at server level in Impala, it can be set 
per query using query option "timezone", e.g. set timezone=CET;

> Ideally, the setting should be per Kudu table, but a system-wide flag is also 
> an option.
Query option convert_kudu_utc_timestamps, only affects reading, so there could 
be a writing related one to, e.g. write_kudu_utc_timestamps. (or 
convert_kudu_utc_timestamps could be changed to also affect writing).

I agree that the ideal would be to be able to override this per table, for 
example with a table property like "impala.use_kudu_utc_timestamps" which would 
override both convert_kudu_utc_timestamps / write_kudu_utc_timestamps.
It would be even better if other components would also respect this property, 
so if it is false, then they would write in the timezone agnostic "Impala" way. 

> Add an option to customize timezone when working with UNIXTIME_MICROS columns 
> of Kudu tables
> 
>
> Key: IMPALA-12370
> URL: https://issues.apache.org/jira/browse/IMPALA-12370
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Alexey Serbin
>Priority: Major
>  Labels: timezone
>
> Impala uses the timezone of its server when converting Unix epoch time stored 
> in a Kudu table in a column of UNIXTIME_MICROS type (legacy type name 
> TIMESTAMP) into a timestamp.  As one can see, the former (a values stored in 
> a column of the UNIXTIME_MICROS type) does not contain information about 
> timezone, but the latter (the result timestamp returned by Impala) does, and 
> Impala's convention does make sense and works totally fine if the data is 
> being written and read by Impala or by other application that uses the same 
> convention.
> However, Spark uses a different convention.  Spark applications convert 
> timestamps to the UTC timezone before representing the result as Unix epoch 
> time.  So, when a Spark application stores timestamp data in a Kudu table, 
> there is a difference in the result timestamps upon reading the stored data 
> via Impala if Impala servers are running in other than the UTC timezone.
> As of now, the workaround is to run Impala servers in the UTC timezone, so 
> the convention used by Spark produces the same result as the convention used 
> by Impala when converting between timestamps and Unix epoch times.
> In this context, it would be great to make it possible customizing the 
> timezone that's used by Impala when working with UNIXTIME_MICROS/TIMESTAMP 
> values stored in Kudu tables.  That will free the users from the 
> inconvenience of running their clusters in the UTC timezone if they use a mix 
> of Spark/Impala applications to work with the same data stored in Kudu 
> tables.  Ideally, the setting should be per Kudu table, but a system-wide 
> flag is also an option.
> This is similar to IMPALA-1658.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12656) impala-shell cannot be installed on Python 3.11

2024-05-29 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850457#comment-17850457
 ] 

Csaba Ringhofer commented on IMPALA-12656:
--

I also bumped into this and tried building the python-sasl PRs
https://github.com/cloudera/python-sasl/pull/32 worked with 3.11 and 3.12 but 
broke 2.7 (at least in my environment). The other PR only fix 3.11, but had 
other build failures with 3.12.

I think that this is a good reason to drop Python 2.7 support.

> impala-shell cannot be installed on Python 3.11
> ---
>
> Key: IMPALA-12656
> URL: https://issues.apache.org/jira/browse/IMPALA-12656
> Project: IMPALA
>  Issue Type: Bug
>Affects Versions: Impala 4.3.0
>Reporter: Michael Smith
>Priority: Major
>  Labels: python3
>
> Trying to {{pip install impala-shell}} fails with
> {code:java}
>       clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG 
> -g -fwrapv -O3 -Wall -isysroot 
> /Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk -Isasl 
> -I/opt/homebrew/opt/python@3.11/Frameworks/Python.framework/Versions/3.11/include/python3.11
>  -c sasl/saslwrapper.cpp -o 
> build/temp.macosx-14-arm64-cpython-311/sasl/saslwrapper.o
>       sasl/saslwrapper.cpp:196:12: fatal error: 'longintrepr.h' file not found
>         #include "longintrepr.h"
>                  ^~~
>       1 error generated. {code}
> Python 3.11 moved this file to a subdirectory in 
> [https://github.com/python/cpython/commit/8e5de40f90476249e9a2e5ef135143b5c6a0b512.]
> Adopting [https://github.com/cloudera/python-sasl/pull/31] or 
> [https://github.com/cloudera/python-sasl/pull/32] might fix it. But they need 
> to be included in a new release of sasl on pypi.org.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12656) impala-shell cannot be installed on Python 3.11

2024-05-29 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12656:
-
Priority: Critical  (was: Major)

> impala-shell cannot be installed on Python 3.11
> ---
>
> Key: IMPALA-12656
> URL: https://issues.apache.org/jira/browse/IMPALA-12656
> Project: IMPALA
>  Issue Type: Bug
>Affects Versions: Impala 4.3.0
>Reporter: Michael Smith
>Priority: Critical
>  Labels: python3
>
> Trying to {{pip install impala-shell}} fails with
> {code:java}
>       clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG 
> -g -fwrapv -O3 -Wall -isysroot 
> /Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk -Isasl 
> -I/opt/homebrew/opt/python@3.11/Frameworks/Python.framework/Versions/3.11/include/python3.11
>  -c sasl/saslwrapper.cpp -o 
> build/temp.macosx-14-arm64-cpython-311/sasl/saslwrapper.o
>       sasl/saslwrapper.cpp:196:12: fatal error: 'longintrepr.h' file not found
>         #include "longintrepr.h"
>                  ^~~
>       1 error generated. {code}
> Python 3.11 moved this file to a subdirectory in 
> [https://github.com/python/cpython/commit/8e5de40f90476249e9a2e5ef135143b5c6a0b512.]
> Adopting [https://github.com/cloudera/python-sasl/pull/31] or 
> [https://github.com/cloudera/python-sasl/pull/32] might fix it. But they need 
> to be included in a new release of sasl on pypi.org.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-11512) BINARY support in Iceberg

2024-05-23 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848937#comment-17848937
 ] 

Csaba Ringhofer commented on IMPALA-11512:
--

BINARY columns seem to be working with Iceberg, but testing seems very limited. 
I didn't find any test with partition spec on BINARY columns.

> BINARY support in Iceberg
> -
>
> Key: IMPALA-11512
> URL: https://issues.apache.org/jira/browse/IMPALA-11512
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Frontend
>Reporter: Csaba Ringhofer
>Priority: Major
>  Labels: impala-iceberg
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-12990) impala-shell broken if Iceberg delete deletes 0 rows

2024-05-17 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer resolved IMPALA-12990.
--
Fix Version/s: Impala 4.4.0
   Resolution: Fixed

> impala-shell broken if Iceberg delete deletes 0 rows
> 
>
> Key: IMPALA-12990
> URL: https://issues.apache.org/jira/browse/IMPALA-12990
> Project: IMPALA
>  Issue Type: Bug
>  Components: Clients
>Reporter: Csaba Ringhofer
>Assignee: Csaba Ringhofer
>Priority: Major
>  Labels: iceberg
> Fix For: Impala 4.4.0
>
>
> Happens only with Python 3
> {code}
> impala-python3 shell/impala_shell.py
> create table icebergupdatet (i int, s string) stored as iceberg;
> alter table icebergupdatet set tblproperties("format-version"="2");
> delete from icebergupdatet where i=0;
> Unknown Exception : '>' not supported between instances of 'NoneType' and 
> 'int'
> Traceback (most recent call last):
>   File "shell/impala_shell.py", line 1428, in _execute_stmt
> if is_dml and num_rows == 0 and num_deleted_rows > 0:
> TypeError: '>' not supported between instances of 'NoneType' and 'int'
> {code}
> The same erros should also happen when the delete removes > 0 rows, but the 
> impala server has an older version that doesn't set TDmlResult.rows_deleted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13056) HBaseTableScanner's timeout handling looks broken

2024-05-03 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-13056:


 Summary: HBaseTableScanner's timeout handling looks broken
 Key: IMPALA-13056
 URL: https://issues.apache.org/jira/browse/IMPALA-13056
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Reporter: Csaba Ringhofer


https://gerrit.cloudera.org/#/c/12660/ rewrote some JNI exception handling code 
and accidentally eliminated the timeout handling in 
https://github.com/apache/impala/blob/7ad94006563b88d9221b4ac978dbf5b4fc0a3ca1/be/src/exec/hbase/hbase-table-scanner.cc#L518



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13052) Sampling aggregate result sizes are underestimated

2024-05-02 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-13052:
-
Description: 
Sampling aggregates (sample, appx_median, histogram) return a string that can 
be quite large, but the planner assumes it to have a fixed small size.

Examples:
select sample(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.45 KB (this is  single row sent by a host)

select appx_median(l_orderkey) from tpch.lineitem;
according to plan: row-size= 8B
in reality: TotalBytesSent: 254.68 KB (this is  single row sent by a host)

select histogram(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.35 KB (this is  single row sent by a host)

This may be also relevant for datasketches functions, haven't checked those  
yet.

This can lead to highly underestimating the memory needs of grouping 
aggregators:
select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1 
limit 1
04:AGGREGATE  FINALIZE Peak Mem:  2.19 GB   Est. Peak Mem:  18.00 MB
01:AGGREGATE STREAMING  Peak Mem:   2.37 GB   Est. Peak Mem:  45.79 MB

Enforcing PREAGG_BYTES_LIMIT also doesn't seem to work well -setting a 40MB 
limit decreased peak mem to 1.5 GB. My guess is that the pre-aggregation logic 
is not prepared for aggregation states that grow during the execution, so it 
can decide to not add another group to the hash table, but can't deny 
increasing an existing one's state.


  was:
Sampling aggregates (sample, appx_median, histogram) return a string that can 
be quite large, but the planner assumes it to have a fixed small size.

Examples:
select sample(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.45 KB (this is  single row sent by a host)

select appx_median(l_orderkey) from tpch.lineitem;
according to plan: row-size= 8B
in reality: TotalBytesSent: 254.68 KB (this is  single row sent by a host)

select histogram(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.35 KB (this is  single row sent by a host)

This may be also relevant for datasketches functions, haven't checked thos  yet.

This can lead to highly underestimating the memory needs of grouping 
aggregators:
select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1 
limit 1
04:AGGREGATE  FINALIZE Peak Mem:  2.19 GB   Est. Peak Mem:  18.00 MB
01:AGGREGATE STREAMING  Peak Mem:   2.37 GB   Est. Peak Mem:  45.79 MB


> Sampling aggregate result sizes are underestimated
> --
>
> Key: IMPALA-13052
> URL: https://issues.apache.org/jira/browse/IMPALA-13052
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Csaba Ringhofer
>Priority: Major
>
> Sampling aggregates (sample, appx_median, histogram) return a string that can 
> be quite large, but the planner assumes it to have a fixed small size.
> Examples:
> select sample(l_orderkey) from tpch.lineitem;
> according to plan: row-size=12B
> in reality: TotalBytesSent: 254.45 KB (this is  single row sent by a host)
> select appx_median(l_orderkey) from tpch.lineitem;
> according to plan: row-size= 8B
> in reality: TotalBytesSent: 254.68 KB (this is  single row sent by a host)
> select histogram(l_orderkey) from tpch.lineitem;
> according to plan: row-size=12B
> in reality: TotalBytesSent: 254.35 KB (this is  single row sent by a host)
> This may be also relevant for datasketches functions, haven't checked those  
> yet.
> This can lead to highly underestimating the memory needs of grouping 
> aggregators:
> select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1 
> limit 1
> 04:AGGREGATE  FINALIZE Peak Mem:  2.19 GB   Est. Peak Mem:  18.00 MB
> 01:AGGREGATE STREAMING  Peak Mem:   2.37 GB   Est. Peak Mem:  45.79 MB
> Enforcing PREAGG_BYTES_LIMIT also doesn't seem to work well -setting a 40MB 
> limit decreased peak mem to 1.5 GB. My guess is that the pre-aggregation 
> logic is not prepared for aggregation states that grow during the execution, 
> so it can decide to not add another group to the hash table, but can't deny 
> increasing an existing one's state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13052) Sampling aggregate result sizes are underestimated

2024-05-02 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-13052:
-
Description: 
Sampling aggregates (sample, appx_median, histogram) return a string that can 
be quite large, but the planner assumes it to have a fixed small size.

Examples:
select sample(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.45 KB (this is  single row sent by a host)

select appx_median(l_orderkey) from tpch.lineitem;
according to plan: row-size= 8B
in reality: TotalBytesSent: 254.68 KB (this is  single row sent by a host)

select histogram(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.35 KB (this is  single row sent by a host)

This may be also relevant for datasketches functions, haven't checked thos  yet.

This can lead to highly underestimating the memory needs of grouping 
aggregators:
select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1 
limit 1
04:AGGREGATE  FINALIZE Peak Mem:  2.19 GB   Est. Peak Mem:  18.00 MB
01:AGGREGATE STREAMING  Peak Mem:   2.37 GB   Est. Peak Mem:  45.79 MB

  was:
Sampling aggregates (sample, appx_median, histogram) return a string that can 
be quite large, but the planner assumes it to have a fixed small size.

Examples:
select sample(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.45 KB (this is  single row sent by a host)

select appx_median(l_orderkey) from tpch.lineitem;
according to plan: row-size= 8B
in reality: TotalBytesSent: 254.68 KB (this is  single row sent by a host)

select histogram(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.35 KB (this is  single row sent by a host)

This may be also relevant for datasketches functions.



> Sampling aggregate result sizes are underestimated
> --
>
> Key: IMPALA-13052
> URL: https://issues.apache.org/jira/browse/IMPALA-13052
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Csaba Ringhofer
>Priority: Major
>
> Sampling aggregates (sample, appx_median, histogram) return a string that can 
> be quite large, but the planner assumes it to have a fixed small size.
> Examples:
> select sample(l_orderkey) from tpch.lineitem;
> according to plan: row-size=12B
> in reality: TotalBytesSent: 254.45 KB (this is  single row sent by a host)
> select appx_median(l_orderkey) from tpch.lineitem;
> according to plan: row-size= 8B
> in reality: TotalBytesSent: 254.68 KB (this is  single row sent by a host)
> select histogram(l_orderkey) from tpch.lineitem;
> according to plan: row-size=12B
> in reality: TotalBytesSent: 254.35 KB (this is  single row sent by a host)
> This may be also relevant for datasketches functions, haven't checked thos  
> yet.
> This can lead to highly underestimating the memory needs of grouping 
> aggregators:
> select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1 
> limit 1
> 04:AGGREGATE  FINALIZE Peak Mem:  2.19 GB   Est. Peak Mem:  18.00 MB
> 01:AGGREGATE STREAMING  Peak Mem:   2.37 GB   Est. Peak Mem:  45.79 MB



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13052) Sampling aggregate result sizes are underestimated

2024-05-02 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-13052:


 Summary: Sampling aggregate result sizes are underestimated
 Key: IMPALA-13052
 URL: https://issues.apache.org/jira/browse/IMPALA-13052
 Project: IMPALA
  Issue Type: Bug
Reporter: Csaba Ringhofer


Sampling aggregates (sample, appx_median, histogram) return a string that can 
be quite large, but the planner assumes it to have a fixed small size.

Examples:
select sample(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.45 KB (this is  single row sent by a host)

select appx_median(l_orderkey) from tpch.lineitem;
according to plan: row-size= 8B
in reality: TotalBytesSent: 254.68 KB (this is  single row sent by a host)

select histogram(l_orderkey) from tpch.lineitem;
according to plan: row-size=12B
in reality: TotalBytesSent: 254.35 KB (this is  single row sent by a host)

This may be also relevant for datasketches functions.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13048) Shuffle hint on joins is ignored in some cases

2024-04-30 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-13048:
-
Description: 
I noticed that shuffle hint is ignored without any warning in some cases

shuffle hint is not applied in this query:

{code}
explain select  * from alltypestiny a2 join /* +SHUFFLE */ alltypes a1 on 
a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col;
{code}
result plan
{code}
PLAN-ROOT SINK
|
07:EXCHANGE [UNPARTITIONED]
|
04:HASH JOIN [INNER JOIN, BROADCAST]
|  hash predicates: a3.tinyint_col = a2.tinyint_col
|  runtime filters: RF000 <- a2.tinyint_col
|  row-size=267B cardinality=80
|
|--06:EXCHANGE [BROADCAST]
|  |
|  03:HASH JOIN [INNER JOIN, BROADCAST]
|  |  hash predicates: a1.id = a2.id
|  |  runtime filters: RF002 <- a2.id
|  |  row-size=178B cardinality=8
|  |
|  |--05:EXCHANGE [BROADCAST]
|  |  |
|  |  00:SCAN HDFS [functional.alltypestiny a2]
|  | HDFS partitions=4/4 files=4 size=460B
|  | row-size=89B cardinality=8
|  |
|  01:SCAN HDFS [functional.alltypes a1]
| HDFS partitions=24/24 files=24 size=478.45KB
| runtime filters: RF002 -> a1.id
| row-size=89B cardinality=7.30K
|
02:SCAN HDFS [functional.alltypessmall a3]
   HDFS partitions=4/4 files=4 size=6.32KB
   runtime filters: RF000 -> a3.tinyint_col
   row-size=89B cardinality=100
{code}

if the first two tables' position is swapped, then it is applied:
{code}
explain select  * from alltypes a1 join /* +SHUFFLE */ alltypestiny a2 on 
a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col;
{code}

  was:
I noticed that shuffle hint is ignore without any warning in some cases

shuffle hint is not applied in this query:

{code}
explain select  * from alltypestiny a2 join /* +SHUFFLE */ alltypes a1 on 
a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col;
{code}
result plan
{code}
PLAN-ROOT SINK
|
07:EXCHANGE [UNPARTITIONED]
|
04:HASH JOIN [INNER JOIN, BROADCAST]
|  hash predicates: a3.tinyint_col = a2.tinyint_col
|  runtime filters: RF000 <- a2.tinyint_col
|  row-size=267B cardinality=80
|
|--06:EXCHANGE [BROADCAST]
|  |
|  03:HASH JOIN [INNER JOIN, BROADCAST]
|  |  hash predicates: a1.id = a2.id
|  |  runtime filters: RF002 <- a2.id
|  |  row-size=178B cardinality=8
|  |
|  |--05:EXCHANGE [BROADCAST]
|  |  |
|  |  00:SCAN HDFS [functional.alltypestiny a2]
|  | HDFS partitions=4/4 files=4 size=460B
|  | row-size=89B cardinality=8
|  |
|  01:SCAN HDFS [functional.alltypes a1]
| HDFS partitions=24/24 files=24 size=478.45KB
| runtime filters: RF002 -> a1.id
| row-size=89B cardinality=7.30K
|
02:SCAN HDFS [functional.alltypessmall a3]
   HDFS partitions=4/4 files=4 size=6.32KB
   runtime filters: RF000 -> a3.tinyint_col
   row-size=89B cardinality=100
{code}

if the first two tables' position is swapped, then it is applied:
{code}
explain select  * from alltypes a1 join /* +SHUFFLE */ alltypestiny a2 on 
a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col;
{code}


> Shuffle hint on joins is ignored in some cases
> --
>
> Key: IMPALA-13048
> URL: https://issues.apache.org/jira/browse/IMPALA-13048
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Csaba Ringhofer
>Priority: Major
>
> I noticed that shuffle hint is ignored without any warning in some cases
> shuffle hint is not applied in this query:
> {code}
> explain select  * from alltypestiny a2 join /* +SHUFFLE */ alltypes a1 on 
> a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col;
> {code}
> result plan
> {code}
> PLAN-ROOT SINK
> |
> 07:EXCHANGE [UNPARTITIONED]
> |
> 04:HASH JOIN [INNER JOIN, BROADCAST]
> |  hash predicates: a3.tinyint_col = a2.tinyint_col
> |  runtime filters: RF000 <- a2.tinyint_col
> |  row-size=267B cardinality=80
> |
> |--06:EXCHANGE [BROADCAST]
> |  |
> |  03:HASH JOIN [INNER JOIN, BROADCAST]
> |  |  hash predicates: a1.id = a2.id
> |  |  runtime filters: RF002 <- a2.id
> |  |  row-size=178B cardinality=8
> |  |
> |  |--05:EXCHANGE [BROADCAST]
> |  |  |
> |  |  00:SCAN HDFS [functional.alltypestiny a2]
> |  | HDFS partitions=4/4 files=4 size=460B
> |  | row-size=89B cardinality=8
> |  |
> |  01:SCAN HDFS [functional.alltypes a1]
> | HDFS partitions=24/24 files=24 size=478.45KB
> | runtime filters: RF002 -> a1.id
> | row-size=89B cardinality=7.30K
> |
> 02:SCAN HDFS [functional.alltypessmall a3]
>HDFS partitions=4/4 files=4 size=6.32KB
>runtime filters: RF000 -> a3.tinyint_col
>row-size=89B cardinality=100
> {code}
> if the first two tables' position is swapped, then it is applied:
> {code}
> explain select  * from alltypes a1 join /* +SHUFFLE */ alltypestiny a2 on 
> a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col;
> {code}



--
This message was sent by Atlassian Jira

[jira] [Created] (IMPALA-13048) Shuffle hint on joins is ignored in some cases

2024-04-30 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-13048:


 Summary: Shuffle hint on joins is ignored in some cases
 Key: IMPALA-13048
 URL: https://issues.apache.org/jira/browse/IMPALA-13048
 Project: IMPALA
  Issue Type: Bug
Reporter: Csaba Ringhofer


I noticed that shuffle hint is ignore without any warning in some cases

shuffle hint is not applied in this query:

{code}
explain select  * from alltypestiny a2 join /* +SHUFFLE */ alltypes a1 on 
a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col;
{code}
result plan
{code}
PLAN-ROOT SINK
|
07:EXCHANGE [UNPARTITIONED]
|
04:HASH JOIN [INNER JOIN, BROADCAST]
|  hash predicates: a3.tinyint_col = a2.tinyint_col
|  runtime filters: RF000 <- a2.tinyint_col
|  row-size=267B cardinality=80
|
|--06:EXCHANGE [BROADCAST]
|  |
|  03:HASH JOIN [INNER JOIN, BROADCAST]
|  |  hash predicates: a1.id = a2.id
|  |  runtime filters: RF002 <- a2.id
|  |  row-size=178B cardinality=8
|  |
|  |--05:EXCHANGE [BROADCAST]
|  |  |
|  |  00:SCAN HDFS [functional.alltypestiny a2]
|  | HDFS partitions=4/4 files=4 size=460B
|  | row-size=89B cardinality=8
|  |
|  01:SCAN HDFS [functional.alltypes a1]
| HDFS partitions=24/24 files=24 size=478.45KB
| runtime filters: RF002 -> a1.id
| row-size=89B cardinality=7.30K
|
02:SCAN HDFS [functional.alltypessmall a3]
   HDFS partitions=4/4 files=4 size=6.32KB
   runtime filters: RF000 -> a3.tinyint_col
   row-size=89B cardinality=100
{code}

if the first two tables' position is swapped, then it is applied:
{code}
explain select  * from alltypes a1 join /* +SHUFFLE */ alltypestiny a2 on 
a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col;
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13040) SIGSEGV in QueryState::UpdateFilterFromRemote

2024-04-26 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-13040:


 Summary: SIGSEGV in  QueryState::UpdateFilterFromRemote
 Key: IMPALA-13040
 URL: https://issues.apache.org/jira/browse/IMPALA-13040
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Reporter: Csaba Ringhofer


{code}
Crash reason:  SIGSEGV /SEGV_MAPERR
Crash address: 0x48
Process uptime: not available

Thread 114 (crashed)
 0  libpthread.so.0 + 0x9d00
rax = 0x00019e57ad00   rdx = 0x2a656720
rcx = 0x059a9860   rbx = 0x
rsi = 0x00019e57ad00   rdi = 0x0038
rbp = 0x7f6233d544e0   rsp = 0x7f6233d544a8
 r8 = 0x06a53540r9 = 0x0039
r10 = 0x   r11 = 0x000a
r12 = 0x00019e57ad00   r13 = 0x7f62a2f997d0
r14 = 0x7f6233d544f8   r15 = 0x1632c0f0
rip = 0x7f62a2f96d00
Found by: given as instruction pointer in context
 1  
impalad!impala::QueryState::UpdateFilterFromRemote(impala::UpdateFilterParamsPB 
const&, kudu::rpc::RpcContext*) [query-state.cc : 1033 + 0x5]
rbp = 0x7f6233d54520   rsp = 0x7f6233d544f0
rip = 0x015c0837
Found by: previous frame's frame pointer
 2  
impalad!impala::DataStreamService::UpdateFilterFromRemote(impala::UpdateFilterParamsPB
 const*, impala::UpdateFilterResultPB*, kudu::rpc::RpcContext*) 
[data-stream-service.cc : 134 + 0xb]
rbp = 0x7f6233d54640   rsp = 0x7f6233d54530
rip = 0x017c05de
Found by: previous frame's frame pointer
{code}

The line that crashes is 
https://github.com/apache/impala/blob/b39cd79ae84c415e0aebec2c2b4d7690d2a0cc7a/be/src/runtime/query-state.cc#L1033
My guess is that inside the actual segfault is within WaitForPrepare() but it 
was inlined. Not sure if a remote filter can arrive even before 
QueryState::Init is finished - that would explain the issue, as 
instances_prepared_barrier_ is not yet created at that point.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12320) test_topic_updates_unblock fails in ASAN build

2024-04-26 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12320:
-
Priority: Critical  (was: Major)

> test_topic_updates_unblock fails in ASAN build
> --
>
> Key: IMPALA-12320
> URL: https://issues.apache.org/jira/browse/IMPALA-12320
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Zoltán Borók-Nagy
>Assignee: Joe McDonnell
>Priority: Critical
>  Labels: broken-build
>
> h3. Error Message
> AssertionError: alter table tpcds.store_sales recover partitions query took 
> less time than 1 msec assert 9622 > 1 + where 9622 =  ApplyResult.get of  0x7f1ab45b6d10>>() + where  > = 
> .get
> h3. Stacktrace
> {noformat}
> custom_cluster/test_topic_update_frequency.py:82: in 
> test_topic_updates_unblock
> non_blocking_query_options=non_blocking_query_options)
> custom_cluster/test_topic_update_frequency.py:132: in __run_topic_update_test
> assert slow_query_future.get() > blocking_query_min_time, \
> E   AssertionError: alter table tpcds.store_sales recover partitions query 
> took less time than 1 msec
> E   assert 9622 > 1
> E+  where 9622 =  >()
> E+where  > = 
> .get
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12266) Sporadic failure after migrating a table to Iceberg

2024-04-24 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840486#comment-17840486
 ] 

Csaba Ringhofer commented on IMPALA-12266:
--

Saw this test failing again.
select * from special_chars;
Could not resolve table reference: 'special_chars'

Looked into coordinator log:
{code}
I0422 03:48:38.383420 19888 Frontend.java:2127] 
1f4e0654b999662f:b6f1b015] Analyzing query: select * from special_chars 
db: test_convert_table_cdba7383
...
I0422 03:48:42.862898  1012 ImpaladCatalog.java:232] Deleting: 
TABLE:test_convert_table_cdba7383.special_chars version: 7785 size: 77
I0422 03:48:42.862920  1012 ImpaladCatalog.java:232] Deleting: 
TABLE:test_convert_table_cdba7383.special_chars_tmp_5eb06c80 version: 7786 
size: 714
I0422 03:48:42.862967  1012 ImpaladCatalog.java:232] Adding: CATALOG_SERVICE_ID 
version: 7786 size: 60
...
I0422 03:48:42.863464 19888 jni-util.cc:302] 1f4e0654b999662f:b6f1b015] 
org.apache.impala.common.AnalysisException: Could not resolve table reference: 
'special_chars'
at org.apache.impala.analysis.Analyzer.resolvePath(Analyzer.java:1458)
...
I0422 03:48:46.893426  1012 ImpaladCatalog.java:232] Adding: 
TABLE:test_convert_table_cdba7383.special_chars version: 7794 size: 84
{code}
I am not familiar with how convert to Iceberg works, but based on the logs
1. special_chars_tmp_5eb06c80 is created,
2. special_chars is deleted 
3. special_chars recreated

If the table is queries between 2 and 3 then the coordinator will think that it 
doesn't exist.

> Sporadic failure after migrating a table to Iceberg
> ---
>
> Key: IMPALA-12266
> URL: https://issues.apache.org/jira/browse/IMPALA-12266
> Project: IMPALA
>  Issue Type: Bug
>  Components: fe
>Affects Versions: Impala 4.2.0
>Reporter: Tamas Mate
>Assignee: Quanlong Huang
>Priority: Major
>  Labels: impala-iceberg
> Attachments: 
> catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, 
> impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1
>
>
> TestIcebergTable.test_convert_table test failed in a recent verify job's 
> dockerised tests:
> https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629
> {code:none}
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EINNER EXCEPTION: 
> EMESSAGE: AnalysisException: Failed to load metadata for table: 
> 'parquet_nopartitioned'
> E   CAUSED BY: TableLoadingException: Could not load table 
> test_convert_table_cdba7383.parquet_nopartitioned from catalog
> E   CAUSED BY: TException: 
> TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, 
> error_msgs:[NullPointerException: null]), lookup_status:OK)
> {code}
> {code:none}
> E0704 19:09:22.980131   833 JniUtil.java:183] 
> 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of 
> TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms
> I0704 19:09:22.980309   833 jni-util.cc:288] 
> 7145c21173f2c47b:2579db55] java.lang.NullPointerException
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480)
>   at 
> org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397)
>   at 
> org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
>   at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109)
>   at 
> org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238)
>   at 
> org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396)
> I0704 19:09:22.980324   833 status.cc:129] 7145c21173f2c47b:2579db55] 
> NullPointerException: null
> @  0x1012f9f  impala::Status::Status()
> @  0x187f964  impala::JniUtil::GetJniExceptionMsg()
> @   0xfee920  impala::JniCall::Call<>()
> @   0xfccd0f  impala::Catalog::GetPartialCatalogObject()
> @   0xfb55a5  
> impala::CatalogServiceThriftIf::GetPartialCatalogObject()
> @   

[jira] [Created] (IMPALA-13037) EventsProcessorStressTest can hang

2024-04-24 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-13037:


 Summary: EventsProcessorStressTest can hang
 Key: IMPALA-13037
 URL: https://issues.apache.org/jira/browse/IMPALA-13037
 Project: IMPALA
  Issue Type: Bug
  Components: Catalog, Infrastructure
Reporter: Csaba Ringhofer


The test failed with timeout.

>From mvn.log the last line is:
20:17:53 [INFO] Running 
org.apache.impala.catalog.events.EventsProcessorStressTest

Things seem to be hanging from 2024.04.22 20:17:53 to 2024.04.23
The tests seems to wait for a Hive query.

>From FeSupport.INFO:
{code}
I0422 20:17:55.478875  7949 RandomHiveQueryRunner.java:1102] Client 0 running 
hive query set 2: 
insert into table events_stress_db_0.stress_test_tbl_0_alltypes_part partition 
(year,month) select * from functional.alltypes limit 100
   create database if not exists events_stress_db_0
   drop table if exists events_stress_db_0.stress_test_tbl_0_alltypes_part 
   create table if not exists 
events_stress_db_0.stress_test_tbl_0_alltypes_part  like  functional.alltypes 
   set hive.exec.dynamic.partition.mode = nonstrict
   set hive.exec.max.dynamic.partitions = 1
   set hive.exec.max.dynamic.partitions.pernode = 1
   set tez.session.am.dag.submit.timeout.secs = 2
I0422 20:17:55.478940  7949 HiveJdbcClientPool.java:102] Executing sql : create 
database if not exists events_stress_db_0
I0422 20:17:55.493497  7768 MetastoreShim.java:843] EventId: 33414 EventType: 
COMMIT_TXN transaction id: 2075
I0422 20:17:55.493682  7768 MetastoreEvents.java:302] Total number of events 
received: 6 Total number of events filtered out: 0
I0422 20:17:55.494762  7768 MetastoreEvents.java:825] EventId: 33407 EventType: 
CREATE_DATABASE Successfully added database events_stress_db_0
I0422 20:17:55.508478  7949 HiveJdbcClientPool.java:102] Executing sql : drop 
table if exists events_stress_db_0.stress_test_tbl_0_alltypes_part 
I0422 20:17:55.516858  7768 MetastoreEvents.java:825] EventId: 33410 EventType: 
CREATE_TABLE Successfully added table events_stress_db_0.stress_test_tbl_0_part
I0422 20:17:55.518288  7768 CatalogOpExecutor.java:4713] EventId: 33413 Table 
events_stress_db_0.stress_test_tbl_0_part is not loaded. Skipping add partitions
I0422 20:17:55.519479  7768 MetastoreEventsProcessor.java:1340] Time elapsed in 
processing event batch: 178.895ms
I0422 20:17:55.521183  7768 MetastoreEventsProcessor.java:1120] Latest event in 
HMS: id=33420, time=1713842275. Last synced event: id=33414, time=1713842275.
I0422 20:17:55.533375  7949 HiveJdbcClientPool.java:102] Executing sql : create 
table if not exists events_stress_db_0.stress_test_tbl_0_alltypes_part  like  
functional.alltypes 
I0422 20:17:55.611153  7949 HiveJdbcClientPool.java:102] Executing sql : set 
hive.exec.dynamic.partition.mode = nonstrict
I0422 20:17:55.616571  7949 HiveJdbcClientPool.java:102] Executing sql : set 
hive.exec.max.dynamic.partitions = 1
I0422 20:17:55.619197  7949 HiveJdbcClientPool.java:102] Executing sql : set 
hive.exec.max.dynamic.partitions.pernode = 1
I0422 20:17:55.621069  7949 HiveJdbcClientPool.java:102] Executing sql : set 
tez.session.am.dag.submit.timeout.secs = 2
I0422 20:17:55.622972  7949 HiveJdbcClientPool.java:102] Executing sql : insert 
into table events_stress_db_0.stress_test_tbl_0_alltypes_part partition 
(year,month) select * from functional.alltypes limit 100
I0422 20:17:57.163591  7950 CatalogServiceCatalog.java:2747] Refreshing table 
metadata: events_stress_db_0.stress_test_tbl_0_part
I0422 20:17:57.829802  7768 MetastoreEventsProcessor.java:982] Received 6 
events. First event id: 33416.
I0422 20:17:57.833026  7768 MetastoreShim.java:843] EventId: 33417 EventType: 
COMMIT_TXN transaction id: 2076
I0422 20:17:57.833222  7768 MetastoreShim.java:843] EventId: 33419 EventType: 
COMMIT_TXN transaction id: 2077
I0422 20:17:57.84  7768 MetastoreShim.java:843] EventId: 33421 EventType: 
COMMIT_TXN transaction id: 2078
I0422 20:17:57.834242  7768 MetastoreShim.java:843] EventId: 33424 EventType: 
COMMIT_TXN transaction id: 2079
I0422 20:17:57.834323  7768 MetastoreEvents.java:302] Total number of events 
received: 6 Total number of events filtered out: 0
I0422 20:17:57.834570  7768 CatalogOpExecutor.java:4862] EventId: 33416 Table 
events_stress_db_0.stress_test_tbl_0_part is not loaded. Not processing the 
event.
I0422 20:17:57.837756  7768 MetastoreEvents.java:825] EventId: 33423 EventType: 
CREATE_TABLE Successfully added table 
events_stress_db_0.stress_test_tbl_0_alltypes_part
I0422 20:17:57.838668  7768 MetastoreEventsProcessor.java:1340] Time elapsed in 
processing event batch: 8.625ms
I0422 20:17:57.840027  7768 MetastoreEventsProcessor.java:1120] Latest event in 
HMS: id=33425, time=1713842275. Last synced event: id=33424, time=1713842275.
I0422 20:18:03.143219  7768 MetastoreEventsProcessor.java:982] Received 0 
events. First event id: 

[jira] [Created] (IMPALA-13026) Creating openai-api-key-secret fails sporadically

2024-04-22 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-13026:


 Summary: Creating openai-api-key-secret fails sporadically
 Key: IMPALA-13026
 URL: https://issues.apache.org/jira/browse/IMPALA-13026
 Project: IMPALA
  Issue Type: Bug
  Components: Infrastructure
Reporter: Csaba Ringhofer


Data load fails time to time with the following error:
{code}
00:27:17.680 Error loading data. The end of the log file is:
00:27:17.680 04:15:15 
/data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/bin/load-data.py 
--workloads functional-query -e core --table_formats kudu/none/none --force 
--impalad localhost --hive_hs2_hostport localhost:11050 --hdfs_namenode 
localhost:20500
00:27:17.680 04:15:15 Executing Hadoop command: ... hadoop credential create 
openai-api-key-secret -value secret -provider 
localjceks://file/data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/testdata/jceks/test.jceks
...

00:27:17.680 java.io.IOException: Credential openai-api-key-secret already 
exists in 
localjceks://file/data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/testdata/jceks/test.jceks
00:27:17.680at 
org.apache.hadoop.security.alias.AbstractJavaKeyStoreProvider.createCredentialEntry(AbstractJavaKeyStoreProvider.java:234)
00:27:17.680at 
org.apache.hadoop.security.alias.CredentialShell$CreateCommand.execute(CredentialShell.java:354)
00:27:17.680at 
org.apache.hadoop.tools.CommandShell.run(CommandShell.java:72)
00:27:17.680at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)
00:27:17.680at 
org.apache.hadoop.security.alias.CredentialShell.main(CredentialShell.java:437)
00:27:17.680 04:15:15 Error executing Hadoop command, exiting
{code}

My guess is that this happens when calling "hadoop credential create" 
concurrently with different  data loader processes.
https://github.com/apache/impala/blob/9b05a205fec397fa1e19ae467b1cc406ca43d948/bin/load-data.py#L323
Ideally this would be called in the serial phase of dataload




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Comment Edited] (IMPALA-13024) Several tests timeout waiting for admission

2024-04-21 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839337#comment-17839337
 ] 

Csaba Ringhofer edited comment on IMPALA-13024 at 4/21/24 8:15 AM:
---

>Slot based admission is not enabled when using default groups
This was also my assumption, but it seems that it is enforced by default.
Reproduced slot starvation locally:

Run one query with more fragment instance than core count in one impala-shell:
set mt_dop=32;
select sleep(1000*60) from tpcds.store_sales limit 200; -- 

Run a query in another impala-shell:
select * from functional.alltypestiny;
ERROR: Admission for query exceeded timeout 6ms in pool default-pool. 
Queued reason: Not enough admission control slots available on host 
csringhofer-7000-ubuntu:27000. Needed 1 slots but 32/24 are already in use. 
Additional Details: Not Applicable

UPDATE:
I understand now what is happening: the limit is only enforced on coordinator 
only queries.
While "select * from alltypestiny" failed, the much larger "select * from 
alltypes" could be run without issues. The reason is that the former query runs 
on a single node.

>From impalad.INFO:
"0421 10:10:57.505287 1586078 admission-controller.cc:1962] Trying to admit 
id=91442a9fa1d2512d:db5337c2 in pool_name=default-pool 
executor_group_name=empty group (using coordinator only) 
per_host_mem_estimate=20.00 MB dedicated_coord_mem_estimate=120.00 MB 
max_requests=-1 max_queued=200 max_mem=-1.00 B is_trivial_query=false
I0421 10:10:57.505345 1586078 admission-controller.cc:1971] Stats: 
agg_num_running=1, agg_num_queued=1, agg_mem_reserved=4.02 MB,  
local_host(local_mem_admitted=516.57 MB, local_trivial_running=0, 
num_admitted_running=1, num_queued=1, backend_mem_reserved=4.02 MB, 
topN_query_stats: queries=[d84f2a7efee0998a:45ac1206], 
total_mem_consumed=4.02 MB, fraction_of_pool_total_mem=1; pool_level_stats: 
num_running=1, min=4.02 MB, max=4.02 MB, pool_total_mem=4.02 MB, 
average_per_query=4.02 MB)
I0421 10:10:57.505407 1586078 admission-controller.cc:2227] Could not dequeue 
query id=91442a9fa1d2512d:db5337c2 reason: Not enough admission control 
slots available on host csringhofer-7000-ubuntu:27000. Needed 1 slots but 32/24 
are already in use."



was (Author: csringhofer):
>Slot based admission is not enabled when using default groups
This was also my assumption, but it seems that it is enforced by default.
Reproduced slot starvation locally:

Run one query with more fragment instance than core count in one impala-shell:
set mt_dop=32;
select sleep(1000*60) from tpcds.store_sales limit 200; -- 

Run a query in another impala-shell:
select * from functional.alltypestiny;
ERROR: Admission for query exceeded timeout 6ms in pool default-pool. 
Queued reason: Not enough admission control slots available on host 
csringhofer-7000-ubuntu:27000. Needed 1 slots but 32/24 are already in use. 
Additional Details: Not Applicable


> Several tests timeout waiting for admission
> ---
>
> Key: IMPALA-13024
> URL: https://issues.apache.org/jira/browse/IMPALA-13024
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Csaba Ringhofer
>Priority: Critical
>
> A bunch of seemingly unrelated tests failed with the following message:
> Example: 
> query_test.test_spilling.TestSpillingDebugActionDimensions.test_spilling_aggs[protocol:
>  beeswax | exec_option: {'mt_dop': 1, 'debug_action': None, 
> 'default_spillable_buffer_size': '256k'} | table_format: parquet/none] 
> {code}
> ImpalaBeeswaxException: EQuery aborted:Admission for query exceeded 
> timeout 6ms in pool default-pool. Queued reason: Not enough admission 
> control slots available on host ... . Needed 1 slots but 18/16 are already in 
> use. Additional Details: Not Applicable
> {code}
> This happened in an ASAN build. Another test also failed which may be related 
> to the cause:
> custom_cluster.test_admission_controller.TestAdmissionController.test_queue_reasons_slots
>  
> {code}
> Timeout: query 'e1410add778cd7b0:c40812b9' did not reach one of the 
> expected states [4], last known state 5
> {code}
> test_queue_reasons_slots seems to be know flaky test: IMPALA-10338



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-13024) Several tests timeout waiting for admission

2024-04-21 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839337#comment-17839337
 ] 

Csaba Ringhofer commented on IMPALA-13024:
--

>Slot based admission is not enabled when using default groups
This was also my assumption, but it seems that it is enforced by default.
Reproduced slot starvation locally:

Run one query with more fragment instance than core count in one impala-shell:
set mt_dop=32;
select sleep(1000*60) from tpcds.store_sales limit 200; -- 

Run a query in another impala-shell:
select * from functional.alltypestiny;
ERROR: Admission for query exceeded timeout 6ms in pool default-pool. 
Queued reason: Not enough admission control slots available on host 
csringhofer-7000-ubuntu:27000. Needed 1 slots but 32/24 are already in use. 
Additional Details: Not Applicable


> Several tests timeout waiting for admission
> ---
>
> Key: IMPALA-13024
> URL: https://issues.apache.org/jira/browse/IMPALA-13024
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Csaba Ringhofer
>Priority: Critical
>
> A bunch of seemingly unrelated tests failed with the following message:
> Example: 
> query_test.test_spilling.TestSpillingDebugActionDimensions.test_spilling_aggs[protocol:
>  beeswax | exec_option: {'mt_dop': 1, 'debug_action': None, 
> 'default_spillable_buffer_size': '256k'} | table_format: parquet/none] 
> {code}
> ImpalaBeeswaxException: EQuery aborted:Admission for query exceeded 
> timeout 6ms in pool default-pool. Queued reason: Not enough admission 
> control slots available on host ... . Needed 1 slots but 18/16 are already in 
> use. Additional Details: Not Applicable
> {code}
> This happened in an ASAN build. Another test also failed which may be related 
> to the cause:
> custom_cluster.test_admission_controller.TestAdmissionController.test_queue_reasons_slots
>  
> {code}
> Timeout: query 'e1410add778cd7b0:c40812b9' did not reach one of the 
> expected states [4], last known state 5
> {code}
> test_queue_reasons_slots seems to be know flaky test: IMPALA-10338



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13024) Several tests timeout waiting for admission

2024-04-20 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-13024:


 Summary: Several tests timeout waiting for admission
 Key: IMPALA-13024
 URL: https://issues.apache.org/jira/browse/IMPALA-13024
 Project: IMPALA
  Issue Type: Bug
Reporter: Csaba Ringhofer


A bunch of seemingly unrelated tests failed with the following message:
Example: 
query_test.test_spilling.TestSpillingDebugActionDimensions.test_spilling_aggs[protocol:
 beeswax | exec_option: {'mt_dop': 1, 'debug_action': None, 
'default_spillable_buffer_size': '256k'} | table_format: parquet/none] 
{code}
ImpalaBeeswaxException: EQuery aborted:Admission for query exceeded timeout 
6ms in pool default-pool. Queued reason: Not enough admission control slots 
available on host ... . Needed 1 slots but 18/16 are already in use. Additional 
Details: Not Applicable
{code}

This happened in an ASAN build. Another test also failed which may be related 
to the cause:
custom_cluster.test_admission_controller.TestAdmissionController.test_queue_reasons_slots
 
{code}
Timeout: query 'e1410add778cd7b0:c40812b9' did not reach one of the 
expected states [4], last known state 5
{code}
test_queue_reasons_slots seems to be know flaky test: IMPALA-10338



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13021) Failed test: test_iceberg_deletes_and_updates_and_optimize

2024-04-19 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-13021:


 Summary: Failed test: test_iceberg_deletes_and_updates_and_optimize
 Key: IMPALA-13021
 URL: https://issues.apache.org/jira/browse/IMPALA-13021
 Project: IMPALA
  Issue Type: Bug
Reporter: Csaba Ringhofer


{code}
test_iceberg_deletes_and_updates_and_optimize
run_tasks([deleter, updater, optimizer, checker])
stress/stress_util.py:46: in run_tasks
pool.map_async(Task.run, tasks).get(timeout_seconds)
Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/multiprocessing/pool.py:568:
 in get
raise TimeoutError
E   TimeoutError
{code}
This happened in an exhaustive test run with data cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-5323) Support Kudu BINARY

2024-04-10 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-5323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-5323:

Fix Version/s: Impala 4.4.0

> Support Kudu BINARY
> ---
>
> Key: IMPALA-5323
> URL: https://issues.apache.org/jira/browse/IMPALA-5323
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend
>Reporter: Pavel Martynov
>Assignee: Csaba Ringhofer
>Priority: Major
>  Labels: kudu
> Fix For: Impala 4.4.0
>
>
> I trying to 'CREATE EXTERNAL TABLE STORED AS KUDU' on the table with BINARY 
> Kudu column data type and got an error: Kudu type 'binary' is not supported 
> in Impala.
> This limitation is not documented, checked:
> https://impala.incubator.apache.org/docs/build/html/topics/impala_kudu.html
> https://kudu.apache.org/docs/kudu_impala_integration.html#_known_issues_and_limitations
> There are some thoughts that Kudu BINARY data type may be supported by 
> Impala's STRING data type:
> https://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Does-impala-support-binary-data-type/td-p/24366
> https://groups.google.com/a/cloudera.org/forum/#!msg/impala-user/muguKJU3c3I/_oArmoxSlDMJ



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-5323) Support Kudu BINARY

2024-04-10 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-5323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer resolved IMPALA-5323.
-
Resolution: Fixed

> Support Kudu BINARY
> ---
>
> Key: IMPALA-5323
> URL: https://issues.apache.org/jira/browse/IMPALA-5323
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend
>Reporter: Pavel Martynov
>Assignee: Csaba Ringhofer
>Priority: Major
>  Labels: kudu
> Fix For: Impala 4.4.0
>
>
> I trying to 'CREATE EXTERNAL TABLE STORED AS KUDU' on the table with BINARY 
> Kudu column data type and got an error: Kudu type 'binary' is not supported 
> in Impala.
> This limitation is not documented, checked:
> https://impala.incubator.apache.org/docs/build/html/topics/impala_kudu.html
> https://kudu.apache.org/docs/kudu_impala_integration.html#_known_issues_and_limitations
> There are some thoughts that Kudu BINARY data type may be supported by 
> Impala's STRING data type:
> https://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Does-impala-support-binary-data-type/td-p/24366
> https://groups.google.com/a/cloudera.org/forum/#!msg/impala-user/muguKJU3c3I/_oArmoxSlDMJ



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Work started] (IMPALA-12990) impala-shell broken if Iceberg delete deletes 0 rows

2024-04-10 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-12990 started by Csaba Ringhofer.

> impala-shell broken if Iceberg delete deletes 0 rows
> 
>
> Key: IMPALA-12990
> URL: https://issues.apache.org/jira/browse/IMPALA-12990
> Project: IMPALA
>  Issue Type: Bug
>  Components: Clients
>Reporter: Csaba Ringhofer
>Assignee: Csaba Ringhofer
>Priority: Major
>  Labels: iceberg
>
> Happens only with Python 3
> {code}
> impala-python3 shell/impala_shell.py
> create table icebergupdatet (i int, s string) stored as iceberg;
> alter table icebergupdatet set tblproperties("format-version"="2");
> delete from icebergupdatet where i=0;
> Unknown Exception : '>' not supported between instances of 'NoneType' and 
> 'int'
> Traceback (most recent call last):
>   File "shell/impala_shell.py", line 1428, in _execute_stmt
> if is_dml and num_rows == 0 and num_deleted_rows > 0:
> TypeError: '>' not supported between instances of 'NoneType' and 'int'
> {code}
> The same erros should also happen when the delete removes > 0 rows, but the 
> impala server has an older version that doesn't set TDmlResult.rows_deleted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12990) impala-shell broken if Iceberg delete deletes 0 rows

2024-04-10 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835793#comment-17835793
 ] 

Csaba Ringhofer commented on IMPALA-12990:
--

https://gerrit.cloudera.org/#/c/21284

> impala-shell broken if Iceberg delete deletes 0 rows
> 
>
> Key: IMPALA-12990
> URL: https://issues.apache.org/jira/browse/IMPALA-12990
> Project: IMPALA
>  Issue Type: Bug
>  Components: Clients
>Reporter: Csaba Ringhofer
>Priority: Major
>  Labels: iceberg
>
> Happens only with Python 3
> {code}
> impala-python3 shell/impala_shell.py
> create table icebergupdatet (i int, s string) stored as iceberg;
> alter table icebergupdatet set tblproperties("format-version"="2");
> delete from icebergupdatet where i=0;
> Unknown Exception : '>' not supported between instances of 'NoneType' and 
> 'int'
> Traceback (most recent call last):
>   File "shell/impala_shell.py", line 1428, in _execute_stmt
> if is_dml and num_rows == 0 and num_deleted_rows > 0:
> TypeError: '>' not supported between instances of 'NoneType' and 'int'
> {code}
> The same erros should also happen when the delete removes > 0 rows, but the 
> impala server has an older version that doesn't set TDmlResult.rows_deleted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12990) impala-shell broken if Iceberg delete deletes 0 rows

2024-04-10 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12990:


 Summary: impala-shell broken if Iceberg delete deletes 0 rows
 Key: IMPALA-12990
 URL: https://issues.apache.org/jira/browse/IMPALA-12990
 Project: IMPALA
  Issue Type: Bug
  Components: Clients
Reporter: Csaba Ringhofer


Happens only with Python 3
{code}
impala-python3 shell/impala_shell.py

create table icebergupdatet (i int, s string) stored as iceberg;
alter table icebergupdatet set tblproperties("format-version"="2");
delete from icebergupdatet where i=0;
Unknown Exception : '>' not supported between instances of 'NoneType' and 'int'
Traceback (most recent call last):
  File "shell/impala_shell.py", line 1428, in _execute_stmt
if is_dml and num_rows == 0 and num_deleted_rows > 0:
TypeError: '>' not supported between instances of 'NoneType' and 'int'
{code}

The same erros should also happen when the delete removes > 0 rows, but the 
impala server has an older version that doesn't set TDmlResult.rows_deleted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12987) Errors with \0 character in partition values

2024-04-09 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12987:
-
Description: 
Inserting strings with "\0" values to partition columns leads errors both in 
Iceberg and Hive tables. 

The issue is more severe in Iceberg tables as from this point the table can't 
be read in Impala or Hive:
{code}
create table iceberg_unicode (s string, p string) partitioned by spec 
(identity(p)) stored as iceberg;
insert into iceberg_unicode select "a", "a\0a";
ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table 
hdfs://localhost:20500/test-warehouse/iceberg_unicode
CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 
paths for table default.iceberg_unicode: failed to load 1 paths. Check the 
catalog server log for more details.
{code}

The partition directory created above seems truncated:
hdfs://localhost:20500/test-warehouse/iceberg_unicode/data/p=a

In partition Hive tables the insert also returns an error, but the new 
partition is not created and the table remains usable. The error is similar to 
IMPALA-11499's

Note that Java handles  \0 characters in unicode in a special way, which may be 
related: 
https://docs.oracle.com/javase/1.5.0/docs/guide/jni/spec/types.html#wp16542


  was:
Inserting strings with "\0" values to partition columns leads errors both in 
Iceberg and Hive tables. 

The issue is more severe in Iceberg tables as from this point the table can't 
be read in Impala or Hive:
{code}
create table iceberg_unicode (s string, p string) partitioned by spec 
(identity(p)) stored as iceberg;
insert into iceberg_unicode select "a", "a\0a";
ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table 
hdfs://localhost:20500/test-warehouse/iceberg_unicode
CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 
paths for table default.iceberg_unicode: failed to load 1 paths. Check the 
catalog server log for more details.
{code}

The partition directory created above seems truncated:
hdfs://localhost:20500/test-warehouse/iceberg_unicode/data/p=a

In partition Hive tables the insert also returns an error, but the new 
partition is not created and the table remains usable. The error is similar to 
IMPALA-11499's

Note Java handles  \0 characters in unicode in a special way, which may be 
related: 
https://docs.oracle.com/javase/1.5.0/docs/guide/jni/spec/types.html#wp16542



> Errors with \0 character in partition values
> 
>
> Key: IMPALA-12987
> URL: https://issues.apache.org/jira/browse/IMPALA-12987
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Csaba Ringhofer
>Priority: Critical
>  Labels: iceberg
>
> Inserting strings with "\0" values to partition columns leads errors both in 
> Iceberg and Hive tables. 
> The issue is more severe in Iceberg tables as from this point the table can't 
> be read in Impala or Hive:
> {code}
> create table iceberg_unicode (s string, p string) partitioned by spec 
> (identity(p)) stored as iceberg;
> insert into iceberg_unicode select "a", "a\0a";
> ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table 
> hdfs://localhost:20500/test-warehouse/iceberg_unicode
> CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 
> paths for table default.iceberg_unicode: failed to load 1 paths. Check the 
> catalog server log for more details.
> {code}
> The partition directory created above seems truncated:
> hdfs://localhost:20500/test-warehouse/iceberg_unicode/data/p=a
> In partition Hive tables the insert also returns an error, but the new 
> partition is not created and the table remains usable. The error is similar 
> to IMPALA-11499's
> Note that Java handles  \0 characters in unicode in a special way, which may 
> be related: 
> https://docs.oracle.com/javase/1.5.0/docs/guide/jni/spec/types.html#wp16542



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12987) Errors with \0 character in partition values

2024-04-09 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12987:
-
Description: 
Inserting strings with "\0" values to partition columns leads errors both in 
Iceberg and Hive tables. 

The issue is more severe in Iceberg tables as from this point the table can't 
be read in Impala or Hive:
{code}
create table iceberg_unicode (s string, p string) partitioned by spec 
(identity(p)) stored as iceberg;
insert into iceberg_unicode select "a", "a\0a";
ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table 
hdfs://localhost:20500/test-warehouse/iceberg_unicode
CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 
paths for table default.iceberg_unicode: failed to load 1 paths. Check the 
catalog server log for more details.
{code}

The partition directory created above seems truncated:
hdfs://localhost:20500/test-warehouse/iceberg_unicode/data/p=a

In partition Hive tables the insert also returns an error, but the new 
partition is not created and the table remains usable. The error is similar to 
IMPALA-11499's

Note Java handles  \0 characters in unicode in a special way, which may be 
related: 
https://docs.oracle.com/javase/1.5.0/docs/guide/jni/spec/types.html#wp16542


  was:
Inserting strings with "\0" values to partition columns leads errors both in 
Iceberg and Hive tables. 

The issue is more severe in Iceberg tables as from this point the table can't 
be read in Impala or Hive:
{code}
create table iceberg_unicode (s string, p string) partitioned by spec 
(identity(p)) stored as iceberg;
insert into iceberg_unicode select "a", "a\0a";
ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table 
hdfs://localhost:20500/test-warehouse/iceberg_unicode
CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 
paths for table default.iceberg_unicode: failed to load 1 paths. Check the 
catalog server log for more details.
{code}

In partition Hive tables the insert also returns an error, but the new 
partition is not created and the table remains usable. The error is similar to 
IMPALA-11499's



> Errors with \0 character in partition values
> 
>
> Key: IMPALA-12987
> URL: https://issues.apache.org/jira/browse/IMPALA-12987
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Csaba Ringhofer
>Priority: Critical
>  Labels: iceberg
>
> Inserting strings with "\0" values to partition columns leads errors both in 
> Iceberg and Hive tables. 
> The issue is more severe in Iceberg tables as from this point the table can't 
> be read in Impala or Hive:
> {code}
> create table iceberg_unicode (s string, p string) partitioned by spec 
> (identity(p)) stored as iceberg;
> insert into iceberg_unicode select "a", "a\0a";
> ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table 
> hdfs://localhost:20500/test-warehouse/iceberg_unicode
> CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 
> paths for table default.iceberg_unicode: failed to load 1 paths. Check the 
> catalog server log for more details.
> {code}
> The partition directory created above seems truncated:
> hdfs://localhost:20500/test-warehouse/iceberg_unicode/data/p=a
> In partition Hive tables the insert also returns an error, but the new 
> partition is not created and the table remains usable. The error is similar 
> to IMPALA-11499's
> Note Java handles  \0 characters in unicode in a special way, which may be 
> related: 
> https://docs.oracle.com/javase/1.5.0/docs/guide/jni/spec/types.html#wp16542



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12987) Errors with \0 character in partition values

2024-04-09 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12987:


 Summary: Errors with \0 character in partition values
 Key: IMPALA-12987
 URL: https://issues.apache.org/jira/browse/IMPALA-12987
 Project: IMPALA
  Issue Type: Bug
Reporter: Csaba Ringhofer


Inserting strings with "\0" values to partition columns leads errors both in 
Iceberg and Hive tables. 

The issue issue more severe in Iceberg tables as from this point the table 
can't be read in Impala or Hive:
{code}
 create table iceberg_unicode (s string, p string) partitioned by spec 
(identity(p)) stored as iceberg;
insert into iceberg_unicode select "a", "a\0a";
ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table 
hdfs://localhost:20500/test-warehouse/iceberg_unicode
CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 
paths for table default.iceberg_unicode: failed to load 1 paths. Check the 
catalog server log for more details.
{code}

In partition Hive tables the insert also returns an error, but the new 
partition is not created and the table remains usable. The error is similare to 
IMPALA-11499's




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12987) Errors with \0 character in partition values

2024-04-09 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12987:
-
Description: 
Inserting strings with "\0" values to partition columns leads errors both in 
Iceberg and Hive tables. 

The issue is more severe in Iceberg tables as from this point the table can't 
be read in Impala or Hive:
{code}
create table iceberg_unicode (s string, p string) partitioned by spec 
(identity(p)) stored as iceberg;
insert into iceberg_unicode select "a", "a\0a";
ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table 
hdfs://localhost:20500/test-warehouse/iceberg_unicode
CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 
paths for table default.iceberg_unicode: failed to load 1 paths. Check the 
catalog server log for more details.
{code}

In partition Hive tables the insert also returns an error, but the new 
partition is not created and the table remains usable. The error is similar to 
IMPALA-11499's


  was:
Inserting strings with "\0" values to partition columns leads errors both in 
Iceberg and Hive tables. 

The issue issue more severe in Iceberg tables as from this point the table 
can't be read in Impala or Hive:
{code}
 create table iceberg_unicode (s string, p string) partitioned by spec 
(identity(p)) stored as iceberg;
insert into iceberg_unicode select "a", "a\0a";
ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table 
hdfs://localhost:20500/test-warehouse/iceberg_unicode
CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 
paths for table default.iceberg_unicode: failed to load 1 paths. Check the 
catalog server log for more details.
{code}

In partition Hive tables the insert also returns an error, but the new 
partition is not created and the table remains usable. The error is similare to 
IMPALA-11499's



> Errors with \0 character in partition values
> 
>
> Key: IMPALA-12987
> URL: https://issues.apache.org/jira/browse/IMPALA-12987
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Csaba Ringhofer
>Priority: Critical
>  Labels: iceberg
>
> Inserting strings with "\0" values to partition columns leads errors both in 
> Iceberg and Hive tables. 
> The issue is more severe in Iceberg tables as from this point the table can't 
> be read in Impala or Hive:
> {code}
> create table iceberg_unicode (s string, p string) partitioned by spec 
> (identity(p)) stored as iceberg;
> insert into iceberg_unicode select "a", "a\0a";
> ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table 
> hdfs://localhost:20500/test-warehouse/iceberg_unicode
> CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 
> paths for table default.iceberg_unicode: failed to load 1 paths. Check the 
> catalog server log for more details.
> {code}
> In partition Hive tables the insert also returns an error, but the new 
> partition is not created and the table remains usable. The error is similar 
> to IMPALA-11499's



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12969) DeserializeThriftMsg may leak JNI resources

2024-04-08 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12969:
-
Priority: Critical  (was: Major)

> DeserializeThriftMsg may leak JNI resources
> ---
>
> Key: IMPALA-12969
> URL: https://issues.apache.org/jira/browse/IMPALA-12969
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Csaba Ringhofer
>Priority: Critical
> Fix For: Impala 4.4.0
>
>
> JNI's GetByteArrayElements should be followed by a ReleaseByteArrayElements 
> call, but this is not done in case there is an error during deserialization:
> [https://github.com/apache/impala/blob/f05eac647647b5e03c3aafc35f785c73d07e2658/be/src/rpc/jni-thrift-util.h#L66]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-12969) DeserializeThriftMsg may leak JNI resources

2024-04-08 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer resolved IMPALA-12969.
--
Fix Version/s: Impala 4.4.0
   Resolution: Fixed

> DeserializeThriftMsg may leak JNI resources
> ---
>
> Key: IMPALA-12969
> URL: https://issues.apache.org/jira/browse/IMPALA-12969
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Csaba Ringhofer
>Priority: Major
> Fix For: Impala 4.4.0
>
>
> JNI's GetByteArrayElements should be followed by a ReleaseByteArrayElements 
> call, but this is not done in case there is an error during deserialization:
> [https://github.com/apache/impala/blob/f05eac647647b5e03c3aafc35f785c73d07e2658/be/src/rpc/jni-thrift-util.h#L66]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12978) IMPALA-12544 made impala-shell incompatible with old impala servers

2024-04-08 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12978:


 Summary: IMPALA-12544 made impala-shell incompatible with old 
impala servers
 Key: IMPALA-12978
 URL: https://issues.apache.org/jira/browse/IMPALA-12978
 Project: IMPALA
  Issue Type: Bug
  Components: Clients
Reporter: Csaba Ringhofer


IMPALA-12544 uses  "progress.total_fragment_instances > 0:", but 
total_fragment_instances is None if the server is older and does not know this 
Thrift member yet (added in IMPALA-12048). 
[https://github.com/apache/impala/blob/fb3c379f395635f9f6927b40694bc3dd95a2866f/shell/impala_shell.py#L1320]

 

This leads to error messages in interactive shell sessions when progress 
reporting is enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12969) DeserializeThriftMsg may leak JNI resources

2024-04-03 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12969:


 Summary: DeserializeThriftMsg may leak JNI resources
 Key: IMPALA-12969
 URL: https://issues.apache.org/jira/browse/IMPALA-12969
 Project: IMPALA
  Issue Type: Bug
Reporter: Csaba Ringhofer


JNI's GetByteArrayElements should be followed by a ReleaseByteArrayElements 
call, but this is not done in case there is an error during deserialization:

[https://github.com/apache/impala/blob/f05eac647647b5e03c3aafc35f785c73d07e2658/be/src/rpc/jni-thrift-util.h#L66]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12968) Early EndDataStream RPC could be responded earlier

2024-04-03 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12968:
-
Description: 
When a producer fragment sends no rows and finishes before the receiver is 
initialized the EndDataStream rpc is stored as early sender and is responded 
when the receiver is registered.

[https://github.com/apache/impala/blob/effc9df933b46eb5b0acf55a858606415425505f/be/src/runtime/krpc-data-stream-mgr.cc#L150]

While it is important to store the information that the EOS has happened to 
unregister the sender from the receiver, the RPC itself could be responded 
right after it was stored in the early sender map.

  was:
When a producer fragment sends no rows and finishes before the receiver is 
initialized te e EndDataStream rpc is stored as early sender and is responded 
when the receiver is registered.

[https://github.com/apache/impala/blob/effc9df933b46eb5b0acf55a858606415425505f/be/src/runtime/krpc-data-stream-mgr.cc#L150]

While it is important to store the information that the EOS has happened to 
unregister the sender from the receiver, the RPC itself could be responded 
right after it was stored in the early sender map.


> Early EndDataStream RPC could be responded earlier
> --
>
> Key: IMPALA-12968
> URL: https://issues.apache.org/jira/browse/IMPALA-12968
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Csaba Ringhofer
>Priority: Minor
>  Labels: krpc
>
> When a producer fragment sends no rows and finishes before the receiver is 
> initialized the EndDataStream rpc is stored as early sender and is responded 
> when the receiver is registered.
> [https://github.com/apache/impala/blob/effc9df933b46eb5b0acf55a858606415425505f/be/src/runtime/krpc-data-stream-mgr.cc#L150]
> While it is important to store the information that the EOS has happened to 
> unregister the sender from the receiver, the RPC itself could be responded 
> right after it was stored in the early sender map.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12968) Early EndDataStream RPC could be responded earlier

2024-04-03 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12968:


 Summary: Early EndDataStream RPC could be responded earlier
 Key: IMPALA-12968
 URL: https://issues.apache.org/jira/browse/IMPALA-12968
 Project: IMPALA
  Issue Type: Improvement
  Components: Backend
Reporter: Csaba Ringhofer


When a producer fragment sends no rows and finishes before the receiver is 
initialized te e EndDataStream rpc is stored as early sender and is responded 
when the receiver is registered.

[https://github.com/apache/impala/blob/effc9df933b46eb5b0acf55a858606415425505f/be/src/runtime/krpc-data-stream-mgr.cc#L150]

While it is important to store the information that the EOS has happened to 
unregister the sender from the receiver, the RPC itself could be responded 
right after it was stored in the early sender map.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Comment Edited] (IMPALA-10349) Revisit constant folding on non-ASCII strings

2024-03-25 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830545#comment-17830545
 ] 

Csaba Ringhofer edited comment on IMPALA-10349 at 3/25/24 3:55 PM:
---

Also bumped into this related to pushing down to Kudu:
{code:java}
explain select count(*) from functional_kudu.alltypes where string_col = "á";

-- kudu predicates: string_col = 'á'

explain select count(*) from functional_kudu.alltypes where string_col = 
concat("a", "")

-- kudu predicates: string_col = 'a'

explain select count(*) from functional_kudu.alltypes where string_col = 
concat("á", "")

-- not pushed down to Kudu:

-- predicates: string_col = concat('á', '') 

{code}
>I think we should allow folding non-ASCII strings if they are legal UTF-8 
>strings.

[~stigahuang]  Do you why is it not possible to fold strings that are not valid 
UTF-8?

Currently BINARY columns also use StringLiterals, a.g cast("a" as binary) will 
be folded to a StringLiteral. It would be useful to also fold expressions like 
cast(unhex("aa")  as binary) to be able to push them down to Kudu.


was (Author: csringhofer):
Also bumped into this related to pushing down to Kudu:

{code}

explain select count(*) from functional_kudu.alltypes where string_col = "á";

-- kudu predicates: string_col = 'á'

explain select count(*) from functional_kudu.alltypes where string_col = 
concat("a", "")

-- kudu predicates: string_col = 'a'

explain select count(*) from functional_kudu.alltypes where string_col = 
concat("á", "")

-- not pushed down to Kudu:

-- predicates: string_col = concat('á', '') 

{code} 

>I think we should allow folding non-ASCII strings if they are legal UTF-8 
>strings.

[~stigahuang]  Do you why is it not possible to fold strings that are not valid 
UTF-8?

Currently BINARY columns also use StringLiterals, a.g cast("a" as binary) will 
be folded to a StringLiteral. It would be useful to also fold expressions like 
cast(unhex("aa")  as binary) to be able to push them down to Kudu.

 

> Revisit constant folding on non-ASCII strings
> -
>
> Key: IMPALA-10349
> URL: https://issues.apache.org/jira/browse/IMPALA-10349
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Reporter: Quanlong Huang
>Priority: Critical
>
> Constant folding may produce non-ASCII strings. In such cases, we currently 
> abandon folding the constant. See commit message of IMPALA-1788 or codes 
> here: 
> [https://github.com/apache/impala/blob/9672d945963e1ca3c8699340f92d7d6ce1d91c9f/fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java#L274-L282]
> I think we should allow folding non-ASCII strings if they are legal UTF-8 
> strings.
> Example of constant folding work:
> {code:java}
> Query: explain select * from functional.alltypes where string_col = 
> substr('123', 1, 1)
> +-+
> | Explain String  |
> +-+
> | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 |
> | Per-Host Resource Estimates: Memory=160MB   |
> | Codegen disabled by planner |
> | |
> | PLAN-ROOT SINK  |
> | |   |
> | 01:EXCHANGE [UNPARTITIONED] |
> | |   |
> | 00:SCAN HDFS [functional.alltypes]  |
> |HDFS partitions=24/24 files=24 size=478.45KB |
> |predicates: string_col = '1' |
> |row-size=89B cardinality=730 |
> +-+
> {code}
> Example of constant folding doesn't work:
> {code:java}
> Query: explain select * from functional.alltypes where string_col = 
> substr('引擎', 1, 3)
> +-+
> | Explain String  |
> +-+
> | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 |
> | Per-Host Resource Estimates: Memory=160MB   |
> | Codegen disabled by planner |
> | |
> | PLAN-ROOT SINK  |
> | |   |
> | 01:EXCHANGE [UNPARTITIONED] |
> | |   |
> | 00:SCAN HDFS 

[jira] [Comment Edited] (IMPALA-10349) Revisit constant folding on non-ASCII strings

2024-03-25 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830545#comment-17830545
 ] 

Csaba Ringhofer edited comment on IMPALA-10349 at 3/25/24 3:55 PM:
---

Also bumped into this related to pushing down to Kudu:
{code:java}
explain select count(*) from functional_kudu.alltypes where string_col = "á";

-- kudu predicates: string_col = 'á'

explain select count(*) from functional_kudu.alltypes where string_col = 
concat("a", "")

-- kudu predicates: string_col = 'a'

explain select count(*) from functional_kudu.alltypes where string_col = 
concat("á", "")

-- not pushed down to Kudu:

-- predicates: string_col = concat('á', '') 

{code}
>I think we should allow folding non-ASCII strings if they are legal UTF-8 
>strings.

[~stigahuang]  Do you know why is it not possible to fold strings that are not 
valid UTF-8?

Currently BINARY columns also use StringLiterals, a.g cast("a" as binary) will 
be folded to a StringLiteral. It would be useful to also fold expressions like 
cast(unhex("aa")  as binary) to be able to push them down to Kudu.


was (Author: csringhofer):
Also bumped into this related to pushing down to Kudu:
{code:java}
explain select count(*) from functional_kudu.alltypes where string_col = "á";

-- kudu predicates: string_col = 'á'

explain select count(*) from functional_kudu.alltypes where string_col = 
concat("a", "")

-- kudu predicates: string_col = 'a'

explain select count(*) from functional_kudu.alltypes where string_col = 
concat("á", "")

-- not pushed down to Kudu:

-- predicates: string_col = concat('á', '') 

{code}
>I think we should allow folding non-ASCII strings if they are legal UTF-8 
>strings.

[~stigahuang]  Do you why is it not possible to fold strings that are not valid 
UTF-8?

Currently BINARY columns also use StringLiterals, a.g cast("a" as binary) will 
be folded to a StringLiteral. It would be useful to also fold expressions like 
cast(unhex("aa")  as binary) to be able to push them down to Kudu.

> Revisit constant folding on non-ASCII strings
> -
>
> Key: IMPALA-10349
> URL: https://issues.apache.org/jira/browse/IMPALA-10349
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Reporter: Quanlong Huang
>Priority: Critical
>
> Constant folding may produce non-ASCII strings. In such cases, we currently 
> abandon folding the constant. See commit message of IMPALA-1788 or codes 
> here: 
> [https://github.com/apache/impala/blob/9672d945963e1ca3c8699340f92d7d6ce1d91c9f/fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java#L274-L282]
> I think we should allow folding non-ASCII strings if they are legal UTF-8 
> strings.
> Example of constant folding work:
> {code:java}
> Query: explain select * from functional.alltypes where string_col = 
> substr('123', 1, 1)
> +-+
> | Explain String  |
> +-+
> | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 |
> | Per-Host Resource Estimates: Memory=160MB   |
> | Codegen disabled by planner |
> | |
> | PLAN-ROOT SINK  |
> | |   |
> | 01:EXCHANGE [UNPARTITIONED] |
> | |   |
> | 00:SCAN HDFS [functional.alltypes]  |
> |HDFS partitions=24/24 files=24 size=478.45KB |
> |predicates: string_col = '1' |
> |row-size=89B cardinality=730 |
> +-+
> {code}
> Example of constant folding doesn't work:
> {code:java}
> Query: explain select * from functional.alltypes where string_col = 
> substr('引擎', 1, 3)
> +-+
> | Explain String  |
> +-+
> | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 |
> | Per-Host Resource Estimates: Memory=160MB   |
> | Codegen disabled by planner |
> | |
> | PLAN-ROOT SINK  |
> | |   |
> | 01:EXCHANGE [UNPARTITIONED] |
> | |   |
> | 00:SCAN HDFS 

[jira] [Commented] (IMPALA-10349) Revisit constant folding on non-ASCII strings

2024-03-25 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830545#comment-17830545
 ] 

Csaba Ringhofer commented on IMPALA-10349:
--

Also bumped into this related to pushing down to Kudu:

{code}

explain select count(*) from functional_kudu.alltypes where string_col = "á";

-- kudu predicates: string_col = 'á'

explain select count(*) from functional_kudu.alltypes where string_col = 
concat("a", "")

-- kudu predicates: string_col = 'a'

explain select count(*) from functional_kudu.alltypes where string_col = 
concat("á", "")

-- not pushed down to Kudu:

-- predicates: string_col = concat('á', '') 

{code} 

>I think we should allow folding non-ASCII strings if they are legal UTF-8 
>strings.

[~stigahuang]  Do you why is it not possible to fold strings that are not valid 
UTF-8?

Currently BINARY columns also use StringLiterals, a.g cast("a" as binary) will 
be folded to a StringLiteral. It would be useful to also fold expressions like 
cast(unhex("aa")  as binary) to be able to push them down to Kudu.

 

> Revisit constant folding on non-ASCII strings
> -
>
> Key: IMPALA-10349
> URL: https://issues.apache.org/jira/browse/IMPALA-10349
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Reporter: Quanlong Huang
>Priority: Critical
>
> Constant folding may produce non-ASCII strings. In such cases, we currently 
> abandon folding the constant. See commit message of IMPALA-1788 or codes 
> here: 
> [https://github.com/apache/impala/blob/9672d945963e1ca3c8699340f92d7d6ce1d91c9f/fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java#L274-L282]
> I think we should allow folding non-ASCII strings if they are legal UTF-8 
> strings.
> Example of constant folding work:
> {code:java}
> Query: explain select * from functional.alltypes where string_col = 
> substr('123', 1, 1)
> +-+
> | Explain String  |
> +-+
> | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 |
> | Per-Host Resource Estimates: Memory=160MB   |
> | Codegen disabled by planner |
> | |
> | PLAN-ROOT SINK  |
> | |   |
> | 01:EXCHANGE [UNPARTITIONED] |
> | |   |
> | 00:SCAN HDFS [functional.alltypes]  |
> |HDFS partitions=24/24 files=24 size=478.45KB |
> |predicates: string_col = '1' |
> |row-size=89B cardinality=730 |
> +-+
> {code}
> Example of constant folding doesn't work:
> {code:java}
> Query: explain select * from functional.alltypes where string_col = 
> substr('引擎', 1, 3)
> +-+
> | Explain String  |
> +-+
> | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 |
> | Per-Host Resource Estimates: Memory=160MB   |
> | Codegen disabled by planner |
> | |
> | PLAN-ROOT SINK  |
> | |   |
> | 01:EXCHANGE [UNPARTITIONED] |
> | |   |
> | 00:SCAN HDFS [functional.alltypes]  |
> |HDFS partitions=24/24 files=24 size=478.45KB |
> |predicates: string_col = substr('引擎', 1, 3)|
> |row-size=89B cardinality=730 |
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12927) Support reading BINARY columns in JSON tables

2024-03-22 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829953#comment-17829953
 ] 

Csaba Ringhofer commented on IMPALA-12927:
--

I think that the best would be to check tbl property "json.binary.format":
 * if not set, give a clear error message
 * if base64, do base64 decoding
 * if rawstring, handle it the way Hive does: 
[https://github.com/apache/hive/blame/f216bbb632752f467321869cee03adf9477409cf/serde/src/java/org/apache/hadoop/hive/serde2/json/HiveJsonReader.java#L455]

Note that I am don't know how exactly special characters are handled in the 
rawstring case.

> Support reading BINARY columns in JSON tables
> -
>
> Key: IMPALA-12927
> URL: https://issues.apache.org/jira/browse/IMPALA-12927
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Csaba Ringhofer
>Assignee: Zihao Ye
>Priority: Major
>
> Currently Impala cannot read BINARY columns in JSON files written by Hive 
> correctly and returns runtime errors:
> {code}
> select * from functional_json.binary_tbl;
> ++--++
> | id | string_col   | binary_col |
> ++--++
> | 1  | ascii        | NULL       |
> | 2  | ascii        | NULL       |
> | 3  | null         | NULL       |
> | 4  | empty        |            |
> | 5  | valid utf8   | NULL       |
> | 6  | valid utf8   | NULL       |
> | 7  | invalid utf8 | NULL       |
> | 8  | invalid utf8 | NULL       |
> ++--++
> WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, 
> type: STRING, data: 'binary1'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: 'binary2'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: 'árvíztűrőtükörfúró'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '你好hello'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '��'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '�D3"'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> {code}
> The single file in the table looks like this:
> {code}
>  hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0
> {"id":1,"string_col":"ascii","binary_col":"binary1"}
> {"id":2,"string_col":"ascii","binary_col":"binary2"}
> {"id":3,"string_col":"null","binary_col":null}
> {"id":4,"string_col":"empty","binary_col":""}
> {"id":5,"string_col":"valid utf8","binary_col":"árvíztűrőtükörfúró"}
> {"id":6,"string_col":"valid utf8","binary_col":"你好hello"}
> {"id":7,"string_col":"invalid utf8","binary_col":"\u�\u�"}
> {"id":8,"string_col":"invalid utf8","binary_col":"�D3\"\u0011\u"}
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Comment Edited] (IMPALA-12927) Support reading BINARY columns in JSON tables

2024-03-21 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829614#comment-17829614
 ] 

Csaba Ringhofer edited comment on IMPALA-12927 at 3/21/24 3:47 PM:
---

[~Eyizoha]  About AuxColumnType: fyi there is an ongoing refactor to remove 
that class and make it easier to decided whether a column is STRING or BINARY: 
[https://gerrit.cloudera.org/#/c/21157/]

About encoding of BINARY columns: I looked at the Hive code, but it doesn't 
match with the encoding I see in the files.

[https://github.com/apache/hive/blob/9a0ce4e15890aa91f05322e845438e1e8830b1c3/serde/src/java/org/apache/hadoop/hive/serde2/JsonSerDe.java#L135]

Current Apache Hive seems to default to using base64 encoding, while it can be 
altered with tbl property "json.binary.format". In the JSON tables in Impala's 
dataload the files are certainly not base64 encoded and "json.binary.format" is 
also not set, so it doesn't seem to work like the current Hive codebase. It is 
possible that this is related to differences between Apache Impala's Hive 
dependency and current Apache Hive.

Currently Impala base64 decodes the BINARY columns:
{code:java}
Hive:

create table tjsonbinary (string s, binary b) stored as JSONFILE;

insert into tjsonbinary values ("abcd", base64(cast("abcd" as binary)));

Impala:

select * from tjsonbinary;

+--+--+
| s    | b    |
+--+--+
| abcd | abcd |
+--+--+

{code}
What do you think about disabling BINARY column reading in JSON until Hive 
compatibility is clarified? My concern is that besides error messages and 
nulled values this may actually lead to correctness issues as many strings are 
both valid utf8 strings and base64 strings, so Impala may return unintended 
results.


was (Author: csringhofer):
[~Eyizoha]  About AuxColumnType: fyi is there is an ongoing refactor to remove 
that class and make it easier to decided whether a column is STRING or BINARY: 
[https://gerrit.cloudera.org/#/c/21157/]

About encoding of BINARY columns: I looked at the Hive code, but it doesn't 
match with the encoding I see in the files.

[https://github.com/apache/hive/blob/9a0ce4e15890aa91f05322e845438e1e8830b1c3/serde/src/java/org/apache/hadoop/hive/serde2/JsonSerDe.java#L135]

Current Apache Hive seems to default to using base64 encoding, while it can be 
altered with tbl property "json.binary.format". In the JSON tables in Impala's 
dataload the files are certainly not base64 encoded and "json.binary.format" is 
also not set, so it doesn't seem to work like the current Hive codebase. It is 
possible that this is related to differences between Apache Impala's Hive 
dependency and current Apache Hive.

Currently Impala base64 decodes the BINARY columns:

{code}

Hive:

create table tjsonbinary (string s, binary b) stored as JSONFILE;

insert into tjsonbinary values ("abcd", base64(cast("abcd" as binary)));

Impala:

select * from tjsonbinary;

+--+--+
| s    | b    |
+--+--+
| abcd | abcd |
+--+--+

{code}

What do you think about disabling BINARY column reading in JSON until Hive 
compatibility is clarified? My concern is that besides error messages and 
nulled values this may actually lead to correctness issues as many strings are 
both valid utf8 strings and base64 strings, so Impala may return unintended 
results.

> Support reading BINARY columns in JSON tables
> -
>
> Key: IMPALA-12927
> URL: https://issues.apache.org/jira/browse/IMPALA-12927
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Csaba Ringhofer
>Assignee: Zihao Ye
>Priority: Major
>
> Currently Impala cannot read BINARY columns in JSON files written by Hive 
> correctly and returns runtime errors:
> {code}
> select * from functional_json.binary_tbl;
> ++--++
> | id | string_col   | binary_col |
> ++--++
> | 1  | ascii        | NULL       |
> | 2  | ascii        | NULL       |
> | 3  | null         | NULL       |
> | 4  | empty        |            |
> | 5  | valid utf8   | NULL       |
> | 6  | valid utf8   | NULL       |
> | 7  | invalid utf8 | NULL       |
> | 8  | invalid utf8 | NULL       |
> ++--++
> WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, 
> type: STRING, data: 'binary1'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: 'binary2'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: 

[jira] [Commented] (IMPALA-12927) Support reading BINARY columns in JSON tables

2024-03-21 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829614#comment-17829614
 ] 

Csaba Ringhofer commented on IMPALA-12927:
--

[~Eyizoha]  About AuxColumnType: fyi is there is an ongoing refactor to remove 
that class and make it easier to decided whether a column is STRING or BINARY: 
[https://gerrit.cloudera.org/#/c/21157/]

About encoding of BINARY columns: I looked at the Hive code, but it doesn't 
match with the encoding I see in the files.

[https://github.com/apache/hive/blob/9a0ce4e15890aa91f05322e845438e1e8830b1c3/serde/src/java/org/apache/hadoop/hive/serde2/JsonSerDe.java#L135]

Current Apache Hive seems to default to using base64 encoding, while it can be 
altered with tbl property "json.binary.format". In the JSON tables in Impala's 
dataload the files are certainly not base64 encoded and "json.binary.format" is 
also not set, so it doesn't seem to work like the current Hive codebase. It is 
possible that this is related to differences between Apache Impala's Hive 
dependency and current Apache Hive.

Currently Impala base64 decodes the BINARY columns:

{code}

Hive:

create table tjsonbinary (string s, binary b) stored as JSONFILE;

insert into tjsonbinary values ("abcd", base64(cast("abcd" as binary)));

Impala:

select * from tjsonbinary;

+--+--+
| s    | b    |
+--+--+
| abcd | abcd |
+--+--+

{code}

What do you think about disabling BINARY column reading in JSON until Hive 
compatibility is clarified? My concern is that besides error messages and 
nulled values this may actually lead to correctness issues as many strings are 
both valid utf8 strings and base64 strings, so Impala may return unintended 
results.

> Support reading BINARY columns in JSON tables
> -
>
> Key: IMPALA-12927
> URL: https://issues.apache.org/jira/browse/IMPALA-12927
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Csaba Ringhofer
>Assignee: Zihao Ye
>Priority: Major
>
> Currently Impala cannot read BINARY columns in JSON files written by Hive 
> correctly and returns runtime errors:
> {code}
> select * from functional_json.binary_tbl;
> ++--++
> | id | string_col   | binary_col |
> ++--++
> | 1  | ascii        | NULL       |
> | 2  | ascii        | NULL       |
> | 3  | null         | NULL       |
> | 4  | empty        |            |
> | 5  | valid utf8   | NULL       |
> | 6  | valid utf8   | NULL       |
> | 7  | invalid utf8 | NULL       |
> | 8  | invalid utf8 | NULL       |
> ++--++
> WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, 
> type: STRING, data: 'binary1'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: 'binary2'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: 'árvíztűrőtükörfúró'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '你好hello'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '��'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '�D3"'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> {code}
> The single file in the table looks like this:
> {code}
>  hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0
> {"id":1,"string_col":"ascii","binary_col":"binary1"}
> {"id":2,"string_col":"ascii","binary_col":"binary2"}
> {"id":3,"string_col":"null","binary_col":null}
> {"id":4,"string_col":"empty","binary_col":""}
> {"id":5,"string_col":"valid utf8","binary_col":"árvíztűrőtükörfúró"}
> {"id":6,"string_col":"valid utf8","binary_col":"你好hello"}
> {"id":7,"string_col":"invalid utf8","binary_col":"\u�\u�"}
> {"id":8,"string_col":"invalid utf8","binary_col":"�D3\"\u0011\u"}
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: 

[jira] [Commented] (IMPALA-12927) Support reading BINARY columns in JSON tables

2024-03-20 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829192#comment-17829192
 ] 

Csaba Ringhofer commented on IMPALA-12927:
--

[~Eyizoha] I see that BINARY tests are explicitly skipped for JSON, but I 
couldn't find any discussion about this in the commit that add the JSON scanner:

[https://gerrit.cloudera.org/#/c/19699/33/tests/query_test/test_scanners.py]

Do you have an idea on what to do with BINARY columns? I am not familiar with 
Hive's JSON files, so I don't know what is the intended encoding for BINARY 
columns. I know that the JSON format doesn't support binary values, so 
generally some encoding (e.g. base64) is used to convert byte arrays to some 
ascii representation. 

> Support reading BINARY columns in JSON tables
> -
>
> Key: IMPALA-12927
> URL: https://issues.apache.org/jira/browse/IMPALA-12927
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Csaba Ringhofer
>Priority: Major
>
> Currently Impala cannot read BINARY columns in JSON files written by Hive 
> correctly and returns runtime errors:
> {code}
> select * from functional_json.binary_tbl;
> ++--++
> | id | string_col   | binary_col |
> ++--++
> | 1  | ascii        | NULL       |
> | 2  | ascii        | NULL       |
> | 3  | null         | NULL       |
> | 4  | empty        |            |
> | 5  | valid utf8   | NULL       |
> | 6  | valid utf8   | NULL       |
> | 7  | invalid utf8 | NULL       |
> | 8  | invalid utf8 | NULL       |
> ++--++
> WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, 
> type: STRING, data: 'binary1'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: 'binary2'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: 'árvíztűrőtükörfúró'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '你好hello'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '��'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '�D3"'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before 
> offset: 481
> {code}
> The single file in the table looks like this:
> {code}
>  hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0
> {"id":1,"string_col":"ascii","binary_col":"binary1"}
> {"id":2,"string_col":"ascii","binary_col":"binary2"}
> {"id":3,"string_col":"null","binary_col":null}
> {"id":4,"string_col":"empty","binary_col":""}
> {"id":5,"string_col":"valid utf8","binary_col":"árvíztűrőtükörfúró"}
> {"id":6,"string_col":"valid utf8","binary_col":"你好hello"}
> {"id":7,"string_col":"invalid utf8","binary_col":"\u�\u�"}
> {"id":8,"string_col":"invalid utf8","binary_col":"�D3\"\u0011\u"}
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12927) Support reading BINARY columns in JSON tables

2024-03-20 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12927:


 Summary: Support reading BINARY columns in JSON tables
 Key: IMPALA-12927
 URL: https://issues.apache.org/jira/browse/IMPALA-12927
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend
Reporter: Csaba Ringhofer


Currently Impala cannot read BINARY columns in JSON files written by Hive 
correctly and returns runtime errors:

{code}

select * from functional_json.binary_tbl;
++--++
| id | string_col   | binary_col |
++--++
| 1  | ascii        | NULL       |
| 2  | ascii        | NULL       |
| 3  | null         | NULL       |
| 4  | empty        |            |
| 5  | valid utf8   | NULL       |
| 6  | valid utf8   | NULL       |
| 7  | invalid utf8 | NULL       |
| 8  | invalid utf8 | NULL       |
++--++
WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, type: 
STRING, data: 'binary1'
Error parsing row: file: 
hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 
481
Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
data: 'binary2'
Error parsing row: file: 
hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 
481
Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
data: 'árvíztűrőtükörfúró'
Error parsing row: file: 
hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 
481
Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
data: '你好hello'
Error parsing row: file: 
hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 
481
Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
data: '��'
Error parsing row: file: 
hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 
481
Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
data: '�D3"'
Error parsing row: file: 
hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 
481

{code}

The single file in the table looks like this:

{code}

 hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0

{"id":1,"string_col":"ascii","binary_col":"binary1"}
{"id":2,"string_col":"ascii","binary_col":"binary2"}
{"id":3,"string_col":"null","binary_col":null}
{"id":4,"string_col":"empty","binary_col":""}
{"id":5,"string_col":"valid utf8","binary_col":"árvíztűrőtükörfúró"}
{"id":6,"string_col":"valid utf8","binary_col":"你好hello"}
{"id":7,"string_col":"invalid utf8","binary_col":"\u�\u�"}
{"id":8,"string_col":"invalid utf8","binary_col":"�D3\"\u0011\u"}

{code}

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12899) Temporary workaround for BINARY in complex types

2024-03-19 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828387#comment-17828387
 ] 

Csaba Ringhofer commented on IMPALA-12899:
--

base64 encoding seems a sane and widely used approach to me. I would suggest 
the following:
 # implement it first with base64 encoding
 # if there is demand to handle this differently, a query option like 
binary_column_encoding_in_json=base64 / skip / hive_style_unquoted_string

I would avoid a "lossy" solution as default, so one where the original binary 
value can't be decoded from the output.

> Temporary workaround for BINARY in complex types
> 
>
> Key: IMPALA-12899
> URL: https://issues.apache.org/jira/browse/IMPALA-12899
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: Daniel Becker
>Assignee: Daniel Becker
>Priority: Major
>
> The BINARY type is currently not supported inside complex types and a 
> cross-component decision is probably needed to support it (see IMPALA-11491). 
> We would like to enable EXPAND_COMPLEX_TYPES for Iceberg metadata tables 
> (IMPALA-12612), which requires that queries with BINARY inside complex types 
> don't fail. Enabling EXPAND_COMPLEX_TYPES is a more prioritised issue than 
> IMPALA-11491, so we should come up with a temporary solution, e.g. NULLing 
> BINARY values in complex types and logging a warning, or setting these BINARY 
> values to a warning string.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12902) Event replication is can be broken if hms_event_incremental_refresh_transactional_table=false

2024-03-14 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12902:


 Summary: Event replication is can be broken if 
hms_event_incremental_refresh_transactional_table=false
 Key: IMPALA-12902
 URL: https://issues.apache.org/jira/browse/IMPALA-12902
 Project: IMPALA
  Issue Type: Bug
  Components: Catalog
Reporter: Csaba Ringhofer


when setting hms_event_incremental_refresh_transactional_table=false 
metadata.test_event_processing.TestEventProcessing.test_event_based_replication 
fails at the following assert:

[https://github.com/apache/impala/blob/6c0c26146d956ad771cee27283c1371b9c23adce/tests/metadata/test_event_processing_base.py#L234]

 

Based on the logs catalogd only sees alter_database and transaction events in 
this case, so if the transaction events (COMMIT_TXN) are ignore, then it 
doesn't detect the change in the table.

This seems strange as the commit that added the test is older than the one that 
added hms_event_incremental_refresh_transactional_table

[https://github.com/apache/impala/commit/e53d649f8a88f42a70237fe7c2663baa126fed1a]

vs

[https://github.com/apache/impala/commit/097b10104f23e0927d5b21b43a79f6cc10425f59]

 

So it is not clear to me how could the test pass originally. One possibility is 
that different events were generated in HMS at that time. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12902) Event replication can be broken if hms_event_incremental_refresh_transactional_table=false

2024-03-14 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12902:
-
Summary: Event replication can be broken if 
hms_event_incremental_refresh_transactional_table=false  (was: Event 
replication is can be broken if 
hms_event_incremental_refresh_transactional_table=false)

> Event replication can be broken if 
> hms_event_incremental_refresh_transactional_table=false
> --
>
> Key: IMPALA-12902
> URL: https://issues.apache.org/jira/browse/IMPALA-12902
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Csaba Ringhofer
>Priority: Major
>
> when setting hms_event_incremental_refresh_transactional_table=false 
> metadata.test_event_processing.TestEventProcessing.test_event_based_replication
>  fails at the following assert:
> [https://github.com/apache/impala/blob/6c0c26146d956ad771cee27283c1371b9c23adce/tests/metadata/test_event_processing_base.py#L234]
>  
> Based on the logs catalogd only sees alter_database and transaction events in 
> this case, so if the transaction events (COMMIT_TXN) are ignore, then it 
> doesn't detect the change in the table.
> This seems strange as the commit that added the test is older than the one 
> that added hms_event_incremental_refresh_transactional_table
> [https://github.com/apache/impala/commit/e53d649f8a88f42a70237fe7c2663baa126fed1a]
> vs
> [https://github.com/apache/impala/commit/097b10104f23e0927d5b21b43a79f6cc10425f59]
>  
> So it is not clear to me how could the test pass originally. One possibility 
> is that different events were generated in HMS at that time. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12895) REFRESH doesn't detect changes in partition locations in ACID tables

2024-03-12 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12895:


 Summary: REFRESH doesn't detect changes in partition locations in 
ACID tables
 Key: IMPALA-12895
 URL: https://issues.apache.org/jira/browse/IMPALA-12895
 Project: IMPALA
  Issue Type: Bug
  Components: Catalog
Reporter: Csaba Ringhofer


This was discovered by running test 
metadata.test_event_processing.TestEventProcessing.test_transact_partition_location_change_from_hive
 when flag hms_event_incremental_refresh_transactional_table  is set to false.

[https://github.com/apache/impala/blob/ab6c9467f6347671b971dbce4c640bea032b6ed9/tests/metadata/test_event_processing.py#L164]

 

When hms_event_incremental_refresh_transactional_table  is true (default), the 
alter partition event is processed correctly and the location change is 
detected. But if it is false or event processing is turned off, the change is 
not detected and running REFRESH on the table also doesn't update the location.

The different handling based on the flag seems intentional:

https://github.com/apache/impala/blob/ab6c9467f6347671b971dbce4c640bea032b6ed9/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java#L2606

 

This seems to be an old issues while the test was added in a recent commit:

[https://github.com/apache/impala/commit/32b29ff36fb3e05fd620a6714de88805052d0117]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Work started] (IMPALA-12835) Transactional tables are unsynced when hms_event_incremental_refresh_transactional_table is disabled

2024-03-07 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-12835 started by Csaba Ringhofer.

> Transactional tables are unsynced when 
> hms_event_incremental_refresh_transactional_table is disabled
> 
>
> Key: IMPALA-12835
> URL: https://issues.apache.org/jira/browse/IMPALA-12835
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Csaba Ringhofer
>Priority: Critical
>
> There are some test failures when 
> hms_event_incremental_refresh_transactional_table is disabled:
>  * 
> tests/metadata/test_event_processing.py::TestEventProcessing::test_transactional_insert_events
>  * 
> tests/metadata/test_event_processing.py::TestEventProcessing::test_event_based_replication
> I can reproduce the issue locally:
> {noformat}
> $ bin/start-impala-cluster.py 
> --catalogd_args=--hms_event_incremental_refresh_transactional_table=false
> impala-shell> create table txn_tbl (id int, val int) stored as parquet 
> tblproperties 
> ('transactional'='true','transactional_properties'='insert_only');
> impala-shell> describe txn_tbl;  -- make the table loaded in Impala
> hive> insert into txn_tbl values(101, 200);
> impala-shell> select * from txn_tbl; {noformat}
> Impala shows no results until a REFRESH runs on this table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12835) Transactional tables are unsynced when hms_event_incremental_refresh_transactional_table is disabled

2024-03-07 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824490#comment-17824490
 ] 

Csaba Ringhofer commented on IMPALA-12835:
--

https://gerrit.cloudera.org/#/c/21116/

> Transactional tables are unsynced when 
> hms_event_incremental_refresh_transactional_table is disabled
> 
>
> Key: IMPALA-12835
> URL: https://issues.apache.org/jira/browse/IMPALA-12835
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Csaba Ringhofer
>Priority: Critical
>
> There are some test failures when 
> hms_event_incremental_refresh_transactional_table is disabled:
>  * 
> tests/metadata/test_event_processing.py::TestEventProcessing::test_transactional_insert_events
>  * 
> tests/metadata/test_event_processing.py::TestEventProcessing::test_event_based_replication
> I can reproduce the issue locally:
> {noformat}
> $ bin/start-impala-cluster.py 
> --catalogd_args=--hms_event_incremental_refresh_transactional_table=false
> impala-shell> create table txn_tbl (id int, val int) stored as parquet 
> tblproperties 
> ('transactional'='true','transactional_properties'='insert_only');
> impala-shell> describe txn_tbl;  -- make the table loaded in Impala
> hive> insert into txn_tbl values(101, 200);
> impala-shell> select * from txn_tbl; {noformat}
> Impala shows no results until a REFRESH runs on this table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Closed] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS

2024-03-01 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer closed IMPALA-12812.

Resolution: Invalid

> Send reload event after ALTER TABLE RECOVER PARTITIONS
> --
>
> Key: IMPALA-12812
> URL: https://issues.apache.org/jira/browse/IMPALA-12812
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Csaba Ringhofer
>Priority: Major
>
> IMPALA-11808 added support for sending reload events after REFRESH to allow 
> other Impala cluster connecting to the same HMS to also reload their tables. 
> REFRESH is often used when in external tables the files are written directly 
> to filesystem without notifying HMS, so Impala needs to update its cache and 
> can't rely on HMS notifications.
> The same could be useful for ALTER TABLE RECOVER PARTITIONS.  -It detects 
> partition directories that were only created in the FS but not in HMS and 
> creates them in HMS too.-  - UPDATE: the previous sentence was not true with 
> current Impala.  It also reloads the table (similarly to other DDLs) and 
> detects new files in existing partitions.
> An HMS event is created for the new partitions but there is no event that 
> would indicate that there are new files in existing partitions. As ALTER 
> TABLE RECOVER PARTITIONS is called when the user expects changes in the 
> filesystem (similarly to REFRESH), it could be useful to send a reload event 
> after it is finished.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS

2024-03-01 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12812:
-
Description: 
IMPALA-11808 added support for sending reload events after REFRESH to allow 
other Impala cluster connecting to the same HMS to also reload their tables. 
REFRESH is often used when in external tables the files are written directly to 
filesystem without notifying HMS, so Impala needs to update its cache and can't 
rely on HMS notifications.

The same could be useful for ALTER TABLE RECOVER PARTITIONS.  -It detects 
partition directories that were only created in the FS but not in HMS and 
creates them in HMS too.-  - UPDATE: the previous sentence was not true with 
current Impala.  It also reloads the table (similarly to other DDLs) and 
detects new files in existing partitions.

An HMS event is created for the new partitions but there is no event that would 
indicate that there are new files in existing partitions. As ALTER TABLE 
RECOVER PARTITIONS is called when the user expects changes in the filesystem 
(similarly to REFRESH), it could be useful to send a reload event after it is 
finished.

  was:
IMPALA-11808 added support for sending reload events after REFRESH to allow 
other Impala cluster connecting to the same HMS to also reload their tables. 
REFRESH is often used when in external tables the files are written directly to 
filesystem without notifying HMS, so Impala needs to update its cache and can't 
rely on HMS notifications.

The same could be useful for ALTER TABLE RECOVER PARTITIONS.  {-}It detects 
partition directories that were only created in the FS but not in HMS and 
creates them in HMS too. I{-}t also reloads the table (similarly to other DDLs) 
and detects new files in existing partitions. - UPDATE: the previous sentence 
was not true with current Impala.

An HMS event is created for the new partitions but there is no event that would 
indicate that there are new files in existing partitions. As ALTER TABLE 
RECOVER PARTITIONS is called when the user expects changes in the filesystem 
(similarly to REFRESH), it could be useful to send a reload event after it is 
finished.


> Send reload event after ALTER TABLE RECOVER PARTITIONS
> --
>
> Key: IMPALA-12812
> URL: https://issues.apache.org/jira/browse/IMPALA-12812
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Csaba Ringhofer
>Priority: Major
>
> IMPALA-11808 added support for sending reload events after REFRESH to allow 
> other Impala cluster connecting to the same HMS to also reload their tables. 
> REFRESH is often used when in external tables the files are written directly 
> to filesystem without notifying HMS, so Impala needs to update its cache and 
> can't rely on HMS notifications.
> The same could be useful for ALTER TABLE RECOVER PARTITIONS.  -It detects 
> partition directories that were only created in the FS but not in HMS and 
> creates them in HMS too.-  - UPDATE: the previous sentence was not true with 
> current Impala.  It also reloads the table (similarly to other DDLs) and 
> detects new files in existing partitions.
> An HMS event is created for the new partitions but there is no event that 
> would indicate that there are new files in existing partitions. As ALTER 
> TABLE RECOVER PARTITIONS is called when the user expects changes in the 
> filesystem (similarly to REFRESH), it could be useful to send a reload event 
> after it is finished.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS

2024-03-01 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12812:
-
Description: 
IMPALA-11808 added support for sending reload events after REFRESH to allow 
other Impala cluster connecting to the same HMS to also reload their tables. 
REFRESH is often used when in external tables the files are written directly to 
filesystem without notifying HMS, so Impala needs to update its cache and can't 
rely on HMS notifications.

The same could be useful for ALTER TABLE RECOVER PARTITIONS.  {-}It detects 
partition directories that were only created in the FS but not in HMS and 
creates them in HMS too. I{-}t also reloads the table (similarly to other DDLs) 
and detects new files in existing partitions. - UPDATE: the previous sentence 
was not true with current Impala.

An HMS event is created for the new partitions but there is no event that would 
indicate that there are new files in existing partitions. As ALTER TABLE 
RECOVER PARTITIONS is called when the user expects changes in the filesystem 
(similarly to REFRESH), it could be useful to send a reload event after it is 
finished.

  was:
IMPALA-11808 added support for sending reload events after REFRESH to allow 
other Impala cluster connecting to the same HMS to also reload their tables. 
REFRESH is often used when in external tables the files are written directly to 
filesystem without notifying HMS, so Impala needs to update its cache and can't 
rely on HMS notifications.

The same could be useful for ALTER TABLE RECOVER PARTITIONS. {-}- It detects 
partition directories that were only created in the FS but not in HMS and 
creates them in HMS too.-{-}It also reloads the table (similarly to other DDLs) 
and detects new files in existing partitions. - UPDATE: the previous sentence 
was not true with current Impala.

An HMS event is created for the new partitions but there is no event that would 
indicate that there are new files in existing partitions. As ALTER TABLE 
RECOVER PARTITIONS is called when the user expects changes in the filesystem 
(similarly to REFRESH), it could be useful to send a reload event after it is 
finished.


> Send reload event after ALTER TABLE RECOVER PARTITIONS
> --
>
> Key: IMPALA-12812
> URL: https://issues.apache.org/jira/browse/IMPALA-12812
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Csaba Ringhofer
>Priority: Major
>
> IMPALA-11808 added support for sending reload events after REFRESH to allow 
> other Impala cluster connecting to the same HMS to also reload their tables. 
> REFRESH is often used when in external tables the files are written directly 
> to filesystem without notifying HMS, so Impala needs to update its cache and 
> can't rely on HMS notifications.
> The same could be useful for ALTER TABLE RECOVER PARTITIONS.  {-}It detects 
> partition directories that were only created in the FS but not in HMS and 
> creates them in HMS too. I{-}t also reloads the table (similarly to other 
> DDLs) and detects new files in existing partitions. - UPDATE: the previous 
> sentence was not true with current Impala.
> An HMS event is created for the new partitions but there is no event that 
> would indicate that there are new files in existing partitions. As ALTER 
> TABLE RECOVER PARTITIONS is called when the user expects changes in the 
> filesystem (similarly to REFRESH), it could be useful to send a reload event 
> after it is finished.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS

2024-03-01 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12812:
-
Description: 
IMPALA-11808 added support for sending reload events after REFRESH to allow 
other Impala cluster connecting to the same HMS to also reload their tables. 
REFRESH is often used when in external tables the files are written directly to 
filesystem without notifying HMS, so Impala needs to update its cache and can't 
rely on HMS notifications.

The same could be useful for ALTER TABLE RECOVER PARTITIONS. {-}- It detects 
partition directories that were only created in the FS but not in HMS and 
creates them in HMS too.-{-}It also reloads the table (similarly to other DDLs) 
and detects new files in existing partitions. - UPDATE: the previous sentence 
was not true with current Impala.

An HMS event is created for the new partitions but there is no event that would 
indicate that there are new files in existing partitions. As ALTER TABLE 
RECOVER PARTITIONS is called when the user expects changes in the filesystem 
(similarly to REFRESH), it could be useful to send a reload event after it is 
finished.

  was:
IMPALA-11808 added support for sending reload events after REFRESH to allow 
other Impala cluster connecting to the same HMS to also reload their tables. 
REFRESH is often used when in external tables the files are written directly  
to filesystem without notifying HMS, so Impala needs to update its cache and 
can't rely on HMS notifications.

The same could be useful for ALTER TABLE RECOVER PARTITIONS. It detects 
partition directories that were only created in the FS but not in HMS and 
creates them in HMS too.- It also reloads the table (similarly to other DDLs) 
and detects new files in existing partitions. - UPDATE: the previous sentence 
was not true with current Impala. 

An HMS event is created for the new partitions but there is no event that would 
indicate that there are new files in existing partitions. As ALTER TABLE 
RECOVER PARTITIONS is called when the user expects changes in the filesystem 
(similarly to REFRESH), it could be useful to send a reload event after it is 
finished.


> Send reload event after ALTER TABLE RECOVER PARTITIONS
> --
>
> Key: IMPALA-12812
> URL: https://issues.apache.org/jira/browse/IMPALA-12812
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Csaba Ringhofer
>Priority: Major
>
> IMPALA-11808 added support for sending reload events after REFRESH to allow 
> other Impala cluster connecting to the same HMS to also reload their tables. 
> REFRESH is often used when in external tables the files are written directly 
> to filesystem without notifying HMS, so Impala needs to update its cache and 
> can't rely on HMS notifications.
> The same could be useful for ALTER TABLE RECOVER PARTITIONS. {-}- It detects 
> partition directories that were only created in the FS but not in HMS and 
> creates them in HMS too.-{-}It also reloads the table (similarly to other 
> DDLs) and detects new files in existing partitions. - UPDATE: the previous 
> sentence was not true with current Impala.
> An HMS event is created for the new partitions but there is no event that 
> would indicate that there are new files in existing partitions. As ALTER 
> TABLE RECOVER PARTITIONS is called when the user expects changes in the 
> filesystem (similarly to REFRESH), it could be useful to send a reload event 
> after it is finished.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS

2024-03-01 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12812:
-
Description: 
IMPALA-11808 added support for sending reload events after REFRESH to allow 
other Impala cluster connecting to the same HMS to also reload their tables. 
REFRESH is often used when in external tables the files are written directly  
to filesystem without notifying HMS, so Impala needs to update its cache and 
can't rely on HMS notifications.

The same could be useful for ALTER TABLE RECOVER PARTITIONS. It detects 
partition directories that were only created in the FS but not in HMS and 
creates them in HMS too.- It also reloads the table (similarly to other DDLs) 
and detects new files in existing partitions. - UPDATE: the previous sentence 
was not true with current Impala. 

An HMS event is created for the new partitions but there is no event that would 
indicate that there are new files in existing partitions. As ALTER TABLE 
RECOVER PARTITIONS is called when the user expects changes in the filesystem 
(similarly to REFRESH), it could be useful to send a reload event after it is 
finished.

  was:
IMPALA-11808 added support for sending reload events after REFRESH to allow 
other Impala cluster connecting to the same HMS to also reload their tables. 
REFRESH is often used when in external tables the files are written directly  
to filesystem without notifying HMS, so Impala needs to update its cache and 
can't rely on HMS notifications.

The same could be useful for ALTER TABLE RECOVER PARTITIONS. It detects 
partition directories that were only created in the FS but not in HMS and 
creates them in HMS too. It also reloads the table (similarly to other DDLs) 
and detects new files in existing partitions. An HMS event is created for the 
new partitions but there is no event that would indicate that there are new 
files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called when 
the user expects changes in the filesystem (similarly to REFRESH), it could be 
useful to send a reload event after it is finished.


> Send reload event after ALTER TABLE RECOVER PARTITIONS
> --
>
> Key: IMPALA-12812
> URL: https://issues.apache.org/jira/browse/IMPALA-12812
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Csaba Ringhofer
>Priority: Major
>
> IMPALA-11808 added support for sending reload events after REFRESH to allow 
> other Impala cluster connecting to the same HMS to also reload their tables. 
> REFRESH is often used when in external tables the files are written directly  
> to filesystem without notifying HMS, so Impala needs to update its cache and 
> can't rely on HMS notifications.
> The same could be useful for ALTER TABLE RECOVER PARTITIONS. It detects 
> partition directories that were only created in the FS but not in HMS and 
> creates them in HMS too.- It also reloads the table (similarly to other DDLs) 
> and detects new files in existing partitions. - UPDATE: the previous sentence 
> was not true with current Impala. 
> An HMS event is created for the new partitions but there is no event that 
> would indicate that there are new files in existing partitions. As ALTER 
> TABLE RECOVER PARTITIONS is called when the user expects changes in the 
> filesystem (similarly to REFRESH), it could be useful to send a reload event 
> after it is finished.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS

2024-03-01 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822631#comment-17822631
 ] 

Csaba Ringhofer commented on IMPALA-12812:
--

I was wrong about this one:
"An HMS event is created for the new partitions but there is no event that 
would indicate that there are new files in existing partitions. "
At the moment no refresh is done on partitions that already exist in HMS.
A valid workaround is to call both REFRESH after ALTER TABLE RECOVER PARTITIONS 
 - REFRESH will both detect new files and send the reload event.
Closing the issue as it wouldn't be that useful.


> Send reload event after ALTER TABLE RECOVER PARTITIONS
> --
>
> Key: IMPALA-12812
> URL: https://issues.apache.org/jira/browse/IMPALA-12812
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Csaba Ringhofer
>Priority: Major
>
> IMPALA-11808 added support for sending reload events after REFRESH to allow 
> other Impala cluster connecting to the same HMS to also reload their tables. 
> REFRESH is often used when in external tables the files are written directly  
> to filesystem without notifying HMS, so Impala needs to update its cache and 
> can't rely on HMS notifications.
> The same could be useful for ALTER TABLE RECOVER PARTITIONS. It detects 
> partition directories that were only created in the FS but not in HMS and 
> creates them in HMS too. It also reloads the table (similarly to other DDLs) 
> and detects new files in existing partitions. An HMS event is created for the 
> new partitions but there is no event that would indicate that there are new 
> files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called 
> when the user expects changes in the filesystem (similarly to REFRESH), it 
> could be useful to send a reload event after it is finished.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Comment Edited] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS

2024-03-01 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822631#comment-17822631
 ] 

Csaba Ringhofer edited comment on IMPALA-12812 at 3/1/24 4:14 PM:
--

I was wrong about this one:
" It also reloads the table (similarly to other DDLs) and detects new files in 
existing partitions. "
At the moment no refresh is done on partitions that already exist in HMS.
A valid workaround is to call both REFRESH after ALTER TABLE RECOVER PARTITIONS 
 - REFRESH will both detect new files and send the reload event.
Closing the issue as it wouldn't be that useful.



was (Author: csringhofer):
I was wrong about this one:
"An HMS event is created for the new partitions but there is no event that 
would indicate that there are new files in existing partitions. "
At the moment no refresh is done on partitions that already exist in HMS.
A valid workaround is to call both REFRESH after ALTER TABLE RECOVER PARTITIONS 
 - REFRESH will both detect new files and send the reload event.
Closing the issue as it wouldn't be that useful.


> Send reload event after ALTER TABLE RECOVER PARTITIONS
> --
>
> Key: IMPALA-12812
> URL: https://issues.apache.org/jira/browse/IMPALA-12812
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Csaba Ringhofer
>Priority: Major
>
> IMPALA-11808 added support for sending reload events after REFRESH to allow 
> other Impala cluster connecting to the same HMS to also reload their tables. 
> REFRESH is often used when in external tables the files are written directly  
> to filesystem without notifying HMS, so Impala needs to update its cache and 
> can't rely on HMS notifications.
> The same could be useful for ALTER TABLE RECOVER PARTITIONS. It detects 
> partition directories that were only created in the FS but not in HMS and 
> creates them in HMS too. It also reloads the table (similarly to other DDLs) 
> and detects new files in existing partitions. An HMS event is created for the 
> new partitions but there is no event that would indicate that there are new 
> files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called 
> when the user expects changes in the filesystem (similarly to REFRESH), it 
> could be useful to send a reload event after it is finished.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-12835) Transactional tables are unsynced when hms_event_incremental_refresh_transactional_table is disabled

2024-02-22 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer reassigned IMPALA-12835:


Assignee: Csaba Ringhofer

> Transactional tables are unsynced when 
> hms_event_incremental_refresh_transactional_table is disabled
> 
>
> Key: IMPALA-12835
> URL: https://issues.apache.org/jira/browse/IMPALA-12835
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Csaba Ringhofer
>Priority: Critical
>
> There are some test failures when 
> hms_event_incremental_refresh_transactional_table is disabled:
>  * 
> tests/metadata/test_event_processing.py::TestEventProcessing::test_transactional_insert_events
>  * 
> tests/metadata/test_event_processing.py::TestEventProcessing::test_event_based_replication
> I can reproduce the issue locally:
> {noformat}
> $ bin/start-impala-cluster.py 
> --catalogd_args=--hms_event_incremental_refresh_transactional_table=false
> impala-shell> create table txn_tbl (id int, val int) stored as parquet 
> tblproperties 
> ('transactional'='true','transactional_properties'='insert_only');
> impala-shell> describe txn_tbl;  -- make the table loaded in Impala
> hive> insert into txn_tbl values(101, 200);
> impala-shell> select * from txn_tbl; {noformat}
> Impala shows no results until a REFRESH runs on this table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12835) Transactional tables are unsynced when hms_event_incremental_refresh_transactional_table is disabled

2024-02-22 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819713#comment-17819713
 ] 

Csaba Ringhofer commented on IMPALA-12835:
--

I think that what actually broke this is IMPALA-11534
Without hms_event_incremental_refresh_transactional_table the only event 
catalogd processes during an INSERT to an unpartitioned ACID table is the 
ALTER_TABLE event -  since IMPALA-11534 most ALTER_TABLE events do not lead to 
reloading file medatata, so while HMS metadata will be reloaded, the file 
listing won't be refreshed (even though the validWriteIdList is refreshed)

Note that this issue only occurs with unpartitioned tables, partitioned tables 
are refreshed correctly when processing the ALTER_PARTITION events.

> Transactional tables are unsynced when 
> hms_event_incremental_refresh_transactional_table is disabled
> 
>
> Key: IMPALA-12835
> URL: https://issues.apache.org/jira/browse/IMPALA-12835
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Quanlong Huang
>Priority: Critical
>
> There are some test failures when 
> hms_event_incremental_refresh_transactional_table is disabled:
>  * 
> tests/metadata/test_event_processing.py::TestEventProcessing::test_transactional_insert_events
>  * 
> tests/metadata/test_event_processing.py::TestEventProcessing::test_event_based_replication
> I can reproduce the issue locally:
> {noformat}
> $ bin/start-impala-cluster.py 
> --catalogd_args=--hms_event_incremental_refresh_transactional_table=false
> impala-shell> create table txn_tbl (id int, val int) stored as parquet 
> tblproperties 
> ('transactional'='true','transactional_properties'='insert_only');
> impala-shell> describe txn_tbl;  -- make the table loaded in Impala
> hive> insert into txn_tbl values(101, 200);
> impala-shell> select * from txn_tbl; {noformat}
> Impala shows no results until a REFRESH runs on this table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12827) Precondition was hit in MutableValidReaderWriteIdList

2024-02-21 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12827:
-
Description: 
The callstack below led to stopping metastore event processor during an abort 
transaction event:
{code}
MetastoreEventsProcessor.java:899] Unexpected exception received while 
processing event
Java exception follows:
java.lang.IllegalStateException
at 
com.google.common.base.Preconditions.checkState(Preconditions.java:486)
at 
org.apache.impala.hive.common.MutableValidReaderWriteIdList.addAbortedWriteIds(MutableValidReaderWriteIdList.java:274)
at org.apache.impala.catalog.HdfsTable.addWriteIds(HdfsTable.java:3101)
at 
org.apache.impala.catalog.CatalogServiceCatalog.addWriteIdsToTable(CatalogServiceCatalog.java:3885)
at 
org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.addAbortedWriteIdsToTables(MetastoreEvents.java:2775)
at 
org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.process(MetastoreEvents.java:2761)
at 
org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.processIfEnabled(MetastoreEvents.java:522)
at 
org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:1052)
at 
org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:881)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
{code}

Precondition: 
https://github.com/apache/impala/blob/2f14fd29c0b47fc2c170a7f0eb1cecaf6b9704f4/fe/src/main/java/org/apache/impala/hive/common/MutableValidReaderWriteIdList.java#L274

I was not able to reproduce this so far.



  was:
The callstack below led to stopping metastore event processor during an abort 
transaction event:
{code}
MetastoreEventsProcessor.java:899] Unexpected exception received while 
processing event
Java exception follows:
java.lang.IllegalStateException
at 
com.google.common.base.Preconditions.checkState(Preconditions.java:486)
at 
org.apache.impala.hive.common.MutableValidReaderWriteIdList.addAbortedWriteIds(MutableValidReaderWriteIdList.java:274)
at org.apache.impala.catalog.HdfsTable.addWriteIds(HdfsTable.java:3101)
at 
org.apache.impala.catalog.CatalogServiceCatalog.addWriteIdsToTable(CatalogServiceCatalog.java:3885)
at 
org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.addAbortedWriteIdsToTables(MetastoreEvents.java:2775)
at 
org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.process(MetastoreEvents.java:2761)
at 
org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.processIfEnabled(MetastoreEvents.java:522)
at 
org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:1052)
at 
org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:881)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
{code}

Precondition: 
https://github.com/apache/impala/blob/2f14fd29c0b47fc2c170a7f0eb1cecaf6b9704f4/fe/src/main/java/org/apache/impala/hive/common/MutableValidReaderWriteIdList.java#L274

I was not able to reproduce this yet.




> Precondition was hit in MutableValidReaderWriteIdList
> -
>
> Key: IMPALA-12827
> URL: https://issues.apache.org/jira/browse/IMPALA-12827
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Csaba Ringhofer
>Priority: Major
>  Labels: ACID, catalog
>
> The callstack below led to stopping metastore event processor during an abort 
> transaction event:
> {code}
> 

[jira] [Updated] (IMPALA-12827) Precondition was hit in MutableValidReaderWriteIdList

2024-02-21 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12827:
-
Labels: catalog  (was: )

> Precondition was hit in MutableValidReaderWriteIdList
> -
>
> Key: IMPALA-12827
> URL: https://issues.apache.org/jira/browse/IMPALA-12827
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Csaba Ringhofer
>Priority: Major
>  Labels: catalog
>
> The callstack below led to stopping metastore event processor during an abort 
> transaction event:
> {code}
> MetastoreEventsProcessor.java:899] Unexpected exception received while 
> processing event
> Java exception follows:
> java.lang.IllegalStateException
>   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:486)
>   at 
> org.apache.impala.hive.common.MutableValidReaderWriteIdList.addAbortedWriteIds(MutableValidReaderWriteIdList.java:274)
>   at org.apache.impala.catalog.HdfsTable.addWriteIds(HdfsTable.java:3101)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.addWriteIdsToTable(CatalogServiceCatalog.java:3885)
>   at 
> org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.addAbortedWriteIdsToTables(MetastoreEvents.java:2775)
>   at 
> org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.process(MetastoreEvents.java:2761)
>   at 
> org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.processIfEnabled(MetastoreEvents.java:522)
>   at 
> org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:1052)
>   at 
> org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:881)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> {code}
> Precondition: 
> https://github.com/apache/impala/blob/2f14fd29c0b47fc2c170a7f0eb1cecaf6b9704f4/fe/src/main/java/org/apache/impala/hive/common/MutableValidReaderWriteIdList.java#L274
> I was not able to reproduce this yet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12827) Precondition was hit in MutableValidReaderWriteIdList

2024-02-21 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12827:
-
Labels: ACID catalog  (was: catalog)

> Precondition was hit in MutableValidReaderWriteIdList
> -
>
> Key: IMPALA-12827
> URL: https://issues.apache.org/jira/browse/IMPALA-12827
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Csaba Ringhofer
>Priority: Major
>  Labels: ACID, catalog
>
> The callstack below led to stopping metastore event processor during an abort 
> transaction event:
> {code}
> MetastoreEventsProcessor.java:899] Unexpected exception received while 
> processing event
> Java exception follows:
> java.lang.IllegalStateException
>   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:486)
>   at 
> org.apache.impala.hive.common.MutableValidReaderWriteIdList.addAbortedWriteIds(MutableValidReaderWriteIdList.java:274)
>   at org.apache.impala.catalog.HdfsTable.addWriteIds(HdfsTable.java:3101)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.addWriteIdsToTable(CatalogServiceCatalog.java:3885)
>   at 
> org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.addAbortedWriteIdsToTables(MetastoreEvents.java:2775)
>   at 
> org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.process(MetastoreEvents.java:2761)
>   at 
> org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.processIfEnabled(MetastoreEvents.java:522)
>   at 
> org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:1052)
>   at 
> org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:881)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> {code}
> Precondition: 
> https://github.com/apache/impala/blob/2f14fd29c0b47fc2c170a7f0eb1cecaf6b9704f4/fe/src/main/java/org/apache/impala/hive/common/MutableValidReaderWriteIdList.java#L274
> I was not able to reproduce this yet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12827) Precondition was hit in MutableValidReaderWriteIdList

2024-02-21 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12827:


 Summary: Precondition was hit in MutableValidReaderWriteIdList
 Key: IMPALA-12827
 URL: https://issues.apache.org/jira/browse/IMPALA-12827
 Project: IMPALA
  Issue Type: Bug
Reporter: Csaba Ringhofer


The callstack below led to stopping metastore event processor during an abort 
transaction event:
{code}
MetastoreEventsProcessor.java:899] Unexpected exception received while 
processing event
Java exception follows:
java.lang.IllegalStateException
at 
com.google.common.base.Preconditions.checkState(Preconditions.java:486)
at 
org.apache.impala.hive.common.MutableValidReaderWriteIdList.addAbortedWriteIds(MutableValidReaderWriteIdList.java:274)
at org.apache.impala.catalog.HdfsTable.addWriteIds(HdfsTable.java:3101)
at 
org.apache.impala.catalog.CatalogServiceCatalog.addWriteIdsToTable(CatalogServiceCatalog.java:3885)
at 
org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.addAbortedWriteIdsToTables(MetastoreEvents.java:2775)
at 
org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.process(MetastoreEvents.java:2761)
at 
org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.processIfEnabled(MetastoreEvents.java:522)
at 
org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:1052)
at 
org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:881)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
{code}

Precondition: 
https://github.com/apache/impala/blob/2f14fd29c0b47fc2c170a7f0eb1cecaf6b9704f4/fe/src/main/java/org/apache/impala/hive/common/MutableValidReaderWriteIdList.java#L274

I was not able to reproduce this yet.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS

2024-02-13 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12812:


 Summary: Send reload event after ALTER TABLE RECOVER PARTITIONS
 Key: IMPALA-12812
 URL: https://issues.apache.org/jira/browse/IMPALA-12812
 Project: IMPALA
  Issue Type: Improvement
Reporter: Csaba Ringhofer


IMPALA-11808 added support for sending reload events after REFRESH to allow 
other Impala cluster connecting to the same HMS to also reload their tables. 
REFRESH is often used when in external tables the files are written directly  
to filesystem without notifying HMS, so Impala needs to update its cache and 
can't rely on HMS notifications.

The same could be useful for ALTER TABLE RECOVER PARTITIONS. It detects 
partition directories that were only created in the FS but not in HMS and 
creates them in HMS too. It also reloads the table (similarly to other DDLs) 
and detects new files in existing partitions. An HMS event is created for the 
new partitions but there is no event that would indicate that there are new 
files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called when 
the user expects changes in the filesystem (similarly to REFRESH), it could be 
useful to send a reload event after it is finished.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12543) test_iceberg_self_events failed in JDK11 build

2024-02-10 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816289#comment-17816289
 ] 

Csaba Ringhofer commented on IMPALA-12543:
--

[~stigahuang]
Do you think that this can cause correctness issues, or this should only lead 
to unnecessary table reloading + failed tests?

If I understand correctly what happens is:
1. alter table starts in CatalogOpExecutor
2. table level lock is taken
3. HMS RPC starts (CatalogOpExecutor.applyAlterTable())
4. HMS generates the event
5. HMS RPC returns
6. table is reloaded
7. catalog version is added to inflight event list
8. table table lock is releases

Meanwhile the event processor thread fetches the new event after 4 and before 
7, and because of  IMPALA-12461 (part 1), it can also finish self event 
checking before reaching 7. Before IMPALA-12461 it would have needed to wait 
for 8.

Currently adding to inflight event list happens here:
https://github.com/apache/impala/blob/11d2fe4fc00a1e6ef2d3a45825be9845456adc1d/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java#L1307

Would it be a problem to move this before the HMS RPC, e.g. into 
CatalogOpExecutor.applyAlterTable()?
In case the RPC or table loading fails we could remove the inflight event.


> test_iceberg_self_events failed in JDK11 build
> --
>
> Key: IMPALA-12543
> URL: https://issues.apache.org/jira/browse/IMPALA-12543
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Riza Suminto
>Assignee: Riza Suminto
>Priority: Major
>  Labels: broken-build
> Attachments: catalogd.INFO, std_err.txt
>
>
> test_iceberg_self_events failed in JDK11 build with following error.
>  
> {code:java}
> Error Message
> assert 0 == 1
> Stacktrace
> custom_cluster/test_events_custom_configs.py:637: in test_iceberg_self_events
>     check_self_events("ALTER TABLE {0} ADD COLUMN j INT".format(tbl_name))
> custom_cluster/test_events_custom_configs.py:624: in check_self_events
>     assert tbls_refreshed_before == tbls_refreshed_after
> E   assert 0 == 1 {code}
> This test still pass before IMPALA-11387 merged.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Comment Edited] (IMPALA-12455) Create set of disjunct bloom filters for keys in partitioned builds

2024-02-08 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815717#comment-17815717
 ] 

Csaba Ringhofer edited comment on IMPALA-12455 at 2/8/24 3:23 PM:
--

>waiting on receiving EOS signals from all senders below it.
agree

>but the fastest join builder still need to wait for the slowest join builder 
>to complete before it can publish its own bloom filter.
yes, they would still need EOS from right side child before publishing any 
filters

Besides avoiding coordinator aggregation work, I expect bloom filter building 
to be faster because the individual bloom filters would be smaller, so more 
likely to fit into the CPU cache.

A solution to "waiting for all senders to send EOS" could be to build bloom 
filters on the sender side (before exchange node) instead in the hash join 
builder (after exchange node). As individual senders would know earlier that 
they are finished they could send their bloom filter without waiting for the 
slowest one.

This would also help in distributing work in case of broadcast joins, as no 
builder would have to process the whole dataset. On the other side this would 
introduce aggregation work the the broadcast case, which is not necessary at 
the moment.





was (Author: csringhofer):
>waiting on receiving EOS signals from all senders below it.
agree

>but the fastest join builder still need to wait for the slowest join builder 
>to complete before it can publish its own bloom filter.
yes, they would still need EOS from right side child before publishing any 
filters

Besides avoiding coordinator aggregation work, I expect bloom filter building 
to be faster because the individual bloom filters would be smaller, so more 
likely to fit into the CPU cache.

An solution to "waiting for all senders to send EOS" could be to build bloom 
filters on the sender side (before exchange node) instead in the hash join 
builder (after exchange node). As individual senders would know earlier that 
they are finished they could send their bloom filter without waiting for the 
slowest one.

This would also help in distributing work in case of broadcast joins, as no 
builder would have to process the whole dataset. On the other side this would 
introduce aggregation work the the broadcast case, which is not necessary at 
the moment.




> Create set of disjunct bloom filters for keys in partitioned builds
> ---
>
> Key: IMPALA-12455
> URL: https://issues.apache.org/jira/browse/IMPALA-12455
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend, Frontend
>Reporter: Csaba Ringhofer
>Priority: Major
>  Labels: bloom-filter, performance, runtime-filters
>
> Currently Impala aggregates bloom filters from different instances of the 
> join builder by OR-ing them to a final filter. This could be avoided by 
> having num_instances smaller bloom filters and choosing the correct one 
> during lookup by doing the same hashing as used in partitioning. Builders 
> would only need to write a single small filter as they have only keys from a 
> single partition. This would make runtime filter producers faster and much 
> more scalable while shouldn't have major effect on consumers.
> One caveat is that we push down the current bloom filter to Kudu as it is, so 
> this optimization wouldn't be applicable in filters consumed by Kudu scans.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Comment Edited] (IMPALA-12455) Create set of disjunct bloom filters for keys in partitioned builds

2024-02-08 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815717#comment-17815717
 ] 

Csaba Ringhofer edited comment on IMPALA-12455 at 2/8/24 3:20 PM:
--

>waiting on receiving EOS signals from all senders below it.
agree

>but the fastest join builder still need to wait for the slowest join builder 
>to complete before it can publish its own bloom filter.
yes, they would still need EOS from right side child before publishing any 
filters

Besides avoiding coordinator aggregation work, I expect bloom filter building 
to be faster because the individual bloom filters would be smaller, so more 
likely to fit into the CPU cache.

An solution to "waiting for all senders to send EOS" could be to build bloom 
filters on the sender side (before exchange node) instead in the hash join 
builder (after exchange node). As individual senders would know earlier that 
they are finished they could send their bloom filter without waiting for the 
slowest one.

This would also help in distributing work in case of broadcast joins, as no 
builder would have to process the whole dataset. On the other side this would 
introduce merging work the the broadcast case, which is not necessary at the 
moment.





was (Author: csringhofer):
>waiting on receiving EOS signals from all senders below it.
agree

>but the fastest join builder still need to wait for the slowest join builder 
>to complete before it can publish its own bloom filter.
yes, they would still need EOS from right side child before publishing any 
filters

Besides avoiding coordinator aggregation work, I expect bloom filter building 
to be faster because the individual bloom filters would be smaller, so more 
likely to fit into the CPU cache.

An alternative solution could be to build bloom filters on the sender side 
(before exchange node) instead in the hash join builder (after exchange node). 
This would make the optimization suggested in this Jira impossible, but would 
help with the issue you raised, as the senders would know earlier that they are 
finished and wouldn't need to wait for all senders to hit EOS before publishing 
bloom filters.


> Create set of disjunct bloom filters for keys in partitioned builds
> ---
>
> Key: IMPALA-12455
> URL: https://issues.apache.org/jira/browse/IMPALA-12455
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend, Frontend
>Reporter: Csaba Ringhofer
>Priority: Major
>  Labels: bloom-filter, performance, runtime-filters
>
> Currently Impala aggregates bloom filters from different instances of the 
> join builder by OR-ing them to a final filter. This could be avoided by 
> having num_instances smaller bloom filters and choosing the correct one 
> during lookup by doing the same hashing as used in partitioning. Builders 
> would only need to write a single small filter as they have only keys from a 
> single partition. This would make runtime filter producers faster and much 
> more scalable while shouldn't have major effect on consumers.
> One caveat is that we push down the current bloom filter to Kudu as it is, so 
> this optimization wouldn't be applicable in filters consumed by Kudu scans.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Comment Edited] (IMPALA-12455) Create set of disjunct bloom filters for keys in partitioned builds

2024-02-08 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815717#comment-17815717
 ] 

Csaba Ringhofer edited comment on IMPALA-12455 at 2/8/24 3:21 PM:
--

>waiting on receiving EOS signals from all senders below it.
agree

>but the fastest join builder still need to wait for the slowest join builder 
>to complete before it can publish its own bloom filter.
yes, they would still need EOS from right side child before publishing any 
filters

Besides avoiding coordinator aggregation work, I expect bloom filter building 
to be faster because the individual bloom filters would be smaller, so more 
likely to fit into the CPU cache.

An solution to "waiting for all senders to send EOS" could be to build bloom 
filters on the sender side (before exchange node) instead in the hash join 
builder (after exchange node). As individual senders would know earlier that 
they are finished they could send their bloom filter without waiting for the 
slowest one.

This would also help in distributing work in case of broadcast joins, as no 
builder would have to process the whole dataset. On the other side this would 
introduce aggregation work the the broadcast case, which is not necessary at 
the moment.





was (Author: csringhofer):
>waiting on receiving EOS signals from all senders below it.
agree

>but the fastest join builder still need to wait for the slowest join builder 
>to complete before it can publish its own bloom filter.
yes, they would still need EOS from right side child before publishing any 
filters

Besides avoiding coordinator aggregation work, I expect bloom filter building 
to be faster because the individual bloom filters would be smaller, so more 
likely to fit into the CPU cache.

An solution to "waiting for all senders to send EOS" could be to build bloom 
filters on the sender side (before exchange node) instead in the hash join 
builder (after exchange node). As individual senders would know earlier that 
they are finished they could send their bloom filter without waiting for the 
slowest one.

This would also help in distributing work in case of broadcast joins, as no 
builder would have to process the whole dataset. On the other side this would 
introduce merging work the the broadcast case, which is not necessary at the 
moment.




> Create set of disjunct bloom filters for keys in partitioned builds
> ---
>
> Key: IMPALA-12455
> URL: https://issues.apache.org/jira/browse/IMPALA-12455
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend, Frontend
>Reporter: Csaba Ringhofer
>Priority: Major
>  Labels: bloom-filter, performance, runtime-filters
>
> Currently Impala aggregates bloom filters from different instances of the 
> join builder by OR-ing them to a final filter. This could be avoided by 
> having num_instances smaller bloom filters and choosing the correct one 
> during lookup by doing the same hashing as used in partitioning. Builders 
> would only need to write a single small filter as they have only keys from a 
> single partition. This would make runtime filter producers faster and much 
> more scalable while shouldn't have major effect on consumers.
> One caveat is that we push down the current bloom filter to Kudu as it is, so 
> this optimization wouldn't be applicable in filters consumed by Kudu scans.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12455) Create set of disjunct bloom filters for keys in partitioned builds

2024-02-08 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815717#comment-17815717
 ] 

Csaba Ringhofer commented on IMPALA-12455:
--

>waiting on receiving EOS signals from all senders below it.
agree

>but the fastest join builder still need to wait for the slowest join builder 
>to complete before it can publish its own bloom filter.
yes, they would still need EOS from right side child before publishing any 
filters

Besides avoiding coordinator aggregation work, I expect bloom filter building 
to be faster because the individual bloom filters would be smaller, so more 
likely to fit into the CPU cache.

An alternative solution could be to build bloom filters on the sender side 
(before exchange node) instead in the hash join builder (after exchange node). 
This would make the optimization suggested in this Jira impossible, but would 
help with the issue you raised, as the senders would know earlier that they are 
finished and wouldn't need to wait for all senders to hit EOS before publishing 
bloom filters.


> Create set of disjunct bloom filters for keys in partitioned builds
> ---
>
> Key: IMPALA-12455
> URL: https://issues.apache.org/jira/browse/IMPALA-12455
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend, Frontend
>Reporter: Csaba Ringhofer
>Priority: Major
>  Labels: bloom-filter, performance, runtime-filters
>
> Currently Impala aggregates bloom filters from different instances of the 
> join builder by OR-ing them to a final filter. This could be avoided by 
> having num_instances smaller bloom filters and choosing the correct one 
> during lookup by doing the same hashing as used in partitioning. Builders 
> would only need to write a single small filter as they have only keys from a 
> single partition. This would make runtime filter producers faster and much 
> more scalable while shouldn't have major effect on consumers.
> One caveat is that we push down the current bloom filter to Kudu as it is, so 
> this optimization wouldn't be applicable in filters consumed by Kudu scans.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-12746) Bump jackson-databind version to 2.15

2024-01-26 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer resolved IMPALA-12746.
--
Fix Version/s: Impala 4.4.0
   Resolution: Fixed

> Bump jackson-databind version to 2.15
> -
>
> Key: IMPALA-12746
> URL: https://issues.apache.org/jira/browse/IMPALA-12746
> Project: IMPALA
>  Issue Type: Task
>Reporter: Csaba Ringhofer
>Priority: Major
> Fix For: Impala 4.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12746) Bump jackson-databind version to 2.15

2024-01-23 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12746:


 Summary: Bump jackson-databind version to 2.15
 Key: IMPALA-12746
 URL: https://issues.apache.org/jira/browse/IMPALA-12746
 Project: IMPALA
  Issue Type: Task
Reporter: Csaba Ringhofer






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-5078) Break up expr-test.cc

2023-12-21 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799678#comment-17799678
 ] 

Csaba Ringhofer commented on IMPALA-5078:
-

[~sy117] I had a work in progress patch for this that moves timestamp/date 
related functions to a separate file and also collects some shared 
functionality to a common header:
https://github.com/csringhofer/Impala/commit/0b8967fa7aa24c9df2d6327c1594e811ba853572

It still needs a lot of cleanup but at least it compiles. Feel free to use it 
or ignore it.
Besides cleanup, it would be nice to move some other functionality, e.g. string 
or decimal functions to separate files.

> Break up expr-test.cc
> -
>
> Key: IMPALA-5078
> URL: https://issues.apache.org/jira/browse/IMPALA-5078
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Henry Robinson
>Assignee: Csaba Ringhofer
>Priority: Minor
>  Labels: newbie, ramp-up
> Attachments: Screen Shot 2020-06-30 at 12.19.16 PM.png, Screen Shot 
> 2020-07-10 at 1.01.43 PM.png, Screen Shot 2020-07-10 at 11.16.36 AM.png, 
> Screen Shot 2020-07-10 at 11.27.57 AM.png, image-2020-07-10-13-22-48-230.png
>
>
> {{expr-test.cc}} clocks in at 7129 lines, which is about enough for my emacs 
> to start slowing down a bit. Let's see if we can refactor it enough to have a 
> couple of test files. Maybe moving all the string instructions into a 
> separate {{expr-string-test.cc}}, and having a common header will be enough 
> to make it a bit more manageable. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-5078) Break up expr-test.cc

2023-12-21 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799582#comment-17799582
 ] 

Csaba Ringhofer commented on IMPALA-5078:
-

[~sy117] Sure, feel free to reassign it to yourself if you would still like to 
work on it.
>Do you think you could give me until December 22nd 5 pm PST?
There is no hard deadline, it is not an urgent task :) I started breaking it up 
because there were some non-deterministic test issues expr-test and having less 
test in one file would help with pinpointing the issue.

> Break up expr-test.cc
> -
>
> Key: IMPALA-5078
> URL: https://issues.apache.org/jira/browse/IMPALA-5078
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Henry Robinson
>Assignee: Csaba Ringhofer
>Priority: Minor
>  Labels: newbie, ramp-up
> Attachments: Screen Shot 2020-06-30 at 12.19.16 PM.png, Screen Shot 
> 2020-07-10 at 1.01.43 PM.png, Screen Shot 2020-07-10 at 11.16.36 AM.png, 
> Screen Shot 2020-07-10 at 11.27.57 AM.png, image-2020-07-10-13-22-48-230.png
>
>
> {{expr-test.cc}} clocks in at 7129 lines, which is about enough for my emacs 
> to start slowing down a bit. Let's see if we can refactor it enough to have a 
> couple of test files. Maybe moving all the string instructions into a 
> separate {{expr-string-test.cc}}, and having a common header will be enough 
> to make it a bit more manageable. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12661) ASAN heap-use-after-free in IcebergMetadataScanNode

2023-12-21 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12661:


 Summary: ASAN heap-use-after-free in IcebergMetadataScanNode
 Key: IMPALA-12661
 URL: https://issues.apache.org/jira/browse/IMPALA-12661
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Reporter: Csaba Ringhofer
 Attachments: asan.txt

See asan.txt for details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12661) ASAN heap-use-after-free in IcebergMetadataScanNode

2023-12-21 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12661:
-
Attachment: asan.txt

> ASAN heap-use-after-free in IcebergMetadataScanNode
> ---
>
> Key: IMPALA-12661
> URL: https://issues.apache.org/jira/browse/IMPALA-12661
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Csaba Ringhofer
>Priority: Critical
> Attachments: asan.txt
>
>
> See asan.txt for details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12660) TSAN error in ImpalaServer::QueryStateRecord::Init

2023-12-21 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12660:
-
Attachment: tsan.txt

> TSAN error in ImpalaServer::QueryStateRecord::Init
> --
>
> Key: IMPALA-12660
> URL: https://issues.apache.org/jira/browse/IMPALA-12660
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Csaba Ringhofer
>Priority: Critical
> Attachments: tsan.txt
>
>
> See error in tsan.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12660) TSAN error in ImpalaServer::QueryStateRecord::Init

2023-12-21 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12660:


 Summary: TSAN error in ImpalaServer::QueryStateRecord::Init
 Key: IMPALA-12660
 URL: https://issues.apache.org/jira/browse/IMPALA-12660
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend
Reporter: Csaba Ringhofer


See error in tsan.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-11921) test_large_sql seems to be flaky

2023-12-19 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-11921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer reassigned IMPALA-11921:


Assignee: Csaba Ringhofer  (was: Fang-Yu Rao)

> test_large_sql seems to be flaky
> 
>
> Key: IMPALA-11921
> URL: https://issues.apache.org/jira/browse/IMPALA-11921
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Fang-Yu Rao
>Assignee: Csaba Ringhofer
>Priority: Major
>  Labels: broken-build
>
> We observed the following failure in an ASAN run.
> {code}
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/shell/test_shell_commandline.py:1026:
>  in test_large_sql assert actual_time_s <= time_limit_s, ( E   
> AssertionError: It took 21.0015001297 seconds to execute the query. Time 
> limit is 20 seconds. E   assert 21.001500129699707 <= 20
> {code}
> We have not seen this failure for a while since IMPALA-7428.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Work started] (IMPALA-11921) test_large_sql seems to be flaky

2023-12-19 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-11921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-11921 started by Csaba Ringhofer.

> test_large_sql seems to be flaky
> 
>
> Key: IMPALA-11921
> URL: https://issues.apache.org/jira/browse/IMPALA-11921
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Fang-Yu Rao
>Assignee: Csaba Ringhofer
>Priority: Major
>  Labels: broken-build
>
> We observed the following failure in an ASAN run.
> {code}
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/shell/test_shell_commandline.py:1026:
>  in test_large_sql assert actual_time_s <= time_limit_s, ( E   
> AssertionError: It took 21.0015001297 seconds to execute the query. Time 
> limit is 20 seconds. E   assert 21.001500129699707 <= 20
> {code}
> We have not seen this failure for a while since IMPALA-7428.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12655) PlannerTest.testProcessingCost seems flaky

2023-12-19 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12655:


 Summary: PlannerTest.testProcessingCost seems flaky
 Key: IMPALA-12655
 URL: https://issues.apache.org/jira/browse/IMPALA-12655
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Reporter: Csaba Ringhofer


This is probably caused by IMPALA-12601 
https://github.com/apache/impala/commit/8661f922d3ccb21da73b9f7f8734d9113429e9bb

The error was cause by this line:
https://github.com/apache/impala/blob/68fe57ff8492a7afdf14a62cabd3e2b0fcade9d1/testdata/workloads/functional-planner/queries/PlannerTest/tpcds-processing-cost.test#L8185

In the actual plan the following appeared here:
fk/pk conjuncts: assumed fk/pk

[~rizaon]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-5078) Break up expr-test.cc

2023-12-19 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798543#comment-17798543
 ] 

Csaba Ringhofer commented on IMPALA-5078:
-

[~sy117] I assumed that you no longer plan to work on this issue and assigned 
it to myself. Feel free comment if you still have plans for this!

> Break up expr-test.cc
> -
>
> Key: IMPALA-5078
> URL: https://issues.apache.org/jira/browse/IMPALA-5078
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Henry Robinson
>Assignee: Csaba Ringhofer
>Priority: Minor
>  Labels: newbie, ramp-up
> Attachments: Screen Shot 2020-06-30 at 12.19.16 PM.png, Screen Shot 
> 2020-07-10 at 1.01.43 PM.png, Screen Shot 2020-07-10 at 11.16.36 AM.png, 
> Screen Shot 2020-07-10 at 11.27.57 AM.png, image-2020-07-10-13-22-48-230.png
>
>
> {{expr-test.cc}} clocks in at 7129 lines, which is about enough for my emacs 
> to start slowing down a bit. Let's see if we can refactor it enough to have a 
> couple of test files. Maybe moving all the string instructions into a 
> separate {{expr-string-test.cc}}, and having a common header will be enough 
> to make it a bit more manageable. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-5078) Break up expr-test.cc

2023-12-19 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer reassigned IMPALA-5078:
---

Assignee: Csaba Ringhofer  (was: Sean Yeh)

> Break up expr-test.cc
> -
>
> Key: IMPALA-5078
> URL: https://issues.apache.org/jira/browse/IMPALA-5078
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Henry Robinson
>Assignee: Csaba Ringhofer
>Priority: Minor
>  Labels: newbie, ramp-up
> Attachments: Screen Shot 2020-06-30 at 12.19.16 PM.png, Screen Shot 
> 2020-07-10 at 1.01.43 PM.png, Screen Shot 2020-07-10 at 11.16.36 AM.png, 
> Screen Shot 2020-07-10 at 11.27.57 AM.png, image-2020-07-10-13-22-48-230.png
>
>
> {{expr-test.cc}} clocks in at 7129 lines, which is about enough for my emacs 
> to start slowing down a bit. Let's see if we can refactor it enough to have a 
> couple of test files. Maybe moving all the string instructions into a 
> separate {{expr-string-test.cc}}, and having a common header will be enough 
> to make it a bit more manageable. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12650) test_create_unicode_table fails on non-HDFS filesystems

2023-12-18 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12650:
-
Description: 
It seems that some tests need Hive which doesn't run in some file system during 
tests:
{code}
describe test_create_unicode_table_da361d5b.testtbl_orc;

-- 2023-12-16 15:58:54,097 INFO MainThread: Started query 
204a7290ec8a73b0:7456aad0
-- connecting to localhost:11050 with impyla
-- 2023-12-16 15:58:54,105 INFO MainThread: Could not connect to ('::1', 
11050, 0, 0)
Traceback (most recent call last):
  File 
"/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py",
 line 137, in open
handle.connect(sockaddr)
  File 
"/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py",
 line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
-- 2023-12-16 15:58:54,105 INFO MainThread: Could not connect to 
('127.0.0.1', 11050)
Traceback (most recent call last):
  File 
"/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py",
 line 137, in open
handle.connect(sockaddr)
  File 
"/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py",
 line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
-- 2023-12-16 15:58:54,105 ERRORMainThread: Could not connect to any of 
[('::1', 11050, 0, 0), ('127.0.0.1', 11050)]
{code}

  was:
It seems that some tests need Hive which doesn't run in some file system during 
tests:
{code}
update functional_parquet.iceberg_int_partitioned set file__position = 42;

-- connecting to localhost:11050 with impyla
-- 2023-12-13 10:49:12,017 INFO MainThread: Could not connect to ('::1', 
11050, 0, 0)
Traceback (most recent call last):
  File 
"/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py",
 line 137, in open
handle.connect(sockaddr)
  File 
"/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py",
 line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
{code}


> test_create_unicode_table fails on non-HDFS filesystems
> ---
>
> Key: IMPALA-12650
> URL: https://issues.apache.org/jira/browse/IMPALA-12650
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Csaba Ringhofer
>Assignee: Zoltán Borók-Nagy
>Priority: Major
>
> It seems that some tests need Hive which doesn't run in some file system 
> during tests:
> {code}
> describe test_create_unicode_table_da361d5b.testtbl_orc;
> -- 2023-12-16 15:58:54,097 INFO MainThread: Started query 
> 204a7290ec8a73b0:7456aad0
> -- connecting to localhost:11050 with impyla
> -- 2023-12-16 15:58:54,105 INFO MainThread: Could not connect to ('::1', 
> 11050, 0, 0)
> Traceback (most recent call last):
>   File 
> "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py",
>  line 137, in open
> handle.connect(sockaddr)
>   File 
> "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py",
>  line 228, in meth
> return getattr(self._sock,name)(*args)
> error: [Errno 111] Connection refused
> -- 2023-12-16 15:58:54,105 INFO MainThread: Could not connect to 
> ('127.0.0.1', 11050)
> Traceback (most recent call last):
>   File 
> "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py",
>  line 137, in open
> handle.connect(sockaddr)
>   File 
> "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py",
>  line 228, in meth
> return getattr(self._sock,name)(*args)
> error: [Errno 111] Connection refused
> -- 2023-12-16 15:58:54,105 ERRORMainThread: Could not connect to any of 
> [('::1', 11050, 0, 0), ('127.0.0.1', 11050)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IMPALA-12650) test_create_unicode_table fails on non-HDFS filesystems

2023-12-18 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12650:
-
Summary: test_create_unicode_table fails on non-HDFS filesystems  (was: 
test_iceberg_negative fails on non-HDFS filesystems)

> test_create_unicode_table fails on non-HDFS filesystems
> ---
>
> Key: IMPALA-12650
> URL: https://issues.apache.org/jira/browse/IMPALA-12650
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Csaba Ringhofer
>Assignee: Zoltán Borók-Nagy
>Priority: Major
>
> It seems that some tests need Hive which doesn't run in some file system 
> during tests:
> {code}
> update functional_parquet.iceberg_int_partitioned set file__position = 42;
> -- connecting to localhost:11050 with impyla
> -- 2023-12-13 10:49:12,017 INFO MainThread: Could not connect to ('::1', 
> 11050, 0, 0)
> Traceback (most recent call last):
>   File 
> "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py",
>  line 137, in open
> handle.connect(sockaddr)
>   File 
> "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py",
>  line 228, in meth
> return getattr(self._sock,name)(*args)
> error: [Errno 111] Connection refused
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12650) test_iceberg_negative fails on non-HDFS filesystems

2023-12-18 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798259#comment-17798259
 ] 

Csaba Ringhofer commented on IMPALA-12650:
--

Reusing this for the same error in test_create_unicode_table

> test_iceberg_negative fails on non-HDFS filesystems
> ---
>
> Key: IMPALA-12650
> URL: https://issues.apache.org/jira/browse/IMPALA-12650
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Csaba Ringhofer
>Assignee: Zoltán Borók-Nagy
>Priority: Major
>
> It seems that some tests need Hive which doesn't run in some file system 
> during tests:
> {code}
> update functional_parquet.iceberg_int_partitioned set file__position = 42;
> -- connecting to localhost:11050 with impyla
> -- 2023-12-13 10:49:12,017 INFO MainThread: Could not connect to ('::1', 
> 11050, 0, 0)
> Traceback (most recent call last):
>   File 
> "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py",
>  line 137, in open
> handle.connect(sockaddr)
>   File 
> "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py",
>  line 228, in meth
> return getattr(self._sock,name)(*args)
> error: [Errno 111] Connection refused
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12650) test_iceberg_negative fails on non-HDFS filesystems

2023-12-18 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12650:


 Summary: test_iceberg_negative fails on non-HDFS filesystems
 Key: IMPALA-12650
 URL: https://issues.apache.org/jira/browse/IMPALA-12650
 Project: IMPALA
  Issue Type: Bug
Reporter: Csaba Ringhofer


It seems that some tests need Hive which doesn't run in some file system during 
tests:
{code}
update functional_parquet.iceberg_int_partitioned set file__position = 42;

-- connecting to localhost:11050 with impyla
-- 2023-12-13 10:49:12,017 INFO MainThread: Could not connect to ('::1', 
11050, 0, 0)
Traceback (most recent call last):
  File 
"/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py",
 line 137, in open
handle.connect(sockaddr)
  File 
"/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py",
 line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12647) Add Hive compatible way to get modified row count in DMLs (HS2)

2023-12-18 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12647:


 Summary: Add Hive compatible way to get modified row count in DMLs 
(HS2)
 Key: IMPALA-12647
 URL: https://issues.apache.org/jira/browse/IMPALA-12647
 Project: IMPALA
  Issue Type: New Feature
  Components: Clients
Reporter: Csaba Ringhofer


e,.g after
 insert into t values (1); 
print "modified 1 rows(s)"

Hive and Impala implemented this in  incompatible ways using different HS2 
"dialects":
- HIVE-14388 added support using TGetOperationStatusResp.numModifiedRows
- IMPALA-7290 added support using TCloseImpalaOperationResp,TDmlResult

https://github.com/apache/hive/blob/fd92b3926393f0366b87cd55d5a0ad27968f18db/service-rpc/if/TCLIService.thrift#L1120
https://github.com/apache/impala/blob/4114fe8db6ec80b2e1679e946555f91ab7043f2e/common/thrift/ImpalaService.thrift#L966

The Impala patch is newer (probably we didn't know about the Hive solution?), 
on the other side it is  based on a much older solution in Beeswax. The Impala 
solution is also more "advanced" and contains extra information relevant  in 
Kudu upserts/inserts.

Currently impala-shell uses the Impala solution while in Hive compatible strict 
HS2 mode it doesn't return modified row count.

impyla  doesn't support modified row count: 
https://github.com/cloudera/impyla/issues/302
There is an extension function that parses Kudu related row counts from the 
profile:
https://github.com/cloudera/impyla/blob/76f0ba3221e1ff26037e36afbe4a5591168157ce/impala/hiveserver2.py#L205

Ideally there would be  a solution supported by both components and clients 
wouldn't need to adapt to specific dialects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12630) TestOrcStats.test_orc_stats fails in count-start on lineitem with filter

2023-12-14 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796632#comment-17796632
 ] 

Csaba Ringhofer commented on IMPALA-12630:
--

was curious and ran this locally:
My local env (with bit old dataload):
 - NumOrcStripes: 12 (12)
 - RowsRead: 13.50K (13501)
vs profiles uploaded:
 - NumOrcStripes: 8 (8)
 - RowsRead: 20.00K (2)

I think that the recent commit Revert "IMPALA-9923: Load ORC serially to hack 
around flakiness" 
https://github.com/apache/impala/commit/b03e8ef95c856f499d17ea7815831e30e2e9f467
 lead to this indeterminisim in dataload
[~rizaon]



> TestOrcStats.test_orc_stats fails in count-start on lineitem with filter
> 
>
> Key: IMPALA-12630
> URL: https://issues.apache.org/jira/browse/IMPALA-12630
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Quanlong Huang
>Priority: Critical
> Attachments: profile_1134.txt, profile_949.txt
>
>
> Saw the test failed several times recently:
> https://jenkins.impala.io/job/ubuntu-20.04-dockerised-tests/949
> https://jenkins.impala.io/job/ubuntu-20.04-from-scratch/1134
> {noformat}
> query_test/test_orc_stats.py:41: in test_orc_stats
> self.run_test_case('QueryTest/orc-stats', vector, use_db=unique_database)
> common/impala_test_suite.py:776: in run_test_case
> update_section=pytest.config.option.update_results)
> common/test_result_verifier.py:683: in verify_runtime_profile
> % (function, field, expected_value, actual_value, op, actual))
> E   AssertionError: Aggregation of SUM over RowsRead did not match expected 
> results.
> E   EXPECTED VALUE:
> E   13501
> E   
> E   
> E   ACTUAL VALUE:
> E   2
> E   
> E   OP:
> E   : {noformat}
> The query is
> {code:sql}
> select count(*) from tpch_orc_def.lineitem where l_orderkey = 1609411
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12594) KrpcDataStreamSender's mem estimate is different than real usage

2023-12-04 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12594:


 Summary: KrpcDataStreamSender's mem estimate is different than 
real usage
 Key: IMPALA-12594
 URL: https://issues.apache.org/jira/browse/IMPALA-12594
 Project: IMPALA
  Issue Type: Bug
  Components: Backend, Frontend
Reporter: Csaba Ringhofer


IMPALA-6684 added memory estimates for KrpcDataStreamSender's, but there are 
few gaps between the how the frontend estimates memory and how the backend 
actually allocates it:
The frontend uses the following formula:
buffer_size = num_channels * 2 * (tuple_buffer_length + 
compressed_buffer_length)
This takes account for the serialization and compression buffer for each 
OutboundRowBatch.

This can  both under and over estimate:
1. it doesn't take account of the RowBatch used by channels during partitioned 
exchange to collact rows belonging to a single channel 
https://github.com/apache/impala/blob/4c762725c707f8d150fe250c03faf486008702d4/be/src/runtime/krpc-data-stream-sender.cc#L232

2.it ignores the adjustment to the RowBatch capacity above based on flag 
data_stream_sender_buffer_size 
https://github.com/apache/impala/blob/4c762725c707f8d150fe250c03faf486008702d4/be/src/runtime/krpc-data-stream-sender.cc#L379
This adjustment can both increase or decrease the capacity to have to desired 
total size (16K by defaul).

Note that the adjustment above ignores var len data, so it can massively 
underestimate in some cases. Meanwhile the frontend logic calculates string 
sizes if stats are present. Ideally both logic would be improved and synced to 
use both data_stream_sender_buffer_size and the string sizes for the estimate 
(I am not sure about collection types).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12373) Implement Small String Optimization for StringValue

2023-11-17 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17787132#comment-17787132
 ] 

Csaba Ringhofer commented on IMPALA-12373:
--

>I think we don't need NULL termination so we can store actually 11 chars with 
>libc++'s technique.
In case the StringValue is inside a tuple, it may be possible to store even 
more chars in the tuple.

The idea is to reserve some bytes before the StringValue, and if based on the 
last bit it is a small string, but length is > 11, then we could assume that 
the string starts len - 11 bytes before the address of StringValue. This could 
speed things up a bit in case we have stats about avg/max string length, as we 
the number of extra bytes could be chosen to minimize waste.

> Implement Small String Optimization for StringValue
> ---
>
> Key: IMPALA-12373
> URL: https://issues.apache.org/jira/browse/IMPALA-12373
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Zoltán Borók-Nagy
>Assignee: Zoltán Borók-Nagy
>Priority: Major
>  Labels: Performance
> Attachments: small_string.cpp
>
>
> Implement Small String Optimization for StringValue.
> Current memory layout of StringValue is:
> {noformat}
>   char* ptr;  // 8 byte
>   int len;// 4 byte
> {noformat}
> For small strings with size up to 8 we could store the string contents in the 
> bytes of the 'ptr'. Something like that:
> {noformat}
>   union {
> char* ptr;
> char small_buf[sizeof(ptr)];
>   };
>   int len;
> {noformat}
> Many C++ string implementations use the {{Small String Optimization}} to 
> speed up work with small strings. For example:
> {code:java}
> Microsoft STL, libstdc++, libc++, Boost, Folly.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12463) Allow batching of non consecutive metastore events

2023-09-30 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770671#comment-17770671
 ] 

Csaba Ringhofer commented on IMPALA-12463:
--

>But it will be very hard to support referential integrity and transactions in 
>the future if the event processor starts to reorder events.

I would separate external, ACID and Iceberg tables here.

External: having cross table consistency for external tables seems like a lost 
cause to me. Anyone can write into them any time, we may or may not get an 
event about that. Even if there is an event, after the resulting refresh 
catalogd won't see only files created by that insert, but also later inserts to 
the same partition.

Hive ACID: there will be a commit event, and according to the logic in 
description  that should "cut" batching of events ,so we won't batch events 
before and after the commit together. This means that events for actually 
different INSERTs won't be batched. I am not 100% sure about ALTER PARTITION 
events, but generally altering should also have its own transaction, and as for 
INSERT, the batching won't go past those.
Note that I didn't check how exactly we handle events for ACID tables - ideally 
Impala should only do refresh after commit and only if it changed the 
validWriteIds

Iceberg: I guess that catalogd won't see partition level events at all, so the 
batching doesn't seem relevant.



> Allow batching of non consecutive metastore events
> --
>
> Key: IMPALA-12463
> URL: https://issues.apache.org/jira/browse/IMPALA-12463
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Csaba Ringhofer
>Assignee: Joe McDonnell
>Priority: Major
> Attachments: concurrent_metadata_load.py
>
>
> Currently Impala tries to batch events like partition insert/creation only if:
> 1. the next event is for the same table as the previous one
> 2. the next event's id is the previous one's + 1
> 3. the next event has the same type as the previous one
> (2 can be stricter than 1 if some events were filtered between the two)
> See 
> https://github.com/apache/impala/blob/94f4f1d82461d8f71fbd0d2e9082aa29b5f53a89/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java#L315
> Another limit is that only events in the same batch from HMS can be merged. 
> Currently 1000 events are polled at the same time: 
> https://github.com/apache/impala/blob/94f4f1d82461d8f71fbd0d2e9082aa29b5f53a89/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEventsProcessor.java#L218
> Making this configurable could be also useful.
> Event batching could be improved by batching all events to the current one if 
> they modify the same table, unless they are "cut" by:
> a. an event on the same table but with a different type
> b. a rename table event where the original or the new name is the same as the 
> current event
> If such an event occurs, the events after that can be only merged to a newer 
> event.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12476) Single thread permission check can bottleneck table loading

2023-09-29 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12476:
-
Summary: Single thread permission check can bottleneck table loading  (was: 
Single thread permission check can bottleneck table loadin)

> Single thread permission check can bottleneck table loading
> ---
>
> Key: IMPALA-12476
> URL: https://issues.apache.org/jira/browse/IMPALA-12476
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Csaba Ringhofer
>Priority: Major
>
> Partitioned tables use multiple threads to list files in different 
> partitions, but access permission checks are done before this on a single 
> thread. IMPALA-7320 optimized this for full table loads (more exactly if a 
> high percentage of partitions have changed), but in some cases we still do 
> namenode RPCs on a single thread to get access level:
> 1. as mentioned above, if only a small subset of partitions are changed
> 2. if the path has ACL (access control list), then after getting file status 
> an extra getAclStatus RPC is done, leading to partition_count number of RPCs 
> on a single thread if ACL is enabled for all partitions
> 3. if there is some error when doing the optimized path
> 1. is especially problematic for metastore event processing, as partition 
> events will often change  only a subset of partitions. Even if all partitions 
> are changed, the catalogd may not process them in one batch (see 
> IMPALA-12463), leading to choosing the unoptimized path for several smaller 
> batches
> Besides the optimization in  IMPALA-7320, there is no good reason for doing 
> access level check on a single thread, so both case 1 and 2 good be made 
> faster by moving to the multithreaded stage of table loading.
> Note it is also a question whether all these access permission checks are 
> really needed, see  IMPALA-12472.
> An anomaly caused by doing these on a single thread is that the affect of 
> flag max_hdfs_partitions_parallel_load can be ambiguous - while it can 
> significantly speed up loading tables with multiple partitions, if the 
> namenode (or the thread that communicates with namenode) is contended, then 
> parallel loads will get an unfair share of the limited resources, meaning the 
> tables where large amount of work is done on single thread can actually get 
> slower.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12476) Single thread permission checks can bottleneck table loading

2023-09-29 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12476:
-
Summary: Single thread permission checks can bottleneck table loading  
(was: Single thread permission check can bottleneck table loading)

> Single thread permission checks can bottleneck table loading
> 
>
> Key: IMPALA-12476
> URL: https://issues.apache.org/jira/browse/IMPALA-12476
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Csaba Ringhofer
>Priority: Major
>
> Partitioned tables use multiple threads to list files in different 
> partitions, but access permission checks are done before this on a single 
> thread. IMPALA-7320 optimized this for full table loads (more exactly if a 
> high percentage of partitions have changed), but in some cases we still do 
> namenode RPCs on a single thread to get access level:
> 1. as mentioned above, if only a small subset of partitions are changed
> 2. if the path has ACL (access control list), then after getting file status 
> an extra getAclStatus RPC is done, leading to partition_count number of RPCs 
> on a single thread if ACL is enabled for all partitions
> 3. if there is some error when doing the optimized path
> 1. is especially problematic for metastore event processing, as partition 
> events will often change  only a subset of partitions. Even if all partitions 
> are changed, the catalogd may not process them in one batch (see 
> IMPALA-12463), leading to choosing the unoptimized path for several smaller 
> batches
> Besides the optimization in  IMPALA-7320, there is no good reason for doing 
> access level check on a single thread, so both case 1 and 2 good be made 
> faster by moving to the multithreaded stage of table loading.
> Note it is also a question whether all these access permission checks are 
> really needed, see  IMPALA-12472.
> An anomaly caused by doing these on a single thread is that the affect of 
> flag max_hdfs_partitions_parallel_load can be ambiguous - while it can 
> significantly speed up loading tables with multiple partitions, if the 
> namenode (or the thread that communicates with namenode) is contended, then 
> parallel loads will get an unfair share of the limited resources, meaning the 
> tables where large amount of work is done on single thread can actually get 
> slower.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12476) Single thread permission check can bottleneck table loadin

2023-09-29 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created IMPALA-12476:


 Summary: Single thread permission check can bottleneck table loadin
 Key: IMPALA-12476
 URL: https://issues.apache.org/jira/browse/IMPALA-12476
 Project: IMPALA
  Issue Type: Improvement
  Components: Catalog
Reporter: Csaba Ringhofer


Partitioned tables use multiple threads to list files in different partitions, 
but access permission checks are done before this on a single thread. 
IMPALA-7320 optimized this for full table loads (more exactly if a high 
percentage of partitions have changed), but in some cases we still do namenode 
RPCs on a single thread to get access level:
1. as mentioned above, if only a small subset of partitions are changed
2. if the path has ACL (access control list), then after getting file status an 
extra getAclStatus RPC is done, leading to partition_count number of RPCs on a 
single thread if ACL is enabled for all partitions
3. if there is some error when doing the optimized path

1. is especially problematic for metastore event processing, as partition 
events will often change  only a subset of partitions. Even if all partitions 
are changed, the catalogd may not process them in one batch (see IMPALA-12463), 
leading to choosing the unoptimized path for several smaller batches

Besides the optimization in  IMPALA-7320, there is no good reason for doing 
access level check on a single thread, so both case 1 and 2 good be made faster 
by moving to the multithreaded stage of table loading.

Note it is also a question whether all these access permission checks are 
really needed, seeIMPALA-12472.

An anomaly caused by doing these on a single thread is that the affect of flag 
max_hdfs_partitions_parallel_load can be ambiguous - while it can significantly 
speed up loading tables with multiple partitions, if the namenode (or the 
thread that communicates with namenode) is contended, then parallel loads will 
get an unfair share of the limited resources, meaning the tables where large 
amount of work is done on single thread can actually get slower.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12472) Skip permission check when refreshing in event processor

2023-09-29 Thread Csaba Ringhofer (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770356#comment-17770356
 ] 

Csaba Ringhofer commented on IMPALA-12472:
--

A related comment:
https://issues.apache.org/jira/browse/IMPALA-7539?focusedCommentId=16876527=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16876527

It suggests to completely skip permission check with the warehouse directory. 
Adding a flag for this would be simple and could be a huge performance 
improvement:
HdfsTable checks here whether to assume read+write access on a given filesystem 
or access checks are needed. We could also pass the path to check whether it is 
a subdiractory of a path where we assume read+write access.

I would consider creating a flag that can have a list of paths instead of 
having a bool on whether to skip on the warehouse. This would allow more fine 
grained control, e.g. assuming some external locations as having read+write 
access.

> Skip permission check when refreshing in event processor
> 
>
> Key: IMPALA-12472
> URL: https://issues.apache.org/jira/browse/IMPALA-12472
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Csaba Ringhofer
>Priority: Major
>
> Saw callstacks where most of EventProcessor's time is spent in rechecking 
> access level for partition directories
> {code}
> org.apache.impala.catalog.HdfsTable.getAvailableAccessLevel
> org.apache.impala.catalog.HdfsTable.createOrUpdatePartitionBuilder
> org.apache.impala.catalog.HdfsTable.createPartitionBuilder
> org.apache.impala.catalog.HdfsTable.reloadPartitions
> org.apache.impala.catalog.HdfsTable.reloadPartitionsFromNames
> org.apache.impala.service.CatalogOpExecutor.reloadPartitionsIfExisorg.apache.impala.catalog.events.MetastoreEvents$MetastoreTableEvent.reloadPartitions
> org.apache.impala.catalog.events.MetastoreEvents$BatchPartitionEvent.process
> {code}
> HdfsTable.getAvailableAccessLevel() does a getFileStatus(), and if access 
> control list bit is set in the status, a getAclStatus() call to the namenode.
> It is questionable whether we should recheck this during refreshing tables 
> for directories that were already checked in the past, as it can be expensive 
> and is unlikely to change. AFAIK having stale data shouldn't cause security 
> issues, as if Impala has no right to access/modify the file, the name node 
> will simply not allow this operation (coordinators/executors use the same 
> username as catalogd for HDFS ops).
> Note that the whole access level check is skipped for most other filesystems 
> than HDFS (see HdfsTable.assumeReadWriteAccess()).
> Currently catalogd checks this for each partition level event (even if they 
> are batched). While checking it once during CREATE PARTITON makes sense, 
> rechecking it for every INSERT and ALTER seems like an overkill - especially 
> an INSERT shouldn't reduce access rights on a partition table.
> Besides event processor, rechecking during REFRESH and  reloads after 
> DML/DDLs are also questionable. If there was an actual change, INVALIDATE 
> METADATA can be used to reload the table from scratch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12461) Avoid write lock on the table during self-event detection

2023-09-29 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12461:
-
Epic Link: IMPALA-11532

> Avoid write lock on the table during self-event detection
> -
>
> Key: IMPALA-12461
> URL: https://issues.apache.org/jira/browse/IMPALA-12461
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Csaba Ringhofer
>Assignee: Csaba Ringhofer
>Priority: Major
>
> Saw some callstacks like this:
> {code}
>     at 
> org.apache.impala.catalog.CatalogServiceCatalog.tryLock(CatalogServiceCatalog.java:468)
>     at 
> org.apache.impala.catalog.CatalogServiceCatalog.tryWriteLock(CatalogServiceCatalog.java:436)
>     at 
> org.apache.impala.catalog.CatalogServiceCatalog.evaluateSelfEvent(CatalogServiceCatalog.java:1008)
>     at 
> org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.isSelfEvent(MetastoreEvents.java:609)
>     at 
> org.apache.impala.catalog.events.MetastoreEvents$BatchPartitionEvent.process(MetastoreEvents.java:1942)
> {code}
> At this point it was already checked that the event comes from Impala based 
> on service id and now we are checking the table's self event list. Taking the 
> table lock can be problematic as other DDL may took write lock at the same 
> time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12463) Allow batching of non consecutive metastore events

2023-09-29 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12463:
-
Epic Link: IMPALA-11532

> Allow batching of non consecutive metastore events
> --
>
> Key: IMPALA-12463
> URL: https://issues.apache.org/jira/browse/IMPALA-12463
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Csaba Ringhofer
>Priority: Major
> Attachments: concurrent_metadata_load.py
>
>
> Currently Impala tries to batch events like partition insert/creation only if:
> 1. the next event is for the same table as the previous one
> 2. the next event's id is the previous one's + 1
> 3. the next event has the same type as the previous one
> (2 can be stricter than 1 if some events were filtered between the two)
> See 
> https://github.com/apache/impala/blob/94f4f1d82461d8f71fbd0d2e9082aa29b5f53a89/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java#L315
> Another limit is that only events in the same batch from HMS can be merged. 
> Currently 1000 events are polled at the same time: 
> https://github.com/apache/impala/blob/94f4f1d82461d8f71fbd0d2e9082aa29b5f53a89/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEventsProcessor.java#L218
> Making this configurable could be also useful.
> Event batching could be improved by batching all events to the current one if 
> they modify the same table, unless they are "cut" by:
> a. an event on the same table but with a different type
> b. a rename table event where the original or the new name is the same as the 
> current event
> If such an event occurs, the events after that can be only merged to a newer 
> event.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12472) Skip permission check when refreshing in event processor

2023-09-29 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-12472:
-
Epic Link: IMPALA-11532

> Skip permission check when refreshing in event processor
> 
>
> Key: IMPALA-12472
> URL: https://issues.apache.org/jira/browse/IMPALA-12472
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Csaba Ringhofer
>Priority: Major
>
> Saw callstacks where most of EventProcessor's time is spent in rechecking 
> access level for partition directories
> {code}
> org.apache.impala.catalog.HdfsTable.getAvailableAccessLevel
> org.apache.impala.catalog.HdfsTable.createOrUpdatePartitionBuilder
> org.apache.impala.catalog.HdfsTable.createPartitionBuilder
> org.apache.impala.catalog.HdfsTable.reloadPartitions
> org.apache.impala.catalog.HdfsTable.reloadPartitionsFromNames
> org.apache.impala.service.CatalogOpExecutor.reloadPartitionsIfExisorg.apache.impala.catalog.events.MetastoreEvents$MetastoreTableEvent.reloadPartitions
> org.apache.impala.catalog.events.MetastoreEvents$BatchPartitionEvent.process
> {code}
> HdfsTable.getAvailableAccessLevel() does a getFileStatus(), and if access 
> control list bit is set in the status, a getAclStatus() call to the namenode.
> It is questionable whether we should recheck this during refreshing tables 
> for directories that were already checked in the past, as it can be expensive 
> and is unlikely to change. AFAIK having stale data shouldn't cause security 
> issues, as if Impala has no right to access/modify the file, the name node 
> will simply not allow this operation (coordinators/executors use the same 
> username as catalogd for HDFS ops).
> Note that the whole access level check is skipped for most other filesystems 
> than HDFS (see HdfsTable.assumeReadWriteAccess()).
> Currently catalogd checks this for each partition level event (even if they 
> are batched). While checking it once during CREATE PARTITON makes sense, 
> rechecking it for every INSERT and ALTER seems like an overkill - especially 
> an INSERT shouldn't reduce access rights on a partition table.
> Besides event processor, rechecking during REFRESH and  reloads after 
> DML/DDLs are also questionable. If there was an actual change, INVALIDATE 
> METADATA can be used to reload the table from scratch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Work started] (IMPALA-12461) Avoid write lock on the table during self-event detection

2023-09-29 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-12461 started by Csaba Ringhofer.

> Avoid write lock on the table during self-event detection
> -
>
> Key: IMPALA-12461
> URL: https://issues.apache.org/jira/browse/IMPALA-12461
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Csaba Ringhofer
>Assignee: Csaba Ringhofer
>Priority: Major
>
> Saw some callstacks like this:
> {code}
>     at 
> org.apache.impala.catalog.CatalogServiceCatalog.tryLock(CatalogServiceCatalog.java:468)
>     at 
> org.apache.impala.catalog.CatalogServiceCatalog.tryWriteLock(CatalogServiceCatalog.java:436)
>     at 
> org.apache.impala.catalog.CatalogServiceCatalog.evaluateSelfEvent(CatalogServiceCatalog.java:1008)
>     at 
> org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.isSelfEvent(MetastoreEvents.java:609)
>     at 
> org.apache.impala.catalog.events.MetastoreEvents$BatchPartitionEvent.process(MetastoreEvents.java:1942)
> {code}
> At this point it was already checked that the event comes from Impala based 
> on service id and now we are checking the table's self event list. Taking the 
> table lock can be problematic as other DDL may took write lock at the same 
> time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Closed] (IMPALA-12475) Epic: event processor performance issues and observability

2023-09-29 Thread Csaba Ringhofer (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer closed IMPALA-12475.

Resolution: Duplicate

It turned out that there is already an epic like this: IMPALA-11532

>  Epic: event processor performance issues and observability
> ---
>
> Key: IMPALA-12475
> URL: https://issues.apache.org/jira/browse/IMPALA-12475
> Project: IMPALA
>  Issue Type: Epic
>  Components: Catalog
>Reporter: Csaba Ringhofer
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



  1   2   3   4   5   6   7   8   9   10   >